FreeRADIUS exiting with Signal 11 on FreeBSD

Hi All,

I am coming back to the FreeBSD community having conducted a discussion on the FreeRADIUS mailing list over the last month.

There are a number of FreeBSD users that seem to suffer the same problem. No one using any other distributions have chimed in to say they are suffering the same problem.
Since the FreeRADIUS author isn't a FreeBSD guru I was wondering if there was anything I could test or suggest that might help resolve this issue.

I will summarise the thread below.

Code:
I am running FreeRADIUS 2.1.10 on 64bit FreeBSD 8.1-RELEASE-p1 with LDAP auth and accounting records being sent to file and MySQL.

I get the following in radius.log:
Sun Dec  5 05:14:57 2010 : Info: Exiting normally.

and the following in messages:
Dec  5 05:14:57 cat kernel: pid 68292 (radiusd), uid 133: exited on signal 11

Danial also commmented that he experienced the same problem for quite some time.

Code:
I'm currently running FreeRADIUS 2.1.10 on i386 FreeBSD 8.1-RELEASE. 
Accounting packets are written file + two extra detail logs which are
proxied.

For me I think this "Exiting normally" problem started happening after
upgrading FreeBSD from either 7.x to 8 or 6.x to 7.x. 
It's been happening for over a year now. Sometimes it happens three times a
day and sometimes it runs fine for a month.

Both he and I are running very lightly loaded servers. Mine has a load average of 0.012

Danial ran a ktrace and got the following:

Code:
/var/log/radius.log:
Sat Dec 18 02:43:43 2010 : Info: Exiting normally.

/var/log/messages:
Dec 18 02:43:43 radius kernel: pid 655 (radiusd), uid 133: exited on signal
11

$ kdump | tail -20
   655 radiusd  CALL  fstat(0xa,0xbf8fb280)
   655 radiusd  STRU  struct stat {dev=83, ino=2378852, mode=-rw------- ,
nlink=1, uid=133, gid=133, rdev=0, atime=1292640223, stime=1292640223,
ctime=1292640223, birthtime=1292640223, size=0, blksize=16384, blocks=0,
flags=0x0 }
   655 radiusd  RET   fstat 0
   655 radiusd  CALL  lseek(0xa,0,SEEK_SET,0x2)
   655 radiusd  RET   lseek 0
   655 radiusd  CALL  write(0xa,0x28752000,0x54)
   655 radiusd  GIO   fd 10 wrote 84 bytes
       "Sat Dec 18 02:43:43 2010
                Timestamp = 1292640223
                Request-Authenticator = Verified

       "
   655 radiusd  RET   write 84/0x54
   655 radiusd  CALL  close(0xa)
   655 radiusd  RET   close 0
   655 radiusd  PSIG  SIGSEGV SIG_DFL
   655 radiusd  RET   _umtx_op -1 errno 4 Interrupted system call
   655 radiusd  RET   _umtx_op -1 errno 4 Interrupted system call
   655 radiusd  RET   _umtx_op -1 errno 4 Interrupted system call
   655 radiusd  RET   _umtx_op -1 errno 4 Interrupted system call

Danial ran gdb and got the following output:

(I have cut some out as I ran out of characters for the post. Let me know if I might have chopped something important)

Code:
Starting program: /usr/local/sbin/radiusd -f
[New LWP 100056]
[New Thread 28401140 (LWP 100056)]
[New Thread 28567ac0 (LWP 100051)]
[New Thread 28567980 (LWP 100068)]
[New Thread 28567840 (LWP 100069)]
[New Thread 28567700 (LWP 100070)]
[New Thread 285675c0 (LWP 100102)]



Program received signal SIGABRT, Aborted.
[Switching to Thread 28567ac0 (LWP 100051)]
0x2831866b in thr_kill () from /lib/libc.so.7
(gdb) 
(gdb) 
(gdb) where
#0  0x2831866b in thr_kill () from /lib/libc.so.7
#1  0x282c7896 in pthread_kill () from /lib/libthr.so.3
#2  0x282c53b3 in raise () from /lib/libthr.so.3
#3  0x283bac4a in abort () from /lib/libc.so.7
#4  0x0806c852 in rad_assert_fail (file=Could not find the frame base for
"rad_assert_fail".
) at util.c:365
#5  0x0806af44 in request_dequeue (request=0x28542b7c, fun=0xbf9fef8c) at
threads.c:412
#6  0x0806b18b in request_handler_thread (arg=0x28542b60) at threads.c:516
#7  0x282c175f in pthread_getprio () from /lib/libthr.so.3
#8  0x00000000 in ?? ()
(gdb) thread apply all bt full

(these were the same so I have shortened them)
Thread 7 (Thread 285675c0 (LWP 100102)):
Thread 6 (Thread 28567700 (LWP 100070)):
Thread 5 (Thread 28567840 (LWP 100069)):
Thread 4 (Thread 28567980 (LWP 100068)):
#0  0x282cc227 in __error () from /lib/libthr.so.3
No symbol table info available.
#1  0x282cbc19 in __error () from /lib/libthr.so.3
No symbol table info available.
#2  0x285484fc in ?? ()
No symbol table info available.
#3  0x0000000f in ?? ()
No symbol table info available.
#4  0x00000000 in ?? ()
No symbol table info available.
#5  0x00000000 in ?? ()
No symbol table info available.
#6  0x00000000 in ?? ()
No symbol table info available.
#7  0x282cdbf4 in ?? () from /lib/libthr.so.3
No symbol table info available.
#8  0xbf8fdf68 in ?? ()
No symbol table info available.
#9  0x282c0fcc in sem_wait () from /lib/libthr.so.3
No symbol table info available.
Previous frame identical to this frame (corrupt stack?)

Thread 3 (Thread 28567ac0 (LWP 100051)):
---Type <return> to continue, or q <return> to quit---
#0  0x2831866b in thr_kill () from /lib/libc.so.7
No symbol table info available.
#1  0x282c7896 in pthread_kill () from /lib/libthr.so.3
No symbol table info available.
#2  0x282c53b3 in raise () from /lib/libthr.so.3
No symbol table info available.
#3  0x283bac4a in abort () from /lib/libc.so.7
No symbol table info available.
#4  0x0806c852 in rad_assert_fail (file=Could not find the frame base for
"rad_assert_fail".
) at util.c:365
No locals.
#5  0x0806af44 in request_dequeue (request=0x28542b7c, fun=0xbf9fef8c) at
threads.c:412
        blocked = 134809920
        i = RAD_LISTEN_ACCT
        start = RAD_LISTEN_ACCT
        entry = (request_queue_t *) 0x0
        last_complained = 0
#6  0x0806b18b in request_handler_thread (arg=0x28542b60) at threads.c:516
        fun = 0x804e250 <rad_accounting>
        self = (THREAD_HANDLE *) 0x28542b60
#7  0x282c175f in pthread_getprio () from /lib/libthr.so.3
No symbol table info available.
#8  0x00000000 in ?? ()
No symbol table info available.

Thread 2 (Thread 28401140 (LWP 100056)):
#0  0x283abf9f in unlink () from /lib/libc.so.7
No symbol table info available.
#1  0x0805daf9 in command_socket_free (this=0x28569100) at command.c:1839
        sock = (fr_command_socket_t *) 0x285b4800
#2  0x08060053 in listen_free (head=0x8090b7c) at listen.c:2257
        next = (rad_listen_t *) 0x285690a0
        this = (rad_listen_t *) 0x28569100
#3  0x08061e7c in free_mainconfig () at mainconfig.c:992
        cc = (cached_config_t *) 0x0
        next = (cached_config_t *) 0x0
#4  0x0806859a in main (argc=2, argv=0xbfbfec74) at radiusd.c:443
        rcode = 0
        argval = -1
        spawn_flag = 1
        dont_fork = 1
        flag = 0
        act = {__sigaction_u = {__sa_handler = 0x80687f0 <sig_fatal>,
__sa_sigaction = 0x80687f0 <sig_fatal>}, sa_flags = 0, 
  sa_mask = {__bits = {0, 0, 0, 0}}}
#0  0x2831866b in thr_kill () from /lib/libc.so.7

Alan DeKok the main FreeRADIUS developer looked at the output and commented:

Code:
> #4  0x0806c852 in rad_assert_fail (file=Could not find the frame base for
> > "rad_assert_fail".
> > ) at util.c:365
> > #5  0x0806af44 in request_dequeue (request=0x28542b7c, fun=0xbf9fef8c) at
> > threads.c:412
  Ugh.  Something is free'ing the request when it's still queued.  This
*only* can happen when the request is in the queue for more than 30s.

  So... why is your server so slow?  Fix that, and the problem won't be
fixed, but it *will* go away.

  I can take a look at the code, but this is a very odd edge case, and
hard to track down/fix.

Another user ironrake commented:

Code:
It started for me at FreeBSD 7.2 and has continued through 8.1 for both Freeradius 2 from the ports collection and from the official tar file source distribution.
My server is lightly loaded and not really a production system so I've not run it under gdb. I use the standard kernel scheduler.
The user I run freeradius as has a default login class.
The parent process will always die with this message just after a successful authentication is logged.

Does anyone have any suggestions for testing? Something I could take back to Alan DeKok?

Thanks for your perseverance.

Happy New Year.

Hugh
 
Is the FreeRADIUS port maintainer in the loop about this? This is mostly a user forum, so we mostly deal with setting up and configuring stuff, not with trawling through kernel and system traces. Maybe contacting the port maintainer(1) or filing a PR(2) for the port would greatly expedite proceedings.

(1)
cd /usr/ports/net/freeradius && make maintainer
(2)
http://www.freebsd.org/send-pr.html
 
any updates?

Has there been any movement on this? I have a server doing the same thing. I had to write a cron job to restart radiusd if the pid file isn't there. Do you have a link to the PR?

Thanks!
 
I think I may have fixed it here. I edited /usr/local/etc/raddb/modules/ldap and raised the timeout values in there. That seems to keep radiusd from exiting. It hasn't happened for three days now and it usually crashes about every 1.5 days. I am still getting "LDAP connection lost" errors in my logs, but not nearly as many now.

Of course now that I am posting this, I fully expect to see it crash again...

Update:
Yep, exactly 15 minutes after posting that. Maybe I need to raise the timeouts some more.
 
Using the following settings in /usr/local/etc/raddb/modules/ldap, it has not crashed since I last posted in this thread:

Code:
        # seconds to wait for LDAP query to finish. default: 20
        timeout = 30

        #  seconds LDAP server has to process the query (server-side
        #  time limit). default: 20
        #
        #  LDAP_OPT_TIMELIMIT is set to this value.
        timelimit = 30

        #
        #  seconds to wait for response of the server. (network
        #   failures) default: 20
        #
        #  LDAP_OPT_NETWORK_TIMEOUT is set to this value.
        net_timeout = 20
 
Back
Top