What causes a box to lockup, and then spontaneously reboot itself 50 min later?

So this was a weird one. Got a page at 18:31 today that abc.xyz.com was down. FreeBSD 6.4-STABLE box. Jumped in the car and headed down to the datacenter, about 40 mins away. As I pulled into the parking lot, I get another page: abc.xyz.com is UP! Huh??

Go inside, ask the datacenter techs if they did anything. Nope, they didn't even know it was down. It's been safely locked in its cage, with nobody touching it, this entire time.

So I login, start poking around. All the logs go right up to 18:31, and then nothing. No error messages anywhere that I can find. But /var/log/messages does show something interesting... the box rebooted itself at 19:19:

Code:
Jul 21 17:47:24 abc ntpd[679]: kernel time sync status change 6001
Jul 21 18:21:30 abc ntpd[679]: kernel time sync status change 2001
Jul 21 19:19:34 abc syslogd: kernel boot file is /boot/kernel/kernel
Jul 21 19:19:34 abc kernel: Copyright (c) 1992-2008 The FreeBSD Project.
Jul 21 19:19:34 abc kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Jul 21 19:19:34 abc kernel: The Regents of the University of California. All rights reserved.
Jul 21 19:19:34 abc kernel: FreeBSD is a registered trademark of The FreeBSD Foundation.
Jul 21 19:19:34 abc kernel: FreeBSD 6.4-STABLE #0: Sat Mar  7 01:08:35 PST 2009

So what would cause a box to lock up so bad it can't even respond to pings, sit there for 48 minutes, and then decide to reboot, all without a human even coming near it?
 
Depending on the amount of memory and harddisks it could just have paniced after which it rebooted, did a core dump of it's memory and fsck of the harddrives. Both could take a long time.
 
Hmm. How long does a coredump take? It only has 4GB of memory; that doesn't take 45 min does it?

It wasn't fsck, because it didn't start that until after it was booted (I have soft updates on; delays fsck for a background check)
 
dordal said:
Code:
Jul 21 17:47:24 abc ntpd[679]: kernel time sync status change 6001
Jul 21 18:21:30 abc ntpd[679]: kernel time sync status change 2001
Jul 21 19:19:34 abc syslogd: kernel boot file is /boot/kernel/kernel
Jul 21 19:19:34 abc kernel: Copyright (c) 1992-2008 The FreeBSD Project.
Jul 21 19:19:34 abc kernel: Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Jul 21 19:19:34 abc kernel: The Regents of the University of California. All rights reserved.
Jul 21 19:19:34 abc kernel: FreeBSD is a registered trademark of The FreeBSD Foundation.
Jul 21 19:19:34 abc kernel: FreeBSD 6.4-STABLE #0: Sat Mar  7 01:08:35 PST 2009

So what would cause a box to lock up so bad it can't even respond to pings, sit there for 48 minutes, and then decide to reboot, all without a human even coming near it?

I'd challenge the idea that the time between 18:21 and 19:19 represented a system "lock up". Is there any other logging you could review to see if in fact the system was live during that period of time?

You might want to enable syslogd mark messages in case this occurs again. Add to /etc/rc.conf:
Code:
syslogd_flags="-m 5"

That will create an entry in /var/log/messages every 5 minutes, and it will make it easier to determine whether the system was hung, or whether there was just no activity.

All that said, spontaneous reboots (i.e. without a kernel panic) are generally hardware (RAM, hard drive, PSU) related.
 
Maybe a memory leak was causing an overflow and coincidentally locking the system in some sort of loop and reboots when there's no more memory to leak to.
This week I had something similar caused bij mdconfig/jail. The system completely freezed, the cpu fan was going to full speed and after a minute it started dumping page fault errors and rebooted.
 
Could it be a power supply error?
 
I had random FreeBSD-Stable reboots until I checked my ram and discovered 1 bad bit in ram....
I replaced my 512MB ram module with 1GB ram, and since then no more random reboots.
 
anomie said:
I'd challenge the idea that the time between 18:21 and 19:19 represented a system "lock up". Is there any other logging you could review to see if in fact the system was live during that period of time?

You might want to enable syslogd mark messages in case this occurs again. Add to /etc/rc.conf:
Code:
syslogd_flags="-m 5"

That's a great idea. I'll give it a go. I'm virtually certain the box really was locked up, because:
- it didn't respond to pings on either the public or private network (different NICs)
- there isn't a SINGLE message in any log that I could find between those time periods. Its a pretty busy box; usually 200-300 processes running, and so the logs usually have lots of stuff in them.

Its the 48 minute part that I just don't get. What does it do for that long before rebooting? Maybe it was a PSU issue...e.g. the PSU is on the edge, stopped providing enough power, then reset itself after it cooled off or something.
 
While you are checking the hardware, make sure to check that the CPU is properly cooled. An overheated CPU will automatically slow down to the point that the system seems locked up, but in fact is just taking next-to-forever to execute a single calculation.
 
No news is that I haven't figured out what happened yet.

I'm suspecting a heat related issue; this box is at the top of the rack. But not really sure what is failing... I mean, what takes 45 minutes to cool off?
 
Okay, keep us posted! I hate dead threads.

If the top of the rack is hot enough to cause problems then, sure, it will be hot enough to keep things warm while the server cools down (especially if the air isn't moving outside and inside the box).
 
Back
Top