The "hung on reboot" problem is really annoying

It's more common than not that a 'reboot' of a FreeBSD server results in a hung one requiring manual intervention (via reset or power-cycling). It even happens at kernel panics. Sometimes it works (more often it the server was rebooted more recently, but long uptimes definitely makes it more probable to "hang").

Are there anyone looking into this issue - that has been there for ages?
Code:
> FreeBSD/amd64 (balrog) (ttyu1)
>
> login: Oct  2 13:14:10 balrog reboot: rebooted by fancypants
> Oct  2 13:14:25 balrog syslogd: exiting on signal 15
> pflog0: promiscuous mode disabled
> Waiting (max 60 seconds) for system process `vnlru' to stop... done
> Waiting (max 60 seconds) for system process `bufdaemon' to stop... done
> Waiting (max 60 seconds) for system process `syncer' to stop... 
> Syncing disks, vnodes remaining... 0 0 0 0 0 0 0 0 0 0 0 done
> All buffers synced.
... after which nothing more happens.

FreeBSD 11.2 & 11.3, hardware Dell PowerEdge R730xd & R740xd, HP ProLiant DL380g9 and others. A lot of disks attached to SAS HBA controllers, 10Ge Intel X710 ethernet controllers but otherwise fairly standard machines...

We've tried mitigating this issue somewhat by enabling the hardware watchdogd(8) feature and that helps - sometimes. But not always (it seem the hardware watchdog also gets lost every now and then).

One possible workaround that has been mentioned before is adding to /boot/loader.conf:
Code:
   hw.usb.no_shutdown_wait = "1"
which we have but it doesn't make any difference so it's probably something else preventing the machine to reboot properly.
 
Don't put spaces around the '=' in loader.conf. But as far as I know that setting is really only useful if it 'hangs' due to an attached USB external disk. If you have no external USB disks it's not going to help much.
 
Don't put spaces around the '=' in loader.conf.


From the man page for "loader.conf":

"The general parsing rules are:
• Spaces and empty lines are ignored."

And it seems to work just fine....
(The settings are manipulated via the Puppet "augeas" tool - which adds them with those spaces around the "=")
 
I have never had this issue and never met anyone who did.
Well, I have. And I have had this issue too. But only when there was an external USB drive attached to the machine. And this actually turned out to be quite common.

Also,

Yes, those are old. This issue crops up from time to time. It's been a while since I last had this issue but it's been happening on and off for a couple of versions now.
 
In 15 years of running FreeBSD servers, I have never had this issue and never met anyone who did. So, no, it is not more common than not.

... more common than not on our 18 servers I probably should have written. Different vendors (HP & Dell) & hardware (R730xd, R740xd, DL380g9), different FreeBSD versions (11.0/11.1/11.2/11.3/12.0). But it's probably something in common with them then. Hm...

An old DL380g5 reboots every Thursday nicely though.

The common things for the servers with problems is using UEFI boot and that they have Intel X710 10G ethernet adapters.

(The old DL380 uses BIOS boot and have no X710 10G ethernet adapter either).
 
In 15 years of running FreeBSD servers, I have never had this issue and never met anyone who did. So, no, it is not more common than not.
I've run into it a fair bit on older FreeBSD releases, usually in crash dumps. The crash dump would hang (not unreasonable in the case of a disk subsystem failure, but these were Ethernet-related panics). I did a fair amount of debugging and working with the developers at the time, including finding some "this should never have worked at all" things, like not halting the other CPUs - they kept doing whatever they were doing (user code, syscalls, whatever) while part of the kernel thought it was crashing and dumping core. Those issues were fixed nearly a decade ago, though.

Sometimes the answer is in hardware, often with a "you can't fix stupidity" - again going back decades, to a Gateway system running a BBS. The Gateway decided to shadow the BIOS in RAM for performance reasons, but didn't mark that RAM as allocated and unavailable. So FreeBSD happily used it, overwriting the BIOS shadow in RAM. Since FreeBSD doesn't use the BIOS once the kernel starts, it went unnoticed until a reboot. At which point, no BIOS and <splat>. Gateway suggested disabling BIOS shadowing, which has its own amusing anecdote - instead of going "beep" at startup, it went "mooooooooooooooooo" due to the much slower BIOS execution. We joked that we knew it was a Gateway (for those who weren't around then, the Gateway mascot and box colors were patterned on a Holstein cow).

Enough digressions. If the issue is that the system isn't handling the reboot properly (as opposed to never getting to the point where it actually tries to reboot), try looking at the various hw.acpi.*reboot sysctl knobs. Other than that, it will likely involve instrumenting the code path after the "All buffers synced" to see where things are hanging up.

To the OP - I am running 12-STABLE (and have been for quite a few months) on a Dell R730 (not an R730xd, but I think they share the same motherboard and BIOS) and haven't had any cases were a reboot failed (and I build new kernels and reboot regularly). What happens if you comment out all of your ports in /etc/rc.conf and reboot right after the system goes multi-user? If it reboots Ok, bisect the problem and repeat until you find a possible problematic port. Or, if it reboots fine if you reboot it right after startup, even with all your ports running, but fails after the system has been up for some time, that points elsewhere.
 
  • Thanks
Reactions: PMc
In 15 years of running FreeBSD servers, I have never had this issue and never met anyone who did. So, no, it is not more common than not.

Well, I had that.
In my case, the problem was that I had removed options VESA from the kernel configuration. I don't really understand why this would be necessary, and I fear, I don't really want to know either...
 
This might be far stretched, but perhaps it is worse to consider the following.

Last week, I had an issue, where a sloppy USB hub which was connected to a FreeBSD 12.0-RELEASE system draw a lot of energy from the system, I recognized that the power supply gave a humming noise, and the 4.2 GHz i7 processor tuned its speed down to the level of an 8088. When shutting down, synching disks took forever. I removed the defective hub, and everything were back to normal since.

Now the question is, are your SAS stacks self powered, or draw these disks their energy from the computer's power supply. It might happen that disk synching spins up all the disks at the same time, which might lead to a spike current drain from the power supply, which then would send the whole system to the south.

On another very slow Atom machine, running FreeBSD 12 as well, and which is connected to an APC UPS, it happened sometimes on power loss that disk synching took longer than the 90 seconds UPS timeout, and the system where shut off in the middle of synchronization. This is a cheap UPS, and the timeout cannot be configured, therefore, I added the following job to /etc/crontab:
Code:
...
# sync the file system every minute
*    *    *    *    *    root    /bin/sync

With that in place, a smooth shutdown reliably takes less than 30 s, which is fine for this 12 years old system.
 
Seems Twitter has been blessed to be the new world's communications standard. Now, I reduce my above message to 160 chars. Not that it would be better to understand, but at least it does not exceed the tweet limit :-D

This might be far stretched, but perhaps it is worse to consider the following.

Last week, I had an issue, where a sloppy USB hub which was connected to a Free
 
Gateway suggested disabling BIOS shadowing, which has its own amusing anecdote - instead of going "beep" at startup, it went "mooooooooooooooooo"

Haha, laughed outloud on that one :) I totally remember the Gateway cow packaging and always though it was super lame. Why would anyone style their packaging after a cow, now we know why. Oh, wait, is that more than 160 characters?
 
This might be far stretched, but perhaps it is worse to consider the following.
...
Now the question is, are your SAS stacks self powered, or draw these disks their energy from the computer's power supply. It might happen that disk synching spins up all the disks at the same time, which might lead to a spike current drain from the power supply, which then would send the whole system to the south.
...
With that in place, a smooth shutdown reliably takes less than 30 s, which is fine for this 12 years old system.

Hmm... Well, our systems are Dell PowerEdge R730xd with internal SAS disks so probably not the same issue. Anyway, I've been instrumenting the shutdown sequence with some printf-debugging and so far it might point to this part of kern_shutdown() in /usr/src/sys/kern/kern_shutdown.c

Code:
        EVENTHANDLER_INVOKE(shutdown_post_sync, howto);
                
        if ((howto & (RB_HALT|RB_DUMP)) == RB_DUMP && !cold && !dumping) 
                doadump(TRUE);

        /* Now that we're going to really halt the system... */
        EVENTHANDLER_INVOKE(shutdown_final, howto);

Probably the "shutdown_post_sync" part. I've seen it take quite a long time in some cases (but no complete "hang" as of yet). Adding more printf:s...
 
?????
??

having completely avoided zfs except for a couple rare instances, i'd have no general idea what that was about, though that answer is very interesting as is those that tried to help's adventure in aiding in finding a solution, even if ' experience ' was the final answer. // ' 3*3+3-1 ( ' ' ); ';

an original theory prior to reading many posts was that your computer thought everything was fine. ( log, i, c:, al ; logical exit[ ' ' ]; );
 
Last edited:
Back
Top