Solved Shutdown/Reboot hangs for over an hour at "All buffers synced."

Hello everyone! I fairly recently upgraded a server from FreeBSD 12.2 to 13.2 (with a separate issue) and during the required reboots, the system started hanging after one of the last outputs: All buffers synced.. This occurs with both shutdown and restart, as with the reboot command. After letting the system sit for about an hour and a quarter, it succeeded in shutting down to be able to be powered back up, but I can't find anything useful in the logs as to the cause of the issue.

The system has two zpools, one for storage a zroot pool for the filesystem. During the upgrade process, I had to unseat the drives for the storage pool in order to boot (needed to update the EFI bootloader) and noticed that the issue only occurs when the storage drives are mounted.

It seems that there was a bug report for 10.1 for a similar issue, but from what I understand, that should have been resolved.

Does anyone have an idea as to what the cause for this issue might be, or where to look next for this? It's a system that doesn't get shut down often, so it's entirely possible that it's been a latent issue since before the upgrade.

Any help is appreciated! Thanks!
 
Is the system EFI booting? If it is you might want to try and set hw.efi.poweroff=0 in /boot/loader.conf. This issue should only happen with a shutdown though, not with a reboot. Are there any USB disk devices attached when you shutdown/reboot? Then you might want to try hw.usb.no_shutdown_wait="1" in loader.conf.
 
Any chance you mount some shares via NFS from another system on that server?
I noticed this on some clients back with ~11.x/12.0-RELEASE with NFS '/home's, where reboot/shutdown would also hang at the same message. Upon forcefully killing the NFS mounts (by restarting nfs_server/mountd) all clients immediately proceeded to shutdown/reboot.
I wasn't able to look any further into that problem back then, but maybe this could also be related to other filesystems being unable to unmount properly (e.g. outstanding/running zfs housekeeping?) and the shutdown process waiting for a timeout?
 
Is the system EFI booting? If it is you might want to try and set hw.efi.poweroff=0 in /boot/loader.conf. This issue should only happen with a shutdown though, not with a reboot. Are there any USB disk devices attached when you shutdown/reboot? Then you might want to try hw.usb.no_shutdown_wait="1" in loader.conf.
The server is EFI booting, but after attempting these fixes, then later rolling them back, it seems to have completely resolved. Even with the changes undone, it's still completing the shutdown in no more than a couple minutes.

My best guess as to why it's working now is that letting it sit for the hour plus to shutdown allowed it to completely process whatever it was getting stuck on. Previously, I'd been forced to hard shutdown the server early, just to get it back up and running.

Any chance you mount some shares via NFS from another system on that server?
I noticed this on some clients back with ~11.x/12.0-RELEASE with NFS '/home's, where reboot/shutdown would also hang at the same message. Upon forcefully killing the NFS mounts (by restarting nfs_server/mountd) all clients immediately proceeded to shutdown/reboot.
I wasn't able to look any further into that problem back then, but maybe this could also be related to other filesystems being unable to unmount properly (e.g. outstanding/running zfs housekeeping?) and the shutdown process waiting for a timeout?
That was something that had come up as a possible issue when I was looking into solutions, but there are no NFS shares mounted on this server.

Thanks to both of you for the suggestions and help!
 
what are the odds...

I just encountered the same problem on my home server yesterday evening after installing a new set of NVMe drives for a dedicated poudriere pool, zfs send | recv my poudriere datasets to the new pool and "trying" to delete the old datasets from the previous pool...

The zfs destroy command sat for ~1 hour (for 147GB on ~10 datasets) during which I did some housekeeping (adjusting poudriere config for new pool, freebsd-upgrade, pkg updates...) and decided to reboot the host - without checking on the 'destroy' command before, which still was hanging/busy...
The system then sat at the "All buffers synced" prompt for almost half an hour before rebooting, then hung again during initialization of the zfs pools. ~2 hours later it finally proceeded to boot and I started digging around for the culprit and/or possible damage.

First guess: one of the NVMe drives of that pool I moved off and deleted the poudriere datasets was dying. There still were several datasets not deleted, and all operations to this pool sent the system into a near catatonic state for several seconds. nvme log data looks fine, yet those are consumer-grade drives, so I won't assume all is OK just because there are no errors logged - been there with SATA drives before that dragged whole systems to a halt...
I then started to send|recv the remaining datasets from that pool (in fear of an imminent hardware failure) to my storage pool, which was painfully slow at the beginning, but after a while it *mostly* recovered to OK-ish speeds. There were still lots of phases during which zfs and all newly issued zfs/zpool commands to that pool simply hung for seconds to minutes.

During the send|recv I encountered a single dataset that always interrupted the send process at the exact same snapshot. That dataset only holds a few kb of config data for a chyves/bhyve VM and can be mounted and read/written to normally, yet mangling with its snapshots or sending (-R) that dataset fails and always causes zfs commands to completely stall.

My best guess is, as one of the two drives runs considerably hotter than the other one for no reason (other drives in the same carrier are also at nominal temperatures), that one drive is dying and goes dark from time to time (or at least ramps up in latency?). As said: I've seen such behaviour especially with SATA disks multiple times - all SMART data was fine, yet the drives went silent for seconds to minutes, sending the filesystem into a catatonic state.
The pool is currently exported and the system is back to perfectly normal behaviour and reboots without hangs. I have already ordered replacement drives which should arrive tomorrow, then I will try to resilver the pool to those new drives and mess around with that "broken" dataset to see if it can be deleted or repaired e.g. by rolling back to a "working" snapshot.


TL;DR: corrupted zfs metadata seems to have caused the same problem on my side.



PS: If any ZFS Wizard is following along: As that system isn't 'critical' and all data from that pool backed up, I can serve as a guinea pig and try to debug that issue. Although I'll need some guidance - I'm not completely unfamiliar with zdb or dtrace, but definitely not familiar enough with them or the innards of ZFS to pull that off by myself.
 
Back
Top