Situations in which only a hard reboot helps?

There are two occasions when I had to do a hard reboot, as a FreeBSD system was stuck with a connection and/or possibly a file handle.

The first one was when I had an NFS mount, which got disconnected without unmount. I couldn't unmount or do anything, I had to do a reboot, and at the very end of the reboot, I had to hard reboot.

The second one was when an npm install ... got stuck, and there was no way to kill -9 <PID> out that npm process. Every chown ..., rm -r ..., find ... process which tried to access the folder in which the npm was stuck got stuck as well. Ctrl + C didn't work on any of them. There is an issue ticket for npm being stuck, with 143 comments so far, so it seems to be a common issue: https://github.com/npm/npm/issues/7862

In both cases I asked on the IRC channel, but in that moment no one had any alternative idea.

My questions:
- Is it possible that npm was stuck with a file/directory handle, or this kind of "kill -9" not responding can only happen with network connections?
- If both cases were network related, what could I have done? ifconfig <interface> down; ifconfig <interface> up and hope it goes back via SSH?
- How can I debug such cases? lsof -p didn't really give me any meaningful info, except seeing the same directory in all stuck processes in case of npm.

FreeBSD 10.2 p8, bare metal host
 
Last edited by a moderator:
Any time a process is deadlocked waiting for I/O and the I/O is not going to happen for whatever reason, hardware bug or something similar, there is no way to avoid a cold restart. The kernel has no recovery path if the hardware doesn't play nicely.
 
kpa the hardware was/is perfectly fine, with no lines in dmesg. The NFS mount disconnect hanging is a common problem which has happened to other people, and it's easy to replicate it. I cannot replicate the " npm install" bug, but there is definitely no hardware component there, just a software bug which affected lot of people across other OS-es as well.
 
tankist02 I haven't looked into it, but what can I do if I already mounted and disconnected one?

FreeBSD is one of the most stable systems ever, and somehow I cannot believe that for such simple situations only a hard reset can be a solution.
 
The behaviour you describe sounds like the intended default operation of mount_nfs(8):
If the server becomes unresponsive while an NFS file system is mounted,
any new or outstanding file operations on that file system will hang
uninterruptibly until the server comes back. To modify this default be-
haviour, see the intr and soft options.
My workstation mounts my home directory over NFS. When the server goes offline, indeed my X session goes unresponsive, etc. However when the server comes back online, everything starts working again, much like a pause button had just been pushed.
Had I used the intr or soft option, which I believe is the behaviour you want, it would hose me, disconnecting all the open FD's. It just depends on how you want to use the tool.
 
I should have written that a bit differently. What I meant to say that in addition any resource such as
tankist02 I haven't looked into it, but what can I do if I already mounted and disconnected one?

FreeBSD is one of the most stable systems ever, and somehow I cannot believe that for such simple situations only a hard reset can be a solution.

Please, the problem here has nothing to do with stability of FreeBSD but it's all about the design of NFS. As I noted above the kernel has no way recovering from a wait state that involves waiting for a resource that never appears unless the subsystem itself (NFS in this case) is designed to abort gracefully from the error condition. NFS was designed quite naively at first with the assumption that the NFS shares would always come back in a timeout situation. Later revisions of NFS have tried to fix this by introducing the soft and intr mount modes but with varying success.
 
It's not just NFS.

FreeBSD and Samba not playing well? Reboot.

FreeBSD and some driver not playing well? Reboot.

I'm taking two systems offline (both running 9.3-release) because we cannot get them to stop rebooting or hard-locking on us.

Samba 4.x seems to cause all kinds of issues with kqueue and sendfile. Instead of just "failing" or making Samba stop working, the FreeBSD kernel hangs, requiring us to hold the power button on the server to shut it down.

I got around the issue by sticking with Samba 3.6 on the other system. That one has a whole different issue, though. We went to replace a drive, but every time we pull one (they are hot swappable), the FreeBSD kernel panics and reboots (and no errors are logged). I've tested drive pulls back when it was on FreeBSD 9.1, and it worked fine. (mfi driver)
 
Back
Top