How to Recover From Disk Causing Hung Processes

Is there any way to kill a process that is hung writing to disk? I have a failing backup disk which caused my Bacula process to hang, it wont die even with kill -9. This hung process has caused the PostgreSQL service to hang as its waiting on too many writes from the Bacula process. The PostgreSQL server unfortunately runs other things at this location which are down as a result. The PostgreSQL process also won't allow me to kill it. I tried rebooting the server so I can't login now either, and will have to wait until I am on site to reset machine to restore services. Just wanting to know in the future if I have missed something that would allow me to get other services restored without a physical hard reset of system.
 
and will have to wait until I am on site to reset machine to restore services.
No IPMI or remote KVM you can use? I can't live with servers that don't have IPMI any more, it saves me so many trips to the datacenter for silly things like hitting a reset button.

Just wanting to know in the future if I have missed something that would allow me to get other services restored without a physical hard reset of system.
Not that I'm aware of. I've had this happen sometimes too. Processes are waiting for signals that never come and as a result get hung up with no way to kill them. Instead of hard resetting try doing a soft power-off first. But that might get stuck on the same thing. A soft power-off would work its way through the normal shutdown procedures. You can always hard reset it if that gets stuck too.
 
Its a low budget site, (It's in My House), if only I had noticed the problem before I went into the office, I could have reset it. Instead of discovering that my email wasn't working after I got to work. Remote KVMs cost money, I will be going home at lunch, and can reset it then. Its basically used for my personal email and a simple personal web page. But also to test updates before I install them on the FreeBSD servers I manage at work.
 
The root cause of all these problems is that "firmware" is not perfect. By firmware, I mean things below the OS, including the code that runs in the disk interfaces (SATA interfaces on the motherboard, SAS HBAs), whatever is in the storage IO path like SAS expanders or SATA multiplexers, and the disk drives themselves. Ideally, they should all have relatively short timeouts (30 or 90 seconds), where any uncompleted IO request gets correctly aborted. One has to include the lowest levels of the kernel in the term "firmware", because the kernel drivers for certain hardware (like SCSI HBAs) needs to participate in keeping track of the state of pending IOs, and clearing aborted or stuck IOs. In practice, all that code has bugs. Usually they cause one of the parties to forget an IO (typically after an error), so the other party waits "forever" for the IO to complete.

I'm sorry if the following makes you even more depressed ... but reality can be ugly.

Usually a hardware reset fixes these problems, because it is sent to not only the motherboard, but also to the disk interfaces. Typically, that clears whatever stuck software state is causing the hang. But I have seen individual disk drives that are so broken that they will hang the bus on the next reboot, and make booting impossible. I had one at home (SATA disk attached directly to the motherboard, with a low-cost motherboard), and I had one at home (high-quality SAS disk with verified firmware version, installed the most expensive disk enclosure ever built, and connected to a very high-quality computer). We ended up having to do a binary search among several hundred disks to find out which disk needs to be physically removed to allow the machine to reboot. No amount of IPMI or remote management/reboot would have helped against cases like that.

This just means that you going home at lunch may be a necessary reality.
 
No IPMI or remote KVM you can use? I can't live with servers that don't have IPMI any more, it saves me so many trips to the datacenter for silly things like hitting a reset button.
My tiny server at home doesn't have any remote management (the motherboard was US-$ 99). And given that it is the network router, it would be pretty hard to access the remote access over the network when it goes down and takes the network with it.

So I decided to use a watchdog instead. A few years ago, I bought a tiny little board, which is a hardware watchdog. It attaches with two wires to the hardware reset pins on the motherboard, and plugs into an internal USB port. After a powerup or reset, you need to start sending it a little command via USB within a few minutes, and then every 30 seconds or so. Writing the software for it (a tiny little daemon) took one evening. I bought it because we were going to be out of town for 3 weeks, and if our home server went down I would not be able to control garden sprinklers. It was pretty cheap, about $30 or so, and it worked. Fortunately, our machine never hung while we were on that long vacation. I've since taken it out again, because it gets really annoying during things like software updates. This evening I can dig out the information about that board.
 
Is there any way to kill a process that is hung writing to disk? I have a failing backup disk which caused my Bacula process to hang, it wont die even with kill -9.

Yeah, and and typically thoses processes show a "D" in the flags seen with ps, that means uninterruptible wait. The cause is usually bad hardware - but You know that already.
All other accesses to that hardware will then end up in the same state.

The AFAIK only way to get a process out of that is to have the device driver perform a reset. For disks that could be done via camcontrol. But, since the device is already in a state that the developer has not expected, there is a rather high probability that this may result in a complete system crash.

The root cause of all these problems is that "firmware" is not perfect. By firmware, I mean things below the OS, including the code that runs in the disk interfaces (SATA interfaces on the motherboard, SAS HBAs), whatever is in the storage IO path like SAS expanders or SATA multiplexers, and the disk drives themselves. Ideally, they should all have relatively short timeouts (30 or 90 seconds), where any uncompleted IO request gets correctly aborted.

It's about cost savings. The old SCSI stuff didn't do such things - as long as the bus was alright, the device could be as faulty as it likes, and one could always do a reset, or even rip it off the bus and recollect the remainders. But for such to work one needs independent microcontrollers that are able to manage their own crap.
 
Back
Top