Solved [Solved] Non-responsive ZFS at boot.

I've been beating my head against this one all day... (And panicking a bit, since I need this machine to receive email.)

I have a 10-RELEASE machine running root on ZFS - it's not new; I originally installed it with 8, and have upgraded since. Last night it stopped; the console was putting out 'indefinite wait for swap' errors. I have seen these before: I run swap on ZFS, and have been recently experimenting with a jail to run a CPAN smoker, which has run the machine out of RAM before.

I rebooted (hit the hardware restart) and it took ages to get past 'Mounting root filesystem' - and then didn't appear to get past 'mounting user filesystems'. After half an hour or so, I rebooted again, and went to single user mode; the hope was that it was RAM/swap related, and I could reconfigure the swap. (Or that a 'clean' shutdown of the filesystem would help, after I finished in single user mode.)

Nothing I've tried has helped so far - I've added a 3G 3 GB swap device (a USB flash drive I had lying around), and have tried to do various things with the zpool - none of which have succeeded, or even run to completion. The zpool says it is fine - I'm trying to run a scrub on it right now (and am not confident of it's success: under normal circumstances a scrub takes 13 hours, and I'd expect this one to take longer), but otherwise appears to respond fairly normal - except that it's very slow. ( zpool status will take three to four minutes, minimum.) If I'm not interacting with the filesystem, the system appears to work fairly normally - for single-user mode. (Though any command may take ages if it needs to load something or page something out.) Though it seems to lock up - but that could be because I'm running ZFS commands. (Mostly the lockups have been before I added the swap drive, and were complaining about not having swap. When it locks up I've been rebooting the hardware.)

I have two possible culprits, of things changed recently: I updated ports yesterday (using portmaster) and last week I had to do some metalwork on the rack the server is in. The system has booted fine since both of these, and was running fine until this morning around 6. (I have an email I got around then in a client that downloaded it.) Otherwise there have been no changes in the past week.

Any ideas as to causes or something I can do? I can get into single-user, so it has to be working, somewhat. But it either takes over a half-hour (estimate, could be longer) to get into multi-user or it can't be done. There are two core files on the machine - from ~12 hours before it died, while I was updating ports, nothing more recent.
 
Re: Non-responsive ZFS at boot.

After waiting overnight and letting the scrub finish, I tried another reboot, and it succeeded. I’m still unsure what caused the issue, but I do have a theory: when the machine ran out of memory on Thursday night, it spent several hours trying to write to swap and failing. I'm guessing that put the ZFS write-ahead log out of joint; there were many log entries that hadn't been written to main storage and resolved. ZFS/FreebsdFreeBSD was trying to resolve those before mounting the drives on boot. (My rebooting of course didn't help, as it had to start over.)

Running in single-user didn't load the ZFS swap file, so it mounted faster (it didn't have to resolve those log files), and my giving it some other swap space and letting it sit overnight let it finally resolve those log entries, putting the filesystem back in a 'current' state.

As I said: this is my guess. I wish I could verify it one way or the other.
 
Back
Top