ZFS A ZFS Christmas story

Hi gang,

Just wanted to share with you what happened to me, and hopefully this information can be useful for someone. I'll stick to the technical parts as best as possible, but I'm also still in the Holiday mood ;)

So I run my own server, my girlfriend and me are also Minecraft addicts (well, not that extreme, but we do really enjoy playing) and we had planned to spend the Holidays at my place. So far, so good.

Monday I noticed that my server didn't respond and quickly learned that it had issues. Both HD's were no where to be found in the setup and it obviously complained about not being able to boot due to missing boot devices. I turned off the machine, checked the hardware and noticed that some power connectors for the (non-SATA) HD's had rather loose connectors. So I decided to replace those and what do you know? My HD's were alive again. Because my setup uses a ZFS mirror I figured I wouldn't lose any data (I also had backups, but not very recent).

The main problem was that the machine refused to boot. I booted it in single user mode, issued an ls command and as a result I'd get a kernel panic and the system would reboot.

Further investigation learned me that there was an issue with my ZFS pool. When I tried to import the ZFS pool normally ( # zpool import -fR /mnt) the system would crash. However, if I imported it readonly then there was no direct problem at all: # zpool import -o readonly=on -fNR /mnt. So: I imported it read-only, set the temporary mount point to /mnt and also instructed the system not to automatically mount any filesystems. This is also where I learned that the initial crash happened during the scrub, because zpool status -v showed me as much.

Now what?

Well, thankfully (and somewhat to my surprise) I could still access all ZFS filesystems without a problem. All but zroot/var/db which showed that there was a severe filesystem corruption in /var/db/ccache. Notable because some file entries only showed their physical filename without having any properties anymore (the system couldn't list those and only displayed an error).

This allowed me to create a recent backup of all the data onto a local data disk. I then trashed the pool ( # zpool destroy zroot), removed all traces from the slices ( # zpool labelclear -f ada0p2 and also for ada1p2), re-created zroot using # zpool create -m / -R /mnt zroot mirror ada0p2 ada1p2 and then restored my data.

Restoration was quite easy because I only needed to overwrite zroot and the rest was merely issuing regular receive commands, for example # zfs recv -v zroot/home < /net/backups/home.zfs.

After I had restored all the data I took precaution and issued a scrub once again while in single user mode, just to ensure everything was well. And when that successfully finished I booted the machine again, and here we are.

One one end I'm a little worried about the limitations there are when it comes to accessing a problematic ZFS pool. After all: if something does get corrupted then there's a chance that you won't be able to access any of your filesystems. Which is totally different when you use UFS and different slices; a severe filesystem corruption doesn't necessarily affect any others.

Another point of concern is that if you import your pool readonly then you won't be able to perform any commands to try and clear up the problem. On one hand this makes sense, it is mounted readonly afterall, but on the other there's a bit of an issue because the only way you'll be able to try and fix a problematic pool is by importing it.

And I learned that importing a pool will always immediately write data to it.

But on the other end I'm also very impressed with the overall robustness of the ZFS filesystem. For example: when I noticed these issues I removed one of the HD's "just in case" and I could easily access and use both HD's in order to check what was going on.

Because I initially had doubts about the server hardware I ended up connecting one of the HD's using an USB interface to my (FreeBSD) laptop in order to verify that what I was experiencing (the kernel panics) was caused by the filesystem and not so much the hardware.

So yeah, figured I'd share.

Morale of the story? ZFS + backups = a sure way to keep your data secured.
 
Back
Top