Corrupt ZFS ZIL (log) device

I accidentally corrupted the log device for my pool. It's irrecoverable. Googling around, I ran across logfix. The problem is FreeBSD doesn't seem to have /etc/zfs/zpool.cache.log (or zpool.cache.log), at least not for my non-root pool. So given that the log device GUID is gone, I have three options: modify the pool checksum to include a new log device, modify the pool checksum to not include a checksum, modify the import code to ignore an invalid checksum.

Anyone have any experience in any of those or any other ideas on how to restore my pool?
 
If the ZFS pool version is under 19, and you do not have a mirrored log device, you're screwed. There's no recovery possible, and the only option is to re-create the pool from scratch, and restore the data from backups.

You do have backups, right? ;)
 
I do have backups that are a little more dated than I'd like, but why won't any of those schemes work? Will ZFS not just choke on the missing data for some files, but also have problems reading files that weren't modified?
 
If the ZFS version is under 19, the loss of a non-redundant ZIL means the loss of the pool. Simple as that. It's the way things are in ZFS land.
 
But why? Is there something fundamental the ZIL changes that makes every chunk of data irrecoverable? I can see why the last few minutes of data might be bad, or why a clean import isn't supported, but a complete loss sounds a bit extreme.
 
Due to the way ZFS and the ZIL works. If a separate ZIL device is present, it is checked during boot to see if there's anything there to replay/copy into the pool. If that separate device is corrupt or missing, the pool cannot be imported. The data is still all present in the pool, but there's no way to import the pool, so there's no way to access the data.

With ZFSv19, the ability to remove ZIL devices was added. Later versions also included the ability to automatically roll-back transactions to get to a consistent state, allowing for corrupted pool to be accessible.

There's some zdb hackery that can sometimes be done to access a pre-ZFSv19 pool with a corrupted ZIL. This is why the recommendation is to always use a mirrored ZIL device.
 
Mirrored ZIL makes little sense if all it does is sequential writing, IMO.

Would be better to wait with SLOG until ZFS v19. But i have a question: can't he use a ZFS pool v19+ version to import his array? Or does the v19 store data differently to make that possible, meaning he can't use the same trick with his older version pool?

Very few SSDs can write safely (have supercapacitor) so would not be suitable for SLOG. That's a pity i think, since in my experience it can significantly enhance write performance.
 
The separate SLOG increases performance mostly because the ZLOG is not allocated form the main pool. This also helps reduce fragmentation, as the ZLOG records are variable size and are de-allocated after the commit. You need not use SSD for the SLOG, but many people prefer to, because it's size needs to be only few megabytes and any current HDD is way larger.

But the idea is interesting: what happens with an pre-v19 zpool with broken ZLOG, if you try to import it in post-v19 capable OS? Could you then remove the broken ZLOG without having to bump the version number? Will it import at all?

You could use OpenSolaris, but you could also use freebsd-current with the recent "up to v23" patches (it may be unstable, but all you want is to import the pool and remove ZIL).

Another idea:
Before importing the pool, disable ZIL with:
Code:
vfs.zfs.zil_disable=1
possibly in /boot/loader.conf
Then copy your data to safer place :)
 
danbi said:
But the idea is interesting: what happens with an pre-v19 zpool with broken ZLOG, if you try to import it in post-v19 capable OS? Could you then remove the broken ZLOG without having to bump the version number? Will it import at all?

Another idea:
Before importing the pool, disable ZIL with:
Code:
vfs.zfs.zil_disable=1
possibly in /boot/loader.conf
Then copy your data to safer place :)

zil_disable didn't work. I'll try recompiling and trying a with something that supports v19. I wonder if manually bumping up the version in a hex editor would work.

Then copy your data to safer place

I ordered them yesterday. I'm also trying all this on a file-backed pool, first.
 
danbi said:
The separate SLOG increases performance mostly because the ZLOG is not allocated form the main pool. This also helps reduce fragmentation, as the ZLOG records are variable size and are de-allocated after the commit. You need not use SSD for the SLOG, but many people prefer to, because it's size needs to be only few megabytes and any current HDD is way larger.

Would cheap USB pens do the same job? Or should it be something faster?
 
It depends on the write speed. The SLOG device needs to write faster than the pool. Some USB sticks may work, others may not. It all depends.
 
phoenix said:
It depends on the write speed. The SLOG device needs to write faster than the pool. Some USB sticks may work, others may not. It all depends.

Faster is a tricky measure. My cheap SSD writes at ~72MB/s. The flash sticks I just bought write at 8MB/s. But since we're talking random access throughput, hard drives will have a hard time keeping up with the flash stick. A hard drive has roughly a 15ms seek time, so it can make something like 60 writes per second on arbitrary sectors. Flash has negligible seek times, so with 16K NAND block size, you can make 512 writes per second to arbitrary blocks (not counting interface time).
 
The SLOG should be written sequentially and under normal circumstances, never read back. So seek times might not matter here.
My experiments with cheap (4-8 MB/sec write) USB sticks show they are the limiting factor for performance. The same sticks perform wonderfully as L2ARC, if of sufficient size.

What is important for SLOG is that it quickly says "data is safely written". So the best is battery backed RAM, then fast write SSD with big capacitors, then maybe hard drives.
 
A cheap alternative can be mirrored Compact Flash disks (they support ATA instructions too) with 600x speed (85MB/s write speed). They cost like 50 Euro each for two 8GB's
 
I fixed it!

Here are the somewhat painful steps I went through:

1. Created image files of both drives. I put the images on ZFS, itself, and created snapshots so I could revert any changes.

2. Logfix requires the log device's GUID. It turns out that a sum of all the GUIDs is stored in the uberblock. I wrote some java that inspects all the uberblocks in each vdev and prints out the sum. Note that it is an unsigned 64-bit int. Here's the code (the links are mirrors): http://pastebin.com/V3jWWcVV http://codetidy.com/423 http://ideone.com/d94OC Usage is java ZfsReader disk.img

3. I used mdconfig -f to mount the images, then did zdb -l <device> to read the labels and print the GUIDs (in base 10) of the devices in the pool.

4. The value of step 2 is a simple unsigned 64-bit int sum of all the GUIDs. Since we now know the GUIDs of the vdevs, figuring out the GUID of the missing log device just takes a bit of algebra. Be sure the value is a 64-bit int. If you're using a calculator, add or subtract 2^64 as necessary. I can't remember if the pool GUID was in the sum or not, so you might have to play around with this method on a test pool where you know all the GUIDs to confirm it works.

5. Once you have the recovered GUID for the log device, use logfix to build a new device. Rather than recompile logfix for FreeBSD, I used the precompiled binary on an OpenSolaris VM. I tried with OpenIndiana, but I got a segmentation fault, probably from a library mismatch.

6. The pool was never exported, and I could export a broken pool, so I had to trick zfs into thinking this was the same machine. ZFS seems to check two things, here: hostname and machine ID. Hostname is just stored in /etc/rc.conf. The machine ID is set with sysctl kern.hostid=<target id>.

7. Once I had done all this and attached all the devices as mds, I could do zpool import. I think I ended up needing a -f or -F option, but I can't remember which.

Sorry there's not more or a better explanation, but I was a little surprised I got this to work, myself, so I'm not really sure about some of the steps. The most important parts are probably reconstructing the log device's GUID and spoofing the host ID. The rest was halfway easy.
 
Back
Top