ZFS Resilvering ZIL reads whole pool?

I had an unpleasant experience this week and am hoping someone can shed some light on it for me.

I have a raidz3 pool with 10 2TB disks plus mirrored ZIL on SSDs, running on FreeBSD 10.2. Pool status:
Code:
  pool: tank
 state: ONLINE
  scan: resilvered 0 in 20h41m with 0 errors on Mon Apr  8 22:10:54 2019
config:

        NAME               STATE     READ WRITE CKSUM
        tank               ONLINE       0     0     0
          raidz3-0         ONLINE       0     0     0
            label/zdisk1   ONLINE       0     0     0
            label/zdisk2   ONLINE       0     0     0
            label/zdisk3   ONLINE       0     0     0
            label/zdisk4   ONLINE       0     0     0
            label/zdisk5   ONLINE       0     0     0
            label/zdisk6   ONLINE       0     0     0
            label/zdisk7   ONLINE       0     0     0
            label/zdisk8   ONLINE       0     0     0
            label/zdisk9   ONLINE       0     0     0
            label/zdisk10  ONLINE       0     0     0
        logs
          mirror-1         ONLINE       0     0     0
            label/zil1     ONLINE       0     0     0
            label/zil2     ONLINE       0     0     0

The pool has about 1.3T of data in use. With snapshots it might be 1.6T, maybe a bit more.

"zil1" failed on Sunday night. I replaced it, and resilver started. It said it had to scan 1.8T, which surprised me, but claimed to be going at a very fast clip so I wasn't worried. But the longer it went, the "slower" it got. By the time it reached 95%, it was crawling at about 1% per hour. During the resilver, gstat showed all the data disks in the pool running at 80-100% utilization all the time, almost all reads.

I then did something dumb that doesn't impact the above, but I'll tell it anyway. My past experience is that adding a new log device is fast, so I gracefully removed the existing ZIL from the pool, and re-added it (using the same physical devices). That worked, quickly, and it said the pool was not degraded. But it ALSO said it was resilvering 2.2T of data, but with no indication in status output of what it was resilvering. gstat looked the same as before, 100% read load on the data disks in the pool, light/normal activity on the zil disks. Started fast, slowed down, did the last 7-8% at 1%/hr.

My questions are, from most to least important:
* Why does resilvering the ZIL need to scan the pool? I expected it to just take a few minutes to scan the small SSDs.
* What determines the amount of data it claims it needs to scan for a resilver?
* What's with the "phantom resilver" I experienced? Any idea what it was doing? I did find a couple of other oldish threads about that phenomenon, but no answers.

Perhaps some of these are bugs that have been fixed between FreeBSD 10 and 12. An upgrade of the server this happened on is being planned.

Thanks,

Mark
 
Warning: I'm just a ZFS user, I've never inspected its internal design or source code. So the following are general remarks about how RAID and RAID with logs works.

In theory, resilvering only needs to read the data that is degraded, where the technical definition of degraded is: missing at least one redundancy copy. Typically, to rebuild that redundancy requires reading the whole data for the track/strip/sector/whatever unit of granularity that is degraded. So let me give you a hypothetical but realistic example. You are running raidz3 on 10 disks. ZFS takes the data in files plus metadata (like directories or inodes), and cuts it into blocks of a certain size, for example 1MB (let's not worry about whether the term "block" is exactly what ZFS uses, nor whether the 1MB granularity is correct). It takes each of these 1MB blocks, and cuts it into 7 pieces of each 1/7th MB, and writes that to 7 of your disks. It then calculates 3 different "redundancies", which combine the bits of the 7 pieces, and writes it on the remaining ones. (In traditional RAID5 that redundancy was a parity, which made for easier-to-understand sentences). If any disk fails, ZFS has to first go through its metadata to figure out what data and redundancy blocks were on that failed disk, which requires reading all the metadata. For each missing redundancy block, it then needs to read the 7 data blocks, calculate a new redundancy, and write it to the new disk; for each missing data block, it needs to read 6 surviving data blocks plus 1 of the surviving redundancies, recalculate the missing data block, and also write it.

If you count the amount of data needed to be read (and in this example, fingers are sufficient), you will clearly see that to resilver 1 disk requires reading the whole data volume of the pool once, but only for files that are allocated. So if your ZFS file system was 43% full, it will read exactly 43% of its capacity (plus all metadata).

I'm sure that for rebuilds of failed normal disks, this logic works correctly in ZFS, because I've seen it work and the speed of resilvering agreed with a back-of-the-envelope estimate. BUT: doing the math for a ZIL is more complicated. I simply don't know how ZFS handles redundancy of the ZIL. In theory, the ZIL is never read from, as long as ZFS keeps running and has copies in RAM. So if you interrupted your ZIL rebuild (power outage? reboot?), then things get hairy. If the ZIL needs to be rebuilt, I don't know how ZFS implements this. This is an unusual corner case, which the implementors of ZFS probably didn't put terribly much effort into optimizing (they have bigger fish to fry). And given that you are running an ancient ZFS version, I suspect that your low (ridiculous?) performance of ZIL resilvering might get better in future versions, or it might not. My advice: Grit your teeth and wait. And upgrade as soon as practical.
 
running on FreeBSD 10.2.
Keep in mind that FreeBSD 10.2 has been End-of-Life since December 2016 and is not supported any more. As a matter of fact, the entire 10 branch is EoL.

Perhaps some of these are bugs that have been fixed between FreeBSD 10 and 12. An upgrade of the server this happened on is being planned.
I would suggest waiting with 12 until 12.1 gets released as 12.0 still has some teething problems. Plan to upgrade with 11.3-RELEASE: https://www.freebsd.org/releases/11.3R/schedule.html
 
Back
Top