I had an unpleasant experience this week and am hoping someone can shed some light on it for me.
I have a raidz3 pool with 10 2TB disks plus mirrored ZIL on SSDs, running on FreeBSD 10.2. Pool status:
The pool has about 1.3T of data in use. With snapshots it might be 1.6T, maybe a bit more.
"zil1" failed on Sunday night. I replaced it, and resilver started. It said it had to scan 1.8T, which surprised me, but claimed to be going at a very fast clip so I wasn't worried. But the longer it went, the "slower" it got. By the time it reached 95%, it was crawling at about 1% per hour. During the resilver,
I then did something dumb that doesn't impact the above, but I'll tell it anyway. My past experience is that adding a new log device is fast, so I gracefully removed the existing ZIL from the pool, and re-added it (using the same physical devices). That worked, quickly, and it said the pool was not degraded. But it ALSO said it was resilvering 2.2T of data, but with no indication in status output of what it was resilvering.
My questions are, from most to least important:
* Why does resilvering the ZIL need to scan the pool? I expected it to just take a few minutes to scan the small SSDs.
* What determines the amount of data it claims it needs to scan for a resilver?
* What's with the "phantom resilver" I experienced? Any idea what it was doing? I did find a couple of other oldish threads about that phenomenon, but no answers.
Perhaps some of these are bugs that have been fixed between FreeBSD 10 and 12. An upgrade of the server this happened on is being planned.
Thanks,
Mark
I have a raidz3 pool with 10 2TB disks plus mirrored ZIL on SSDs, running on FreeBSD 10.2. Pool status:
Code:
pool: tank
state: ONLINE
scan: resilvered 0 in 20h41m with 0 errors on Mon Apr 8 22:10:54 2019
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
label/zdisk1 ONLINE 0 0 0
label/zdisk2 ONLINE 0 0 0
label/zdisk3 ONLINE 0 0 0
label/zdisk4 ONLINE 0 0 0
label/zdisk5 ONLINE 0 0 0
label/zdisk6 ONLINE 0 0 0
label/zdisk7 ONLINE 0 0 0
label/zdisk8 ONLINE 0 0 0
label/zdisk9 ONLINE 0 0 0
label/zdisk10 ONLINE 0 0 0
logs
mirror-1 ONLINE 0 0 0
label/zil1 ONLINE 0 0 0
label/zil2 ONLINE 0 0 0
The pool has about 1.3T of data in use. With snapshots it might be 1.6T, maybe a bit more.
"zil1" failed on Sunday night. I replaced it, and resilver started. It said it had to scan 1.8T, which surprised me, but claimed to be going at a very fast clip so I wasn't worried. But the longer it went, the "slower" it got. By the time it reached 95%, it was crawling at about 1% per hour. During the resilver,
gstat
showed all the data disks in the pool running at 80-100% utilization all the time, almost all reads.I then did something dumb that doesn't impact the above, but I'll tell it anyway. My past experience is that adding a new log device is fast, so I gracefully removed the existing ZIL from the pool, and re-added it (using the same physical devices). That worked, quickly, and it said the pool was not degraded. But it ALSO said it was resilvering 2.2T of data, but with no indication in status output of what it was resilvering.
gstat
looked the same as before, 100% read load on the data disks in the pool, light/normal activity on the zil disks. Started fast, slowed down, did the last 7-8% at 1%/hr.My questions are, from most to least important:
* Why does resilvering the ZIL need to scan the pool? I expected it to just take a few minutes to scan the small SSDs.
* What determines the amount of data it claims it needs to scan for a resilver?
* What's with the "phantom resilver" I experienced? Any idea what it was doing? I did find a couple of other oldish threads about that phenomenon, but no answers.
Perhaps some of these are bugs that have been fixed between FreeBSD 10 and 12. An upgrade of the server this happened on is being planned.
Thanks,
Mark