Severe performance problems with ZFS scrub under some conditions

Terry_Kennedy · Jul 19, 2011

There seems to be a major performance regression in handling of scrub operations on ZFS pools under some conditions. I have 3 servers with identical hardware and 32TB ZFS pools on each. All are running 8-STABLE. Each system performs normal ZFS reads at 500MB/sec or faster - the only thing that is slow is the scrub operation.

One of the systems, 8-STABLE as of this morning, zpool/zfs 4/15, scrubs at the expected rate, 500MB/sec+:

Code:

(0:15) rz1:/data# zpool status
  pool: data
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scan: scrub in progress since Mon Jul 18 18:04:14 2011
    907G scanned out of 12.4T at 577M/s, 5h47m to go
    0 repaired, 7.16% done
(0:16) rz1:/data# gstat -f twd
dT: 1.008s  w: 1.000s  filter: twd
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    6    626    626  39071    7.4      0      0    0.0   91.5| label/twd0
    5    622    622  39301    7.8      0      0    0.0   91.6| label/twd1
    7    618    618  39233    8.3      0      0    0.0   92.9| label/twd2
    8    606    606  38816    7.6      0      0    0.0   92.6| label/twd3
    5    605    605  39160    7.2      0      0    0.0   91.8| label/twd4
    5    597    597  39623    7.4      0      0    0.0   88.3| label/twd5
    4    600    600  39880    7.8      0      0    0.0   90.3| label/twd6
    4    581    581  39529    8.4      0      0    0.0   90.2| label/twd7
    8    586    586  38798    7.4      0      0    0.0   88.5| label/twd8
    6    609    609  38759    6.7      0      0    0.0   86.9| label/twd9
    9    585    585  38548    8.8      0      0    0.0   92.8| label/twd10
   10    619    619  39183    7.7      0      0    0.0   91.3| label/twd11
    9    602    602  38793    7.8      0      0    0.0   89.8| label/twd12
    9    631    631  38898    7.5      0      0    0.0   90.3| label/twd13
    6    612    612  39084    7.4      0      0    0.0   88.9| label/twd14
    0      0      0      0    0.0      0      0    0.0    0.0| label/twd15

This system is running 8-stable as of this morning, zpool/zfs versions 5/28:

Code:

(0:11) rz2:/data# zpool status
  pool: data
 state: ONLINE
 scan: scrub in progress since Mon Jul 18 11:19:04 2011
    2.22T scanned out of 11.8T at 48.5M/s, 35h32m to go
    0 repaired, 18.79% done
(0:12) rz2:/data# gstat -f twd
dT: 1.022s  w: 1.000s  filter: twd
 L(q)  ops/s    r/s   kBps   ms/r    w/s   kBps   ms/w   %busy Name
    0     97     97   2976   36.8      0      0    0.0   84.0| label/twd0
   10    122    122   3803   72.7      0      0    0.0  100.1| label/twd1
    0     98     98   3005   34.7      0      0    0.0   82.4| label/twd2
    0    114    114   3496  170.5      0      0    0.0   95.7| label/twd3
   10    123    123   3793  129.4      0      0    0.0   99.9| label/twd4
    0    114    114   3453   50.4      0      0    0.0   85.5| label/twd5
    0    115    115   3516   48.5      0      0    0.0   91.2| label/twd6
    0    115    115   3483   69.4      0      0    0.0   90.1| label/twd7
    0    115    115   3485   59.2      0      0    0.0   91.8| label/twd8
    0    115    115   3516   72.4      0      0    0.0   93.9| label/twd9
    0     88     88   2700   36.4      0      0    0.0   80.3| label/twd10
    0     94     94   2880   71.2      0      0    0.0   84.9| label/twd11
    0     91     91   2813   44.3      0      0    0.0   85.6| label/twd12
    0     87     87   2687   31.2      0      0    0.0   78.8| label/twd13
    0     90     90   2781   32.0      0      0    0.0   80.4| label/twd14
    0      0      0      0    0.0      0      0    0.0    0.0| label/twd15

Note that even while that system is scrubbing at under 50MB/sec, with disks 90% busy, I can still read from the pool at a reasonable rate of 250MB/sec or so:

Code:

(0:13) rz2:/data# dd if=test.iso of=/dev/null bs=1m
29902+0 records in
29902+0 records out
31354519552 bytes transferred in 117.035626 secs (267905771 bytes/sec)

A third system is running 8-STABLE from June 6th, and scrubs at the same slow rate as the second system above. In order to rule out any kernel mis-tuning that I might have done, I booted from a mfsBSD ISO (8.2-RELEASE + zfs v28) from March, 2011 and saw the same low performance.

The hardware on each box is dual E5520 CPUs, 48GB RAM, 3Ware 9650SE-16ML with 16 WD2003FYYS drives, exported as individual drives. The OS is on a separate controller / drive - the only thing on the 3Ware/WD2003's is the ZFS pool.

Terry_Kennedy · Jul 20, 2011

To folow up on this (I'm not marking it "[Solved]" as I don't fully understand the issue), the scrubs completed in much less time than originally indicated. There seem to be long periods of very slow scrub performance, with a few intervals of normal (500MB/sec) performance.

This isn't related to the setting of vfs.zfs.scrub_limit. As you can see from the previous message, the first system achieves > 500MB/sec with a queue depth of 10 or fewer.

My current theory is that it is related to deduplication, since that wasn't supported on the v15 zpool, but is enabled on the v28 pools. If I feel adventurous, I'll wipe and re-create one of these pools with dedup off, reload 12TB of data, and see if the problem goes away. These systems have 48GB of RAM and fast CPUs, so I'm not sure why dedup would slow scrubs to a crawl.

phoenix · Jul 20, 2011

ZFS scrub touches each block in "linear-time" order. Meaning, the first block that was written to disk is read/checked first, then the next block that was written, etc. If you create and destroy a lot of snapshots, then your pool will be fragmented and the disk heads will thrash around. If the pool is over 50% full, this becomes a real issue.

Ed Harvey has posted about this issue many, many, many, many times on the zfs-discuss list, going on and on about how non-optimal the current scrub code.

On a clean pool, or an empty pool, or a pool where you only ever write (never delete), then scrub runs extremely fast as the "linear-time" order of the blocks is the same as the "logical block order" on the disks, meaning the scrub is basically one long, continuous, contiguous, in-order read where the drive heads rarely move.

Terry_Kennedy · Jul 20, 2011

phoenix said:
On a clean pool, or an empty pool, or a pool where you only ever write (never delete), then scrub runs extremely fast as the "linear-time" order of the blocks is the same as the "logical block order" on the disks, meaning the scrub is basically one long, continuous, contiguous, in-order read where the drive heads rarely move.

The interesting thing is, the "slow" pools should have a lot less of this sort of issue than the "fast" pool. The "fast" one has lots of create / delete activity on it, going back over a year. The "slow" ones are copies (not snapshots) of the data on the "fast" server at a specific point in time, and haven't had any writes since. Ther are no snapshots on any of them.

phoenix · Jul 20, 2011

Hrm, well that sucks. That should be almost the perfect setup for a fast scrub.

If you check the Warnings page in 3dm2, does it mention anything about drive timeouts or drop-outs on the "slow scrub" box?

If you lower the size of the ATA queue (vfs.zfs.vdev.max_pending="4" in loader.conf), does anything change? I've found on our ZFS boxes, a value of 4 really speeds things up compared to the default of 10 (IOps in gstat() hovers around 140 per drive, instead of 80-100).

And, have you confirmed that the 3Ware settings are the same for both boxes? Same StorSave Profile, same queueing setting, auto-verify disabled, write cache enabled, yadda yadda?

Terry_Kennedy · Jul 21, 2011

phoenix said:
Hrm, well that sucks. That should be almost the perfect setup for a fast scrub.

Indeed.

If you check the Warnings page in 3dm2, does it mention anything about drive timeouts or drop-outs on the "slow scrub" box?

And, have you confirmed that the 3Ware settings are the same for both boxes? Same StorSave Profile, same queueing setting, auto-verify disabled, write cache enabled, yadda yadda?

No warnings / errors. Settings are identical. Normal read performance is identical (> 500MB/sec).

And even while the scrub is lazing along at 50MB/sec, I can simultaneously do reads of files and achieve > 250MB/sec. So it isn't something like the scrub seeking all over the disk and being slowed down by that.

If you lower the size of the ATA queue (vfs.zfs.vdev.max_pending="4" in loader.conf), does anything change? I've found on our ZFS boxes, a value of 4 really speeds things up compared to the default of 10 (IOps in gstat() hovers around 140 per drive, instead of 80-100).

I haven't tried that. Note that on the box with the "good" scrub, I'm getting > 600 I/O per sec (see my original post for an example) using the default value.

danbi · Jul 21, 2011

I do not have explanation, but have seen similar behavior on systems converted form ZFS v15 to v28. Haven't had chance yet to play with max_pending..

Not confirmed, but seems v28 uses more smaller I/Os, compared to v15.

Sebulon · Jul 22, 2011

Hi,

I know scrubbing is auto-throttled upon load. Exactly how much it throttles or how it decides what load is, I can't say, but our Sun Unified Storages could throttle down scrubbing/resilvering down to like 5MB/s. Since scrubbing is the only thing that is throttled, perhaps that could explain why scrub goes slow, while other things goes faster?

/Sebulon

Terry_Kennedy · Jul 22, 2011

Sebulon said:
I know scrubbing is auto-throttled upon load. Exactly how much it throttles or how it decides what load is, I can't say, but our Sun Unified Storages could throttle down scrubbing/resilvering down to like 5MB/s. Since scrubbing is the only thing that is throttled, perhaps that could explain why scrub goes slow, while other things goes faster?

Perhaps - the only tunable I found was vfs.zfs.scrub_limit which limits the queue depth. But all 3 boxes (1 fast and 2 slow) have it set to the default of 10.

The first box, with the older pool, scrubs in 6:23:

Code:

(0:27) rz1:/tmp# zpool status
  pool: data
 state: ONLINE
status: The pool is formatted using an older on-disk format.  The pool can
        still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
        pool will no longer be accessible on older software versions.
 scan: scrub repaired 0 in 6h23m with 0 errors on Tue Jul 19 00:27:57 2011

One of the other boxes, same date 8-STABLE, scrubs in 15:27 (which is a huge improvement over the 160-hour original estimate):

Code:

(0:1) rz1m:/sysprog/terry# zpool status
  pool: data
 state: ONLINE
 scan: scrub repaired 0 in 15h27m with 0 errors on Tue Jul 19 22:37:38 2011

overmind · May 3, 2012

I was having the same problem (slow scrub) and after using

Code:

vfs.zfs.vdev.max_pending="4"

the speed increased from 3MB/s to 400MB/s.

Update: My mistake, this option did not solved anything. I still have the same slow speed and 400MB/s was in fact for resilvering and not for scrub.