How often do you scrub?

dvl@ · Nov 9, 2012

At present, my ZFS array scrubs every 7 days.

How often do you scrub?

In the past two years, I've seen no errors. *knock* *knock* Or rather, if I have seen errors, I've completely forgotten about them.

Given the rarity of such errors, I'm tempted to increase the scrub period to every 21 days.

Comments?

phoenix · Nov 9, 2012

I scrub our storage servers once a month, as it can take over a week to do a complete scrub on one of the boxes. These all have multiple raidz2 vdevs and 20-odd TB of storage. I've found a few dying drives this way (zpool status shows increasing errors, SMART shows errors, drives fall out of pool, etc).

Terri_Kennedy · Nov 10, 2012

phoenix said:
I scrub our storage servers once a month, as it can take over a week to do a complete scrub on one of the boxes.

Are you using dedup or something else that slows scrubs way down? Scrubs here run at about 2TB/hour:

Code:

[0] rz1:~> zpool status
  pool: data
 state: ONLINE
  scan: scrub repaired 0 in 6h46m with 0 errors on Thu Nov  8 22:28:05 2012
config:

        NAME             STATE     READ WRITE CKSUM
        data             ONLINE       0     0     0
          raidz1-0       ONLINE       0     0     0
            label/twd0   ONLINE       0     0     0
            label/twd1   ONLINE       0     0     0
            label/twd2   ONLINE       0     0     0
            label/twd3   ONLINE       0     0     0
            label/twd4   ONLINE       0     0     0
          raidz1-1       ONLINE       0     0     0
            label/twd5   ONLINE       0     0     0
            label/twd6   ONLINE       0     0     0
            label/twd7   ONLINE       0     0     0
            label/twd8   ONLINE       0     0     0
            label/twd9   ONLINE       0     0     0
          raidz1-2       ONLINE       0     0     0
            label/twd10  ONLINE       0     0     0
            label/twd11  ONLINE       0     0     0
            label/twd12  ONLINE       0     0     0
            label/twd13  ONLINE       0     0     0
            label/twd14  ONLINE       0     0     0
        logs
          da0            ONLINE       0     0     0
        spares
          label/twd15    AVAIL   

errors: No known data errors
[0] rz1:~> df -h /data
Filesystem    Size    Used   Avail Capacity  Mounted on
data           21T     12T    9.1T    57%    /data

bbzz · Nov 10, 2012

Maybe thousands of snapshots dating years back?

I scrub once a week, it's much smaller pool, 6TB, does it in about 10 hours.

phoenix · Nov 11, 2012

40-odd filesystems, each with 300-odd snapshots, all deduped and compressed. The slowest box is also 90% full and takes about 200 hours to resilver a disk or scrub the pool. The fastest pool takes a little over 30 hours.

Sebulon · Nov 11, 2012

phoenix said:
40-odd filesystems, each with 300-odd snapshots, all deduped and compressed. The slowest box is also 90% full and takes about 200 hours to resilver a disk or scrub the pool. The fastest pool takes a little over 30 hours.

Yeah of course it takes a very long time when the system is almost full, but it's strange though, that you can have scrub and resilver performance of what Terry here described, but as soon as you go near dedup, that performance drops down to a couple of MB/s, and that's regardless of how full it is. Well, that's my experience of it.

Scrubbing on our systems is done periodically from cron every sunday. I think that it is written in the Admin guide that it is considered best practice on enterprise systems, but for home systems once a month is sufficient. Of course it also depends on how fast your systems actually are able to scrub, like phoenix's systems e.g. but with his savings on dedup, I wouldn't be complaining

Phoenix, how much savings was it you had, counting both compression and dedup? x6, x7?

/Sebulon

dvl@ · Nov 12, 2012

Terry_Kennedy said:
Are you using dedup or something else that slows scrubs way down? Scrubs here run at about 2TB/hour:

Hmmm, mine takes about 12 hours:

Code:

 scan: scrub repaired 0 in 11h54m with 0 errors on Sat Nov 10 15:10:45 2012

That's 8x2TB drives in a raidz2, so let's say 13TB.

Code:

$ zpool list
NAME      SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
storage  12.7T  7.51T  5.18T    59%  1.00x  ONLINE  -

Mine is more like 500G/hour

nORKy · Nov 14, 2012

Code:

echo "daily_scrub_zfs_enable=YES" >> /etc/periodic.conf"

Look /etc/periodic/daily/800.scrubs-zfs to see what it does and what you can do

Sfynx · Nov 14, 2012

Until recently almost never because the pool got so full and fragmented (due to some youthful ZFS inexperience) it took forever to complete.
Now I've ashifted the pool to 12 by zfs sending the entire thing around, rearranging things and adding drives in the process, so now the scrub takes 5 hours tops and I do it every weekend during the night.

phoenix · Nov 15, 2012

Sebulon said:
Phoenix, how much savings was it you had, counting both compression and dedup? x6, x7?

Backups server for schools:

Code:

storage                          23.7T  2.06T   256K  none
dedup = 1.79, compress = 1.59, copies = 1.05, dedup * compress / copies = 2.71

So, about 50 TB of data in 24 TB of disk. This box has 20 GB of RAM (needs more).

Backups for admin sites:

Code:

storage                                 47.1T  4.87T   288K  none
dedup = 3.86, compress = 1.61, copies = 1.06, dedup * compress / copies = 5.88

So, about 275 TB of data in 47 TB of disk. This box has 64 GB of RAM (works well).

Off-site replica of the above two servers:

Code:

storage                                 39.7T  10.6T   288K  none
dedup = 2.41, compress = 1.55, copies = 1.06, dedup * compress / copies = 3.52

So, about 140 TB of data in 40 TB of disk. This box has 32 GB of RAM (needs more).

dvl@ · Nov 18, 2012

dvl@ said:
Hmmm, mine takes about 12 hours:

Code:

scan: scrub repaired 0 in 11h54m with 0 errors on Sat Nov 10 15:10:45 2012

That's 8x2TB drives in a raidz2, so let's say 13TB.

Code:

$ zpool list NAME SIZE ALLOC FREE CAP DEDUP HEALTH ALTROOT storage 12.7T 7.51T 5.18T 59% 1.00x ONLINE -

Mine is more like 500G/hour

Scrub is underway now:

Code:

$ zpool status storage
  pool: storage
 state: ONLINE
 scan: scrub in progress since Sun Nov 18 03:16:11 2012
    6.20T scanned out of 7.77T at 141M/s, 3h14m to go
    0 repaired, 79.81% done

141M/s == 8460M/minute = 507600M/hour = 495G/hour

OK, my estimate was more or less accurate.

FYI, most of this cluster is compressed and all of it is running off these HDD:

- Hitachi GST Deskstar HD32000 IDK/7K (0S00164) 2TB 7200 RPM 32MB Cache SATA 3.0Gb/s

using this controller:

- SYBA SY-PEX40008 PCI Express SATA II (3.0Gb/s)

Some more information here on this box.