ZFS Scrub task best practice

I am looking at scheduling for the zfs scrub task, something i understand is important to keep my zfs pool healthy. my question is how often should i run the task and what kinds of environmental things affect this choice?
in short: what are the scrub task best practices?

in my case, I have 4 x 3.64 TB drives set in a raidz2 pool.
I am also making the assumption the scrub is a performance hit so i should not schedule it when i expect high use of the nas? correct?

Any information or a link to information would be very helpful!
thanks for your time.
 
I've no idea what "best practices" for zfs scrub are, but I'll note that the default of every 35 days has served me well for many years now. FreeBSD handles this via the periodic script /etc/periodic/daily/800.scrub-zfs which gets its default values from /etc/defaults/periodic.conf:
Code:
root@kg-core2# grep scrub_zfs /etc/defaults/periodic.conf
daily_scrub_zfs_enable="NO"
daily_scrub_zfs_pools=""            # empty string selects all pools
daily_scrub_zfs_default_threshold="35"        # days between scrubs
#daily_scrub_zfs_${poolname}_threshold="35"    # pool specific threshold
If you want to change some of them, you should pout them in the /etc/rc.conf file.
 
Like tingo, I have no idea what is best practice. I do sometime only annual scrubs and the scrub of my main box (also a sort of NAS) is now from Aug 2019. I don't say that is a good example, but it works for my medias.
I assume your NAS is also energy saving which means that the CPU is weak. When I do my scrubs, I shutdown all server functionality (SMB, NFS, SQL, AFP ) to allow ZFS to quickly pull the 4x3TB raidz1 through the scrub.
you also need to think about the fact, that a scrub is a hard work for the drive; also the temperature is increasing. With my rare usage of scrub, I still have my 10 years old drives in the cage and smartctl tell me every day that they are ok.
All files you are normaly touch, are anyhow checked during these touches. Only your longterm not touched files will be also checked during a scrub.
Michael Lucas is suggesting in his book : if you have server grade hardware , a quarterly scrub should be sufficient - if it is more home equipment he suggest a monthly schedule
 
Scrubbing is a very-low priority background job, you don't have to stop any services for that. Especially because it imposes almost no load on the CPU - only checksums are verified, which could essentialy be handled by the equivalent of a modern toaster...
But yes, scrubbing imposes a load on the drive and therefore reduces I/O and increases latency of the pool, so better not run a scrub during monday morning office times on the main storage pool.
But it's MUCH better to have a drive showing up errors during a monthly (or by-weekly) scrub than waiting a full year until maybe 2 other drives are also nearly dying and all of them are returning no/invalid data for the same blocks. In that case even zfs can't recover. The goal should be to get a bad drive detected and out of the pool ASAP, not to just look away and pretend/hope everything is fine. If some drives die prematurely from the light load of monthly scrubs, they are simply crap and you should stay away from that series/model in the future. (and make sure to beat the remaining drives of that series/model to death...)

Given that a NAS should have "high-endurance"/Enterprise Disks rated for 24/7 operation and higher daily I/O than a desktop drive anyways, there is absolutely nothing wrong with a monthly scrub. If you think you've gotten a bad batch of disks (which annoyingly seems to have become more common again than ~5-10 years ago...) you can even schedule a weekly scrub to catch a bad drive ASAP and get it RMA'd within the warranty period.


Personally I've always scheduled scrubs according to the importance of the data on the pool and/or how "hot" the data is, using "monthly" as a baseline. If most of the data on the pool is accessed/modified very often, then it is constantly verified anyways, so monthly scrubs are sufficient. If the data is very important (e.g. a warm backup pool) but seldomly read or modified (backups in form of ZFS snapshots are never modified...) I still go for monthly or even bi-weekly scrubs. On desktops and laptops I just stick with monthly.
I usually run scrubs on Friday evenings so a) in case of fire I can drop in on Saturday and sort everything out without everyone running around screaming, and b) the scrubs are finished before the big (non-ZFS-snapshots-based) backup jobs are running.
And as I said - if I suspect some drives might become faulty (e.g. starting to show up reallocated sectors) I often crank that pool up to weekly scrubs to get those drives weeded out ASAP. (E.g. we had a series of ~10 Seagate Constellation ES 3TB drives that all failed within the first ~18 months). Of course you always have make sure you have enough spare drives (NOT from that faulty series!) allocated to a pool and ideally mix different drives (age, vendor, model...) within a vdev to prevent multiple disks within a vdev dying at the same time.
 
Sorry, I do have a different opinion: zfs scrub is not replacing S:M:A:R:T , so if you hope to find disk errors with zfs scrub, it is much too late: better use smarttools to regular check your VDEV's - here you will identify upcoming drive failures earlier.

I agree that the drives you are using should be special for your NAS with 24/7. In a NAS you should _not_ automatic use Enterprise drives - it depends on your NAS ( e.g. 4 drives or more drives ) Here he has only 4 drives, there could be that enterprise drives earlier fail due to vibrations as these cages are different than SAN cages. Also you need in a private NAS not high speed drives ( 5400 vs 7200+ ). I also suggest to buy drives from different sources, times and in smaller chunks. Also look at the MTF and more important the non-recoverable read errors. ( E.g. make a comparission between surveillance drives and NAS drive )

A scrub might not take a lot of load, but it will use memory - and that is one of the reasons, why I stopped all services during a scrab. It also speeds up the scrub. you could influence the load with some syscontrols. look for vfs.zfs.scrub_... and vfs.zfs.resilver_...

You should also be aware, that during a scrub you could loose data: a file where the scrub identifies errors that could not be recovered/repaired, will no longer be available as ZFS will not give you files with errors. So ensure you have backups ready. You also should remember, that ZFS always checks the files you access. So if you have a heavy used pool, a scrub might be less important than a pool with archives.
 
Sorry, I do have a different opinion: zfs scrub is not replacing S:M:A:R:T , so if you hope to find disk errors with zfs scrub, it is much too late: better use smarttools to regular check your VDEV's - here you will identify upcoming drive failures earlier.

SMART data might be helpful, but usually the overall SMART "health status" is completely useless as most drives already return false data or completely lock up before they admit via SMART that they are dying. 80% of the time ZFS throws out a soon-to-be-deceased drive way before any of the SMART thresholds are hit and the drive firmware admits there's an upcoming drive failure.
Worst of all: A LOT of drives will try to act like everything's fine instead of just dying and staying silent. We've had systems that came to a crawl, locked up HBAs and systems that were unable to boot due to failing drives. The most annoying and extreme failure scenarios were induced by SATA drives BTW. SAS drives seem to behave better and just die and stay dead...

there could be that enterprise drives earlier fail due to vibrations as these cages are different than SAN cages. Also you need in a private NAS not high speed drives ( 5400 vs 7200+ ). I also suggest to buy drives from different sources, times and in smaller chunks. Also look at the MTF and more important the non-recoverable read errors. ( E.g. make a comparission between surveillance drives and NAS drive )
Enterprise drives have much higher vibration tolerances and MUCH higher MTBF and lower expected error rates (usually 1 in 10^15 vs 1 in 10^12 or 13 for consumer drives). The fact that "NAS-Drives" are often just glorified Desktop drives with some additional firmware features, but with nearly the same price as the lower enterprise series should also simplify the decision...
We've once tried 6 WD Red drives for a 3rd tier backup NAS. They were only a little cheaper than the RE series at that time, but offered only 3 vs 5 years warranty and still 3 of those drives failed well within that 3 year period, so we went for RE (-3 or -4 IIRC) drives to also replace the remaining Red drives. The RE are still running and are due to be replaced only due to their age, not because of failures...

You should also be aware, that during a scrub you could loose data: a file where the scrub identifies errors that could not be recovered/repaired, will no longer be available as ZFS will not give you files with errors. So ensure you have backups ready.
ZFS never replaces a backup strategy. And data loss because all copies can't be read due to drive failures is not tied to scrubs, but with regular scrubs you are more likely to catch the first erraneous copy of a block, which ZFS can easily replace as long as the other copies are intact. The longer you let the data rest without verifying it (via scrubs or reading it), the higher the risk of multiple blocks and copies being corrupt... Again: using yearly scrubs is only a "looking away" tactic. Just because you don't check doesn't mean there's no error...
 
I agree also a S:M:A:R:T drive could die ;) and as said, even a ( from spec ) healthy drive. If you look at the Non-recoverable bits read, you could calculate that on 3TB drives a loss of around multiple MB is in the specs. So the drive is still healthy, even if you loose data. Normaly the drive tries to re-arrange sectors to keep itself up and running by using some spare sectors.Also you need to configure e.g. the smartmontools to get good diagnostic results from the drives.

As said, I still have a different opinion especially when I look at the NAS spec of genfoch1. And I also think that a scrub will do an additional usage load of your drives, so when you do it more frequent, you will age your drives more.
Anyhow, I hope genfoch1 has now enough information to decide and schedule the scrub 👩‍🔧
 
Thanks for the responses. in retrospect i should have provided more of my specs i just wasn't thinking.
my nas has an i3-9100 3.6 ghz cpu, the drives are WD red pro, with 32 GB of ecc ram

from the responses you all have given ( and thanks to everyone for that! )

The scrub task does not use a lot of cpu
the scrub task does put load on the disks
This is a long running job and i need to track drive temperature to ensure high load does not impact the disk
I plan to set up a grafana dashboard for this (once i get the nas working anyways )
the scrub has a higher reliability of detecting a dick fault than the SMART though I i plan to have both configured to report errors
the scrub will use memory but i'm not clear on how much
I'd like to think 32GB of ram should handle the scrub while leaving more than enough for moderate use
i don't know if the amount of ram used is related to total data or if it will use all the ram it can
there is no best practice for how often you should run a scrub
some of my data is important which would imply scrub should be run more often
on the other hand, scrub does put more ware on the drive and could impact performance
so there is a balancing act between these two.

zfs/scrub/raid (whatever) does not discount the need for a backup. putting it all together it seems the backup ( and numbers of backups retained ) are a part of the scrub equation. if you are running scrub say once a month and backups once a week you could get in a situation where you have data corruption right after the last scrub that i guess would be included in your backup (giving you a backups with corrupt data ) which you would not see until the next scrub so my oldest backup needs to be older than my last scrub?

does the act of backing up (reading the data) run the same validation as a scrub? if so, if i backup up my entire nas weekly would i ever need to run a scrub ? (note i was thinking about exporting a snapshot as a backup. i have read this is how it can be done though i'm still doing research and have not actually done it )
 
In short: you need both: backup and scrub
and now the long story 😃
Ok, with 32 GB mem and only the 4 drives you should have enough memory to run teh scrub without stopping the services - my ten year old box I only had the possibility to put 16GB ECC in. There is a sysctl to assign more or less mem to ZFS - I just remember the name - need to lookup. One rule is: Not used memory is useless memory ! So you might lookup how you could adjust it for your situation.

When ZFS is reading a block, it always verifies the stored and calculated hash to identify an error. in case of raid it could use the parity to correct and if that not working there are some stored copies that might give a second chance.

The challenge is to push ZFS to read the blocks. Depending on your way to backup, you might have touched all blocks - but that is a worst case secnario as that mean you do a lot of useless reading of empty blocks and reading multiple times used blocks during the backup.

If you use ZFS send/recv this will not happen; here you only will read the used blocks, there is even a step further : when you send a full snapshot, you read all used blocks of that dataset. When you do after the first full send an incremental send/recv, you only read blocks that have changed - so not even all used blocks and never unused blocks. So depending on your backup strategy ( how often full snapshot send, how often incremental ) you touch/verify more or less used blocks. (Empty blocks never! - SMART doesn't care about the usage of blocks. Here you could test low-level the whole disk) But to be clear: your backup ( using zfs send/recv) will never contain corrupt data.

When you do the incremental backup with send/recv, your backup will not replace the usage of scrub. So you would also need to schedule a scrub.
Also scrub is a pool cmd, while your backup is one the dataset level - so if you not backup every dataset, you won't read every block of your pool - like scrub is doing.

I am not sure how ZFS is handling fragmentation. When you destroy a dataset, the space is not available at that moment - the blocks are just marked. There is a asynchron type of resilvering. IMHO also : if you immediate create a new dataset , I _believe_ ZFS is using empty blocks and not just destroyed blocks. Saying that means, ther might be some later empty blocks that also have been read and verified.

Oh, for backup with send/recv you also might have a look on sysutils/zxfer
 
This is a very complicated question. One on which there isn't any good agreement in the storage research or industry community, as far as I know.

To begin with, as others said above, scrub is only one part of a strategy to keep a redundant storage system healthy. Another part of the strategy is SMART, and other forms of checking disk health (like tracking disk performance, which can be an early indicator of reliability problems). SMART is neither a panacea (which reliably predicts disk failure, long before any data is lost on the drive), nor completely useless. Instead it is somewhat reliable: it often predicts disk data loss, but sometimes drives fail (completely or gradually) without SMART giving any warning, and sometimes SMART declares a drive to be PFA dead, and then the drive continues functioning for a long period. But one should not ignore smart, just because it isn't perfect.

Another vital tool is backup. Because even with the best failure prediction and failure search, the system will fail occasionally, For example due to effects that redundancy can't help against, such as correlated failures. Those are often failures of the wetware: a human doing something very wrong. One classic example is "rm -Rf /", which no amount of RAID guards you against, but the data can be restored from a backup.

That leaves the question of: how often should one scrub? That's a terribly difficult question. There are three forces in play.

First, scrub catches errors, both CRC and metadata inconsistency errors that the disk drive (and SMART) can't even begin to catch, and latent errors of the disk hardware. Since the number of errors increases over time (sometimes stepwise, when a whole drive fails at once and remains failed), scrubbing early can only help, since it might (not always!) catch errors while their number is still small. This argues that one should be scrubbing as much as possible. A theoretically optimal implementation would be that the drive is always scrubbing (as fast as it can) when there is no foreground workload. In reality, this is not practical, since moving the head in the short idleness gap between two IOs of the foreground workload will destroy performance. But using QoS techniques such as disk schedulers that are aware of different classes of service (such as emergency resilvering, foreground workload, and background scrub) one can approximate this. AFAIK, ZFS does have some IO scheduling mechanisms. If they were perfect, then scrub should not affect foreground workload performance, which brings us to ...

Second, scrub does affect foreground workload performance. Whether the effect is huge (system basically unusable while scrubbing, need to shut all services that use the file system down before scrubbing) or small but measurable (slight slowdown while scrubbing) is a matter of a lot of debate. I think the reason for the debate is that it depends heavily on the setup of the system. My personal experience is: while scrubbing, the file system is very slow, so much so that human activities that are IO intensive (like building large software, or organizing and moving lots of files) are painful. If I hit the system simultaneously with scrub, backup (which walks all file system metadata) and the nightly periodic run, it may become so slow that I need to reboot to regain control of the system. And since I have a disk that is shared between two zpools, if I scrub both pools simultaneously, performance becomes ridiculously low. For this reason, I have arranged my scrub so it finishes in a few hours, and I start at most one scrub in the middle of the night (at 1:15, right after the last hourly backup at 1am), and I suspend most other nighttime maintenance activities (such as periodic and backup) while scrub is running, so scrub gets done by 7am, when normal human activity may resume. But: My system is very small, only three disk drives in use by ZFS (of which one is a slow backup disk), and very little memory (only 3 GiB in use, due to the limitations of a 32-bit system). And being a home server, there is virtually no activity during the night (since the humans who could cause activity are sleeping).

Part of that question of how scrub interacts with foreground workload performance is the converse: how fast does scrub run? And that depends crucially on (a) how large and complex the zpool is, and (b) what the foreground workload is doing. As an example, on my system the largest pool is a 3 TiB pool on two mirrored drives, and that scrub takes about 3-4 hours, so it is reliably done before the 7am deadline. But I know that other people (with much pools containing more and larger drives) have scrubs that take a day or two, and in some cases a week. That pretty much means that scrub has to run while there is foreground activity, and this will make scrub run for a long time.

Third, and this is the most difficult tradeoff. As I argued above, scrub improves reliability, by detecting errors early. Great. But does scrub also cause errors? The answer is: unfortunately, yes. That's because with modern disks and extremely low head fly-heights, any access to the disk (both read and write) causes "wear and tear", and makes the data less reliable. One way this is visible is that disk drive vendors no specify a maximum IO rate (the number is typically 550 TB/year), and above that rate, the warranty on the drive becomes void. The disk vendors do that for a good reason: any IO increases the number of data errors, and above a certain IO rate, their published and contractually warranted error rate (10^-14 or -15 or around there) can no longer be held. But note that 550 TB/year means that a 20 TB drive can only be read 27.5 times per year. Which means: If there was absolutely no foreground workload, then scrubbing every 2 weeks would already use all the available IO. So just from this simple bookkeeping argument, one should probably not scrub more often than once a month on a system with modern large drives. Since my personal disks are much smaller (the largest ones are 4 TB), I scrub once a week. Note that ZFS only scrubs allocated data (files and metadata), so if a file system is 50% full, only half the platter will be scrubbed.

But: Nobody really knows how much extra IO activity accelerates disk errors. An accurate measurement of that effect is required to analytically optimizing the scrub rate, to get it to the point where scrub catches the most problems, without causing more problems than it catches. The disk manufacturers have some internal measurements (which they do not share with the public, for good reasons); those measurements (and competitive pressure) is where that 550 TB limitation comes from. Big disk users (the companies that by disks by the million) have internal measurements, which are also absolutely not shared. And the academic/research literature has virtually nothing in this area (but hold that, I know some groups are working on it). So for now, I would be a little careful with scrubbing, and try to limit it to far less than 550 TB per drive per year.
 
I fully agree ! The possibility that scrub is pushing to much load on my disks and the duration of multiple days - even when I stop the server services - led into my current situation, that I don't schedule scrubs; I start them from time to time, when I think the last was too long ago.

Anyhow: just one additional hint to genfoch01: /etc/defaults/periodic contains a 800.scrub-zfs with 35 days as a threshold. So a daily_scrub_zfs_enable="YES" would enable it. It also could be configured to differentiate by pool and threshold
 
yes it appears to be scheduled to run at 00:00 on Sunday every 35 days. ( according to the doc, thats the first sunday 35 days AFTER the last scrub ran so technically i guess that could be up to 41 days if your 35th day landed on a monday.
 
I fully agree ! The possibility that scrub is pushing to much load on my disks and the duration of multiple days - even when I stop the server services - led into my current situation, that I don't schedule scrubs; I start them from time to time, when I think the last was too long ago.

Anyhow: just one additional hint to genfoch01: /etc/defaults/periodic contains a 800.scrub-zfs with 35 days as a threshold. So a daily_scrub_zfs_enable="YES" would enable it. It also could be configured to differentiate by pool and threshold
And the scrub frequency of one every 35 days agrees reasonably well with the back-of-the-napkin calculation I did above that gives one in four weeks. And it runs in the middle of the night (I think periodic monthly starts at 5:30am by default on a stock FreeBSD distribution) that it is less likely to interfere with other workload on machines that follow a single-geography diurnal cycle, like households. So this is an excellent starting point.
 
offered only 3 vs 5 years warranty and still 3 of those drives failed well within that 3 year period, so we went for RE (-3 or -4 IIRC) drives to also replace the remaining Red drives. The RE are still running and are due to be replaced only due to their age, not because of failures...
Yeah, the WD RE4 drives have been really good. We bought some 200 of the WD2003FYYS 2TB RE4 drives some 10 years ago to replace other drives in some Sun Thumpers and only a handful have failed us. And the rest are still going strong (going to retire them this year though - hopefully)
 
The computer runs whether I sleep or not; that's what cron is for. Now, if you put your computer to sleep every night, then scrub won't be run. In that case, you should run it during the daytime. I think in ZFS you can pause and resume a scrub, so you could run it during lunch break (or some similar time when the foreground workload is not very intense).
 
PS: I switched from cron to fcron. It allows me more flexibility.
About scrub. When i reboot my pc during a scrub, the scrub continous, after reboot, and it doesn't start from zero. It say's already 1 Tera done, and continues. I think this is new zfs feature.
 
PS: I switched from cron to fcron. It allows me more flexibility.
If your computer is always down at night, then running a version of cron that performs the "missed" tasks later might be a good idea. I'm particularly thinking of periodic daily/weekly/monthly, which are valuable.

About scrub. When i reboot my pc during a scrub, the scrub continous, after reboot, and it doesn't start from zero. It say's already 1 Tera done, and continues. I think this is new zfs feature.
I think I've been seeing that for at least a few years now. It is such an obviously good idea, wouldn't surprise me if it were much older.
 
My 2 cents, feel free to ignore :)
A lot depends on the hardware: the type of drive (consumer grade vs enterprise/NAS grade), the system memory (ECC vs non-ECC).
A zfs scrub basically walks a dataset/zpool and recomputes the checksums and then compares the newly computed checksums against the ondisk checksums (ZFS has checksums for just about everything). If there is a difference a recovery happens. If the zpool has redundant vdevs (like a mirror or raid-zX) it compares against other members and eventually fixes the checksum.
Keep in mind what a pool is used for: your typical zroot pool is reasonably static (assuming you aren't installing/deinstalling pkgs). Mostly the log files are changing on a daily basis. This means low load on the the ZFS "system", so you aren't calculating checksums alot.
Zpools that represent user data? Those may change a lot more so you have a lot of block updates, lots of checksum changes, more potential for errors (checksums on readonly media should never change, right?)

Now what if your memory got whacked by cosmic rays just as it computed a checksum for a datablock? You may wind up with a bit flip, but you don't know that. The scrub sees an error and goes into "fix it" mode. That is why ECC vs non-ECC memory is important. Enterprise vs consumer grade devices also plays into this. Heat and power can also get in the mix (too high temp, data is unstable, too low power data is unstable).

So what do I do? I have a mix of consumer and enterprise grade stuff (the WD-RE drives are awesome) so I have the smart status in the periodic/daily logs, then I manually scrub about every 2 months or so. Yes and good power supplies and air flow in the box.

Yes, manual scrub, but I like the control. It's worked for me, but may not work for you. In general, consumer grade devices I think 30-60 day intervals on scrub is about right.
 
I shut down my home desktop PC every night and get round that by using sysutils/anacron to run most of my scheduled tasks including periodic daily, weekly and monthly.

Be careful with shutdown your server after a long time running or on every day frequency: power supply could crash !
Statistics will show, that the temperature change ( from warm running server to cold ) introduce tensions into the electronic boards and parts, which may lead into fails and also power supplies die more often during power up than during normal run due to the significat startup current.

During Y2K we asked our customer NOT to shutdown there server - they could stop all programs, but not do a electrical shutdown. We have calculated, that even the normal statistical power supplies failures will case problems in parts supply. Now, saying that - it is 20 years ago, but are the power supplies today better ?
I would recommend to use all possible sleep modes before really power off the server regular. ( And I already needed to change the power supply for one server the last 10 years )
 
I wonder if I scrub a drive two times directly after each-other will the second scrub be much faster than the first scrub or not ?
 
I wonder if I scrub a drive two times directly after each-other will the second scrub be much faster than the first scrub or not ?
Why ? There might be something in the ARC, but in general, the origin of a scrub is to recalculate all data blocks. So it can't be faster - only if your computer runs faster when hot :D
 
Well I was doubting if it only checked the "meanwhile changed datablocks".
I understand, but a scrub will start to check all data again from the beginning. Resilvering is concentrating on "obsolete" blocks only. Both (scrub & resilvering) will skip unused blocks - that's why - IMHO SMART monitoring is still needed to identify drive failures.
I would be careful to start a second scrub right after the first is done: this is a significant load on your drives and could lead into an earlier damage of your drive. That the reason why you should balance the usage of scrub between data and drive lifetime.
 
Back
Top