ZFS scrubbing

Jimmy · Nov 13, 2012

When the handbook talks about scrubbing it mentions additional disk space required for the checksums, isn't scrubbing just checking the stripes and recalculating the parity to see if the parity is accurate? If not why is the checksum needed, isn't it a little redundant when we have parity? Can someone explain in a little more detail how it works?

Does each block's inode reference a checksum or is the scrub a completely independent mechanism?

Thank you.

Jimmy · Nov 13, 2012

Also wanted to ask is scrubbing automatic? And how do we check if a scrub is being performed?

usdmatt · Nov 13, 2012

I'll provide information on what I can, I'm no expert on how raidz actually works.

All data written to the zpool is split into records of a certain size (usually 128kB or less). This record is then checksummed and written to disk. The checksum is checked every time that record is read. If you compress data, the compression is done first, then checksummed. If dedupe is enabled, the checksum will be looked up in the DDT (DeDupe Table) to see if that record already exists.

I'm not sure how exactly records are written with raidz but obviously data is not always stripped with parity - many people use mirrored pools or even single disk pools. In a single disk pool data can still be recovered from checksum error if you tell ZFS to store more than one copy of each record.

From what I can find on the net I think that data read from raidz is validated with the checksum. The parity is only read if data needs to be reconstructed although it's not very clear.

Scrub is not automatic. You can start a scrub, and then check progress with the commands below. If a scrub is in progress the status command will tell you and give the progress, speed and estimated time to completion. I believe there are also periodic scripts that can be enabled to do this automatically at a set interval.

Code:

# zpool scrub pool
# zpool status pool
# zpool scrub -s pool (stop the scrub)

serverhamster · Nov 13, 2012

A note on those periodic scrubs:
Put the following in /etc/periodic

Code:

daily_scrub_zfs_enable="YES"
daily_scrub_zfs_default_treshold=7

See periodic.conf(5)() for more info and options.

It looks like periodic scrubbing is recommended. This is the output of zpool status

Code:

  pool: tank
 state: ONLINE
 scan: scrub repaired 1.23M in 21h0m with 0 errors on Thu Nov  8 00:21:47 2012
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            da0     ONLINE       0     0     0
            da1     ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da4     ONLINE       0     0     0
            da5     ONLINE       0     0     0

errors: No known data errors

If I'm not mistaken, this silent corruption would not have been found without ZFS. Looks like I have to watch those disks!

bbzz · Nov 13, 2012

6 drives in Raidz1. What a brave man, you are.

Sfynx · Nov 14, 2012

Yeah, just realize that if you experience total disk failure on one of the drives, all used space on all other drives has to be completely readable without any more errors, or else you will have to restore something from backups. Good recipe for heart problems during a lengthy rebuild, use raidz2 to rectify the problem (space is cheap)

phoenix · Nov 15, 2012

There's a nice lengthy post of the zfs-discuss mailing list that covers how scrub works on a raidz vdev. The gist of it is:

read data block, compute checksum
read checksum from disk, compare to computed checksum
compute parity for block, compute checksum for parity block
read checksum for parity block from disk, compare to computed checksum

If compute checksums don't match the checksums on disk, then data block is rebuilt from parity, checksums are compared, and data written overtop of the bad data block.

Jimmy · Nov 17, 2012

Thank you all for the input.

Is this the same mechanism fsck uses on UFS/UFS2 filesystems? Comparing checksums from the inodes but not being able to correct from parity. FSCK can fix inodes though right, so does it just rewrite the inode checksum based on the computed checksum read from the data block?

I.E. I've discovered corrupted data! Lets fix that by rewriting the checksum? But I guess this would mean the FS continues to write to bad blocks? So on reflection, is the inode simply deleted or the block reference removed to mark the block as bad? Thing is, bitrot doesn't necessarily indicate a bad (completely non-functional) physical medium, right?

nakal · Nov 17, 2012

Thing is, bitrot doesn't necessarily indicate a bad (completely non-functional) physical medium, right?

Yes, this is correct. The behavior of HDDs is a bit strange here and I don't think that the most file system drivers consider this (some more expensive RAID controllers do though). Most people take S.M.A.R.T. seriously, even the values are often misunderstood and behave differently for different manufacturers.

Use smartmontools... this actually saves your a$$ before a failure happens. It gives you a reaction time of between 7-30 days before you see the indicators dropping below sanity (THRES value).

I watched several times what happens when a physical error occurs. The internal drive controller will not correct the error instantly and you have a defect (read error) all the time. The point is you really need to write the particular sector to get it replaced by the sector relocation mechanism. I tried to dd with bs=512 exactly at the affected position (please don't try this with your precious data without a proper backup!). The overwrite procedure eventually has to be repeated several times and also for the neighbor sectors which often are also affected.

It's perfectly normal to have an error on day 1. The drives are fully functional when you overwrite the erroneous areas and will work for years. When you get a new drive, overwrite it with zeros with dd and make a read pass (also with dd). When dd stops before reaching the end of drive, you should try to overwrite the disk several times. The corrections are usually logged by S.M.A.R.T. (but not always; I also have drives that have 0 corrected sectors again after having 1 or 2).

All in all, it's pretty confusing... but it's worth to know how HDDs are behaving generally. Many people are panicking, because they found an error. You don't need to panic about it, but: verify your backups, then take a close look at the defect and watch the defect in future (you should always keep an eye on your drives, even they are healthy, smartmontools' smartd helps here).

Terry_Kennedy · Nov 18, 2012

nakal said:
It's perfectly normal to have an error on day 1. The drives are fully functional when you overwrite the erroneous areas and will work for years.

I disagree. Any bad spots on the drive are re-mapped during manufacturing test, using far more sophisticated methods than are available from the customer interface. If you have a bad spot on a new drive, it was either damaged in shipment or the manufacturer's quality control is bad.

Unless you know for sure what caused the bad spot (for example, a power failure during a write causing a write splice error), you can't predict anything about the drive's future error rate. If the reason for the bad sector(s) is damage of the platter surface, you've got pieces of the surface loose in the drive. Depending on where they land, you could get another bad sector or a complete head crash. I've seen pieces land in the head stack actuator and cause all sorts of weird problems.

I've never had a manufacturer decline an RMA due to even a single bad sector. And they do keep track of returns per customer, recording what faults (if any) were found on the drives the customer has returned.

Sfynx · Nov 18, 2012

I just do the S.M.A.R.T. conveyance and long self tests when a drive comes in, and if these pass I start using the drive, if not I will schedule it for the next RMA shipment immediately. I trust ZFS to handle any later errors, and that's about it.

nakal · Nov 18, 2012

Terry_Kennedy said:
I've never had a manufacturer decline an RMA due to even a single bad sector. And they do keep track of returns per customer, recording what faults (if any) were found on the drives the customer has returned.

When you have 1 TB drives, the chance of getting a drive with a initial bad sector is around 50% (for me here). You will definitely notice this if you use RAID on such a drive. On cheap RAID implementations it will look like a failure. Better controllers will try to get the replacement mechanisms to work on this.

I have several drives which had 1-2 faulty sectors on day 1. They all work since years. I only watch, if the drives accumulate errors in a constant rate. This is the time where you should panic and look for a replacement.

I have even 2 drives where is S.M.A.R.T. is broken and shows me 2047 (highest value) faulty sectors. The drives are perfectly fine! No read errors at all.

What I take seriously are the THRESH values on Pre-fail attributes.

Of course the manufacturers will replace your drive. It's more expensive to check it thoroughly than to send you a replacement.

Terry_Kennedy · Nov 18, 2012

nakal said:
When you have 1 TB drives, the chance of getting a drive with a initial bad sector is around 50% (for me here).

I still think you have drives damaged in shipping or a manufacturer with poor quality control.

My most recent experiences have been with WD2003FYYS drives, where I've received over 200 of them (both directly from Western Digital and from various distributors). None of those drives had any defects when installed (both a surface scan and SMART showed no problems).

Some years ago, a number of resellers were simply putting bare drives in antistatic bags into a box and then using some air pillow packing in the hope that the drive wouldn't move around. I've learned to not buy from those resellers (hence my buying drives in OEM 20-packs for the most part).

Even longer ago when I was using Seagate (this was old-logo Seagate) and had a dedicated on-site account rep, I reported some resellers to him for poor packaging. At that time, Seagate shipped the resellers knocked-down Seagate boxes and packing material for each drive, to be used when shipping drives to end users, and some resellers weren't using it for some reason. The resellers were eventually told to either use the Seagate packaging or else Seagate wouldn't sell drives to them.

You will definitely notice this if you use RAID on such a drive. On cheap RAID implementations it will look like a failure. Better controllers will try to get the replacement mechanisms to work on this.

I guess it depends on how you define RAID. I agree that there's a great deal of quality variation in RAID implementations, ranging from the excellent (3Ware), to middle-of-the-range (discount controllers) to poor (most BIOS PseudoRAID). When you're using a raidz* in ZFS, ZFS is only going to look at the areas of the disk that you have files on. I use 3Ware 9xxx-series controllers and export each drive as an individual volume to FreeBSD (as opposed to just exporting the raw drives). This gives me full control of the drives - I can have the controller do scheduled background verification without impacting FreeBSD, request specific scan operations, etc. The controller also operates the 3 drive bay LEDs on each drive to show activity / status / errors. And it supports SMART passthru so I can use sysutils/smartmontools to monitor each individual drive.

I have several drives which had 1-2 faulty sectors on day 1. They all work since years. I only watch, if the drives accumulate errors in a constant rate. This is the time where you should panic and look for a replacement.

It may be that your manufacture doesn't do a particularly thorough surface analysis when they build the drives, and it is a simple "bad spot" on the media and not a sign of physical damage. However, as I said in my earlier reply, without OEM-type analysis you can't tell what caused the bad sectors. You're had good experience where no additional errors appear.

In my case, when a drive develops an error (which SMART will usually report as an Offline Uncorrectable), if I tell the controller to put the drive back into service, within a few days the number of bad sectors will ramp up rapidly, from the initial 1 to something like 49.

I have even 2 drives where is S.M.A.R.T. is broken and shows me 2047 (highest value) faulty sectors. The drives are perfectly fine! No read errors at all.

Out of curiousity, are these drives the same brand / model / firmware as other drives that don't have the problem, or are they different? There have been a number of drives from various manufacturers where there were problems with the SMART implementation. Those are usually "mostly harmless" and just report preposterous values. There was at least one type of drive where an inopportune SMART request would cause actual data corruption on the drive.

I simply don't buy drives built by companies that don't fix this sort of thing. And when I'm planning a major buy, I'll "taste" the drive model by buying first a single drive and testing heavily, and then a single 20-pack to build a complete array and do more testing.

Of course the manufacturers will replace your drive. It's more expensive to check it thoroughly than to send you a replacement.

Trust me, the manufacturers don't simply junk the drive you send back - particularly in this economic climate, and with the recent drive shortages.

I deal almost exclusively with WD these days, but procedures at other manufacturers will be similar.

Drives that are returned are processed on test equipment to determine if there is a fault, and if a fault is found, to determine if the problem is inside the sealed environment or is a logic board problem. At that point, the path diverges depending on whether or not repair needs to be done in a clean room.

Let me give you a sample of an incoming test (with identifying numbers removed):

Code:

Category            Mode                Submode
-----------------   -----------------   -----------------   -----------------   
Complete            Unable to Process
                    Through FSPT/FTA

Drive               HSA                 Poor On-Track       Over Write Degraded
                                        Write
PRELIMINARY LEVEL

DRIVE LEVEL

Failure Code Observations: Customer reported failure "WD2003FYYS Failing 20%+ with
RAID drop offs." You can also see detailed customer information on attachment for 
ITR # xxxx

Relolist has 8 entry, 1 relocated, glist has 1 entry. Validate reserve cyl OK. SMART
OFFLINE failed on head 5. DST test passed. POH = 3980 hours 42 minutes, 1367 active
logs, 7 ECC errors reported but may not see by Host, no errors reported to Host. Head 
5 has excessive ECC errors on all logs.

Found bad head 5, OW degraded by 8DB, drop below limit.

COMPONENT LEVEL
n/a

CONCLUSION
OW degraded on head 5, drop 8DB fall below 25DB limit.

If an end user returns a bunch of drives that show no problem found on the incoming test, WD may contact the user to ask them why they feel the drives failed and needed to be returned. Depending on the outcome of that discussion, either the incoming test will be expanded to detect a missed fault, or (more likely) the user will be told that none of the drives were actually bad, and to investigate other components (controller, cables, power supply, environment, etc.). In addition to showing the customer that they're being listened to, this will save the customer time and money in not RMA-ing good drives. And, of course, it also saves WD money as they don't have to process good drives as RMA's.

nakal · Nov 20, 2012

Out of curiousity, are these drives the same brand / model / firmware as other drives that don't have the problem, or are they different?

I use mostly Seagate and Samsung both brands have different kind of errors.

Seagate drives:
- 2 HDDs with 2047 replaced sectors
- had once 1 HDD with several 1 billions raw read errors (works fine, accumulated several thousands per second)
- had also many broken drives from Seagate (mostly failed RAID)

Samsung drives:
- 50% chance of getting 1 faulty sector I would say (show 0 in S.M.A.R.T. after a while). I am talking about the Reallocated_Sector_Ct attribute here.

There was at least one type of drive where an inopportune SMART request would cause actual data corruption on the drive.

Yes. This was exactly this kind of drive. It was working fine until I did a firmware upgrade. 2 of 4 drives showed 2047 replaced sectors after this procedure.

I've had some WD of "green series" and consider the latest ones trash (broken power management = well known problem: accumulating Load_Cycle_Count). I don't know about Blue/Black Series though. I decided not to buy a drive from WD until they test them on operating systems beyond the Window$-family.

Terry_Kennedy · Nov 21, 2012

nakal said:
I've had some WD of "green series" and consider the latest ones trash (broken power management = well known problem: accumulating Load_Cycle_Count). I don't know about Blue/Black Series though. I decided not to buy a drive from WD until they test them on operating systems beyond the Window$-family.

The vast majority of retail drives go into Windows machines. That's just the way the world is at this time.

OEM drives can go into Windows machines (Dell, etc.) or non-Windows systems (EMC, etc.)

The difference between a retail drive and an OEM drive might only be the labeling and the warranty, or it can be more substantial. For example, a common firmware change for OEM's is to reduce the reported capacity of the drive so that all "146GB" (or whatever) drives from that OEM have the exact same number of sectors, regardless of manufacturer or generation. They might also have the identify data changed to show the OEM name instead of the standard drive name. Sometimes, vendors claim that their drives are "special" and are thus worthy of the much-higher price they charge. You might want to look at my article Low-end disk devices - The Digital difference from nearly 25 years ago to see how long this game has been going on.

Sometimes there are actually substantial firmware changes for OEM drives. For example, Seagate added the ability to spindle-sync between different generations of [SCSI] Barracuda drives for us.

BTW, I'm not a Western Digital apologist - I've raked them over the coals for the same silliness with their Green drives. For example, here (I think - that forum is returning a "too busy" error right now).