ZFS Permanent errors in ZVOL / ZFS checksum error question

Hello everyone,

* It's me again *

Today I have the following problem:

I recently created a ZVOL to use it as a disk for a Windows VM with bhyve.
It worked fine for a few days until some software reported I/O problems with that "disk".

admin@server:~ % sudo zpool status -v tank
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 16.1M in 7h16m with 68 errors on Wed Jun  7 07:49:15 2017

    NAME                      STATE     READ WRITE CKSUM
    tank                     ONLINE       0     0     0
     raidz2-0                ONLINE       0     0     0
       gpt/wd_redpro_2tb_1   ONLINE       0     0     0
       gpt/wd_redpro_2tb_2   ONLINE       0     0     0
       gpt/wd_redpro_2tb_3   ONLINE       0     0     0
       gpt/wd_redpro_2tb_4   ONLINE       0     0     0
       gpt/wd_redpro_2tb_13  ONLINE       0     0     0
       gpt/wd_redpro_2tb_14  ONLINE       0     0     0
       gpt/wd_redpro_2tb_15  ONLINE       0     0     0
       gpt/wd_redpro_2tb_16  ONLINE       0     0     0
     mirror-1                ONLINE       0     0     0
       nvd0                  ONLINE       0     0     0
       nvd1                  ONLINE       0     0     0
     nvd2                    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:


I did a scrub, rebooted the server and executed zpool clear tank. It's still the same. Apparently this ZVOL is damaged. To create it I executed:
zfs create -V 650G -b 64k -o volmode=dev tank/veeam/backup_disk
Does anyone think I did something wrong / not recommended?

I have seen some reports in the FreeNAS forums that the HBA I'm using is not recommended.
I am using an Adaptec 71605H.

General question though: How many checksum errors are considered "normal"? Up until now I ran a scrub about once every three - four months and had no problems whatsoever. I had some on all drives though, so it's definitely possible it's because of the HBA.

Honestly any answer would be much appreciated.

Here's a sample output from one of the drives:

root@server:/home/admin # smartctl -a /dev/da0
smartctl 6.5 2016-05-07 r4318 [FreeBSD 11.0-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Model Family:     Western Digital Red Pro
Device Model:     WDC WD2001FFSX-68JNUN0
Serial Number:    WD-WMC5C0E6M3T1
LU WWN Device Id: 5 0014ee 0aec6698e
Firmware Version: 81.00A81
User Capacity:    2'000'398'934'016 bytes [2.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jun  7 14:03:56 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (23280) seconds.
Offline data collection
capabilities:             (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:     (   2) minutes.
Extended self-test routine
recommended polling time:     ( 254) minutes.
Conveyance self-test routine
recommended polling time:     (   5) minutes.
SCT capabilities:           (0x70bd)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   187   171   021    Pre-fail  Always       -       5633
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       132
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   093   093   000    Old_age   Always       -       5476
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       132
 16 Unknown_Attribute       0x0022   005   195   000    Old_age   Always       -       103127219295
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       130
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       3732
194 Temperature_Celsius     0x0022   118   110   000    Old_age   Always       -       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      5476         -

SMART Selective self-test log data structure revision number 1
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Other drives look similar (no read or seek errors, no bad blocks on any drive). Did I miss something?
No, these look fine. The important bits are typically Offline_Uncorrectable and Current_Pending_Sector, and they look fine.
Thanks very much for your responses. I'm still not sure how to go about this though.

Do you actually think zvols corrupt unrecoverably just like that?
Or is there a way to avoid this? Is this a rare use case?
Would you say the HBA is a poor choice? So poor that it matters?

Do you actually think zvols corrupt unrecoverably just like that?
I very much doubt that. I'm sure there's a reason for it, files (even ZVOLs) don't magically corrupt themselves. At least not with ZFS (I have a different opinion regarding ext3/4 but that's another discussion).
What does this line mean:
scrub repaired 16.1M in 7h16m with 68 errors

Does it mean that ZFS scanned 16.1 million things, and found 68 of them with errors?

Or does it mean that ZFS scanned the whole pool, found 16.1 million things that had errors, and was able to repair most of them, but 68 of them were unrepairable?

When I say "thing", I also don't know whether ZFS is reporting bytes, sectors (512 bytes or 4K?), or some sort of ZFS allocation unit (blocks, extents, who knows).

Given modern disks (these seem to be 2TB drives), and the known failure rates (the uncorrectable bit error rate is usually spec'ed as 10^-14), it is theoretically possible that you had 68 sectors with an uncorrectable read error. But that seems wrong, since the SMART data shows that the drives themselves have not found any read errors. It is completely impossible that you would have 16.1M media errors; any sensible SMART implementation (and the good people at WD are very sensible, I know some of them) would have long raised serious alarms. So it seems likely that the corruption to the on-disk data happened at a layer above the disk. Suspecting the HBA seems implausible to me; HBAs don't quietly corrupt data (they tend to instead lose connection to the disk, or fabricate IO errors). If you really got 16.1M sector errors, my suspicion would be that someone actively overwrite the disk (by going behind ZFSs back directly to the device). Perhaps someone got confused, didn't use your /dev/gpt/wd_redbpro_2tb_4 device entry for the disk, and by mistake created a new file system on /dev/da7? Maybe someone was doing a performance test on the raw disk with dd, and switched "if=" and "of=" around? I work on file systems for a living, and we have a joke in the group: The worst thing that can happen to a file system is that someone tries to reformat one of our disks with a Reiser file system (to understand how cruel this joke is, you need to know what Hans Reiser did to his wife).

You ask what checksum error rate you should be seeing. That's a difficult question. Let me first answer a different question: Disk manufacturers quote an uncorrectable bit error rate, typically 10^-14 for consumer grade, and 10^-15 for enterprise grade drives. You can do a quick back-of-the-envelope calculation: Typical disks have a performance of roughly 100 MByte/s (that's accurate to within a factor of two), but are typically under-utilized by a factor of 10. So they read 10 MByte per second, or 100 MBit/second (I'm rounding to make the math easier). Multiply that by 10^-14 errors per bit, and you should get one error every million seconds, or roughly 30 errors per year (warning: that estimate is likely inaccurate by a factor of 10 or 100 due to various effects).

BUT: Those are cases where the drive detects an internal unrecoverable read error, and reports a media error or read error up to the operating system. These are *not* cases where the drive silently corrupts data and returns it to ZFS, pretending it to be good data. The disk drive has very extensive ECC (error correcting code) internally, and shall fundamentally never return wrong data. Vendors typically don't even specify numbers for silent corruption, and the rate of such errors is supposed to be several orders of magnitude smaller than the uncorrectable BER. So checksum errors shall never happen; when they do, the problem is likely not the drive itself. Only when you have systems so large that you have seen thousands of media errors should you even entertain the thought of silent corruption. (In this discussion I'm not specifically talking about off-track writes, but they are likely the only real-world mechanism for silent data corruption in drives themselves.)

Or in other words: On a system of your scale, any checksum error means that you need to check the IO stack between ZFS and the drive, as it is statistically unlikely to come from the drive itself.
You have to be careful with that statement. You can look at SMART on two levels.

The level that lots of people want is perfect PFA - predictive failure analysis. In the best of all possible worlds, the disk drive would use SMART to diagnose itself, and give a very simple, clear and correct answer to the host: either (a) I am in perfect health, functioning great, and you can store data on me with very little risk, or (b) I am very sick or perhaps already dead, do not store new data on me, if there is still any data on me then read it ASAP and save it elsewhere, and replace me as fast as possible. Ideally, there would be no gray zone in the middle. SMART is not capable of doing that; and in the ideal version I described, it can and will never work. But the PFA that is built into SMART on enterprise SCSI drives is a reasonably good approximation of this capability: When drives report that they will fail imminently, that is strongly correlated with actual failures (not 100% identical, but correlated).

On the SATA implementation of SMART, PFA isn't that easy; it really requires the host to cooperate. That is typically done by looking at error counters (196 through 198 are the common ones) and making heuristic decisions. Works reasonably, but not great.

The other level of looking at SMART is: It is a simple mechanism to get measurements of real world effects. The simplest example is the temperature: You can pretty much trust a drive to report its temperature via SMART correctly. And since we know that temperature is correlated with disk failure (although the correlation is not a simple straight one), a sensible storage system should monitor the temperature, and do something when it becomes dangerous. That is pretty obvious and trivial. The part that is less trivial, and of much greater value: We can pretty much trust drives to report internal read/write errors via SMART (the reporting in SCSI is nicer and easier to deal with than in SATA, but either works). And we know that internal errors are very strongly correlated with eventual drive failure. Even more importantly: If at the host or OS of file system level we see errors (like the ZFS checksum errors we are discussing in this thread), but the drive itself says that it is perfect and has not had any errors, then we know that the cause of the data corruption must be something else ... above the drive and below ZFS. This kind of measurement helps narrow down the root cause analysis and thereby perhaps find and fix the problem.

You are right when you say that you shouldn't have too much faith in SMART. But that doesn't mean that you should ignore it either. It's one piece of the puzzle, probably the biggest single one.

P.S.: I just looked, the OP's drives are at 32 degrees C, which is pretty much perfect. Don't change anything in the cooling, the problem must be elsewhere.