ZFS ZFS - Interpreting output of "zpool status"

I have set up a home fileserver, using 3 drives in a raidz1 configuration. It's all working great, but I'd like some confirmation that I'm interpreting the output of "zpool status" correctly.

Here it is:

Code:
  pool: keg2
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 448K in 1h26m with 0 errors on Sun Jan 10 01:50:44 2021
config:

    NAME                                          STATE     READ WRITE CKSUM
    keg2                                          ONLINE       0     0     0
      raidz1-0                                    ONLINE       0     0     0
        ata-WDC_WD20EFRX-68AX9N0_WD-WMC301719728  ONLINE       0     0    12
        ata-WDC_WD20EFRX-68AX9N0_WD-WMC301719975  ONLINE       0     0     3
        ata-WDC_WD20EFRX-68EUZN0_WD-WCC4M2TF1P1Y  ONLINE       0     0     3

errors: No known data errors

So - on the last "scrub", I had some errors repaired. I assume that the errors still listed for each drive under "CKSUM" are errors that still exist (because it says they are "unrecoverable"). As it also says "no known data errors", I assume that my data is intact, but I have lost some of the redundancy built into ZFS. I also assume that, if I get more "CKSUM" errors across the three drives, eventually they may line up across a particular piece of data and I will get corruption.

Am I correct? Do people run drives with CKSUM errors, and if so, what are your thresholds for replacement (they are old and quite used drives)?

Thanks in advance!
 
[I wonder whether ZFS can differentiate between checksum errors arising from defective media and from defective memory. So I'd be curious, does this computer use ECC memories (server/workstation) or memories without parity bits (consumer grade PC)]
 
Check the actual disks themselves with sysutils/smartmontools. ZFS is quite resilient to errors but it's not bulletproof. If you have bad spots on the drives then it's probably time to replace the disks.
 
So - on the last "scrub", I had some errors repaired. I assume that the errors still listed for each drive under "CKSUM" are errors that still exist (because it says they are "unrecoverable"). As it also says "no known data errors", I assume that my data is intact, but I have lost some of the redundancy built into ZFS. I also assume that, if I get more "CKSUM" errors across the three drives, eventually they may line up across a particular piece of data and I will get corruption.

According to the scrub output it repaired the data with 0 errors so I suspect you are fine. I believe the error counts stay there unless you do a zpool clear

As you state, you would need errors to hit the same data across multiple disks to actually get corruption. I believe the status actually says that "applications may be affected" in this case.

It depends how critical the data is as to how much money you want to spend. 2TB disks are pretty cheap these days. You could also consider a 4TB disk just as a standalone secondary copy of the data. If there's anything at all you wouldn't want to lose I would do at least one of those things. Note though that WD RED disks come in both CMR and SMR forms these days, and I generally stay away from the SMR ones.
 
Check the actual disks themselves with sysutils/smartmontools. ZFS is quite resilient to errors but it's not bulletproof. If you have bad spots on the drives then it's probably time to replace the disks.
I checked all the disks with a thorough run of badblocks -w before putting them into the array. They are old though, so I'm not too surprised that they are showing some errors in daily use.
According to the scrub output it repaired the data with 0 errors so I suspect you are fine. I believe the error counts stay there unless you do a zpool clear

As you state, you would need errors to hit the same data across multiple disks to actually get corruption. I believe the status actually says that "applications may be affected" in this case.

It depends how critical the data is as to how much money you want to spend. 2TB disks are pretty cheap these days. You could also consider a 4TB disk just as a standalone secondary copy of the data. If there's anything at all you wouldn't want to lose I would do at least one of those things. Note though that WD RED disks come in both CMR and SMR forms these days, and I generally stay away from the SMR ones.
So are those "CKSUM" errors the ones that were corrected during the scrub, or are they outstanding errors? I know they can be cleared, but I wasn't sure if I was just hiding problems that way.
 
Am I correct? Do people run drives with CKSUM errors, and if so, what are your thresholds for replacement (they are old and quite used drives)?
No we usually do not. If there are such errors, it should be identified how they happened to appear.
Even an old disk is not supposed to produce surface errors in a regular fashion. It may occasionally hit a bad block, and then there will be a checksum error. But then that sector must be re-written, so the disk can map it away, and everything should be fine again.
But there are also lots of other flaws that can lead to such error counts: weak cable connectors, weak power supply, critical timings in the CMOS, etc.etc.) And, keep in mind: the error count appears when scrub does read back the data, but the error itself would have happened earlier, when that data was written.

Then, when ZFS shows the message as reproduced, that means
1. The checksum errrors on disk have been fixed from redundancy
2. the counter in the status is kept until a zpool clear (or, occasionally, another scrub, or reboot).

[I wonder whether ZFS can differentiate between checksum errors arising from defective media and from defective memory. So I'd be curious, does this computer use ECC memories (server/workstation) or memories without parity bits (consumer grade PC)]
No it cannot, but you might: memory errros are usually not associated with a disk, they would appear on the vdev.
But then, actual random memory errors that would be corrected by ECC are very rare (less than once a year, and depending on altitude).
Considering the quality of the hardware assembly, there are other reasons that might be more common, including almost every cmos timing, or any badly seated pci controller that would result in no other flaw than checksum errors on an entirely unrelated(!) zfs pool.
In short, anything that might distort any bus timings, might result in zfs checksum errors.
 
You want to try and establish whether the issue is with the disks themselves or if there's something else causing a fault. If you check smartctl -a and see a corresponding IO error in the logs then you know the disks are problematic. Faults should come up in the system log also. If there are no smart errors then it could a controller fault or more likely just be a faulty cable (I've even had issuse with loose power cables causing the same problem).

You need to work out whether you're getting faults on one disk or all of them. And if any of your disks are failing smart tests then that's a pretty good warning sign.

If a disk is failing. then whether to replace it is up to you. It depends how important the data is, whether you keep backups etc. A small number of faults is "normal" for old disks, but the rate of errors is high (or quickly climbing) then you are inviting trouble.
 
Am I correct? Do people run drives with CKSUM errors, and if so, what are your thresholds for replacement (they are old and quite used drives)?
Different people have different tolerances with this sort of stuff. If I saw this output (as a home user) then most likely I wouldn't be too alarmed. I would just "zpool clear" and get on with my life.

But if you're getting these sorts of things weekly or monthly then you definitely want to investigate, especially if your data is important to you.

I think what others are saying is that it's important to try and understand WHY the faults are occuring. Then make your choices based on that.
 
Many thanks for the in-depth answers, I have all the information I was looking for, and more - always good to learn new things. Definitely a very good first impression of the quality of this forum!

One of the disks has a Raw_Read_Error_Rate of 2 (raw value - in smartctl report), but apart from that all other values across all disks in those reports look ok (apart from around 50k power-on hours....I did say they were used, right??). I'm not overly concerned, as I have a decent backup strategy, and it's not critical data. I'll think about swapping in some new disks sometime though.
 
Back
Top