ZFS zpool strange cksum errors

thilo · Dec 23, 2021

I am experiencing "strange" cksum errors recently.
I am looking to understand the underlaying problem (and fix it).

Since updating to >FreeBSD 13.0-STABLE #0 stable/13-n248455-dbb2f1cdb84: Thu Dec 9 04:43:44 UTC 2021< the pool shows uncorrectable errors.
multiple scrubs produce (different!) error counts on each run but always only on mirror-0 & mirror-2.

NAME STATE READ WRITE CKSUM
Pool1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada3p3 ONLINE 0 0 44
ada5p3 ONLINE 0 0 44
mirror-1 ONLINE 0 0 0
ada2p2 ONLINE 0 0 0
ada4p2 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
ada0p3 ONLINE 0 0 52
ada1p3 ONLINE 0 0 52
logs
ada6p3 ONLINE 0 0 0
cache
ada6p5 ONLINE 0 0 0

I do not think these are actual device errors because:
- smartmon shows nothing unusual.
- this problem happened after upgrading from release-12
- The four "BAD" drives are 1 year old Seagate 6TB ST6000VN001 (10.000h power-on). and the two "OK" drives are 4 year old WesternD 3TB WDC WD30EFRX drives (30.000h power-on).
lastly:
The "cksum" error looks non random but more like a systematic error:
From event-log:

cksum_expected = 0xab1efa6343 0x1b7b7ee8409a2 0x26b1f9284404425 0x70e837fc698b506d
cksum_actual = 0xab3efa6343 0x1b8144e8409a2 0x26ba51544404425 0x71690923a98b506d
cksum_algorithm = "fletcher4"
time = 0x61b99952 0x35d7e963
eid = 0x2372

==> XOR 0x0020000000 0x00fa3a0000000 0x0000ba87c0000000 0x01813edfc0000000

I swapped all drives around/replaced cabling... to no avail.

Before this error came up, I had an issue with one of the seagate drives "regularly" every 2-4 weeks disconnecting

May 16 22:39:44 maggi kernel: ada5 at ahcich5 bus 0 scbus5 target 0 lun 0
May 16 22:39:44 maggi kernel: ada5: <ST6000VN001-2BB186 SC60> s/n ZCT26VXB detached
May 16 22:39:44 maggi kernel: (ada5:ahcich5:0:0:0): Periph destroyed
May 16 22:39:44 maggi ZFS[66677]: vdev state changed, pool_guid=15273872331664054620 vdev_guid=18411497283176214413
May 16 22:39:44 maggi ZFS[66681]: vdev is removed, pool_guid=15273872331664054620 vdev_guid=18411497283176214413
May 16 22:39:52 maggi kernel: ada5 at ahcich5 bus 0 scbus5 target 0 lun 0
May 16 22:39:52 maggi kernel: ada5: <ST6000VN001-2BB186 SC60> ACS-3 ATA SATA 3.x device
May 16 22:39:52 maggi kernel: ada5: Serial Number ZCT26VXB
May 16 22:39:52 maggi kernel: ada5: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
May 16 22:39:52 maggi kernel: ada5: Command Queueing enabled
May 16 22:39:52 maggi kernel: ada5: 5723166MB (11721045168 512 byte sectors)

Back then, after "re-online & scrub there where no problems found.

Does anyone have a suggestion what to look for (tuning parameters in bios/kernel, ?? )?
Could that be a problem with the SATA drivers in the Mobo?
Anything else to investigate?

System:

CPU: AMD FX-8320E Eight-Core Processor (3193.15-MHz K8-class CPU)
Origin="AuthenticAMD" Id=0x600f20 Family=0x15 Model=0x2 Stepping=0
Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
Features2=0x3e98320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C>
AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
AMD Features2=0x1ebbfff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,XOP,SKINIT,WDT,LWP,FMA4,TCE,NodeId,TBM,Topology,PCXC,PN
XC>
Structured Extended Features=0x8<BMI1>
SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=65536
TSC: P-state invariant, performance statistics
real memory = 17179869184 (16384 MB)
avail memory = 16572493824 (15804 MB)
Event timer "LAPIC" quality 100
ACPI APIC Table: <ALASKA A M I>
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
FreeBSD/SMP: 1 package(s) x 8 core(s)
random: unblocking device.
Firmware Warning (ACPI): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1 (20201113/tbfadt-79
6)

ahci0: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem 0xfeb0b000-0xfeb0b3ff irq 19 at device 17.0 on pci0
ahci0: AHCI v1.20 with 6 6Gbps ports, Port Multiplier supported
ahci0: quirks=0x22000<ATI_PMP_BUG,1MSI>
ahcich0: <AHCI channel> at channel 0 on ahci0

Thilo

Cath O'Deray · Dec 23, 2021

thilo said:
FreeBSD 13.0-STABLE #0 stable/13-n248455-dbb2f1cdb84: Thu Dec 9 04:43:44 UTC 2021

zfs --version
zpool list
geom disk list
pkg -vv | grep -e url -e enabled

Please make appropriate use of CODE markup, thanks.

<https://forums.freebsd.org/help/bb-codes/#code>

thilo said:
Before this error came up, I had an issue with one of the seagate drives "regularly" every 2-4 weeks disconnecting

Was that with an inferior build of stable/13, or with 12.something?

thilo · Dec 23, 2021

Thank you for the markup hints.. (it was my first post here)
The disk disconnects where with
FreeBSD 13.0-RELEASE-p4 #0: Tue Aug 24 07:33:27 UTC 2021 (and p3/p2...)
Then I moved to
FreeBSD 13.0-STABLE #0 stable/13-n247742-e058d44fda3: Thu Oct 21 and
FreeBSD 13.0-STABLE #0 stable/13-n248455-dbb2f1cdb84: Thu Dec 9
because of some issues with the intel-nw card & VLAN's I tried to solve...

At first I blamed the routing of the SATA power cables (being too close to the disk-pcb's). I noticed lesser problems after the power cables where clearly away from the disks.

The issue came after installing
FreeBSD 13.0-STABLE #0 stable/13-n247742-e058d44fda3: Thu Oct 21 03:01:41 UTC 2021

Code:

zfs --version
zfs-2.1.1-FreeBSD_g71c609852
zfs-kmod-2.1.1-FreeBSD_g71c609852

NAME     SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
Pool1   13.6T  12.5T  1.12T        -         -    56%    91%  1.00x    ONLINE  -
system  15.5G  14.4G  1.15G        -         -    69%    92%  1.00x    ONLINE  -


Geom name: ada0
Providers:
1. Name: ada0
   Mediasize: 6001175126016 (5.5T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2
   descr: ST6000VN001-2BB186
   lunid: 5000c500c537931f
   ident: ZCT26VXB
   rotationrate: 5425
   fwsectors: 63
   fwheads: 16

Geom name: ada1
Providers:
1. Name: ada1
   Mediasize: 6001175126016 (5.5T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2
   descr: ST6000VN001-2BB186
   lunid: 5000c500c688f767
   ident: ZCT2WZF1
   rotationrate: 5425
   fwsectors: 63
   fwheads: 16

Geom name: ada2
Providers:
1. Name: ada2
   Mediasize: 3000592982016 (2.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2
   descr: WDC WD30EFRX-68EUZN0
   lunid: 50014ee2ba5e28f4
   ident: WD-WCC4N5SYJNRT
   rotationrate: 5400
   fwsectors: 63
   fwheads: 16

Geom name: ada3
Providers:
1. Name: ada3
   Mediasize: 6001175126016 (5.5T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2
   descr: ST6000VN001-2BB186
   lunid: 5000c500c6895002
   ident: ZCT2WXZY
   rotationrate: 5425
   fwsectors: 63
   fwheads: 16

Geom name: ada4
Providers:
1. Name: ada4
   Mediasize: 3000592982016 (2.7T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2
   descr: WDC WD30EFRX-68EUZN0
   lunid: 50014ee265088eb5
   ident: WD-WCC4N6PZR9KA
   rotationrate: 5400
   fwsectors: 63
   fwheads: 16

Geom name: ada5
Providers:
1. Name: ada5
   Mediasize: 6001175126016 (5.5T)
   Sectorsize: 512
   Stripesize: 4096
   Stripeoffset: 0
   Mode: r1w1e2
   descr: ST6000VN001-2BB186
   lunid: 5000c500c689463e
   ident: ZCT2WY73
   rotationrate: 5425
   fwsectors: 63
   fwheads: 16

Geom name: ada6
Providers:
1. Name: ada6
   Mediasize: 240057409536 (224G)
   Sectorsize: 512
   Mode: r4w4e7
   descr: Apacer AS350 240GB
   lunid: 502b2a201d1c1b1a
   ident: I37394R001697
   rotationrate: 0
   fwsectors: 63
   fwheads: 16


 # pkg -vv | grep -e url -e enabled
    url             : "pkg+http://pkg.FreeBSD.org/FreeBSD:13:amd64/latest",
    enabled         : yes,

Thilo

ralphbsz · Dec 24, 2021

Hmm. Real cksum errors are rare. Most people who have just one or ten disks will never see them in their lifetime. People who have a thousand or hundred thousand disks will see them somewhat regularly (see anecdote below). That's because the disk drive causing a cksum error means that you have an undetected read error on the disk, and that means that the bit-pattern on the disk managed to fool the ECC on the disk hardware, which is spectacularly unlikely. The only other "common" reason for cksum errors is off-track writes, which on modern disks should also be vanishingly rare (they were a worry about 20 years ago, due to vibration on single-CPU disks, but that was long ago).

Most likely, a lot of cksum errors have a different common cause: Some errant process is overwriting the disk. In the big cluster computing business we used to joke that one of the major causes of data death is some sys admin re-formatting a disk with a Reiser file system (it's a joke about Hans Reiser murdering his wife, so it's in bad taste, but still funny). So could it be that there was something that wrote on the disk? Can you check what location on disk the checksum errors are?

thilo said:
The "cksum" error looks non random but more like a systematic error:

Find a pattern in that. I have no idea what the pattern could be. Try to correlate it to things you have done to the system.

Could that be a problem with the SATA drivers in the Mobo?

Exceedingly unlikely. Once FreeBSD is booted, the BIOS drivers are not in the picture.

Anything else to investigate?

Power, tends to be my "favorite" problem to diagnose (not!).

Anecdote from a now retired colleague: In a sufficiently large system, everything that's unlikely happens all the time. And things that are impossible happen occasionally.

Andriy · Dec 24, 2021

thilo · Dec 24, 2021

Andriy said:
RAM?

Taken out, swiped with isopropyl and put back in reverse.

ralphbsz said:
cksum error means that you have an undetected read error on the disk

I had a couple ( < 10 ) of them in the last 5 years, I blame cheap MOBO (no ECC) for those. -- Another good reason to use ZFS...

ralphbsz said:
Power, tends to be my

Mine too, as said, yes I was blaming it when one drive was having hickup's, I didn't change the supply, though (I will consider it)

ralphbsz said:
here was something that wrote on the disk

dd of=my-valuable-disks?
Unlikely after 20 years admin of netbsd & freebsd ;-) .. and this is my personal storage server...., I think I know what I shouldn't do.

ralphbsz said:
Find a pattern in that.

Doing this over the collected pool events:

Code:

egrep -h 'cksum_(ac|ex)' zpool.events* | sort | uniq -c
  20         cksum_actual = 0x10c761e3d4501 0x26759b24ab17d5 0x0 0x0
  20         cksum_actual = 0x57b6b01888e1 0xb524cecf749a8 0x0 0x0
  20         cksum_actual = 0xab3efa6343 0x1b8144e8409a2 0x26ba51544404425 0x71690923a98b506d
  20         cksum_expected = 0x95e1d299d 0x218b4a5e507 0x0 0x0
  20         cksum_expected = 0xab1efa6343 0x1b7b7ee8409a2 0x26b1f9284404425 0x70e837fc698b506d
  20         cksum_expected = 0xc7e3cb33f 0x508fa78cd58 0x0 0x0

gives me only 3 checksum errors and not

Code:

        NAME        STATE     READ WRITE CKSUM
        Pool1       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada3p3  ONLINE       0     0    88
            ada5p3  ONLINE       0     0    88
          mirror-1  ONLINE       0     0     0
            ada2p2  ONLINE       0     0     0
            ada4p2  ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            ada0p3  ONLINE       0     0   104
            ada1p3  ONLINE       0     0   104
        logs
          ada6p3    ONLINE       0     0     0
        cache
          ada6p5    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /mnt/Pool1/Storage/raid3e/Backups/backup/tmp/Backup_2010_02_13-13-30/2001/2001-05-09 100MSDCF
        /mnt/Pool1/Storage/dk3/users/BSD/NetBSD/midge/obj/usr.sbin/bind/check
        /mnt/Pool1/Storage/raid3e/raid1e/Backups/t.....
        /mnt/Pool1/Storage/raid3e/.....
        /mnt/Pool1/Storage/dk3/wd0a_xen/sys/pkgsrc/devel/argp/patches
        Pool1/jails/ts2a@auto-20211211.060000-99y:<0x0>
        Pool1/jails/ts2a:<0x0>

Does it make sense to anyone?

ralphbsz said:
off-track writes,

The corrupted areas, are ALL nowhere near anything active in the last 4 years, so I rule write errors because of missing ECC on the RAM out. And on four disks of the same vendor at the same time?

The only thing, that could possibly hint to this would be the snapshot taken (or renamed?) on 11th December?

Could that be the crucial hint?
As said, the jails/ts2a has not been in use for a number of years and only the snapshots are occurring on it.

Another question, as it seems, that some snapshots are corrupt, is there any other means to repair the errors then to destroy this (and all depending snapshots on the same tree?

thilo · Dec 24, 2021

Another "interesting" thing I noticed.

Before the move to STABLE, I only saw 600MB/s resilvering performance.
After the update, I did see 900MB/s read-rate.

That indicates some performance optimisations in the drivers.
If that is the case, are there any sysctl's I can tweak to throttle something?

Cath O'Deray · Dec 24, 2021

thilo said:
resilvering performance

<https://cgit.freebsd.org/src/log/?h=stable/13&qt=grep&q=resilver> a few matches in stable/13 although (at a glance) I don't see anything to explain improved performance.

Using <https://docs.freebsd.org/en/> to seek vfs.zfs.resilver_min_time_ms leads to (amongst other things) <https://docs.freebsd.org/en/books/handbook/zfs/>, which does not include the phrase.

ralphbsz · Dec 24, 2021

Cause being cheap MoBo with no ECC? Possible. The rate of memory errors is REALLY hard to predict. Big computer users that have zillions of machines with ECC and keep track of (corrected or detected) memory errors have statistics for that, and I've seen internal studies. The thing that strikes me about those statistics is how exceedingly variable memory errors are.

You joke about "dd of=my-valuable-disks?". Clearly, few people are dumb enough to do something that blatantly wrong. But other things happen. For example, you change the configuration of another disk (which should not be in the ZFS array), run "gpart adaX", then notice it didn't work, rerun the command and don't notice that the first time you had used adaX and not adaY, and there you go. Humans make errors, which is why computers have redundancy.

About the pattern: Is there any way to determine *where* (which disk, what offset) the checksum errors were found? That pattern might help find a common cause. But you say that all errors are on four disks which are the same manufacturer and the same age, while other disks (different manufacturer, different age) have no errors. That's hard to understand and explain ... which doesn't make it untrue.

Your theory that the snapshot taken on December 11 is the common cause seems possible. That would mean you found a software bug in ZFS, or at least bad handling of a hardware error (perhaps during the snapshot, a hardware problem occurred, like an IO error, and ZFS didn't do the right thing). How to get rid of it? I think the errors should vanish once no shapshots refer to the underlying disk blocks any longer. But remember: Snapshots share disk blocks.

One strategy might be: Just ignore the snapshots, try not to read them, and in a while delete them all.

Cath O'Deray · Dec 25, 2021

ralphbsz said:
… Humans make errors, …

Yeah, it's ~~difficult~~ impossible to make things foolproof.

From Sun's draft documentation in 2006, with added emphasis:

… placing the labels in non-contiguous locations (front and back) provides ZFS with a better probability that some label will remain accessible in the case of media failure or accidental overwrite (eg. using the disk as a swap device while it is still part of a ZFS storage pool).

thilo · Dec 25, 2021

Sysadmins do the accidental overwrite, everyone else does the accidental delete of their master thesis.... wasn't the "restore from backup" the most requested job from admins?

I would think 95% of the redundancy (backups/snapshots/...) are for "to err is human". I don't recall having a broken (head crash or similar fatal) rotating disk in the last 20 years. But the ratio of 10 bugs/1000 lines of code seems to stay more or less static.

Going back on topic:

How do I remove the cksum errors?
The directories it's complaining about cannot be removed using rm/rmdir (they show as empty (and no . or ..) but seem to still have a link count>0) . Any alternatives?
Would zfs send | receive, be able to "recreate" an error free volume?

thilo · Dec 25, 2021

I restarted a scrub last night, and it starts collecting new errors:

Code:

  scan: scrub in progress since Sat Dec 25 01:41:47 2021
        3.53T scanned at 103M/s, 3.22T issued at 93.8M/s, 12.5T total
        0B repaired, 25.76% done, 1 days 04:49:43 to go
config:

        NAME        STATE     READ WRITE CKSUM
        Pool1       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada3p3  ONLINE       0     0    74
            ada5p3  ONLINE       0     0    74
          mirror-1  ONLINE       0     0     0
            ada2p2  ONLINE       0     0     0
            ada4p2  ONLINE       0     0     0
          mirror-2  ONLINE       0     0     0
            ada0p3  ONLINE       0     0    82
            ada1p3  ONLINE       0     0    82
        logs
          ada6p3    ONLINE       0     0     0
        cache
          ada6p5    ONLINE       0     0     0

grep 'class =' zpool.eventlog.ff | sort | uniq -c
  28         class = "ereport.fs.zfs.checksum"
  49         class = "ereport.fs.zfs.data"
   1         class = "sysevent.fs.zfs.config_sync"
2853         class = "sysevent.fs.zfs.history_event"
   1         class = "sysevent.fs.zfs.scrub_start"

Where also the number of errors (this time completely captured) in the error log don't really match the cksum column.

I will abort this and try to reboot into the "-RELEASE" branch (thanks to beadm!) and see if that reproduces the scrub errors.