I am experiencing "strange" cksum errors recently.
I am looking to understand the underlaying problem (and fix it).
Since updating to >FreeBSD 13.0-STABLE #0 stable/13-n248455-dbb2f1cdb84: Thu Dec 9 04:43:44 UTC 2021< the pool shows uncorrectable errors.
multiple scrubs produce (different!) error counts on each run but always only on mirror-0 & mirror-2.
- smartmon shows nothing unusual.
- this problem happened after upgrading from release-12
- The four "BAD" drives are 1 year old Seagate 6TB ST6000VN001 (10.000h power-on). and the two "OK" drives are 4 year old WesternD 3TB WDC WD30EFRX drives (30.000h power-on).
lastly:
The "cksum" error looks non random but more like a systematic error:
From event-log:
I swapped all drives around/replaced cabling... to no avail.
Before this error came up, I had an issue with one of the seagate drives "regularly" every 2-4 weeks disconnecting
Back then, after "re-online & scrub there where no problems found.
Does anyone have a suggestion what to look for (tuning parameters in bios/kernel, ?? )?
Could that be a problem with the SATA drivers in the Mobo?
Anything else to investigate?
System:
Thilo
I am looking to understand the underlaying problem (and fix it).
Since updating to >FreeBSD 13.0-STABLE #0 stable/13-n248455-dbb2f1cdb84: Thu Dec 9 04:43:44 UTC 2021< the pool shows uncorrectable errors.
multiple scrubs produce (different!) error counts on each run but always only on mirror-0 & mirror-2.
I do not think these are actual device errors because:NAME STATE READ WRITE CKSUM
Pool1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada3p3 ONLINE 0 0 44
ada5p3 ONLINE 0 0 44
mirror-1 ONLINE 0 0 0
ada2p2 ONLINE 0 0 0
ada4p2 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
ada0p3 ONLINE 0 0 52
ada1p3 ONLINE 0 0 52
logs
ada6p3 ONLINE 0 0 0
cache
ada6p5 ONLINE 0 0 0
- smartmon shows nothing unusual.
- this problem happened after upgrading from release-12
- The four "BAD" drives are 1 year old Seagate 6TB ST6000VN001 (10.000h power-on). and the two "OK" drives are 4 year old WesternD 3TB WDC WD30EFRX drives (30.000h power-on).
lastly:
The "cksum" error looks non random but more like a systematic error:
From event-log:
cksum_expected = 0xab1efa6343 0x1b7b7ee8409a2 0x26b1f9284404425 0x70e837fc698b506d
cksum_actual = 0xab3efa6343 0x1b8144e8409a2 0x26ba51544404425 0x71690923a98b506d
cksum_algorithm = "fletcher4"
time = 0x61b99952 0x35d7e963
eid = 0x2372
==> XOR 0x0020000000 0x00fa3a0000000 0x0000ba87c0000000 0x01813edfc0000000
I swapped all drives around/replaced cabling... to no avail.
Before this error came up, I had an issue with one of the seagate drives "regularly" every 2-4 weeks disconnecting
May 16 22:39:44 maggi kernel: ada5 at ahcich5 bus 0 scbus5 target 0 lun 0
May 16 22:39:44 maggi kernel: ada5: <ST6000VN001-2BB186 SC60> s/n ZCT26VXB detached
May 16 22:39:44 maggi kernel: (ada5:ahcich5:0:0:0): Periph destroyed
May 16 22:39:44 maggi ZFS[66677]: vdev state changed, pool_guid=15273872331664054620 vdev_guid=18411497283176214413
May 16 22:39:44 maggi ZFS[66681]: vdev is removed, pool_guid=15273872331664054620 vdev_guid=18411497283176214413
May 16 22:39:52 maggi kernel: ada5 at ahcich5 bus 0 scbus5 target 0 lun 0
May 16 22:39:52 maggi kernel: ada5: <ST6000VN001-2BB186 SC60> ACS-3 ATA SATA 3.x device
May 16 22:39:52 maggi kernel: ada5: Serial Number ZCT26VXB
May 16 22:39:52 maggi kernel: ada5: 600.000MB/s transfers (SATA 3.x, UDMA6, PIO 8192bytes)
May 16 22:39:52 maggi kernel: ada5: Command Queueing enabled
May 16 22:39:52 maggi kernel: ada5: 5723166MB (11721045168 512 byte sectors)
Back then, after "re-online & scrub there where no problems found.
Does anyone have a suggestion what to look for (tuning parameters in bios/kernel, ?? )?
Could that be a problem with the SATA drivers in the Mobo?
Anything else to investigate?
System:
CPU: AMD FX-8320E Eight-Core Processor (3193.15-MHz K8-class CPU)
Origin="AuthenticAMD" Id=0x600f20 Family=0x15 Model=0x2 Stepping=0
Features=0x178bfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,MMX,FXSR,SSE,SSE2,HTT>
Features2=0x3e98320b<SSE3,PCLMULQDQ,MON,SSSE3,FMA,CX16,SSE4.1,SSE4.2,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C>
AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
AMD Features2=0x1ebbfff<LAHF,CMP,SVM,ExtAPIC,CR8,ABM,SSE4A,MAS,Prefetch,OSVW,IBS,XOP,SKINIT,WDT,LWP,FMA4,TCE,NodeId,TBM,Topology,PCXC,PN
XC>
Structured Extended Features=0x8<BMI1>
SVM: NP,NRIP,VClean,AFlush,DAssist,NAsids=65536
TSC: P-state invariant, performance statistics
real memory = 17179869184 (16384 MB)
avail memory = 16572493824 (15804 MB)
Event timer "LAPIC" quality 100
ACPI APIC Table: <ALASKA A M I>
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
FreeBSD/SMP: 1 package(s) x 8 core(s)
random: unblocking device.
Firmware Warning (ACPI): Optional FADT field Pm2ControlBlock has valid Length but zero Address: 0x0000000000000000/0x1 (20201113/tbfadt-79
6)
ahci0: <AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller> port 0xf040-0xf047,0xf030-0xf033,0xf020-0xf027,0xf010-0xf013,0xf000-0xf00f mem 0xfeb0b000-0xfeb0b3ff irq 19 at device 17.0 on pci0
ahci0: AHCI v1.20 with 6 6Gbps ports, Port Multiplier supported
ahci0: quirks=0x22000<ATI_PMP_BUG,1MSI>
ahcich0: <AHCI channel> at channel 0 on ahci0
Thilo