Solved Zpool checksum errors

Giteh · Apr 22, 2021

Hello,

Code:

I've got zpool with such structure:
        NAME          STATE     READ WRITE CKSUM
        zdata         ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            ada3p1    ONLINE       0     0     4
            ada5p1    ONLINE       0     0     1
          mirror-1    ONLINE       0     0     0
            gpt/zfs2  ONLINE       0     0     2
            gpt/zfs3  ONLINE       0     0     1
          mirror-2    ONLINE       0     0     0
            gpt/zfs4  ONLINE       0     0     2
            gpt/zfs5  ONLINE       0     0     1

This pool was imported from a different server using 'zpool export' 'zpool import' commands. After few months of work after export/import I noticed there are checksum errors. If I do zpool clear command but new errors occur in few minutes.
Also after zpool clear I get such output:

Code:

status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: none requested

Is this possible that upgrading the pool will help? If not what other options I have?

Thanks for all answers!

SirDice · Apr 22, 2021

Giteh said:
Is this possible that upgrading the pool will help?

No, these checksum errors have nothing to do with features.

I would take a good look at the disks themselves using smartctl(8) and check the logs for any issues with the controller. Also check if there's a port extender involved, a dodgy extender can also cause all sorts of problems. It might be as simple as a couple of dodgy cables or a power supply that's barely holding up. So definitely check your cabling and power requirements too.

VladiBG · Apr 22, 2021

If there's a faulty cable it will cause UDMA CRC errors logged in the SMART but it will not write corrupted data to the disk.
Do you have ECC memory? Do you see any CAM errors on the console? Can you replace your HBA adapter? Can you replace your PSU?
How are the Disks connected to the HBA via backplane or cables?

Giteh · Apr 22, 2021

VladiBG said:
If there's a faulty cable it will cause UDMA CRC errors logged in the SMART but it will not write corrupted data to the disk.
Do you have ECC memory? Do you see any CAM errors on the console? Can you replace your HBA adapter? Can you replace your PSU?
How are the Disks connected to the HBA via backplane or cables?

Thank you for answer!

- I'm not sure if this has ECC memory and cannot find simple way to check this out in FreeBSD console. Do you know one?
- I cannot see any CAM related errors in dmesg and in /var/log/messages, only ZFS[36678]: checksum mismatch (in /var/log/messages)
- If this could help I can try to replace HPA adapter or even the PSU

And if you may tell me how to check in smartctl if there are any UDMA CRC errors. Will it be visible in 'smartcl -a'?

VladiBG · Apr 22, 2021

You need to install sysutils/smartmontools first then you can use smartctl -a /dev/ada3 to show the status of the drive ada3
Under SMART Attributes look for any CRC Errors, ECC recovered errors, Read Errors and reallocated sectors.

Then if the SMART status of all disks is OK the next step is to test your RAM using memtest86

SirDice · Apr 22, 2021

Giteh said:
I'm not sure if this has ECC memory and cannot find simple way to check this out in FreeBSD console. Do you know one?

sysutils/dmidecode

Giteh · Apr 26, 2021

It appeared that issue was related to ZFS itself. I did the 'zpool scrub' and after there was no error any more. It worked for three days without troubles some I think it'll be fine.
Once again thank you for your help!

mtu · Apr 26, 2021

There's still a chance of faults with your hardware. Your scrub has made zfs check every checksum of every block on every disk. If a mismatch was found with one block on one disk, the correct data was copied over from the mirror partner to replace the bad data. So the scrub probably just "flushed" all the checksum errors (and automatic corrections-from-the-mirror-partner) that you would have run into over time, and resolved them at once.

It looks like you were lucky that no block was faulty on both disks of a mirror. But what caused those errors in the first place?, I'd be wondering – especially since you said that errors were piling up quickly before the scrub.

Solved Zpool checksum errors

Giteh

SirDice

Administrator

VladiBG

Giteh

VladiBG

SirDice

Administrator

Giteh

mtu