ZFS data corruption - Smart fine

weberjn · Jul 6, 2024

I have a rather new Seagate Exos X X18 disk which a ZFS pool that now shows two files with errors.
I find it rather implausible that the disk (which is half full) is fine with exception of just two files.
Especially as smartctl does not show errors.

Can this be a problem with the ZFS code or the checksum of the two files?

sh:

root:~ # zpool status -v
  pool: seagate16tb
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
config:

        NAME         STATE     READ WRITE CKSUM
        seagate16tb  ONLINE       0     0     0
          ada0       ONLINE       0     0    22

errors: Permanent errors have been detected in the following files:

...

root:~ # smartctl /dev/ada0 -a
smartctl 7.4 2023-08-01 r5530 [FreeBSD 14.1-RELEASE amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST18000NM000J-2TV103
Serial Number:    ..
LU WWN Device Id: 5 ..
Firmware Version: SN02
User Capacity:    18,000,207,937,536 bytes [18.0 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database 7.3/5528
ATA Version is:   ACS-4 (minor revision not indicated)
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sat Jul  6 14:53:13 2024 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

cracauer@ · Jul 6, 2024

Could also be other hardware failure such as RAM errors.

tingo · Jul 6, 2024

SMART is better than nothing. But I would trust zfs over smart any day. Just my 0.02 eurocents.

Mirror176 · Jul 21, 2024

zpool scrub will have zfs test that all allocated blocks are reading properly but ignores others, it requires the system function as a whole (disk>controller>ram>cpu>zfs). Smart will test more properties of the drive that zfs and the OS will either not see or not watch: temperature, read retries before telling the OS a sector is unreadable, reallocated blocks, etc.

If you question the drive, reading the smart log only shows what it currently has recorded and what the current running properties are, running a short or conveyance test only partially tests the disk and a long test will include a full surface read. Those tests keep the load on the disk so cable, controller, ram, etc. issues will likley not be picked up. smartctl -x is the new smartctl -a...or at least gives more details + different presentation so it too could be worth a try.

Though you can try the shorter smart tests, I'd just run the long one; it aborts and logs the error if it runs into an unreadable sector. I think gsmartcontrol gives a periodic updated status as a test runs and you can check that status from commandline by reading smartctl output but it otherwise is silently running in the background. A zpool scrub can be in order too.

I have had disks intermttently corrupt data when they were overheated (showed as unreadable sectors that would succeed without reallocating when overwritten) and had a years ago a disk that would write + read data with occasional corruption but no disk errors reported as it happened.

ZFS data corruption - Smart fine

weberjn

cracauer@

tingo

Mirror176