Solved Checksum errors on my ZFS pool

Hello,

I'm experiencing weird issues with ZFS.
I have a home server, running Freebsd-RELEASE-14.2, 1 disk (nvme).

Today, I did a scrub of my pool (zpool) and got a lot of checksum errors.
There was no power failure, or events that can explain that.

This is the scrub report :
Code:
root@nuc:~ # zpool scrub zroot
root@nuc:~ # zpool status zroot -v
  pool: zroot
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 00:00:07 with 18 errors on Thu Apr 24 12:42:10 2025
config:
        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          nda0p4    ONLINE       0     0    37
errors: Permanent errors have been detected in the following files:
        /usr/local/bastille/jails/srv/root/var/cache/pkg/perl5-5.36.3_3~d97227b0c9.pkg
        /usr/local/bastille/jails/srv/root/var/cache/pkg/mariadb114-client-11.4.5_1~9ca28e7b33.pkg
        /usr/local/bastille/jails/srv/root/var/cache/pkg/dotnet-9.0.3~894952f590.pkg
        /usr/local/bastille/jails/srv/root/usr/home/pix/aaa/Guadeloupe/Guadeloupe-1720-NEF_DxO_DeepPRIMEXD.jpg
        /usr/local/bastille/jails/srv/root/usr/local/share/dotnet/library-packs/runtime.freebsd.14-x64.Microsoft.DotNet.ILCompiler.9.0.3.nupkg
        /usr/local/bastille/jails/srv/root/usr/local/share/dotnet/library-packs/runtime.freebsd.14-x64.Microsoft.NETCore.DotNetAppHost.9.0.3.nupkg
        /usr/local/bastille/jails/srv/root/usr/src/sys/contrib/openzfs/tests/zfs-tests/tests/functional/cli_root/zpool_create/draidcfg.gz
        /usr/local/bastille/jails/srv/root/var/cache/pkg/mariadb114-server-11.4.5_1~2db63a4a70.pkg
        /usr/local/bastille/jails/srv/root/usr/src/contrib/libarchive/libarchive/test/test_read_format_zip_winzip_aes256_large.zip.uu
        /usr/local/bastille/jails/srv/root/var/cache/pkg/webmin-2.013~9ed48768d8.pkg
        /usr/local/bastille/jails/srv/root/var/db/freebsd-update/files/fab7de1b74ef80c3ff0760fdf182c077ded2aab249e0ea3a252f6d3cd5a15e6d.gz
        /usr/local/bastille/jails/srv/root/usr/local/www/share/hxsafe/EDINBURGH.zip
        zroot/ROOT/default@2025-04-14-12:18:03-0:/var/cache/pkg/python311-3.11.11~093fbc3d04.pkg
        zroot/ROOT/default@2025-04-14-12:18:03-0:/usr/freebsd-dist/kernel.txz
        zroot/ROOT/default@2025-04-14-12:18:03-0:/usr/freebsd-dist/src.txz
        zroot/ROOT/default@2025-04-14-12:18:03-0:/var/db/freebsd-update/files/5b7b75b9ec886a56ceccab6c3b495186a643d111746d6b0e462548fa671c48b7.gz
        /usr/local/bastille/jails/matrix/root/var/db/freebsd-update/files/578c8b1ce80566a6412fe5c3bd5982c87e5cfbdc6952f66ea75b53e49e5c8e59.gz
        zroot/ROOT/default@2025-03-29-18:16:24-0:/usr/freebsd-dist/kernel.txz
        zroot/ROOT/default@2025-03-29-18:16:24-0:/usr/freebsd-dist/src.txz
        zroot/ROOT/default@2025-03-29-18:16:24-0:/var/db/freebsd-update/files/5b7b75b9ec886a56ceccab6c3b495186a643d111746d6b0e462548fa671c48b7.gz
        zroot/ROOT/14.2-RELEASE-p2_2025-04-14_121803:/var/cache/pkg/python311-3.11.11~093fbc3d04.pkg
        zroot/ROOT/14.2-RELEASE-p2_2025-04-14_121803:/usr/freebsd-dist/kernel.txz
        zroot/ROOT/14.2-RELEASE-p2_2025-04-14_121803:/usr/freebsd-dist/src.txz
        zroot/ROOT/14.2-RELEASE-p2_2025-04-14_121803:/var/db/freebsd-update/files/5b7b75b9ec886a56ceccab6c3b495186a643d111746d6b0e462548fa671c48b7.gz
        //var/cache/pkg/python311-3.11.11~093fbc3d04.pkg
        //usr/freebsd-dist/kernel.txz
        //usr/freebsd-dist/src.txz
        //var/db/freebsd-update/files/5b7b75b9ec886a56ceccab6c3b495186a643d111746d6b0e462548fa671c48b7.gz
        zroot/ROOT/14.2-RELEASE_2025-03-29_181624:/usr/freebsd-dist/kernel.txz
        zroot/ROOT/14.2-RELEASE_2025-03-29_181624:/usr/freebsd-dist/src.txz
        zroot/ROOT/14.2-RELEASE_2025-03-29_181624:/var/db/freebsd-update/files/5b7b75b9ec886a56ceccab6c3b495186a643d111746d6b0e462548fa671c48b7.gz
        /usr/local/bastille/cache/14.2-RELEASE/base.txz

I removed the /usr/local/bastille/jails/srv/root/home/pix/aaa dir and /usr/local/bastille/jails/srv/root/usr/local/www/share/hxsafe/EDINBURGH.zip file, and redid a scrub,
and now, if i do a scrub, i have that :
(the number of checksum errors has increased because i did several scrubs)
Code:
root@nuc:~ # zpool scrub zroot
root@nuc:~ # zpool status zroot -v
  pool: zroot
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:00:07 with 0 errors on Thu Apr 24 13:44:48 2025
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          nda0p4    ONLINE       0     0   199

errors: No known data errors

Then i did a selftest with nvme control
Code:
root@nuc:~ # nvmecontrol selftest -c 2 nvme0
root@nuc:~ # nvmecontrol logpage -p 2 nvme0
SMART/Health Information Log
============================
Critical Warning State:         0x00
 Available spare:               0
 Temperature:                   0
 Device reliability:            0
 Read only:                     0
 Volatile memory backup:        0
Temperature:                    310 K, 36.85 C, 98.33 F
Available spare:                100
Available spare threshold:      10
Percentage used:                0
Data units (512,000 byte) read: 3754076
Data units written:             3261524
Host read commands:             18228121
Host write commands:            84273103
Controller busy time (minutes): 717
Power cycles:                   92
Power on hours:                 153
Unsafe shutdowns:               34
Media errors:                   0
No. error info log entries:     1
Warning Temp Composite Time:    0
Error Temp Composite Time:      0
Temperature 1 Transition Count: 0
Temperature 2 Transition Count: 0
Total Time For Temperature 1:   0
Total Time For Temperature 2:   0

root@nuc:~ # nvmecontrol logpage -p 1 nvme0
Error Information Log
=====================
No error entries found

The 34 unsafe shutdowns correspond to the old life of that NVME disk, in an other computer, which had experienced a lot of power failures.

So, I dont understand what happened.
Is my zpool really healthy ?

Thank you :)
 
This could be a wide range of things. With a M.2 NVMe you usually don't suffer power or connection problems.

I would proceed testing the actual computer for memory problems, starting with memtest86, superpi, mprime/prime95 etc.
 
Okay, thank you, i'll do a memtest86.

Do you think that my data are okay ? As scrub report said that :
Code:
errors: No known data errors

I made a save of the entire pool with zfs send, and I was planning to zfs receive it on an other NVME 1To I have.
Not sure if it's a good idea.
 
That's not starting very well
1000009143.jpg
 
I unmounted my ram sticks, cleaned them up, and trying a memtest86 again.

In the case my ram is faulty, I'll replace it, but do you think my zfs pool is corrupted ? I mean, Should I restart from scratch or should I keep my nvme as it is currently ?
 
Okay, you think i might have other corrupted files than the ones listed in the first scrub report ?
I ask because i dont have a pre-memory failure backup (Im still building my home server, I have not thought that it will be broken so soon ^^), and the listed files are almost exclusively data that I don't care, cache, or things I can easily fetch again from freebsd.org servers.
 
Yeah, undetected errors can happen. I recently had an import panic on a pool because of some linked list something or other. Remember that there is no fsck for ZFS.

Also, I would do a low level format because of all the power failures in its previous install.
 
The bottom line :
I checked my 2 memory sticks, in each socket, and identified the gulty one (the other passes memtest86 in each socket).
I ordered 2 new memory sticks, and while I'm waiting, I rebuilt my home server from scratch with the good memory stick and a spare NVME disk that I forgot I had.
Time to make a trustable backup now.

Thank you everybody for your anwsers and your knowledge :)
Solved.
 
Yeah, undetected errors can happen. I recently had an import panic on a pool because of some linked list something or other. Remember that there is no fsck for ZFS.
I am not sure what you mean. "fsck" where supported usually checks file-system structures only, not that the actual files are correct. A "zpool scrub" command does check structures and that files are correct against their checksums. I would personally trust data that passes a zpool scrub.

With SSDs being as affordable as they are, I would always use a mirrored zroot. This will allow the scrub command not only to dectect, but also to correct errors.
 
I am not sure what you mean. "fsck" where supported usually checks file-system structures only, not that the actual files are correct. A "zpool scrub" command does check structures and that files are correct against their checksums. I would personally trust data that passes a zpool scrub.

There are some conditions with metadata inconsistency that can hose you at import time. I recently discussed one on #openzfs, forgot the details.
 
There are some conditions with metadata inconsistency that can hose you at import time. I recently discussed one on #openzfs, forgot the details.
That sounds like an incredible rare circumstance, much rarer than silent corruption of files in non-ZFS file systems. I stand by my viewpoint that zfs is more resilient than anything using "fsck".
 
With SSDs being as affordable as they are, I would always use a mirrored zroot. This will allow the scrub command not only to dectect, but also to correct errors.
Sadly, my home server is a NUC12 (WSHi7000) with only 1 M2 slot (and the NVME plugged in bears the system). So, no mirror for the system in my case. 😅

I ordered 2 HDD and a 2disk-bay for a mirrored pool for my data.
 
Back
Top