Solved zpool error

/var/log/messages

Code:
ZFS[78724]: vdev I/O failure, zpool=zroot path=/dev/nvd0p3 offset=695657820160 size=131072 error=5

Message from Monit:

Code:
 Description: status failed (1) -- ==== ZPOOL STATUS ====
 pool: zroot
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
  see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
 scan: scrub repaired 0B in 00:00:05 with 0 errors on Sat Oct 15 12:29:50 2022
config:

    NAME        STATE     READ WRITE CKSUM
    zroot       ONLINE       0     0     0
      mirror-0  ONLINE       0     0     0
        nvd0p3  ONLINE       0     1     0
        nvd1p3  ONLINE       0     0     0

errors: No known data errors

zpool status

Code:
  pool: zroot
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 00:00:05 with 0 errors on Sat Oct 15 12:29:50 2022
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            nvd0p3  ONLINE       0     1     0
            nvd1p3  ONLINE       0     0     0

errors: No known data errors

smartctl -a /dev/nvme0

Code:
smartctl 7.3 2022-02-28 r5338 [FreeBSD 13.1-RELEASE-p2 amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       KIOXIA KCD71RUG3T84
Serial Number:                      3250A1HWTQN8
Firmware Version:                   0104
PCI Vendor/Subsystem ID:            0x1e0f
IEEE OUI Identifier:                0x8ce38e
Total NVM Capacity:                 3,840,755,982,336 [3.84 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               64
Local Time is:                      Sun Oct 16 16:18:58 2022 EEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x025f):   Security Format Frmw_DL NS_Mngmt Self_Test MI_Snd/Rec Get_LBA_Sts
Optional NVM Commands (0x00ff):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Resv Timestmp Verify
Log Page Attributes (0x1e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg
Maximum Data Transfer Size:         1024 Pages
Warning  Comp. Temp. Threshold:     72 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    25.00W   25.00W       -    0  0  0  0   500000  500000
 1 +    18.00W   18.00W       -    0  0  1  1   500000  500000
 2 +    16.00W   16.00W       -    0  0  2  2   500000  500000
 3 +    14.00W   14.00W       -    1  1  3  3   500000  500000
 4 +    11.00W   11.00W       -    2  2  4  4   500000  500000
 5 +     9.00W    9.00W       -    3  3  5  5   500000  500000
 6 -     5.00W       -        -    6  6  6  6   500000  500000

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        56 Celsius
Available Spare:                    100%
Available Spare Threshold:          26%
Percentage Used:                    0%
Data Units Read:                    247,886,257 [126 TB]
Data Units Written:                 311,308,900 [159 TB]
Host Read Commands:                 5,631,341,557
Host Write Commands:                641,940,409
Controller Busy Time:               6,869
Power Cycles:                       16
Power On Hours:                     3,661
Unsafe Shutdowns:                   2
Media and Data Integrity Errors:    0
Error Information Log Entries:      214
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 256 entries)
No Errors Logged

I did a zpool scrub and here is the result:

Code:
scrub repaired 384K in 00:00:49 with 0 errors on Sun Oct 16 16:21:47 2022

How to proceed with this error? Do I have to replace the disk or do a zpool clear and if it happens again then to replace the disk?
 
Last edited by a moderator:
It is weird that scrub shows no errors. And status show errors. This is conflicting information. (i don't know)
You could try to rerun scrub.
 
Yes it clear the error but if I run "zpool scrub" few times some times it shows that it fixes errors:

Code:
scrub repaired 128K in 00:00:53 with 0 errors on Sun Oct 16 16:55:44 2022
 
These are the only errors in /var/log/messages :

Code:
ZFS[78724]: vdev I/O failure, zpool=zroot path=/dev/nvd0p3 offset=695657820160 size=131072 error=5
nvme0: WRITE sqid:3 cid:121 nsid:1 lba:1392263160 len:256
nvme0: DATA TRANSFER ERROR (00/04) sqid:3 cid:121 cdw0:0
 
These are the only errors in /var/log/messages :

Code:
ZFS[78724]: vdev I/O failure, zpool=zroot path=/dev/nvd0p3 offset=695657820160 size=131072 error=5
nvme0: WRITE sqid:3 cid:121 nsid:1 lba:1392263160 len:256
nvme0: DATA TRANSFER ERROR (00/04) sqid:3 cid:121 cdw0:0
You can try to run smartctl from sysutils/smartmontools and see if there are drive errors. Also run some built in tests.
 
Finally I ask datacenter to replace the disk and add it back in the RAID using:

Code:
gpart backup nvd1 | gpart restore -F nvd0
gmirror forget swap
gmirror insert swap /dev/nvd0p2
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 nvd0
zpool replace zroot /dev/nvd0p3
 
Back
Top