ZFS Zpool device unrecoverable error

Hello friends,

Is there a way to resolve this zpool status issue? I have cleared this error/warning several times using zpool clear ;
but the error/warning keeps coming up every few days.

I don't see any noticeable issue in my system behavior though, it is concerning however.

thank you for your time.


----------
Code:
pool: zfspool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: [URL]https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P[/URL]
  scan: scrub repaired 137M in 00:12:24 with 0 errors on Tue Aug  11 10:11:05 2023
config:

    NAME             STATE     READ WRITE CKSUM
    zfspool            ONLINE       0     0     0
      mirror-0       ONLINE       0   607     0
        mfisysp1  ONLINE       0   842     0
        mfisysp2  ONLINE       0   954     0

errors: No known data errors
 
Is there a way to resolve this zpool status issue? I have cleared this error/warning several times using zpool clear ;
but the error/warning keeps coming up every few days.

Something about your hardware is bad. The disks themselves might be, or it might be the SATA cables, or the motherboard/SATA controller. In your case, ZFS is managing to cope with the errors and no known data errors have occurred, so you are still able to rectify this by fixing the failing hardware.

Given that both your disks in the mirror are failing, it might be best to replace them by using zpool-attach(8) to add them to the mirror (making it 3- or 4-wide) before detaching the old disks.
 
Check the disks themselves with smartctl(8).

i just installed this utility and tried to run diagnostic, as suggested in an online post.

Are there some specific commands you would suggest that I could try out and see if there is a problem;

thank you all for the help.

-------
here is what i got from smartctl -d megaraid..
all of them showed no apparent errors;


SMART Error Log Version: 1
No Errors Logged;

All had :
Reallocated Sector Count 0
Current Pending Sector Count 0
 
here is what i got from smartctl -d megaraid..
are you using ZFS on top of hardware raid?
don't do that.

nuke the pool from orbit, restore from backups using a plain HBA without any lying RAID firmware between ZFS and the disks.
 
It seems that this pool is not built using raw disks, but a MegaRAID disk array. It could even be that the disk array internally uses redundancy to "be perfect"; it could also be that you're running it in JBOD mode, with physical drives having a 1-to-1 correspondence to logical disks (= zpool vdevs). You should probably tell us more about your setup.
 
don't do that.
Correct. ZFS has a RAID layer built in, which is usually better than a hardware solution. I say "usually", because it's not black and white; a really good underlying RAID layer (which is not the one you find on consumer cards) could do things that ZFS doesn't do.

nuke the pool from orbit, ...
That's a bit extreme; it's quite possible that the setup can be saved for now.
 
It seems that this pool is not built using raw disks, but a MegaRAID disk array. It could even be that the disk array internally uses redundancy to "be perfect"; it could also be that you're running it in JBOD mode, with physical drives having a 1-to-1 correspondence to logical disks (= zpool vdevs). You should probably tell us more about your setup.
tbh, am not really sure what was done here, as it was quite a long time ago.
That said although servers had RAID capability and the BIOS came with that pre-built in, I don't think it makes use of any RAID confirguration, my understanding is zfs was installed directly to have access to the hardware;
Not sure though;
 
The problem with RAID-controllers even in JBOD mode is, they still mangle with the on-disk format and lie about a lot of things (latencies, caching, data actually being committed or not...).
Given that the windows-folks pay good money even for older used RAID-controllers, just throw that thing out and get a plain, simple HBA. If that's a homeserver just get a used or "white box" (never had any problem with those) SAS9300 based HBA, which can be found regularly for under 100$.


That's a bit extreme; it's quite possible that the setup can be saved for now.
It *might* be. Given the large amount of write errors and the fact there's a RAID-controller in between there's a chance data hasn't been committed to disk, yet the controller still claimed that (with high delay).
One might add one (or 2) disk that is *NOT* connected to the RAID-controller (i.e. to a HBA or the on-board SATA controller) to the mirror, let that resilver and then move at least one of the other two drives off the RAID-Controller, wipe and re-add it to the pool. With the pool moved completely off the RAID-controller a scrub will show if/how much damage was done.

DON'T just connect those drives to a HBA - they most likely have some proprietary headers and/or even complete on-disk-format and the data won't be accessible from a plain HBA. Worst case, those headers are being damaged in the process and the RAID-controller won't recognize the drives any more and will wipe them when re-configuring.
 
sko thank you. It seems quite involved.

I have couple of servers, and it seems strange that both servers have the same kind of error.
Even though scrubbing, and clearing tells me the zpool is fine;
cannot see any issue from smartctl commands either.

I found this in the dmesg;

-----

mfi0: I/O error, cmd=0xfffffe00e441fc18, status=0x3c, scsi_status=0
mfi0: sense error 13, sense_key 15, asc 137, ascq 67

mfi0: sense error 47, sense_key 15, asc 175, ascq 175
mfisyspd3: hard error cmd=write #######608- 143
 
unfortunately, cannot get this particular errors/warnings from the list available online.
in any case, would it be possible to run specific list of commands to figure out where the problem is, if at all there is a problem.
 
You are correct. Sense key 15 = Fh does not exist, nor do ASCs 137 = 89h and 175 = AFh. To decode them, you probably need to find vendor documentation for your RAID card, or contact the manufacturer's tech support.
 
It's better first to read the Description of mrsas and check which version of raid controller you have and if there's any firmware updates for it.

The mrsas name is derived from the phrase "MegaRAID SAS HBA", which is
substantially different than the old "MegaRAID" Driver mfi(4) which does
not connect targets to the cam(4) layer and thus requires a new driver
which attaches targets to the cam(4) layer. Older MegaRAID controllers
are supported by mfi(4) and will not work with mrsas, but both the mfi(4)
and mrsas drivers can detect and manage the LSI MegaRAID SAS
2208/2308/3008/3108 series of controllers.
 
Back
Top