ZFS External Drive ZFS - power loss, insufficient replicas, pool suspended, cannot online /dev/da0: pool I/O is currently suspended

nerozero · Jul 12, 2024

Hello,

How to deal with the situation when the external drive with ZFS was temporary disconnected / power loss.

After reconnecting drive:

Code:

# zpool status
  pool: zrbackup01
 state: SUSPENDED
status: One or more devices are faulted in response to IO failures.
action: Make sure the affected devices are connected, then run 'zpool clear'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-HC
  scan: scrub repaired 0B in 1 days 15:03:03 with 0 errors on Fri Jul 12 04:53:55 2024
config:

    NAME           STATE     READ WRITE CKSUM
    zrbackup01     UNAVAIL      0     0     0  insufficient replicas
      da0          REMOVED      0     0     0

# zpool clear zrbackup01
cannot clear errors for zrbackup01: I/O error

# zpool clear -nF zrbackup01

# zpool online zrbackup01 /dev/da0
cannot online /dev/da0: pool I/O is currently suspended

# geom disk list
...
Geom name: da0
Providers:
1. Name: da0
   Mediasize: 14000519639040 (13T)
   Sectorsize: 4096
   Mode: r0w0e0
   descr: Seagate USB
   lunid: 41736d6564696120
   ident: XXXXXX
   rotationrate: unknown
   fwsectors: 63
   fwheads: 255
...

at this stage if I reboot PC - the freebsd will hang with no response (probably due to hard drive IO is suspended), only way to make system back to life - hard reset.

Thanks

ralphbsz · Jul 12, 2024

nerozero said:
# zpool clear zrbackup01
cannot clear errors for zrbackup01: I/O error

What was the IO error? You should probably look at system logs to diagnose errors, before marching ahead blindly.

# zpool clear -nF zrbackup01

I don't see options -n or -F in the documentation, looking at both 13.3 and 14.1. Anyone know what they do?

# geom disk list
...

Is this the correct disk? From the geom output, I can't tell.

at this stage if I reboot PC - the freebsd will hang with no response (probably due to hard drive IO is suspended), only way to make system back to life - hard reset.

You are telling me that reboot and hard reset give different results at this stage. That's weird, and points to a hardware problem. From a FreeBSD software viewpoint, both should be identical.

When you say "hang with no response", how far does the boot get? What is the last message? Are there any error messages?

nerozero · Jul 12, 2024

ralphbsz said:
What was the IO error? You should probably look at system logs to diagnose errors, before marching ahead blindly.

kernel: Solaris: WARNING: Pool 'zrbackup01' has encountered an uncorrectable I/O failure and has been suspended.
kernel: ZFS[55529]: pool I/O failure, zpool=zrbackup01 error=28

kernel: Solaris: WARNING: Pool 'zrbackup01' has encountered an uncorrectable I/O failure and has been suspended.
kernel: ZFS[55645]: pool I/O failure, zpool=zrbackup01 error=28

kernel: Solaris: WARNING: Pool 'zrbackup01' has encountered an uncorrectable I/O failure and has been suspended.
kernel: ZFS[57506]: pool I/O failure, zpool=zrbackup01 error=28

ralphbsz said:
I don't see options -n or -F in the documentation, looking at both 13.3 and 14.1. Anyone know what they do?

-F: (undocumented for clear, the same as for import) Rewind. Recovery mode for a non-importable pool. Attempt to return the pool to an importable state by discarding the last few transactions. Not all damaged pools can be recovered by using this option. If successful, the data from the discarded transactions is irretrievably lost. This option is ignored if the pool is importable or already imported.
-n: (undocumented for clear, the same as for import) Used with the -F recovery option. Determines whether a non-importable pool can be made importable again, but does not actually perform the pool recovery. For more details about pool recovery mode, see the -F option, above. and then try to re-import again:
-X (undocumented): Extreme rewind. The effect of -X seems to be that some extremely lengthy operation is attempted, that never finishes. In some cases, a reboot was necessary to terminate the process.
-V (undocumented): Option by UTSLing, when used for import it makes the pool got imported again, but still without an attempt at resilvering.

ralphbsz said:
Is this the correct disk? From the geom output, I can't tell.

Yes, no other daX is connected. Smartctl can read the drive normally, dd and other tools can easily communicate with drive

ralphbsz said:
You are telling me that reboot and hard reset give different results at this stage. That's weird, and points to a hardware problem. From a FreeBSD software viewpoint, both should be identical.

After restarting system, drive is imported without any issues, no errors,...
I had this issue on a multiple systems, same outcome. Just upgraded BSD to latest 13.3 -> 14.1. same issues (not yet upgraded the ZFS version on the drive).

ralphbsz said:
When you say "hang with no response", how far does the boot get? What is the last message? Are there any error messages?

Just by memory - Synching disks

nerozero · Jul 12, 2024

Found this: https://github.com/openzfs/zfs/pull/11082

nerozero · Jul 12, 2024

Also forget to mention, if I try to export or force export pool - the command never completes:

Code:

# zpool export -f zrbackup01
..... Never ends ......

Andriy · Jul 12, 2024

nerozero said:
error=28

This is the main clue.

nerozero · Jul 12, 2024

Andriy said:
This is the main clue.

Please go on. I have found no reference to this error.
I would be grateful if you could share where this clue should lead to.

Erichans · Jul 12, 2024

nerozero said:
After restarting system, drive is imported without any issues, no errors,...
I had this issue on a multiple systems, same outcome. Just upgraded BSD to latest 13.3 -> 14.1. same issues (not yet upgraded the ZFS version on the drive).

I suggest you leave the zpool upgrade consideration for another day.

~~After the [external] "drive imported without ...", did you perform a zfs scrub?~~ missed the scrub in your OP.

nerozero said:
-X (undocumented): Extreme rewind. The effect of -X seems to be that some extremely lengthy operation is attempted, that never finishes. In some cases, a reboot was necessary to terminate the process.

Referring to zpool-history(8); have you actually issued a command with -X*, like zpool clear -FX ... or zpool import -FX ?
zpool-import.8:

-X
Used with the -F recovery option. Determines whether extreme measures to find a valid txg should take place. This allows the pool to be rolled back to a txg which is no longer guaranteed to be consistent. Pools imported at an inconsistent txg may contain uncorrectable checksum errors. For more details about pool recovery mode, see the -F option, above. WARNING: This option can be extremely hazardous to the health of your pool and should only be used as a last resort.

Likewise -V ?
zpool_main.c - #L3719-#L3723:

Code:

 *	-V	Import even in the presence of faulted vdevs.  This is an
 *		intentionally undocumented option for testing purposes, and
 *		treats the pool configuration as complete, leaving any bad
 *		vdevs in the FAULTED state. In other words, it does verbatim
 *		import.

If you've used those options, that might have complicated things.

___
* zpool_main.c - #L7410-#L7414 (for zpool_do_clear(int argc, char **argv):

Code:

	if ((dryrun || xtreme_rewind) && !do_rewind) {
		(void) fprintf(stderr,
		    gettext("-n or -X only meaningful with -F\n"));
		usage(B_FALSE);
	}

Likewise: zpool_main.c - #L3906-#L3910, for zpool_do_import(int argc, char **argv)

Erichans · Jul 12, 2024

nerozero said:
Please go on. I have found no reference to this [28] error.

I could only find two instances of any reference to a ZFS error 28:

Pool err=28 flags=0x4000 bookmark=515 - task txg_sync:3412555 blocked for more than 120 seconds. Call Trace #13959
Regression: Temporary USB issues cause ZFS lockup until forced reboot, and sometimes data loss! #12007

However, I have no idea if those ZFS error numbers are "absolute" and unique or relative and depend on the context. I do not know the place where in the code those errror messages are listed unfortunately.

nerozero · Jul 13, 2024

Erichans Thanks, will try that, no physical access to the server till Monday.

Andriy · Jul 13, 2024

nerozero said:
Please go on. I have found no reference to this error.
I would be grateful if you could share where this clue should lead to.

Here is a handy little python script that lets you "decode" any POSIX / errno error code:
python -c 'import os ; import errno ; x = 28 ; print(errno.errorcode[x]) ; print(os.strerror(x))'

ralphbsz · Jul 14, 2024

On FreeBSD, "man errno" has the same information.