ZFS cannot import pool after booting 13.2 after upgrading to 14.0

jauh · Jan 26, 2024

First, a quick warning to beadm users:

If you have

upgraded to 14.0
and you boot from ZFS
and you ran zpool upgrade on your root ZFS pool
and you beadm activate a pre-14.0 snapshot

then you may be in for some pain

How did I get here:

My FreeBSD system uses UEFI, I boot from ZFS and I use beadm, just in case I have to roll back when things go wrong.

Last week I updated to FreeBSD 14.0-RELEASE via freebsd-update
I apt upgrade'ed everything, performed several reboots and everything seemed to be wonderful.
... until today when I discovered that my teamspeak server was quietly crashing, without a core dump or error message.

I have been using beadm for years, so the solution was obvious, roll back, then look for answers.
So, a quick beadm activate 13.2-RELEASE-p9_xxx followed by a reboot and then...
... crickets :-(

Unfortunately I did not see the notice about not running zpool upgrade on your boot pool, so I rammed straight into the "ZFS: unsupported feature: com.klarasystems:vdev_zaps_v2" problem which has been reported here so often.
In hindsight this is obvious, because the 13.2 boot loader doesn't know about the new upgraded zfs features.

I managed to repair the boot partition enough to boot again, mostly by following the tips such as:

Answer: update boot codes before rebooting
Upgrading from Previous Releases of FreeBSD
Boot Loader Changes
and finally calling beadm activate again, so that it boots into the live, 14.0 environment.

However...

My problem:

zpool is no longer able to import my data pool.
This may be unrelated to booting 13.2 in a 14.0 environment but my data pool is the pool I use daily and it was working just fine until I unwittingly rolled back to 13.2.

My machine is an old Intel NUC, with a 60GB internal SSD (zroot) and 2 external USB 3.0 SSD's for data (zreserve).

The data pool (zreserve) is striped over the 2 external SSD's - completely separate to the internal boot SSD, but they were still attached while I went through the steps above. (yes, striping is fragile, and yes, I have backups)

The symptoms:

zpool doesn't see a problem, and says that I can import my pool:

Code:

# zpool import
   pool: zreserve
     id: 15131527649563311145
  state: ONLINE
 action: The pool can be imported using its name or numeric identifier.
 config:

    zreserve              ONLINE
      gpt/zdomus-4tb-13t  ONLINE
      gpt/zdomus-4tb-19v  ONLINE

... and then fails when I actually try to:

Code:

# zpool import zreserve
cannot import 'zreserve': one or more devices is currently unavailable

gpart shows the internal and external disks, all OK so far:

Code:

# gpart show -p
=>       40  117231328    ada0  GPT  (56G)
         40     532480  ada0p1  efi  (260M)
     532520       1024  ada0p2  freebsd-boot  (512K)
     533544        984          - free -  (492K)
     534528    4194304  ada0p3  freebsd-swap  (2.0G)
    4728832  112500736  ada0p4  freebsd-zfs  (54G)
  117229568       1800          - free -  (900K)

=>        40  7814037088    da0  GPT  (3.6T) [CORRUPT]
          40        4056         - free -  (2.0M)
        4096  7759462400  da0p1  freebsd-zfs  (3.6T)
  7759466496    54570632         - free -  (26G)

=>        40  7814037088    da1  GPT  (3.6T) [CORRUPT]
          40  7814037088  da1p1  freebsd-zfs  (3.6T)

The "CORRUPT" status has been following me for years. It always happens on this machine, but
a) zfs always worked fine, even if gpart said the partition tables were incomplete and
b) gpart recover <device> always fixed it.

Code:

# gpart recover da0
gpart: Input/output error
# gpart recover da1
da1 recovered

oops :-(

even so, gpart status doesn't see any problems any more:

Code:

# gpart status
  Name  Status  Components
ada0p1      OK  ada0
ada0p2      OK  ada0
ada0p3      OK  ada0
ada0p4      OK  ada0
 da0p1      OK  da0
 da1p1      OK  da1

and glabel also looks normal (it always showed N/A for the status)

Code:

# glabel status
              Name  Status  Components
      gpt/efiboot0     N/A  ada0p1
      gpt/gptboot0     N/A  ada0p2
         gpt/swap0     N/A  ada0p3
gpt/zdomus-4tb-13t     N/A  da0p1
gpt/zdomus-4tb-19v     N/A  da1p1

but zdb has got me worried:

Code:

# zdb -l da0
failed to unpack label 0
failed to unpack label 1
failed to unpack label 2      <<---- shouldn't there be a ZFS label here?
failed to unpack label 3
# zdb -l da1
failed to unpack label 0
failed to unpack label 1
------------------------------------
LABEL 2 (Bad label cksum)
------------------------------------
    version: 5000
    state: 3
    guid: 4620694731288109924
    labels = 2
failed to unpack label 3

and yet, it quite happily seems to be able access the whole pool:

Code:

zdb -d zreserve | grep -v 'objects$' # grep out the full list of datasets - they all look OK to me :-)
Verified large_blocks feature refcount of 0 is correct
Verified large_dnode feature refcount of 0 is correct
Verified sha512 feature refcount of 0 is correct
Verified skein feature refcount of 0 is correct
Verified edonr feature refcount of 0 is correct
Verified userobj_accounting feature refcount of 649 is correct
Verified encryption feature refcount of 0 is correct
Verified project_quota feature refcount of 649 is correct
Verified redaction_bookmarks feature refcount of 0 is correct
Verified redacted_datasets feature refcount of 0 is correct
Verified bookmark_written feature refcount of 0 is correct
Verified livelist feature refcount of 0 is correct
Verified zstd_compress feature refcount of 0 is correct
Verified zilsaxattr feature refcount of 6 is correct
Verified blake3 feature refcount of 0 is correct
Verified device_removal feature refcount of 0 is correct
Verified indirect_refcount feature refcount of 0 is correct

but these messages are new:

Code:

# grep -w da0 /var/log/messages
Jan 25 23:50:24 domus kernel: (da0:umass-sim0:0:0:0): WRITE(10). CDB: 2a 00 00 00 12 10 00 00 10 00
Jan 25 23:50:24 domus kernel: (da0:umass-sim0:0:0:0): CAM status: SCSI Status Error
Jan 25 23:50:24 domus kernel: (da0:umass-sim0:0:0:0): SCSI status: Check Condition
Jan 25 23:50:24 domus kernel: (da0:umass-sim0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
Jan 25 23:50:24 domus kernel: (da0:umass-sim0:0:0:0): Error 5, Retries exhausted

So, does this mean my SSD just died?
Are there any other things I could try to recover the pool?
zdb -c says it'll take 10..20 hours to complete... would it be worth doing anyway and maybe even zdb -b to take one last snapshot before it dies completely?

If you're still reading... thanks, I'm impressed and grateful for any tips!

jauh · Jan 26, 2024

... and a few hours later...

zdb -c seems to be happy at least:

Code:

# zdb -c zreserve

Traversing all blocks to verify metadata checksums and verify nothing leaked ...

loading concrete vdev 1, metaslab 231 of 232 ...
4.77T completed (3127MB/s) estimated time remaining: 0hr 00min 00sec
        No leaks (block sum matches space maps exactly)

        bp count:              53976456
        ganged count:                 0
        bp logical:       6189612133888      avg: 114672
        bp physical:      5221710064640      avg:  96740     compression:   1.19
        bp allocated:     5240758358016      avg:  97093     compression:   1.18
        bp deduped:                   0    ref>1:      0   deduplication:   1.00
        bp cloned:                    0    count:      0
        Normal class:     5240736829440     used: 66.17%
        Embedded log class         454656     used:  0.00%

        additional, non-pointer bps of type 0:    1052806
        Dittoed blocks on same vdev: 467155
        Dittoed blocks in same metaslab: 1

cracauer@ · Jan 26, 2024

jauh said:

but these messages are new:

Code:

# grep -w da0 /var/log/messages
Jan 25 23:50:24 domus kernel: (da0:umass-sim0:0:0:0): WRITE(10). CDB: 2a 00 00 00 12 10 00 00 10 00
Jan 25 23:50:24 domus kernel: (da0:umass-sim0:0:0:0): CAM status: SCSI Status Error
Jan 25 23:50:24 domus kernel: (da0:umass-sim0:0:0:0): SCSI status: Check Condition
Jan 25 23:50:24 domus kernel: (da0:umass-sim0:0:0:0): SCSI sense: ABORTED COMMAND asc:0,0 (No additional sense information)
Jan 25 23:50:24 domus kernel: (da0:umass-sim0:0:0:0): Error 5, Retries exhausted

So, does this mean my SSD just died?

Looks more like a bad cable to me. Assuming you have a cable (SATA, not M.2).

monwarez · Jan 26, 2024

Maybe it is an issue with usb stability, do you have another systems to try by plugin them directly on a sata port ?

mer · Jan 26, 2024

jauh said:
upgraded to 14.0

and you boot from ZFS

and you ran zpool upgrade on your root ZFS pool

and you beadm activate a pre-14.0 snapshot

This is a potential pain point, regardless of versions (it's not limited to 13.x and 14.x). That's why a lot of folk recommend never running zpool upgrade if there is any possibility of needing an older boot environment.
Newer kernels typically can always understand previous versions (standard backwards compatibility) but older kernels almost never will understand newer versions.

jauh · Jan 26, 2024

mer said:
This is a potential pain point, regardless of versions (it's not limited to 13.x and 14.x). That's why a lot of folk recommend never running zpool upgrade if there is any possibility of needing an older boot environment.
Newer kernels typically can always understand previous versions (standard backwards compatibility) but older kernels almost never will understand newer versions.

yep, like I said, obvious in hindsight.
Just like "fire is hot", sometimes it just needs to be said, explicitly ;-)

jauh · Jan 26, 2024

Thanks for the "check the cables" tips...

The disks are both external USB, and no, I don't have any SATA ports where I could attach them directly
(I only have NUC's, Raspi's and Laptops), but I do have external USB cases from different manufacturers.

So, I stuck a green sticker on the "-19v" (good SSD) and a red sticker on "-13t" (bad SSD), then switched them over in all combinations of cable/case that I could think of, but the error stubbornly followed the -13t disk.
(the ID's are just the last part of their serial no.s by the way)

So it definitely seems to be related to the SSD itself.

I also just noticed smartctl:

"-13t"'s SMART report:

Code:

# smartctl -a /dev/da0
smartctl 7.4 2023-08-01 r5530 [FreeBSD 14.0-RELEASE-p3 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 870 EVO 4TB
...
Local Time is:    Fri Jan 26 16:24:35 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

Read SMART Data failed: scsi error aborted command

=== START OF READ SMART DATA SECTION ===
SMART Status command failed: scsi error aborted command
SMART overall-health self-assessment test result: UNKNOWN!
SMART Status, Attributes and Thresholds cannot be read.

Read SMART Log Directory failed: scsi error aborted command
Read SMART Error Log failed: scsi error aborted command
Read SMART Self-test Log failed: scsi error aborted command

vs "-19v"'s

Code:

# smartctl -a /dev/da1
smartctl 7.4 2023-08-01 r5530 [FreeBSD 14.0-RELEASE-p3 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 870 EVO 4TB
...
Local Time is:    Fri Jan 26 16:24:35 2024 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

...

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   096   096   010    Pre-fail  Always       -       167
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       23801
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       65
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       5
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   096   096   010    Pre-fail  Always       -       167
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   096   096   010    Pre-fail  Always       -       167
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   070   032   000    Old_age   Always       -       30
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   100   100   000    Old_age   Always       -       0
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       36
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       26169839826

SMART Error Log Version: 1
No Errors Logged
...

I think it's time to call the manufacturer :-(

monwarez · Jan 26, 2024

Looking at -19v logs, it seems they have 2.7 years of 24/24h usage.
So you should be able to claim the warranty.

You could try to contact the reseller first I guess ?

jauh · Jan 26, 2024

monwarez said:
Looking at -19v logs, it seems they have 2.7 years of 24/24h usage.
So you should be able to claim the warranty.

You could try to contact the reseller first I guess ?

pretty close, I'd say 3.5 years on/off, but yep - exactly my plan.
Thanks for taking the time!