Solved Storage reboots after trying to import pool

gkontos · May 14, 2024

Hi everyone,

We have an old supermicro server with 11 WD-RE 4TB disks in a raidz-2 setup running fine for 6 years. I had to reboot it today and the system reboots when trying to import the pool.

It is a rather old set up running 11.2-RELEASE-p8. I know it is EOL but I am asking in case anyone has any ideas.

The pool consists of 11 disks encrypted with GELI. I can attach the drives fine with no issues and I can import the pool only read-only, otherwise the system will reboot.

I would appreciate any troubleshooting here! We are decommissioning those old storages but until we do this one is still being used!

Thanks

George

mer · May 14, 2024

Is the boot media different than the raidz-2 setup?
You say you can import read-only?
can one do a zpool scrub if a pool is imported read only? asking because I don't know and it would be interesting to see if zpool scrub actually does anything.
Since you can import read only, it does give you a chance to recover the data. yes it will be painful, but painful recovery is better than no recovery (my opinion)

gkontos · May 14, 2024

The system boots from a UFS mem stick. I haven't been able to scrub in a read-only mode. The problem is that this storage is still being used so I would prefer to find a way to make it writable again

mer · May 14, 2024

If you import in read only, what does "zpool status -x myzpoolnamehere" show? Just wondering if status shows it's ok with no errors, check system logs for any kind of errors on the devices. Maybe don't import but run things like smartctl or other checks for physical errors.

gkontos · May 14, 2024

 

# zpool status -x datastore

pool 'datastore' is healthy

 

# zpool status  datastore

  pool: datastore

 state: ONLINE

  scan: resilvered 1.99T in 168h45m with 0 errors on Sat Oct 15 08:48:54 2022

config

    NAME          STATE     READ WRITE CKSUM

    datastore     ONLINE       0     0     0

      raidz2-0    ONLINE       0     0     0

        da0.eli   ONLINE       0     0     0

        da1.eli   ONLINE       0     0     0

        da2.eli   ONLINE       0     0     0

        da3.eli   ONLINE       0     0     0

        da4.eli   ONLINE       0     0     0

        da5.eli   ONLINE       0     0     0

        da6.eli   ONLINE       0     0     0

        da7.eli   ONLINE       0     0     0

        da8.eli   ONLINE       0     0     0

        da9.eli   ONLINE       0     0     0

        da10.eli  ONLINE       0     0     0

        da11.eli  ONLINE       0     0     0

errors: No known data errors

mer · May 14, 2024

Hmm. So that looks like datastore is apparently healthy. raidz2: is that concatinating all the devices? Asking because I'm being lazy about what raidz2 means and don't want to look it up. That also shows da0 through da11 which is 12 devices.
If that is concatinating the devices we can't try and remove one, run in degraded mode and muck around. I do like the "resilvered 1.99T in 168h45m".
Nothing in the logs when trying to import r/w?

Cath O'Deray · May 14, 2024

gkontos said:
The system boots from a UFS mem stick. …

Also check things at that level.

Single user mode, then fsck to check but not (yet) repair; let's see whether the file system is OK.

ralphbsz · May 14, 2024

When it reboots, do you get kernel messages on the console? Does dmesg or /var/log/messages have any record of WHY it reboots? Can we distinguish whether it is a software problem (in the kernel? in ZFS?) versus a hardware problem (bad cables, power supply ill) combined with bad error handling.

For mer: Raid-Z2 means that there are two disks' worth of redundancy, which are stored using an encoding (typically parity based). So any two disks can fail (partially or completely), and the pool (array) will not lose data. Which is the same as saying: The OP has 11 disks, and gets 9 disks' worth of capacity. In ZFS, the redundant data is spread over all 11 disks pretty uniformly, somewhat similar to traditional RAID-6.

Given that the machine uses a memory stick as its root file system, theoretically no failure of the 11 ZFS data disks should cause a kernel panic and reboot. In practice, error handling of hardware errors is less than perfect, and old software versions (11.2 is old) have more bugs. Personally, I don't suspect it is a disk error (so smartctl is probably not a good investment of effort), since the ZFS pool comes online just fine. I suspect it is a hardware problem causing either the kernel or the whole CPU to have big trouble.

mer · May 14, 2024

thanks ralphbsz so in theory, one could create a new root/boot stick on say 13 or 14 latest and see if importing the pool acts differently

sko · May 14, 2024

mer said:
thanks [FONT=monospace]ralphbsz[/FONT] so in theory, one could create a new root/boot stick on say 13 or 14 latest and see if importing the pool acts differently

Thats exactly what I'd suggest. I remember FreeBSD 11 and sometimes also 12 could be a bit delicate if it came to things like borked metadata. I experienced this twice with pools where a dying HBA corrupted some metadata. With one pool (back with FreeBSD 11.X) I always got immediate kernel panics when accessing the affected dataset while on the other pool (12.4) it was only one specific snapshot that locked up the whole OS when touched.
I was able to send|recv all remaining deltas of the second pool by using a 13.1 memdisk image. OpenZFS 2 seems to be a bit more resilient in those cases as I was even able to delete the snapshot which always caused a crash on 12.4 and even fully recover the pool by letting ZFS just finish whatever it did while sending the whole system into a half-catatonic state. *All* ZFS commands just stalled in that time and all disk I/O was painfully slow, but it eventually recovered after ~10 hours. (Of course this was only out of interest - the data was already fully backed up and restored on a replacement system; the limping pool and disks were decomissioned)

So TL;DR:
Put a recent memdisk image on a new/'known good' flashdisk, import that pool readonly and backup all important data.

Then you could try to let ZFS recover by scrubbing the pool and also check SMART data of all disks. And don't trust that bogus "health status" which always reads "OK" until the disk is finally dead - look at the values for things like reallocated sectors, pending sectors etc.
But TBH I wouldn't put that ancient pool in any production use. Get new disks, restore from backups (or send|recv off that pool if you are confident it is healthy) and move on.
I suspect those disks are older than dirt, so don't try to replace a single disk - this would take ages with spining rust in such a single-vdev raidz configuration and the high load will very likely cause other disks to fail. For any replacement you should maybe consider using mirror vdevs - those are much more flexible and easier/faster to resilver and give you much more options to recover/repair a pool in case of failures e.g. by removing a whole vdev after adding a new one.

gkontos · May 14, 2024

Thanks for your suggestions, nothing suspicious in the logs but like you guys mentioned this is a very old version with bugs. Also, the hardware is rather old...

I will try to import the pool on a new installation with a recent FreeBSD.

gkontos · May 15, 2024

Just an update in case it helps someone else. I upgraded the system to the latest patchset and then to 12.4 and 13.2
After that, the pool was imported successfully!

mer · May 15, 2024

gkontos said:
Just an update in case it helps someone else. I upgraded the system to the latest patchset and then to 12.4 and 13.2
After that, the pool was imported successfully!

Awesome; thanks for the update.

mer · May 15, 2024

Forgot to add; after importing onto 13.2, did you start a scrub? From your zpool status I see that would take about a week, but it may be worth it just to see if anything pops up

gkontos · May 15, 2024

I have not yet, yes it takes about a week or more to resilver the pool after replacing a faulty drive. The scrub that you saw was the result of a drive replacement back in 2022. That is the reason we use raidz-2. Reslivering takes a long time and the pool is still ok if lose another drive.

Cath O'Deray · May 15, 2024

gkontos said:
… upgraded the system to the latest patchset and then to 12.4 and 13.2 …

Using the same boot drive? (I assume so.)

I do recommend a file system check, if you have not already done so.

Thanks

mer · May 15, 2024

Cath O'Deray said:
I do recommend a file system check, if you have not already done so.

OP posted zpool status showing about 7 days to resilver. scrub would likely be on that same order of magnitude.

Cath O'Deray · May 15, 2024

mer thanks, however the repeat recommendation relates to the boot drive,

gkontos said:
… UFS …