Solved Unable to import pool after firmware upgrade & BIOS reset to defaults

I'm running a Freebsd 11.1-RELEASE-p9 NAS with ZFS pool sat comprised of 2 two-way mirrors. The hardware is HP Microserver Gen 8. I have SATA controller set to AHCI mode in BIOS.

mirror-0 disks are ada[13]. mirror-1 disks are ada[02].

Yesterday I upgraded the iLO (HP name for IPMI) firmware, which in turn included updates for some additional firmware in the system.

After a reboot, I was no longer able to access the pool, although all of the four drives were still assigned to the same device names, e.g. /dev/ada1 was still /dev/ada1 as far as I could tell.

Here's what I got from zpool status (sorry for the truncated output, don't have the complete message any longer):
Code:
NAME                      STATE     READ WRITE CKSUM
sat                       FAULTED      0     0     0
  mirror-0                ONLINE       0     0     0
    1151882951            UNAVAIL      0     0     0  was /dev/ada1
    8929664670538268915   UNAVAIL      0     0     0  was /dev/ada3
  mirror-1                ONLINE       0     0     0
    10322645952576320949  UNAVAIL      0     0     0  was /dev/ada0
    18320716168250509702  UNAVAIL      0     0     0  was /dev/ada2

I immediately suspected that the new firmware is causing the integrated RAID controller work in a different way, so I went ahead and reset BIOS to system defaults (probably not the smartest idea to introduce more changes at this point, but I did it anyway...).

After BIOS reset, my pool still can't be imported and zpool import -f reports corrupted data:
Code:
sudo zpool import -f
   pool: sat
     id: 1167020886
  state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
comment: inherit
        The pool may be active on another system, but can be imported using
        the '-f' flag.
   see: http://illumos.org/msg/ZFS-8000-5E
config:

        sat                       FAULTED  corrupted data
          mirror-0                ONLINE
            1151882951            UNAVAIL  corrupted data
            8929664670538268915   UNAVAIL  corrupted data
          mirror-1                ONLINE
            10322645952576320949  UNAVAIL  corrupted data
            18320716168250509702  UNAVAIL  corrupted data

zdb -l output for all 4 devices attached here:

* ada0 - http://ix.io/17U1
* ada1 - http://ix.io/17U2
* ada2 - http://ix.io/17U3
* ada3 - http://ix.io/17U4

So, what options do I have here? Do I understand it correctly that this is a drive metadata (or whatever is the correct term) issue - ZFS doesn't recognise what the storage controller is presenting.

Right now I'm planning to try the following two things:

1. Re-check my BIOS settings to make sure everything is sane there and RAID is disabled and get the controller work in simplest HBA mode if I can - basically try to get the same configuration as I had before applying the firmware update & performing a BIOS reset, aka "last known good configuration"
2. Try importing the pool on different physical machine - need to figure out where to find one first :) Do you think this is worth while?

Thanks for reading and I'd appreciate any additional suggestions you folks might have.

----

Update: I don't have a ZFS cache file, neither /etc/zfs/zpool.cache nor /boot/zfs/zpool.cache exist.

I also tried -F option to zpool import, but it doesn't help:
Code:
sudo zpool import -nF
   pool: sat
     id: 1167020886
  state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
comment: inherit
        The pool may be active on another system, but can be imported using
        the '-f' flag.
   see: http://illumos.org/msg/ZFS-8000-5E
config:

        sat                       FAULTED  corrupted data
          mirror-0                ONLINE
            1151882951            UNAVAIL  corrupted data
            8929664670538268915   UNAVAIL  corrupted data
          mirror-1                ONLINE
            10322645952576320949  UNAVAIL  corrupted data
            18320716168250509702  UNAVAIL  corrupted data
 
Did you perhaps use encryption on the RAID card? I can imagine the key being lost after a firmware update (factory reset), it may need to be entered correctly again.
 
For anyone following this epic at home, so far I've not had any luck with option #1 (get BIOS & firware settings back to what they were before) - after lots of BIOS config changes & reboots, the pool is still unimportable.

No luck with option #2 either - nobody within reasonable distance of where I live has a desktop computer. I'm not that surprised by this, but still have a few people to reach who live further away.

In other news, I'm of course searching the deep web for ideas as well and came across this excellent video, appropriately titled "Adventure in ZFS Data Recovery" and have been following along.

One of the commands demonstrated shows how to dump blocks associated with files on given ZFS pool. And I'm amazed to see that its actually able to find the files still present in my pool (of course they're there, its not like I'm having a physical device or even filesystem-level failure here):

Code:
zdb -AAAAA -ddddd -e sat/pictures
<snip>
    Object  lvl   iblk   dblk  dsize  lsize   %full  type
       187    2    16K   128K  2.38M  2.38M  100.00  ZFS plain file
                                        168   bonus  System attributes
        dnode flags: USED_BYTES USERUSED_ACCOUNTED
        dnode maxblkid: 18
        path    /2007/abc-2007/still/dscn1183.jpg
        uid     5006       
        gid     5006
        atime   Wed Dec 11 07:32:21 2013                           
        mtime   Thu Aug  9 22:06:48 2007
        ctime   Wed Dec 11 07:32:21 2013                                                                                                                                                                                               
        crtime  Wed Dec 11 07:32:21 2013
        gen     14217
        mode    100664
        size    2434144
        parent  21
        links   1
        pflags  40800000004
Indirect blocks:
               0 L1  0:9e126b1c00:600 4000L/600P F=19 B=14217/14217
               0  L0 0:9e12451c00:20000 20000L/20000P F=1 B=14217/14217
           20000  L0 0:9e12431c00:20000 20000L/20000P F=1 B=14217/14217
           40000  L0 0:9e12471c00:20000 20000L/20000P F=1 B=14217/14217
           60000  L0 0:9e124d1c00:20000 20000L/20000P F=1 B=14217/14217
           80000  L0 0:9e12491c00:20000 20000L/20000P F=1 B=14217/14217
           a0000  L0 0:9e124b1c00:20000 20000L/20000P F=1 B=14217/14217
<snip>

And he also shows a Perl(!) script that can actually re-assemble a file from the blocks on disk.

In other words, while I've made no progress to get the entire pool back online, there's a chance I'll be able to recover data off of it. Will need to find ~6TB worth of storage first though, but it could be worse.

Also, and this is stated multiple times in the video, but its worth repeating - most (all?) of the shown commands are at the very least "experimental" and you should be prepared for data loss if you're venturing into that territory.
 
Update: I don't have a ZFS cache file, neither /etc/zfs/zpool.cache nor /boot/zfs/zpool.cache exist.
On FreeBSD the cache file is automatically generated in /boot/zfs/zpool.cache.

I suppose I am a little too late right now but I can't help but wonder... If you boot using a rescue CD (anything but the main OS itself) then what entries do you see in /dev/gptid? And/or: /dev/gpt? Of course assuming a GPT partitioning scheme here.

Also: in the boot menu what does lsdev show you?

Can you give us the output of gpart list perhaps? Or at least gpart show?

Can't help but wonder if this could be caused by some ID's which went haywire, but it's a far stretch. I've only seen this happen once and to be perfectly honest I'm not even convinced that my conclusions were correct.

If you don't mind me asking (feel free to ignore): why did you apply that update in the first place?
 
Thanks for your reply, all good questions.

On FreeBSD the cache file is automatically generated in /boot/zfs/zpool.cache.

I don't see /boot/zfs/zpool.cache in sight, not entirely sure how that gets automatically created. I'll look into this more when I'm done with recovery, maybe it'll be useful next time.
Code:
ls -lh /boot/zfs/zpool.cache
ls: /boot/zfs/zpool.cache: No such file or directory

If you boot using a rescue CD (anything but the main OS itself) then what entries do you see in /dev/gptid? And/or: /dev/gpt? Of course assuming a GPT partitioning scheme here.

I'm using whole disks with ZFS here so no GPT present on any of the drives.

Also: in the boot menu what does lsdev show you?

I've not tried this one, will check later and report back.

ICan you give us the output of gpart list perhaps? Or at least gpart show?

These also don't reveal anything related to the ZFS pool, as there are no GPT partitions on the disks in question.

Can't help but wonder if this could be caused by some ID's which went haywire, but it's a far stretch. I've only seen this happen once and to be perfectly honest I'm not even convinced that my conclusions were correct.

I'm absolutely sure of this. And I just tested that I'm able to recover at least a single file using the script shown here, so the data (or some of it) are in tact:
Code:
sudo ./extract-zfs-file 187 dscn1183.jpg                                                                ~
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror
Found vdev type: mirror

file dscn1183.jpg
dscn1183.jpg: JPEG image data, Exif standard: [TIFF image data, little-endian, direntries=11, description=          , manufacturer=NIKON, model=COOLPIX S6, orientation=upper-left, xresolution=216, yresolution=224, resolutionunit=2, software=COOLPIX S6V1.0, datetime=2007:08:09 15:06:49], baseline, precision 8, 2816x2112, frames 3

If you don't mind me asking (feel free to ignore): why did you apply that update in the first place?

These HP Microservers have a few quirks, one of them is that you can't boot from the USB 3.0 ports, but only USB 2.0 ports. Of course USB 2.0 performance is poor in comparison to USB 3.0, so I was hoping latest firmware & BIOS (which I didn't update in the end) would add support for booting from USB 3.0. So much for that, but I'm considering to find new, less "smart" and more basic hardware to replace this server with.
 
Update on the progress.

I managed to borrow a Dell machine from a friend. Moved all my 4 drives to it, booted from FreeBSD 11.1-RELEASE USB image (LiveCD) and was able to import the pool as easily as zpool import -f sat.

Listed pool status, datasets, everything was looking great. Exported the pool with zpool export, moved the disks back to my HP Microserver and guess what, pool still couldn't be imported:
Code:
sudo zpool import -nF                                                                                                                                                                                                       ~
   pool: sat
     id: 1167020886
  state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
comment: inherit
   see: http://illumos.org/msg/ZFS-8000-5E
config:

        sat                       FAULTED  corrupted data
          mirror-0                ONLINE
            1151882951            UNAVAIL  corrupted data
            8929664670538268915   UNAVAIL  corrupted data
          mirror-1                ONLINE
            10322645952576320949  UNAVAIL  corrupted data
            18320716168250509702  UNAVAIL  corrupted data

This is nuts, what on earth is this HP SmartArray RAID card doing?! I'll probably never find the answer to that, but I either need a new server or at the very least a reliable 4 port SATA HBA adapter I can install into the existing server.
 
If you boot off a LiveCD on the HP box, what does it show under /dev for block devices? From the sounds of it, it's no longer exposing them as adaX devices. Which usually points toward the configuration of the RAID controller changing.

You'll want to go through the RAID controller configuration utility and look for JBOD settings and enable that. Most likely, this was disable after the firmware upgrade (possibly removed?).

Or, pick up an LSI/Avago/Broadcom HBA and use that instead of the RAID controller.
 
Late reply, but to conclude, in the end I "solved" this by building a new NAS from scratch with off-the-shelf components and no RAID card and was able to use the ZFS pool w/o any issues.

I suspect I could've gotten the HP RAID controller working eventually, but I simply lost trust in it and didn't want to rely on it for my data going forward.

Thanks for all the helpful comments, I learned a bunch about ZFS internals as a result of this and managed to avoid data loss, so it was almost worth it (minus the initial stress of "is all my data gone?").
 
Back
Top