zpool unavailable after 9.0 to 9.1 upgrade

This may or may not be similar to http://forums.freebsd.org/showthread.php?t=36571, so please bear with me.

I ran the upgrade from 9.0-RELEASE to 9.1 RELEASE and discovered that my two previously functional zpools are now flagged as "unavailable". For the sake of caution, I have not yet attempted any form of recovery-- I thought I'd ask the experts about this first. It should surprise no-one that I'm extremely interested in recovering these volumes rather than losing them.

I have verified that there are no hardware issues and the host system appears to be recognizing all the attached hardware. The zpools are for storage only-- I'm not booting from them and there's nothing critical to the correct operation of the host system on them.

'dmesg' shows me pretty much what I was expecting to see:

Code:
mpt0: <Dual LSILogic FC929X 2Gb/s FC PCI-X Adapter>
mpt0: MPI Version=1.3.2.0

mpt1: <Dual LSILogic FC929X 2Gb/s FC PCI-X Adapter>
mpt1: MPI Version=1.3.2.0

da0 at mpt0 bus 0 scbus0 target 0 lun 0
da0: <APPLE Xserve RAID 1.51> Fixed Direct Access SCSI-5 device
da0: 100.000MB/s transfers
da0: Command Queueing enabled
da0: 4292376MB (8790786048 512 byte sectors: 255H 63S/T 547201C)

da1 at mpt1 bus 0 scbus1 target 0 lun 0
da1: <APPLE Xserve RAID 1.51> Fixed Direct Access SCSI-5 device
da1: 100.000MB/s transfers
da1: Command Queueing enabled
da1: 2861608MB (5860573184 512 byte sectors: 255H 63S/T 364803C)

But I did find something new after the 9.1 upgrade (it's repeated a few time before it "gives up"). I'm not entirely sure what it's trying to tell me:

Code:
(da0:mpt0:0:0:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0
(da0:mpt0:0:0:0): CAM status: CCB request terminated by the host
(da0:mpt0:0:0:0): Retrying command

Running 'zpool status -x' gives me this:

Code:
  pool: xraid0
 state: UNAVAIL
status: One or more devices could not be opened.  There are insufficient
        replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: [url]http://illumos.org/msg/ZFS-8000-3C[/url]
  scan: none requested
config:

        NAME                    STATE     READ WRITE CKSUM
        xraid0                  UNAVAIL      0     0     0
          18446743527590293691  UNAVAIL      0     0     0  was /dev/da0

Does anyone have a suggestion(s) to safely recover from this? I've been reading other threads, and I'm guessing that running a 'zpool import' might be the right way to handle this, but I really hate guessing and I'd love an actual explanation of what went wrong. If it's something I did, I'm keenly interested in not repeating my mistakes.
 
If it's something I did
Not likely IMHO, it looks like hardware error (guessing in absence of more info).

Normally, if just pool error, you could recover by:
* scrub the pool
# zpool scrub -o readonly=on <pool>
* import the pool as read-only
# zpool import -o readonly=on <pool>

But this message kills it:
pool: xraid0
state: UNAVAIL
status: One or more devices could not be opened. There are insufficient
replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.

Check connector cables and power to HDD. The fact that BIOS detects the HDD is the first step - it does not necessarily mean that he HDD is actually on-line no matter what file-sys is on it.
 
This exact same thing happened to me too. My setup is 6 drives - 2 mirrored with gmirror and 4 as a RAIDZ pool. My guess is that a driver update might have caused this, for example dmesg now says this:

GEOM_RAID: Promise: Array Promise created.
GEOM_RAID: Promise: Disk ada2 state changed from NONE to SPARE.
GEOM_RAID: Promise: Disk ada3 state changed from NONE to SPARE.
GEOM_RAID: Promise: Disk ada4 state changed from NONE to SPARE.
GEOM_RAID: Promise: Disk ada5 state changed from NONE to SPARE.

And ada2-5 are the missing drives from the pool.

Haven't found a solution yet, will report back when I do...
 
I managed to get mine back online. This thread provided the key. I guess my drives had some Promise RAID metadata on them that was previously ignored but got picked up by GEOM after the update since geom_raid is now built into the GENERIC kernel.

Here's what I did:

# svn checkout svn://svn.freebsd.org/base/release/9.1.0/ /usr/src
# nano /usr/src/sys/amd64/conf/GENERIC
# comment out "options GEOM_RAID" and save under a different name
# cd /usr/src; make buildkernel KERNCONF=MYCONF
# make installkernel KERNCONF=MYCONF
# shutdown -r now

After the reboot the pool was back online and I was pleasantly surprised that my boot volume mirror still worked.

The next step would be to clear the metadata with graid so I could go back to using GENERIC but I will try that after I've taken a complete backup of the data. :)


HTH,
-flip
 
Back
Top