ZFS 4-disk Raidz1 showing as "UNAVAIL" - one disk missing, one disk OFFLINE

bsandor · Apr 13, 2019

I had just started the process of expanding my raidz1 by OFFLINEing the first disk, shutting down, replacing the disk with a larger one, then "pool replace omnius media-0E2FD10D-C34A-D544-A04F-811FDF4714B9". Let the system take the day to resilver, everything was OK, probably wrote some data to the RAID, then started the next drive the next day. Took the next drive offline, shut down, replaced with larger disk...

But, when it came back up the pool state was UNAVAIL. The first disk that I had replaced the day before is now showing as UNAVAIL "cannot open". The second disk is showing as OFFLINE, because I took it offline to replace. I've replaced the original second disk, but pool still showing as UNAVAIL. I can't "online" the second disk, because it seems like the pool has to be available in order to do that, but I can't get it online because one disk is now bad and the second (should be fine) is OFFLINE.

Is there a way to online the disk without the pool being online? I've tried a forced import, but it won't do it.

I have backups, but it will be a huge pain to restore (large amount of data, all offsite).

I'm really hoping I'm just missing something here that can bring this second original disk back online so I can at least have 3 of my 4 disks online and get the pool back online again to replace the bad disk.

bsandor · Apr 13, 2019

Code:

pool: omnius
     id: 14509745266368669880
  state: UNAVAIL
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
   see: http://illumos.org/msg/ZFS-8000-EY
config:
    omnius                                          UNAVAIL  insufficient replicas
      raidz1-0                                      UNAVAIL  insufficient replicas
        9775893318672239851                         UNAVAIL  cannot open
        14344325973176569304                        OFFLINE
        gptid/cefa2c89-4ec6-8e4f-90a0-082598104763  ONLINE
        gptid/7ca3306b-3627-5549-89a2-d98aaf5a8069  ONLINE

D-FENS · Apr 14, 2019

One remark: Please paste the code in a code block, otherwise it's confusing to read.

You pool has one device that has "disappeared" and one that is offlined. As it is a RADIZ1, it is incomplete - it misses two devices and this is too much.
In order to prevent corrupting your data, ZFS refuses to import the pool in this state.

I tried different forcing options but in this situation it seems the only way to import it is to make sure the "UNAVAIL" device is attached to the system. This is what you need to do:

Attach the disk that is displayed as UNAVAIL: (9775893318672239851)
Make sure it is visible in /dev/gptid. Unfortunately I cannot see it's GPT GUID from your log.
Import the ZFS pool using zpool import omnius.
Make the OFFLINE disk online again, or if not available - insert a new one and use zpool replace.

You CANNOT bring the OFFLINE disk online without importing first, because it potentially misses some updates from the other drives. It needs to be resilvered when brought online. So you have in effect two disks that are not available and for RAIDZ1 this means lost pool.
Your best shot is to find a way to bring the UNAVAIL device back.
Otherwise you lost your pool and it can't be recovered without some ZFS hacking, I'm affraid.

Maybe someone else with a lot of ZFS experience knows another trick?

D-FENS · Apr 14, 2019

P.S. I assumed you have a 3+1 RAIDZ1, right? Or is it a 2+1?

D-FENS · Apr 14, 2019

Now reading your post again, I think it's pretty simple what happened. You offlined one disk, but removed another. This is no problem.
Ignore what I have written above.

Just return the original disk back into the slot where you removed it and import your pool or reboot again.

Then online the disk again and make sure everything is ONLINE in your pool. Do not OFFLINE anything. Simply shutdown and remove the drive.

What you did initially was - you removed the wrong drive, so you have one OFFLINE and one you removed. OFFLINE-ing is actually not necessary, or make sure you offline the correct drive you are going to be removing!

bsandor · Apr 15, 2019

Fist, thank you very much for your responses. Second, sorry about the bad pasting, and about any lack of clarity. The disk that is UNAVAIL is installed, it is for some reason showing as unavailable. Here is some more information I've collected as I've played around with it this weekend, and some that I did not include originally since I did not think it relevant (which means it probably is very relevant):

It is a 3+1 RAIDZ1 that I am having the issue with - "ominus"

I think now my issue may be it's being tied to a second system, a FreeNAS system, with a 8+2 disk RAIDZ2 - "Shared". These RAIDZs are not linked by any physical means, but only by my frugality. I was attempting to recycle drives from the RAIDZ2/Shared to the RAIDZ1/omnius. Here are the steps I took, hopefully it will be more clear than my original post. I will paste any historic output I still have saved into the code block for clarity.

1) RAIDZ2/Shared - 'offline' 1st 4Tb disk, then physically remove
2) RAIDZ2/Shared - install new 8Tb disk, then 'replace'
3) RAIDZ2/Shared - resilvered successfully, all is happy
4) RAIDZ1/omnius - 'offline' 1st 2Tb disk, shutdown, then physically remove
5) RAIDZ1/omnius - install recycled 4Tb disk, then power on
6) RAIDZ1/omnius - 'replace', resilvered successfully, all is happy (pool state showed as ONLINE)
7) RAIDZ2/Shared - 'offline' 2nd 4Tb disk, then physically remove
8) RAIDZ2/Shared - install new 8Tb disk, then 'replace'
9) RAIDZ2/Shared - resilvered successfully, all is happy
10) RAIDZ1/omnius - 'offline' 2nd 2Tb disk, shutdown, then physically remove
11) RAIDZ1/omnius - install recycled 4Tb disk, then power on
12) RAIDZ1/omnius - pool state showed as UNAVAIL. 1st disk now showing as "UNAVAIL cannot open". 2nd disk showing as OFFLINE
13) RAIDZ1/omnius - powered off, re-installed original 2nd 2Tb disk, powered back on
14) RAIDZ1/omnius - pool state showed as UNAVAIL. 1st disk now showing as "UNAVAIL cannot open". 2nd disk showing as OFFLINE
15) RAIDZ1/omnius - 'import -f omnius' -> cannot import 'omnius': one or more devices is currently unavailable
16) RAIDZ1/omnius - 'online omnius 14344325973176569304' -> cannot open 'omnius': no such pool
17) RAIDZ1/omnius - Tried various combinations of original and replacement drives in the 1st & 2nd bay, all the same results
18) RAIDZ1/omnius - 'pool import' - this one gave very interesting results, and probably the key to why the 1st disk is now showing as UNAVAIL:

Code:

  pool: omnius
     id: 14509745266368669880
  state: UNAVAIL
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
   see: http://illumos.org/msg/ZFS-8000-EY
config:

    omnius                                          UNAVAIL  insufficient replicas
      raidz1-0                                      UNAVAIL  insufficient replicas
        9775893318672239851                         UNAVAIL  cannot open
        14344325973176569304                        OFFLINE
        gptid/cefa2c89-4ec6-8e4f-90a0-082598104763  ONLINE
        gptid/7ca3306b-3627-5549-89a2-d98aaf5a8069  ONLINE

   pool: Shared
     id: 17235548798910189170
  state: UNAVAIL
status: The pool was last accessed by another system.
action: The pool cannot be imported due to damaged devices or data.
   see: http://illumos.org/msg/ZFS-8000-EY
config:

    Shared                             UNAVAIL  insufficient replicas
      raidz2-0                         UNAVAIL  insufficient replicas
        16185004583841295555           UNAVAIL  cannot open
        11306961280199829528           UNAVAIL  cannot open
        diskid/DISK-WD-WCC4E5DEVY95p2  ONLINE
        3749584798590910767            UNAVAIL  cannot open
        9647032758289143148            UNAVAIL  cannot open
        13082751735808943102           UNAVAIL  cannot open
        6880421956224970715            UNAVAIL  cannot open
        14933410017921218789           UNAVAIL  cannot open
        13898314399446029648           UNAVAIL  cannot open
        7944725303299886127            UNAVAIL  cannot open

So, to my very untrained eyes, the 1st 4Tb disk I recycled from RAIDZ2/Shared is now reporting itself as being a part of RAIDZ2/Shared, not of RAIDZ1/omnius even though it had successfully resilvered into RAIDZ1/omnius.

My guess is, I should not try to recycle drives like this, and if I do, maybe there's something I should do to the disk before re-using it in another ZFS system.

Hope this extra information is helpful. Thank you again for your time and knowledge.

As an aside, I've tried (in desperation) to use a RAID5 data recovery utility on the 3 original RAIDZ1/omnius drives, thinking MAYBE RAIDZ1 is similar enough to RAID5 that it would piece it together. Probably no surprise to anyone on this forum, but it did not. Anyone know of any ZFS data recovery software?

SirDice · Apr 15, 2019

I think you have the same problem as I do on my controller. If you (physically) remove a disk the controller starts renumbering disks and nominations start moving around. So, suppose da1 is broken, if I remove it disks da2 and da3 move to da1 and da2 resp. This seems to confuse the heck out of ZFS. I found the only way to recover is to reboot. Disks are still shuffled around but at least ZFS will find the correct disk for the right mirror and things appear to work again.

I also have to take extra care when issuing a replace command, I have to double check the correct disk name or else it blows up. The best way I found was to make a note of the drive's serial number and use smartctl(8) to verify the disk name with the serial number.

Code:

config:
        NAME        STATE     READ WRITE CKSUM
        stor10k     ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            da3     ONLINE       0     0     0
            da0     ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            da2     ONLINE       0     0     0
            da1     ONLINE       0     0     0

This was once set up with da0/da1 + da2/da3 but a string of bad disks caused everything to move around.

D-FENS · Apr 15, 2019

But his pools use the disks by gpt id and not device numbers. Even if the numbering changes, the gpt ids would still stay the same.
What I think had happened is that the disks bsandor wants to recycle still have the "Shared" pool on them.

@gsandor: Did you format the disks you want to use for replacement? If you take a disk from one ZFS pool on a machine and stick it into another machine, it still contains the ZFS pool on it and the other machine recognizes this and tells you:

The pool was last accessed by another system.

This is what zfs import tells you, but you don't need to import Shared there. You just add the disk to "omnius" and this will overwrite the data of "Shared" which is still on the disk.

To be on the safe side, you could recreate the partitions on the drive and then add it to "omnius". You could of course dd the whole disk with zeroes but it's not necessary.

What I think you shouldn't do is offlining the disks. Just shut down and pull out the drive. Do not offline them because when you do and pull out the wrong drive, your RAIDZ1 pool misses suddenly 2 drives and it can't work.
So go back a step and make the unavailable disk available. Then without offlining pull the disk which you want out.

Eric A. Borisch · Apr 15, 2019

bsandor said:
My guess is, I should not try to recycle drives like this, and if I do, maybe there's something I should do to the disk before re-using it in another ZFS system.

zpool labelclear device is likely what you are looking for.

bsandor · Apr 15, 2019

roccobaroccoSC said:
But his pools use the disks by gpt id and not device numbers. Even if the numbering changes, the gpt ids would still stay the same.
What I think had happened is that the disks bsandor wants to recycle still have the "Shared" pool on them.

@gsandor: Did you format the disks you want to use for replacement? If you take a disk from one ZFS pool on a machine and stick it into another machine, it still contains the ZFS pool on it and the other machine recognizes this and tells you:

This is what zfs import tells you, but you don't need to import Shared there. You just add the disk to "omnius" and this will overwrite the data of "Shared" which is still on the disk.

To be on the safe side, you could recreate the partitions on the drive and then add it to "omnius". You could of course dd the whole disk with zeroes but it's not necessary.

What I think you shouldn't do is offlining the disks. Just shut down and pull out the drive. Do not offline them because when you do and pull out the wrong drive, your RAIDZ1 pool misses suddenly 2 drives and it can't work.
So go back a step and make the unavailable disk available. Then without offlining pull the disk which you want out.

I did not format the disk I recycled, nor did it tell me the pool was last accessed by another system. I simply used the replace command and it took it and resilvered, no errors.

Good to know going forward that offlining the disk is not necessary before replacing. I always stress myself out trying to triple-check that I am offlining the correct disk since I cannot see the serial numbers of the installed disks to be sure.

Unfortunately, I cannot seem to be able to "go back a step" and make the unavailable disk (the 4b disk originally from Shared) available again. It's in the bay, but it shows as "UNAVAIL cannot open". I tried putting the original 2Tb disk back in that bay, it still tells me "UNAVAIL cannot open" for that first device when I do a zpool import. The only difference is that when I use the original 2Tb disk rather than the recycled 4Tb disk, zpool import no longer shows me the information for "Shared". It doesn't matter which disk I put in the first bay, when I zpool import -f omnius, I only get cannot import 'omnius': one or more devices is currently unavailable.

D-FENS · Apr 16, 2019

The UNAVAIL was first shown when you removed a drive to replace it. So in my opinion it should be your original 2TB drive that should fix it.
You have to find the drive with ZFS GUID "9775893318672239851" (original 2TB drive) and insert it in the bay.

Normally, if your ZFS pool were imported, the device would still be UNAVAIL and you need to online it first with zpool online omnius 9775893318672239851.
I don't know how this can be done on not-imported pools.
Try zpool import -Ff omnius.

If this does not work, maybe zpool online omnius 9775893318672239851 but not sure if this would work on a non-imported pool.
We need an advice from a ZFS guru here.

Another idea: insert the original 2TB drive and reboot the machine - is it still unavail?

D-FENS · Apr 16, 2019

I found this article: https://serverfault.com/questions/562998/zfs-bringing-a-disk-online-in-an-unavailable-pool
The dude has the same problem like you - an OFFLINE and an UNAVAIL in a pool that cant be imported.

They say this:

1: Make sure the disks are safe - even if that means unplugging all of them.
2: Update to the latest version of FreeBSD - you want the latest ZFS bits you can get your hands on.
3: Put the original gpt/ta4 (that is supposedly 'OK' and just experiencing read errors) back in the system or into a new system with newer ZFS bits (as well as all the others if you've removed them), boot it, and run, in order until one works (be forewarned - these are not safe, especially the last one, in that in their attempts to recover the system they're likely to roll back and thus lose recently written data):

zpool import -f tank

zpool import -fF tank

zpool import -fFX tank

If all 3 fail, you're outside the realm of "simple" recovery. Some Googling for 'importing bad pools', 'zdb', 'zpool import -F', 'zpool import -X', 'zpool import -T' (danger!), and the like, might provide you some additional blogs and information on recovery attempts made by others, but it's already on very dangerous and potentially further-data-damaging ground at that point and you're rapidly entering territory of paid recovery services (and not from traditional data recovery companies, they have zero expertise with ZFS and will not be of any use to you).

I hope this helps.

D-FENS · Apr 16, 2019

What you need to do is make the UNAVAIL disk available again. The OFFLINE one is no good because it needs to be resilvered from the others in order to be usable again. If this can't be done, you probably lost the whole pool. Had you not OFFLINEd the second drive, this would not have happened. Don't offline devices when replacing, it's just too risky. Unless you have RAIDZ2 - you can play with one drive and still have 1 for buffer.
Also - if you make a mistake, just slide the drive back in and wait for resilvering to complete.

On the debugging side, things could (theoretically) be rolled back to the last checkpoint where the OFFLINE disk was online, after which it could be brought back online. I have no idea how to do that.

bsandor · Apr 17, 2019

roccobaroccoSC said:
I found this article: https://serverfault.com/questions/562998/zfs-bringing-a-disk-online-in-an-unavailable-pool
The dude has the same problem like you - an OFFLINE and an UNAVAIL in a pool that cant be imported.

They say this:

I hope this helps.

Thank you again for all your help. Those 3 options did not do the trick. I'm going to try the import -T option tomorrow when I get a chance to double-check which drives I left in the server (I've left this server at work and attempting the suggestions remotely from home). Looks like I can use zdb -l on the disks to determine some information about the pools they've been used in, which could somehow guide me to try more dangerous import/recovery options. Since I'm mostly considering the data lost at this point, I'm going to continue to play around with these tools and see if I can get any results. I'll post them if I do (or don't).

You have all be great and very helpful. I have to tell you, this has been a totally different experience from when I post on the FreeNAS forums for help with my FreeNAS system. There tends to be a LOT of judgement and condescension there if you don't know as much as the most experienced moderators or users. I typically go to forums to admit I don't know something and hope to get help, not to get my lack of knowledge thrown back at me with contempt.

D-FENS · Apr 17, 2019

bsandor said:
Thank you again for all your help. Those 3 options did not do the trick. I'm going to try the import -T option tomorrow when I get a chance to double-check which drives I left in the server (I've left this server at work and attempting the suggestions remotely from home). Looks like I can use zdb -l on the disks to determine some information about the pools they've been used in, which could somehow guide me to try more dangerous import/recovery options. Since I'm mostly considering the data lost at this point, I'm going to continue to play around with these tools and see if I can get any results. I'll post them if I do (or don't).

You have all be great and very helpful. I have to tell you, this has been a totally different experience from when I post on the FreeNAS forums for help with my FreeNAS system. There tends to be a LOT of judgement and condescension there if you don't know as much as the most experienced moderators or users. I typically go to forums to admit I don't know something and hope to get help, not to get my lack of knowledge thrown back at me with contempt.

I don't know ixsystems' forums but the FreeBSD forums are very friendly. That's one of the things I really like about this OS.

bsandor · Apr 17, 2019

Here's more data for anyone interested. I'm not quite sure I understand most let alone all of it, but what I have found is that if I have what should be the "correct" drives in the machine for pool omnius, I definitely get different pool info for the 4 drives. The 4Tb recycled drive, even though it resilvered and ran for almost 24 hours in the omnius pool without error, only shows information for it's original Storage pool. The other 3 disks seem to show the guids of the 3 2Tb drives, then a guid of what I assume is the recycled 4Tb drive that was resilvered into omnius, but has now vanished from the physical disk somehow. I'm abbreviating the zdb -l output because it's way too long:

bay1 - /dev/da0 - 4Tb recycled drive
bay2 - /dev/da1 - 2Tb original drive (offlined)
bay3 - /dev/da2 - 2Tb original drive
bay4 - /dev/da3 - 2Tb original drive

Code:

root@FreeBSD:~ # zpool import
   pool: omnius
     id: 14509745266368669880
  state: UNAVAIL
 status: The pool was last accessed by another system.
 action: The pool cannot be imported due to damaged devices or data.
   see: http://illumos.org/msg/ZFS-8000-EY
 config:

    omnius                                          UNAVAIL  insufficient replicas
      raidz1-0                                      UNAVAIL  insufficient replicas
        9775893318672239851                         UNAVAIL  cannot open
        14344325973176569304                        OFFLINE
        gptid/cefa2c89-4ec6-8e4f-90a0-082598104763  ONLINE
        gptid/7ca3306b-3627-5549-89a2-d98aaf5a8069  ONLINE

   pool: Shared
     id: 17235548798910189170
  state: UNAVAIL
 status: The pool was last accessed by another system.
 action: The pool cannot be imported due to damaged devices or data.
   see: http://illumos.org/msg/ZFS-8000-EY
 config:

    Shared                             UNAVAIL  insufficient replicas
      raidz2-0                         UNAVAIL  insufficient replicas
        16185004583841295555           UNAVAIL  cannot open
        11306961280199829528           UNAVAIL  cannot open
        diskid/DISK-WD-WCC4E5DEVY95p2  ONLINE
        3749584798590910767            UNAVAIL  cannot open
        9647032758289143148            UNAVAIL  cannot open
        13082751735808943102           UNAVAIL  cannot open
        6880421956224970715            UNAVAIL  cannot open
        14933410017921218789           UNAVAIL  cannot open
        13898314399446029648           UNAVAIL  cannot open
        7944725303299886127            UNAVAIL  cannot open



root@FreeBSD:/dev # zdb -l /dev/da0p2
------------------------------------
LABEL 0
------------------------------------
failed to unpack label 0
------------------------------------
LABEL 1
------------------------------------
failed to unpack label 1
------------------------------------
LABEL 2
------------------------------------
    version: 5000
    name: 'Shared'
    state: 0
    txg: 27761907
    pool_guid: 17235548798910189170
    hostid: 3412306411
    hostname: 'freenas01.allegiance-it.private'
    top_guid: 6187037690749441769
    guid: 2651163555300036725
    vdev_children: 6
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 6187037690749441769
        nparity: 2
        metaslab_array: 31
        metaslab_shift: 37
        ashift: 9
        asize: 39986347376640
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 16185004583841295555
            path: '/dev/gptid/0b184901-4126-11e5-a934-f46d042cc584'
            phys_path: 'id1,enc@n5001e677bb75effd/type@0/slot@5/elmdesc@ArrayDevice04'
            whole_disk: 1
            DTL: 511
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 11306961280199829528
            path: '/dev/gptid/a71c4aec-4cdc-11e5-b035-f46d042cc584'
            phys_path: 'id1,enc@n5001e677bb75effd/type@0/slot@6/elmdesc@ArrayDevice05'
            whole_disk: 1
            DTL: 510
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 2651163555300036725
            path: '/dev/gptid/0a208c45-49c0-11e5-a8c9-f46d042cc584'
            phys_path: 'id1,enc@n5001e677bb75effd/type@0/slot@7/elmdesc@ArrayDevice06'
            whole_disk: 1
            DTL: 509
            create_txg: 4
        children[3]:
            type: 'disk'
            id: 3
            guid: 3749584798590910767
            path: '/dev/gptid/fb6e3c21-57d9-11e9-abc8-18a9055a8b30'
            phys_path: 'id1,enc@n5001e677bb75effd/type@0/slot@8/elmdesc@ArrayDevice07'
            whole_disk: 1
            DTL: 4258
            create_txg: 4
        children[4]:
            type: 'disk'
            id: 4
            guid: 9647032758289143148
            path: '/dev/gptid/41abb3b5-5af1-11e5-964e-f46d042cc584'
            phys_path: 'id1,enc@n5001e677bb75effd/type@0/slot@a/elmdesc@ArrayDevice09'
            whole_disk: 1
            DTL: 507
            create_txg: 4
        children[5]:
            type: 'disk'
            id: 5
            guid: 13082751735808943102
            path: '/dev/gptid/0b61db0f-5711-11e5-90ec-f46d042cc584'
            phys_path: 'id1,enc@n5001e677bb75effd/type@0/slot@b/elmdesc@ArrayDevice0A'
            whole_disk: 1
            DTL: 506
            create_txg: 4
        children[6]:
            type: 'disk'
            id: 6
            guid: 6880421956224970715
            path: '/dev/gptid/5fe6acb4-6154-11e5-aef2-f46d042cc584'
            phys_path: 'id1,enc@n5001e677bb75effd/type@0/slot@14/elmdesc@ArrayDevice13'
            whole_disk: 1
            DTL: 505
            create_txg: 4
        children[7]:
            type: 'disk'
            id: 7
            guid: 14933410017921218789
            path: '/dev/gptid/193e97fb-50d0-11e5-9a9d-f46d042cc584'
            phys_path: 'id1,enc@n5001e677bb75effd/type@0/slot@c/elmdesc@ArrayDevice0B'
            whole_disk: 1
            DTL: 504
            create_txg: 4
        children[8]:
            type: 'disk'
            id: 8
            guid: 13898314399446029648
            path: '/dev/gptid/6e9292a8-646a-11e5-baa7-f46d042cc584'
            phys_path: 'id1,enc@n5001e677bb75effd/type@0/slot@13/elmdesc@ArrayDevice12'
            whole_disk: 1
            DTL: 503
            create_txg: 4
        children[9]:
            type: 'disk'
            id: 9
            guid: 7944725303299886127
            path: '/dev/gptid/89dfd2f2-5fbb-11e5-9dd4-f46d042cc584'
            phys_path: 'id1,enc@n5001e677bb75effd/type@0/slot@9/elmdesc@ArrayDevice08'
            whole_disk: 1
            DTL: 502
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data


root@FreeBSD:~ # zdb -l /dev/da1p1
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'omnius'
    state: 0
    txg: 23528409
    pool_guid: 14509745266368669880
    errata: 0
    hostid: 2446817539
    hostname: ''
    top_guid: 17608848787523622521
    guid: 7688033009861353422
    vdev_children: 1
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 17608848787523622521
        nparity: 1
        metaslab_array: 34
        metaslab_shift: 35
        ashift: 12
        asize: 8001538752512
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 9775893318672239851
            path: '/private/var/run/disk/by-id/media-0E2FD10D-C34A-D544-A04F-811FDF4714B9'
            whole_disk: 0
            DTL: 273
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 14344325973176569304
            path: '/private/var/run/disk/by-id/media-F2C5AAAF-4045-114B-8836-CB35D5584327'
            whole_disk: 1
            DTL: 42
            create_txg: 4
            offline: 1
        children[2]:
            type: 'disk'
            id: 2
            guid: 7688033009861353422
            path: '/private/var/run/disk/by-id/media-CEFA2C89-4EC6-8E4F-90A0-082598104763'
            whole_disk: 1
            DTL: 39
            create_txg: 4
        children[3]:
            type: 'disk'
            id: 3
            guid: 6390400256698347316
            path: '/private/var/run/disk/by-id/media-7CA3306B-3627-5549-89A2-D98AAF5A8069'
            whole_disk: 1
            DTL: 41
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data


root@FreeBSD:~ # zdb -l /dev/da2p1
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'omnius'
    state: 0
    txg: 23528409
    pool_guid: 14509745266368669880
    errata: 0
    hostid: 2446817539
    hostname: ''
    top_guid: 17608848787523622521
    guid: 6390400256698347316
    vdev_children: 1
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 17608848787523622521
        nparity: 1
        metaslab_array: 34
        metaslab_shift: 35
        ashift: 12
        asize: 8001538752512
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 9775893318672239851
            path: '/private/var/run/disk/by-id/media-0E2FD10D-C34A-D544-A04F-811FDF4714B9'
            whole_disk: 0
            DTL: 273
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 14344325973176569304
            path: '/private/var/run/disk/by-id/media-F2C5AAAF-4045-114B-8836-CB35D5584327'
            whole_disk: 1
            DTL: 42
            create_txg: 4
            offline: 1
        children[2]:
            type: 'disk'
            id: 2
            guid: 7688033009861353422
            path: '/private/var/run/disk/by-id/media-CEFA2C89-4EC6-8E4F-90A0-082598104763'
            whole_disk: 1
            DTL: 39
            create_txg: 4
        children[3]:
            type: 'disk'
            id: 3
            guid: 6390400256698347316
            path: '/private/var/run/disk/by-id/media-7CA3306B-3627-5549-89A2-D98AAF5A8069'
            whole_disk: 1
            DTL: 41
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data




root@FreeBSD:~ # zdb -l /dev/da3p1
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'omnius'
    state: 0
    txg: 23525225
    pool_guid: 14509745266368669880
    errata: 0
    hostid: 2446817539
    hostname: ''
    top_guid: 17608848787523622521
    guid: 14344325973176569304
    vdev_children: 1
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 17608848787523622521
        nparity: 1
        metaslab_array: 34
        metaslab_shift: 35
        ashift: 12
        asize: 8001538752512
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 9775893318672239851
            path: '/private/var/run/disk/by-id/media-0E2FD10D-C34A-D544-A04F-811FDF4714B9'
            whole_disk: 0
            DTL: 273
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 14344325973176569304
            path: '/private/var/run/disk/by-id/media-F2C5AAAF-4045-114B-8836-CB35D5584327'
            whole_disk: 1
            DTL: 42
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 7688033009861353422
            path: '/private/var/run/disk/by-id/media-CEFA2C89-4EC6-8E4F-90A0-082598104763'
            whole_disk: 1
            DTL: 39
            create_txg: 4
        children[3]:
            type: 'disk'
            id: 3
            guid: 6390400256698347316
            path: '/private/var/run/disk/by-id/media-7CA3306B-3627-5549-89A2-D98AAF5A8069'
            whole_disk: 1
            DTL: 41
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data

bsandor · Apr 17, 2019

But, if I put all the original 2Tb drives back in, the first disk seems to show all the correct guids in omnius, but the other 3 obviously show a different guid for disk0 since they are expecting the recycled 4Tb disk.

bay1 - /dev/da0 - 2Tb original drive (outdated, some data written to omnius after removal)
bay2 - /dev/da1 - 2Tb original drive (offlined)
bay3 - /dev/da2 - 2Tb original drive
bay4 - /dev/da3 - 2Tb original drive

Code:

root@FreeBSD:~ # zpool import 
     pool: omnius
     id: 14509745266368669880
  state: UNAVAIL
 status: The pool was last accessed by another system.
 action: The pool cannot be imported due to damaged devices or data.
   see: http://illumos.org/msg/ZFS-8000-EY
 config:

    omnius                                          UNAVAIL  insufficient replicas
      raidz1-0                                      UNAVAIL  insufficient replicas
        9775893318672239851                         UNAVAIL  cannot open
        14344325973176569304                        OFFLINE
        gptid/cefa2c89-4ec6-8e4f-90a0-082598104763  ONLINE
        gptid/7ca3306b-3627-5549-89a2-d98aaf5a8069  ONLINE




root@FreeBSD:~ # zdb -l /dev/da0p1
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'omnius'
    state: 0
    txg: 23514860
    pool_guid: 14509745266368669880
    errata: 0
    hostid: 2446817539
    hostname: ''
    top_guid: 17608848787523622521
    guid: 8654596240248325070
    vdev_children: 1
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 17608848787523622521
        nparity: 1
        metaslab_array: 34
        metaslab_shift: 35
        ashift: 12
        asize: 8001538752512
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 8654596240248325070
            path: '/private/var/run/disk/by-id/media-0E2FD10D-C34A-D544-A04F-811FDF4714B9'
            whole_disk: 1
            DTL: 43
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 14344325973176569304
            path: '/private/var/run/disk/by-id/media-F2C5AAAF-4045-114B-8836-CB35D5584327'
            whole_disk: 1
            DTL: 42
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 7688033009861353422
            path: '/private/var/run/disk/by-id/media-CEFA2C89-4EC6-8E4F-90A0-082598104763'
            whole_disk: 1
            DTL: 39
            create_txg: 4
        children[3]:
            type: 'disk'
            id: 3
            guid: 6390400256698347316
            path: '/private/var/run/disk/by-id/media-7CA3306B-3627-5549-89A2-D98AAF5A8069'
            whole_disk: 1
            DTL: 41
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data


root@FreeBSD:~ # zdb -l /dev/da1p1
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'omnius'
    state: 0
    txg: 23528409
    pool_guid: 14509745266368669880
    errata: 0
    hostid: 2446817539
    hostname: ''
    top_guid: 17608848787523622521
    guid: 7688033009861353422
    vdev_children: 1
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 17608848787523622521
        nparity: 1
        metaslab_array: 34
        metaslab_shift: 35
        ashift: 12
        asize: 8001538752512
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 9775893318672239851
            path: '/private/var/run/disk/by-id/media-0E2FD10D-C34A-D544-A04F-811FDF4714B9'
            whole_disk: 0
            DTL: 273
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 14344325973176569304
            path: '/private/var/run/disk/by-id/media-F2C5AAAF-4045-114B-8836-CB35D5584327'
            whole_disk: 1
            DTL: 42
            create_txg: 4
            offline: 1
        children[2]:
            type: 'disk'
            id: 2
            guid: 7688033009861353422
            path: '/private/var/run/disk/by-id/media-CEFA2C89-4EC6-8E4F-90A0-082598104763'
            whole_disk: 1
            DTL: 39
            create_txg: 4
        children[3]:
            type: 'disk'
            id: 3
            guid: 6390400256698347316
            path: '/private/var/run/disk/by-id/media-7CA3306B-3627-5549-89A2-D98AAF5A8069'
            whole_disk: 1
            DTL: 41
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data


root@FreeBSD:~ # zdb -l /dev/da2p1
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'omnius'
    state: 0
    txg: 23528409
    pool_guid: 14509745266368669880
    errata: 0
    hostid: 2446817539
    hostname: ''
    top_guid: 17608848787523622521
    guid: 6390400256698347316
    vdev_children: 1
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 17608848787523622521
        nparity: 1
        metaslab_array: 34
        metaslab_shift: 35
        ashift: 12
        asize: 8001538752512
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 9775893318672239851
            path: '/private/var/run/disk/by-id/media-0E2FD10D-C34A-D544-A04F-811FDF4714B9'
            whole_disk: 0
            DTL: 273
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 14344325973176569304
            path: '/private/var/run/disk/by-id/media-F2C5AAAF-4045-114B-8836-CB35D5584327'
            whole_disk: 1
            DTL: 42
            create_txg: 4
            offline: 1
        children[2]:
            type: 'disk'
            id: 2
            guid: 7688033009861353422
            path: '/private/var/run/disk/by-id/media-CEFA2C89-4EC6-8E4F-90A0-082598104763'
            whole_disk: 1
            DTL: 39
            create_txg: 4
        children[3]:
            type: 'disk'
            id: 3
            guid: 6390400256698347316
            path: '/private/var/run/disk/by-id/media-7CA3306B-3627-5549-89A2-D98AAF5A8069'
            whole_disk: 1
            DTL: 41
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data



root@FreeBSD:~ # zdb -l /dev/da3p1
------------------------------------
LABEL 0
------------------------------------
    version: 5000
    name: 'omnius'
    state: 0
    txg: 23525225
    pool_guid: 14509745266368669880
    errata: 0
    hostid: 2446817539
    hostname: ''
    top_guid: 17608848787523622521
    guid: 14344325973176569304
    vdev_children: 1
    vdev_tree:
        type: 'raidz'
        id: 0
        guid: 17608848787523622521
        nparity: 1
        metaslab_array: 34
        metaslab_shift: 35
        ashift: 12
        asize: 8001538752512
        is_log: 0
        create_txg: 4
        children[0]:
            type: 'disk'
            id: 0
            guid: 9775893318672239851
            path: '/private/var/run/disk/by-id/media-0E2FD10D-C34A-D544-A04F-811FDF4714B9'
            whole_disk: 0
            DTL: 273
            create_txg: 4
        children[1]:
            type: 'disk'
            id: 1
            guid: 14344325973176569304
            path: '/private/var/run/disk/by-id/media-F2C5AAAF-4045-114B-8836-CB35D5584327'
            whole_disk: 1
            DTL: 42
            create_txg: 4
        children[2]:
            type: 'disk'
            id: 2
            guid: 7688033009861353422
            path: '/private/var/run/disk/by-id/media-CEFA2C89-4EC6-8E4F-90A0-082598104763'
            whole_disk: 1
            DTL: 39
            create_txg: 4
        children[3]:
            type: 'disk'
            id: 3
            guid: 6390400256698347316
            path: '/private/var/run/disk/by-id/media-7CA3306B-3627-5549-89A2-D98AAF5A8069'
            whole_disk: 1
            DTL: 41
            create_txg: 4
    features_for_read:
        com.delphix:hole_birth
        com.delphix:embedded_data
[/CMD]

bsandor · Apr 17, 2019

So, I'm not 100% sure what zpool import -T does (from what I can gather online, it reads all the raw data and attempts to rebuild the pool from it), but I'm going to try it first with only the disks 2, 3 & 4 (their data should match). If that doesn't work, I'm going to try it with the original 2Tb disk 1 plus the other 3 disks. From everything I can see, the recycled 4Tb disk 1 only has it's old pool info on it, so I'm going to exclude it from all further import attempts. I'll keep you posted.

D-FENS · Apr 18, 2019

bsandor said:
bay1 - /dev/da0 - 2Tb original drive (outdated, some data written to omnius after removal)
bay2 - /dev/da1 - 2Tb original drive (offlined)
bay3 - /dev/da2 - 2Tb original drive
bay4 - /dev/da3 - 2Tb original drive

I don't know how could this situation happen?
if you removed da0, no information could have been written to omnius! It's not possible.
da1 was offline and you removed da0, omnius should have been UNAVAIL as a whole so no new data could have been written.

In case you first removed da0, da1 could not have been offlined because ZFS would not allow it.

All in all, in my opinion your da0 should still have the latest data, unless you replaced it with another drive but this is not yet the case.

D-FENS · Apr 18, 2019

If you find a way to make da0 ONLINE again (even sacrificing some inconsistent data after the last consistent checkpoint) you could import omnius again.
Try to focus on how you can make da0 ONLINE again, if possible.

zpool status shows "Cannot open". Try to debug why exactly "cannot open"? What's the exact reason?

bsandor · Apr 18, 2019

roccobaroccoSC said:
I don't know how could this situation happen?
if you removed da0, no information could have been written to omnius! It's not possible.
da1 was offline and you removed da0, omnius should have been UNAVAIL as a whole so no new data could have been written.

In case you first removed da0, da1 could not have been offlined because ZFS would not allow it.

All in all, in my opinion your da0 should still have the latest data, unless you replaced it with another drive but this is not yet the case.

After da0 was removed and replaced with the recycled 4Tb drive, the pool resilvered, and it was approximately one day before I offlined da1. During that day, some data was likely written to omnius. The pool was online and available once da0 was replaced, but went offline after da1 was offlined. Once I offlined da1, shut the system down, replaced it, and turned it back on, that's when the new da0 drive seems to have lost all omnius pool data. Very strange.

bsandor · Apr 18, 2019

roccobaroccoSC said:
If you find a way to make da0 ONLINE again (even sacrificing some inconsistent data after the last consistent checkpoint) you could import omnius again.
Try to focus on how you can make da0 ONLINE again, if possible.

zpool status shows "Cannot open". Try to debug why exactly "cannot open"? What's the exact reason?

If I can find a way to online da1, then I have 3 drives out of 4 that are all of the same version of data in relation to ominus.

The original 2Tb da0 shows omnius data, but it is older than the other 3 drives. The replacement/recycled 4Tb da0 should be consistent with the rest of the omnius, but it somehow has lost all data pertaining to omnius and only shows Shared pool data on it.

I've been searching online for ways to online a drive in a pool that is not online itself, but have not as of yet found a solution. I have found many articles on using the zdb tool to attempt to recover corrupted zfs pools - but none of those have worked for me because every iteration I try complains that I do not have enough drives online. I am surprised there is no way to force either the original da0 or the current da1 online, but I have not been able to discover one. I will keep looking. and post if I find something, since I would image it would be useful information to others.

D-FENS · Apr 19, 2019

bsandor said:
After da0 was removed and replaced with the recycled 4Tb drive, the pool resilvered, and it was approximately one day before I offlined da1. During that day, some data was likely written to omnius. The pool was online and available once da0 was replaced, but went offline after da1 was offlined. Once I offlined da1, shut the system down, replaced it, and turned it back on, that's when the new da0 drive seems to have lost all omnius pool data. Very strange.

Oh, wait! I probably misunderstood you in the beginning.

So you did resilver your 4TB drive on da0? Then forget about the original 2TB drive. After you replaced it with 4TB, it was no longer in sync. You can't use it anymore.
You have to find the last drive that was working (ONLINE) in the da0 slot and put that back online.

Basically try to remember which drives were in the bay when you last saw da0, da2 and da3 ONLINE. Only da1 is OFFLINE and you know why. You can't change that.

recluce · Apr 19, 2019

Unless the old drive is defective and non longer seen as online by the pool I would recommend not to simply remove the old drive and plug in a new one - you are just deliberately taking away the only redundancy your pool has (RAIDZ1). If things now fail - and due to the resilvering there is more workload on the drives that may push another drive that is borderline over the edge - you are SOL.

Rather, attach the new drive in addition to the existing set and use (example, replace by diskid or gptid):

zpool replace tank da1 da5

I realize that you may not always have a free bay available. In that case, just get creative. For example, you could use an external USB or eSATA dock or simply hotwire from an internal SATA port to a drive resting on the case of your system. If you work by diskid or gptid, that will not cause any issues when putting the new drive in its permanent place.

bsandor · Apr 19, 2019

recluce said:
Unless the old drive is defective and non longer seen as online by the pool I would recommend not to simply remove the old drive and plug in a new one - you are just deliberately taking away the only redundancy your pool has (RAIDZ1). If things now fail - and due to the resilvering there is more workload on the drives that may push another drive that is borderline over the edge - you are SOL.

Rather, attach the new drive in addition to the existing set and use (example, replace by diskid or gptid):

zpool replace tank da1 da5

I realize that you may not always have a free bay available. In that case, just get creative. For example, you could use an external USB or eSATA dock or simply hotwire from an internal SATA port to a drive resting on the case of your system. If you work by diskid or gptid, that will not cause any issues when putting the new drive in its permanent place.

Good advice. Thanks.

ZFS 4-disk Raidz1 showing as "UNAVAIL" - one disk missing, one disk OFFLINE

Administrator