FreeBSD 10 won't boot to ZFS root after power failure

While installing new hardware in a rack today, I involuntarily disconnected the power to one of our servers. This server is running FreeBSD 10 with Root-on-ZFS as offered by the installer. It has a total of 36 disks, distributed into two RAID-Z2 which belong to the same pool (called zroot).

Code:
$ zpool status
  pool: zroot
 state: ONLINE
  scan:
config:

    NAME                                            STATE     READ WRITE CKSUM
    zroot                                           ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/f8c57b3a-083e-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/f97b7e8b-083e-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/fa3c41d9-083e-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/faf62101-083e-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/fbb19e1b-083e-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/fc6b75db-083e-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/fd26cd36-083e-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/fddb4b8e-083e-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/fe9a55f6-083e-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/ff582110-083e-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/001713d1-083f-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/00d90b6c-083f-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/0192be91-083f-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/023ea058-083f-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/02fb8ee4-083f-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/03ab78ec-083f-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/04632542-083f-11e4-b11b-002590e745f4  ONLINE       0     0     0
        gptid/052144fd-083f-11e4-b11b-002590e745f4  ONLINE       0     0     0
      raidz2-1                                      ONLINE       0     0     0
        da18                                        ONLINE       0     0     0
        da19                                        ONLINE       0     0     0
        da20                                        ONLINE       0     0     0
        da21                                        ONLINE       0     0     0
        da22                                        ONLINE       0     0     0
        da23                                        ONLINE       0     0     0
        da24                                        ONLINE       0     0     0
        da25                                        ONLINE       0     0     0
        da26                                        ONLINE       0     0     0
        da27                                        ONLINE       0     0     0
        da28                                        ONLINE       0     0     0
        da29                                        ONLINE       0     0     0
        da30                                        ONLINE       0     0     0
        da31                                        ONLINE       0     0     0
        da32                                        ONLINE       0     0     0
        da33                                        ONLINE       0     0     0
        da34                                        ONLINE       0     0     0
        da35                                        ONLINE       0     0     0

errors: No known data errors

After switching the server back on, it will now no longer boot. Before the boot menu shows up the following messages are printed:

Code:
Loading /boot/defaults/loader.conf
ZFS: i/o error - all block copies unavailable
Warning: error reading file /boot/loader.conf

Despite these messages, the system continues to boot the kernel until it stops at a mountfrom> prompt and I can't get it to continue from there. If I enter zfs:zroot/ROOT/default (what it should be), it just says unknown filesystem. I can however boot from a USB stick, import the zpool and I can read both /boot/defaults/loader.conf and /boot/loader.conf. In fact, it appears that the zpool is perfectly fine. I have then tried to reinstall the bootcode: gpart bootcode -b /tmp/zroot/boot/pmbr -p /boot/gptzfsboot -i 1 da0. I have tried to recreate the zpool cache file: zpool set cachefile=/tmp/zroot/boot/zfs/zpool.cache zroot. I have tried to manually tell the bootloader to load the ZFS module. All of these attempts still resulted in the same behaviour and I am now completely stuck on what I could possibly do to make the system boot again.

Output of zpool list:

Code:
$ zpool list
NAME    SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
zroot  97.8T  28.0T  69.7T    28%  1.33x  ONLINE  /tmp/zfs

Currently the system is booted from a USB stick, thus I could not import the pool to /

Contents of /boot/loader.conf:

Code:
$ cat /boot/loader.conf
zfs_load="YES"
ipmi_load="YES"

I also posted a question on ServerFault: http://serverfault.com/questions/616991/freebsd-10-wont-boot-to-zfs-root-after-power-failure. Unfortunately, so far it has not gained much traction and I hoped someone here could give me pointers on how to rectify the situation. I mainly have Linux experience and am really new to FreeBSD, so please bear with me and also consider that I did forget something very basic or did something basic very wrong (apart from using Root-on-ZFS apparently).
 
See if setting this at the loader prompt (select drop to loader prompt before the loader start to load the kernel) before boot makes any difference:
Code:
set kern.geom.label.gptid.enable=0
boot
 
See if setting this at the loader prompt (select drop to loader prompt before the loader start to load the kernel) before boot makes any difference:

Code:
set kern.geom.label.gptid.enable=0
boot

Unfortunately, this did not help. Thank you anyway :)

While I did this, I had a new idea: When booting up, the BIOS seems to only know about the first 16 drives, not all 36, and so does the BTX loader. Taking the ZFS message that all block copies are unavailable into account, I suppose it is possible that all copies of the files in /boot are on the disks 17 to 36. This would inevitably lead to the message that all block copies are unavailable.

I was now thinking of telling ZFS to have 36 copies of (at least) the /boot dataset, maybe even the whole base system. I think this could then solve this issue.

Edit: Okay, my idea did not work out either, because
1) I can only set the copies property to 1, 2, or 3 (I would have liked to set it to 36)
2) Changing the property only affects newly-written data
 
If you have an option to redo the pools before putting the system to production you should create a very small root pool made of only few disks that contains just the basic operating system, possibly just use a mirror VDEV for that. I also think that the ZFS experts strongly advise against putting too many disks in a VDEV in RAIDZ pools, I believe the recommendation is like 8-9 disks per VDEV at maximum.
 
kpa said:
If you have an option to redo the pools before putting the system to production you should create a very small root pool made of only few disks that contains just the basic operating system, possibly just use a mirror VDEV for that. I also think that the ZFS experts strongly advise against putting too many disks in a VDEV in RAIDZ pools, I believe the recommendation is like 8-9 disks per VDEV at maximum.

Yes, I think I learned my lesson the hard way :)

As it seems no one is able to solve this problem, I will now try to rearrange the existing pool and re-install FreeBSD on a two-disk mirror where it can hopefully boot from.

Thank you very much everyone for your help!
 
Back
Top