ZFS 10.1 doesn't boot anymore from zroot after applying p25

FreeBeer · Dec 16, 2015

Hi list,

[I think it is better placed in storage than in Installing/Upgrading. Excuse me for the double post.]

I am a bit lost right now.

I thought I just update my system which is:

Code:

10.1-RELEASE FreeBSD 10.1-RELEASE #0 r274401: Tue Nov 11 21:02:49 UTC 2014 root@releng1.nyi.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64

I did updates with freebsd-update fetch && freebsd-update. p24 was applied and after a reboot I got:

Code:

FreeBSD 10.1-RELEASE-p24 (GENERIC) #0: Mon Nov 2 12:17:28 UTC 2015

I started to do updates again. It said:

Code:

The following files will be added as part of updating to 10.1-RELEASE-p25:
*a long list with mostly stuff in /usr/src/contrib/*

Reboot again and then it got stuck:

Code:

ZFS: i/o error - all block copies unavailable
ZFS: can't read object set for dataset u
ZFS: can't open root filesystem
gptzfsboot: failed to mount default pool zroot

I read that the boot code has to be rewritten. I booted 10.1 from a usb stick and did:
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 mdif0 (and 1,2,3,4,5)

Nothing changed.

Then I booted 10.2 and did the above again. Now it says:

Code:

ZFS: i/o error - all block copies unavailable (3 times)

Can't find /boot/zfsloader

...
boot:
ZFS: i/o error - all block copies unavailable

Can't find /boot/kernel/kernel

With 10.2 I tried to import the zpool to see the status.
zpool import -f zroot

That worked but zpool status gave me:

Code:

internal error: failed to initialize ZFS library

Same with 10.1.

I rewrote the bootcode again with 10.1. Same errors.

So, now I can't boot and I can't access the zpool just because I did some updates. How can that be?

I'd really appreciate some help!

Cheers,

FreeBeer

FreeBeer · Dec 17, 2015

I solved one problem. Importing the pool has to be like that:

zpool import -o altroot=/mnt zroot

That way I am able to do chroot /mnt/

I read that the directory /boot would "heal" by doing this:

Code:

mv boot boot.orig
mkdir boot
cd boot.orig
cp -r * /boot

Then I exited chroot, exported zroot and did a reboot. Now it actually does boot but I am stuck again with:

Code:

Mounting from zfs:zroot/ROOT/default failed with error 2: unknown file system

mountroot>

Back to the chroot shell I tried

freebsd-update rollback

Still the same.

Anyone an idea?

FreeBeer

FreeBeer · Dec 20, 2015

I gave up after two days of trying. Last thing I did was copying the kernel and modules from the 10.1 live CD to /boot of the zroot. Then it didn't boot again, dropping me at the boot prompt with zfs i/o error and so on.

In order to access the pool and share it via nfs I installed 10.1 on an external USB disk. That was done in like 10 minutes.

I really came to the conclusion, that root on zfs is nothing for production if it fails after doing something as normal as security updates.
I'll install the system on a mirror of external devices using UFS and keep the zpool separate. Maybe PCI SSDs or internal dual SDCards (it is a Dell R710) would be best.

FreeBeer

outpaddling · Jan 14, 2016

I just ran into the same issue on one of my 10.1-RELEASE systems.

Thanks for posting your experience despite the fact that it failed to work for you. You may have just missed by a minor detail.

I was able to recover easily by doing the following:

1) Boot from 10.2-RELEASE USB stick, CD, or DVD
2) Select Live CD from Install/Shell/Live CD menu
3) Log in as root
4) Run the following at the shell prompt:
# zpool import -R /mnt -f zroot # Probably equivalent to what you did for this purpose
# cd /mnt
# mv boot boot.orig
# mkdir boot
# cd boot.orig
# cp -Rp * /mnt/boot # Note -p to make sure permissions are correct in the new /boot
# zpool export
reboot

The system booted fine after this.

I'd be interested in any comments from someone who knows ZFS better than I do.

Is there a cleaner solution than this?

Is this to be expected for everyone booting 10.1 from ZFS with this round of updates?

Has the issue been corrected in 10.2?

Thanks,

Jason

Oliver Schonrock · Jan 15, 2016

very similar symptoms as you, but on 10.2, after yesterdays updates it wouldnt boot. This was a clean 10.2 zfsonroot install.

tried your method: no luck so far

outpaddling · Jan 15, 2016

I verified that it's not inevitable for all 10.x ZFS-booted systems. I updated our test cluster head node this afternoon and it rebooted fine. FYI, I had also rewritten the boot block prior to recreating /boot, but it didn't appear to have any effect on its own. Also removed a bunch of files while it was mounted under /mnt, but I doubt this had any effect since the pool was only at about 17% capacity to begin with.

Oliver Schonrock · Jan 15, 2016

Been on this for nearly 24hrs now (4 hours sleep ;-) still no joy. Some more detail:

This is a clean zfs on root install of 10.2 from about September. I used the guided zfs-root setup. I then added a second mirror vdev so that the root zpool has 2 mirror vdevs (4 disks total). The 2 disks on the 1st vdev have 3 partitions (freebsd-boot, freebsd-swap, freebsd-zfs) and the 2nd vdev pair has no partitions, they are raw disks. Although, strangely the boot loader reports them as having 2 partitions each (freebsd-swap and freebsd-zfs).

The error I am getting is AFTER the bootloader menu. I.e. the menu loads fine then tries boot and fails with:

Code:

ZFS: I/O errro - all block copies unavailable
/boot/kernel/kernel text 0xfc8dc8 ZFS: i/o error - all block copies unavailable.

readin failed
elf64_loadimage: read failed
can't load file /boot/kernel/kernel: input/output error
Error while inclding menu.rc, in line
menu display

then tries, and fails, a second time.

The funny part is, if I try your cp -Rp /boot.orig /boot then it "gets worse", i.e. it doesn't even get as far as the bootloader menu, the system fails at loader level already (just before the menu loads).

The machine boots perfectly from 10.2 live USB stick, and the pool imports fine and is healthy (I even did a scrub). Pool is 97% empty!

Somehow, the loader manages to semi-mount/read the pool, but not properly enough, to read the kernel and boot.

The only weird thing about the whole setup that I have found is that after zpool import from live USB then zpool export zroot, goes OK, but gpart reports the following:

Code:

GEOM: ada2 the primary GPT Table is corrupt or invalid
GEOM: ada2 using the seconday instead -- recovery strongly advised
...repeated for ada3...and then the GEOM disk ids

Not sure what this is about or how to fix it.

Also noticed that ada2 and ada3 (i.e. the second mirror vdev) do not get listed in /dev/gpt.

They do get listed fine under gpart show and gpart list, but only AFTER
zpool import.

My gut feel, is that the problem is all about the zfs root pool being spread over 2 vdevs, and with the normal upgrade I did, the /boot contents (including new kernel), got striped to the second vdev. When the pool is fully imported and ONLINE, this is totally fine, but the bootloader has trouble bringing a second vdev online early enough to be able to boot off it...

Thoughts, ideas...

outpaddling · Jan 15, 2016

If this is your actual command:

cp -Rp /boot.orig /boot

then you're trying to modify the USB stick, not the mounted ZFS pool. This command would also create a directory called /boot/boot.orig rather than sync /boot.orig to /boot. If you show the *exact* sequence of commands you ran while running the live CD image, it might become apparent why it's not working.

Your error message is also different from what I saw:

Code:

ZFS: i/o error - all block copies unavailable
ZFS: can't find dataset u
Default: pool-name: <0x0>:
boot:

This may not be quite the same issue.

Regards,

JB

outpaddling · Jan 15, 2016

I'm 2 for 3 now. Our test cluster and production cluster head nodes both came up fine after updates. I ran a full backup of the production cluster head node before applying updates just in case, so I could recover quickly via a fresh install if necessary.

I will note that on the one server that had a problem, there had been a large amount of data written to it recently, although the pool was still nowhere near full. That made this post catch my attention:

http://octobertoad.tumblr.com/post/24050828517/freebsd-zfs-all-block-copies-unavailable

It appears that this problem is uncommon. Every PC-BSD 10.x system is boot from ZFS, so I'd think there would be a lot of hits on the search engines otherwise. At any rate, I hope the exact causes are discovered soon.

I might opt for a UFS2 boot partition + ZFS pool for user partitions next time just to be safe.

JB

Oliver Schonrock · Jan 16, 2016

outpaddling said:
If this is your actual command:
cp -Rp /boot.orig /boot

then you're trying to modify the USB stick, not the mounted ZFS pool. This command would also create a directory called /boot/boot.orig rather than sync /boot.orig to /boot. If you show the *exact* sequence of commands you ran while running the live CD image, it might become apparent why it's not working.

No, sorry, that was bad reporting on my part. I followed your sequence of commands exactly. I noticed that you used

Code:

cp -Rp * /mnt/boot

which actually wouldn't have copied hidden (ie .* ) files, but there are none, so that's OK. Not that it makes any sense to me why your cp command would make any difference anyway. Either "making it work" in your case, or "making it worse" in my case. Why should an identical copy of the same files (using different inodes and disk block layout only), make a difference to what happens, unless the FS is seriously flakey?

Spent a few more hours. Still no joy. Have started pulling critical data off, using tar/ssh from liveUSB. I also did a chroot from the liveUSB into the zroot pool system and then freebsd-update rollback. That changed the error message subtly, but not much. It is identical to what it was, if I now boot into kernel.old.

The problem seems to be that when boot fails and I get dropped to the loader prompt, I can:

lszfs all over the pool and see my files.

lsdev reports a sensible set of pool devices. Although, it reports the ada2 & ada3 as having 2 partitions each, which they do not. They are raw, "all zfs" block devices. Is this part of the problem?

ls /boot/kernel shows the relevant files.

BUT

lszfs zroot/ROOT/default gives an EMPTY result
load /boot/kernel/kernel gives the same error message as "autoboot" (see above)

It's like the meta-data is there, but when I try to read the contents of those files, ZFS is lacking the block devices to serve me. Is it those ada2 & ada3 mirror drives from the 2nd vdev?

All this is a bit of a worry, because while this particular server is only an inconvenience, I was planning to put the same setup onto a whole new set of, brand-new servers for a production datacenter installation. Although, they would be simpler, in that they really would be just the bsdinstall default, ie, 2 disks for 1 mirror vdev on root (not 2 vdevs like this machine).

Oliver Schonrock · Jan 17, 2016

I solved my problem.

The issue was this:

At install time, I chose zfs-on-root, using a simple 2-way mirror. bsdinstall partitioned 2 disks, there are 4 in this box, in the usual way. I later added the other two disks, ada2 & ada3, as an additional, mirror vdev, to the same root pool. I added them as RAW disks without partitioning. ZFS is fine with that. So my root pool layout was like this:

Code:

zpool status
  pool: zroot
state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p3  ONLINE       0     0     0
            ada1p3  ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            ada2    ONLINE       0     0     0
            ada3    ONLINE       0     0     0

So, just after the install, the physical data distribution (zfs internals) was such that 99% of the base system, including kernel and /boot etc, was on the first vdev: The one with proper partitions, which was created by bsdinstall.

As data got written, ZFS striped it across the 2 vdevs, and favoured the second vdev (ada2/ada3), to even things up. This worked perfectly. I updated the system and rebooted maybe twice over 4 months.

Then, on Thursday, Jan 14 2016, I did another freesbsd-update fetch install && reboot: BOOM! (see above for detailed errors)

The reason was that, PURELY BY CHANCE, zfs decided to put some of my new /boot and kernel files on the second vdev (ada2/ada3). As reported above, the BTX loader is incapable of properly recognising non-partitioned ZFS disks. It misreported them as having 2 partitions, when they actually had none. lsdev from the loader prompt was the proof. As a result, the meta-data (all/mostly on 1st vdev), was fine. In other words, I could browse the file/dir tree with ls from loader prompt. However, attempting to read contents of those files resulted in: "ZFS: i/o error - all block copies unavailable", because the loader could not properly use the second vdev block device.

So, with hindsight, each time I updated the system, I was playing Russian roulette. Each time, I had a 50/50, or worse, chance of zfs deciding to put the new data of /boot onto the 2nd vdev.

The solution was as follows:

Install a spare drive.
Create a zpool called zbackup on it, and zfs send the entire zroot pool to it. Luckily, it was only 10% full. I roughly followed this post.
zpool destroy zroot. Use gpart to create single partitions on ada2 & ada3 of type=freebsd-zfs, taking up almost the whole disk.
Recreate the zroot pool, including both mirrors, using all 4 drives from the start. I could have also added the 2 additional drives later, as long as they had proper partitioning.
zfs receive the backup.

For reference, these are the commands that I used. If useful, you will need to adapt these:

Code:

# credit to: https://www.dan.me.uk/blog/2012/08/05/full-system-backups-for-freebsd-systems-using-zfs/
# and I obtained a copy of the bsdinstall_log to get the latest, best practice

# to make the backup

gpart destroy -F ada4
dd if=/dev/zero of=/dev/ada4 bs=1m count=128
zpool create zbackup /dev/ada4
zfs set mountpoint=/var/full-backup zbackup

# DONT PIPE THROUGH GZIP, IT WILL LIKELY BE A CPU BOTTLENECK!
zfs snapshot -r zroot@backup
zfs send -Rv zroot@backup > /var/full-backup/full-system-backup.zfs
zfs destroy -r zroot@backup

# ensure our new partitions use 4K sectors
sysctl vfs.zfs.min_auto_ashift=12

# kill the old pool
zpool destroy zroot

# clear the labels of vdev1 drives (not sure if really required)
zpool labelclear -f /dev/ada0p3
zpool labelclear -f /dev/ada1p3

# properly partition the vdev2 drives (one big partition)
gpart destroy -F ada2
gpart destroy -F ada3

zpool labelclear -f /dev/ada2
zpool labelclear -f /dev/ada3

gpart create -s gpt ada2
gpart create -s gpt ada3

gpart add -a 1m -l zfs2 -t freebsd-zfs ada2
gpart add -a 1m -l zfs3 -t freebsd-zfs ada3

# check structure
gpart show
=>        34  7814037101  ada0  GPT  (3.6T)
          34           6        - free -  (3.0K)
          40        1024     1  freebsd-boot  (512K)
        1064         984        - free -  (492K)
        2048     4194304     2  freebsd-swap  (2.0G)
     4196352  7809839104     3  freebsd-zfs  (3.6T)
  7814035456        1679        - free -  (840K)

=>        34  7814037101  ada1  GPT  (3.6T)
          34           6        - free -  (3.0K)
          40        1024     1  freebsd-boot  (512K)
        1064         984        - free -  (492K)
        2048     4194304     2  freebsd-swap  (2.0G)
     4196352  7809839104     3  freebsd-zfs  (3.6T)
  7814035456        1679        - free -  (840K)

=>        34  7814037101  ada2  GPT  (3.6T)
          34        2014        - free -  (1.0M)
        2048  7814033408     1  freebsd-zfs  (3.6T)
  7814035456        1679        - free -  (840K)

=>        34  7814037101  ada3  GPT  (3.6T)
          34        2014        - free -  (1.0M)
        2048  7814033408     1  freebsd-zfs  (3.6T)
  7814035456        1679        - free -  (840K)


zpool create -o altroot=/mnt -O compress=lz4 -O atime=off -m none -f zroot mirror ada0p3 ada1p3 mirror ada2p1 ada3p1

# to restore the backup
zfs receive -vdF zroot < /var/full-backup/full-system-backup.zfs

zpool set bootfs=zroot/ROOT/default zroot

# ensure we have a valid zpool.cache
zpool export zroot
zpool import -o altroot=/mnt zroot
mkdir -p /mnt/boot/zfs
zpool set cachefile=/mnt/boot/zfs/zpool.cache zroot

reboot

The new pool structure is now:

Code:

[CMD]zpool status[/CMD]
  pool: zroot
state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        zroot       ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            ada0p3  ONLINE       0     0     0
            ada1p3  ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            ada2p1  ONLINE       0     0     0
            ada3p1  ONLINE       0     0     0


(note that ada2 and ada3 both have a partition now)

Using lsdev at the loader prompt, now reports the true pool and partition structure.

And importantly, the system BOOTS FINE, because loader can now understand, that ada2p1 and ada3p1 are part of the pool.

Hope this helps someone. Thanks to kpa, for steering me in the right direction.

I do wonder, whether this cp -Rp boot.orig boot technique, mentioned above, is a way of "playing Russian roulette, in reverse". Pull the trigger until you live: ie, keep cp'ing, until your critical boot data is on a vdev/disk, which is healthy, and available at boot time. Perhaps, not a terribly reassuring fix?

Anyway, the moral of my story is: If you want to grow your root-on-zfs pool, adding a further vdev is a good, and perfectly legitimate way. However, you must make sure, that the disks, which make up the new vdev, are properly partitioned, with the zfs slice being of type=freebsd-zfs. If not, early loader code will not recognise them, and the whole system will, eventually, fail to boot.

Of course, there is zfs documentation from Sun/Oracle, out there already, that tells you this:

Using Slices in a ZFS Storage Pool
Disks can be labeled with a traditional Solaris VTOC (SMI) label when you create a storage pool with a disk slice.

For a bootable ZFS root pool, the disks in the pool must contain slices and the disks must be labeled with an SMI label. The simplest configuration would be to put the entire disk capacity in slice 0 and use that slice for the root pool.

It would be nice, if the FreeBSD handbook, were more detailed on the options and pitfalls.

phoenix · Jan 18, 2016

Note: ignore just about all Solaris disk labelling/partitioning information you find online if it mentions ZFS, as it does not apply to FreeBSD.

Oliver Schonrock · Jan 18, 2016

phoenix: Agreed. However, basic disk partitioning to the appropriate type was necessary to make system boot at all in this narrow case.

If we had authoritative info on what is required by FreeBSD, that would be preferable. Unfortunately there are no FreeBSD docs on how to add 2nd and further vdevs to root-on-zfs pool for FreeBSD, that I found.

Found the Solaris info after the event. This seemed to highlight the fact, that we are somewhat behind on clearly documenting what is and isn't possible, and if so, how?

free-and-bsd · Jan 18, 2016

Oliver Schonrock said:
... bsdinstall partitioned 2 disks, there are 4 in this box, in the usual way...

So do you think this would not have happened had you handled partitioning and installation manually instead of using bsdinstall ?

I just has been my feeling for quite some long time now that in complicated setups bsdinstall is not the best tool to use... what do you think?

Oliver Schonrock · Jan 19, 2016

free-and-bsd

free-and-bsd said:
So do you think this would not have happened had you handled partitioning and installation manually instead of using bsdinstall ?

Not quite. There were no docs that I found which would have told me how to partition a second vdev for zfs-on-root. bsdinstall did a perfect job for the 1st vdev, and it provided a template for me to eventually copy for ada2/3.

The key to my particular problem was to understand that BTX/loader needs more information than a fully booted system to identify which disks are part of a zpool. Therefore, for root-on-zfs, you need to "tag" all the disks in the pool with partition type=freebsd-zfs so that the bootcode can understand the pool structure. Once fully booted ZFS doesn't need these "partition labels".

bsdinstall does very well covering a small subset of an almost infinite number of options for using ZFS on root. IMHO, what we need are some docs which describe the why/how so individual users can expand on that.My suggestion would be to update and formalise the root-on-zfs wiki content and put it in the handbook.

In addition, rather than spend lots of effort teaching bsdinstall to cover more options, just offer an additional one or two "breakpoints" from bsdinstall. eg allow it to "set up the basic pool", then break to shell to customise it. Also, you may wish to "set up the basic pool", and then restore from backup (as in my case), rather than "install system".

As it stands I had to "reverse engineer", what bsdinstall was doing in order to customise it. Yet, what it did was a good example of latest best practice and worth following.

It's tricky to get right: zfs-on-root partitioning and pool config is not trivial and there are many many valid options.

phoenix · Jan 19, 2016

Every disk in the root pool should be configured identically. Using gpart show -l on an existing disk will give you all the information you need for configuring new disks. Use the information from that command to manually partition new disks the same way.

Once the disks are partitioned and labelled, then you also need to install the bootcode onto each disk. That way, the BIOS can load the boot loader off any disk, in the case that the "first" disk fails. gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 (or similar) is what's needed.

Once the disks have the bootcode installed, then they can be added to the pool as a new vdev.

I stopped using the automatic disk partitioning tools in OS installers several years ago. They handle the "normal" or "easy" situations, but are not very good for even slightly difficult/different situations.

Do things manually a few times, and you'll really start to understand how it all works, and you'll be all the better for doing so.

My home setup, running FreeBSD 9.3, with root-on-ZFS:

Code:

[phoenix@rogue /home/phoenix]$ zpool status
  pool: pool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(7) for details.
  scan: scrub repaired 0 in 5h57m with 0 errors on Thu Feb 19 20:54:00 2015
config:

    NAME           STATE     READ WRITE CKSUM
    pool           ONLINE       0     0     0
     mirror-0     ONLINE       0     0     0
       gpt/disk1  ONLINE       0     0     0
       gpt/disk3  ONLINE       0     0     0
     mirror-1     ONLINE       0     0     0
       gpt/disk4  ONLINE       0     0     0
       gpt/disk2  ONLINE       0     0     0
     mirror-2     ONLINE       0     0     0
       gpt/disk5  ONLINE       0     0     0
       gpt/disk6  ONLINE       0     0     0

errors: No known data errors

Each disk is configured the same:

Code:

[phoenix@rogue /home/phoenix]$ gpart show -l ada0
=>        34  1953525101  ada0  GPT  (931G)
          34         256     1  boot2  (128k)
         290        1758        - free -  (879k)
        2048  1953523087     2  disk2  (931G)

Oliver Schonrock · Jan 19, 2016

Yes, agree with "doing it manually yourself". I have been doing that for FreeBSD on UFS for 15 years on all our datacenter-based production machines. However, with me being new to ZFS, and the docs being sparse or outdated, I started with the bsdinstall "gui", because it was the only available, official solution to this task that I found.

I don't (yet?) agree that "all the disks have to be the same", at least not for a multi-vdev, mirror-based zroot pool.

My first 2 disks are the same, ie they have a duplicate copy of the bootcode and they have a gmirror of the swap. If they both fail, the machine is hosed anyway, because that would mean the entire first mirror is gone and the zroot pool is faulted. No boot, no data: BRICK

So why would I want copies of the bootcode on ada2 and ada3? How are they going to help me? Perhaps I have missed something?

With all due respect, it is just this kind of carte blanche statement:

It just has to be this way. All the disk have to have 2 partitions including a copy of the bootcode.

which is not written down anywhere official, and provides no clear reasoning, that confuses this subject. And, according to me, it makes no sense. Also, how do I run gpart show when the disks are broken or new?

Hence my request for some official guidelines / docs in the man pages or the handbook on the subject of "partitioning for root-on-zfs". By making it into these official docs, the information will need to pass peer review scrutiny and can therefore be relied upon more than the scattered and often dated info from many parties, which is all we have on this subject right now.

phoenix · Jan 19, 2016

Oliver Schonrock said:
Yes, agree with "doing it manually yourself". I have been doing that for FreeBSD on UFS for 15 years on all our datacenter-based production machines. However, with me being new to ZFS, and the docs being sparse or outdated, I started with the bsdinstall "gui", because it was the only available, official solution to this task that I found.

I don't (yet?) agree that "all the disks have to be the same", at least not for a multi-vdev, mirror-based zroot pool.

My first 2 disks are the same, ie they have a duplicate copy of the bootcode and they have a gmirror of the swap. If they both fail, the machine is hosed anyway, because that would mean the entire first mirror is gone and the zroot pool is faulted. No boot, no data: BRICK

So why would I want copies of the bootcode on ada2 and ada3? How are they going to help me? Perhaps I have missed something?

Because you can't guarantee that "disk1", as seen by the BIOS, is the disk attached to SATA port 1. Maybe the disk attached to port 1 dies, and the BIOS selects the disk attached to port 3 as the new "disk1" and tries to load the boot loader from it, which will then fail. Or someone hits F11 during POST and selects a different disk to boot from, which doesn't have the boot loader. Or you are doing some work in the case and some cables get shuffled around, and now "disk1" is actually "disk4". Or you are playing around in the BIOS and decide to change the order of the disks as seen by the BIOS, or change which one to use as "hard disk" to boot from. Or ...

All disks in the root pool should be the same, configured such that any disk can be removed, and everything will continue normally; and such that the BIOS can choose any of the disks to boot from, without causing issues. And really, the root pool shouldn't be more than 4-6 disks or so (depending on usage). If you are using more disks than that, you should consider splitting the system into 2 pools: one for the OS, one for data storage.

And, according to me, it makes no sense. Also, how do I run gpart show when the disks are broken or new?

You can run gpart show at any time, against any disk. If the disk is broken/not working, it will return an error message. If the disk is unpartitioned, then it will say so. Otherwise, it will show any and all partitioning data for it.

[quote[Hence my request for some official guidelines / docs in the man pages or the handbook on the subject of "partitioning for root-on-zfs". By making it into these official docs, the information will need to pass peer review scrutiny and can therefore be relied upon more than the scattered and often dated info from many parties, which is all we have on this subject right now.[/QUOTE]

That's a good idea. Nobody is arguing against that.

Oliver Schonrock · Jan 19, 2016

phoenix said:
Because you can't guarantee that "disk1", as seen by the BIOS, is the disk attached to SATA port 1. Maybe the disk attached to port 1 dies, and the BIOS selects the disk attached to port 3 as the new "disk1" and tries to load the boot loader from it, which will then fail. Or someone hits F11 during POST and selects a different disk to boot from, which doesn't have the boot loader. Or you are doing some work in the case and some cables get shuffled around, and now "disk1" is actually "disk4". Or ...

OK, some of those might be relevant in some cases. When you are using datacenter class hardware, less so: ie disk1 is always disk1, because it's in a hotplug bay labelled disk1. It never comes out of there, unless it fails, because the machine never gets touched (typically 5+ years in my case). I also set up the BIOS to never boot off ada2/3, obviously. I know it can't so I told the BIOS.

phoenix said:
You can run gpart show at any time, against any disk. If the disk is broken/not working, it will return an error message. If the disk is unpartitioned, then it will say so. Otherwise, it will show any and all partitioning data for it.

Sure you can. But in a locked room with a brand new server with brand new disks: gpart show will not help you. That's when you can use any of these:

bsdinstall with all its inherent limitations
some random, dated wiki/forum
read the source code of bsdinstall and try to cook your own from that (that's what I did)
read some authoritative, peer reviewed docs, checked by people who understand the internals

phoenix said:
That's a good idea. Nobody is arguing against that.

Great, that's all am suggesting. Lets write that stuff down somewhere useful, so people don't have to search random forums / wikis many of which are dated and full of "half information". Partitioning is something you want to do precisely once per machine, and then leave it until you add more drives or throw the machine away. So you want to get it right first time.

Not having decent docs contributed to the original poster of this thread concluding:

FreeBeer said:
I really came to the conclusion, that root on zfs is nothing for production if it fails after doing something as normal as security updates.
I'll install the system on a mirror of external devices using UFS and keep the zpool separate.

And the next person concluding

outpaddling said:
I might opt for a UFS2 boot partition + ZFS pool for user partitions next time just to be safe.

While using a UFS boot drive to boot from might be "safer", it's pretty messy, because that's not redundant, unless you use RAID. If you are using RAID, then one of the biggest reasons for ZFS goes out the window. As someone commented in a related thread:

ab2k said:
You have a very nice [RAID] controller and flashing it to IT mode - hmmm... bad idea - think its going to be better to use UFS - it will surelly outperform ZFS, just because of installed cache.

How does one get stuff submitted to the Handbook? Never tried to do that, although I am not the person to write this (yet).

outpaddling · Apr 7, 2020

Ran into this again on a 12.1 system.

I don't know if it's related, but this error arose after /var/log/messages was flooded with entries like the following:

kernel: swap_pager_getswapspace(X): failed

System has a Dell PERC H700, 4 disks configured as individual RAID-0 volumes in the hardware RAID (it does not support JBOD), configured as a RAIDZ by bsdinstall using mfid0 - mfid3 as physical disks.

Updating my own prior instructions from above as a couple new steps were necessary this time:

1) Boot from 12.1-RELEASE USB stick, CD, or DVD
2) Select Live CD from Install/Shell/Live CD menu
3) Log in as root
4) Run the following at the shell prompt:

# zpool import -R /tmp/mnt -fF zroot # Probably equivalent to what you did for this purpose
# cd /tmp/mnt
# zfs mount zroot/ROOT/default # Not mounted by default
# mv boot boot.orig
# mkdir boot # Slightly longer route to avoid issues with cp and trailing /
# cd boot.orig
# cp -Rp * ../boot # Note -p to make sure permissions are correct in the new /boot
# zpool export zroot
# reboot

After this, it successfully loaded the kernel, but then halted at mountroot. Trying to mount manually with

zfs:zroot/ROOT/default

produced an "unknown filesystem" error.

"zpool scrub" did not find any errors and all datasets appeared to be intact when importing manually, though some did not mount automatically. "zfs get canmount" showed a few datasets such as zroot/var/log set to "off". I restored them to "on" using a pristine system as an example:

Code:

NAME                PROPERTY  VALUE     SOURCE
zroot               canmount  on        default
zroot/ROOT          canmount  on        default
zroot/ROOT/default  canmount  noauto    local
zroot/tmp           canmount  on        default
zroot/usr           canmount  off       local
zroot/usr/ports     canmount  on        default
zroot/usr/src       canmount  on        default
zroot/var           canmount  off       local
zroot/var/audit     canmount  on        default
zroot/var/crash     canmount  on        default
zroot/var/log       canmount  on        default
zroot/var/mail      canmount  on        default
zroot/var/tmp       canmount  on        default

Still would not boot completely, though.

Ultimately I got it to boot successfully by copying /boot from the USB stick and restoring the following from boot.orig:

/mnt/boot/loader.conf
/mnt/boot/zfs/zpool.cache

Apparently something in my old /boot directory was corrupted, so moving it to boot.orig and copying it back was not sufficient this time.

This of course left me booting from the original 12.1-RELEASE kernel, so I ran freebsd-update to bring everything up to date.

From here on, I'm going to make an occasional copy of /boot on working systems (/boot.1, /boot.2, etc), at least one after each freebsd-update, so I can more quickly recover from this if it happens again.

outpaddling · Sep 6, 2020

This happened again on a 12.1 workstation with a 4-disk raidz. Renaming and recreating /boot as described above worked on the first try this time.

I'm wondering if anyone has encountered this issue on a simple 2-disk mirror. I never have and the reports I've seen all seem to indicate 4 disks or more.

outpaddling · Feb 16, 2025

Note to posterity: The issue in my case was apparently using disks too large for the BIOS, in my case > 2 TiB. While FreeBSD can fully utilize larger disks without a problem, if any part of /boot gets moved to blocks beyond the 2 TiB boundary, this error will occur. Switching from BIOS boot to UEFI boot should avert the issue, as will creating a < 2TiB boot partition or using a < 2 TiB boot disk.

See https://bugs.freebsd.org/bugzilla//show_bug.cgi?id=199804.