Adventures in upgrading boot environments

I've been struggling for a while now to find the proper upgrade procedure when using boot environments and freebsd-upgrade with the -b and -d flags (i.e. upgrading without rebooting into the new BE first). I thought I had found something that works but was proven wrong when our primary storage server failed to boot earlier this week.

There are two scenarios I've found that are pretty easy to get yourself in to when upgrading boot environments this way:
  1. A mismatch in kernel and kernel modules. This is what happened to us this week. I believe the kernel had been upgraded to 12.1-RELEASE but the procedure I was using lead to 12.0-RELEASE kernel modules being re-installed. The system couldn't load these old modules so it couldn't load the root filesystem (ZFS). Booting into a previous boot environment and redoing the upgrade a bit differently got us back up and running but left the system in a very odd state (more on this below).
  2. The more worrisome scenario: after performing a major version upgrade e.g. from 12.0-RELEASE to 12.1-RELEASE, it's entirely possible to have freebsd-version -kru report that the kernel and userland are 12.1-RELEASE when the userland is still 12.0-RELEASE! We ran like this for several weeks and probably would have kept running this way had I not noticed a difference in man pages between two systems that should have been identical.
I think these issues might stem from freebsd-update storing some of its state somewhere other than the directories you provide with the -b and -d flags. Consider a server that has been upgraded from 12.0-RELEASE to 12.0-RELEASE-p4 and is now being upgraded to 12.0-RELEASE-p10. This is what I was doing:

Code:
bectl create 12.0-RELEASE-p10

bectl mount 12.0-RELEASE-p10 /mnt

freebsd-update -b /mnt -d /mnt/var/db/freebsd-update fetch

freebsd-update -b /mnt -d /mnt/var/db/freebsd-update install

bectl activate 12.0-RELEASE-p10

bectl unmount 12.0-RELEASE-p10

shutdown -r now

# !!! THIS CAN BREAK THE INSTALL !!!
freebsd-update install

pkg update

pkg upgrade

That last freebsd-update install will either report that it has no updates if this is the first minor upgrade OR it'll install the PREVIOUS upgrade's files leading to something like this:

Code:
# freebsd-version -kru
12.0-RELEASE-p10
12.0-RELEASE-p10
12.0-RELEASE-p4 <- userland is still reporting the old version

Running freebsd-update fetch and freebsd-update install again appears to fix this scenario.

The second scenario is even scarier. Consider a server that has been upgraded a few times, is now on 12.0-RELEASE-p10, and needs to be upgraded to 12.1-RELEASE. This is what I was doing:

Code:
bectl create 12.1-RELEASE

bectl mount 12.1-RELEASE /mnt

freebsd-update -b /mnt -d /mnt/var/db/freebsd-update -r 12.1-RELEASE upgrade

freebsd-update -b /mnt -d /mnt/var/db/freebsd-update install

bectl umount 12.1-RELEASE

bectl activate 12.1-RELEASE

shutdown -r now

# !!! THIS BREAKS THE SYSTEM !!!
freebsd-update install

# I believe I also ended up having to do this as well:
freebsd-update fetch
freebsd-update install

pkg update

pkg upgrade

That second freebsd-update install appears to install the previous 12.0-RELEASE patch which can leave the system in an unbootable state along with an old userland install. I couldn't figure out how to fix this so I started over. I slightly modified my procedure which booted just fine but lead to unexpected results with the userland components:

Code:
bectl create 12.1-RELEASE

bectl mount 12.1-RELEASE /mnt

freebsd-update -b /mnt -d /mnt/var/db/freebsd-update -r 12.1-RELEASE upgrade

freebsd-update -b /mnt -d /mnt/var/db/freebsd-update install

bectl umount 12.1-RELEASE

bectl activate 12.1-RELEASE

shutdown -r now

# !!! THIS LEADS TO UNEXPECTED RESULTS !!!
freebsd-update fetch
freebsd-update install

pkg update

pkg upgrade

This appeared to work as expected. freebsd-version -kru showed the same versions and the system booted and operated just fine. However, this lead to a system that was booting the 12.1 kernel, reporting a 12.1 userland install, but was still running 12.0 userland! We ran like this for weeks and I would have never known it unless I just happened to notice that the man page for bectl was different on two systems that should have been identical. That lead to me comparing md5 sums on several binaries and realizing that the whole upgrade was botched and had to be redone.

Short sidetrack here: it's pretty easy to get your BEs into such a state that they can't be destroyed without manual intervention. Here's a VM I installed 12.0 on, upgraded to 12.0-RELEASE-p13, then ran the botched upgrade procedures above. I booted into the 12.0-RELEASE-p13 BE in order to start the upgrade again. You can see what happened when I tried to destroy one of the botched BEs:

Code:
root@freebsd-12-test:~ # bectl list
BE                  Active Mountpoint Space Created
12.0-RELEASE-p13    NR     /          286M  2020-09-16 15:29
12.1-RELEASE        -      -          2.71G 2020-09-15 23:32
12.1-RELEASE-FIXED  -      -          1.62G 2020-09-16 16:01
12.1-RELEASE-BROKEN -      -          2.04G 2020-09-16 15:39
root@freebsd-12-test:~ # bectl destroy 12.1-RELEASE-BROKEN
cannot destroy 'zroot/ROOT/12.1-RELEASE-BROKEN@2020-09-16-15:39:46': dataset already exists
unknown error

Even the latest version of bectl could not fix this. You can see why this is happening here:

Code:
root@freebsd-12-test:~ # zfs list -rt all -o name,type,clones zroot/ROOT/12.1-RELEASE-BROKEN
NAME                                                TYPE        CLONES
zroot/ROOT/12.1-RELEASE-BROKEN                      filesystem  -
zroot/ROOT/12.1-RELEASE-BROKEN@2020-09-16-15:39:46  snapshot    zroot/ROOT/12.0-RELEASE-p13

I'm not sure why bectl does this or how this even works to be honest but I haven't messed with clones at all. This is what I had to do to get this cleaned up:

Code:
root@freebsd-12-test:~ # zfs promote zroot/ROOT/12.0-RELEASE-p13
root@freebsd-12-test:~ # bectl destroy 12.1-RELEASE-BROKEN
root@freebsd-12-test:~ # bectl destroy 12.1-RELEASE-FIXED
root@freebsd-12-test:~ # bectl destroy 12.1-RELEASE
cannot destroy 'zroot/ROOT/12.1-RELEASE@2020-09-15-23:32:52': dataset already exists
unknown error
root@freebsd-12-test:~ # zfs list -rt all -o name,type,clones zroot/ROOT/12.1-RELEASE
NAME                                         TYPE        CLONES
zroot/ROOT/12.1-RELEASE                      filesystem  -
zroot/ROOT/12.1-RELEASE@2020-09-15-23:17:04  snapshot
zroot/ROOT/12.1-RELEASE@2020-09-15-23:32:52  snapshot    zroot/ROOT/12.0-RELEASE-p13
root@freebsd-12-test:~ # zfs promote zroot/ROOT/12.0-RELEASE-p13
root@freebsd-12-test:~ # bectl destroy 12.1-RELEASE
root@freebsd-12-test:~ # bectl list
BE               Active Mountpoint Space Created
12.0-RELEASE-p13 NR     /          1.70G 2020-09-16 15:29

I'm not sure if this is the correct thing to do and I don't fully understand it all but it does seem to work. This got me back to a clean start so I could perform the upgrade again. Here is the procedure that appears to work well for major upgrades:

Code:
bectl create 12.1-RELEASE

bectl mount 12.1-RELEASE /mnt

freebsd-update -b /mnt -d /mnt/var/db/freebsd-update -r 12.1-RELEASE upgrade

# This needs to run three times BEFORE reboot. Here is what I think each run does:
# Installs the new kernel.
freebsd-update -b /mnt -d /mnt/var/db/freebsd-update install
# This installs the new userland.
freebsd-update -b /mnt -d /mnt/var/db/freebsd-update install
# This cleans up any old shared object files. If you have custom compiled software, you might need to do something differently here.
freebsd-update -b /mnt -d /mnt/var/db/freebsd-update install

# This is needed for some package upgrades. I added these steps recently.
mount -t devfs devfs /mnt/dev/

pkg -c /mnt update

pkg -c /mnt upgrade

umount /mnt/dev

bectl umount 12.1-RELEASE

bectl activate 12.1-RELEASE

# If I find out during the upgrade that the patch level is p10, I'll rename the BE:
bectl rename 12.1-RELEASE 12.1-RELEASE-p10

shutdown -r now

This procedure may not work in every scenario, but it seems to work well for us. Everything appears to be properly upgraded and operating as expected. freebsd-version -kru reports all the correct versions and the userland binaries are actually what they're supposed to be now.

For minor upgrades, say you're running 12.1-RELEASE:

Code:
# I can't figure out how to find out what patch level I'm going to get until AFTER I run the upgrade, so I just use a placeholder name:
bectl create 12.1-RELEASE-pX

bectl mount 12.1-RELEASE-pX /mnt

freebsd-update -b /mnt -d /mnt/var/db/freebsd-update fetch

freebsd-update -b /mnt -d /mnt/var/db/freebsd-update install

mount -t devfs devfs /mnt/dev/

pkg -c /mnt update

pkg -c /mnt upgrade

bectl umount 12.1-RELEASE-pX

bectl activate 12.1-RELEASE-pX

# We learn what the patch level is during the upgrade so rename the placeholder:
bectl rename 12.1-RELEASE-pX 12.1-RELEASE-p10

shutdown -r now

The cool thing about doing upgrades this way is you can stage the upgrade while everything is up and running then simply reboot during a maintenance window, minimizing downtime. This saves us an additional 10-15 minute reboot as well (these servers we have are SO slow to boot).

ZFS on root + BEs is such a cool setup that it has me seriously considering considering replacing more of our Linux systems with FreeBSD. Super cool stuff once you get it all working properly!

This has taken me many hours of research, trial and error, and a major production outage this week to get right (at least I hope it's right anyway). Hopefully this helps someone else avoid at least some of the hassle!
 
Back
Top