Upgrade within 13-stable: no modules found by bootloader, so ZFS fails

dl8dtl

Developer
Yesterday, I tried upgrading my system (self-compiled) from 13.2-stable to the most recent 13-stable. All the usual `make` processes (buildkernel, installkernel, buildworld, installworld) went fine. However, when booting, the bootloader is unable to load any modules, thus `zfs.ko` is not there, so root on ZFS fails to be found.

I noticed that "Bootloader needs to be upgraded" message for the first time then. Not sure whether it's related. To the best of my knowledge (the machine is 10+ years old), the system doesn't use EFI at all. I tried to upgrade the primary bootblocks and gptzfsboot on all relevant SSDs, but the problem persists.

Right now, I manually loaded `/boot/kernel.old/kernel`, and set the module path accordingly, so it works, using the previous version.

Do I have to install an EFI partition now? There is a bit of free space on the boot SSDs:

# gpart show ada3
=> 34 976773101 ada3 GPT (466G)
34 943718400 1 freebsd-zfs (450G)
943718434 1024 2 freebsd-boot (512K)
943719458 33053677 - free - (16G)

that could be used for it.
 
That should work with BIOS booting. Did you run this?
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i2 ada3

I ain't sure, it's a problem with the bootcodes.
 
Now, after looking at the output of gpart show, I noticed that my (mirrored) boot disks ada2 and ada3 have swapped freebsd-zfs vs. freebsd-boot partitions. So I might have written the bootcode to partition one (rather than 2) even on ada3.
Apart from that, yes, that's what I've been doing.
I'm also not so sure it is really related to the bootcode, but what else might have changed within 13-stable in such a dramatic way that the modules are not found anymore?
 
Just wondering, there's a bunch of old stuff (like the FORTH files) in my /boot. Could that be an issue?
 
Now, after looking at the output of gpart show, I noticed that my (mirrored) boot disks ada2 and ada3 have swapped freebsd-zfs vs. freebsd-boot partitions. So I might have written the bootcode to partition one (rather than 2) even on ada3.
And the zpool is happy with that?

You should try to update the bootcodes on index 2 partitions (just gptzfsboot, since pmbr has been already written) to see if something changes at boot time.
 
Yesterday, I tried upgrading my system (self-compiled) from 13.2-stable to the most recent 13-stable. All the usual `make` processes (buildkernel, installkernel, buildworld, installworld) went fine. However, when booting, the bootloader is unable to load any modules, thus `zfs.ko` is not there, so root on ZFS fails to be found.

I noticed that "Bootloader needs to be upgraded" message for the first time then. Not sure whether it's related. To the best of my knowledge (the machine is 10+ years old), the system doesn't use EFI at all. I tried to upgrade the primary bootblocks and gptzfsboot on all relevant SSDs, but the problem persists.

Right now, I manually loaded `/boot/kernel.old/kernel`, and set the module path accordingly, so it works, using the previous version.

Do I have to install an EFI partition now? There is a bit of free space on the boot SSDs:

# gpart show ada3
=> 34 976773101 ada3 GPT (466G)
34 943718400 1 freebsd-zfs (450G)
943718434 1024 2 freebsd-boot (512K)
943719458 33053677 - free - (16G)

that could be used for it.


Code:
#!/bin/sh

# Variables
BOOT_DEVICE="ada3" # Set this to the correct boot device
BOOT_PARTITION_INDEX="2"
BACKUP_DIR="/root/boot_backup_$(date +%Y%m%d)"
LOADER_CONF="/boot/loader.conf"
OLD_KERNEL_DIR="/boot/kernel.old"
KERNEL_DIR="/boot/kernel"

# Step 1: Update bootcode on the freebsd-boot partition
echo "Updating bootcode on ${BOOT_DEVICE} partition ${BOOT_PARTITION_INDEX}..."
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i ${BOOT_PARTITION_INDEX} ${BOOT_DEVICE}
if [ $? -eq 0 ]; then
    echo "Bootcode updated successfully."
else
    echo "Failed to update bootcode. Exiting."
    exit 1
fi

# Step 2: Backup and clean old files in /boot
echo "Backing up /boot to ${BACKUP_DIR}..."
mkdir -p ${BACKUP_DIR}
cp -r /boot/* ${BACKUP_DIR}/
if [ $? -eq 0 ]; then
    echo "Backup completed."
else
    echo "Backup failed. Exiting."
    exit 1
fi

# Cleaning up old boot files (except kernel.old and current kernel)
echo "Cleaning up old files in /boot..."
find /boot -type f -name "*.old" -exec rm {} \;

# Step 3: Check ZFS pool status
echo "Checking ZFS pool status..."
zpool status
if [ $? -eq 0 ]; then
    echo "ZFS pool is healthy."
else
    echo "ZFS pool has issues. Please check manually."
fi

# Step 4: Verify boot configurations and module paths
echo "Verifying boot configurations and module paths..."
if grep -q "^module_path" ${LOADER_CONF}; then
    echo "Module path is set in ${LOADER_CONF}."
else
    echo "module_path is not set in ${LOADER_CONF}. You may want to add it if necessary."
fi

# Prompt user to reboot
echo "All tasks completed. It's recommended to reboot the system to apply changes."
echo "Would you like to reboot now? (yes/no)"
read REBOOT

if [ "${REBOOT}" = "yes" ]; then
    echo "Rebooting..."
  
  reboot
else
    echo "Please remember to reboot later."
fi
 
noticed that my (mirrored) boot disks ada2 and ada3 have swapped freebsd-zfs vs. freebsd-boot partitions. So I might have written the bootcode to partition one (rather than 2) even on ada3.
As these are mirrored, have you tried simply removing the one you think you broke by writing the bootcode to the freebsd-zfs partition? Forcing it to boot from the "good" disk? If the "good" disk boots, then remove the partitions on the other disk, set it up again, in the same order this time, and re-add the mirror.
 
As these are mirrored, have you tried simply removing the one you think you broke by writing the bootcode to the freebsd-zfs partition? Forcing it to boot from the "good" disk? If the "good" disk boots, then remove the partitions on the other disk, set it up again, in the same order this time, and re-add the mirror.
I'm not even sure I broke it by that. I just might not have upgraded the bootcode. Offhand, I'm not sure which of the mirrored SSDs the BIOS actually prefers to boot from.
And the zpool is happy with that?
Sure, as long as both parts of the pool are the same size, it doesn't care about the physical location.
Until now, I didn't even notice they are not symmetrical. The pool runs that way for 7 years now. ;-)

Still, I'm wondering whether this "You need to upgrade" warning is actually causing the issue that the newly built .ko files cannot be loaded, or whether it's something else. I tried to understand the Lua code that triggers the warning, and to me, it seems this is related to EFI (which I don't use).
 
Yesterday, I tried upgrading my system (self-compiled) from 13.2-stable to the most recent 13-stable. All the usual `make` processes (buildkernel, installkernel, buildworld, installworld) went fine. However, when booting, the bootloader is unable to load any modules, thus `zfs.ko` is not there, so root on ZFS fails to be found.

I noticed that "Bootloader needs to be upgraded" message for the first time then. Not sure whether it's related. To the best of my knowledge (the machine is 10+ years old), the system doesn't use EFI at all. I tried to upgrade the primary bootblocks and gptzfsboot on all relevant SSDs, but the problem persists.

Right now, I manually loaded `/boot/kernel.old/kernel`, and set the module path accordingly, so it works, using the previous version.

Do I have to install an EFI partition now? There is a bit of free space on the boot SSDs:

# gpart show ada3
=> 34 976773101 ada3 GPT (466G)
34 943718400 1 freebsd-zfs (450G)
943718434 1024 2 freebsd-boot (512K)
943719458 33053677 - free - (16G)

that could be used for it.
Loader has a command line interface allowing you to list (ls) files and directories. List and remember the currdev variable (show command). Then set to the the other device. The lsdev command lists devices loader sees.

As to why your devices are enumerated in reverse order, that may need a bugzilla PR. FreeBSD shouldn't be as flaky as Linux randomly enumerating devices.
 
Loader has a command line interface allowing you to list (ls) files and directories. List and remember the currdev variable (show command). Then set to the the other device. The lsdev command lists devices loader sees.

As to why your devices are enumerated in reverse order, that may need a bugzilla PR. FreeBSD shouldn't be as flaky as Linux randomly enumerating devices.
It's not the devices enumerated in reverse order.
Just, I don't know why, but when I created the mirrored ZFS pool 7 years ago, the partitions were created in reverse (illogical) order on one of the devices:

=> 34 976773101 ada3 GPT (466G) 34 943718400 1 freebsd-zfs (450G) 943718434 1024 2 freebsd-boot (512K) 943719458 33053677 - free - (16G) => 34 976773101 diskid/DISK-S2RBNX0J600264R GPT (466G) 34 1024 1 freebsd-boot (512K) 1058 943718400 2 freebsd-zfs (450G) 943719458 33053677 - free - (16G)
(Also, I have no idea why geom calls ada2 only by that immemorable name.)
Loader has a command line interface allowing you to list (ls) files and directories. List and remember the currdev variable (show command). Then set to the the other device. The lsdev command lists devices loader sees.
Good idea.
 
Sure, as long as both parts of the pool are the same size, it doesn't care about the physical location.
Until now, I didn't even notice they are not symmetrical. The pool runs that way for 7 years now. ;-)
You didn't understand my point. It's not about the lack of symmetry, but about the fact you may have poked a bootcode in a freebsd-zfs partition.

I just tried it in a VM with a zfs mirror. The system replies: "Operation not permitted".
I set sysctl kern.geom.debugflags=16, just to see. It replies the same. So, I guess you didn't do that in fact.

Edit: that said, I booted on a 14.1-RELEASE dvd and did the thing. Rebooted, no effect so far, all seems to work... :rolleyes:
 
I just verified: data at the beginning of all the partitions marked as freebsd-zfs looks identical: all zeroes until 0x3fd8. So for sure, I didn't damage any of these.
I just verified all the freebsd-boot partitions I have (beyond the ZFS mirror, there's another one on a 3rd SSD), and it seems diskid/DISK-S2RBNX0J600264Rp1 still had some old bootcode (much shorter than the current one). So I upgraded it.
Now, I need another maintenance window from the family to try rebooting. ;-)
 
Loader has a command line interface allowing you to list (ls) files and directories. List and remember the currdev variable (show command). Then set to the the other device. The lsdev command lists devices loader sees.
That's not possible with ZFS. When ZFS, currdev doesn't select the default device by disk/partition name like disk2p2 or disk3p1 to loader the kernel from but by zfs:dataset.

Since there is only one specific zfs:dataset where the kernel is located, currdev can't be used to choose one or the other disk of the ZFS mirror.

loader_simp(8)
Rich (BB code):
    currdev   Selects the default device to loader the kernel from.  The
               syntax is:
                     loader_device:
               or
                     zfs:dataset:
               Examples:
                     disk0p2:
                     zfs:zroot/ROOT/default:
 
(Also, I have no idea why geom calls ada2 only by that immemorable name.)
I can't tell why the disk ID is shown instead of the device name (did a quick grep(1) through the source code, but it didn't show much), however device names can be made visible by setting kern.geom.label.disk_ident.enable="0" in /boot/loader.conf.

To the issue, can the system be booted when the upgraded kernel and zfs.ko are loaded from the loader command line? Check also currdev if its the correct dataset.

At the boot menu escape to loader prompt
Code:
show currdev

load boot/kernel/kernel
load boot/kernel/zfs.ko

boot
 
To the issue, can the system be booted when the upgraded kernel and zfs.ko are loaded from the loader command line?
Nope. That I tried before, of course.
I also tried swapping the boot disks in BIOS, no change.
I still get the loader warning message, and I suspect it not being the actual problem.
Still, the only way to boot is using /boot/kernel.old/kernel (and the respective zfs.ko from there).
Btw., currdev is set to zfs:zroot/ROOT/default:.
I'm starting to wonder what I might have broken in the newly built kernel itself …
Attached is the list of possible root devices at the mountroot> prompt – sorry, only as a photo taken with the mobile phone.
2024-08-30-21-06-14-198.jpg
 
I still get the loader warning message, and I suspect it not being the actual problem.
Probably not.

I'm starting to wonder what I might have broken in the newly built kernel itself …
You could try the pre-built 13.4-STABLE kernel temporary, to check if it's the kernel and modules from the local build which fail.

I would also rename kernel.old to make sure its save from being overwritten by an another installkernel, if you have plans in this direction.
 
Escape to the boot prompt and set zfs_load="YES" or boot using livecd and edit your loader.conf as following:

/boot/loader.conf
kern.geom.label.disk_ident.enable="0"
kern.geom.label.gptid.enable="0"
cryptodev_load="YES"
zfs_load="YES"

/etc/rc.conf
zfs_enable="YES"
 
Just to make sure, I also restored the previous kernel configuration from backup, and compared it to the current one.
Code:
 options        INET6                   # IPv6 communications protocols
+options                IPSEC_SUPPORT           # Allow kldload of ipsec and tcpmd5
+options                NETLINK                 # netlink(4) support
+options                ROUTE_MPATH             # Multipath routing support
+options                FIB_ALGO                # Modular fib lookups
 options        TCP_OFFLOAD             # TCP offload
 options                TCP_BLACKBOX            # Enhanced TCP event logging
 options                TCP_HHOOK               # hhook(9) framework for TCP
@@ -31,6 +35,7 @@
 options        CD9660                  # ISO 9660 Filesystem
 options        PROCFS                  # Process filesystem (requires PSEUDOFS)
 options        PSEUDOFS                # Pseudo-filesystem framework
+options                TMPFS                   # Efficient memory filesystem
 options        GEOM_RAID               # Soft RAID functionality.
 options        GEOM_LABEL              # Provides labelization
 options                EFIRT                   # EFI Runtime Services support
@@ -74,6 +79,8 @@
 options                GZIO                    # gzip-compressed kernel and user dumps
 options                ZSTDIO                  # zstd-compressed kernel and user dumps
 options                NETDUMP                 # netdump(4) client support
+options                DEBUGNET                # debugnet networking
+options                NETGDB                  # netgdb(4) client support
 
 # Make an SMP-capable kernel by default
 options        SMP                     # Symmetric MultiProcessor Kernel
Thus, nothing has been removed, only new options added.
set zfs_load="YES"
That's all been the case for many years now.
cryptodev_load is not there, but I can't imagine why it would be needed here.
 
Finally had the time slot to reboot the pre-built GENERIC kernel. It finds the zfs root without any problems. (But I had to figure out how to get the Xserver to work with the evdev driver contained in GENERIC. ;-)

So somehow, the custom kernel broke it, even though it has only few changes compared to the previous version.
 
I would also rename kernel.old to make sure its save from being overwritten by an another installkernel, if you have plans in this direction.
I'll be carefull, and rather use make reinstallkernel. ;-)
Hopefully no muscle memory kicks in. Happens to me sometimes (not particularly when installing kernel). I need a few attempts to have the command written I want, It's like the fingers have their own mind. I'm not talking about typos but similar commands.


Just to make sure, I also restored the previous kernel configuration from backup, and compared it to the current one.
Thus, nothing has been removed, only new options added.
So somehow, the custom kernel broke it, even though it has only few changes compared to the previous version.
Did you compare the previous kernel configuration file with the custom file or perhaps by mistake with GENERIC?

Because all those options shown as added are in GENERIC of stable/13.
 
Back
Top