ZFS Mirrored zroot doesn't boot from second disk

Hi all,

when adding a second disk to my zroot boot pool, I basically followed this guide: https://dan.langille.org/2019/10/15/creating-a-mirror-from-your-zroot/
My source device is ada1 and the new one is ada0, so I had to swap the device names in the commands and it seemed to work fine.
The pool was resilvered after adding the disk, gmirror for swap works fine, etc.

To verify the setup and to test if the system boots on drive failure, I removed the old disk from the boot order in BIOS and tried to boot from the second one.
Unfortunately I get this:
Code:
BIOS drive C: is disk0
ZFS: i/o error - all block copies unavailable
ZFS: can't read MOS of pool zroot
ZFS: can't find pool by guid

Can't find /boot/zfsloader

Can't find /boot/loader

Can't find /boot/kernel/kernel

FreeBSD/x86 boot
Default: /boot/kernel/kernel
boot:

Booting from the old drive works well.

During the process I installed the bootcode to the new drive with
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0

Can somebody push me in the right direction to fix this?
The OS is FreeBSD 12.4.

Many thanks in advance!
 
Please show the output of gpart show (should show information from both disks).
 
It says:

Code:
root@mortimer# gpart show
=>       40  976773088  ada0  GPT  (466G)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048   33554432     2  freebsd-swap  (16G)
   33556480  939524096     3  freebsd-zfs  (448G)
  973080576    3692552        - free -  (1.8G)

=>       40  976773088  ada1  GPT  (466G)
         40       1024     1  freebsd-boot  (512K)
       1064        984        - free -  (492K)
       2048   33554432     2  freebsd-swap  (16G)
   33556480  939524096     3  freebsd-zfs  (448G)
  973080576    3692552        - free -  (1.8G)
 
Is it possible ada0 and ada1 have swapped places? That's not uncommon to happen when adding a disk. Try creating the boot on the 'other' drive too.
 
Good hint, didn't check that! But apparently not: The old drive (ada1) is a Samsung EVO 860, the new one (ada0) a EVO 870:

Code:
root@mortimer# smartctl  -i /dev/ada0
smartctl 7.3 2022-02-28 r5338 [FreeBSD 12.4-RELEASE-p1 amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 870 EVO 500GB
Serial Number:    S6PYNL0TB39854A
LU WWN Device Id: 5 002538 f52b2ebfe
Firmware Version: SVT02B6Q
User Capacity:    500.107.862.016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon May  8 11:48:09 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

root@mortimer# smartctl  -i /dev/ada1
smartctl 7.3 2022-02-28 r5338 [FreeBSD 12.4-RELEASE-p1 amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 860 EVO 500GB
Serial Number:    S3Z2NB1K393794X
LU WWN Device Id: 5 002538 e402d402b
Firmware Version: RVT01B6Q
User Capacity:    500.107.862.016 bytes [500 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
TRIM Command:     Available, deterministic, zeroed
Device is:        In smartctl database 7.3/5319
ATA Version is:   ACS-4 T13/BSR INCITS 529 revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon May  8 11:48:14 2023 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
if this box was installed before to 12.4 ada1 may have a different boot code (not updated automatically)
you can try to dd the partition ada1p1 to ada0p1 and see what happens (copy old bootcode to the new drive)
probably wont change anything but it is fast to try
 
I'd just run gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0 and gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1. Just to be on the safe side. It will write the bootcode from the currently installed version (which is 12.4). That should be fine.
 
Indeed, the box is around since FreeBSD 12.1 or so and has been updated since.
If for some reason the old bootcode on ada1 is able to boot 12.4 and the "new" one on ada0 not, that would lead to an unbootable machine, wouldn't it?
 
I would think 12.4 bootcode should be able to boot anything 12.x, should be backwards compatible to 12.1
What's the output of zpool status? I'm looking more for the names used to create the vdev. I like to use labels instead of the raw partition names like ada0p3.
I removed the old disk from the boot order in BIOS and tried to boot from the second one.
Perhaps this makes the BIOS do something funky with the disk numbering and references to "ada###" somewhere are causing the failure.

Is there a way to stop in the loader and look for/scan for pools to boot from?
How about a boot device reference in the loader?
 
You can make it a two part change: first upgrade the bootcode on ada0 as SirDice mentioned; as it is, it does not boot of that one, so you have nothing to loose (yes, that may look as a repetition of your earlier upgrade). Check if that ada0 as a stand alone disk gets you into a booted system. If so, you can upgrade the bootcode for ada1 as mentioned; if not so, there's something else going on.
 
Indeed, the box is around since FreeBSD 12.1 or so and has been updated since.
If for some reason the old bootcode on ada1 is able to boot 12.4 and the "new" one on ada0 not, that would lead to an unbootable machine, wouldn't it?
in theory yes. but i suppose that its rather a bios compatibility problem than broken 12.4 bootcode
you should have backups of the boot partitions in /var/backups
i suspect that using either all old / all new bootcode wont solve the problem (unless somehow your /boot/gptzfsboot.bin is borked)
but this experiments are quick to try / rollback (as long the box is not remote) so they are worth to try in my opinion
 
Thanks for your replies, guys!
I would think 12.4 bootcode should be able to boot anything 12.x, should be backwards compatible to 12.1
What's the output of zpool status? I'm looking more for the names used to create the vdev. I like to use labels instead of the raw partition names like ada0p3.

So me. That's the partitioning with gpt labels:
Code:
root@mortimer# gpart show -l
=>       40  976773088  ada0  GPT  (466G)
         40       1024     1  boot0  (512K)
       1064        984        - free -  (492K)
       2048   33554432     2  swap0  (16G)
   33556480  939524096     3  disk0  (448G)
  973080576    3692552        - free -  (1.8G)

=>       40  976773088  ada1  GPT  (466G)
         40       1024     1  boot1  (512K)
       1064        984        - free -  (492K)
       2048   33554432     2  swap1  (16G)
   33556480  939524096     3  disk1  (448G)
  973080576    3692552        - free -  (1.8G)

root@mortimer# zpool status zroot
  pool: zroot
 state: ONLINE
  scan: resilvered 105G in 0 days 00:05:27 with 0 errors on Sun May  7 10:21:30 2023
config:

        NAME           STATE     READ WRITE CKSUM
        zroot          ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            gpt/disk1  ONLINE       0     0     0
            gpt/disk0  ONLINE       0     0     0

Perhaps this makes the BIOS do something funky with the disk numbering and references to "ada###" somewhere are causing the failure.

Is there a way to stop in the loader and look for/scan for pools to boot from?
How about a boot device reference in the loader?
I'm not sure. Where are the error messages "ZFS: i/o error - all block copies unavailable" and so on are coming from? Is it maybe from the bootcode that cannot access the zfs for some reason?

You can make it a two part change: first upgrade the bootcode on ada0 as SirDice mentioned; as it is, it does not boot of that one, so you have nothing to loose (yes, that may look as a repetition of your earlier upgrade). Check if that ada0 as a stand alone disk gets you into a booted system. If so, you can upgrade to bootcode for ada1 as mentioned; if not so, there something else going on.
But that's what I did. As mentioned in my original post I was writing the bootcode to ada0 with
gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0

Since ada0 doesn't boot with the updated code, I fear to render my system unbootable when overwriting the bootcode ada1 as well.

in theory yes. but i suppose that its rather a bios compatibility problem than broken 12.4 bootcode
you should have backups of the boot partitions in /var/backups
i suspect that using either all old / all new bootcode wont solve the problem (unless somehow your /boot/gptzfsboot.bin is borked)
but this experiments are quick to try / rollback (as long the box is not remote) so they are worth to try in my opinion
Ok, I can try. But how can I rollback the bootcode from the backup in /var/backup when it fails?
 
gpart bootcode -p /boot/gptzfsboot -i 1 ada0 is the same as doing dd if=/boot/gptzfsboot of=/dev/ada0p1. The contents of /boot/gptzfsboot is simply written as-is to the partition. There's nothing 'magical' happening here.

You can make your own backup quite easily: dd if=/dev/ada0p1 of=./mybackupbootcode
 
So, I took my time to make some backups, got me a 12.4 memstick and saved the old bootcode from ada0p1 and ada1p1 on the stick, to be able to recover quickly and then I refreshed the bootcode on both drives as suggested:

Code:
root@mortimer# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0
partcode written to ada0p1
bootcode written to ada0
root@mortimer# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1
partcode written to ada1p1
bootcode written to ada1
Unfortunately the result is still the same. The box is booting from ada1 (Samsung 860 EVO) but not booting from ada0 (Samsung 870 EVO) with the same error message as in the original post.

I guess, something must be wrong with the zfs on ada0 as I suppose the error messages actually come from the bootcode. Do they?

I'm about remove ada0 from the zpool, wipe it and start all over. Does somebody have any other suggestions?
 
I guess, something must be wrong with the zfs on ada0 as I suppose the error messages actually come from the bootcode. Do they?
They do, from gptzfsboot precisely. I have to admit that I do not understand why it doesn't boot given the same code runs well on the other disk.
 
[...] but not booting from ada0 (Samsung 870 EVO) with the same error message as in the original post.
It seems that the boot process gets stuck at gptzfsboot as Emrion mentioned. The boot process is also described in
The FreeBSD Boot Process, specifically, in your case for BIOS/ GPT/ZFS at +-> gptzfsboot | STAGE 1 + STAGE 2.

I am not familiar with the inner workings of scrubbing but I would try a scrub of the pool as a complete mirror.

If I am reading your messages correctly then the system has never fully booted from ada0 (you previously had only one disk: ada1). There might be something wrong with ada0 with respect to the boot process; you could try physically attaching the "original" Samsung 860 EVO to the ada0 connection as a sole disk and see what happens at boot. Also, after that you could try attaching the new Samsung 870 EVO to ada1 and see what happens at boot.

Looking at gptzfsboot(8), there is a possibility to drop into the gptzfsboot boot prompt (I do not have experience with that):
Code:
USAGE
     Normally gptzfsboot will boot in fully automatic mode.  However, like
     boot(8), it is possible to	interrupt the automatic	boot process and in-
     teract with gptzfsboot through a prompt.  gptzfsboot accepts all the op-
     tions that	boot(8)	supports.

     The filesystem specification and the path to loader(8) are	different from
     boot(8).  The format is

     [zfs:pool/filesystem:][/path/to/loader]

     Both the filesystem and the path can be specified.	 If only a path	is
     specified,	then the default filesystem is used.  If only a	pool and
     filesystem	are specified, then /boot/loader is used as a path.

     Additionally, the status command can be used to query information about
     discovered	pools.	The output format is similar to	that of	zpool status
     (see zpool(8)).

     The configured or automatically determined	ZFS boot filesystem is stored
     in	the loader(8) loaddev variable,	and also set as	the initial value of
     the currdev variable.
That pauses the boot process at that stage where you could manually set it to proceed the boot process. That means you could specify to proceed as it normally would: resuming the bootprocess from the same disk (=ada1 with the "original" Samsung 860 EVO). It should also be possible the specify the alternate filesystem at ada0 of the mirror and see what happens. I don't have an idea what the precise problem is but at least the above suggestions could get more detailed information.

If you cannot get more suggestions or a solution from this forum, I'd suggest you state your problem in an appropriate FreeBSD mailing list (see C.2. Mailing Lists). IMO, you have a clearly defined problem with a clear history and without any complexities such as encryption; that should help in diagnosing the problem further and hopefully a solution.
 
Many thanks for your reply.

The long story is: When I set up the system like 3 years ago, I equipped it with one 860 as boot and system drive (the system has two more RaidZ-Pools consisting of 5 harddrives each). I installed it as a non-redundant zpool for convenience (no more partition resizing, etc.).
A couple of months later, I bought a second 860 and went for a zroot-mirror (ada0 and ada1) for redundancy. At this time the system booted from either of them (checked that!). In last December one of them (ada0) died and while it still was in the warranty period, I removed it from the zpool and sent it in for the RMA process. A few weeks later I received a 870 of the same size as a replacement. Then it collected dust on my desk for some weeks, until I got to plugging it into the box and making the zroot a mirror again. And that's where we are now.

I tried scrubbing the zroot, but it didn't help.

I think, I'm going to start all over and try again. Maybe I borked something with the GPT or the ZFS on the way. As a last resort I still can reinstall FreeBSD and let the installer do the work. At least, now I'm prepared to recover the system within a acceptable time, did some housekeeping and updated backups ...
And I learned, that the installer is now able to install zroot mirrors right away. Don't know if this was the case back then.

Thanks for your support guys, I will keep you updated if and when I was successful.
 
what i would try is
1. try with a gptzfsboot (bootcode) from 13.2 or even 14
2. if efi is available you can shrink the swap and create efi partitions on both drives
3. if efi is not available again shrink swap and create ufs partitions, install ufs bootcode over current zfs boot code and just put /boot/<loader_stuff> on the new ufs partitions and tell that loader to use rootdef the pool
4. if boot redundancy is not *that* important do nothing :)
 
4. if boot redundancy is not *that* important do nothing :)
Redundancy already payed off when the first drive died. It's not completely impossible that the other one might die in the near future, so it's more like a nice-to-have to me.
 
I think he created a copy of the GPT from disk 0 and restore it on another disk. So he end up with a cloned GUID partitions tables instead of having unique GUID of the partitions. So instead of cloning the GPT you need to create it on the secondary disk and then to mirror only the ZFS partition data. That way your boot loader will identify the correct GUID of the partition.

edit:
After some test it's not the case. The GUID is not the same so you can ignore my post.
 
Back
Top