Solved UEFI won't always boot drives larger than 2TB without the right BIOS setting

hhartzer · Apr 12, 2024

I have a 2 drive zmirror with 16TB drives, everything in zroot. x86_64. Automatic ZFS install. After a reboot, without notable changes to the filesystem, I'm getting these kinds of boot errors.

Code:

zio_read error: 5
zio_read error: 5
zio_read error: 5
ZFS: i/o error - all block copies unavailable
ZFS: failed to read pool zroot directory object

Can't find /boot/zfsloader
Can't find /boot/loader
Can't find /boot/kernel/kernel

I seem to get them whether I try BIOS booting or UEFI booting.

I can mount the filesystem and all looks well. I did a scrub and there's no errors.

I've read things suggesting that zroot should be 1TB or smaller, in case /boot stuff is located more than 1TB out in the drive. I'm pretty sure UEFI is supposed to work fine in these cases as well.

At a bit of a loss. Debating just reinstalling to get the machine going again, but very perplexing and am wondering what wrong.

Current synopsis as of 2024-04-19:

I cannot reproduce this with a 1TB or 2TB drive. The 16TB drives, by themselves, or in a mirror, will do this readily. This seems to be some kind of size allocation issue with the larger drives, perhaps on certain UEFI firmware versions.

16TB drives with automatic UFS install will not boot, period.

Solved!

This was an issue with the UEFI configuration on my particular hardware. Nothing to do with FreeBSD at all. ZFS just took longer to trigger the issue, for whatever reason.

Under PCIe/PCI/PnP Configuration in the BIOS, "Launch Storage OpRom Policy" must be set to "EFI Compatible" or something along those lines. It's required to reliably boot drives larger than 2TB.

Mirror176 · Apr 13, 2024

Though you say there were no notable filesystem changes, were there any OS or hardware changes? was this a fresh 14 install or did you upgrade to it and was it booting previously and stopped? Threads like https://forums.freebsd.org/threads/...d-error-5-messages-on-boot-please-help.83577/ would suggest things like reseating cables. I would look at SMART status of the drives too. Though unlikely, if drives seem to take a long time to spin up you could try booting into BIOS/UEFI and waiting a bit, then either try to select a device to boot from now or ctrl+alt+delete reboot the system to give them more time to spin up to speed.

hhartzer · Apr 13, 2024

Thanks for replying!

No OS or hardware changes. Fresh install, not used much.

SMART checked out fine. I could mount the drives. I did a zpool scrub and there were no errors.

I tried both BIOS and UEFI boot.

hhartzer · Apr 15, 2024

I ended up installing to two other drives, so I could go back to the first two if need be.

With the system running and not under load, I unplugged power from each drive one by one. Mostly to see how high the power usage of the drives was.

Starting back up, I get the exact same error as earlier. Not sure if this is an issue with mirrors and power loss or what.

Traditionally, with a journaling filesystem it should survive this, or at least be able to progress towards booting.

hhartzer · Apr 16, 2024

I took another set of drives and installed 14.0-RELEASE in the same manner, with ZFS on a 16TB mirror.

I pressed the reset button to force a hard reboot after it booted. No issues coming back up.

I did freebsd-update and did the same. I also disconnected power to the drives on another boot to simulate what happened before.

No issues.

I then wrote several tens of gigabytes from /dev/urandom to a file. I pressed the reset button and no issues.

Finally, I wrote over a terabyte of "random" data. I pressed the reset button.

At the boot loader, I saw repeated

Code:

zio_read error: 5

errors overlayed over the 1, 2, 3, 4, 5 boot loader menu. It booted fine anyway.

I logged in, and ran [COMMAND]reboot[/COMMAND].

This time, the boot loader couldn't find the relevant bits and it failed to boot entirely.

I booted the installer and am able to mount zroot without issues. I'm running a scrub now, and I suspect it will come back clean.

It seems that a valid ZFS filesystem can get into a state where the boot loader cannot read it. I don't know how this happens, but it's certainly concerning. I don't know if it's a bug particular to my hardware. I'm hoping someone else is able to reproduce this.

I have not tested this with single drive configurations, smaller filesystems, or on another piece of hardware.

Quip · Apr 16, 2024

Even if it should not be the problem in case of UEFI booting (I had that problem with old BIOS booting), did you tried to create small system partition (let's say 20GB for the whole FreeBSD installation and log files) and another partition after this to store data? (or you can try just the minimal /boot/ partition if you know what to do)

I still think it would be related to the partition being too big for booting.

ThePowerOfFuet · Apr 16, 2024

Quip said:
I still think it would be related to the partition being too big for booting.

The partition size does not change as data is written to it, so if it boots even once then it's not an issue with some ancient BIOS not being able to boot from a large partition.

Quip · Apr 16, 2024

ThePowerOfFuet said:
The partition size does not change as data is written to it, so if it boots even once then it's not an issue with some ancient BIOS not being able to boot from a large partition.

It is not that easy. With the BIOS boot it is not about partition size but about the position of boot loader data inside of this partition (filesystem) in question. You can create perfectly working ZFS setup on 4TB drives (in my case), everything was booting, copied data from an old pool by zfs send and then it becomes unbootable because there were about 2TB of data written, then boot loader data updated (on COW filesystem it means written to another position) and system wont boot anymore, because boot loader data was stored to far away from the begining of the partition.
However, the system was bootable when the drives were moved to a different computer with a different BIOS (moving from an HP Proliant to some Supermicro X9-SCA). A combination of "bad BIOS" and filesystem data location can cause problems.

That's why I chose a small system partition where I can be sure that the boot loader data will never be too far away.

Emrion · Apr 16, 2024

Quip said:
With the BIOS boot

Op tried also with UEFI and get the same result. Her/His machine isn't that old.

ThePowerOfFuet · Apr 16, 2024

Quip said:
and then it becomes unbootable because there were about 2TB of data written, then boot loader data updated (on COW filesystem it means written to another position) and system wont boot anymore

While this is true, OP reported simply writing tons of data to a file then rebooting and the box not coming back up — so the bootloader never had the chance to be rewritten outside the magic boundary.

Hence, my comment stands.

hhartzer · Apr 16, 2024

I haven't tried creating a smaller zroot. That may well solve the issue if that's the problem.

I do feel like this should be documented if it's an issue, even with BIOS. I do wonder how it might be possible for UEFI to have the same problem. I don't know enough about it to understand.

Emrion · Apr 16, 2024

hhartzer said:
I haven't tried creating a smaller zroot. That may well solve the issue if that's the problem.

Very unlikely.

ThePowerOfFuet · Apr 16, 2024

hhartzer said:
That may well solve the issue if that's the problem.

It won't, because unless you are updating your booter it is not moving on disk and thus cannot be the cause of what you are experiencing according to what you have described.

hhartzer · Apr 16, 2024

Emrion said:
Very unlikely.

I realized I would have to have a separate zpool and have it on a partition to prevent zroot from going past 1TB. Putting things in a different volume probably won't help, if it's not also in a different zpool.

If this won't help, what will?

ThePowerOfFuet · Apr 16, 2024

hhartzer said:
If this won't help, what will?

I have no idea what is causing the issue, however I believe faulty hardware is to blame. The ECC RAM and memtest which you mentioned on the mailing list (which is what brought me here) would indicate it's almost certainly not the RAM.

Can you elaborate on ALL the hardware in the machine, including motherboard, CPU, disk controller, power supply, etc? Exact model numbers are best.

Emrion · Apr 17, 2024

ThePowerOfFuet said:
It won't, because unless you are updating your booter it is not moving on disk and thus cannot be the cause of what you are experiencing according to what you have described.

What Quip says is possible (but can't be the op trouble). It's not related to the loaders but the files the 2nd stage has to load. At this point, the loader (let's say gptzfsboot for instance) still uses the BIOS functions. Normaly, IIRC, the LBAs to load are passed via a 64 bits var so it shouldn't have any problem to access beyond the 2 TB boundary (if block size = 512 bytes). But what if this BIOS function is bugged? So, definatively, it's possible but uncommon for sure.

Unfortunately, I have no solution to provide to the op. I read here and there it may be a problem of disk hardware cache but nothing seems really relevant to me.

ThePowerOfFuet · Apr 17, 2024

Emrion said:
What [FONT=monospace]Quip[/FONT] says is possible (but can't be the op trouble).

I am well aware of that, but I am in this thread to try to help OP solve their issue. Discussion of causes which have been ruled out is not helpful.

hhartzer · Apr 18, 2024

I've been able to reproduce this again, this time without the mirror (single drive, still 16TB), and after allocating under 1TB (though somewhat near it).

Motherboard: Supermicro X9SBAA-F

SATA is just plugged into the motherboard. Supermicro power supply. CPU is Intel Atom S1260.

I don't see why /boot would be walking around, but I'm quite suspicious as the system is 100% stable other than the boot process.

ThePowerOfFuet · Apr 18, 2024

The drive, SATA cable, SATA controller, motherboard, or RAM (or some combination of the preceding) is bad, or the power supply is putting out very dirty power (unlikely given the brand). RAM is also unlikely because you ran memtest, but at this point I would retest it and wait for two or three passes.

Reboot to a Debian live image on a flash drive and run `badblocks -svw -b 4096 /dev/sda` (or whatever the drive is). ALL DATA ON THE DRIVE WILL BE LOST.

You might need to install badblocks with apt first.

badblocks(8) - Linux manual page

hhartzer · Apr 18, 2024

I've reproduced this on 6 different drives... are you telling me that all are bad? If they were bad, they shouldn't have passed zpool scrub.

More testing:

I installed with a 1TB drive, filled it up, and it rebooted fine.

Same with a 2TB drive.

I'll try a little harder to make the 1TB and 2TB fail and go from there.

ThePowerOfFuet · Apr 18, 2024

hhartzer said:
I've reproduced this on 6 different drives... are you telling me that all are bad?

No, they are not all bad. In that case, the disk can likely be removed from the list of hardware which I think is causing this issue for you.

I would retest the RAM more aggressively using the latest version of Memtest86+.

hhartzer · Apr 19, 2024

I've tried extensively to make the 1TB drive have the same boot failure and I cannot. I believe there is clearly an issue with booting zroots larger than 2TB with UEFI, at least on this hardware.

Unfortunately, my options for testing on other UEFI systems are limited, although maybe I'll come up with something.

I did test a thorough test of the RAM with the latest version of Memtest86+, booted through UEFI. This included Rowhammer.

hhartzer · Apr 19, 2024

Another piece of data...

If I install with UFS (automatic) on one of the 16TB drives, it will not boot (UEFI). It'll load the efi partition but can't boot further.

If I install UFS on a 2TB drive, it boots just fine with UEFI.

This indicates that this has nothing to do with ZFS, and simply drives larger than 2TB booted with UEFI on this hardware.

Emrion · Apr 20, 2024

Really surprised. A bug of that sort in a 64 bits UEFI firmware is just unbelievable.

ThePowerOfFuet · Apr 20, 2024

Emrion said:
Really surprised. A bug of that sort in a 64 bits UEFI firmware is just unbelievable.

I entirely agree.

In every case, the BIOS/EFI is finding the booter and passing control to it, so I believe the BIOS/EFI is not to blame.

This doesn't account for the filesystem eating itself.

My conclusion has not changed: I believe there is a hardware issue with this box. ZFS simply does not do this.