ZFS boot thoughts

jem · Feb 2, 2012

The following is something of a thought-exercise, hoping to prompt discussion from interested or informed parties.

I've been thinking a lot about the process involved in booting a ZFS-based FreeBSD system, specifically whether it could be made possible to boot a system from a pool comprised of whole-disks. I'm a strong believer in using whole-disks for pools whenever possible, due to the benefits it brings: elimination of partition alignment issues on devices with >512 byte sectors, greater ease of device replacement in the event of failure, to name but two. As we are aware though, there is currently no way of booting a system from a pool that spans whole disks, either in FreeBSD or in Solaris.

I've worked around this limitation on some of my systems by booting from a BSD or GPT partitioned USB stick containing a UFS filesystem with a copy of the /boot directory. The kernel and modules are loaded from the stick, then at the end of kernel initialisation, the root filesystem is mounted from the whole-disk pool. This is an acceptable solution on my system as I'm able to install the USB stick inside the system where it can't easily be removed, but this method might not suit everyone. There's also a slight inelegance in copying the /boot directory to the stick and then having to keep it synchronised with the original in the root filesystem.

The first thing I considered is whether it is possible to build an third type of zfs boot image. The existing two zfs boot images, zfsboot and gptzfsboot are clearly designed to probe for zpools based on MBR and GPT partitions respectively. This third type would probe only for whole-disk zpools.

I suspect that such a boot image could be quite a bit simpler and smaller than the others, without needing the ability to interpret partition schemes. It could either be directly dd'able to a USB stick (probably leaving the rest of the device unusable) or, like gptzfsboot, could be written into a freebsd-boot partition on a stick. This would eliminate the need to copy /boot to the stick - there would only be a bare minimum amount of boot code needed on it.

I then considered the possibility of booting from the actual pools disks themselves. From experimentation, it appeared that almost the first 16KB bytes of disks used for whole-disk pools is left unused by ZFS:

Code:

# mdconfig -a -t malloc -s 128M
md0
# hexdump /dev/md0
0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
8000000
# zpool create mdpool md0
# hexdump /dev/md0 | head -3
[red]0000000 0000 0000 0000 0000 0000 0000 0000 0000
*
0003fd0 0000 0000 0000 0000[/red] 7a11 b10c da7a 0210

The ZFS On-Disk Specification document partially confirms the non-use of this 16KB region. ZFS places two 256KB vdev labels at the beginning of a device (and two more at the end). Of each vdev label, the first 8KB is unused by design and the second 8KB is reserved for future "Boot Block Headers".

Even so, this 16KB region might not be big enough for boot code, judging by the sizes of zfsboot and gptzfsboot, but might simpler whole-disk boot code be small enough to fit there?

If not, the on-disk specification also describes a 3.5MB region of unused space following the initial two 256KB vdev labels. Could a small "zpmbr" be placed in sector zero, akin to GPT's pmbr, and containing code to jump to the 3.5MB region and run larger second stage boot code from there?

Would be interested in hearing anyone's thoughts on this.

References:

ZFS On-Disk Specification - http://hub.opensolaris.org/bin/download/Community+Group+zfs/docs/ondiskformat0822.pdf

RedRat · Feb 2, 2012

There is motherboards which wouldn't boot at all if they can't find a valid MBR or GPT on a disk.

flow · Feb 2, 2012

I'm not sure if it helps, but I'm running full disk ZFS that I'm booting off of a USB memory stick, with just gptzfsboot on it. no /boot.

kpa · Feb 2, 2012

Do you use partitions on those disks? I thought the pool needs to be on a GPT partition of type freebsd-zfs for gptzfsboot to be able to locate the loader(8).

flow · Feb 2, 2012

No I use full disks

Code:

# zpool status
  pool: rpool
 state: ONLINE
  scan: scrub repaired 0 in 23h7m with 0 errors on Sun Jan 29 02:08:41 2012
config:

	NAME        STATE     READ WRITE CKSUM
	rpool       ONLINE       0     0     0
	  raidz1-0  ONLINE       0     0     0
	    ada0    ONLINE       0     0     0
	    ada1    ONLINE       0     0     0
	    ada2    ONLINE       0     0     0
	    ada3    ONLINE       0     0     0

errors: No known data errors

phoenix · Feb 3, 2012

I'm a big believer in separating "the OS" (the thing that boots) from "bulk storage". Thus, every system should have two ZFS pools: one pool with a single mirror vdev, the other pool with whatever vdevs make sense for the system.

Depending on the use of the server, the "root pool" could be as simple as a UFS+gmirror setup using USB sticks, CF disks, SSDs, or even HDs.

The problem with trying to boot from the storage pool is that you need to make sure that every single disk in the pool has the boot information on it, and that every single disk in the pool is visible by the BIOS/loader. IOW, everything may work with your single vdev pool ... but what happens when you expand it to 2 vdevs? To 5 vdevs? To 20+ disks?

There's threads on the -stable? -fs? -current? mailing list about this exact issue (unable to boot from a multiple raidz2 vdev pool since the BIOS/loader only sees the first 12 disks, and not the second set of 12 disks).

Keep the boot process as simple as possible.

jem · Feb 3, 2012

flow said:
I'm not sure if it helps, but I'm running full disk ZFS that I'm booting off of a USB memory stick, with just gptzfsboot on it. no /boot.

Let me understand you correctly...

gptzfsboot is able to discover your whole-disk pool and run loader() from a dataset within that pool?

If so, that's my first point answered. No need for a new zfs bootcode. I had thought that gptzfsboot only looks for pools built upon freebsd-zfs GPT partitions.

phoenix said:
I'm a big believer in separating "the OS" (the thing that boots) from "bulk storage". Thus, every system should have two ZFS pools: one pool with a single mirror vdev, the other pool with whatever vdevs make sense for the system.

Depending on the use of the server, the "root pool" could be as simple as a UFS+gmirror setup using USB sticks, CF disks, SSDs, or even HDs.

The problem with trying to boot from the storage pool is that you need to make sure that every single disk in the pool has the boot information on it, and that every single disk in the pool is visible by the BIOS/loader. IOW, everything may work with your single vdev pool ... but what happens when you expand it to 2 vdevs? To 5 vdevs? To 20+ disks?

There's threads on the -stable? -fs? -current? mailing list about this exact issue (unable to boot from a multiple raidz2 vdev pool since the BIOS/loader only sees the first 12 disks, and not the second set of 12 disks).

Keep the boot process as simple as possible.

I mostly agree with you, but even if you put the OS on a dedicated mirrored zpool, it'd still be nice to make it a whole-disk pool. I wouldn't consider putting a root filesystem on a complex multi-vdev pool either.

flow · Feb 3, 2012

Code:

gptzfsboot is able to discover your whole-disk pool and run loader from a dataset within that pool?

Indeed, that is exactly what I do.

The problem with trying to boot from the storage pool is that you need to make sure that every single disk in the pool has the boot information on it, and that every single disk in the pool is visible by the BIOS/loader.

I'm not sure exactly what you mean by "boot information", but since you define where ZFS should boot from with (in my case):

Code:

zpool set bootfs=rpool/ROOT rpool

That is where I have my /boot directory and the only location where I have any boot information in my pool. And none of the disks in the pool has any boot sector or anything.

And wouldn't every single disk always have to be visible by the loader?