ZFS Fixing partition alignment on a ZFS boot disk?

richard612 · Feb 13, 2025

Code:

~ gpart show nda1
=>      34  62914493  nda1  GPT  (30G)
        34       345     1  freebsd-boot  (173K)
       379     66584     2  efi  (33M)
     66963   2097152     3  freebsd-swap  (1.0G)
   2164115  60748397     4  freebsd-zfs  (29G)
  62912512      2015        - happy? -  (1.0M)

I don't care about the first two partitions. I do care about swap and the zfs partition.

I was contemplating gpart backup, making edits, then gpart restore onto a new disk, then dd the contents across. But I don't think ZFS would take kindly to an unannounced lift-and-shift onto new LBAs.

Are zfs send and zfs recv the answer? Or perhaps create a mirror vdev from the old+new partitions, boot from the new device, then break the mirror at which point ZFS complains about the missing device for the rest of eternity?

This is a rather messy problem...

SirDice · Feb 13, 2025

Please don't post pictures of text. Just copy/paste the actual text.

covacat · Feb 13, 2025

if you move the partition towards the beginning of the disk and are mega-brave then you can dd it directly without a spare disk
you still have to do it by booting from external media
the theory is like
dd if=/dev/nda1p4 bs=1m |dd of=/dev/nda1 seek=1057 bs=1m
the 2nd dd overwrites what the 1st has already read

a power failure would foobar everything hard

then you fix the partition table so the zfs part starts at 1057m
the swap part would be shrinked with something less than 1mb

don't try this at home

VladiBG · Feb 13, 2025

covacat It's better to use a new disk to make a backup and then restore it.

richard612 · Feb 13, 2025

It's a VM. I can snapshot it.

sko · Feb 13, 2025

Never dd zfs providers! For one it circumvents all integrity checks and self-healing capabilities of zfs and may cause damage to the vdev or even the whole pool (e.g. due to single-bit errors in zfs metadata during the dd - been there, wasn't funny) and secondly it is horribly inefficient compared to a proper zfs resilver.

Add a new disk (image) to the VM, create the GPT table with properly aligned partitions and add the bootcode/uefi partition etc, then zpool attach(8) the new zfs partition to the existing vdev and let it resilver. After resilvering zpool detach(8) the old 'misaligned' provider. Shutdown the VM, remove old disk/image from VM configuration, start VM, collect underpants, then profit.

That being said - depending on what kind of disk image the hypervisor is using, the partition alignment is completely irrelevant anyways...

Emrion · Feb 13, 2025

covacat said:
then you fix the partition table so the zfs part starts at 1057m

Out of curiosity, how do you fix the partition table? Is there a tool for that?
I said that because it 's somewhat complex to do by hand.

I imagine:

- Copy the fourth entry at the first place.
- Change the LBA start and end of the partition in this entry.
- Clear to zero the others entries (it's not clear if we get rid of the swap partition or not).
- Recompute the checksum (CRC32) of the table.
- Copy at the end of the disk this new table and correct the checksum in the corresponding header.

VladiBG · Feb 13, 2025

I can write you a short how-to migrate the zpool from one disk to another using send/recieve. As this is VM why you are using ZFS on it and what is your expectation of aligning the partitions on the virtual disk?

covacat · Feb 13, 2025

Emrion said:
Out of curiosity, how do you fix the partition table? Is there a tool for that?
I said that because it 's somewhat complex to do by hand.

I imagine:

- Copy the fourth entry at the first place.
- Change the LBA start and end of the partition in this entry.
- Clear to zero the others entries (it's not clear if we get rid of the swap partition or not).
- Recompute the checksum (CRC32) of the table.
- Copy at the end of the disk this new table and correct the checksum in the corresponding header.

you can just delete and recreate with another starting block[/size] using gpart

richard612 · Feb 13, 2025

I just had the bright idea to set up a new installation from scratch, mount my old zroot to the new install, and copy over my scripts, confs, tunables, and whatnot. DEAR LORD what a trainwreck when two zroot pools are present. ZFS is bratty and I hate it a little right now. There's got to be a clean way to do this...

cracauer@ · Feb 13, 2025

In the past I moved partitions because of alignment nags. They never got any faster. I guess don't really understand all the bits involved. Maybe it isn't an exact science.

SirDice · Feb 13, 2025

richard612 said:
what a trainwreck when two zroot pools are present.

Pass an alternate name when importing.

Code:

 If newpool is specified, the pool is
             imported using the name newpool.  Otherwise, it is imported with
             the same name as its exported name.

zpool-import(8)

covacat · Feb 13, 2025

too many translation layers in between. if the disk has native 4k and emulates 512b then it might be the case that a fs block of 16k spawns 5 physical blocks instead of 4 if alignment is not optimal
same with ssd write/erase zones which are 32k or something

angry_vincent · Feb 13, 2025

VladiBG said:
I can write you a short how-to migrate the zpool from one disk to another using send/recieve. As this is VM why you are using ZFS on it and what is your expectation of aligning the partitions on the virtual disk?

Yes, please.

VladiBG · Feb 13, 2025

Here's how you can migrate your ZFS zroot to a new disk (da1). First you will need a USB with the same FreeBSD version of the current running OS.
Boot from the FreeBSD installation and select Live system. Then login with root w/o password

#The current running da0 disk which we going to migrate to a new disk (da1)

Code:

# gpart show
=>       40  266338224  da0  GPT  (127G)
         40     532480    1  efi  (260M)
     532520       1024    2  freebsd-boot  (512K)
     533544        984       - free -  (492K)
     534528    4194304    3  freebsd-swap  (2.0G)
    4728832  261607424    4  freebsd-zfs  (125G)
  266336256       2008       - free -  (1.0M)

# camcontrol devlist
<Msft Virtual Disk 1.0> at scbus0 target 0 lun 0 (pass0,da0)
<Msft Virtual Disk 1.0> at scbus0 target 0 lun 1 (pass1,da1)

# Create a new partitioning scheme on the new disk

gpart create -s gpt da1

# Add a new efi system partition (ESP)

gpart add -a 4k -l efiboot0 -t efi -s 260M da1

# Format the ESP

newfs_msdos da1p1

# Add new Boot partition for Legacy boot (BIOS)

gpart add -a 4k -l gptboot0 -t freebsd-boot -s 512k da1

# Add the protective master boot record and bootcode

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 2 da1

# Create new swap partition

gpart add -a 1m -l swap0 -t freebsd-swap -s 2G da1

# Create new ZFS partition to the rest of the disk space

gpart add -a 1m -l zfs0 -t freebsd-zfs da1

# mount the ESP partition

mount_msdosfs /dev/da1p1 /mnt

# Create the directories and copy the efi loader in the ESP

mkdir -p /mnt/efi/boot
mkdir -p /mnt/efi/freebsd
cp /boot/loader.efi /mnt/efi/boot/bootx64.efi
cp /boot/loader.efi /mnt/efi/boot/loader.efi

# Create the new UEFI boot variable and unmount the ESP

efibootmgr -a -c -l /mnt/efi/boot/loader.efi -L FreeBSD-14
umount /mnt

# Create mountpoint for zroot and zroot_new

mkdir /tmp/zroot
mkdir /tmp/zroot_new

# Create the new ZFS pool on the new disk (zroot_new)

zpool create -o altroot=/tmp/zroot_new -O compress=lz4 -O atime=off -m none -f zroot_new da1p4

# Import the original zroot

zpool import -R /tmp/zroot zroot

# Create a snapshot and send it to the zroot_new on the other disk.

zfs snapshot -r zroot@migration
zfs send -vR zroot@migration | zfs receive -Fdu zroot_new

# Export zroot and zroot_new and import again zroot_new under the new name (rename the zpool_new to zroot)

zpool export zroot
zpool export zroot_new
zpool import -R /tmp/zroot zroot_new zroot

# Set the default boot

zpool set bootfs=zroot/ROOT/default zroot

# cleanup the snapshot created for migration

zfs list -t snapshot -H -o name | grep migration | xargs -n1 zfs destroy

# export the pool

zpool export zroot

# Shut down and remove the old disk

shutdown -p now

# After the reboot select FreeBSD-14 from the UEFI and if everything is ok clean up the old UEFI record using efibootmgr

cy@ · Feb 13, 2025

cracauer@ said:
In the past I moved partitions because of alignment nags. They never got any faster. I guess don't really understand all the bits involved. Maybe it isn't an exact science.

Back in the day when disks' geometry mattered and LBA wasn't a thing, alignment was a thing. Today it doesn't matter. I ignore the nags.

sko · Feb 14, 2025

cy@ said:
Back in the day when disks' geometry mattered and LBA wasn't a thing, alignment was a thing. Today it doesn't matter. I ignore the nags.

On flash drives it really doesn't matter - the ancient concept of blocks and sectors doesn't apply any more. those drives just pretend like they are structured like magnetic drums from 70 years ago, but their actual IO patterns are completely different and managed by their firmware - so they absolutely don't care if your IO is 512 bytes of a fictional mapping earlier or later because your 512 or even 4k chunks are comically small for them anyways.

Roughly the same applies for VMs sitting on non-raw disk images that have their own internal data structure and possibly even some compression going on...

cy@ · Feb 14, 2025

sko said:
On flash drives it really doesn't matter - the ancient concept of blocks and sectors doesn't apply any more. those drives just pretend like they are structured like magnetic drums from 70 years ago, but their actual IO patterns are completely different and managed by their firmware - so they absolutely don't care if your IO is 512 bytes of a fictional mapping earlier or later because your 512 or even 4k chunks are comically small for them anyways.

Nice that you restated what I just said. Thank you for reinforcing that.

sko said:
Roughly the same applies for VMs sitting on non-raw disk images that have their own internal data structure and possibly even some compression going on...

Ditto.

Mirror176 · Feb 20, 2025

sko said:
On flash drives it really doesn't matter - the ancient concept of blocks and sectors doesn't apply any more. those drives just pretend like they are structured like magnetic drums from 70 years ago, but their actual IO patterns are completely different and managed by their firmware - so they absolutely don't care if your IO is 512 bytes of a fictional mapping earlier or later because your 512 or even 4k chunks are comically small for them anyways.

Roughly the same applies for VMs sitting on non-raw disk images that have their own internal data structure and possibly even some compression going on...

I think the more appropriate answer is "its complicated". Just as too small of a sector size (usually below 4k) causes performance to tank non-proportionally, too large of a sector size can cause performance drops too on some drives that are known internally to be working with much larger sizes like 256k. If misaligned it might store the misaligned piece all the same or might slow down as the writes hit real, or their separate firmware-simulated boundaries, boundaries and take a performance hit. Also matters less if you don't do a lot of small block activity and just have it at the beginning+end of significantly larger transfers.

VMs sometimes take a noticeable hit from any number of disagreements between host and guest, specifically if in a disk image within a host filesystem.

Mirror176 · Feb 20, 2025

VladiBG said:
Here's how you can migrate your ZFS zroot to a new disk (da1). First you will need a USB with the same FreeBSD version of the current running OS.
Boot from the FreeBSD installation and select Live system. Then login with root w/o password
[snipped content as it didn't seem to quote-copy correct anyway]

I think the formal answer is that loader.efi filename is placed into efi/freebsd instead of efi/boot. Also seems the efi/boot/boot<arch>.efi path is normally capitalized.

My understanding is current ZFS uses compression=on (default) as lz4 compression so setting it could be skipped; unless I'm wrong, bsdinstall could be similarly adjusted to no longer set compression at all.

The installer makes an exception and sets atime back on for /var/mail .

For send/recv it is good to review command options. Some aspects of the old pool can be modified at this time.

For sending: -L is needed if recordsize was set above 128k; my understanding is excluding it will rewrite larger records into smaller ones. I use -e out of habit to keep streams smaller but likely has minimal impact when send is piped into recv on the same local machine; its also has incompatible with receiving into an encrypted dataset. -v makes things noisy; comes with both good and bad so its just a choice.

For receiving if you wanted to adjust any pool properties you can do so with -o setting=value. Applying different values to different datasets means you need to use a series of send+recv instead of using recursive in one command set. Adjusting properties sets that value for all future data while only some changes apply to restructure old data; setting compression will recompress all old data at the new value while setting recordsize will not rewrite the data into new recordsizes.

I recently went through a series of the similar steps to redo an old disk onto itself. Not sure why but the drive would legacy/csm boot but refused to be a choice for UEFI booting until I redid that. Thought it was going to be alignment as the layout was old, 512b sector size and alignment not forced but destroying and recreating with the same partition layout+type+content seemed to also be UEFI bootable (efibootmgr had no impact on this). Maybe I will tear down the old disk image to see if I can find a difference but otherwise at a glance it looked right but just failed to work.

ZFS Fixing partition alignment on a ZFS boot disk?

richard612

Attachments

SirDice

Administrator

covacat

VladiBG

richard612

sko

Emrion

VladiBG

covacat

richard612

cracauer@

SirDice

Administrator

covacat

angry_vincent

VladiBG

cy@

sko

cy@

Mirror176

Mirror176