[ZFS] 512b -> 4k alignment for existing pool?

gjm · Feb 17, 2013

I have a ZFS filesystem whose underlying pool is a RAID1 mirror of two spinning-rust drives. The drives are quite new and I bet they really have 4k sectors, but they report themselves as having 512-byte sectors and because I'm an idiot I didn't do anything about this when I set up the system. As a result, my mirror vdev has ashift=9.

This seems like it may be bad for two reasons.

Presumably ZFS will be doing lots of 512-byte I/O operations on the drive, and they will all really be reading or writing 4096 bytes at a time, so everything will be less efficient than it should be.
I am about to add an SSD as an L2ARC device; my hazy understanding is that getting the sizes and alignments right is especially important for SSDs, and I have a vague fear (not backed by any concrete evidence) that this will be harder if my existing storage is all using ashift=9 and the SSD wants everything 4k-aligned.

So:

Can I somehow make all my existing storage use ashift=12? If so, how? Answers that don't begin "first, transfer all your data onto another device because we're going to blow it all away" would be particularly welcome, because I am easily scared .
Suppose for the sake of argument that I can't, or that I don't even though I can; so everything in my pool is still on a vdev with ashift=9. Can I / should I / need I do anything special when setting up my new SSD as an L2ARC device, to make sure that I/O operations done on it are properly aligned?

A few more details, in case they're relevant. I'm running FreeBSD 9.1 on an amd64 system. It has 4GB of RAM. (At some point I will surely increase that; is it urgent?) My existing setup has a zpool containing just one vdev, a RAID1 mirror of two 2TB hard drives. The device I intend to use for L2ARC is an Intel 520-series SSD, 120GB in size.

Many thanks in advance for any advice!

Sebulon · Feb 19, 2013

@gjm

You should always have a proper backup, period. IÂ´m guessing thatÂ´s from where your fears stem. Get that sorted and youÂ´ll feel alot less scared about your data

Unfortunately there is no other way but to backup and restore. Obviously, I take no responsibility for any harm done following this guide. Consider it only as guidelines for performing such operation and make sure to have proper understanding of the commands before trying any of this.

Boot from a FreeBSD install CD and choose to enter Live Mode:

Code:

ada0 = disk1
ada1 = disk2

[CMD="#"]mkdir /tmp/oldpool[/CMD]
[CMD="#"]mkdir /tmp/newpool[/CMD]
[CMD="#"]zpool import -o cachefile=/tmp/zpool.cache -o altroot=/tmp/oldpool oldpool[/CMD]

----------------------------------------------------
Making sure youÂ´re good to go. This can of course be prepared before doing all of this.
[CMD="#"]zpool scrub oldpool[/CMD]
----------------------------------------------------

[CMD="#"]zpool status oldpool[/CMD]
...
	NAME             STATE     READ WRITE CKSUM
	oldpool          ONLINE       0     0     0
	  mirror-0       ONLINE       0     0     0
	    gpt/disk1    ONLINE       0     0     0
	    gpt/disk2    ONLINE       0     0     0

errors: No known data errors
[CMD="#"]zpool detach oldpool gpt/disk2[/CMD]
[CMD="#"]gpart destroy -F ada1[/CMD]
[CMD="#"]gpart create -s gpt ada1[/CMD]

----------------------------------------------------
If youÂ´re booting off of them.
[CMD="#"]gpart add -t freebsd-boot -s 64k ada1[/CMD]
[CMD="#"]gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada1[/CMD]
Remember this:
[CMD="#"]zpool get bootfs oldpool[/CMD]
NAME     PROPERTY  VALUE         SOURCE
oldpool  bootfs    oldpool/root  local
----------------------------------------------------

[CMD="#"]gpart add -t freebsd-zfs -b 2048 -a 4k -l disk2 ada1[/CMD]
[CMD="#"]gnop create -S 4096 /dev/gpt/disk2[/CMD]
[CMD="#"]zpool create -m /tmp/newpool -o cachefile=/tmp/zpool.cache -o autoexpand=on newpool gpt/disk2.nop[/CMD]
[CMD="#"]zpool export newpool[/CMD]
[CMD="#"]gnop destroy /dev/gpt/disk2.nop[/CMD]
[CMD="#"]zpool import -d /dev/gpt -o cachefile=/tmp/zpool.cache -o altroot=/tmp/newpool newpool[/CMD]
[CMD="#"]zfs snapshot -r oldpool@now[/CMD]
[CMD="#"]zfs send -R oldpool@now | zfs recv -dF newpool[/CMD]
This will take a while, depending on how much data there is.

At this point you have replicated all of your data from the old ashift=9 pool over to you new ashift=12 pool and now youÂ´d have to switch over to using your new pool:

Code:

[CMD="#"]zfs destroy -r newpool@now[/CMD]
[CMD="#"]zpool destroy oldpool[/CMD]
[CMD="#"]gpart destroy -F ada0[/CMD]
[CMD="#"]gpart create -s gpt ada0[/CMD]

----------------------------------------------------
Again, if youÂ´re booting off of them.
[CMD="#"]gpart add -t freebsd-boot -s 64k ada0[/CMD]
[CMD="#"]gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ada0[/CMD]
[CMD="#"]zpool set bootfs=newpool/root newpool[/CMD]
----------------------------------------------------

[CMD="#"]gpart add -t freebsd-zfs -b 2048 -a 4k -l disk1 ada0[/CMD]
[CMD="#"]zpool attach newpool gpt/disk2 gpt/disk1[/CMD]
Wait for resilver to finish.

----------------------------------------------------
To get the same name on the pool as you had before:
[CMD="#"]zpool export newpool[/CMD]
[CMD="#"]zpool import -d /dev/gpt -o cachefile=/tmp/zpool.cache -o altroot=/tmp/oldpool newpool oldpool[/CMD]
----------------------------------------------------

[CMD="#"]cp /tmp/zpool.cache /tmp/oldpool/boot/zfs/[/CMD]
[CMD="#"]zfs set mountpoint=/ oldpool/root[/CMD]
ItÂ´s likely going to whine here about not being able to mount your pool over the existing [FILE]/[/FILE]
while in the Live environment and is to be expected.
[CMD="#"]shutdown -r now[/CMD]

Now, IÂ´ve read though that again and again and again, tried to be as detailed as possible, and it should "just work". But as I said, itÂ´s only guidelines. Use at your own risk.

About optimizing cache devices, thereÂ´s only partition alignment to worry about:

Code:

ada2 = cache1

[CMD="#"]gpart destroy -F ada2[/CMD]
[CMD="#"]gpart create -s gpt ada2[/CMD]
[CMD="#"]gpart add -t freebsd-zfs -b 2048 -a 4k -l cache1 ada2[/CMD]
[CMD="#"]zpool add oldpool cache gpt/cache1[/CMD]

Know that the system still needs RAM to able to allocate L2ARC, where a good rule of thumb is 1GB of RAM per 10GB of L2ARC. But on most workloads, that number usually is much smaller, like on one of my systems e.g:

Code:

[CMD="#"]zfs-stats -L[/CMD]
...
L2 ARC Size: (Adaptive)				166.94	GiB
	Header Size:			0.89%	1.48	GiB
...

And that ZFS by default only caches small data. If you want to enable caching for all data, you can have:
/etc/sysctl.conf:

Code:

vfs.zfs.l2arc_noprefetch=0

Good luck!

/Sebulon

gjm · Feb 20, 2013

Sebulon, many thanks for the detailed and helpful reply!

Yes, strongly agree about backups. I am, however, paranoid even when (I believe) well backed up

. (If I screw up, delete all my data, and render my system unbootable, then getting it back into working order is going to be a nuisance even with good backups. At least, that's the case with the backup hardware I currently have.)

I have some ignorant questions about your procedure.

The first thing you do is a zpool import. How is that possible? The "-m /tmp/oldpool" makes me wonder whether it was meant to be a zpool create -- but everything that follows makes it look as if oldpool is meant to be my existing pool. I suspect I'm missing something very simple here.
- There's another zpool import with that mysterious -m PATH, later on. Just to be clear about what's confusing me: on my machine, man zpool says that -m in zpool import means "enable import with missing log devices" and doesn't seem to assign any meaning to passing a pathname.
Just to check that I understand what the procedure is doing:
- we detach disk2=ada1, so now the old pool still exists but is no longer mirrored;
- we redo partitioning etc. of disk2 from scratch, including the gnop hack (zpool create, gnop create, zpool export, gnop destroy, zpool import) to make it use 4k sectors, so now we have a new zpool living on (just) disk2 with ashift=12; copy data from old pool (now just on disk1=ada0) to new pool (on disk2=ada1) by making a snapshot and send/receiving it. Now everything should be alive and well on disk2=ada1, and disk1=ada0 can now be re-gpart'ed. It doesn't need the gnop hack because what we're adding it to already has ashift=12. Attaching disk2 automatically makes a mirror vdev, because that's the (very sensible) default behaviour.
There's clearly some magic going on involving /tmp/zpool.cache. Is it explained somewhere?
If I have -- as I do -- considerably less RAM than 10% of the size of the L2ARC device, what will happen? (I'm guessing "usually nothing much", but what's the worst case? Panics? Less caching than I might have hoped for?)

Once again, many many thanks.

Sebulon · Feb 20, 2013

@gjm

Well there you go, there were some things I overlooked

Right, thatÂ´s your existing, old ashift=9 pool.
- Dang it, I forgot that -m has different meanings from create to import. ItÂ´s supposed to be for mounting on the mountpoint you previously created. IÂ´ll correct the original post with the appropriate changes.
Correct.
Not much magic really, itÂ´s just a cache that ZFS keeps to avoid having to read the metadata on the disks all the time, but I know development is leaning towards removing it all together, but until like FreeBSD-10, or maybe even 11, you have to make sure you handle it manually.
It just wonÂ´t be able to cache as much as youÂ´d perhaps expect, nothing harmful.

/Sebulon