ZFS Best recordsize for bhyve disk image storage

patpro · May 3, 2024

Hello,

I’m in the process to move my production storage from a zraid1 SAS HDD pool (4 drives) to a mirrored SSD pool (2 drives), and it’s a great opportunity to reshape and tweak the ZFS layout and configuration.

I’ve read some Klara’s post about bhyve and ZFS, they give nice inputs but they don’t go very deep when it comes to recordsize.
Future storage is SSD, and I would like to ensure I won’t generate too much write amplification.

I was thinking about a 8k recordsize for the storage of VMs’ disk images, but I’m really not sure.

Any pointers appreciated

andrian · May 4, 2024

In your case, you need to choose the block size based on the usage planning of the applications that will write files to your pool.
The example is described for MySQL: https://wiki.freebsd.org/ZFSTuningGuide
...

zfs set recordsize=16k tank/db/innodb
zfs set recordsize=128k tank/db/logs

...
So you write that you plan to store vms images, then find out what are the block sizes of the file systems in your guest systems and choose the average value between these virtual machines.

Erichans · May 4, 2024

As you're moving from spinning rust to SSDs, I'd say take extra care in setting the right ashift* value as this is an immutable vdev property (and SSDs are known to lie); when in doubt test/benchmark. OpenZFS' Workload Tuning** and as you're using VMs, the section zvol volblocksize; note: volblocksize >= guest FS’s block size >= ashift. I haven't any experience with these three as a triple, but it seems to me that these should be multiples (1 or more) of each other.

* Preferred Ashift by George Wilson - OH SH*FT!! - slides
** containing this reference, i.e. # 2, which you might already be familiar with ...

patpro · May 4, 2024

andrian said:
So you write that you plan to store vms images, then find out what are the block sizes of the file systems in your guest systems and choose the average value between these virtual machines.

Well, I get what you mean but I have no idea what I could run in the future so it’s a bit tricky. I could probably settle for something like 16k or 32k but I would rather benefit from experience (or benchmark) of an expert on that subject.

patpro · May 4, 2024

Erichans said:
As you're moving from spinning rust to SSDs, I'd say take extra care in setting the right ashift* value as this is an immutable vdev property (and SSDs are known to lie);

thanks for the pointers, I’ll take a deep look into that.
A for the lying, how can I get real info? Is smartctl reliable?

Code:

# smartctl -a /dev/ada0
smartctl 7.4 2023-08-01 r5530 [FreeBSD 14.0-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     KINGSTON SEDC600M7680G
../..
Sector Size:      512 bytes logical/physical
...

andrian · May 4, 2024

patpro said:
Well, I get what you mean but I have no idea what I could run in the future so it’s a bit tricky. I could probably settle for something like 16k or 32k but I would rather benefit from experience (or benchmark) of an expert on that subject.

Fine. Now you have decided whether your images have a block size of 16 or 32, so you set it to 32.
zfs set recordsize=32k
And you use your images.
Then when you will plan create a new virtual machine, you will find out its block size, and if it has, for example, a block size of 64,
zfs set recordsize=64k
then you will indicate this value to the zfs pool before creating (recording) the image.
I don't see any problems. Everything will work as it should. ZFS will take care of this.

Erichans · May 4, 2024

Your size output of smartctl is almost certainly not usable as a true indicator for a correct ashift value. AFAIK ZFS determines (when not specified it'll probably defaults to 9) its ashift value at creation via its own method, but it won't overcome a lying SSD (as seems the case with your Kingston SSD). You can check the ashift value after creation with zdb -C <poolname> (grep/look for the ashift entries). A value of 12 (= 4K sector drives) or 13 is common for flash based drives. Unless you have specific data that an SSD has/requires a value of 13, I'd test with 12. Specify an appropriate value for the sysctl vfs.zfs.min_auto_ashift before a VDEV creation, or use an explict value at the command line at the time of VDEV creation, for example -o ashift=12.

andrian · May 4, 2024

patpro said:
thanks for the pointers, I’ll take a deep look into that.
A for the lying, how can I get real info? Is smartctl reliable?

Code:

# smartctl -a /dev/ada0 smartctl 7.4 2023-08-01 r5530 [FreeBSD 14.0-RELEASE-p6 amd64] (local build) Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Device Model: KINGSTON SEDC600M7680G ../.. Sector Size: 512 bytes logical/physical ...

This shows the block size on the physical disk, not the block size of a file system like ufs or ext2-3 or ntfs (windows os).
For create zfs pool ashift, use this:
An ashift value is a bit shift value (i.e. 2 to the power of ashift), so a 512 block size is set as ashift=9 (2^9 = 512). The ashift values range from 9 to 16, with the default value 0 meaning that ZFS should auto-detect the sector size.
So, create a pool and see what value is set automatically.
zpool create -o ashift=0 ...
zdb | grep ashift pool
(for you value on KINGSTON SEDC600M7680G I recommend ashift=12)

patpro · May 6, 2024

I’ll wait a bit for more info (hopefully Q3): https://www.freebsd.org/status/report-2024-01-2024-03/#_bhyve_improvements

andrian · May 10, 2024

Tuning recordsize in OpenZFS

Whenever you need to eke more performance out of your file system, OpenZFS offers an unusual array of tunables, such as inline compression. Learn how to properly tune recordsize in OpenZFS to suit your needs.

klarasystems.com

patpro · Monday at 1:06 PM

thanks. They wrote «If you’re using the Linux KVM hypervisor with file-based storage, the default qcow2 cluster_size is 64KiB—and you should perform zfs set recordsize=64K on the dataset holding those files to match.» but from other posts on Klara’s blog, they admit they have not tested other block size with bhyve.
That one is a good example: https://klarasystems.com/articles/virtualization-showdown-freebsd-bhyve-linux-kvm/

patpro · Wednesday at 7:32 AM

I have a related question: what experiment would you set up / design to measure recordsize effectiveness (for bhyve) with the following constraint: you don’t have a dedicated zpool?

andrian · Wednesday at 8:39 AM

As for me. I don't see the point in experimenting, because experiments have already been conducted and conclusions have been provided. There are examples for databases (block size 4-8 postgres mysql), there are examples for operating systems (block size 16 ext ntfs), for network storage with small files (block size 4), for large file storage (block size 1M).
You can create several file systems and define your own block size for each.
example:
#zfs create zroot/databases
#zfs create zroot/share
#zfs create zroot/backups
#zfs set recordsize=8K zroot/databases
#zfs set recordsize=4K zroot/share
#zfs set recordsize=1M zroot/backups
In the guest operation system:
#dd if=/dev/random of=/dev/(mount-device)/test.file bs=4k count=1024 status=progress
#dd if=/dev/random of=/dev/(mount-device)/test.file bs=8k count=512 status=progress
#dd if=/dev/random of=/dev/(mount-device)/test.file bs=1m count=4 status=progress

rootbert · Wednesday at 9:02 AM

I did a similar (non-scientific) benchmark on a Windows VM using various record sizes, but did not note a difference worth to mention when choosing 64k or 128k. However, contrary to klarasystems where they are boasting that bhyve is faster for windows virtualization than KVM, I have found KVM to be faster at least 50% on both hardware systems where I tested it (one with AMD CPU, one with Intel CPU, both dual-booting with FreeBSD 14 and Ubuntu 22.04; tested on NVME, SATA-SSD and rust-sata-hdd; tested virtio and nvme driver and virtio seemed more consistent over the tests)

patpro · Wednesday at 1:39 PM

andrian said:
As for me. I don't see the point in experimenting, because experiments have already been conducted and conclusions have been provided. There are examples for databases (block size 4-8 postgres mysql), there are examples for operating systems (block size 16 ext ntfs), for network storage with small files (block size 4), for large file storage (block size 1M).
You can create several file systems and define your own block size for each.
../..

I get your point, but I’m way more interested in reducing read & write amplification than in maximising speed. Your tests will only measure speed.
The measure of speed may or may not be related to the measure of amplification, but that’s not an hypothesis I’ll make without a proper experiment.

patpro · Wednesday at 1:41 PM

rootbert said:
../.. However, contrary to klarasystems where they are boasting that bhyve is faster for windows virtualization than KVM, I have found KVM to be faster at least 50% on both hardware systems ../..

This is why I would be interested in running tests on my own hardware.