ZFS ashift settings for ZFS raidz and many small files

byrnejb · Jul 5, 2019

We have been having concerns respecting the massive increase in storage requirements since moving to FreeBSD and RAIDZ. I have been reading a thread concerning this, Thread 71099, and would like to have the ashift settings put into some context by someone who knows what is going on.

Our host was set up with vfs.zfs.min_auto_ashift: 12 and vfs.zfs.max_auto_ashift: 13. This had to have been done by the installer. I infer that the selection of 4096 byte sectors resulted in the 12 value. I have no idea where the 13 comes from.

On this host we have four disks, encrypted, all in a raidz2 single pool created by the installer. The host runs the BHyve hypervisor and has several guests all running with zfs backends. The applications running in theses guests include postfix, postgresql and cyrus-imap. We are migrating off of the existing host onto another machine also running FreeBSD. We are still debating whether to stick with 12 or revert to 11 on the new host. Whatever the decision I would like to learn what would be the most appropriate values for the block size / ashift.

ralphbsz · Jul 5, 2019

You said "many small files". I don't know the details of how ZFS handles small files, but I've worked on the internals of other file systems. Regularly, users and customers come and say that they want their file system to be efficient for "many small files". But when one goes through the numbers, it usually turns out that "many" is actually very few.

Concrete example. Say you have a 10TB usable space file system (today, that's easy to do, even with good redundancy it only requires a handful of disks). Say that your file system is half full, which is a good way to run a file system. You then think that your storage needs are dominated by small files, and as evidence you bring that you have hundreds of thousands, perhaps as many as a million, files that are under 4KB in size. Well, that's nonsense. Because 1 million x 4KB = 4GB. Say for fun that your file system is inefficient by a factor of 3 if one includes metadata, directory entries, and rounding file size up to the nearest allocation block, that is still only 12GB. On a 10TB file system that is half full, that is a little bit over 0.2%, which means that about 99.8% of the space usage has to be coming from medium-size and large files. For small files (around the size of the 4KB or 8KB allocation block) to make any difference in space usage on modern file system, you need to have billions of them, and there are very few workloads that do that. Most people who think that their small files are dominating space efficiency are wrong.

So here is my suggestion: Make a small statistic of how many files you have, and what their size distribution is. Ideally, do that in powers of 2: Count how many files between 512bytes and 1KB, how many between 1 and 2KB, and so on. Ideally, don't do it by file size, but by actual disk usage of the file, because that better captures sparse files (or do both versions). If you combine that statistic with knowledge of how ZFS space allocation actually works (you won't get that from me), you can make sensible predictions.

Different topic: What disk drives are you using? Most modern spinning rust disk drives actually have physical 4K sectors, but many are able to emulate 512byte logical sectors, usually using some read-modify-write technique (sometimes assisted with internal flash or log-structured writing within the disk). Using any blocksize smaller than 4K will have terrible performance effects for those disks.

byrnejb · Jul 5, 2019

Disks = 4 x 3T WD reds. RAIDZ2 leaves 10T. Small depends upon your point of view. For my purposes I consider anything under 10K as small. The existing system is set with ashift = 12 (4096byte blocks) . What I am trying to find out is what a larger ashift would accomplish in terms of storage utilization.

ralphbsz · Jul 5, 2019

The math doesn't work out: RAID-Z2 has two redundancy disks (it can handle two faults). With four disks total, you should get two disks' worth of capacity (the other two are redundancy), or about 6TB usable. And 3TB WD Reds are recent enough, I bet they have 4K sectors. Here is an example output from my system (which has 3TB and 4TB Hitachi disks):

Code:

# camcontrol identify /dev/ada3
pass3: <HGST HMS5C4040BLE640 MPAOA5D0> ATA8-ACS SATA 3.x device
...
sector size           logical 512, physical 4096, offset 0
...

Which means that ashift=12 is the smallest practical value for me, from a performance standpoint.

Try my trick of measuring the distribution of files by file size (or disk usage); I suspect you will be surprised.

byrnejb · Jul 5, 2019

I may have picked the wrong figure of the report so I reproduce it:

Code:

# zpool list
NAME       SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
bootpool  1.98G   258M  1.73G        -         -    13%    12%  1.00x  ONLINE  -
zroot     10.6T  8.21T  2.42T        -         -    55%    77%  1.00x  ONLINE  -

ralphbsz · Jul 5, 2019

If both pools share the same 4 disks, then the total is wrong. You have 10.6T + 0.002T of disk space, which is more than 4 x 3TB disks can provide with RAID-Z2. I worry that you may have a setup mistake, and your pool is really not redundant at all.

Eric A. Borisch · Jul 6, 2019

Two points:

* zpool list shows all the “storage space” in a zpool, which includes space for redundancy in raidz setups, so 10.6TB sounds about right for 4x3T
* As ralphbsz points out, it is hard to fill up a modern large filesystem with small files; it is, however, easy to fill it with small records if you set recordsize low, or if you use zvols with small volblocksize.

See my posts in the thread you reference.

ZFS ashift settings for ZFS raidz and many small files

byrnejb

ralphbsz

byrnejb

ralphbsz

byrnejb

ralphbsz

Eric A. Borisch