ZFS primarycache: all versus metadata


I've made some tests and benchmarks on ZFS, and I've found something I was not expecting. I've written a big post with plots here: http://www.patpro.net/blog/index.php/20 ... -metadata/ but the main problem boils down to this:

I create 2 brand new datasets, both with primarycache=none and compression=lz4, and I copy in each one a 4.8GB file (2.05x compressratio). Then I set primarycache=all on the first one, and primarycache=metadata on the second one.
I cat the first file into /dev/null with zpool iostat running in another terminal. And finally, I cat the second file the same way.

The sum of read bandwidth column is (almost) exactly the physical size of the file on the disk (du output) for the dataset with primarycache=all: 2.44GB.
For the other dataset, with primarycache=metadata, the sum of the read bandwidth column is ...wait for it... 77.95GB.

Any idea/hint about this behavior? I don't understand why reading a 4.8 GB file from a ZFS dataset could yield to 77.95 GB in read bandwidth when primarycache is set to "metadata".
clamscan reads a file, gets 4k (pagesize?) of data and processes it, then it reads the next 4k, etc.

ZFS, however, cannot read just 4k. It reads 128k (recordsize) by default. Since there is no cache (you've turned it off) the rest of the data is thrown away.

128k / 4k = 32
32 x 2.44GB = 78.08GB
@worldi, thank you very much for your explanation. It makes sense, but the quoted example above makes use of the cat command, not clamscan. That would mean that most commands are using a 4k read chuck. It looks very suboptimal when it happens on top of a non-caching ZFS dataset :).
Last edited by a moderator:

# dd if=test.dat of=/dev/null bs=128k
temp         544G   152G     50     33  6.31M   119K

# dd if=test.dat of=/dev/null bs=4k
temp         544G   152G    238      0  29.8M      0
temp         544G   152G    568      0  71.1M      0
temp         544G   152G    559      0  69.9M      0
temp         544G   152G    231      0  29.0M      0

# dd if=test.dat of=/dev/null bs=1k
temp         544G   152G    236      0  29.6M      0
temp         544G   152G    665      0  83.2M      0
temp         544G   152G    622      0  77.8M      0
temp         544G   152G    624      0  78.0M      0
temp         544G   152G    616      0  77.1M      0
temp         544G   152G    581      0  72.7M      0
temp         544G   152G    558      0  69.9M      0
temp         544G   152G    559      0  70.0M      0
temp         544G   152G    559      0  69.9M      0
temp         544G   152G    559      0  69.9M      0
temp         544G   152G    559      0  69.9M      0
temp         544G   152G    245      0  30.7M      0

I always thought primarycache was something that could be used to tune the dataset, in case you decided it made more sense to cache more metadata decribing the location of data than the data itself. Actually it looks like a terrible idea.

Searching around I came across this post (viewtopic.php?f=48&t=44035) where they see the same issue.

ZFS can write data in blocks of any size. If you write a 2kb file, it should create a 2kb record. However, if you create a large file, it will, by default, be split into 128kb chunks. Reading this file with code that pulls the data in chunks of <128kb appears to be disastrous. If you are running a dataset for a specific purpose (like a database) you can probably mitigate this by making sure you set recordsize to the same size used for reads by the database engine.

It's quite a shock though. I guess I was expecting ZFS to have some inbuilt cache to deal with data that is currently build read, but it obviously just relies on the MRU in ARC.
I've checked the behavior of the cat command with dtruss, the 4k chunk size is clearly shown for read syscalls.

About primarycache and recordsize settings, you make a point: it's only for specific purposes like databases. I've posted a short article about tunning ZFS for MySQL usage on FreeBSD, with some interesting references: http://www.patpro.net/blog/index.php/20 ... n-freebsd/

This current forum thread clearly shows that playing with primarycache is a very bad idea, unless you know exactly how it works, and why you do it.
playing with primarycache is a very bad idea, unless you know exactly how it works, and why you do it.

That would be my conclusion as well. Being a zfs property that's easy to change, I think it gives the impression that it's something you can play with to tune the dataset. In reality it should only ever be changed for a very specific reason and you have to do a bit of work to make sure your application is configured correctly for it (and nothing other than that application is storing data there).