ZFS 1MB recordsize performance - recordsize discussion

Hi
Recently I just newly installed a new FreeBSD 10.1-STABLE server and discovered that 1MB recordsize support has been added. Theoretically, large recordsize will wasting more space, increase io latency especially reading a 4K data from 1MB block.

However with LZ4 compression, accessing to disk is minimal when reading a 4K data on 1MB recordsize since the compression ratio on the empty data is high, so that only 4K or less physical block on disk been accessed.

The overhead on writing 4k data to disk also minimal because the latency of inline compression is very low on high performance CPU .

May I know whether my assumption are correct? If so enabling 1MB recordsize won't harm the RW performance on ZFS, in addition, it also increases the compression ratio as well and speed up scrubbing process.
 
Your assumption is correct if you want storage to store many zeroes. :) On real data compression will be less efficient, and as result to read 4KB your storage may need to read 1MB. For write situation is even worse, because most of writes will cause read-modify-write cycle of 1MB block. Large blocks indeed improve compression and reduce processing overhead for large files, but, unless you have very specific workload, excessive I/O and cache usage for small operation may be too big.
 
Hi
Recently I just newly installed a new FreeBSD 10.1-STABLE server and discovered that 1MB recordsize support has been added. Theoretically, large recordsize will wasting more space, increase io latency especially reading a 4K data from 1MB block.

"recordsize" on ZFS is not a hard rule, it's an upper limit. Meaning, ZFS will use a recordsize that matches the size of the data block.

For example, if you create a 4 KB text file, ZFS will store that in a 4 KB record. If you create a 12 KB text file, ZFS will store that in a 16 KB record. If you create a 720 KB text file, ZFS will store that in a 1 MB record. Etc.

In other words, there won't be any extra wasted space from using 1 MB records; unless all of your files are 129 KB in size, of course. :)
 
Nowadays CPU is fast enough to compress data with higher than HDD write speed in real time, I guess enabling the 1M recordsize won't add extra overhead to the HDD but to the CPU, however the speed of CPU processing power still far more faster than HDD or SSD even with such extra overhead added.

I'm thinking to turn it on and forget about it in future.
 
"recordsize" on ZFS is not a hard rule, it's an upper limit. Meaning, ZFS will use a recordsize that matches the size of the data block.

For example, if you create a 4 KB text file, ZFS will store that in a 4 KB record. If you create a 12 KB text file, ZFS will store that in a 16 KB record. If you create a 720 KB text file, ZFS will store that in a 1 MB record. Etc.

In other words, there won't be any extra wasted space from using 1 MB records; unless all of your files are 129 KB in size, of course. :)
Originally writing a 1.3MB file on 1m recordsize volume will result to 2MB been occupied. However with compression enabled only 1.3MB is in used.

I'm not sure will enable 1MB block improve NFS sequencial read speed from esxi or not. However it will improve scrubbing speed and resilvering speed since esxi only dealing with those large vmdk image, am I right?
 
Usually ESXi and other VMs tend to create massive random write load. Such load with cause huge read-modify-write overhead -- for each 4KB write your disks will have to read and then write 1MB block. Even if you get good 2-3x compression, that is still 300-500KB. That may not compensate benefits of scrub/resilver.
 
Usually ESXi and other VMs tend to create massive random write load. Such load with cause huge read-modify-write overhead -- for each 4KB write your disks will have to read and then write 1MB block. Even if you get good 2-3x compression, that is still 300-500KB. That may not compensate benefits of scrub/resilver.
I found a new thing today, large recordsize doesn't work very well for any kind of virtual disk images such as vmdk, qcow2 and etc. Read amplification can be found on those vm images with large recordsize.

For example, reading a 4k data inside vm will trigger to read the next 256 * 4k block with data at the same time, if trigger random read to 10 different 4k block in VM will amplify the read to 10MB from ZFS. Most of the vm images are thin provisioning so that there will be no empty or zeroed data inside.
 
Last edited by a moderator:
By the way, what is the best recordsize for ESXI NFS or ISCSI? anyone using 4K recordsize to avoid alignment issue ?

4K might be the best for ESXI but the drawback are higher space overhead, slow scrubbing and resilvering.
 
Back
Top