Block sizes

jimbobmcgee · Jun 10, 2014

I gather a ZFS filesystem uses a variable block size, between ashift and recordsize, but that the block size of a zvol is fixed to volblocksize.

As such, when sharing a zvol to a iSCSI client, there are four distinct block sizes to consider:

The volblocksize of the zvol (defaults to 8KB)
The BlockLength of the iSCSI target (currently recommended 512B)
The MTU of the iSCSI network layer (commonly ~1500B or ~9000B)
The block size or allocation unit of the filesystem, as created by the OS of the iSCSI client (e.g. in WinNT/NTFS, this defaults to 4KB)

It strikes me that disparity between these could lead to a lot of wasted writes, and also wasted space due to metadata. One or more of the following must surely happen:

The iSCSI client might break up its 4K writes into little 512B I/Os (suboptimal network transfer)
The iSCSI client might break up its 4K writes into 1500B I/Os (requiring TCP recompilation in the network stack)
The iSCSI server might write each of its 512B writes to a separate ZFS record (significant bloat from metadata, suboptimal compress/dedupe unit)
The iSCSI server might batch up its 512B writes into 8K blocks (latency between receipt and commit)
The iSCSI server might ignore its own block-length and accept a 4K block from the client OS filesystem, and write that as a single ZFS record

Given these:

Which (if any) of the above best describes the actual I/O operations?
What is the actual size of the record, written to the ZFS pool, for any arbitrary write (i.e. the one to which ZFS metadata is written and compression/dedupe might be applied)?
What are the "best" values to "match up"? volblocksize and FS allocation unit?
What is the average size of ZFS metadata per written record?

Sebulon · Jun 16, 2014

jimbobmcgee said:
The iSCSI client might break up its 4K writes into little 512B I/Os (suboptimal network transfer)

The iSCSI client might break up its 4K writes into 1500B I/Os (requiring TCP recompilation in the network stack)

The iSCSI server might write each of its 512B writes to a separate ZFS record (significant bloat from metadata, suboptimal compress/dedupe unit)

The iSCSI server might batch up its 512B writes into 8K blocks (latency between receipt and commit)

The iSCSI server might ignore its own block-length and accept a 4K block from the client OS filesystem, and write that as a single ZFS record

Of course I can´t speak for everyone, but we are primarily using BlockLength 4096 and only when that fails default back to 512. Currently it stands at 2 512's out of 23 LUN's exported from one storage server. Those two are exported to a Microsoft Data Protection Manager (DPM) server that handles backups of other servers (virtual and physical), which apparently was picky about the disk block size. Running tcpdump on our iSCSI network has shown us that packets come in around 8-9000 B large regardless of BlockLength.

Jumbo Frames has alway been considered best practice for iSCSI for a good reason. And ZFS always buffers in ARC (possibly also logging in ZIL too) before flushing to disk so that all those itsy small writes are transformed to a variably larger stream.

jimbobmcgee said:
Given these:

Which (if any) of the above best describes the actual I/O operations?

What is the actual size of the record, written to the ZFS pool, for any arbitrary write (i.e. the one to which ZFS metadata is written and compression/dedupe might be applied)?

What are the "best" values to "match up"? volblocksize and FS allocation unit?

What is the average size of ZFS metadata per written record?

In our experience BlockLength and volblocksize has little to do with each other essentially, and what´s of more concern is:
Mitigating 4k disk issues on RAIDZ
http://forums.freebsd.org/viewtopic.php?f=48&t=37365&p=206910

/Sebulon