ZFS For large files ZFS recordsize can be incremented to 1M, what about the GELI sectorsize?

Relictrix · Mar 28, 2023

Hi,

When on certain pools or datasets large files will be stored, it can be an advantage to use a larger recordsize of 1M in ZFS. Suppose the pool is encrypted by GELI, would it be better or worse to align the sectorsize of GELI with the recordsize of ZFS? In general i see GELI sectorsizes of 4096, anybody has experience with this kind of configuration.

Thanks in advance,

Best Regards,

mer · Mar 28, 2023

I've not done this, but GELI sits below ZFS in this case, yes? I would think leave GELI alone and set ZFS layer to use the larger recordsize.
If you are using at least 13.0 I think (not positive) that ZFS may have native encryption, which could be an option.

Above are my opinion only

Relictrix · Mar 28, 2023

Hi,

Yes , ZFS (inner) is created inside the GELI (outer). Thanks for your reaction!

mer · Mar 28, 2023

So GELI is being used for "whole disk encryption"? If I recall recordsize is a per dataset attribute (maybe a zpool attribute)? To me that leads more to "leave GELI alone, muck with ZFS".
Why? My opinion, because to ZFS the GELI GEOM "is the physical device"; to ZFS the underlying physical has a 4k sector size (ashift of 12 I think) and it all just works out.

cracauer@ · Mar 28, 2023

mer said:
So GELI is being used for "whole disk encryption"? If I recall recordsize is a per dataset attribute (maybe a zpool attribute)? To me that leads more to "leave GELI alone, muck with ZFS".
Why? My opinion, because to ZFS the GELI GEOM "is the physical device"; to ZFS the underlying physical has a 4k sector size (ashift of 12 I think) and it all just works out.

Yeah, but you don't want larger single transfers from ZFS to be split up into multiple smaller transfers when it passes through GELI to the disk. I don't know the answer to the original question. Try it out and tell us

Eric A. Borisch · Mar 28, 2023

My suggestion: use geli sectorsize == 2^ashift of the pool.

Note that setting recordsize only sets a maximum record size that will be used internal (think checksums and compression function calls) to zfs; smaller I/Os will still be issued (down to the pool’s 2^ashift size) for small reads/writes or metadata (ZFS/ZPOOL operations themselves). Depending on pool layout, even with a full 128k recordsize being written, the data written to each device may be significantly less than 128k.

Put together, this means increasing the geli block size (above 2^ashift) will lead to write amplification and extra load while reading/writing any data that is smaller than your geli block. Increasing the ashift to force large IOs would move this amplification up the stack, and cause the pool to be much less efficient/performant storing small files / metadata or potentially at compressing data. In a RAIDZn setup this would be further aggravated by the (2^ashift)*(n+1) minimum allocation size.

Relictrix · Mar 28, 2023

Hi,

When using a sectorsize of 8192 in GELI, I get a warning saying: sectorsize is larger then pagesize. When I continue, I get an error when creating the zfs pool. Could not really find something to increase the pagesize: sysctl hw.pagesize = readonly.

Best Regards,

mer · Mar 28, 2023

Relictrix said:
Hi,

When using a sectorsize of 8192 in GELI, I get a warning saying: sectorsize is larger then pagesize. When I continue, I get an error when creating the zfs pool. Could not really find something to increase the pagesize: sysctl hw.pagesize = readonly.

Best Regards,

That is interesting. To me, the implication is "GELI doesn't want bigger than that".

cracauer@ I understand and don't disagree, but if we take GELI out and have the zpool directly on a device with physical size sectors of 4096, we would still have a problem if dataset has 1m record size, no?
I don't know the answer, but I do know if you create ZFS structures with an ashift of 9 on a device that has physical sectors of 4096, you have performance issues. I would think that recordsize > physical sector size is a "normal" situation. 1m (1024*1024) is an even multiple of 4096.
But as I said "I've not done this, I'm speculating, blah blah blah"

PMc · Mar 30, 2023

mer said:
if you create ZFS structures with an ashift of 9 on a device that has physical sectors of 4096, you have performance issues.

Yes, because then there is an alignment flaw: the zfs record (whatever it is, 4k or 128k or 1m) is then allowed to start at any 512-byte boundary, not necessarily on a 4k boundary. And when that happens, the 512e emulation-layer in the storage device has to re-write an additional 4k device block for every zfs block written.

Also consider that the zfs recordsize is not the blocking size; the ashift is the blocking size: if you set recordsize to 1m, and write a 4k file, it will use one 4k block. If you write a 1m file, it will also use 4k blocks (possibly spread over the raidz).
But, if you change a single byte in that 1m file, the entire file will be fetched from disk, and rewritten to disk (to a new location). Transactional integrity guarantees that this 1m file is either entirely changed, or not at all.
Also, compression is done over the whole 1m recordsize - and that is where the fun starts, because better compression means less i/o and better cache utilization.

Relictrix · Apr 19, 2023

Thanks all for your reactions!