Solved 512B L2ARC with a 4K pool?

`Orum · Aug 27, 2019

I've been upgrading one of our old pools at work, going from 512B drives to 4K. However, I realized that I wanted to keep our SSD L2ARC, which reports 512B sectors (which I think it actually is). I also stumbled across an old thread that suggests that even if your L2ARC device is 512B, it should be overridden and forced to 4K. My questions are:

Is this (forcing 4K on a cache device for a 4K pool) still the correct configuration now?
Will running sysctl vfs.zfs.min_auto_ashift=12 (prior to adding it as a cache device) do the right thing for L2ARC, or do we still need to use gnop?
Is there ever a case where you want a mismatch between L2ARC sector size and pool sector size? If no, why isn't this handled automatically (or is it)?

sko · Aug 27, 2019

`Orum said:
SSD L2ARC, which reports 512B sectors (which I think it actually is)

Then those drives are lying - there have never been any SSDs with (physical) 512B sectors... Some SSDs (esp. consumer devices) report 512B sectors for compatibility with some primitive OSes and filesystems (most notably from Redmond).

As for your question: You cant' specify a block size/ashift for the L2ARC - it always ingests the data in the blocksize in which it is stored in the pool, which is variable and defaults to 128k.

`Orum · Aug 27, 2019

sko said:
Some SSDs (esp. consumer devices) report 512B sectors for compatibility

Yeah, the 4K drives are the same way.

sko said:
it always ingests the data in the blocksize in which it is stored in the pool, which is variable and defaults to 128k.

My understanding was the pool blocksize, i.e. ashift, was based on the block devices that are initially added to it. I've forced it to be 12 via the sysctl since the drives lie and emulate 512B for the reasons you specified. In any case, as long as the cache device will use the same automatically, I'll just go ahead and add the raw device without doing anything special. Thanks!

sko · Aug 27, 2019

The ARC is much more up the stack so the underlying concept of raw blocks doesn't apply here.
ZFS stores data in metaslabs, which are further divided to records of variable size. It's been a while that I've wrapped my head around the inner workings of ZFS, but both books by Michael W Lucas give a pretty good insight of how ZFS works underneath (esp. the "Advanced ZFS" book). I try to give a _very_ short summary so you can understand why the blocksize doesn't matter for a L2ARC device.

Here is a quick example of how data is arranged in a metaslab, taken from the output of zdb -mmm <poolname> on one of our servers:

Code:

        metaslab      7   offset   e000000000   spacemap  65538   free    46.4G
                          segments      11810   maxsize   1.82G   freepct   36%
        In-memory histogram:
                         13:   2698 ****************************************
                         14:   2396 ************************************
                         15:   1279 *******************
                         16:    869 *************
                         17:    928 **************
                         18:   2411 ************************************
                         19:    424 *******
                         20:    120 **
                         21:    103 **
                         22:    120 **
                         23:    121 **
                         24:     74 **
                         25:     67 *
                         26:    117 **
                         27:     46 *
                         28:     28 *
                         29:      6 *
                         30:      3 *
        On-disk histogram:              fragmentation 14
                         13:  25481 ****************************************
                         14:  11709 *******************
                         15:   9187 ***************
                         16:   9710 ****************
                         17:   8145 *************
                         18:   6875 ***********
                         19:   5085 ********
                         20:   3847 *******
                         21:   1751 ***
                         22:   1221 **
                         23:    245 *
                         24:    146 *
                         25:     81 *
                         26:    117 *

The histograms show the distribution of how blocks in this metaslab (7) are allocated in memory (ARC, and hence in L2ARC) and on-disk. The first column is the block size in powers of 2 (as given for ashift) - so there are blocks from 8KB (2^13) up to 2^30 bytes for in-memory data and 2^26 bytes for on-disk data in this metaslab.
Records are stored in memory (ARC) with their respective size and not further fragmented into smaller blocks by ZFS. So a 32MB record is stored as a 32MB record in memory from the (high level) view of ZFS. As L2ARC is only an extension to the ARC, it ingests records that fall out of the ARC in the same size.
Hence, as long as you use an ashift size equal or bigger than the physical block size you won't have any problems, as records are always at least the size of the min_ashift value and you won't end up with write/read amplification on the disk level.

PMc · Aug 27, 2019

vfs.zfs.min_auto_ashift=12 was all I did. No further measures seem necessary.
Only make certain that the slice (or partition, or whatever) used for l2arc does actually start at an intended block boundary.

Solved 512B L2ARC with a 4K pool?

`Orum

sko

`Orum

sko

PMc