ZFS How to dimension log device and cache device of a zpool

Alain De Vos · Jun 8, 2021

Let's say i have a PC with X memory and a pool of size Y. How do you dimension the size of the log device (write cache) and cache device (read cache) for this zpool. Guidelines, best practices, tips.

Alain De Vos · Jun 9, 2021

Miss ratio of my L2-ARC is 95%

SirDice · Jun 9, 2021

It's not about the size of the memory or the pool. Well, sort of, the type of access has much more impact. If you have a bunch of large files (movie files for example) that are mostly read sequentially, those aren't even cached. Small files, lots of random reads, those could benefit from the L2ARC.

I added a L2ARC to my storage thinking it might improve performance, but after some time it turned out cache was mostly empty and, like you, a really high miss rate. So it was mostly useless and I removed it again.

PMc · Jun 9, 2021

Log device: the amount of data that your system can write within vfs.zfs.txg.timeout - default is 5 or 10 seconds. Probably times 2 for reserve. That is usually only a couple of GB.

The L2ARC is a bit more delicate. There is no simple rule-of-thumb.

Alain De Vos said:
Miss ratio of my L2-ARC is 95%

Mine is at 85%, and still it does perfectly what I need it for: 1. keep the database mostly in l2arc, 2. hold the metadata for the large-files pool, so that the disks don't start spinning for no purpose only when looked after.
That utilization figure is calculated in a way I do not fully understand, it doesn't seem very helpful. But I clearly notice on the first night after reboot the nightly periodic takes half an hour and spinning disks are at 100%, while subsequently it takes 5 minutes and zpool iostat -v shows throughput on the cache device.
(But then, I have separate pools for the database, the large-files and the other stuff, so also separate L2ARC devices, so one data would not overwrite the other.)

It may be helpful to have a plan on what the L2ARC should specifically achieve. Then configure the secondarycache attributes of the filesystems accordingly, and then size the L2ARC to the requirement.

PMc · Jun 9, 2021

SirDice said:
I added a L2ARC to my storage thinking it might improve performance, but after some time it turned out cache was mostly empty and, like you, a really high miss rate. So it was mostly useless and I removed it again.

I once tried to add one on the desktop, for general acceleration, and found it utterly useless. Investing in a second SSD and run the frequently used filesysteme entirely from them gives a magnitude better results.

SirDice · Jun 9, 2021

PMc said:
I once tried to add one on the desktop, for general acceleration, and found it utterly useless. Investing in a second SSD and run the frequently used filesysteme entirely from them gives a magnitude better results.

Just adding more memory will help a lot too. More room for ARC or filesystem caches (in case of UFS).

But yeah, there's no simple calculation to show how much ZIL or L2ARC you're going to need. It really depends on the load, what's being stored and how it's accessed. Just add some, and do a lot of measurements and performance tests. That's really the only way to find out what you need.

sko · Jun 9, 2021

One thing that is often missed is: L2ARC also needs RAM for its tables - sometimes A LOT of it. So adding a huge L2ARC effectively decreases the ARC size, thus hurting performance (badly!).
Basically you should first completely max out the RAM configuration before you consider an L2ARC, because RAM is still faster by several magnitudes as even the fastest PCIe/NVMe drives.
A cache/L2ARC usually makes most sense on big NAS/SANs with a big set of 'hot' data. For most fileservers where only a small percentage of files are accessed regularly, normal ARC in RAM (with maxed out RAM configuration) is more than sufficient.
Use the zfstats tool to look at your ARC stats; if most of your ARC is used for "frequently used" blocks AND has a high hit rate (~80%+), a L2ARC can make sense. For a large "recently used cache" it doesn't make that much sense, as this is only caching blocks that have been recently read and are held in cache in case they will be accessed again. Only blocks that are accessed multiple times before falling out of the ARC are counted towards the "frequently used cache".
For caches with a low amount of "frequently used" blocks, your L2ARC will most likely have a hit-ratio of <1% and thus only blocking valuable space in RAM (~1-2GB per 100GB of cache)

Here's an example of this scenario on our old fileserver (maxed out at 64GB RAM), now mostly used for backup purposes:

Code:

# zfs-stats -A

------------------------------------------------------------------------
ZFS Subsystem Report                            Wed Jun  9 13:50:55 2021
------------------------------------------------------------------------
[...]
ARC Size:                               42.25%  25.89   GiB
        Target Size: (Adaptive)         42.31%  25.93   GiB
        Min Size (Hard Limit):          12.50%  7.66    GiB
        Max Size (High Water):          8:1     61.29   GiB
        Decompressed Data Size:                 160.21  GiB
        Compression Factor:                     6.19

ARC Size Breakdown:
        Recently Used Cache Size:       87.93%  22.80   GiB
        Frequently Used Cache Size:     12.07%  3.13    GiB
[...]

With the ARC mostly filled with recently used data, the L2ARC hit rate is terrible:

Code:

# zfs-stats -L

------------------------------------------------------------------------
ZFS Subsystem Report                            Wed Jun  9 13:59:38 2021
------------------------------------------------------------------------
[...]
L2 ARC Size: (Adaptive)                         68.31   GiB
        Decompressed Data Size:                 95.12   GiB
        Compression Factor:                     1.39
        Header Size:                    0.94%   915.03  MiB
[...]
L2 ARC Breakdown:                               6.03    b
        Hit Ratio:                      0.12%   7.42    m
        Miss Ratio:                     99.88%  6.02    b
        Feeds:                                  6.14    m

For the miniscule hitrate of 0.12% the L2ARC takes up 915MB of RAM (L2ARC Header Size) - nearly 1GB which would be more effective if it would be available for the ARC. (So I really should remove the L2ARC from that pool soon...)

As for separate ZIL devices: this fully depends on the purpose of the pool. The ZIL is only used for synchronous writes, so usually just for some database configurations or VMs. Your normal "fileserver" workload will never touch the ZIL.
ZFS _always_ keeps a ZIL, but usually spreads it over _all_ available providers. So the ZIL is already quite fast and can easily handle the few synchronous writes that might occure with "normal" database configurations and even a few VMs (although windows as a guest OS is always a PITA regarding sync writes, because it often decides to disable disk caches in a VM and forces sync writes for *everything*).
If you decide you really need a separate ZIL, _always_ use mirrored devices. If the ZIL gets lost or damaged, the pool might end up in an inconsistent state.

mer · Jun 9, 2021

sko that is some good info right there. Biggest thing is "why do you think you need XYZ?" One needs to look at data before you can say "adding XYZ may help".
Your specific workload has a lot to do with the meaning of the data.
zfs-stats is a good tool. Lots of information to sift through, then you need to understand what it means.
Think of how a typical desktop is used:
Boot, go into graphical environment, use a browser, use an editor, play some music.
Now keep that in mind with whats been said about the ARC, L2ARC, etc.

Alain De Vos · Jun 9, 2021

On performance, L2-ARC is adaptive ,so it does not hurt much neither.
On consistancy, if I understood you correctly sko, the ZIL is more fragile to sudden power outages ...

sko · Jun 9, 2021

mer said:
Think of how a typical desktop is used:
Boot, go into graphical environment, use a browser, use an editor, play some music.
Now keep that in mind with whats been said about the ARC, L2ARC, etc.

Almost no desktop and only some servers really need L2ARC and/or SLOG devices. In most cases it is by far the best solution to just use a "standard" pool and only if you really run into bottlenecks or performance issues that can be solved with L2ARC or SLOG you should add them. Size and type then still depends on the actual workload.
For desktops which run almost exclusively off of SSDs nowadays L2ARC on another SSD makes absolutely no sense at all (except for NVMe, but then you still better use that NVMe(s) for the pool)

Alain De Vos said:
L2-ARC is adaptive. So it does not hurt much neither.

Except for what I've said above: It costs memory and with a low amount of frequently read data in your ARC, you won't gain any performance but still reduce your ARC size...

mer · Jun 9, 2021

zfs-stats -E shows a bunch of good stuff.
Rereading a bit of ZFS Mastery by Michael W Lucas, I was reminded by this one key statement (basically supporting everything that sko was saying) :

L2ARC will only cache items that fall off of ARC.
The important part is this from zfs-stats -E (a desktop system):

Code:

    CACHE HITS BY CACHE LIST:
      Most Recently Used:        11.18%    8.92    m
      Most Frequently Used:        88.73%    70.77    m
      Most Recently Used Ghost:    0.29%    227.57    k
      Most Frequently Used Ghost:    0.58%    460.02    k

Things move from MRU to MFU, then to the Ghosts. Ghosts represent things that have recently fallen off the lists.
If you have fractional hits resolved from the ghost lists, L2ARC isn't really going to help. If they are significant, then L2ARC will likely help.

PMc · Jun 9, 2021

sko said:
One thing that is often missed is: L2ARC also needs RAM for its tables - sometimes A LOT of it. So adding a huge L2ARC effectively decreases the ARC size, thus hurting performance (badly!).

That can be estimated as ~1% of the L2ARC size (give or take). That needs to be put into the calculation.

sko said:
A cache/L2ARC usually makes most sense on big NAS/SANs with a big set of 'hot' data. For most fileservers where only a small percentage of files are accessed regularly, normal ARC in RAM (with maxed out RAM configuration) is more than sufficient.

For a database where the working set cannot fit into ram, but can fit into l2arc, it works really good. The tradeoff in ram doesn't matter then, because the ram doesnt fit the usecase anyway, and is best used as l2arc header space.

So, it is really necessary to know the usecase, and to know how the things work, and then either do measurements or do the math.

sko said:
[...]
L2 ARC Size: (Adaptive) 68.31 GiB
Decompressed Data Size: 95.12 GiB
Compression Factor: 1.39
Header Size: 0.94% 915.03 MiB
[...]
L2 ARC Breakdown: 6.03 b
Hit Ratio: 0.12% 7.42 m
Miss Ratio: 99.88% 6.02 b
Feeds: 6.14 m
[/code]

Yeah, that's pointless. At these figures the data is indeed best accomodated directly in memory, where the ARC will do a much better job in optimizing what is actually needed.

That is the other point here: the L2ARC does not have the intelligent optimization like the ARC, it only stores the garbage that falls off the ARC. One would need to optimize manually, by proper partitioning the filesystems and setting secondarycache options.

sko said:
As for separate ZIL devices: this fully depends on the purpose of the pool. The ZIL is only used for synchronous writes, so usually just for some database configurations or VMs. Your normal "fileserver" workload will never touch the ZIL.

NFS is by default 100% sync, it goes all thru the ZIL. (If you want or need that is the other question; I have switched mine to async.)

PMc · Jun 9, 2021

Alain De Vos said:
On performance, L2-ARC is adaptive ,so it does not hurt much neither.
On consistancy, if I understood you correctly sko, the ZIL is more fragile to sudden power outages ...

ZIL is only for power outages. It is not a cache, it is an "intent log". It doesn't speed writes, it only makes sure that you don't need an fsck when restarting. (And since ZFS does not have fsck, you would be in bad luck if you would need one.)

Therefore, as was already said, ZFS does always write a ZIL. But normally it writes the ZIL into the normal pool data, which may be on spinning disks, and may be slow. And this is why a separate ZIL on SSD can make writes faster. And then you must make sure that the thing doesn't fail, because when it fails you might need an fsck which doesnt exist...

sko · Jun 10, 2021

PMc said:
NFS is by default 100% sync, it goes all thru the ZIL. (If you want or need that is the other question; I have switched mine to async.)

Isn't the default to only write metadata synchronously and all data asynchronously?
At least that's also what mount(8) states:

Code:

             noasync
                     Metadata I/O should be done synchronously, while data I/O
                     should be done asynchronously.  This is the default.

Only the 'sync' flag enforces fully synchronous writes; 'async' means everything (including metadata) is written asynchronously.

Looking at nfs sysctls, there is only 'vfs.nfsd.async', which is set to 0 by default and the nfsd daemon doesn't have any flags regarding sync (because sync/async/noasync is set on the mounting side?).

Argentum · Jun 10, 2021

Alain De Vos said:
Let's say i have a PC with X memory and a pool of size Y. How do you dimension the size of the log device (write cache) and cache device (read cache) for this zpool. Guidelines, best practices, tips.

I have 64GB of fast SSD L2ARC cache on my desktop and this seems to be good enough for regular use. The drive is bigger, but I partitioned it that way. Have also swap on the same drive. Depends on the usage profile of course, but in my case it is not even full all the time. Stays somewhere near 60GB. It is good that L2ARC is persistent now with FreeBSD 13.0 ?

mer · Jun 10, 2021

Argentum out of curiousity, what is the output of zfs-stats -E under your typical load on that system? Particularly the "CACHE HITS BY CACHE LIST" section, then the MRU/MFU Ghost lines.
L2ARC gets populated by stuff that falls out of ARC, the Ghosts represent "recently fell out of ARC".

Argentum · Jun 10, 2021

mer said:
Argentum out of curiousity, what is the output of zfs-stats -E under your typical load on that system? Particularly the "CACHE HITS BY CACHE LIST" section, then the MRU/MFU Ghost lines.
L2ARC gets populated by stuff that falls out of ARC, the Ghosts represent "recently fell out of ARC".

Can answer that later. Not sitting behind my desktop now.

mer · Jun 10, 2021

Argentum said:
Can answer that later. Not sitting behind my desktop now.

Not a problem, it's more of a "Hmm, I wonder what that shows".

Argentum · Jun 10, 2021

mer said:
Not a problem, it's more of a "Hmm, I wonder what that shows".

The only thing right now is that just recently I did some experiments with Chia plotting on my desktop and that almost cleared the entire L2ARC...

PMc · Jun 10, 2021

sko said:
Isn't the default to only write metadata synchronously and all data asynchronously?

Yes, for normal mounts. But NFS is different. I did mount, and then I saw all my writes go through the ZIL in full.
Then I searched, and found some statements of the style that it is totally obvious that nfsd must do sync writes. Now I don't even find these statements anymore. But here is one:

https://lists.freebsd.org/pipermail/freebsd-hackers/2014-November/046481.html: "Since you're using an NFS server, it cannot reply success to an operation till it is committed to stable storage"

Lamia · Jun 10, 2021

There are uncountable posts on this topic online. These days, there is little or no need for it. The RAM is matters more. And if it is ECC, that's superb.

Argentum · Jun 10, 2021

mer said:
Argentum out of curiousity, what is the output of zfs-stats -E under your typical load on that system? Particularly the "CACHE HITS BY CACHE LIST" section, then the MRU/MFU Ghost lines.
L2ARC gets populated by stuff that falls out of ARC, the Ghosts represent "recently fell out of ARC".

Here it is:

Code:

------------------------------------------------------------------------
ZFS Subsystem Report                            Thu Jun 10 17:35:28 2021
------------------------------------------------------------------------

ARC Efficiency:                                 46.85   m
        Cache Hit Ratio:                96.57%  45.24   m
        Cache Miss Ratio:               3.43%   1.61    m
        Actual Hit Ratio:               95.73%  44.85   m

        Data Demand Efficiency:         88.84%  9.20    m
        Data Prefetch Efficiency:       36.03%  274.54  k

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             0.57%   258.46  k
          Most Recently Used:           22.74%  10.29   m
          Most Frequently Used:         76.39%  34.56   m
          Most Recently Used Ghost:     0.28%   124.62  k
          Most Frequently Used Ghost:   0.03%   12.59   k

        CACHE HITS BY DATA TYPE:
          Demand Data:                  18.06%  8.17    m
          Prefetch Data:                0.22%   98.91   k
          Demand Metadata:              80.36%  36.36   m
          Prefetch Metadata:            1.36%   616.62  k

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  63.95%  1.03    m
          Prefetch Data:                10.94%  175.63  k
          Demand Metadata:              15.86%  254.66  k
          Prefetch Metadata:            9.25%   148.44  k

------------------------------------------------------------------------

Alain De Vos · Jun 10, 2021

Of my desktop PC,

Code:

ZFS Subsystem Report                            Thu Jun 10 16:55:26 2021
ARC Efficiency:                                 3.88    b
        Cache Hit Ratio:                99.97%  3.87    b
        Cache Miss Ratio:               0.03%   1.15    m
        Actual Hit Ratio:               99.93%  3.87    b

        Data Demand Efficiency:         99.27%  50.51   m
        Data Prefetch Efficiency:       62.78%  542.10  k

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             0.02%   905.46  k
          Most Recently Used:           0.87%   33.85   m
          Most Frequently Used:         99.09%  3.84    b
          Most Recently Used Ghost:     0.01%   306.90  k
          Most Frequently Used Ghost:   0.01%   253.79  k

        CACHE HITS BY DATA TYPE:
          Demand Data:                  1.29%   50.14   m
          Prefetch Data:                0.01%   340.35  k
          Demand Metadata:              98.65%  3.82    b
          Prefetch Metadata:            0.05%   1.86    m

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  32.07%  367.52  k
          Prefetch Data:                17.60%  201.75  k
          Demand Metadata:              40.12%  459.88  k
          Prefetch Metadata:            10.21%  116.98  k

L2 ARC Summary: (HEALTHY)
        Low Memory Aborts:                      209
        Free on Write:                          17.80   k
        R/W Clashes:                            1
        Bad Checksums:                          0
        IO Errors:                              0

L2 ARC Size: (Adaptive)                         10.15   GiB
        Decompressed Data Size:                 12.43   GiB
        Compression Factor:                     1.22
        Header Size:                    0.25%   32.40   MiB

L2 ARC Evicts:
        Lock Retries:                           15
        Upon Reading:                           0

L2 ARC Breakdown:                               1.14    m
        Hit Ratio:                      52.94%  601.71  k
        Miss Ratio:                     47.06%  534.91  k
        Feeds:                                  39.38   k

L2 ARC Writes:
        Writes Sent:                    100.00% 33.08   k

mer · Jun 10, 2021

Argentum said:
CACHE HITS BY CACHE LIST: Anonymously Used: 0.57% 258.46 k Most Recently Used: 22.74% 10.29 m Most Frequently Used: 76.39% 34.56 m Most Recently Used Ghost: 0.28% 124.62 k Most Frequently Used Ghost: 0.03% 12.59 k

Alain De Vos said:
CACHE HITS BY CACHE LIST: Anonymously Used: 0.02% 905.46 k Most Recently Used: 0.87% 33.85 m Most Frequently Used: 99.09% 3.84 b Most Recently Used Ghost: 0.01% 306.90 k Most Frequently Used Ghost: 0.01% 253.79 k

See how small the "...used by ghost" are? That is stuff that would wind up in L2ARC.
Alain De Vos looking at your numbers, especially the L2 stats, it just doesn't seem to me that it's providing much of a benefit. But your system your choice so don't change anything on my say so.

Argentum · Jun 10, 2021

... and L2ARC

Code:

------------------------------------------------------------------------
ZFS Subsystem Report                            Thu Jun 10 18:35:47 2021
------------------------------------------------------------------------

L2 ARC Summary: (HEALTHY)
        Low Memory Aborts:                      6
        Free on Write:                          4.69    k
        R/W Clashes:                            0
        Bad Checksums:                          0
        IO Errors:                              0

L2 ARC Size: (Adaptive)                         55.87   GiB
        Decompressed Data Size:                 78.85   GiB
        Compression Factor:                     1.41
        Header Size:                    0.14%   116.00  MiB

L2 ARC Evicts:
        Lock Retries:                           4
        Upon Reading:                           0

L2 ARC Breakdown:                               1.61    m
        Hit Ratio:                      29.55%  475.94  k
        Miss Ratio:                     70.45%  1.13    m
        Feeds:                                  84.89   k

L2 ARC Writes:
        Writes Sent:                    100.00% 16.04   k

------------------------------------------------------------------------

But this is all after I ran Chia plotter recently. That affected my L2ARC seriously. It has not recovered yet.

Another thing is that main disks are rotating. The L2ARC makes the system almost silent.

ZFS How to dimension log device and cache device of a zpool

Alain De Vos

Alain De Vos

SirDice

Administrator

PMc

PMc

SirDice

Administrator

sko

mer

Alain De Vos

sko

mer

PMc

PMc

sko

Argentum

mer

Argentum

mer

Argentum

PMc

Lamia

Argentum

Alain De Vos

mer

Argentum