ZFS How to dimension log device and cache device of a zpool

Let's say i have a PC with X memory and a pool of size Y. How do you dimension the size of the log device (write cache) and cache device (read cache) for this zpool. Guidelines, best practices, tips.
 
It's not about the size of the memory or the pool. Well, sort of, the type of access has much more impact. If you have a bunch of large files (movie files for example) that are mostly read sequentially, those aren't even cached. Small files, lots of random reads, those could benefit from the L2ARC.

I added a L2ARC to my storage thinking it might improve performance, but after some time it turned out cache was mostly empty and, like you, a really high miss rate. So it was mostly useless and I removed it again.
 
Log device: the amount of data that your system can write within vfs.zfs.txg.timeout - default is 5 or 10 seconds. Probably times 2 for reserve. That is usually only a couple of GB.

The L2ARC is a bit more delicate. There is no simple rule-of-thumb.
Miss ratio of my L2-ARC is 95%
Mine is at 85%, and still it does perfectly what I need it for: 1. keep the database mostly in l2arc, 2. hold the metadata for the large-files pool, so that the disks don't start spinning for no purpose only when looked after.
That utilization figure is calculated in a way I do not fully understand, it doesn't seem very helpful. But I clearly notice on the first night after reboot the nightly periodic takes half an hour and spinning disks are at 100%, while subsequently it takes 5 minutes and zpool iostat -v shows throughput on the cache device.
(But then, I have separate pools for the database, the large-files and the other stuff, so also separate L2ARC devices, so one data would not overwrite the other.)

It may be helpful to have a plan on what the L2ARC should specifically achieve. Then configure the secondarycache attributes of the filesystems accordingly, and then size the L2ARC to the requirement.
 
I added a L2ARC to my storage thinking it might improve performance, but after some time it turned out cache was mostly empty and, like you, a really high miss rate. So it was mostly useless and I removed it again.
I once tried to add one on the desktop, for general acceleration, and found it utterly useless. Investing in a second SSD and run the frequently used filesysteme entirely from them gives a magnitude better results.
 
I once tried to add one on the desktop, for general acceleration, and found it utterly useless. Investing in a second SSD and run the frequently used filesysteme entirely from them gives a magnitude better results.
Just adding more memory will help a lot too. More room for ARC or filesystem caches (in case of UFS).

But yeah, there's no simple calculation to show how much ZIL or L2ARC you're going to need. It really depends on the load, what's being stored and how it's accessed. Just add some, and do a lot of measurements and performance tests. That's really the only way to find out what you need.
 
One thing that is often missed is: L2ARC also needs RAM for its tables - sometimes A LOT of it. So adding a huge L2ARC effectively decreases the ARC size, thus hurting performance (badly!).
Basically you should first completely max out the RAM configuration before you consider an L2ARC, because RAM is still faster by several magnitudes as even the fastest PCIe/NVMe drives.
A cache/L2ARC usually makes most sense on big NAS/SANs with a big set of 'hot' data. For most fileservers where only a small percentage of files are accessed regularly, normal ARC in RAM (with maxed out RAM configuration) is more than sufficient.
Use the zfstats tool to look at your ARC stats; if most of your ARC is used for "frequently used" blocks AND has a high hit rate (~80%+), a L2ARC can make sense. For a large "recently used cache" it doesn't make that much sense, as this is only caching blocks that have been recently read and are held in cache in case they will be accessed again. Only blocks that are accessed multiple times before falling out of the ARC are counted towards the "frequently used cache".
For caches with a low amount of "frequently used" blocks, your L2ARC will most likely have a hit-ratio of <1% and thus only blocking valuable space in RAM (~1-2GB per 100GB of cache)

Here's an example of this scenario on our old fileserver (maxed out at 64GB RAM), now mostly used for backup purposes:
Code:
# zfs-stats -A

------------------------------------------------------------------------
ZFS Subsystem Report                            Wed Jun  9 13:50:55 2021
------------------------------------------------------------------------
[...]
ARC Size:                               42.25%  25.89   GiB
        Target Size: (Adaptive)         42.31%  25.93   GiB
        Min Size (Hard Limit):          12.50%  7.66    GiB
        Max Size (High Water):          8:1     61.29   GiB
        Decompressed Data Size:                 160.21  GiB
        Compression Factor:                     6.19

ARC Size Breakdown:
        Recently Used Cache Size:       87.93%  22.80   GiB
        Frequently Used Cache Size:     12.07%  3.13    GiB
[...]

With the ARC mostly filled with recently used data, the L2ARC hit rate is terrible:

Code:
# zfs-stats -L

------------------------------------------------------------------------
ZFS Subsystem Report                            Wed Jun  9 13:59:38 2021
------------------------------------------------------------------------
[...]
L2 ARC Size: (Adaptive)                         68.31   GiB
        Decompressed Data Size:                 95.12   GiB
        Compression Factor:                     1.39
        Header Size:                    0.94%   915.03  MiB
[...]
L2 ARC Breakdown:                               6.03    b
        Hit Ratio:                      0.12%   7.42    m
        Miss Ratio:                     99.88%  6.02    b
        Feeds:                                  6.14    m

For the miniscule hitrate of 0.12% the L2ARC takes up 915MB of RAM (L2ARC Header Size) - nearly 1GB which would be more effective if it would be available for the ARC. (So I really should remove the L2ARC from that pool soon...)


As for separate ZIL devices: this fully depends on the purpose of the pool. The ZIL is only used for synchronous writes, so usually just for some database configurations or VMs. Your normal "fileserver" workload will never touch the ZIL.
ZFS _always_ keeps a ZIL, but usually spreads it over _all_ available providers. So the ZIL is already quite fast and can easily handle the few synchronous writes that might occure with "normal" database configurations and even a few VMs (although windows as a guest OS is always a PITA regarding sync writes, because it often decides to disable disk caches in a VM and forces sync writes for *everything*).
If you decide you really need a separate ZIL, _always_ use mirrored devices. If the ZIL gets lost or damaged, the pool might end up in an inconsistent state.
 
sko that is some good info right there. Biggest thing is "why do you think you need XYZ?" One needs to look at data before you can say "adding XYZ may help".
Your specific workload has a lot to do with the meaning of the data.
zfs-stats is a good tool. Lots of information to sift through, then you need to understand what it means.
Think of how a typical desktop is used:
Boot, go into graphical environment, use a browser, use an editor, play some music.
Now keep that in mind with whats been said about the ARC, L2ARC, etc.
 
On performance, L2-ARC is adaptive ,so it does not hurt much neither.
On consistancy, if I understood you correctly sko, the ZIL is more fragile to sudden power outages ...
 
Think of how a typical desktop is used:
Boot, go into graphical environment, use a browser, use an editor, play some music.
Now keep that in mind with whats been said about the ARC, L2ARC, etc.

Almost no desktop and only some servers really need L2ARC and/or SLOG devices. In most cases it is by far the best solution to just use a "standard" pool and only if you really run into bottlenecks or performance issues that can be solved with L2ARC or SLOG you should add them. Size and type then still depends on the actual workload.
For desktops which run almost exclusively off of SSDs nowadays L2ARC on another SSD makes absolutely no sense at all (except for NVMe, but then you still better use that NVMe(s) for the pool)

L2-ARC is adaptive. So it does not hurt much neither.
Except for what I've said above: It costs memory and with a low amount of frequently read data in your ARC, you won't gain any performance but still reduce your ARC size...
 
zfs-stats -E shows a bunch of good stuff.
Rereading a bit of ZFS Mastery by Michael W Lucas, I was reminded by this one key statement (basically supporting everything that sko was saying) :

L2ARC will only cache items that fall off of ARC.
The important part is this from zfs-stats -E (a desktop system):
Code:
    CACHE HITS BY CACHE LIST:
      Most Recently Used:        11.18%    8.92    m
      Most Frequently Used:        88.73%    70.77    m
      Most Recently Used Ghost:    0.29%    227.57    k
      Most Frequently Used Ghost:    0.58%    460.02    k
Things move from MRU to MFU, then to the Ghosts. Ghosts represent things that have recently fallen off the lists.
If you have fractional hits resolved from the ghost lists, L2ARC isn't really going to help. If they are significant, then L2ARC will likely help.
 
One thing that is often missed is: L2ARC also needs RAM for its tables - sometimes A LOT of it. So adding a huge L2ARC effectively decreases the ARC size, thus hurting performance (badly!).
That can be estimated as ~1% of the L2ARC size (give or take). That needs to be put into the calculation.

A cache/L2ARC usually makes most sense on big NAS/SANs with a big set of 'hot' data. For most fileservers where only a small percentage of files are accessed regularly, normal ARC in RAM (with maxed out RAM configuration) is more than sufficient.
For a database where the working set cannot fit into ram, but can fit into l2arc, it works really good. The tradeoff in ram doesn't matter then, because the ram doesnt fit the usecase anyway, and is best used as l2arc header space.

So, it is really necessary to know the usecase, and to know how the things work, and then either do measurements or do the math.
[...]
L2 ARC Size: (Adaptive) 68.31 GiB
Decompressed Data Size: 95.12 GiB
Compression Factor: 1.39
Header Size: 0.94% 915.03 MiB
[...]
L2 ARC Breakdown: 6.03 b
Hit Ratio: 0.12% 7.42 m
Miss Ratio: 99.88% 6.02 b
Feeds: 6.14 m
[/code]
Yeah, that's pointless. At these figures the data is indeed best accomodated directly in memory, where the ARC will do a much better job in optimizing what is actually needed.

That is the other point here: the L2ARC does not have the intelligent optimization like the ARC, it only stores the garbage that falls off the ARC. One would need to optimize manually, by proper partitioning the filesystems and setting secondarycache options.

As for separate ZIL devices: this fully depends on the purpose of the pool. The ZIL is only used for synchronous writes, so usually just for some database configurations or VMs. Your normal "fileserver" workload will never touch the ZIL.
NFS is by default 100% sync, it goes all thru the ZIL. (If you want or need that is the other question; I have switched mine to async.)
 
On performance, L2-ARC is adaptive ,so it does not hurt much neither.
On consistancy, if I understood you correctly sko, the ZIL is more fragile to sudden power outages ...
ZIL is only for power outages. It is not a cache, it is an "intent log". It doesn't speed writes, it only makes sure that you don't need an fsck when restarting. (And since ZFS does not have fsck, you would be in bad luck if you would need one.)

Therefore, as was already said, ZFS does always write a ZIL. But normally it writes the ZIL into the normal pool data, which may be on spinning disks, and may be slow. And this is why a separate ZIL on SSD can make writes faster. And then you must make sure that the thing doesn't fail, because when it fails you might need an fsck which doesnt exist...
 
NFS is by default 100% sync, it goes all thru the ZIL. (If you want or need that is the other question; I have switched mine to async.)

Isn't the default to only write metadata synchronously and all data asynchronously?
At least that's also what mount(8) states:
Code:
             noasync
                     Metadata I/O should be done synchronously, while data I/O
                     should be done asynchronously.  This is the default.

Only the 'sync' flag enforces fully synchronous writes; 'async' means everything (including metadata) is written asynchronously.

Looking at nfs sysctls, there is only 'vfs.nfsd.async', which is set to 0 by default and the nfsd daemon doesn't have any flags regarding sync (because sync/async/noasync is set on the mounting side?).
 
Let's say i have a PC with X memory and a pool of size Y. How do you dimension the size of the log device (write cache) and cache device (read cache) for this zpool. Guidelines, best practices, tips.
I have 64GB of fast SSD L2ARC cache on my desktop and this seems to be good enough for regular use. The drive is bigger, but I partitioned it that way. Have also swap on the same drive. Depends on the usage profile of course, but in my case it is not even full all the time. Stays somewhere near 60GB. It is good that L2ARC is persistent now with FreeBSD 13.0 🙂
 
Argentum out of curiousity, what is the output of zfs-stats -E under your typical load on that system? Particularly the "CACHE HITS BY CACHE LIST" section, then the MRU/MFU Ghost lines.
L2ARC gets populated by stuff that falls out of ARC, the Ghosts represent "recently fell out of ARC".
 
Argentum out of curiousity, what is the output of zfs-stats -E under your typical load on that system? Particularly the "CACHE HITS BY CACHE LIST" section, then the MRU/MFU Ghost lines.
L2ARC gets populated by stuff that falls out of ARC, the Ghosts represent "recently fell out of ARC".
Can answer that later. Not sitting behind my desktop now.
 
Isn't the default to only write metadata synchronously and all data asynchronously?
Yes, for normal mounts. But NFS is different. I did mount, and then I saw all my writes go through the ZIL in full.
Then I searched, and found some statements of the style that it is totally obvious that nfsd must do sync writes. Now I don't even find these statements anymore. But here is one:

https://lists.freebsd.org/pipermail/freebsd-hackers/2014-November/046481.html: "Since you're using an NFS server, it cannot reply success to an operation till it is committed to stable storage"
 
There are uncountable posts on this topic online. These days, there is little or no need for it. The RAM is matters more. And if it is ECC, that's superb.
 
Argentum out of curiousity, what is the output of zfs-stats -E under your typical load on that system? Particularly the "CACHE HITS BY CACHE LIST" section, then the MRU/MFU Ghost lines.
L2ARC gets populated by stuff that falls out of ARC, the Ghosts represent "recently fell out of ARC".
Here it is:

Code:
------------------------------------------------------------------------
ZFS Subsystem Report                            Thu Jun 10 17:35:28 2021
------------------------------------------------------------------------

ARC Efficiency:                                 46.85   m
        Cache Hit Ratio:                96.57%  45.24   m
        Cache Miss Ratio:               3.43%   1.61    m
        Actual Hit Ratio:               95.73%  44.85   m

        Data Demand Efficiency:         88.84%  9.20    m
        Data Prefetch Efficiency:       36.03%  274.54  k

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             0.57%   258.46  k
          Most Recently Used:           22.74%  10.29   m
          Most Frequently Used:         76.39%  34.56   m
          Most Recently Used Ghost:     0.28%   124.62  k
          Most Frequently Used Ghost:   0.03%   12.59   k

        CACHE HITS BY DATA TYPE:
          Demand Data:                  18.06%  8.17    m
          Prefetch Data:                0.22%   98.91   k
          Demand Metadata:              80.36%  36.36   m
          Prefetch Metadata:            1.36%   616.62  k

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  63.95%  1.03    m
          Prefetch Data:                10.94%  175.63  k
          Demand Metadata:              15.86%  254.66  k
          Prefetch Metadata:            9.25%   148.44  k

------------------------------------------------------------------------
 
Of my desktop PC,
Code:
ZFS Subsystem Report                            Thu Jun 10 16:55:26 2021
ARC Efficiency:                                 3.88    b
        Cache Hit Ratio:                99.97%  3.87    b
        Cache Miss Ratio:               0.03%   1.15    m
        Actual Hit Ratio:               99.93%  3.87    b

        Data Demand Efficiency:         99.27%  50.51   m
        Data Prefetch Efficiency:       62.78%  542.10  k

        CACHE HITS BY CACHE LIST:
          Anonymously Used:             0.02%   905.46  k
          Most Recently Used:           0.87%   33.85   m
          Most Frequently Used:         99.09%  3.84    b
          Most Recently Used Ghost:     0.01%   306.90  k
          Most Frequently Used Ghost:   0.01%   253.79  k

        CACHE HITS BY DATA TYPE:
          Demand Data:                  1.29%   50.14   m
          Prefetch Data:                0.01%   340.35  k
          Demand Metadata:              98.65%  3.82    b
          Prefetch Metadata:            0.05%   1.86    m

        CACHE MISSES BY DATA TYPE:
          Demand Data:                  32.07%  367.52  k
          Prefetch Data:                17.60%  201.75  k
          Demand Metadata:              40.12%  459.88  k
          Prefetch Metadata:            10.21%  116.98  k

L2 ARC Summary: (HEALTHY)
        Low Memory Aborts:                      209
        Free on Write:                          17.80   k
        R/W Clashes:                            1
        Bad Checksums:                          0
        IO Errors:                              0

L2 ARC Size: (Adaptive)                         10.15   GiB
        Decompressed Data Size:                 12.43   GiB
        Compression Factor:                     1.22
        Header Size:                    0.25%   32.40   MiB

L2 ARC Evicts:
        Lock Retries:                           15
        Upon Reading:                           0

L2 ARC Breakdown:                               1.14    m
        Hit Ratio:                      52.94%  601.71  k
        Miss Ratio:                     47.06%  534.91  k
        Feeds:                                  39.38   k

L2 ARC Writes:
        Writes Sent:                    100.00% 33.08   k
 
CACHE HITS BY CACHE LIST: Anonymously Used: 0.57% 258.46 k Most Recently Used: 22.74% 10.29 m Most Frequently Used: 76.39% 34.56 m Most Recently Used Ghost: 0.28% 124.62 k Most Frequently Used Ghost: 0.03% 12.59 k

CACHE HITS BY CACHE LIST: Anonymously Used: 0.02% 905.46 k Most Recently Used: 0.87% 33.85 m Most Frequently Used: 99.09% 3.84 b Most Recently Used Ghost: 0.01% 306.90 k Most Frequently Used Ghost: 0.01% 253.79 k
See how small the "...used by ghost" are? That is stuff that would wind up in L2ARC.
Alain De Vos looking at your numbers, especially the L2 stats, it just doesn't seem to me that it's providing much of a benefit. But your system your choice so don't change anything on my say so.
 
... and L2ARC

Code:
------------------------------------------------------------------------
ZFS Subsystem Report                            Thu Jun 10 18:35:47 2021
------------------------------------------------------------------------

L2 ARC Summary: (HEALTHY)
        Low Memory Aborts:                      6
        Free on Write:                          4.69    k
        R/W Clashes:                            0
        Bad Checksums:                          0
        IO Errors:                              0

L2 ARC Size: (Adaptive)                         55.87   GiB
        Decompressed Data Size:                 78.85   GiB
        Compression Factor:                     1.41
        Header Size:                    0.14%   116.00  MiB

L2 ARC Evicts:
        Lock Retries:                           4
        Upon Reading:                           0

L2 ARC Breakdown:                               1.61    m
        Hit Ratio:                      29.55%  475.94  k
        Miss Ratio:                     70.45%  1.13    m
        Feeds:                                  84.89   k

L2 ARC Writes:
        Writes Sent:                    100.00% 16.04   k

------------------------------------------------------------------------

But this is all after I ran Chia plotter recently. That affected my L2ARC seriously. It has not recovered yet.

Another thing is that main disks are rotating. The L2ARC makes the system almost silent.
 
Back
Top