ZFS ZFS tuning metadata

Just a heads up.

I recently analysed a lot of servers using zfs-stats and the 'kstat.zfs.misc.arcstats' sysctl's. I observed that the vast majority of hits on the cache was metadata, MRU cache in my case on every single server was almost useless, whilst MFU was working brilliant. There was some data caching going on but metadata was the priority.

The servers previously didn't have a lot of tuning. It was basically prefetch been disabled (as wanted maximum random i/o at expense of some sequential i/o), ARC capped but still set to use more than half of ram, and metadata limit increased to around 50% of ARC size from its default 25%.

After analysing the stats I determined my ARC cap was way too high, it still filled up but was with lots of MRU data that had bad hit rates. So just wasting ram, as an example on one server MFU was about 7.6 gig, and the ARC was capped at 17 gig. 80% of cache hits were metadata, and metadata was limited to 8 gig with 6.5 gig been actively used. So only 1.1 gig of the MFU was from data.

The tuning I applied was reducing ARC to 10 gig (so still some wiggle room over current MFU usage), and increased metadata limit to 80% of ARC.

I feel the metadata 25% stock is way too low based on what I am observing on every single server, there is no performance loss based on the changes, so essentially added 7 gig of ram to the pool on the server that can be used for larger innodb buffer or something else. The overall ARC hit rate is the same as before which is is around 99.7%.

I found other discussion's on the mailing lists with similar findings where people were actually advised to raise metadata limit to the full ARC limit in those cases.

I plan to enable prefetch again just to see if it helps any of the server's so bear in mind these findings are with that off, and different workloads can also have different caching patterns.
 
Hi,

I was going to open a new thread, but I think yours broadly covers what I was going to ask.

Firstly, can you elaborate a little more on how you collected and analysed the data?

Secondly, I'm curious about why you choose to cap the ARC, rather than let ZFS use whatever free memory there is? Have you run into situations where the ARC is not flushed quickly enough to satisfy a malloc request?

====

Onto my specific question:

I've recently been experimenting with an NVMe SSD L2ARC in front of a couple of mirrored HDDs, but I'm frustrated that zfs-stats (and, it looks like, the kstat.zfs.misc.arcstats.l2_* sysctls) focus on hits - ignoring data size - rather than (say) percentage of bytes fetched from cache versus bytes fetched from storage.

For example, the reported hit rate on my L2ARC is kind of meh (about 25% after 4 days) but the SSD is currently reading at 150MB+ per second sustained (writes less than 2MB/sec), which suggests there's a fair amount of successful read caching happening.

So, is there a way to calculate ARC/L2ARC hits by volume?
 
Secondly, I'm curious about why you choose to cap the ARC, rather than let ZFS use whatever free memory there is? Have you run into situations where the ARC is not flushed quickly enough to satisfy a malloc request?
If you search the forums you'll find many people having issues with this, especially in combination with other memory hungry applications like MySQL/MariaDB. It's best not to rely on the automatic management and simply put a limit on ARC. Or else you're likely to end up in a situation where various different applications (including ARC) are battling for the same memory, never agreeing and eventually end up stalling everything.
 
Secondly, I'm curious about why you choose to cap the ARC, rather than let ZFS use whatever free memory there is?
Basically ZFS is a junkie when it comes to mem: it doesn't care for others.
There seems to be a natural point when ZFS assumes it has "enough" mem, but that is quite high up, specifically if you have some trees with many files.

Have you run into situations where the ARC is not flushed quickly enough to satisfy a malloc request?

ZFS is not so very fast in freeing mem, so in fact the struggle is at first pushed onto the pagedaemon (or whatever grabs inactive mem and puts it to new use). Question is if we want that struggle to happen.

I'm frustrated that zfs-stats (and, it looks like, the kstat.zfs.misc.arcstats.l2_* sysctls) focus on hits - ignoring data size - rather than (say) percentage of bytes fetched from cache versus bytes fetched from storage.

Performance matters. Incrementing a value comes at almost no cost. Figuring a size and adding it to some other value would need additional instructions.
I think the hits are records, with sizes somewhere between sector sizer and record size, depending on the application.
For some analysis it should be quite easy to put in a counter at a dtrace probe.
 
Back
Top