vfs.zfs.arc_max question

pillai_hfx · May 10, 2013

I have a FreeBSD (64 bit) 8.3 file server using ZFS with 16 GB RAM. At the moment, vfs.zfs.arc_max is not tuned. So it is by default set by FreeBSD to about 1 GB less than the default vm.kmem_size of 15.4 GB. There are two pools with 20 TB and 64 TB of usable space with own cache and log devices on each pool. I recently discovered that I am able to stall the server (won't even respond to ping) with long rsyncs from an NFS client. It turns out that the server is running out of RAM when looking at the free memory pages ( sysctl vm.vmtotal) There is a sharp drop soon after the transfer starts. So I started digging on guidelines to reduce vfs.zfs.arc_max.

From the FreeBSD forum, what I understood initially was to set it like this - vfs.zfs.arc_max + RAM needed for other running applications should be less than or equal to vm.kmem_size. But when I searched a little further, there are opinions expressed by people that vfs.zfs.arc_max is not an exact hard limit of ARC size. It is more of a ballpark number when crossed, FreeBSD would start a thread or two to start flushing data to the disk till it reaches vfs.zfs.arc_min. In the mean time, if the writing is continuing faster than what FreeBSD could flush into the disk, the ARC size will pass vfs.zfs.arc_max and could potentially use up all the available memory causing a stall. Did I understand this correctly? If I did, vfs.zfs.arc_max is roughly like vm.dirty_background_ratio in Linux where the flush to disk would start. In that case, to be on the safe side the vfs.zfs.arc_max would be < half the RAM I have on the server so that there is less of a chance of the headroom between vm.kmem_size and vfs.zfs.arc_max be overrun by continuous writes.

I found opinions like this -

Sebulon said:
With 16GB of RAM, about 15 is really available, consider what other applications you have and make a rough guess how much RAM theyÂ´d like, lets say another 4, so you take 15-4=11 should end you up with vfs.zfs.arc_max="11G". Comment out the rest and see your performance shoot through the roof.

This supports what I initially understood from the forum and kind of assumes that vfs.zfs.arc_max is a hard set limit and ARC size won't cross this limit. (http://forums.freebsd.org/showthread.php?t=38979&highlight=vfs.zfs.arc_max)

The use case here is for HPC and there are 15 NFS clients doing reads and writes at the same time when lots of users are running their compute jobs. If Sebulon's opinion is correct, then vfs.zfs.arc_max could be set between 10 and 11 GB. If not, I will go with setting vfs.zfs.arc_max to 6 GB or less and hope that the 7 - 8 GB headroom I have won't get overrun. With not much load, i have just over 2 GB in free memory pages. This drops quite fast once the rsync starts from the NFS client. In vmstat -m, it's the Solaris line that increases in memory usage when the rsync starts. Any opinion on this is much appreciated.

Thanks.

phoenix · May 11, 2013

You're doing rsync over NFS? Why not rsync over SSH (with None cipher)? Just curious.

We don't have NFS enabled on our ZFS boxes, they just do rsync's over SSH, and I have ARC set to "RAM - 4 GB". The biggest box has 128 GB of RAM, the smallest has 32 GB. Compression and dedupe enabled on all 4 of them.

With dedupe disabled, 16 GB is usually plenty. However, depending on what the boxes do, I'd just add RAM.

You can never have too much RAM!

Also, on our boxes, I have primarycache and secondarycache set to metadata, so that actual file data is never cached. This is mainly due to having dedupe enabled on pools with 20-50 TB of storage in use, so the DDT is very large. Plus, rsync just looks at metadata first to determine which files to transfer, so having that in RAM already helps a lot more than caching the actual file data. Depending on your needs, you may want to play with those settings as well.

Also, install sysutils/zfs-stats and look through the output of -A and -L to see how your ARC/L2ARC is being used.

pillai_hfx · May 11, 2013

Thanks @phoenix. This server is used for primary file storage and not for backup purposes. NFS is the bread and butter solution for HPC clusters until there is a need to scale out with Lustre, FHGFS etc. Coming back to this issue, I already have installed zfs-stats and the ARC summary shows similar info as I could get from vmstat -m and looking at the Solaris line. Our typical usage patterns for storage come under the new jargon "big data". I noticed this issue when I ran a script that uses wget to fetch the 250 GB NCBI
Blast databases and after a while the storage server stalled. The rsync test was a good way to emulate the constant writes, but being said that rsync over NFS is used very frequently.

The idea is to prevent the server getting overrun by excessive writes. I faced a similar issue with Linux storage servers a few years ago and it was solved by decreasing the dirty_backgroud_ratio to <5 and increasing the dirty_ratio to >80. This makes sure that disk flushes start early enough and still there is head room for the buffers to build until it hits the upper RAM limit set with the dirty_ratio which will lead to a stall. The newer 3.xx kernels have a throttling mechanism to prevent this upper limit from reaching too soon. I am hoping to use a similar method here with ~~freebsd~~ FreeBSD and zfs so that flushes to disk would start sooner and the ARC could continue to build until it hits the RAM limit. This would get the heavy writes to continue without stalling.

Please see http://lists.freebsd.org/pipermail/freebsd-stable/2010-September/059172.html

There are opinions that mention the arc_max to be a "high water mark" vs a hard limit. This makes a difference in setting the arc_max value. If it is a high water mark where ARC can grow beyond arc_max, then I would prefer the arc_max to be less than 1/2 the RAM.

If it is a hard limit, then I would like the arc_max to be as big as possible, leaving enough RAM for ~~freebsd~~ FreeBSD. What are your thoughts on this? Is arc_max a hard limit or a high water mark that can be crossed?

Toast · May 14, 2013

# Set TXG write limit to a lower threshold. This helps "level out"
# the throughput rate (see "zpool iostat"). A value of 256MB works well
# for systems with 4 GB of RAM, while 1 GB works well for us w/ 8 GB on
# disks which have 64 MB cache. <
>

# NOTE: in <v28, this tunable is called 'vfs.zfs.txg.write_limit_override'.
vfs.zfs.write_limit_override=1073741824

https://wiki.freebsd.org/ZFSTuningGuide

Sebulon · May 15, 2013

Hi @pillai_hfx!

In my experience arc_max has been a hard limit you could count on, but my systems are general purpose file servers for NFS, CIFS, AFP and iSCSI, where other workloads may affect systems differently. There was this memory-leak reported a while back, but that was for 9.0 and NFS, so itÂ´s probably not related. You could try and upgrade to 9.1 to see if thatÂ´ll matter.

DonÂ´t really use rsync so I canÂ´t give any advice on that.

/Sebulon

Crivens · May 15, 2013

I found ZFS on FreeBSD being in trouble when you use other file systems in parallel as the vnode cache is trying to push the ARC out of existence. What is the "inactive" memory on the machine doing while you perform the rsync? And it would be provable that this is the problem when the same rsync by SSH does not interfere with the performance.

Sebulon · May 15, 2013

Crivens said:
I found ZFS on FreeBSD being in trouble when you use other file systems in parallel as the vnode cache is trying to push the ARC out of existence.

@@pillai_hfx

I can add to that, that our systems only use ZFS. Just to avoid that kind of a situation.

/Sebulon

Crivens · May 15, 2013

To mitigate that, would it be possible to use a "file" for the ARC, maybe with it's own pager, so the memory handling might be better? It's not needed to have a real file, only attach the memory in it to a vnode. This would, IMHO, remove a lot of these problems.

pillai_hfx · May 15, 2013

Thank you all for responding with suggestions. I did some more testing and the problem I was facing is more or less solved. The main culprit was the primarycache setting and changing that to metadata as mentioned by @phoenix made the difference. Before changing this setting, I noticed the following behavior while tracking the free memory pages. Once the free memory pages get to less than 1 GB, which is consistent with crossing the current arc_max setting, it will start to go up and down instead of constantly going down. I am guessing that this is more like a tug of war between the aggressive writes and flushes to the disk to free up arc. But disabling file caching on primarycache and limiting the caching to metadata, keeps the free memory pages to between 1.5 GB and 2 GB during the write tests. Since there are 100 GB L2ARCs for both pools, the secondarycache should take care of read caching.

With zfs-stats -A, please have a look at the ARC Size section. The target size is showing as adaptive, Minimum size as a hard limit and the Maximum size as a high water mark. Unless this is a mistake in the script, it might be true that the arc_max is a high water mark and not a hard limit. This makes sense to me, as there is no reason for writes to stall when there is still ram available. As per the mailing list, crossing the mark would leads to attempts to bring it down to arc_min by starting to flush data to disk. I guess it may not be as essential as I thought before to bring arc_max down to less than half the ram, if data caching is disabled for primarycache.

I am planning to use ZFS for much bigger storage solutions down the road for HPC, as btrfs is not showing signs of getting production ready soon and it was good to figure out the small, but critical details that could affect stability of the storage server in long term. HPC is the usage situation where workloads are not very predictable, unlike infrastructure usage of storage. Preparing for the worst usage case and protecting the storage server from getting overrun by the running compute jobs becomes as important as tuning for speed.

pillai_hfx · May 15, 2013

I did some further testing on the effects of setting primarycache. Apparently disabling data caching on primarycache also disables secondarycache! Please see
http://docs.oracle.com/cd/E26502_01/html/E29022/chapterzfs-db1.html

"ZFS uses the primarycache and secondarycache properties to manage buffering data in the ARC. Note that using the secondarycache (L2ARC) property to improve random reads also requires the primarycache property to be enabled."

I ran the IOR MPI benchmarks with two hosts. Here are the results:

Code:

with primarycache=metadata

Max Write: 103.61 MiB/sec (108.64 MB/sec)
Max Read:  45.72 MiB/sec (47.94 MB/sec)


with primarycache=all

Max Write: 104.20 MiB/sec (109.27 MB/sec)
Max Read:  210.32 MiB/sec (220.54 MB/sec)

As it shows, setting primarycache to metadata disables the L2ARC. The writes used to be close to 200 MB/sec before I had to slice the same SSD for the log and cache devices. In order to not take a hit in read performance, I guess it is not a bad idea to limit the arc_max a bit low to about half the RAM in this case, leave the data caching enabled and let ARC grow beyond the high water mark to use up the headroom between arc_max and free memory pages.

Toast · May 15, 2013

pillai_hfx said:
This makes sense to me, as there is no reason for writes to stall when there is still ram available.

ZFS isn't just going to ignore vfs.zfs.arc_max and keep buffering writes. Once it hits vfs.zfs.arc_max it's going to do everything it can to not go over it.

By the way, did you set vfs.zfs.write_limit_override? It limits how much data zfs buffers before flushing.

phoenix · May 15, 2013

pillai_hfx said:
I did some further testing on the effects of setting primarycache. Apparently disabling data caching on primarycache also disables secondarycache!

Semi-correct. The L2ARC is populated by the l2arc_feed_thread (kernel process) that scans the ARC looking for entries that are about to be removed, and copying them into the L2ARC.

Thus, only what's cached in the ARC can be cached in the L2ARC.

So, you can set primarycache to all and secondarycache to metadata and everything works; but if you set primarycache to metadata and secondarycache to all, only metadata will be cached in both.

IOW, primarycache setting doesn't "disabled" secondarycache, it just limits what can be cached in the L2ARC.

pillai_hfx · May 16, 2013

Toast said:
ZFS isn't just going to ignore vfs.zfs.arc_max and keep buffering writes. Once it hits vfs.zfs.arc_max it's going to do everything it can to not go over it.

By the way, did you set vfs.zfs.write_limit_override? It limits how much data zfs buffers before flushing.

Thanks. I haven't attempted vfs.zfs.write_limit_override yet, it is still set to zero. The more I read about this, I am not sure what the total implications would be with this setting vfs.zfs.write_limit_override is supposed to limit the size of ZFS transaction groups to match the I/O capabilities of the underlying vdevs, if I understood this correctly. As per Solaris related mailing lists and blogs, there could be three or so transaction groups active at any time. There are two more transaction group related settings I could find with sysctl named vfs.zfs.txg.synctime_ms and vfs.zfs.txg.timeout and I am reading up on what they actually do.

For my purposes, it looks like reducing arc_max to 12 GB from the current setting of 14.4 GB should reduce the chances of the server from running out of RAM during heavy writes. Since I am quite happy with the performance of the storage, I am trying to limit any ZFS throttling settings unless I really have to use it. The next idle window for a reboot is several months from now and I am planning to do more tests after I get the chance to reboot the server to apply the new arc_max setting.

yayix · Jun 26, 2013

pillai_hfx said:
Thanks. I haven't attempted vfs.zfs.write_limit_override yet, it is still set to zero. The more I read about this, I am not sure what the total implications would be with this setting vfs.zfs.write_limit_override is supposed to limit the size of ZFS transaction groups to match the I/O capabilities of the underlying vdevs, if I understood this correctly. As per Solaris related mailing lists and blogs, there could be three or so transaction groups active at any time. There are two more transaction group related settings I could find with sysctl named vfs.zfs.txg.synctime_ms and vfs.zfs.txg.timeout and I am reading up on what they actually do.

For my purposes, it looks like reducing arc_max to 12 GB from the current setting of 14.4 GB should reduce the chances of the server from running out of RAM during heavy writes. Since I am quite happy with the performance of the storage, I am trying to limit any ZFS throttling settings unless I really have to use it. The next idle window for a reboot is several months from now and I am planning to do more tests after I get the chance to reboot the server to apply the new arc_max setting.

Any updates on this?

pillai_hfx · Jun 26, 2013

The problem got fixed. I eventually set the arc_max to 8 GB. The problem was not really ZFS related. I had set the kern.ipc.nmbclusters and a few other buffers to a very high value. In a RAM constrained situation, this caused the server to stall. Once I removed all of those custom settings, the server has been working fine. I have been monitoring the free memory pages for a while and everything look ok.

yayix · Jun 27, 2013

pillai_hfx said:
The problem got fixed. I eventually set the arc_max to 8 GB. The problem was not really ZFS related. I had set the kern.ipc.nmbclusters and a few other buffers to a very high value. In a RAM constrained situation, this caused the server to stall. Once I removed all of those custom settings, the server has been working fine. I have been monitoring the free memory pages for a while and everything look ok.

I see. Caused by bumping kern.ipc.nmbclusters.. maybe you need to tweak your NIC and route MTU, jumbo frames etc. Anyways, good for you.

vfs.zfs.arc_max question

pillai_hfx

phoenix

pillai_hfx

Toast

Sebulon

Crivens

Administrator

Sebulon

Crivens

Administrator

pillai_hfx

pillai_hfx

Toast

phoenix

pillai_hfx

yayix

pillai_hfx

yayix