TCP connection failures under heavy NFS load (8.4)

spork · Apr 28, 2015

I have an 8.4 host (running on bare metal, no virtualization) that serves as a backups destination for a few vmware hosts (weekly) and rsync/ rsnapshot backups on a daily basis. To complicate my life, there's a monitoring server running in a jail.

What I'm finding is that at some times I have an rsync run and a vmware snapshot copy run (this is over NFS) happening at the same time (not enough hours in the day to not have some overlap). When this happens, I see some kind of network starvation happening. The monitoring host has a java app that polls a number of wifi devices using ssh and those start to fail, the ticket system that polls a POP mailbox for new tickets times out, and connections to the mysql server from local web apps time out. As soon as the backup processes finish, everything returns to normal.

My knowledge of network tuning dates to the FreeBSD 4.x era; where should I be poking around to gather more evidence on what's being starved? I'm a bit lost because the combo of NFS (UDP) and rsync (TCP) seems to be the trigger.

Here's netstat -s output:

https://gist.github.com/sporkman/eae104710657436bc8af

I see obvious issues in the UDP section, but nothing is standing out in the TCP section...

spork · May 28, 2015

Bump.

Love it when googling the problem brings your own question as the first result.

junovitch@ · May 29, 2015

On the TCP front, the Rsync is just going to open a single socket between both hosts to do the backups on so won't impose much of a load. On the UDP front, maybe the drops are because the nfsd(8) threads are blocking on disk or something else. What state are the nfsd(8) threads in when performance has tanked? Try a ps auxl | grep nfs. Are they waiting on disk? Does gstat indicate issues with disk saturation?

spork · May 29, 2015

I think what's confusing me is that when this pileup of disk activity happens, networking is affected. For example, a jail on this host runs some Ubiquiti monitoring software, Nagios, and other typical monitoring stuff. During times when I find that interactive performance on the command line is horrid, I also start seeing even simple ping monitoring checks time out and report an error. That's what sent me in the network starvation/NFS direction.

Last night this popped up again - perfect storm where a periodic ZFS scrub, VMWare backups and rsync backups all happened at the same time.

Still ongoing, this is all ps is showing for nfsd:

Code:

ps auxl | grep nfs
...
USER      PID %CPU %MEM   VSZ   RSS  TT  STAT STARTED      TIME COMMAND            UID  PPID CPU PRI NI MWCHAN
root      748  0.0  0.0  5832   128  ??  Is   30Jan15   0:00.31 nfsd: master (nf     0     1   0  44  0 accept
root      749  0.0  0.0  5832    92  ??  S    30Jan15 950:21.38 nfsd: server (nf     0   748   0  44  0 rpcsvc

gstat is showing saturation - all drives are in the red, going between 75% and 100% (and for giggles, sometimes 104%).

Obviously if I don't ask the box to do all this work in parallel these problems disappear, but I feel like there should be some tuning available (likely in ZFS) to keep interactive response in check in exchange for slower backups. The network timeouts still have me totally stumped.

Perhaps this is just an 8.x thing.

junovitch@ · May 29, 2015

I've certainly seen general latency on heavily loaded systems but nothing like what you are describing. Knowing that you use ZFS is helpful, there is some potential that the steady stream of fixes from OpenZFS in the past 2 years could improve things on newer releases. The scrubs do get lower priority then access to real data but nonetheless there's going to be a bit of a slowdown when it's running. It's just speculation but with 8.4-RELEASE coming up on EOL this summer it's certainly something to keep in mind that it may potentially be fixed.

If you can, maybe doing a truss -p <PID> on one of the processes that you know gets unresponsive would provide some telling information as to why it's timing out. Preferably pick one that is lightly loaded to reduce the noise.

spork · May 31, 2015

My quick fix that I'm going to attempt today or tonight is an upgrade to 9.3.

kpa · May 31, 2015

ZFS and NFS are known to be a problematic combination if there's lots of writes happening to the pool because NFS by default insists on synchronous writes and that stresses the ZIL (ZFS intent log) heavily. Common solutions include using a separate fast ZIL device, usually an SSD. Other solution is to use async NFS mounts or turning off ZIL alltogether.

https://wiki.freebsd.org/ZFSTuningGuide#NFS_tuning

spork · May 31, 2015

Thanks - already took care of that issue. The NFS stuff is also VMware clients, and there are documented issues with that. Initially I disabled logging on the dataset VMware was using, and later added an SSD for ZIL.

spork · Aug 4, 2015

Just a quick update. I'm on 9.3 now. Things feel a bit more responsive now when backups are running.

However I have some new/weird interactions that probably come down to "that's how ZFS and NFS work".

Here's the scenario. I build my kernels and world on this host, and export /usr/src and /usr/obj over NFS. When my other hosts need to be updated, I use these remotely mounted directories. I find that if poudriere is running while this happens (not uncommon to be building something like openssl while I'm doing some other update), make installworld and make installkernel tend to bail out with "permission denied" or "file not found" errors (only have an example of the former in scrollback):

Code:

===> drm2/radeonkmsfw/RS780_pfp (install)
install -o root -g wheel -m 555   radeonkmsfw_RS780_pfp.ko /boot/kernel
install -o root -g wheel -m 555   radeonkmsfw_RS780_pfp.ko.symbols /boot/kernel
===> drm2/radeonkmsfw/RV610_me (install)
cd: /usr/src/sys/modules/drm2/radeonkmsfw/RV610_me: Permission denied
*** [realinstall] Error code 2

Stop in /usr/src/sys/modules/drm2/radeonkmsfw.
*** [realinstall] Error code 1

It's never at the same place. I can re-run with poudriere stopped, and all is well.

Known weirdness or should I be digging for more info?