ZFS NFS server stalling

Hi,

I have a FreeBSD 11 NFS server with two Intel 10 Gb interfaces. There are 18 Linux NFS clients. Unfortunately the NFS server stops responding a few times a day lasting several minutes. I can still ssh to NFS server when this happens, but the server is idle and very little traffic on the 10 Gb interfaces. I had this issue with 10.3 and upgraded to 11 today. I disabled all offloads on the NIC and didn't make any difference. The NFSv3 Linux clients use TCP and the number of NFS server threads is set to 128. The server has 64 GB ram and 6 cores (50 GB ARC max). I have disabled energy efficient ethernet on the switch just to be sure. But since I can ssh to the NFS server without issues, switch doesn't seem to cause any problem. Jumbo frames are used. A netstat -m doesn't show any denied or delayed instances. When the server works fine, the performance is excellent. Have anybody seen this behaviour before? Thanks.
 
Last edited by a moderator:
I had similar issues in the past with a 96TB storage. The storage was using 2 Intel SSD drives for CACHE and it was affected by the following bug ->> https://forums.freebsd.org/threads/47540/
At some point the issue was fixed but later on I would see the storage having all processes in stacked in "ZIO>I" state.
The problem was solved by detaching the SSD drives, DD them and then re attaching them.
 
I had similar issues in the past with a 96TB storage. The storage was using 2 Intel SSD drives for CACHE and it was affected by the following bug ->> https://forums.freebsd.org/threads/47540/
At some point the issue was fixed but later on I would see the storage having all processes in stacked in "ZIO>I" state.
The problem was solved by detaching the SSD drives, DD them and then re attaching them.
Out of the 4 pools, only one of them had a cache and the hit ratio is less than one percent. I have removed the cache and hopefully this might improve the situation. Thanks a lot.
 
After digging a fair bit, looks like I may have found the cause. Here it is -

https://svnweb.freebsd.org/base?view=revision&revision=280930

The server has 64 gigs of ram and OS auto tunes nmbclusters to use about 8 gigs of ram. Each buf is 2K. Also see svc.c in sources. It says the following -

"Don't use more than a quarter of mbuf clusters"

Quarter of nmbclusters is ~2 gigs. With two 10 gig interfaces, it takes only a second or two to get to that limit and OS will start throttling. The behaviour I am seeing is very consistent with that. I will see traffic exceeding 1 gigabyte per second and the next moment stall to a few kilobytes per second. At that point, the client mounts hang due to throttling. Then nfs server will pick up traffic again. It shows that busy nfs servers with 10 Gb interfaces should pay good attention to boost up nmbclusters within the available ram to prolong the throttling as much as possible. Throttling is a good thing though even at the expense of hung nfs client mounts. The other alternative is nfs server being over run by the clients, which is what I faced when using old 2.6 series Linux kernels and hardware raid a long time ago.
 
Back
Top