NFS woes

Hi, all. We have a simple setup: one NFS server and many clients having RW access. Server setup is as follows:

/etc/rc.conf:
Code:
nfs_server_enable="YES"
nfsv4_server_enable="YES"
nfs_server_flags="$nfs_server_flags --maxthreads 2048"
mountd_flags="$mountd_flags -p 878"
rpcbind_enable="YES"
rpcbind_flags="-s"
rpc_lockd_enable="YES"
rpc_statd_enable="YES"

/etc/exports:
Code:
V4: /path/to/storage -network 100.64.0.0 -mask 255.255.0.0

And each client is created and destroyed dynamically. Normally around 30-40 running at any given time:
Code:
rpc_lockd_enable="YES"
rpcbind_enable="YES"
rpcbind_flags="-s"
rpc_lockd_flags="-h 127.0.0.1"       # Flags to rpc.lockd (if enabled)."
nfs_client_enable="YES"
rpc_statd_enable="YES"

And the mountpoint is mounted like this when starting:
Code:
/sbin/mount -t nfs -o nfsv4,nosuid,acregmin=3600,acregmax=86400,acdirmin=3600,acdirmax=86400 masterdb.local:/$project_name/ /mnt

(ac* options is me trying to fix the problems).
Now the problem: more often than not the mount hangs, or file access hangs, or `df` hangs, or umount hangs, which all makes my life miserable. Plain `mount` call never hangs though, and displays the list of mounted filesystems, sometimes the mount on /mnt is duplicated literally hundreds of times (this is probably the result of me trying to umount /mnt + mount it again when file access (`cp` from the server to local) fails - probably mount/umount fail but leave the mount around). If this wasn't enough, the nfsd on the server chews up a lot of system CPU (no meaningful disk access at that time according to `gstat`), literally 100% cpu for about a minute, and then eases for 30-40 seconds or so. I can "fix" this cpu chewing by not trying to mount/umount on clients in case `cp` fails, but this leaves the clients inoperable without the needed files. I'm not really sure how to fix this NFS, and I'm open to any ideas. Also the NFS server logs contantly lines like this:

Code:
nfsrv_cache_session: no session IPaddr=100.64.87.21, check NFS clients for unique /etc/hostid's
Needless to say /etc/hostid is unique on each client)
 
If other protocols work better, I could try them. Whatever it is it must support shared file locking to control the writing, and that's it. NFS supports it, but it's either buggy or badly misconfigured by me.
 
Now the problem: more often than not the mount hangs, or file access hangs, or `df` hangs, or umount hangs, which all makes my life miserable.
Is there a firewall between the NFS server and the clients?
 
Which FreeBSD version is the server running on?

/etc/exports:
Code:
V4: /path/to/storage -network 100.64.0.0 -mask 255.255.0.0
Is that all in /etc/exports ?

Because the V4: line doesn't export any file systems, only marks the root of the NFSv4 tree.

exports(5)
Rich (BB code):
     if the -alldirs flag has not been specified.  The third form has the
     string ``V4:'' followed by a single absolute path name, to specify the
     NFSv4 tree root.  This line does not export any file system, but simply
     marks where the root of the server's directory tree is for NFSv4 clients.
     The exported file systems for NFSv4 are specified via the other lines in
     the exports file in the same way as for NFSv2 and NFSv3.

Furthermore,
Rich (BB code):
     mapped to user credentials on the server.  For the NFSv4 tree root, the
     only options that can be specified in this section are ones related to
     security: -sec, -tls, -tlscert and -tlscertuser.
so, no -network -mask
 
Is there a firewall between the NFS server and the clients?
There is (ipfw), but I don't think it's firewall related, because the server works half of the time.

On the server:
Code:
01006  58041150  10578308138 allow tcp from table(0) to me 5432,111,718,823,873,878,885,958,2049,6379,8118,8123,9000,9004,9009,9200,9201,9202,9300,9301,9302,11212,26379

01011         0            0 allow udp from table(0) to me 111,823,878,885,958
01016 64356355 17474382672 allow udp from table(1) to me 41641
table 0 is tailnet 100.* addresses of each connected client, table 1 is their normal IPv4 address, 41641 is tailscale's transport port.

On the client everything is allowed from the server
Code:
01001  4022485 1330248448 allow udp from 1.2.3.4 to me 41641 // server
01006  4005204 1064935815 allow ip from 100.64.240.7 to me // server
 
Which FreeBSD version is the server running on?


Is that all in /etc/exports ?

Because the V4: line doesn't export any file systems, only marks the root of the NFSv4 tree.

exports(5)
Rich (BB code):
     if the -alldirs flag has not been specified.  The third form has the
     string ``V4:'' followed by a single absolute path name, to specify the
     NFSv4 tree root.  This line does not export any file system, but simply
     marks where the root of the server's directory tree is for NFSv4 clients.
     The exported file systems for NFSv4 are specified via the other lines in
     the exports file in the same way as for NFSv2 and NFSv3.

Furthermore,
Rich (BB code):
     mapped to user credentials on the server.  For the NFSv4 tree root, the
     only options that can be specified in this section are ones related to
     security: -sec, -tls, -tlscert and -tlscertuser.
so, no -network -mask

Forgot to mention, there's also /etc/zfs/exports:
Code:
/path/to/storage     -maproot=normaluser -alldirs -network 100.64.0.0 -mask 255.255.0.0
 
I have an NFS server that feeds mainly diskless boot machines and hence gets hammered a bit. I have not observed such problems.
 
I would assume this is related to tailscale VPN (UDP protocol) and NFS being run on top using UDP.

Did you try to use TCP instead of UDP for mounting the NFS share on the client from the server ?
 
What else is running on the machine? Is it something like poudriere or other app that continually mounts/unmounts filesystems? This can overwhelm mountd resulting in the NFS server (not the machine itself) becoming deadlocked. The workaround is to stop mountd and use NFSv4 without it. If you're using NFSv3 you still need mountd, you'll need to migrate to NFSv4 before attempting the circumvention.
 
I suppose tailscale does its own transport handling at the app layer, so it shouldn't be an issue. PostgreSQL etc works fine over Tailscale between the same client/server machines, it's only NFS having problems.
NFS itself uses TCP by default - please see the quote above, UDP IPFW counters are at zero. So basically it's TCP over UDP)
 
What else is running on the machine? Is it something like poudriere or other app that continually mounts/unmounts filesystems?
It's me doing umount+mount in the scripts on the client machines trying to fix the failing NFS accesses.

This can overwhelm mountd resulting in the NFS server (not the machine itself) becoming deadlocked. The workaround is to stop mountd and use NFSv4 without it. If you're using NFSv3 you still need mountd, you'll need to migrate to NFSv4 before attempting the circumvention.
Hm, mountd is running but I don't think it's in use for NFSv4? It was needed on v3, but then I switched to v4 because v3 was even worse.
I think I'm currently using nfsv4 exclusively, as stated by the mount flags in the first post.
 
Here's how mountd is currently run on the server:
Code:
/usr/sbin/mountd -r -S -p 878 /etc/exports /etc/zfs/exports
Not sure if NFSv4 relies on it or not.
 
It's me doing umount+mount in the scripts on the client machines trying to fix the failing NFS accesses.


Hm, mountd is running but I don't think it's in use for NFSv4? It was needed on v3, but then I switched to v4 because v3 was even worse.
I think I'm currently using nfsv4 exclusively, as stated by the mount flags in the first post.
Sure, it's not used by NFSv4. But it still issues locks in the kernel resulting in the behaviour. It doesn't matter if you don't use NFSv3. There's still an impact.
 
Back
Top