Solved NFS woes

Speedy · Jul 19, 2017

I had to kill a process and apparently I mistyped the PID, killing an NFS instance. As a result a Linux machine cannot connect any more, all attempts time out. As a last resort I even rebooted both boxes, still no joy. FreeBSD 11.0-RELEASE, NFSv3 share.
Any clues where to look?

SirDice · Jul 19, 2017

NFS should automatically recover when it's back up again. But maybe the Linux client is getting errors and therefor failing?

Check a few of the obvious things, showmount(8); is it still exported? rpcinfo(8); are all the necessary RPC endpoints correctly registered?

Speedy · Jul 19, 2017

Thanks for reply. Yes, the share is still exported, all other boxes connect no problem. The Linux box in question mounts other NFS shares successfully.
Summary:
One Linux box connects to other boxes and mounts NFS shares successfully, but cannot connect to the FreeBSD server any more, times out.
Other Linux boxes connect to the FreeBSD server successfully.
No firewall is involved.

Me scratching head.

SirDice · Jul 19, 2017

Speedy said:
One Linux box connects to other boxes and mounts NFS shares successfully, but cannot connect to the FreeBSD server any more, times out.

Timeouts usually mean firewall or routing issues. There's a huge difference between a "connection failed" and "connection timed out". Use tcpdump(8) on the FreeBSD host and see if any of the requests from the Linux client actually make it to the server. Your traffic may be dropped somewhere in between.

Speedy · Jul 19, 2017

It sure connects, the way I see it there is a stale record somewhere that causes the trouble.

Code:

 tcpdump -i em0 host 192.168.2.57
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on em0, link-type EN10MB (Ethernet), capture size 262144 bytes
10:26:57.103394 IP 192.168.2.57.45016 > turtle.sunrpc: Flags , seq 762707414, win 29200, options [mss 1460,sackOK,TS val 2889165310 ecr 0,nop,wscale 7], length 0
10:26:57.103452 IP turtle.sunrpc > 192.168.2.57.45016: Flags [S.], seq 4061648648, ack 762707415, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 3735043158 ecr 2889165310], length 0
10:26:57.104023 IP 192.168.2.57.45016 > turtle.sunrpc: Flags [.], ack 1, win 229, options [nop,nop,TS val 2889165310 ecr 3735043158], length 0
10:26:57.104774 IP 192.168.2.57.45016 > turtle.sunrpc: Flags [P.], seq 1:61, ack 1, win 229, options [nop,nop,TS val 2889165311 ecr 3735043158], length 60
10:26:57.104893 IP turtle.sunrpc > 192.168.2.57.45016: Flags [P.], seq 1:33, ack 61, win 1026, options [nop,nop,TS val 3735043160 ecr 2889165311], length 32
10:26:57.105020 IP 192.168.2.57.45016 > turtle.sunrpc: Flags [.], ack 33, win 229, options [nop,nop,TS val 2889165311 ecr 3735043160], length 0
10:26:57.105143 IP 192.168.2.57.45016 > turtle.sunrpc: Flags [F.], seq 61, ack 33, win 229, options [nop,nop,TS val 2889165311 ecr 3735043160], length 0
10:26:57.105176 IP turtle.sunrpc > 192.168.2.57.45016: Flags [.], ack 62, win 1026, options [nop,nop,TS val 3735043160 ecr 2889165311], length 0
10:26:57.105183 IP 192.168.2.57.46278 > turtle.nfsd: Flags , seq 2729923481, win 29200, options [mss 1460,sackOK,TS val 2889165311 ecr 0,nop,wscale 7], length 0
10:26:57.105213 IP turtle.nfsd > 192.168.2.57.46278: Flags [S.], seq 1902691979, ack 2729923482, win 65535, options [mss 1460,nop,wscale 6,sackOK,TS val 1635852321 ecr 2889165311], length 0
10:26:57.105254 IP turtle.sunrpc > 192.168.2.57.45016: Flags [F.], seq 33, ack 62, win 1026, options [nop,nop,TS val 3735043160 ecr 2889165311], length 0
10:26:57.105663 IP 192.168.2.57.46278 > turtle.nfsd: Flags [.], ack 1, win 229, options [nop,nop,TS val 2889165311 ecr 1635852321], length 0
10:26:57.105731 IP 192.168.2.57.45016 > turtle.sunrpc: Flags [.], ack 34, win 229, options [nop,nop,TS val 2889165311 ecr 3735043160], length 0
...
Keeps going ...

Speedy · Jul 19, 2017

There is going on more than meets the eye. Below is snippet from messages, the host in question (coder) was removed from network 3 years ago, permanently. The question: how can I flush everything NFS related and start from clean slate?

Code:

Jul 19 00:20:34 turtle4 rpc.statd: Failed to contact host coder: RPC: Unknown host
Jul 19 01:20:34 turtle4 rpc.statd: Failed to contact host coder: RPC: Unknown host
Jul 19 02:20:34 turtle4 rpc.statd: Failed to contact host coder: RPC: Unknown host
Jul 19 03:20:34 turtle4 rpc.statd: Failed to contact host coder: RPC: Unknown host
Jul 19 04:20:36 turtle4 rpc.statd: Failed to contact host coder: RPC: Unknown host
Jul 19 05:20:37 turtle4 rpc.statd: Failed to contact host coder: RPC: Unknown host
Jul 19 06:20:37 turtle4 rpc.statd: Failed to contact host coder: RPC: Unknown host

SirDice · Jul 20, 2017

Speedy said:
The question: how can I flush everything NFS related and start from clean slate?

There's really not much to configure for NFS, there's /etc/exports and that's about it. Other settings are all done from rc.conf.

Speedy · Jul 20, 2017

That's what I thought, but why it is looking for this host which was disconnected three years ago? And how to explain my problem I described earlier?

SirDice · Jul 20, 2017

Have a look at the /etc/hosts file on both the client and the server. It may have been lingering there.

Speedy · Jul 20, 2017

Yes I did, found nothing. Thanks for staying with me. Using IP addresses instead of hostnames changes nothing. When I do showmount -e 192.168.2.254 from troubled Linux box it takes long time, but finally it gets the exports. From all other Linux boxes the same command returns results instantly. I even looked at routing tables, everything is nice and clean with routing. And it all started when I killed a process accidentally in FreeBSD host ... just does not make any sense.

SirDice · Jul 20, 2017

Have you checked DNS too? I've seen some instances where the service initially started properly, and had been running for ages, but a restart caused all sorts of failures. Turned out somewhere in between DNS was updated incorrectly but because the service was still running nobody noticed. Until it needed to be restarted.

Speedy said:
When I do showmount -e 192.168.2.254 from troubled Linux box it takes long time, but finally it gets the exports. From all other Linux boxes the same command returns results instantly.

This does sound like a resolving issue, not forward lookups (because you're connecting to an IP address) but reverse lookups may be causing the delay.

Speedy · Jul 22, 2017

Mystery ended. The troubled box connects now. I did nothing, except I ran pkg_cutleaves and removed a bunch of packages. I do not see how this could have affected the NFS issue.

Speedy · Jul 22, 2017

BTW, to get rid of unknown host message mentioned above I had to remove /var/db/statd.status.

Solved NFS woes

Speedy

SirDice

Administrator

Speedy

SirDice

Administrator

Speedy

Speedy

SirDice

Administrator

Speedy

SirDice

Administrator

Speedy

SirDice

Administrator

Speedy

Speedy