Hello all, I need help troubleshooting a mystery NSFv3 lockd issue.
I am using a pretty plain vanilla 13.0-RELEASE-p4 vmware vm as an NFS server to linux clients -- has been working perfectly for more than 7 months. Last Friday we started getting strange issues with our build utilities which I think I traced down to NFSv3 mounts and issues with file system locking. To our knowledge, there's been no changes applied to either our freeBSD server or any clients. The symptom I see in freeBSD messages is:
Jan 26 10:00:09 siren kernel: NLM: failed to contact remote rpcbind, stat = 5, port = 28416
(and that repeated endlessly)
I've tried rebooting our freebsd server, also removing /var/db/statd.status and reboot. I suspect that there is some kind of problem with lockd -- either it somehow got corrupt data or there's a client that is doing something to kill it. When I try to do "service lockd stop", it hangs indefinitely and I cannot kill it from command line.
I need advice on how to troubleshoot this, what it might be and where to look.
Longer details:
NFSv4 is not at all affected. Re-mounting our mounts with v4 passes all tests works perfectly (as I know v4 has its own built in locking protocol). Through some trial and error I was able to narrow down and easily reproduce build tool hangs that seem related to file system locking. One example, maven build on NFS mount attempts to grab assets from a nexus server hangs indefinitely. A more obscure and easier test is (on linux clients) to have an NFS home directory and valid data populated to the ~/.pki (certificate) cache then try a simple "curl --verbose https://google.com" and the utility hangs indefinitely with an attempt to lock sqlite DB files in ~/.pki. I tested running the freeBSD NFS with lockd disabled; in this case the 'curl' test eventually times out on the file lock and proceeds, but maven still fails with java IO exception as it requires locking some things.
I need to use NFSv3 as it seems to work better with some of my older clients.
I need to either find some way to clear out whatever is corrupting the freeBSD lockd, or perhaps trace down the rogue host that is causing lockd to have trouble.
Thanks in advance!
I am using a pretty plain vanilla 13.0-RELEASE-p4 vmware vm as an NFS server to linux clients -- has been working perfectly for more than 7 months. Last Friday we started getting strange issues with our build utilities which I think I traced down to NFSv3 mounts and issues with file system locking. To our knowledge, there's been no changes applied to either our freeBSD server or any clients. The symptom I see in freeBSD messages is:
Jan 26 10:00:09 siren kernel: NLM: failed to contact remote rpcbind, stat = 5, port = 28416
(and that repeated endlessly)
I've tried rebooting our freebsd server, also removing /var/db/statd.status and reboot. I suspect that there is some kind of problem with lockd -- either it somehow got corrupt data or there's a client that is doing something to kill it. When I try to do "service lockd stop", it hangs indefinitely and I cannot kill it from command line.
I need advice on how to troubleshoot this, what it might be and where to look.
Longer details:
NFSv4 is not at all affected. Re-mounting our mounts with v4 passes all tests works perfectly (as I know v4 has its own built in locking protocol). Through some trial and error I was able to narrow down and easily reproduce build tool hangs that seem related to file system locking. One example, maven build on NFS mount attempts to grab assets from a nexus server hangs indefinitely. A more obscure and easier test is (on linux clients) to have an NFS home directory and valid data populated to the ~/.pki (certificate) cache then try a simple "curl --verbose https://google.com" and the utility hangs indefinitely with an attempt to lock sqlite DB files in ~/.pki. I tested running the freeBSD NFS with lockd disabled; in this case the 'curl' test eventually times out on the file lock and proceeds, but maven still fails with java IO exception as it requires locking some things.
I need to use NFSv3 as it seems to work better with some of my older clients.
I need to either find some way to clear out whatever is corrupting the freeBSD lockd, or perhaps trace down the rogue host that is causing lockd to have trouble.
Thanks in advance!