FreeBSD 11.1 server occasionally freezing/locking up for about 10 minutes at a time

Several times a day this server "locks up" for about 10 minutes at a time. During these events nothing is recorded by rsyslog. Open ssh connections stay connected and I can do things like "echo hello" successfully, but any attempts to run any commands such as "ls" cause that shell to lock up until the event passes. NFS clients are also unable access the server during these events. There are no kernel messages showing up in dmesg before or during these events, and the only message that shows up afterwards is "sonewconn: pcb 0xfffff801d6465e10: Listen queue overflow: 16 already in queue awaiting acceptance (23 occurrences)", which is related to the SSSD unix socket.

I have a script running that's gathering some basic information, and I'm attaching the output from immediately before and after one of these events (during these events no files are generated):
Bash:
#!/bin/bash

while sleep 1;
do
    fname="data/$(date +%Y-%m-%dT%T)"
    uptime > "$fname"
    sysctl vm vfs.zfs vfs.nfsd kstat.zfs >> "$fname"
    echo "$fname"
done

The machine is a SuperMicro system with 2x Intel(R) Xeon(R) Silver 4114 2.20GHz processors and 92GB of RAM. It hosts a ZFS pool with 164T of disks raw (71.8T after raid, currently 33.7T used). It currently serves an NFS heavy load (currently serving NFS to my organizations internal and public Linux mirors). For networking it's using the integrated Intel X722 NIC with two 10G baset connections LACP bonded together, currently using version 1.9.5 of the Intel driver from ports.

Other services on the machine include a SAMBA 1.6 server that doesn't have any active clients, and a 5 minute cron job that creates and destroys regular ZFS snapshots.
 

Attachments

  • 2018-05-12T22:18:22.txt
    17.8 KB · Views: 219
  • 2018-05-12T22:09:12.txt
    17.8 KB · Views: 184
Only theory that crosses my mind quickly: the root file system gets stuck (perhaps because the underlying device gets stuck). That would explain that the already running shell functions (the "echo" command is a probably shell builtin, and doesn't require any disk IO, so it works, while "ls" needs file system and therefore disk IO).

So here are the questions: What file system is the root on? Is it also ZFS? If not, does it share the same devices?

And: to test whether the root file system is working while the system is taking its short nap: "/bin/echo hallo". If that freezes, but "echo hallo" works, then we know that it is the root file system (or more accurately the file system containing the /bin directory, but that's likely root). If that works, then try whether writing to the /tmp file system works. If you get this far, try whether networking (even the most basic stuff) is working, for example "ping 127.0.0.1".

The fact that your script hangs during the outage doesn't surprise me: The loop is controlled by a "sleep" command. Sleep is a program on the root file system (it's really "/bin/sleep"); if the whole root file system is hung, then your script will go down with it. What you could try is this: Replace your script with the equivalent perl or python program. Here's why: in those languages, you can do things like "sleep" from within the running program itself, without having to go to disk.

By the way, just to be clear: All I've done above is giving you hints on how to gather more data to find out where the root cause might be. I have no idea what the root cause is, and even less how to fix it.
 
So here are the questions: What file system is the root on? Is it also ZFS? If not, does it share the same devices?

Root is on a separate ZFS pool consisting of two mirrors devices. I've added another python script that appends the current time to a file in tmpfs and will watch if it also freezes when one of these events occurs.

vfs.zfs.arc_max: 97949433856

97949433856 = ~91GB

Try to limit this number to something reasonable. Like 1/2 of your total memory.

Err, I typo'd the system ram in my initial post, it has 96GB. I've now limited the arc size to around 45GB, and will see if it makes a difference.
 
Since the server becomes unresponsive until the event passes, and is not 'dumping' I am suspicious that its workload related, IO getting bogged down. You did say its serving a heavy load.

Consider these commands, each running in its own shell. You can use script(1) to record the output to a file. I realize some of these commands seem trivial, but each reveals a lot of info. Try to determine your baselines, and then see what goes off the rails when your 'seizure event' occurs.

top -C -P -s 5
top -m io -C -P -s 5
vmstat -w 5
iostat -w 5
systat 5
systat -vmstat 5
systat -iostat 5


I don't have a lot of real world experience with heavy load machines acting awry, but I will say this: On one of my home machines I really bogged it down one day, just having fun with it. In the 'b' column of the vmstat(8) output I say that it was consistantly a non-zero number, which means "
blocked for resources".

This might take some effort, but I think its worth it.

You also said "Several times a day". Does that mean the seizing only happens during business open hours when users are pounding the server, or does it happen at night when there are little or no users using the server? If its only during heavy load then resources are where to look. If it happens even during the wee hours of the morning then you might have some sort of malfunction.

Fee free to post the output of the above commands. I will try to look at it when I can, maybe this evening or tomorrow, or maybe someone else can beat me to it.
 
Back
Top