System Occasionally Becomes Unresponsive on FreeBSD 13/14

Hi
I’m having a weird issue where our FreeBSD systems occasionally become unresponsive. I've seen it so far on the following versions:
  • 13.3
  • 14.2
  • 14.3
It's also been seen on different hardware, and on VMs too. So I've ruled out a hardware issue. The obvious things like CPU/Memory/Swap all look fine.

When the issue occurs, it locks out new ssh sessions, but existing ones may still work for a while. Sometimes the filesystem becomes inaccessible (but not always). Eventually the console locks up and we can no longer access the server at all.

All servers that have the issue seem to be the largest busiest servers, with lots of network activity. The issue feels like some find if resource starvation, but we're not sure which resource it could be. Eventually the box needs to be hard reset, and the logs are always clean after a reboot. We cant seem to find any errors in logs, eg. dmesg, /var/log/messages. We've also seen it appear to be occurring more frequently since upgrading servers to 14.3.

Can anyone provide some ideas on what we can do or where we can look to track this one down? Any suggestions of what the problem might be? We haven't been able to reliably reproduce it anywhere, it just seems to randomly occur.

thanks for any help anyone can provide.
 
Such thing I observe when
1) some third-party kernel modules loaded. For example, virtualbox modules.
2) when some process locked in trying to do something with disk, filesystems. For example, if some fusefs program, let's say ntfs-3g issued to mount device with read-write options (
Code:
-o rw
), but i'ts node, let's say
Code:
/dev/ada0p3
allows only "r". Process starts and hungs. All my tries to kill the process hungs. All commands related to folder with mountpoint, e.g. ls -l /media, hungs too. Until reboot.

Can't say more. But try to enumerate such moments.
 
I started a thread about this two months ago or so (potentially). In my case, the issue was ZFS.

I was having major lockups when I would copy a very large file from an external drive into my home folder (which is a dataset on zroot, the default setup). It blocked everything - from network to I/O operations.

What alleviated this was either limiting I/O bandwidth of that rsync operation, or, here's the revealing part, turning off fsync for the target dataset off before copying the file.

So the issue is fsync on for a ZFS dataset that's on zroot where your root lives. It's hogging up all I/O. This is not an issue if the target dataset is NOT on zroot pool.
 
Unfortunately we're not using ZFS, and there are no externally mounted drives. There shouldn't be any third party kernel module loaded either. I can double check all those things.
We've had another server show the issue, so it's now a total of 4 servers. There must be some common element that we're not seeing. We're now running the following and dumping results to disk every 10-30s, in the hopes that if they crash again, we might see something in the data. None have crashed yet with this script on them:

top -b -n 30
iostat -x -c 2
df -hi
tunefs -p /
sysctl -a
fstat
ps auxwwwd
vmstat
vmstat -m
vmstat -z
sockstat
netstat -s
netstat -m
check_procstat

Anything else we could add to that list to catch the problem?
 
Back
Top