Unresponsive system after upgrade to 13.3

Hi folks,

I wonder if somebody else has the same issues. I've upgraded my laptop from 13.2 to 13.3. Upgraded all the packages. Having the problem with rsync almost completely locking up the system with near-100% system load. Starting it up via nice (with the default increment of 10) doesn't help. The same script starting rsync ran absolutely fine on 13.2. Firefox now randomly does the same - 100% system load, the whole system becomes nearly unresponsive, X locking up. Can fix by switching to text console and killing the process. systat vmstat doesn't show anything unusual. Nothing in the system logs.

Any ideas where to even begin looking?

Update: Can reproduce the problem every time I start rsync. It takes 100% CPU (top shows 87% system, 12% idle) but X and everything else is locking up. Strangely enough, whenever I start rsync, firefox shows as consuming 200% WCPU in top at the same time. Once I kill the rsync task, X and firefox are automatically back to normal, so looks like rsync is the culprit here.

Also, Xfce's thunar is doing similar thing on start randomly - producing 100% CPU load, but X remains responsive, and top shows only 37% load when thunar starts. It eventually draws a window after some time, but when switching through folders it locks up randomly, sometimes producing gvfsd-recent at 100% CPU load and an error window "Failed to open recent:/// - Timeout was reached."

Randomly, also, similar behavior when starting a terminal window with zsh stuck at 100% load for a few seconds before showing prompt. Almost as if all related to disk activity.

Update 2 (edited): Ran fsck on the external drive I'm backing up to, now rsync runs fine, no lockup. After some time the rsync began to misbehave again (post #4 below).
 
  • Like
Reactions: mro
Actually, the problem is back after fsck'ing the external drive. rsync is locking up the entire system, now with 100% CPU load but only about 37% of the system load. Nothing in /var/log/messages. vmstat output is attached for the reference. Whenever that happens, the whole system is barely responsive - it takes maybe half a minute to log in (launch a shell), X windows not responsive. Whenever rsync is taking the CPU other random processes like firefox and zsh also begin to show 100% CPU load. The rsync session itself also seems to be stuck without progress.

rsync is ran with these flags: rsync --log-file=$RSYNC_LOG_FILE --archive --hard-links --delete --delete-excluded --sparse --xattrs --numeric-ids --acls --progress
 

Attachments

  • Screenshot_2024-03-11_08-41-24.png
    Screenshot_2024-03-11_08-41-24.png
    129.3 KB · Views: 69
Yes, since it is 100% CPU the top outputwould be useful. If it is system time you could then do flame graphs using dtrace, but let's do one step after another.

Is the system disk USB?
 
Here is the top -HSz:

I apologize for the quality of the screenshot - this was done from the phone, as the system was unresponsive during rsync.
First, the rsync started at about 30% CPU load, then the kernel{arc_prune} process appeared at 100%, followed by other processes like firefox at 100% at which point the system became unresponsive. Killing the rsync made the system responsive with other processes returning back to normal CPU with the only exception of kernel{arc_prune} which remained to be running at 100% until I have disconnected the external hard disk.

The external hard disk is /dev/da0p1, a WD Green SATA SSD mounted via USB to SATA adapter. The file system on it is UFS. The laptop's internal NVMe drive is ZFS. I am unsure why would kernel{arc_prune} be related to a UFS file system. Looks like this may ber somehow related to this bug as well:

Submitted a comment in Bugzilla.
 
That's why I ask for top output. I wanted to see if arc_prune usage was high.

Do you use rsync to copy files between ZFS and UFS?
 
Few days ago I upgrade 150+ systems to FreeBSD 13.3. I use rsync daily to copy files between ZFS datasets and also to remote servers. I was lucky and everything works fine.
 
Could you run this (as root) and make out.kern_stacks available somewhere?

Code:
(sleep 60; dtrace -x stackframes=100 -n 'profile-997 /arg0/ { @[stack()] = count(); } tick-60s { exit(0); }' -o out.kern_stacks) &
rsynccommand...
wait

This sleeps 60 seconds and then runs dtrace for kernel profiling for 60 minutes, hopefully while this condition is ongoing. Might want to modify the times if required.
 
Hmmm,
Code:
diff --git a/sys/contrib/openzfs/META b/sys/contrib/openzfs/META
index e9a809aef3b8..6e199face590 100644
--- a/sys/contrib/openzfs/META
+++ b/sys/contrib/openzfs/META
@@ -1,10 +1,10 @@
 Meta:          1
 Name:          zfs
 Branch:        1.0
-Version:       2.1.9
+Version:       2.1.14

I am glad that a point release of FreeBSD didn't do a major openzfs update. But then this doesn't look like a big enough change to introduce pathological behavior of this magnitude.

The total diff between 13.2 and 13.3 is 400000 lines. I can't read that "today" :)
 
I would really like to avoid using a workaround for rsync. It's really a big part of my backup strategy. Unfortunately, thunar is not providing any info on what it was doing, but I think it being stuck was related to the rsync process running in the background.
 
I am glad that a point release of FreeBSD didn't do a major openzfs update. But then this doesn't look like a big enough change to introduce pathological behavior of this magnitude.

The total diff between 13.2 and 13.3 is 400000 lines. I can't read that "today" :)
Don't waste your time. 99.9 prob this is PR 275594
Patches there seem so far working for everybody who tried them
 
I would really like to avoid using a workaround for rsync. It's really a big part of my backup strategy. Unfortunately, thunar is not providing any info on what it was doing, but I think it being stuck was related to the rsync process running in the background.
No. It is somehow related to inode cache (aka vnode wtf) in relation to avail mem. It seems trouble starts when many files are handled - by a backup task, by a tar unpack or a parallel compile, or by rsync. The patch Seigo Tanimura wrote looks wonderful when you look at the code, but it's lengthy and I didn't try to understand it all, so I'm not fully sure about cause and effect..

If you're bored, you might try an old trick of mine to get ZFS to behave: significantly lower kern.maxvnodes. This will cost performance, and I don't know if it helps in this case, but it might.
I didn't try that, I just installed the patches from Seigo Tanimuro, and I'm fine.
 
Actually, the problem is back after fsck'ing the external drive. rsync is locking up the entire system, now with 100% CPU load but only about 37% of the system load. Nothing in /var/log/messages. vmstat output is attached for the reference. Whenever that happens, the whole system is barely responsive - it takes maybe half a minute to log in (launch a shell), X windows not responsive. Whenever rsync is taking the CPU other random processes like firefox and zsh also begin to show 100% CPU load. The rsync session itself also seems to be stuck without progress.

rsync is ran with these flags: rsync --log-file=$RSYNC_LOG_FILE --archive --hard-links --delete --delete-excluded --sparse --xattrs --numeric-ids --acls --progress
This thread looks like the issue is on its way to resolution but have a couple remarks for here:
1. I have read that 'rsync -a' could have performance issues with ZFS because when moving/copying files ZFS automatically checksums each block as it's transferred, for integrity. Since rsync -a is doing the same thing there's a doubling up of work per block. I'd guess this doesn't happen when it's ZFS -> UFS (like my case), but I don't know if ZFS recognizes the transfer is going to a non-ZFS filesystem (and does not do the initial checksum) since it's not going to be used by the recipient machine. Checksumming is possibly the core feature of ZFS so maybe it's never turned off.
2. The phonegrab I posted is there because I was surprised by the filesystem being updated by fsck to track directory depth, something I have never seen. I attributed it to the update and not fsck. I have a second backup disk but thought it unwise to use it. After rebooting into 13.2-p10 I did back up to it without issue. If you know any info about this directory depth thing I'd appreciate a link; I could find nothing.
3. That screenshot of yours is pretty funky, fvwm(1?). What's the window bg color? I'm needing a change.

thx
s-a
 
Back
Top