Unresponsive system after upgrade to 13.3

blackhaz · Mar 10, 2024

Hi folks,

I wonder if somebody else has the same issues. I've upgraded my laptop from 13.2 to 13.3. Upgraded all the packages. Having the problem with rsync almost completely locking up the system with near-100% system load. Starting it up via nice (with the default increment of 10) doesn't help. The same script starting rsync ran absolutely fine on 13.2. Firefox now randomly does the same - 100% system load, the whole system becomes nearly unresponsive, X locking up. Can fix by switching to text console and killing the process. systat vmstat doesn't show anything unusual. Nothing in the system logs.

Any ideas where to even begin looking?

Update: Can reproduce the problem every time I start rsync. It takes 100% CPU (top shows 87% system, 12% idle) but X and everything else is locking up. Strangely enough, whenever I start rsync, firefox shows as consuming 200% WCPU in top at the same time. Once I kill the rsync task, X and firefox are automatically back to normal, so looks like rsync is the culprit here.

Also, Xfce's thunar is doing similar thing on start randomly - producing 100% CPU load, but X remains responsive, and top shows only 37% load when thunar starts. It eventually draws a window after some time, but when switching through folders it locks up randomly, sometimes producing gvfsd-recent at 100% CPU load and an error window "Failed to open recent:/// - Timeout was reached."

Randomly, also, similar behavior when starting a terminal window with zsh stuck at 100% load for a few seconds before showing prompt. Almost as if all related to disk activity.

Update 2 (edited): Ran fsck on the external drive I'm backing up to, now rsync runs fine, no lockup. After some time the rsync began to misbehave again (post #4 below).

cracauer@ · Mar 11, 2024

Anything in dmesg during those fun times?

blackhaz · Mar 11, 2024

Nope, there was nothing at all.

blackhaz · Mar 11, 2024

Actually, the problem is back after fsck'ing the external drive. rsync is locking up the entire system, now with 100% CPU load but only about 37% of the system load. Nothing in /var/log/messages. vmstat output is attached for the reference. Whenever that happens, the whole system is barely responsive - it takes maybe half a minute to log in (launch a shell), X windows not responsive. Whenever rsync is taking the CPU other random processes like firefox and zsh also begin to show 100% CPU load. The rsync session itself also seems to be stuck without progress.

rsync is ran with these flags: rsync --log-file=$RSYNC_LOG_FILE --archive --hard-links --delete --delete-excluded --sparse --xattrs --numeric-ids --acls --progress

gotnull · Mar 11, 2024

Could it be the same problem? rsync involved, high CPU usage.

ZFS - disappointing ZFS(?) performance from 13.3

This morning I got to watch my 13.3 (ZFS zroot)desktop take 14 minutes to rsync about 21GB of data to a locally attached UFS disk with a steady 85%-88% CPU usage. Both parameters double last week's typical run. I have a BE to "fall back to" but this morning's boot gave the attached message prior...

forums.freebsd.org

CyberCr33p · Mar 11, 2024

Can you show the output of top?

cracauer@ · Mar 11, 2024

Yes, since it is 100% CPU the top outputwould be useful. If it is system time you could then do flame graphs using dtrace, but let's do one step after another.

Is the system disk USB?

blackhaz · Mar 11, 2024

Here is the top -HSz:

IMG-0224 hosted at ImgBB

Image IMG-0224 hosted in ImgBB

ibb.co

I apologize for the quality of the screenshot - this was done from the phone, as the system was unresponsive during rsync.
First, the rsync started at about 30% CPU load, then the kernel{arc_prune} process appeared at 100%, followed by other processes like firefox at 100% at which point the system became unresponsive. Killing the rsync made the system responsive with other processes returning back to normal CPU with the only exception of kernel{arc_prune} which remained to be running at 100% until I have disconnected the external hard disk.

The external hard disk is /dev/da0p1, a WD Green SATA SSD mounted via USB to SATA adapter. The file system on it is UFS. The laptop's internal NVMe drive is ZFS. I am unsure why would kernel{arc_prune} be related to a UFS file system. Looks like this may ber somehow related to this bug as well:

275063 – kernel using 100% CPU in arc_prune

bugs.freebsd.org

Submitted a comment in Bugzilla.

CyberCr33p · Mar 11, 2024

That's why I ask for top output. I wanted to see if arc_prune usage was high.

Do you use rsync to copy files between ZFS and UFS?

blackhaz · Mar 11, 2024

Yes, it was rsync from the internal ZFS NVMe to an external UFS USB drive.

CyberCr33p · Mar 11, 2024

Few days ago I upgrade 150+ systems to FreeBSD 13.3. I use rsync daily to copy files between ZFS datasets and also to remote servers. I was lucky and everything works fine.

blackhaz · Mar 11, 2024

I also do rsync to a remote server and that one works fine, no lockups.

CyberCr33p · Mar 11, 2024

If you rsync from ZFS to ZFS then do you have lockups?

cracauer@ · Mar 11, 2024

99.9% system (in the totals line).

cracauer@ · Mar 11, 2024

Could you run this (as root) and make out.kern_stacks available somewhere?

Code:

(sleep 60; dtrace -x stackframes=100 -n 'profile-997 /arg0/ { @[stack()] = count(); } tick-60s { exit(0); }' -o out.kern_stacks) &
rsynccommand...
wait

This sleeps 60 seconds and then runs dtrace for kernel profiling for 60 minutes, hopefully while this condition is ongoing. Might want to modify the times if required.

cracauer@ · Mar 11, 2024

Hmmm,

Code:

diff --git a/sys/contrib/openzfs/META b/sys/contrib/openzfs/META
index e9a809aef3b8..6e199face590 100644
--- a/sys/contrib/openzfs/META
+++ b/sys/contrib/openzfs/META
@@ -1,10 +1,10 @@
 Meta:          1
 Name:          zfs
 Branch:        1.0
-Version:       2.1.9
+Version:       2.1.14

I am glad that a point release of FreeBSD didn't do a major openzfs update. But then this doesn't look like a big enough change to introduce pathological behavior of this magnitude.

The total diff between 13.2 and 13.3 is 400000 lines. I can't read that "today"

blackhaz · Mar 11, 2024

cracauer@ here is the out.kern_stacks: https://www.dropbox.com/scl/fi/nea1...n_stacks?rlkey=vjgdjnr3k8yfpivhsbui5jpti&dl=0

CyberCr33p, I haven't tried ZFS->ZFS yet.

cracauer@ · Mar 11, 2024

blackhaz said:
cracauer@ here is the out.kern_stacks: https://www.dropbox.com/scl/fi/nea1...n_stacks?rlkey=vjgdjnr3k8yfpivhsbui5jpti&dl=0

Here is the profiling flame graph. Looks like a locking problem:

https://www.cons.org/20240311.svg

richardtoohey2 · Mar 11, 2024

cracauer@ said:
I am glad that a point release of FreeBSD didn't do a major openzfs update.

Is that sarcasm? Just curious - don't know if going from 2.1.x to 2.1.y is "major" (would think 2.1 to 2.2 or 3 or whatever would definitely count as major) - so wondering if I've missed something.

cracauer@ · Mar 12, 2024

richardtoohey2 said:
Is that sarcasm? Just curious - don't know if going from 2.1.x to 2.1.y is "major" (would think 2.1 to 2.2 or 3 or whatever would definitely count as major) - so wondering if I've missed something.

No sarcasm. Didn't go to 2.2.x. But apparently there is still a major behavior change that sneaked in.

Alain De Vos · Mar 12, 2024

Try to use "clone" instead of "rsync".
Does thunar tries to mount ?

blackhaz · Mar 12, 2024

I would really like to avoid using a workaround for rsync. It's really a big part of my backup strategy. Unfortunately, thunar is not providing any info on what it was doing, but I think it being stuck was related to the rsync process running in the background.

PMc · Mar 12, 2024

cracauer@ said:
I am glad that a point release of FreeBSD didn't do a major openzfs update. But then this doesn't look like a big enough change to introduce pathological behavior of this magnitude.

The total diff between 13.2 and 13.3 is 400000 lines. I can't read that "today"

Don't waste your time. 99.9 prob this is PR 275594
Patches there seem so far working for everybody who tried them

PMc · Mar 12, 2024

blackhaz said:
I would really like to avoid using a workaround for rsync. It's really a big part of my backup strategy. Unfortunately, thunar is not providing any info on what it was doing, but I think it being stuck was related to the rsync process running in the background.

No. It is somehow related to inode cache (aka vnode wtf) in relation to avail mem. It seems trouble starts when many files are handled - by a backup task, by a tar unpack or a parallel compile, or by rsync. The patch Seigo Tanimura wrote looks wonderful when you look at the code, but it's lengthy and I didn't try to understand it all, so I'm not fully sure about cause and effect..

If you're bored, you might try an old trick of mine to get ZFS to behave: significantly lower kern.maxvnodes. This will cost performance, and I don't know if it helps in this case, but it might.
I didn't try that, I just installed the patches from Seigo Tanimuro, and I'm fine.

semi-ambivalent · Mar 13, 2024

blackhaz said:
Actually, the problem is back after fsck'ing the external drive. rsync is locking up the entire system, now with 100% CPU load but only about 37% of the system load. Nothing in /var/log/messages. vmstat output is attached for the reference. Whenever that happens, the whole system is barely responsive - it takes maybe half a minute to log in (launch a shell), X windows not responsive. Whenever rsync is taking the CPU other random processes like firefox and zsh also begin to show 100% CPU load. The rsync session itself also seems to be stuck without progress.

rsync is ran with these flags: rsync --log-file=$RSYNC_LOG_FILE --archive --hard-links --delete --delete-excluded --sparse --xattrs --numeric-ids --acls --progress

This thread looks like the issue is on its way to resolution but have a couple remarks for here:
1. I have read that 'rsync -a' could have performance issues with ZFS because when moving/copying files ZFS automatically checksums each block as it's transferred, for integrity. Since rsync -a is doing the same thing there's a doubling up of work per block. I'd guess this doesn't happen when it's ZFS -> UFS (like my case), but I don't know if ZFS recognizes the transfer is going to a non-ZFS filesystem (and does not do the initial checksum) since it's not going to be used by the recipient machine. Checksumming is possibly the core feature of ZFS so maybe it's never turned off.
2. The phonegrab I posted is there because I was surprised by the filesystem being updated by fsck to track directory depth, something I have never seen. I attributed it to the update and not fsck. I have a second backup disk but thought it unwise to use it. After rebooting into 13.2-p10 I did back up to it without issue. If you know any info about this directory depth thing I'd appreciate a link; I could find nothing.
3. That screenshot of yours is pretty funky, fvwm(1?). What's the window bg color? I'm needing a change.

thx
s-a

Unresponsive system after upgrade to 13.3

Attachments