Solved System becomes unresponsive with large data transfer

I'm setting up a home system, FreeBSD 14 running AMD Ryzen 5 5600G on an ASRock B550 with an Intel® I225V NIC. In general, everything is up and running ok.

However, whenever there a large network data transfer, the system becomes unresponsive. For example, I'm trying to transfer media files from my previous home system to the new one using rsync, and after about 30s the system will not respond at all. I have to power-cycle to bring it back up.

It's the same if I try it the other way round, pulling data from my old system to the new one.

Sorry to be vague. Any idea what this might be?
 
please try various settings with ifconfig: "-tso -rxcsum -txcsum -vlanmtu -vlanhwtag -vlanhwtso -vlanhwcsum", and can you try to set sysctl kern.sched.preempt_thresh=224 and report?
 
You could run "htop" , "gstat" , "iostat -c 1000" in a separate terminal to see what is happening.
Try "clone" on one disk to see if the problem is related to the disk writing.
You could tar gz a part of the filesystem and transfer it with ftp to see if the problem is related to network activity.
If both functions good, well the problem might be rsync itself....
 
Vague often makes it hard to help others.

A bit of information that may help others help you:
output of ifconfig -a so we can see what speeds and duplex the link has negotiated to along with other options as suggested by rootbert

How about some basic information like what filesystem you are copying to, what is the system you are copying from?

Have you tried to monitor the system with something like iostat (man iostat) in another window to see if something is getting interrupt bound like the network device driver or the disk device?

Have you tried something other than rsync, such as scp of a large file? This helps eliminate rsync being part of the problem and also helps determine if there is a threshhold where things go bad.

Are you doing this from the console or within a desktop environment? Have you tried not booting to a GUI and running the rsync from the console? This helps eliminate something in the DE that is creating contention (say a file manager parsing/reparsing a directory tree that is constantly changing).


As Geezer says, FreeBSD-14 is CURRENT which is "head of development, not a production release" so there may be things turned on that help debug issues. Kernel options like WITNESS and others may reduce performance. Unless you absolutely need to use 14 because of hardware support, using FreeBSD-13.1-RELEASE would be better.
 
In addition to the suggestions mer made: Separate the copying into two parts. First do a network copy, with rsync, but to /dev/null, so there is no file system or disk IO. Then do a copy, but not from the network (not via rsync) but from /dev/zero, so only file system and disk IO are stressed, but not the network. Which one works? How fast is each? Do the speeds agree with your intuition about how fast they should be?
 
FreeBSD 14
FreeBSD 14 is an unsupported, development version.

 
… from my previous home system to the new …

Both FreeBSD?

Diagnosis should be simplified if you can (please) share the result of a hardware probe for each affected system.

pkg install sysutils/hw-probe sysutils/pciutils sysutils/usbutils

hw-probe -all -upload
 
I couldn't get my NIC working with 13.1, but it worked fine (initially) with 14 without any additional work.
That's what I was referring to "...if you need to because of hardware support..."

Now we know you need to use 14 because of Hardware support, so everything I said about kernel options maybe hurting performance holds.

What is the destination filesystem, ZFS or UFS? That could be important. Options on the mount could matter (atime, sync, etc).
The output of the ifconfig command would really help out here. We can see if the link has negotiated correctly, what options are applied to the hardware, all the little things so we don't have to figure out what we need to ask, ask one question, and repeat. Kind of like telling a mechanic "car won't start" and then he needs to start asking you questions. Makes the whole debugging hard for everyone.
 
Vague often makes it hard to help others.
I know, I hate that I raised such a vague question, and I do apologise. Unfortunately, I don't have any further information. It simply locks up and is not available on the network.
A bit of information that may help others help you:
output of ifconfig -a so we can see what speeds and duplex the link has negotiated to along with other options as suggested by rootbert

# ifconfig -a
igc0: flags=8863<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=4e527bb<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,JUMBO_MTU,VLAN_HWCSUM,TSO4,TSO6,LRO,WOL_MAGIC,VLAN_HWFILTER,VLAN_HWTSO,RXCSUM_IPV6,TXCSUM_IPV6,NOMAP>
ether 70:85:c2:98:64:9b
inet 192.168.2.201 netmask 0xffffff00 broadcast 192.168.2.255
inet6 fe80::7285:c2ff:fe98:649b%igc0 prefixlen 64 scopeid 0x1
media: Ethernet autoselect (1000baseT <full-duplex>)
status: active
nd6 options=23<PERFORMNUD,ACCEPT_RTADV,AUTO_LINKLOCAL>
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
options=680003<RXCSUM,TXCSUM,LINKSTATE,RXCSUM_IPV6,TXCSUM_IPV6>
inet6 ::1 prefixlen 128
inet6 fe80::1%lo0 prefixlen 64 scopeid 0x2
inet 127.0.0.1 netmask 0xff000000
groups: lo
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
pflog0: flags=0<> metric 0 mtu 33160
groups: pflog

How about some basic information like what filesystem you are copying to, what is the system you are copying from?
ZFS to ZFS.
Have you tried to monitor the system with something like iostat (man iostat) in another window to see if something is getting interrupt bound like the network device driver or the disk device?
I haven't yet, and will try to. Unfortunately it seems to happen somewhat randomly and I've found it is not only linked to rsync, though rsync can definitely trigger it.
Have you tried something other than rsync, such as scp of a large file? This helps eliminate rsync being part of the problem and also helps determine if there is a threshhold where things go bad.
Yes, in fact scp seemed to work fine, if slowly. The system remained responsive, and transferred the files I needed to transfer.
Are you doing this from the console or within a desktop environment? Have you tried not booting to a GUI and running the rsync from the console? This helps eliminate something in the DE that is creating contention (say a file manager parsing/reparsing a directory tree that is constantly changing).
This is all via the console, over SSH, to both machines.
As Geezer says, FreeBSD-14 is CURRENT which is "head of development, not a production release" so there may be things turned on that help debug issues. Kernel options like WITNESS and others may reduce performance. Unless you absolutely need to use 14 because of hardware support, using FreeBSD-13.1-RELEASE would be better.
Understood, and I usually go for something older and well supported, but I couldn't get the NIC to work on anything but 14.
 
BTW, sorry for the late replies on this thread – the forum didn't email me any notifications!
 
  • Like
Reactions: mer
Now we know you need to use 14 because of Hardware support, so everything I said about kernel options maybe hurting performance holds.

What is the destination filesystem, ZFS or UFS? That could be important. Options on the mount could matter (atime, sync, etc).
ZFS to ZFS, with the following options set (at least on the receiving side):
compress=lz4 atime=off xattr=sa recordsize=1M logbias=throughput
The output of the ifconfig command would really help out here. We can see if the link has negotiated correctly, what options are applied to the hardware, all the little things so we don't have to figure out what we need to ask, ask one question, and repeat. Kind of like telling a mechanic "car won't start" and then he needs to start asking you questions. Makes the whole debugging hard for everyone.
Again, thanks for your help and apologies for not providing the information sooner. The forum didn't notify me of responses.
 
Please don't discourage people from running 14-CURRENT. (BTW, I've been running almost bleeding edge -CURRENT here on all my machines for a good number of years with no such hangs when performing large file transfers.)

What tool or tools are you using to perform the large file transfer?

Is the machine in question acting as the client or the server?

Are you able to rebuild the kernel with the the following option added:

options BREAK_TO_DEBUGGER # A BREAK on a serial console goes to
# ddb, if available.

If using a VGA-like console, ctrl-alt-esc, and type in bt (for backtrace).

If using a serial console, hit the break (on conserver-com that would be a ctrl-E, c, l, 1).

Then list the backtrace here.

Also, an excellent resource is the freebsd-current mailing list. You will receive a lot of help from developers. Developers monitor that mailing list very closely.
 
cy Sorry; just wanted to make sure it was needed because by default there are some things enabled in GENERIC that may affect performance (or at least they used to be).

So 1Gig link, full duplex, jumbo frames enabled. I'm assuming that makes sense for your physical link.
Is this system going into a switch? Is the switch 1GB?
 
What tool or tools are you using to perform the large file transfer?
rsync was the main culprit. scp seemed to work ok. But I've seen hangs while software on the system is downloading from the internet (torrent for disk images).
Is the machine in question acting as the client or the server?
My previous home media machine was pushing files to the new machine using rsync -aP ./* user@newmedia:/path
Are you able to rebuild the kernel with the the following option added:

options BREAK_TO_DEBUGGER # A BREAK on a serial console goes to
# ddb, if available.
I've never (re)built a kernel on FreeBSD or *nix, and I'm a little uncomfortable doing that. However, if someone can point me to a couple of tutorials I could give it a try!
If using a VGA-like console, ctrl-alt-esc, and type in bt (for backtrace).

If using a serial console, hit the break (on conserver-com that would be a ctrl-E, c, l, 1).

Then list the backtrace here.
Thanks for this information. I'll give it a try next time it happens.
 
So 1Gig link, full duplex, jumbo frames enabled. I'm assuming that makes sense for your physical link.
Is this system going into a switch? Is the switch 1GB?
Both machines are connected directly to an Asus RT-AC68U, which states "RJ45 for 10/100/1000/Gigabits BaseT for LAN x 4".
 
  • Like
Reactions: mer
My systems are connected to two 1 Gb networks internally, while the firewall has three additional 100 Mb interfaces -- two upstream and one to the DMZ. I haven't bothered with it yet. I had reasons not to before but those have evaporated and I never got around to configuring jumbo frames again.

My network configuration is each server downstairs is connected to two 1 Gb switches. The firewall is connected to one 100 Mb switch as well, plus two 100 Mb uplinks to my cable provider. One of my 1 Gb networks has two WiFi networks attached to it. While the DMZ has another and potentially another should I choose to enable WiFi (guests when we have them here).

I enable break to debugger on all my systems in case of hangs. However sometimes the hangs are when interrupts are disabled, meaning break to debugger won't work.

You can diagnose the hang by pinging the affected system. If it responds to pings you will know that the lower half of the kernel is responsive while the upper half is not. In that case the debugger may have been disabled because of a buffer overrun or some other kind of memory overlay that affects the debugger. If the system fails to respond to pings you know that interrupts have been disabled and its truly a hard hang. If the system is in a spin loop you'll notice the cooling fan will sound as if it's spinning loudly to cool the CPU -- another clue that it's a spin loop (probably a deadlock condition).

Your best bet is to post something on freebsd-current@.
 
Well, it has locked up again while I had a screen attached, and I got this lovely message:

IMG_5527.jpeg


After a reboot, checking the status I get this:

# smartctl -H /dev/ada8
smartctl 7.3 2022-02-28 r5338 [FreeBSD 14.0-CURRENT amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022 048 035 040 Old_age Always In_the_past 52 (Min/Max 49/52 #589)
 
Well, it has locked up again while I had a screen attached, and I got this lovely message:

View attachment 14038

After a reboot, checking the status I get this:

# smartctl -H /dev/ada8
smartctl 7.3 2022-02-28 r5338 [FreeBSD 14.0-CURRENT amd64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Please note the following marginal Attributes:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022 048 035 040 Old_age Always In_the_past 52 (Min/Max 49/52 #589)
The uncorrectable sectors are probably not the cause of this , though your drive needs a bit of TLC (you should replace the drive, or if the data is replaceable, write zeros to the drive to map the bad sectors to spares -- open a new thread for this).

The double free is a problem. I cannot diagnose this without a dump. Try to get a core dump and either post to freebsd-current@ or open a bugzilla bug. markj@ is the person who's put the lion's share of the work into VM. He'll be able to quickly fix this.
 
The uncorrectable sectors are probably not the cause of this , though your drive needs a bit of TLC (you should replace the drive, or if the data is replaceable, write zeros to the drive to map the bad sectors to spares -- open a new thread for this).
I've re-seated all the drives in the machine to provide a little extra space around the troubled drive for air-flow. I'll take a look at pulling it from the zpool for now so I can try zero-ing out the bad sectors. The data on there isn't critical.
The double free is a problem. I cannot diagnose this without a dump. Try to get a core dump and either post to freebsd-current@ or open a bugzilla bug. markj@ is the person who's put the lion's share of the work into VM. He'll be able to quickly fix this.
Thanks very much for your help so far! Could you point me to somewhere that can walk me through the core dump process?
 
Back
Top