FreeBSD 14.1 intermittent freeze

lassino · Dec 23, 2024

Hello everyone there! I use FreeBSD on my home NAS, and recently, intermittent freeze occurs on the system continuously. The system freezes for serveral minutes and becomes responsive again for serveral minutes and then freezes again. When the system freezes, not only pings and tcp connections to the system times out, when I connect a display through HDMI to the system, the terminal and the X11 GUI are not responsive too.

I noticed this problem recently, when the freezes become more frequent and longer. But I suppose this also existed before because sometimes connection to the system took much longer.

I mainly use the system for seeding and downloading with net-p2p/qbittorrent (in nox flavor), and as a net/samba416 server for network file access. And short before I notice the system freeze, I increased the torrenting workload, seeding around 1000 torrents at the same time (previously around 500 torrents).

I searched about the problem and tried some measures by myself but failed to identify the root cause, as follows:
1. I tried to inspect system load with systat -ifstat and top when the system recover from a previous freeze, but the statistics seems completely normal. The CPU load is around 3.00 when the system is responsive, which does not seem very high. I also use sysctl dev.re.0.stats=1 provided by the net/realtek-re-kmod driver to display network statistics, and no significant abnormal can be observed.
2. I find some discussion saying that Realtek ethernet controlle (which is exactly the NIC on my system) hangs with high UDP traffic, so I turned off µTP on qBittorrent and also limit the upload and download bandwidth (to 30MB/s). I also limits the maxium connection to 200. This seems to mitigate the problem, with each freeze seem to last shorter, but it's still occurring. Plus, I suppose a NIC hang should not cause the complete system unresponsive.

Here are also some possible reasons I suppose, but I don't know how to confirm these suspicions.
1. The CPU my NAS uses is Intel i7-12800H, with hybrid architecture of 6 P-cores and 8 E-cores. Can this be related with the scheduling between E-Cores and P-Cores?
2. The storage I use for torrenting and Samba access is a ZFS RAID-Z pool with 4 8-TB underlying SATA hard drive. Is this related to some filesystem IO?
3. Samba and qBittorrent are running in a jail, which is connected to the network via epair bound to the network controller. Is this related to the jail or epair system?

Do you have any idea what causes the system freeze, and can you suggest what I can do further to identify and solve my problem?

SirDice · Dec 23, 2024

lassino said:
Do you have any idea what causes the system freeze

Check your disks. Any weird DMA errors for example in /var/log/messages? Use sysutils/smartmontools on the disk(s) too. When disks are on their last legs they often give lots of time-outs, that can certainly cause apparent freezes.

Phishfry · Dec 23, 2024

lassino said:
Plus, I suppose a NIC hang should not cause the complete system unresponsive.

Realtek would be my suspect since this only happens when you increase torrent seeding.
Check the interrupts for re0 and see if they are the culprit.
vmstat -i

Phishfry · Dec 23, 2024

My personal feelings are that jails deserve their own interfaces. Vnet passthrough interfaces instead of epair.
Slap in an Intel Gigabit Adapter for your high load jails and use the re0 interface for the host system.

Truthfully you have added on so much fluff any one could be the problem. I like to KISS then migrate.
ZFS
Jails
Realtek
Samba
X11 GUI on a Server with 1000 clients all hitting you up for small files?

I doubt the cores are a problem like this.

Phishfry · Dec 23, 2024

Tangentially, Do you think this processor is best for a NAS? No ECC support and you are using ZFS?

Intel® Core™ i7-12800H Processor (24M Cache, up to 4.80 GHz) - Product Specifications | Intel

Intel® Core™ i7-12800H Processor (24M Cache, up to 4.80 GHz) quick reference with specifications, features, and technologies.

www.intel.com

With 1000 clients I think the term is File Server and not NAS.
Using Realtek for 1000 clients is probably the problem but you also have a mobile CPU without ECC serving ZFS...
So that is my critique of the build. Sorry to be so brutal.

lassino · Dec 23, 2024

Well, though I seed 1000 torrent files, there are usually only 10-20 seeds actually active. Meanwhile, in qBittorrent I set the maximum connection to 200. I think such a torrenting workload is not a very heavy one.

As for the system spec, you are right that this is not a so ideal one, since I just meant to build one personal home NAS with my old laptop. By the way, is there any severe problem of running ZFS without ECC?

CeXP1917 · Dec 23, 2024

Is there some racctl / rctl limitations for jails?

cracauer@ · Dec 23, 2024

A ZFS hang would still have the machine answer to ping.

A network hardware problem is more likely.

cracauer@ · Dec 23, 2024

lassino said:
By the way, is there any severe problem of running ZFS without ECC?

Not more so than other filesystems.

Phishfry · Dec 23, 2024

lassino said:
I increased the torrenting workload, seeding around 1000 torrents at the same time (previously around 500 torrents).

So I have to wonder about this.
Was it running OK at 500 torrents?
Problems only start when increasing to 1000 torrent seeds???
If so that points to a problem with the torrent serving program.
If 500 ran OK I would gradually increase to 1000 seeders. See where the threshold is. ect..600, then 700, then 800.
With mininal actual clients hitting the box the problem may be with the torrent server itself.

ralphbsz · Dec 24, 2024

Ethernet hardware? Hard to debug. My dumb suggestion (probably impractical): Borrow a different Ethernet card (Intel for example), see whether the problem goes away.

Networking stack software? One suggestion would be to get rid of the jails.

One more thing: Under extreme intense storage workloads with deep queues, I've seen storage bring the whole kernel to its knees (not on FreeBSD, on Linux, but the principles are the same). That is sort of the exception to Cracauer's (otherwise correct) observation that disk problems wouldn't break ping. To see whether that's happening, keep "iostat -d 1" running (perhaps with a -n option to see all disks), and save the output to a file. By the way, keeping "vmstat 1" running at all times isn't a bad idea either. Then look whether you see any unusual spikes in workload or latency at the time of your freeze.

lassino · Feb 4, 2025

Hi everyone! Thanks a lot for all your suggestions. It has been a while since I posted the thread. After that, the freeze disappeared, probably because I tuned some parameters of net-p2p/qbittorrent. So I could not further detect the problem with suggested commands.
Recently, the system freeze came again, so I tried to obtain some monitoring results with iostat and vmstat. Here are the results when the freeze happens.

Code:

~ > iostat -n 12 # n = 12 to show all major devices.
            nda0             ada0             ada1             ada2             ada3             ada4             ada5             ada6             ada7            pass0            pass1            pass2
KB/t   tps  MB/s KB/t   tps  MB/s KB/t   tps  MB/s KB/t   tps  MB/s KB/t   tps  MB/s KB/t   tps  MB/s KB/t   tps  MB/s KB/t   tps  MB/s KB/t   tps  MB/s KB/t   tps  MB/s KB/t   tps  MB/s KB/t   tps  MB/s
 8.6    97  0.82  171    47  7.87  0.0     0  0.00  0.0     0  0.00  0.0     0  0.00  150    14  2.01  0.0     0  0.00  0.0     0  0.00  0.0     0  0.00  0.0     0  0.00  0.0     0  0.00  0.0     0  0.00

Code:

~ > vmstat -n 12 # n = 12 to show all major devices.
procs    memory    page                      disks                                                         faults       cpu r  b  w  avm  fre  flt  re  pi  po   fr   sr nda0 ada0 ada1 ada2 ada3 ada4 ada5 ada6 ada7 pas0 pas1 pas2   in   sy   cs us sy id
 1  0  1 565119758336 575651840 4307   0   0   0 14697  294    1  178    0    0    2  230    2    0    0    0    0    0 27583 98266 151279  3  2 94

Code:

~> vmstat -i
cpu0:timer                          1634        539
cpu1:timer                          1638        540
cpu2:timer                          1618        533
cpu3:timer                          1902        627
cpu4:timer                          1609        530
cpu5:timer                          1867        615
cpu6:timer                          1775        585
cpu7:timer                          1498        494
cpu8:timer                          1853        611
cpu9:timer                          1667        549
cpu10:timer                         3423       1128
cpu11:timer                         1257        414
cpu12:timer                         2113        696
cpu13:timer                         2262        746
cpu14:timer                         2072        683
cpu15:timer                         2079        685
cpu16:timer                         1964        647
cpu17:timer                         1805        595
cpu18:timer                         2078        685
cpu19:timer                         2017        665
irq128: ahci0                        148         49
irq129: xhci0                         16          5
irq130: ahci1                          0          0
irq131: nvme0:admin                    0          0
irq132: nvme0:io0                      9          3
irq133: nvme0:io1                      7          2
irq134: nvme0:io2                      3          1
irq135: nvme0:io3                     14          5
irq136: nvme0:io4                      1          0
irq137: nvme0:io5                      4          1
irq138: nvme0:io6                     10          3
irq139: nvme0:io7                      2          1
irq140: nvme0:io8                      3          1
irq141: nvme0:io9                     12          4
irq142: nvme0:io10                    13          4
irq143: nvme0:io11                     1          0
irq144: nvme0:io12                    13          4
irq145: nvme0:io13                     2          1
irq146: nvme0:io14                     4          1
irq147: re0                       133864      44121
irq149: hdac0                          0          0
irq150: vgapci0                       27          9
Total                             172284      56784

So in my understanding, this seems to be related with the intterupt on the NIC (re0). Is there anything I can do to further identify and solve the problem? Meanwhile, what does the high "fr" value ("pages freed" as per the manual) of vmstat indicate?

FreeBSD 14.1 intermittent freeze

Administrator