3 cores pegged 24/7 by interrupts with no endpoints + MCE

NAS server based on Supermicro H13SSL-N w/ EPYC 9214. TrueNAS CORE Community 13.0-U6.8 (FreeBSD 13.1-RELEASE-p9)

A month ago we replaced HBAs, RAM and the OS; since then, 3 CPU cores are pegged at 100% continuously. It persists after reboots.

Usually this would suggest an issue either with processes, or with a NIC, HBA, USB etc. — but it's all interrupt cycles, and it appears that these interrupts come from root PCIe ports, with no downstream endpoints at all.

Code:
top -SH
last pid: 30669;  load averages:  3.33,  3.15,  3.12    up 1+12:17:11  16:52:25
2199 threads:  36 running, 2020 sleeping, 143 waiting
CPU:  0.0% user,  0.0% nice,  0.0% system,  9.4% interrupt, 90.6% idle
Mem: 128K Active, 1500M Inact, 179M Laundry, 236G Wired, 11G Free
ARC: 211G Total, 71G MFU, 140G MRU, 31K Anon, 53M Header, 61M Other
     205G Compressed, 252G Uncompressed, 1.23:1 Ratio
Swap: 10G Total, 10G Free
  PID USERNAME    PRI NICE   SIZE    RES STATE    C   TIME    WCPU COMMAND
   11 root        155 ki31     0B   512K CPU23   23  36.4H 100.00% idle{idle: cpu23}
   11 root        155 ki31     0B   512K CPU25   25  36.0H 100.00% idle{idle: cpu25}
   11 root        155 ki31     0B   512K CPU13   13  35.8H 100.00% idle{idle: cpu13}
   11 root        155 ki31     0B   512K CPU15   15  35.8H 100.00% idle{idle: cpu15}
   11 root        155 ki31     0B   512K CPU12   12  35.8H 100.00% idle{idle: cpu12}
   11 root        155 ki31     0B   512K RUN     14  35.8H 100.00% idle{idle: cpu14}
   11 root        155 ki31     0B   512K CPU10   10  35.7H 100.00% idle{idle: cpu10}
   11 root        155 ki31     0B   512K CPU7     7  35.9H  99.73% idle{idle: cpu7}
   12 root        -80    -     0B  2336K CPU18   18  36.3H  99.56% intr{irq156: pcib9}
   12 root        -80    -     0B  2336K CPU26   26  36.3H  98.92% intr{irq176: pcib12}
   12 root        -80    -     0B  2336K CPU20   20  36.3H  98.17% intr{irq157: pcib10}
   11 root        155 ki31     0B   512K CPU9     9  35.8H  94.85% idle{idle: cpu9}
...


Code:
pciconf -lv | egrep -n 'pcib9@|pcib10@|pcib12@' -A6
112:pcib9@pci0:128:1:1:    class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x14ab subvendor=0x1022 subdevice=0x1453
113-    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
114-    class      = bridge
115-    subclass   = PCI-PCI
116:pcib10@pci0:128:1:2:    class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x14ab subvendor=0x1022 subdevice=0x1453
117-    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
118-    class      = bridge
119-    subclass   = PCI-PCI
120-pcib11@pci0:128:1:3:    class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x14ab subvendor=0x1022 subdevice=0x1453
121-    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
122-    class      = bridge
--
124:pcib12@pci0:128:1:4:    class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x14ab subvendor=0x1022 subdevice=0x1453
125-    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
126-    class      = bridge
127-    subclass   = PCI-PCI
128-hostb9@pci0:128:2:0:    class=0x060000 rev=0x01 hdr=0x00 vendor=0x1022 device=0x149f subvendor=0x0000 subdevice=0x0000
129-    vendor     = 'Advanced Micro Devices, Inc. [AMD]'
130-    class      = bridge

MSI and MSI-X are enabled.

Also, as it just so happens, the server had an MCE yesterday. First time it happened, after weeks of testing + 3 weeks in production.
Version String: FreeBSD 13.1-RELEASE-p9 n245433-9dc2dc9b081 TRUENAS
Panic String: Unrecoverable machine check exception
2026-02-07 04:28:33 Memory [MEM-0001] Uncorrectable ECC / other uncorrectable memory error @DIMMG2 - Assertion Sensor-specific
2026-02-07 04:28:33 ProcessorConfiguration [PC-0153] Configuration error - CPU 1 LS Uncorrectable error - Assertion Sensor-specific
2026-02-07 04:28:15 Watchdog [WDT-0131] Timer interrupt - interrupt type: none, timer use at expiration: SMS/OS - Assertion Sensor-specific
2026-02-07 04:24:49 ProcessorConfiguration [PC-0153] Configuration error - CPU 1 LS Uncorrectable error - Assertion

With all this in mind…

1. How likely is it that CPU hardware is the cause for both issues?
2. Is the next step to temporarily disable MSI, MSI-X? (sysctl hw.pci.enable_msix=0 , sysctl hw.pci.enable_msi=0 ) How disruptive is that to a production system? (Seconds of downtime? Risk of dropping storage/network mid-I/O?)

My colleague who's the more experienced Linux guy, is "not going to worry about the pegged threads now". That worries me, and this issue worries me. Also, this I can look into, while conjuring replacement CPU+RAM+mobo will take days at best.
 
TrueNAS CORE Community 13.0-U6.8 (FreeBSD 13.1-RELEASE-p9)

you did read the forum rules, didn't you?

FreeBSD 13.1 has been EOL for over *two and a half years* now.
 
Back
Top