NAS server based on Supermicro H13SSL-N w/ EPYC 9214. TrueNAS CORE Community 13.0-U6.8 (FreeBSD 13.1-RELEASE-p9)
A month ago we replaced HBAs, RAM and the OS; since then, 3 CPU cores are pegged at 100% continuously. It persists after reboots.
Usually this would suggest an issue either with processes, or with a NIC, HBA, USB etc. — but it's all interrupt cycles, and it appears that these interrupts come from root PCIe ports, with no downstream endpoints at all.
MSI and MSI-X are enabled.
Also, as it just so happens, the server had an MCE yesterday. First time it happened, after weeks of testing + 3 weeks in production.
With all this in mind…
1. How likely is it that CPU hardware is the cause for both issues?
2. Is the next step to temporarily disable MSI, MSI-X? (sysctl hw.pci.enable_msix=0 , sysctl hw.pci.enable_msi=0 ) How disruptive is that to a production system? (Seconds of downtime? Risk of dropping storage/network mid-I/O?)
My colleague who's the more experienced Linux guy, is "not going to worry about the pegged threads now". That worries me, and this issue worries me. Also, this I can look into, while conjuring replacement CPU+RAM+mobo will take days at best.
A month ago we replaced HBAs, RAM and the OS; since then, 3 CPU cores are pegged at 100% continuously. It persists after reboots.
Usually this would suggest an issue either with processes, or with a NIC, HBA, USB etc. — but it's all interrupt cycles, and it appears that these interrupts come from root PCIe ports, with no downstream endpoints at all.
Code:
top -SH
last pid: 30669; load averages: 3.33, 3.15, 3.12 up 1+12:17:11 16:52:25
2199 threads: 36 running, 2020 sleeping, 143 waiting
CPU: 0.0% user, 0.0% nice, 0.0% system, 9.4% interrupt, 90.6% idle
Mem: 128K Active, 1500M Inact, 179M Laundry, 236G Wired, 11G Free
ARC: 211G Total, 71G MFU, 140G MRU, 31K Anon, 53M Header, 61M Other
205G Compressed, 252G Uncompressed, 1.23:1 Ratio
Swap: 10G Total, 10G Free
PID USERNAME PRI NICE SIZE RES STATE C TIME WCPU COMMAND
11 root 155 ki31 0B 512K CPU23 23 36.4H 100.00% idle{idle: cpu23}
11 root 155 ki31 0B 512K CPU25 25 36.0H 100.00% idle{idle: cpu25}
11 root 155 ki31 0B 512K CPU13 13 35.8H 100.00% idle{idle: cpu13}
11 root 155 ki31 0B 512K CPU15 15 35.8H 100.00% idle{idle: cpu15}
11 root 155 ki31 0B 512K CPU12 12 35.8H 100.00% idle{idle: cpu12}
11 root 155 ki31 0B 512K RUN 14 35.8H 100.00% idle{idle: cpu14}
11 root 155 ki31 0B 512K CPU10 10 35.7H 100.00% idle{idle: cpu10}
11 root 155 ki31 0B 512K CPU7 7 35.9H 99.73% idle{idle: cpu7}
12 root -80 - 0B 2336K CPU18 18 36.3H 99.56% intr{irq156: pcib9}
12 root -80 - 0B 2336K CPU26 26 36.3H 98.92% intr{irq176: pcib12}
12 root -80 - 0B 2336K CPU20 20 36.3H 98.17% intr{irq157: pcib10}
11 root 155 ki31 0B 512K CPU9 9 35.8H 94.85% idle{idle: cpu9}
...
Code:
pciconf -lv | egrep -n 'pcib9@|pcib10@|pcib12@' -A6
112:pcib9@pci0:128:1:1: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x14ab subvendor=0x1022 subdevice=0x1453
113- vendor = 'Advanced Micro Devices, Inc. [AMD]'
114- class = bridge
115- subclass = PCI-PCI
116:pcib10@pci0:128:1:2: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x14ab subvendor=0x1022 subdevice=0x1453
117- vendor = 'Advanced Micro Devices, Inc. [AMD]'
118- class = bridge
119- subclass = PCI-PCI
120-pcib11@pci0:128:1:3: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x14ab subvendor=0x1022 subdevice=0x1453
121- vendor = 'Advanced Micro Devices, Inc. [AMD]'
122- class = bridge
--
124:pcib12@pci0:128:1:4: class=0x060400 rev=0x01 hdr=0x01 vendor=0x1022 device=0x14ab subvendor=0x1022 subdevice=0x1453
125- vendor = 'Advanced Micro Devices, Inc. [AMD]'
126- class = bridge
127- subclass = PCI-PCI
128-hostb9@pci0:128:2:0: class=0x060000 rev=0x01 hdr=0x00 vendor=0x1022 device=0x149f subvendor=0x0000 subdevice=0x0000
129- vendor = 'Advanced Micro Devices, Inc. [AMD]'
130- class = bridge
MSI and MSI-X are enabled.
Also, as it just so happens, the server had an MCE yesterday. First time it happened, after weeks of testing + 3 weeks in production.
Version String: FreeBSD 13.1-RELEASE-p9 n245433-9dc2dc9b081 TRUENAS
Panic String: Unrecoverable machine check exception
2026-02-07 04:28:33 Memory [MEM-0001] Uncorrectable ECC / other uncorrectable memory error @DIMMG2 - Assertion Sensor-specific
2026-02-07 04:28:33 ProcessorConfiguration [PC-0153] Configuration error - CPU 1 LS Uncorrectable error - Assertion Sensor-specific
2026-02-07 04:28:15 Watchdog [WDT-0131] Timer interrupt - interrupt type: none, timer use at expiration: SMS/OS - Assertion Sensor-specific
2026-02-07 04:24:49 ProcessorConfiguration [PC-0153] Configuration error - CPU 1 LS Uncorrectable error - Assertion
With all this in mind…
1. How likely is it that CPU hardware is the cause for both issues?
2. Is the next step to temporarily disable MSI, MSI-X? (sysctl hw.pci.enable_msix=0 , sysctl hw.pci.enable_msi=0 ) How disruptive is that to a production system? (Seconds of downtime? Risk of dropping storage/network mid-I/O?)
My colleague who's the more experienced Linux guy, is "not going to worry about the pegged threads now". That worries me, and this issue worries me. Also, this I can look into, while conjuring replacement CPU+RAM+mobo will take days at best.

