System freeze with all FreeBSD (Windows/Linux OK)

I submitted this as a FreeBSD bug, but thought I'd post it here because it looks like whomever took the bug in is just treating it as a random system hang (as in something wrong with the hardware) when there's evidence to the contrary. I have a dual socket 7742 system (128 total real cores, 128 threads) that will completely lock up the system in under an hour if left idle. Windows 10/11 and Ubuntu Linux works without issue. By "lock up", this means:

* Console unresponsive (no keyboard/USB/numlock)
* Networking unresponsive (no pings, no arps, nothing)

Like it's "jumping to self" with all interrupts disabled. The system needs to be reset or power cycled. I have tried the following distributions over the last few months with the same results:

FreeBSD 13.2-RELEASE releng/13.2-n254617-525ecfdad597 GENERIC amd64
FreeBSD 13.1
FreeBSD 13.0
FreeBSD 12.3
Several memstick images of 14.0 since December 2022

Other notes:

* The lockup is guaranteed. I've never had it not lock up when left idle. Always locks up in <1 hour (usually in 10-20 minutes).

* If I run a "stress" program, the system runs for days at a time without any observed lockups. If there's any significant system activity, it appears to not lock up.

* At one point (on a 14.0 build) I was able to get the kernel debugger compiled in. When the system locked up, hitting the local USB keyboard sequence to get in to the kernel debugger worked. This also seemed to unlock the system, as after I exited the kernel debugger, the system was alive again.

* I've installed the OSes on either 2GB M.2 Samsung SSDs *OR* on a Western Digital SN200 NVME disk. No changes in behavior. Storage does not appear to be a factor.

* I've halved the memory and swapped DIMMs entirely. No change.

System specs:

Motherboard : Asus rs700a-e11-rs12u-wocpu009z
CPUs : Dual AMD 7742 CPUs
BIOS Version : 0901
BMC Firmware model : RS700A-E11-RS12U
BMC Firmware version: 1.2.15
Installed ECC memory: 512GB
Storage : Two Samsung EVO 980 TB M.2 SSDs, and a WD SN200 7.68TB NVME U.2 disk

Video is the ASpeed AST2500, which supplies video for the system.

I'd be happy to put this system on the internet and allow any and all interested parties access to it for troubleshooting/debugging. Does this look even remotely familiar to anyone? I had thought at one point that it might be related to the number of CPU cores (256 total, 128 threads, 128 cores), or perhaps CPU state related since it locks up at idle and never when loaded, but I'd love to deploy FreeBSD for this system if I can.
 

Attachments

Your posting was waiting to be cleared by a moderator. The forum told you about that, so there is no need to post this 3 times. Untill you have reached a certain post count, all you write needs to be cleared by a moderator, so please have some patience. It is sunday, and I do happen to have kids to raise. OK?
 
Your posting was waiting to be cleared by a moderator. The forum told you about that, so there is no need to post this 3 times. Untill you have reached a certain post count, all you write needs to be cleared by a moderator, so please have some patience. It is sunday, and I do happen to have kids to raise. OK?
Got it - from my perspective, it looked like they were disappearing. I'd post it, see the "Awaiting moderator" notification, then they'd disappear entirely from the summary for that group with no further notification. It did look like they were being deleted. Sorry for the thrash - I couldn't figure out what was going on!
 
No problem.

For your system, does it still happen when powerd is not running? Also, you may try setting cx_lowest for some cores to C0 when powerd is running. I had this problem also with an ASUS board. Their BIOS seems... suboptimal.
 
Thanks Crivens/Facedebouc! powerd isn't running currently. Facedebouc, I couldn't find a C6 state in BIOS, but I did find something called "Global C-State control". It was set to auto, and now I've set it to disabled, so maybe that'll help? Time will tell... will report back when I have some evidence (or lack of it) one way or another. Thank you both!
 
tl;dr version, setting "Global C-State control" to "enabled or "disabled" fixes the problem (instead of "auto") Either setting, the system runs 24/7. I'll also update the bug with details.

However... if I have it set to "disabled", the system uses considerably more wattage:

Power in = 300 watts
Power out = 276 watts
CPU = 168 watts
Mem = 112 watts

When set to "enabled", it's quite a bit more reasonable, and in line with Windows and Linux distros I tried:

Power in = 120 watts
Power out = 108 watts
CPU = 88 watts
Mem = 16 watts

So... recommendation is to set it to enabled. Now I'll go mess around with powerd. Thanks much for the assistance folks!
 
Back
Top