Random lockups, no pattern to it

Okay, so this is probably not a question that can be answered, but I'll give it a shot. Long post!

The system started with 11.1 I think and from the get-go suffered from random lockups which we saw on the network. But we couldn't monitor it (no GPU permanently installed) so it was hard to tell what was going on. The swap size is less than RAM so core dumps wouldn't save (as I found out later). Sometimes it would start spinning fans at 100% and stay like that until reboot. Regardless, some OS updates later those lockups randomly stopped. We didn't do anything hardware-wise, they just stopped. Later it got upgraded to 11.2, ran altogether extremely well. Until the recent upgrade to 12.0 where the problem started happening again, once a week at first and probably once a day now. I personally was lucky to witness a moment when the system appeared to turn off by itself only to momentarily turn back on (judging by the fans and clicks from the PSU) but it didn't get back fully. The display was connected at that time and it stayed black but the moment before it was the usual login screen after a boot. I fumbled a bit with it and realized it could be suspend/resume causing such a behavior. And indeed I did trigger the exact same behavior by putting it to sleep and trying to wake it - no display, no network, but it did react appropriately to "shutdown -r now" blindly typed and ctrl+alt+del. Okay, a known issue with FreeBSD's sleep/resume. But that obviously left two questions:
  • suppose it was sleep/resume, why on earth would the system try to even enter sleep if it wasn't configured for that anywhere and at random times too?
  • why would it try to resume half a second after sleep?
Weird. So anyway, I went ahead and disabled sleep mode everywhere I could find - that included S3 mode in BIOS (confirmed by checking with the OS) and putting kern.suspend_blocked="1" in sysctl.conf.
But the problem didn't go away as I'd hoped. A day later I get a kernel panic that references spin lock being held for too long. The screen is attached. Weirdly it didn't even try to reboot after that, even though I explicitly told it to (kdb and ddb were in the kernel config along with KDB_UNATTENDED) and a check with sysctl confirmed it was set to reboot. At the panic screen the keyboard wasn't active and all I could do is physically power cycle that box.

So that's the story so far and I find it really hard to understand what may be going on. The RAM was tested to be fine and smartctl doesn't show any issues either. I thought the PSU (which is Corsair by the way) might be faulty but that wouldn't explain the panic or the fact that it was happily running without a single lockup for months just before the upgrade. I also tried running it on GENERIC kernel after upgrading to 12 for a while - same thing. Checked CPU temperatures from time to time - all normal. Doesn't lock up under stress whether it's compiling something or running mprime, although I have never left it running continuously. The BIOS is now updated to the latest version and there was no BIOS update during 11.2->12 transition.

Any ideas where to look?

Quick specs:
ASRock X370 Mini ITX motherboard, Ryzen 1700X, Intel 512Gb SSD, zfs.
 

Attachments

  • IMG_7296.JPG
    IMG_7296.JPG
    143.7 KB · Views: 128
Back
Top