Solved support for hardware watchdogs in supermicro mainboard (X10SRA-F)

dch

Developer
My system started hanging a few weeks ago, and I'm suspecting hardware problems. It's a hard hang, the whole system freezes but never reboots. This post is about finding a way to trigger a reboot if the system hangs, not the actual problem itself!

My mainboard has both IPMI, and a BIOS enabled hardware watchdog feature, which seems to be set around 5 minutes mark. I've not yet found how to inform the BIOS watchdog that the system is running, so I turned that off as 5 minutes of uptime is not my thing.

There's ichwd() and watchdogd() which in theory should be sufficient:

Code:
# kldload ipmi
# kldload ichwd
# ls /dev/fido
/dev/fido
# watchdogd -d
watchdogd: mlockall failed: Cannot allocate memory
...

That's an ominous error message!

After loading these drivers, dmesg reports:

Code:
[23] ipmi0: <IPMI System Interface> port 0xca2,0xca3 on acpi0
[23] ipmi0: KCS mode found at io 0xca2 on acpi
[23] ipmi0: IPMI device rev. 1, firmware rev. 3.45, version 2.0, device support mask 0xbf
[23] ipmi0: Number of channels 2
[23] ipmi0: Attached watchdog
[23] ipmi0: Establishing power cycle handler
[25] ipmi1 failed to probe on isa0
[45] ichwd0: <Intel Wellsburg watchdog timer> on isa0

As an alternative, the sysutils/freeipmi port has bmc-watchdog() which reports:

Code:
root@wintermute /u/h/dch# bmc-watchdog --get
Timer Use:                   SMS/OS
Timer:                       Stopped
Logging:                     Enabled
Timeout Action:              None
Pre-Timeout Interrupt:       None
Pre-Timeout Interval:        0 seconds
Timer Use BIOS FRB2 Flag:    Clear
Timer Use BIOS POST Flag:    Clear
Timer Use BIOS OS Load Flag: Clear
Timer Use BIOS SMS/OS Flag:  Clear
Timer Use BIOS OEM Flag:     Clear
Initial Countdown:           0 seconds
Current Countdown:           0 seconds

which appears to be completely unrelated, but might do the job if it communicates with the BMC?

I will try this next, update thread as I go - the hang can be hours away so it could take some time.
 
Well I am in luck - 3 hangs today! So far, experimental results:

- the external BIOS watchdog appears not to be useful for anything other than annoying me every 5 minutes
- watchdogd(8)() along with ichwd(4)() does exactly what I wanted

When I kldload the appropriate things, and start watchdogd, this is what we see in dmesg:

Code:
[242] ichwd0: <Intel Wellsburg watchdog timer> on isa0
[242] pcib1: allocated type 4 (0x430-0x437) for rid 0 of ichwd0
[242] pcib1: allocated type 4 (0x460-0x47f) for rid 1 of ichwd0
[242] ichwd0: timer disabled
[242] pcib1: allocated type 4 (0x3f0-0x3f5) for rid 0 of fdc0
[242] pcib1: allocated type 4 (0x3f7-0x3f7) for rid 1 of fdc0
[242] ppc0: cannot reserve I/O port range
[298] ichwd0: timer enabled
[298] ichwd0: timeout set to 229 ticks

It seems to work perfectly. Case closed? If anybody has more wisdom to share on watchdogs, that would be appreciated!

As a bonus, using bmc-watchdog --get from sysutils/freeipmi seems to interrogate the same underlying hardware - although possibly via a different mechanism:

Code:
# bmc-watchdog --get
Timer Use:                   SMS/OS
Timer:                       Running     <-------- this has changed
Logging:                     Enabled
Timeout Action:              Power Cycle <-------- this has changed
Pre-Timeout Interrupt:       None
Pre-Timeout Interval:        120 seconds <-------- this has changed
Timer Use BIOS FRB2 Flag:    Clear
Timer Use BIOS POST Flag:    Clear
Timer Use BIOS OS Load Flag: Clear
Timer Use BIOS SMS/OS Flag:  Clear
Timer Use BIOS OEM Flag:     Clear
Initial Countdown:           137 seconds
Current Countdown:           132 seconds

And querying it repeatedly shows the countdown doing what we'd expect, albeit with slightly wonky numbers. But the result seems to work.
 
Try to figure out why it just hangs in the first place though. While this may be helpful to combat some of the symptoms, those freezes don't sound good. Eventually they're going to cause filesystem errors (freezes and instant reboots are never good for the consistency of the filesystem) and those could lead to data-loss or worse.
 
I have an old Thinkpad in the closet which I suspect to have some BGA problem.
When the thing gets flexed a bit, the card reader chip causes an interrupt storm, causing the computer to noticeably slow down.
Luckily the reader can be disabled.

If you have an interrupt storm, it can also cause the watchdog to reset, even if the computer does not really "hang".
And querying it repeatedly shows the countdown doing what we'd expect, albeit with slightly wonky numbers. But the result seems to work.
Could this be caused by missing some tick interrupts?

I would be curious if you'd post when you have found out what is going wrong :)
 
Back
Top