jails system powered off on its own, twice in about 2 weeks

I have modified my system to be virtualized as much as possible (in an attempt to minimize the # of reboots). The only physical part is the kernel and base system (micro, some ZFS management utilities, etc.). I have 2 separate jails, 1 for routing traffic in my home, and the other for my workstation usage.

I'm trying to debug the events, but I'm not seeing anything standing out in /var/log/messages. The only thing I see is that I booted up the system and the corresponding output from that. My system is certainly old, I think it is from around 2012/2013, so it is entirely possible it is dying, but I don't understand what would cause a sudden power off yet still allow it to power on.

My prior experience with hardware failures is that it is persistent, not intermittent.

Any ideas on debugging this? The reason I put this here is that I'm partially thinking that perhaps I might also have resource contention between the host and jailed guest, especially the video part.
 
Try to replace your PSU and take a closer look at motherboard for faulty capacitors. Also check the temps of the CPU/VGA etc.
 
Hmm. Mine has powered off on its own twice in the last two weeks, also. I thought it might be a hardware issue, too, but haven't taken the time to look into it.
 
My power settings have not changed, I'm not sure they make sense, but this is what I set many years ago when I first started using FreeBSD. I am using:

Code:
hw.acpi.cpu.cx_lowest: C8

/etc/rc.conf
Code:
powerd_enable="YES"

performance_cx_lowest="Cmax"
economy_cx_lowest="Cmax"

/boot/loader.conf
Code:
hint.p4tcc.0.disabled="1"
hint.acpi_throttle.0.disabled="1"
hint.apic.0.clock="0"
kern.hz="100"
hint.atrtc.0.clock="0"
hw.pci.do_power_nodriver="3"

/etc/sysctl.conf
Code:
# system/power-management
dev.cpu.0.cx_lowest="C2"

# system/acpi
machdep.idle="hlt"

I thought I checked this before when setting up my system, but this is that is supported:

Code:
dev.cpu.0.cx_supported: C1/1/1 C2/2/59 C3/3/80

Does this mean C8 is invalid? If I look at the lowest state any CPU has been in all but one have been in C8:
Code:
dev.cpu.7.cx_lowest: C8
dev.cpu.5.cx_lowest: C8
dev.cpu.3.cx_lowest: C8
dev.cpu.1.cx_lowest: C8
dev.cpu.6.cx_lowest: C8
dev.cpu.4.cx_lowest: C8
dev.cpu.2.cx_lowest: C8
dev.cpu.0.cx_lowest: C2

Code:
freq_levels: 3401/77000 3400/77000 3300/73840 3100/67694 3000/64705 2900/61772 2800/58907 2600/53315 2500/50601 2400/47940 2200/42787 2100/40284 2000/37833 1900/35433 1700/31421 1600/29164

sysctl dev.cpu.{0,1,2,3,4,5,6,7}.cx_usage:
Code:
dev.cpu.0.cx_usage: 0.00% 0.00% 0.00% last 10178us
dev.cpu.1.cx_usage: 0.00% 0.00% 0.00% last 5445us
dev.cpu.2.cx_usage: 0.00% 0.00% 0.00% last 4061us
dev.cpu.3.cx_usage: 0.00% 0.00% 0.00% last 11574us
dev.cpu.4.cx_usage: 0.00% 0.00% 0.00% last 4249us
dev.cpu.5.cx_usage: 0.00% 0.00% 0.00% last 12940us
dev.cpu.6.cx_usage: 0.00% 0.00% 0.00% last 12922us
dev.cpu.7.cx_usage: 0.00% 0.00% 0.00% last 12903us

sysctl hw.acpi.thermal
Code:
hw.acpi.thermal.tz1._TSP: 10
hw.acpi.thermal.tz1._TC2: 5
hw.acpi.thermal.tz1._TC1: 1
hw.acpi.thermal.tz1._ACx: -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
hw.acpi.thermal.tz1._CRT: 106.1C
hw.acpi.thermal.tz1._HOT: -1
hw.acpi.thermal.tz1._PSV: 106.1C
hw.acpi.thermal.tz1.thermal_flags: 0
hw.acpi.thermal.tz1.passive_cooling: 1
hw.acpi.thermal.tz1.active: -1
hw.acpi.thermal.tz1.temperature: 29.9C
hw.acpi.thermal.tz0._TSP: -1
hw.acpi.thermal.tz0._TC2: -1
hw.acpi.thermal.tz0._TC1: -1
hw.acpi.thermal.tz0._ACx: 85.1C 55.1C 0.1C 0.1C 0.1C -1 -1 -1 -1 -1
hw.acpi.thermal.tz0._CRT: 106.1C
hw.acpi.thermal.tz0._HOT: -1
hw.acpi.thermal.tz0._PSV: -1
hw.acpi.thermal.tz0.thermal_flags: 0
hw.acpi.thermal.tz0.passive_cooling: 0
hw.acpi.thermal.tz0.active: 2
hw.acpi.thermal.tz0.temperature: 27.9C
hw.acpi.thermal.user_override: 0
hw.acpi.thermal.polling_rate: 10
hw.acpi.thermal.min_runtime: 0
 
If you have an Intel CPU, load coretemp(4) and keep an eye on the CPU's temperature too. Maybe the fan died, or got clogged up with dust bunnies, and the CPU is overheating sometimes. That will definitely shutdown the machine without any warning.
 
One other thing to note is that the drive the OS is on does get fairly hot (not scalding, but probably around 110F), there isn't a fan blowing air over it and it doesn't get much airflow as there is another drive adjacent to it. I'm not sure if that could cause that, but that setup hasn't changed recently other than since switching, I switched physical drives (so I can easily revert back to the old).
 
[…] I'm trying to debug the events, but I'm not seeing anything standing out in /var/log/messages. […]
Like VladiBG I would at first suspect a power failure. You want to examine /var/log/utx.log with the last(1) utility for that. If it shows “crash” there was no proper shutdown.​
[…] CPU is overheating sometimes. That will definitely shutdown the machine without any warning.
Is there a shutdown(8) mechanism installed? I would expect the power management of a CPU – after throttling did apparently not help – to simply freeze the CPU (“hcf”), but leave other hardware unaffected (so e. g. a hard drive can finish its write operations).​
[…] the drive the OS is on does get fairly hot (not scalding, but probably around 110F) […]
If it is a spinning disk, this is a normal operational temperature.​
 
I should add that this machine is on a UPS and generally has a runtime of 60 minutes. At the time it turned off, I didn't hear the UPS make any sounds like it was correcting the voltage or the input power had failed (resulting in the unit beeping).

last only shows that it booted up last night, no shutdown.
 
Just powered off again :(. I have the cover off and am checking for dust ...Everything is spinning, there is some dust, but nothing is clogged up.
 
Is there a shutdown(8) mechanism installed? I would expect the power management of a CPU – after throttling did apparently not help – to simply freeze the CPU (“hcf”), but leave other hardware unaffected (so e. g. a hard drive can finish its write operations).
Nope. Nothing needs to be installed, or configured. It's not controlled from the OS either. It's the hardware itself that will just kill the power immediately. But yeah, good luck if you were writing something at that time.
 
I am still investigating - I was out of town and had it happen again. In the meantime, I am building a hotspare so I can just use that without needing to transfer the drive to other hardware.
 
It powered down again on its own, nothing in the logs to speak of. I think I have the latest Intel firmware applied so my Intel management firmware should be patched.
 
Ah, good point, I think there should be something, let me have a quick peek.

I just checked, but there wasn't anything matching today and the date/time is accurate. I occasionally got a keyboard not connected error, there was one interesting one a while back about an unknown exception, but that was in 2022. Unfortunately, that had little information and I don't believe I can action that for that reason and the fact that it is over a year old.
 
Could be bad caps on your motherboard or PSU. Could also be exposed wiring. Could be an issue with your power source as well. Could be all kinds of things. Have you added any new components that could overdraw the max output of your PSU?
 
No, in fact I went from having a dedicated video card to using the on-board video 6+ months ago to simplify my setup. The box is old, so I will move to a 'newer' box, same age, but far fewer operational hours.
 
  • No, if last(1) does not reveal a “crash” we can rule out a power failure. I see a reboot(2) “is invoked automatically in the event of unrecoverable system failures.” I’m not sure whether such an automatic reboot is logged as an “orderly” shutdown.​
  • Staying with hardware failures, check your RAM with sysutils/memtest86+ (modern BIOSs are sometimes also shipped with some test cycles [incl. memory test] in case you don’t operate an x86 machine).​
  • Also, provided you are using a generic kernel, I would simply migrate components (hard drive, graphics card, NIC) to a different machine (a different compatible motherboard) just to isolate potential causes of errors. If the error persists you can tell it has something to do with the components that remained the same.​
 
It was never a reboot, just a hard poweroff like the power went out, but my monitors still had power and they're on the same UPS.

Yes, I was thinking about running a memory test to confirm that it is or isn't the memory. I will do that after I migrate the components.

Yes, I will simply migrate components (I'll put other hard drives with a duplicate configuration into the other 2 boxes to replace this one). I'm taking a bit longer to migrate because I am tweaking my utility to automate my installation and configuration of FreeBSD.
 
Yes, I was thinking about running a memory test to confirm that it is or isn't the memory.
If memory is corrupt you just get weird, random, crashes. Can't hurt to test though.

Could be bad caps on your motherboard or PSU.
Yeah, I'm leaning towards this as well. Some kind of short-circuit, the PSU then cuts out to protect the system from further damage. I'd replace the PSU, it's the easiest to swap out. Especially if this is an older system, the PSU is usually the first thing to break or go bad.
 
If memory is corrupt you just get weird, random, crashes. Can't hurt to test though.


Yeah, I'm leaning towards this as well. Some kind of short-circuit, the PSU then cuts out to protect the system from further damage. I'd replace the PSU, it's the easiest to swap out. Especially if this is an older system, the PSU is usually the first thing to break or go bad.
I have a power supply from 2011 and the manufacture of the wiring sleeves used some kind of rubber that dried out very quickly and began to crack and flake off. It would short and cause power to cut off. I still use it but I had to replace wire harnesses on it. This could be possible as well. But this PSU is a proprietary HP product. Could be a possibility here too.
 
I have a power supply from 2011 and the manufacture of the wiring sleeves used some kind of rubber that dried out very quickly and began to crack and flake off.
The older electronics get the greater the risk of breakage. I have lots of retro tech, often doesn't work any more (but can be fixed), tantalum capacitors often blow up or simply short-out when powered on the first time in decades. Then there's also a long period when every manufacturer used flaky capacitors. Those often start leaking corrosive fluids that slowly eat away the copper traces of a PCB. Cold solder joints can also be quite a tricky problem to track down and cause lots of intermittent failures.
 
Back
Top