Crashing Server - Temp/Voltage Monitoring

I have a 7.1-RELEASE-p3 server running in a remote data center. I've had problems starting about 3 days ago in which the server is randomly rebooting (about once per day). Nothing shows up in /var/log/messages about the problem (only the normal boot-up messages). The `last` command does show that the server did in fact crash.

I'm trying to figure out what is causing this. I have the suspicion that it is a hardware problem, and I'd like to install a tool that lets me monitor the temperature and voltage of the processors and system. I've tried using Healthd, but after playing with it for a short period of time, I realized that it wasn't detecting anything (it said 0 temp, 0 volts, etc).

I was wondering if anyone can help me figure out how I can monitor the hardware of the server. Here's some info on the hardware:

Intel Xeon CPU 2.40GHz
PCI Devices:
  • ATI Technologies Inc - Rage XL PCI
  • Intel Corporation - 82540EM Gigabit Ethernet Controller
  • Intel Corporation - 82801 Family (ICH2/3/4/4/5/5/6/7/8/9,63xxESB) Hub Interface to PCI Bridge
  • Intel Corporation - 82801CA (ICH3) UltraATA/100 EIDE Controller
  • Intel Corporation - 82801CA/CAM (ICH3-S/ICH3-M) LPC Interface
  • Intel Corporation - 82801CA/CAM (ICH3-S/ICH3-M) SMBus Controller
  • (2x) Intel Corporation - 82801CA/CAM (ICH3-S/ICH3-M) USB Controller
  • Intel Corporation - E7500 System Controller (MCH, Hub Interface A) Error Reporter
  • Intel Corporation - E7501 Host Controller
IDE Devices
  • ad0: WDC WD1600AAJB-00WRA0 58.01H58

I was reading up on various software I can use to do the monitoring, such as lmmon, mbmon, healthd, and ipmitool. From what I can tell, I will be required to recompile my kernel after adding a few options to it in order to have support for /dev/smb or something.

Also, will I need to enable ACPI? Right now it is disabled.

Thanks.
 
You will most likely need ACPI.. Not sure what else but /dev/smb isn't always needed (it isn't on my board and temp monitoring works with mbmon).

Also consider installing smartmontools if you have S.M.A.R.T. enabled drives.
 
If you dont get the thermal sensors working, you should

kldload /boot/kernel/cpufreq.ko

After that you can use (sysctl -a | grep freq) the sysctl variable "dev.cpu.0.freq" to set a lower frequency.

I know this is only a poor workaround, but if you dont have access to the server it may help.

Btw. dont run the powerd for that test.
 
Back
Top