sudden "calcru" problem and various CPU oddities

Hello, (forgive my bad english, it's not my main language)

Yesterday, my server behaved very strangely. It's a 4-core Xeon running:
Code:
FreeBSD 8.1-RELEASE #0: Mon Jul 19 02:36:49 UTC 2010  root@mason.cse.buffalo.edu:/usr/obj/usr/src/sys/GENERIC  amd64

dmesg.boot is too long for the post, so it's available here: dmesg

Around 19:30, my munin monitoring started to display a less than normal CPU utilization (less than 400%, idle included). ACPI uses factory settings (never touched).
I don't see any good reason for my cpu utilization probe to report less than 400%. See graph:
cpu-day.png


At 21:08:23, I've launched the steam client, a linux binary, to update my source dedicated server (a game server). The process downloads available updates after scanning few GB of file hosted on a ZFS formated disk (compression enabled). The disk is a SATA HDD, plugged on a 3Ware RAID card. The process ended around 21:10:30

The CPU core temperature started to rise at this moment, increasing by few degrees C. It never get back to its normal value. The hosting provider tells me his monitoring for the rack shows no temperature increase.

cputemp-day.png


At 21:09:09, the system logs showed a bunch of alert:
Code:
21:09:09 rack kernel: calcru: runtime went backwards from 122407344 usec 
to 108865844 usec for pid 2 (g_event)
21:09:09 rack kernel: calcru: runtime went backwards from 3293977567 
usec to 2931366351 usec for pid 12 (intr)
...

Around 21:12, I've launched the updated game server (linux binary). The game was almost unplayable. Game clients (player side) reported a ping around 60, in general we have between 15 and 30. But 60 should be playable. Despite a not-so-bad ping, the game was constantly freezing on client side. The freeze was short, but occurred every 3-5 seconds, and eventually every seconds.

Uptime is 53 days. The game server, and the whole server in general behaved flawlessly until yesterday. The game server is launched only few times a week, for few hours.

Any idea is greatly appreciated.

patpro
 
SirDice, thank you for the reply, but:
- I've already read the faq, but I don't have any physical access to the server, so I can't check the EIST status in BIOS;
- lines like this one in the dmesg est: CPU supports Enhanced Speedstep, but is not recognized. suggest that EIST is disabled (disabled is also the default setting according to my motherboard documentation);
- the server is up for 53 days, I would expect this kind of alert to come up way earlier, if it's caused by a BIOS setting;
- it doesn't explain the CPU temperature rise, and CPU utilization odd values appearing without any setup changes;

I'm currently running a "cpu burn" task (md5 on one core), to raise temp, raise cpu utilization, and see how system stats are changing.
And I see in my /var/log/messages[\FILE] a fresh entry:

Code:
Jan  5 15:02:41 rack named[806]: *** POKED TIMER ***[\CODE]

As expected, the total CPU utilization falls as the load (and temperature) rise:
[img]http://patpro.net/~patpro/cpu-day-1.png[/img]

I've done many benchmarks and CPU burn tests earlier, and none of them revealed a problem like this one.

We agree that a cpu temp/voltage control procedure is triggered, but I think it's not EIST. I'll ask my hosting provider to check my BIOS settings if possible. I'm also afraid the whole problem is triggered by a dead fan.
 
The time issues could also be caused by a wacky NTP server.
 
SirDice said:
The time issues could also be caused by a wacky NTP server.

That would be a nasty coincidence... I think it's more likely that hardware CPU throttling induce timer problem.

It looks like he problem is getting worse:
- 2 calcru alerts tonight (around 2:05 and around 5:15)
- temperature of HDDs is (very) slowly increasing
 
patpro said:
- lines like this one in the dmesg est: CPU supports Enhanced Speedstep, but is not recognized. suggest that EIST is disabled (disabled is also the default setting according to my motherboard documentation);
I've seen the above boot message, and experienced the time problems on a system where speedstep was enabled, but set to give only the BIOS control over it. You really need to eliminate the BIOS as a potential cause...
 
aragon: thanks for your reply. I'll ask to my hosting provider to know if he can take a look at this. But it does not really explain the temperature rise. I think I'll also buy a couple a spare fans and have them delivered to him.
 
Ok, I'm back.
My hosting provider checked the hardware, and the BIOS. He found nothing unusual, fans are running fine and EIST is disabled.

I made a new quick test: I've launched the command [CMD=""]while true; do /usr/bin/time sleep 10 ; done[/CMD] and the md5 command on a big file (few GB). Here are the results:

Code:
# while true; do /usr/bin/time sleep 10 ; done
       10.00 real         0.00 user         0.00 sys
       21.88 real         0.00 user         0.00 sys
       18.39 real         0.00 user         0.00 sys
       20.48 real         0.00 user         0.00 sys
       16.99 real         0.00 user         0.00 sys
       20.58 real         0.00 user         0.00 sys
       19.08 real         0.00 user         0.00 sys
       19.15 real         0.00 user         0.00 sys
       22.28 real         0.00 user         0.00 sys
       12.09 real         0.00 user         0.00 sys
       10.00 real         0.00 user         0.00 sys
       10.00 real         0.00 user         0.00 sys

before the md5 launch, `sleep 10` runs in 10 seconds, but when the md5 is running, `sleep 10` takes up to 22 seconds. At the end of the md5 computation, the system date/time was lagging about ~5 seconds.

Any other idea is welcome.
 
Hi all.

It might help someone, so I'm posting a conclusion about all this story: the CPU throttling (and subsequent time issues) was caused by a defective power supply. I've replaced the faulty PSU, and now the server is back to normal. No more CPU throttling under load, no more time shifting.
 
Back
Top