9.2 - Timer Issues

grabes · Oct 28, 2013

I have rolled back systems to 9.1 after a lot of trial and error between 9.1, and 9.2. I get massive system time skews on a semi loaded system. The systems runs FreeSwitch on a realtime priority (which I think is the issue). Other than that, I am not sure how to give get more debugging information. Any ideas on how to report this one better to developers?

Code:

Oct 28 12:24:55 sbc11 ntpd[1730]: time reset +193.766034 s
Oct 28 12:45:48 sbc11 ntpd[1730]: time reset +6.150141 s
Oct 28 13:09:22 sbc11 ntpd[1730]: time reset +166.084942 s
Oct 28 13:25:30 sbc11 ntpd[1730]: time reset +3.075496 s
Oct 28 13:42:20 sbc11 ntpd[1730]: time reset +15.378863 s
Oct 28 14:43:21 sbc11 ntpd[1730]: time reset +39.983595 s
Oct 28 15:08:16 sbc11 ntpd[1730]: time reset +21.526845 s
Oct 28 15:27:43 sbc11 ntpd[1730]: time reset +33.832378 s
Oct 28 15:53:22 sbc11 ntpd[1730]: time reset +6.151294 s
Oct 28 16:13:57 sbc11 ntpd[1730]: time reset +39.982847 s
Oct 28 16:35:24 sbc11 ntpd[1730]: time reset +18.454572 s
Oct 28 17:20:06 sbc11 ntpd[1730]: time reset +15.377594 s

Hardware: Dell R610 (2x Xeon 5560, 32 GB RAM)

Uniballer · Oct 29, 2013

What is in your ntp.conf?

grabes · Oct 29, 2013

Apparently, this is not related to the realtime priority. I ran some tests with normal priorities, and was able to cause time to skew as well.

ntp.conf:

Code:

server 0.freebsd.pool.ntp.org iburst
server 1.freebsd.pool.ntp.org iburst
server 2.freebsd.pool.ntp.org iburst

aupanner · Oct 29, 2013

One thing to check is ntpq and issue the "peers" display. This will show you which ntpd servers you are connecting to, and what their individual jitter looks like. This link could help you make sense of the display (http://log.or.cz/?p=80).

It might also be useful to see the contents of /var/db/ntpd.drift (Mine is -40.695)

The other question is which timer source your box is using. You can see this in the dmesg output right after the memory. This is mine for example.

Code:

real memory  = 4294967296 (4096 MB)
avail memory = 4080443392 (3891 MB)
Event timer "LAPIC" quality 400

grabes · Oct 30, 2013

I don't think its an NTP(D) issue. It seems that NTP(D) is merely trying to correct he skew.

Code:

Event timer "LAPIC" quality 400
Event timer "RTC" frequency 32768 Hz quality 0
attimer0: <AT timer> port 0x40-0x5f irq 0 on acpi0
Event timer "i8254" frequency 1193182 Hz quality 100
hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Event timer "HPET" frequency 14318180 Hz quality 450
Event timer "HPET1" frequency 14318180 Hz quality 440
Event timer "HPET2" frequency 14318180 Hz quality 440
Event timer "HPET3" frequency 14318180 Hz quality 440
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x808-0x80b on acpi0

kpa · Oct 30, 2013

The dmesg(8) output will not show directly which hardware timer is selected although you can deduce it from the quality values. This sysctl(8) will show what is currently in use:

sysctl kern.timecounter.hardware

The available choices are in kern.timecounter.choice, you can set kern.timecounter.hardware on the fly and see if a change makes a difference.

grabes · Nov 2, 2013

Thanks, changing the timer to HPET has seemed to correct all the craziness I was experiencing.

I just want to add there is still something very "off" about 9.2 I have odd CPU spikes, and times of being semi-unresponsive. I have 20 machines of 9.1 and 9.2... It's happening only on 9.2 machines, and they all have the same exact work loads.

When the machines are busy, they behave perfect. When they are lightly loaded is when I have issues.

This is still broken for me, lost 12 hours on 2 servers in the last 2 days. 4 others stayed in check. I actually thought the one was locked solid as it was unresponsive, and then it came back to life at some point.

aupanner · Nov 10, 2013

You've got everything you need to figure out what's going on: an "A" system that doesn't have the issue; a "B" system that does have the issue. Apparently your boxes have lots of clock options, based on quality it seems that they would be using HPET (450) by default.

I'd compare sysctl kern.eventtimer.choice and sysctl kern.eventtimer.timer on the two systems (not sure how those differ from kern.timecounter.choice and kern.timecounter.hardware). Also I'd compare the ntpd.driftfile values on your various boxes.

Maybe you need to use LAPIC clock because your HPET device goes to sleep sometimes or is too busy doing something else?

No way to know until you tabulate data from your systems for us to look at.

grabes · Nov 12, 2013

Thanks, so far I am seeing stable load averages by setting

Code:

kern.eventtimer.timer=LAPIC

Things are almost night and day on low loaded systems. I will report more results in 48 hours.

KernelPanic · Nov 13, 2013

This seems very similar to the issue I experienced with a pair of Intel SR1500 servers: http://forums.freebsd.org/showthread.php?t=42918

If switching to LAPIC doesn't help you should try removing one of the two CPUs to see if the problem goes away.

It's not really a fix, of course. It would just be more of a double confirmation that FreeBSD 9.2 may have issues with systems that have two (or more?) Xeon processors.

grabes · Nov 14, 2013

I feel that changing the kern.eventtimer.choice to HPET, may have helped a little bit (I have no evidence to support this). Changing

Code:

kern.eventtimer.timer=LAPIC

was much more relevant. On an unloaded server handling a few requests (heavy threading) I would see load averages around 60 with 95% idle CPU. As soon as I changed the event timer things made sense load averages around 1, and my time keeping problems seem to be solved.