NTP huge offset mystery

chavez243ca · Feb 18, 2020

I have this one host, 12.1-p2, part of an Elasticsearch (Graylog) cluster that is constantly getting way out of sync time-wise. This is BSD on metal, like the other members of the cluster, and is the only one out of all my BSDs, virtual or otherwise, that exhibits this behaviour. As of right now it is 38 seconds out. When this started the system was a PE 2950, but it has since been updated to a 710R - which somewhat rules out an RTC issue. Unfortunately, I don't see much useful detail in ntp.log.

ntp.conf is relatively vanilla, and the same as other hosts on the network that are keeping time just fine. I did re-point the servers to the freebsd pool instead of our internal NTP server as a test, but saw no improvement.

The clock will correct eventually, but not very quickly and it drops out of the Graylog cluster when the offset gets too severe.

Looking for any troubleshooting / debugging tips, or thoughts on what might be going on here.

acheron · Feb 18, 2020

Have you tried to remove /var/db/ntp/ntpd.drift?

chavez243ca · Feb 18, 2020

In my case it's: /var/db/ntpd.drift

root 58976 0.0 0.0 19348 19428 - Ss 7Feb20 0:32.76 /usr/sbin/ntpd -dd -4 -p /var/db/ntp/ntpd.pid -c /etc/ntp.conf -f /var/db/ntpd.drift -g

I'm fairly certain one of my steps has been to delete that and restart ntpd.

PMc · Feb 19, 2020

Had that problem once, offset was precisely 27 seconds. Reason was I had compiled base with wrong option for the leap seconds (see src.conf(5): LEAPSECONDS).
UTC would show the correct time, but localtime was wrong.

Otherwise, the thing offers lots of information output, beginning with ntpq -p, where it shows what server it locks on and what quality it obtains. If it diverges into the tens of seconds, that should be visible there, no matter what the driftfile might say. ntptime tells a bit more of the details, but I'm not sure what all that means...

`Orum · Feb 19, 2020

I've seen weird things happen with time when the MB's battery dies, or extremely spotty (high jitter/loss) connections, but other than that I haven't had too many problems.

However, I would consider not using the ntpd in base, or its equivalent from ports. Several headaches from them over the years, for something that should be very simple, have pushed me to use other options (e.g. chrony) that are easier to configure, use, diagnose, and troubleshoot.

Phishfry · Feb 19, 2020

My money is on a hardware issue.
Either Motherboard RTC issues or Interrupts.

Incorrect time on Server | DELL Technologies

I've got a power edge 710 that continually shows the wrong time. Server 2003r2 enterprise. I've tried the following: Re registering windows time service. Using domhier settings in reg...

www.dell.com

549480 – On a Dell R710, a read() operation on /dev/rtc blocks if the frequency is 8192.

bugzilla.redhat.com

Phishfry · Feb 19, 2020

Some more reading for you on the Dell R710

time issues and some more

Nicola Mingotti · Feb 20, 2020

I agree with `Orum , when i need the time to be under scrutiny I don't use ntpd any more, instead I do something like this:
ntpdate XXX.ntp.org
If necessary under cron.

ralphbsz · Feb 20, 2020

These lines in /etc/rc.conf worked perfectly for me to fix this problem; the hardware clock on my motherboard is VERY sloppy, and if the computer is down for many hours, it starts with the time so wrong that ntp can't fix it. In that case, a hard setting of the clock (not drift to it, but set it) really helps:

Code:

# When NTPD starts, sync the clock hard, in case it has become wrong:
ntpd_enable="YES"
ntpd_sync_on_start="YES"

PMc · Feb 20, 2020

Nicola Mingotti said:
I agree with `Orum , when i need the time to be under scrutiny I don't use ntpd any more, instead I do something like this:
ntpdate XXX.ntp.org
If necessary under cron.

Ups? AFAIK this is should run automatically at boot, and not during regular operation. Afterwards time should be correct to some fraction of a second, and then ntpd can start and eventually bring precision to some fraction of a millisecond. The downside is that ntpdate takes about a minute to complete and that delays the boot, so the more modern approach is to use ntpd_sync_on_start instead, as ralphbsz showed. But using ntpdate on boot has the advantage that there are correct logfiles already during application start.

chavez243ca · Feb 20, 2020

This host, and pretty much all the BSDs in this environment are using the base ntpd, with only one having an issue. ntpd_sync_on_start is also a de facto line in rc.conf for all these systems. Most of our infrastructure is virtual, so I'll have to do a quick check to see just how many 710R boxen are in the herd.

[UPDATE]
At least 6 710R deployed, with at least one other running the same BIOS version.

Presently system has kept good time for 15 hours now; still marked as flapping in Nagios so will give it some more time to prove itself.

Nicola Mingotti · Feb 20, 2020

PMc said:
Ups? AFAIK this is should run automatically at boot, and not during regular operation. Afterwards time should be correct to some fraction of a second, and then ntpd can start and eventually bring precision to some fraction of a millisecond. The downside is that ntpdate takes about a minute to complete and that delays the boot, so the more modern approach is to use ntpd_sync_on_start instead, as ralphbsz showed. But using ntpdate on boot has the advantage that there are correct logfiles already during application start.

i am in the only starbucks without wifi... arrrgggg . Phone typing. telegraphic. sorry.

ntpd has given several issues to me in the past. especially at boot. probably i did not put enough effort to study it well. on the other side, it could be probably made simpler.

ntpdate is easy, just a command, you run when you wish and it is out of your way. the manpage says it will be retired, well that would be a mistake.

my use case is in the BeagleBoneBlack mostly, since they do not have battery for the clock.

gpw928 · Feb 24, 2020

I would keep using multiple external time servers until the problem is sorted.

Is the time correct just after boot? Repeated use of ntpq -pn should settle fairly quickly with the "reach" at 377 for each time server. A working ntp client will show offsets in milliseconds. These are normal:

Code:

[ritz.134] # ntpq -pn 
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+203.26.24.6     216.218.254.202  2 u  815 1024  377   68.044   16.950 173.634
+203.12.160.2    133.243.238.163  2 u  893 1024  377   77.520   -9.385 194.406
+203.0.178.191   180.150.12.46    2 u  744 1024  377  274.477   54.269 153.390
*150.203.1.10    203.35.83.242    2 u  832 1024  377   57.518  -19.504 215.309
+150.203.22.28   13.55.50.68      4 u  817 1024  377   59.597  -14.446 217.340
+202.22.158.31   131.203.16.6     2 u  778 1024  377  281.932   16.512 186.021
+13.55.50.68     203.206.205.83   3 u 1009 1024  377   79.522   -8.024 157.460
+103.214.220.220 221.95.54.210    2 u  659 1024  377  522.974  169.933  51.774
+220.158.215.20  131.203.16.6     2 u 1009 1024  377  129.220   -9.129 235.570

ntpd exits with a message to the system log if the offset exceeds the panic threshold, which is 1000 seconds by default.

So with an offset around 38 seconds, I expect ntpd would soldier on, and eventually correct, which seems like what you have observed.

Is your system clock always slow (i.e. behind the "real" time)? Be sure of the answer, as you don't want clock slewing backwards on any system that matters.

If so, you could try adding "tinker panic 0" to the ntp.conf file, and restarting ntpd. This is primarily intended for VM's waking up after being moved, and should correct the clock instantly, as required.

You will also get the corrections logged. So investigating will be easier.

Since this is a physical system, and the only one in the fleet misbehaving, I'm suspecting the battery or clock chip.

PMc · Feb 24, 2020

gpw928 said:
These are normal:

Strong mass&gravity fluctuations at Your place?

Here that`s normal:

Code:

# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
-syrte8.obspm.fr 145.238.203.14   2 u   35  256  377   31.386    2.278   0.427
-schnitzel.team  10.1.105.2       2 u  231  256  377   32.061    0.994   0.401
+mail.stygium.ne 193.67.79.202    2 u  236  256  357   22.339    2.294   0.532
-stratum2-4.NTP. 129.70.130.70    2 u   37  256  377   36.340    3.712   0.412
*formularfetisch 131.188.3.222    2 u  261  256  377   25.141    2.065   0.725
+time3.hs-augsbu 131.188.3.222    2 u  235  256  377   49.093    1.875   0.466
-atlas.linocomm. 130.149.17.21    2 u  255  256  377   25.093    0.230   0.591
+217.144.138.234 237.17.204.95    2 u  228  256  377   26.016    2.039   0.418

tingo · Feb 24, 2020

OP, you should check that it has selected the right timer hardware.
Example:

Code:

root@bvm5# sysctl kern.timecounter.hardware
kern.timecounter.hardware: HPET

you want the one with the best quality, see dmesg output on your machine

Code:

root@bvm5# dmesg | grep quality
Event timer "LAPIC" quality 600
Event timer "RTC" frequency 32768 Hz quality 0
Timecounter "i8254" frequency 1193182 Hz quality 0
Event timer "i8254" frequency 1193182 Hz quality 100
Timecounter "HPET" frequency 16777216 Hz quality 950
Event timer "HPET" frequency 16777216 Hz quality 550
Event timer "HPET1" frequency 16777216 Hz quality 450
Event timer "HPET2" frequency 16777216 Hz quality 450
Event timer "HPET3" frequency 16777216 Hz quality 450
Event timer "HPET4" frequency 16777216 Hz quality 450
Event timer "HPET5" frequency 16777216 Hz quality 450
Event timer "HPET6" frequency 16777216 Hz quality 450
Event timer "HPET7" frequency 16777216 Hz quality 450
Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
Timecounter "TSC" frequency 1496735006 Hz quality 1000

Here it seems like "best" is not the highest number - I have no idea why.
Normally the correct selection happens automagically. I have no idea why it sometimes selects an inferior timer.
To be clear: my example here is when everything is working all right. When it's wrong it is like this:

Code:

root@bvm5# sysctl kern.timecounter.hardware
kern.timecounter.hardware: TSC

gpw928 · Feb 24, 2020

PMc said:
Strong mass&gravity fluctuations at Your place?

Australia is an island/continent of 25 million people in an area more than 20 times the size of Germany. Outside major cities, all the infrastructure has to stretch a long way.

My Internet connection uses a 3G GPRS (cellular mobile) service, with a 6 m antenna mast pointed towards the township of Bega (12 km away). It's quite reliable (I monitor and actively manage it from my firewall). I routinely get up to 800 kilobytes a second, and pay Aus$60/month for a 200 GB bi-directional up/download limit. This is as good as it gets in my locality.

However, by your standards, there is little doubt that the UDP packets to and from the ntp servers have to run a bit of a torture test.

PMc · Feb 25, 2020

gpw928 said:
Australia is an island/continent of 25 million people in an area more than 20 times the size of Germany. Outside major cities, all the infrastructure has to stretch a long way.

My Internet connection uses a 3G GPRS (cellular mobile) service

I already thought that could only be cellular - anyway, Your location appears to have it's charms compared to overcrowded Europa.

With cellular service You have the telephony already solved, while here it is the other way round: when my Telco went for VoIP, I had to see how I can get that through my machine. And they were absolutely certain one cannot use telephony with a computer, one must buy a "router" (aka: plastic thing with linux, a bunch of security holes and probably a backdoor). So ntpq became my measurement tool for proof that VoIP problems cannot be on my side (all network tools get rtprio(1) set).

`Orum · Feb 25, 2020

gpw928 said:
It's quite reliable
...
However, by your standards, there is little doubt that the UDP packets to and from the ntp servers have to run a bit of a torture test.

So reliable (low loss/high availability), but high latency, or high jitter? ntpd usually works well with reliable connections, but can suffer on networks with lots of jitter. It also tends to have problems if the latency is significantly asymmetric. Of course, no NTP daemon is perfect in this regard as the protocol isn't well equipped to deal with that situation, but some are better than others.

Your problem seems much more like a hardware one though, especially because it happens after a cold boot. Check that your battery is good, and select a good source as tingo suggested, and you should be okay.

If your connection is really bad, you can also just get a GPS to sync to a stratum 0 source (making your server stratum 1).

sko · Feb 26, 2020

tingo said:
OP, you should check that it has selected the right timer hardware.
Example:

Code:

root@bvm5# sysctl kern.timecounter.hardware kern.timecounter.hardware: HPET

you want the one with the best quality, see dmesg output on your machine

Code:

...

Here it seems like "best" is not the highest number - I have no idea why.
Normally the correct selection happens automagically. I have no idea why it sometimes selects an inferior timer.
To be clear: my example here is when everything is working all right. When it's wrong it is like this:

Code:

root@bvm5# sysctl kern.timecounter.hardware kern.timecounter.hardware: TSC

regarding timecounters(4), the highest number IS the best quality timecounter and therefore chosen by the kernel.
Although there might be one caveat:

Time counters are the lowest level of time tracking in the kernel. They
provide monotonically increasing timestamps with known width and update
frequency. They can overflow, drift, etc and so in raw form can be used
only in very limited performance-critical places like the process
scheduler.

I suspect on a very busy machine the counter of a very high frequency timecounter is overflowing too early to provide useful timestemps (i.e. they are no longer increasing over usable periods of time). This might explain why manually selecting a lower-frequency counter fixed the issue on your system.

chavez243ca · Mar 3, 2020

Interestingly, host has been keeping good time lately.

Only two changes I can think of - freebsd-update from 12.1-RELEASE-p1 to -p2, and altered ntp.conf to point to the freebsd ntp pool rather than our internal time sources.

Now monitoring ntp across numerous hosts to try to catch the problem, but might just make this as resolved.