What is your tolerance for clock drift

Phishfry · Mar 5, 2023

I have some Beaglebone projects nearing completion and I see a wee bit of clock drift on the external DS1307 RTC.

What is your threshold for drift adjustment?

I am seeing roughly -0.5s/24h and that concerns me. This on a box running GPIO services. Nothing critical.

Phishfry · Mar 5, 2023

It appears the drift file is only used when ntpd is in use.
/var/db/ntpd.drift
I use ntp manually for now.

I guess this board is a good candidate for ntpd.

Phishfry · Mar 5, 2023

Upon further review this might be a good task for a weekly cron job to check time.
I don't want the bulk of a service. I can live with 4 seconds drift a week.

Is 0.5 seconds over 24 hours pretty bad? This is just a temperature monitoring box.

ralphbsz · Mar 6, 2023

Depends on what you use the timestamp information for.

Say you're monitoring temperatures like the weather, or in your beer fermentation or water processing. The temperatures probably change by several percent per hour: if you make 0...100 the normalized range of temperature, then you probably have no more than 5 normalized units per hour of change. Let's say that your temperature measurement is accurate to 0.1 units (accuracy = both reproducibility and absolute tolerance). Then the rate of change is 50 "accuracy bands" per hour. So if your timing uncertainty is significantly better than 1/50th of an hour, your overall accuracy is dominated by temperature measurement, not recording time. Given that 1/50th of an hour is roughly a minute, and you're talking about timing accuracy of a second, you are in GREAT shape, and can simply ignore the problem.

Different example. You have three of these systems, and they are spaced in a triangle 1km apart. You want to use them to locate the direction to gunshots. The speed of sound in air is roughly 333 m/s, so you need to measure 3s time differences accurately. Say you want your shot spotter to be accurate to 5 degrees of angle, which is roughly 1% of a radian or of the full circle (I'm rounding to make the math easier). So you need to be able to measure relative times to an accuracy of 1% of 3 seconds, or 30ms. At this point, your 4 second clock drift just completely wiped out your measurement. On the other hand, if all three computers are reachable by WiFi or cellphone, you can get NTP client accuracy easily to 30ms.

Ah, but there is one more problem (which I face at home, except that the things I measure are pressures and water levels, and I use an RPi instead of a beaglebone): When the network fails, the clock time of the RPi can be inaccurate to multiple hours. That's fine, since I know when the RPi booted (that's recorded in the data logs), and when the clock jumps forward (also in the log). But the code that processes the measurements needs to have lots of special cases: What do you do if the next measurement is several hours away backwards or forwards? What if you have two measurements that look like they were taken at the same time, but one really happened several hours away? What if you (god forbid) put an assert into the code that enforces that new measurements always have to have a time stamp that is larger than the previous one? Having the time jump around is much more of a problem than it being slightly wrong.

So here would be my advice: If all you are doing is slow-moving temperatures, ignore the problem. Otherwise, just run an NTP client on all machines that are reachable by network, and deal with the inevitable clock jumping by hand.

chrbr · Mar 6, 2023

If I remember correctly the drift of consumer grade crystal based clocks has been about 5s per day, Mains powered clocks have a better long term stability as long as the average of the mains frequency is maintained.

SirDice · Mar 6, 2023

chrbr said:
Mains powered clocks have a better long term stability as long as the average of the mains frequency is maintained.

Mains frequency has nothing to do with it. Think about it for a second, these chips typically run on 5V or 3.3V DC. How in the world would the mains frequency have any influence here? This could only possibly have an influence if the analog clock's motor is directly (or through a transformer) fed with 50 (or 60) Hz AC power.

rawthey · Mar 6, 2023

SirDice said:
Mains frequency has nothing to do with it. Think about it for a second, these chips typically run on 5V or 3.3V DC. How in the world would the mains frequency have any influence here? This could only possibly have an influence if the analog clock's motor is directly (or through a transformer) fed with 50 (or 60) Hz AC power.

I think some digital mains clocks and timers pick up the reference frequency from the mains instead of using a crystal oscillator. More likely to provide long term accuracy than a cheap oscillator circuit.

SirDice · Mar 6, 2023

rawthey said:
I think some digital mains clocks and timers pick up the reference frequency from the mains instead of using a crystal oscillator.

Nope. Requires a lot more electronics (thus more expensive) than generating a clock from a crystal. Some clock circuits have builtin oscillators, those are definitely not as accurate as crystals but they're not depending on the mains frequency (chips are fed with DC power).

chrbr · Mar 6, 2023

SirDice said:
How in the world would the mains frequecy have any influence here? This could only possibly have an influence if the analog clock's motor is directly (or through a transformer) fed with 50 (or 60) Hz AC power.

Yes, that is how it is done. One example is at the bottom of http://www.decodesystems.com/clock-ics.html with the IC named 5316. This ships use the down transformed mains for LED multiplexing and for the internal clock. In the past Toshiba has had a lot of those chips in production. Today it is difficult to find data sheets.

Alain De Vos · Mar 6, 2023

Why not run an "ntp-client". The offset will be one tenth of a second even after running a year.

jbo@ · Mar 6, 2023

SirDice said:
Nope. Requires a lot more electronics (thus more expensive) than generating a clock from a crystal. Some clock circuits have builtin oscillators, those are definitely not as accurate as crystals but they're not depending on the mains frequency (chips are fed with DC power).

Electronics engineer here. Picking up mains frequency is very inexpensive to do - especially if the downstream device is already powered by mains. There are a couple of different ways of doing it - some more elegant than others - but depending on the design of the power supply you actually don't need any additional components at all. In fact, the creme-de-la-creme of the cheapest possible consumer electronics featuring clock capabilities are built using mains frequency rather than an oscillator (be it crystal or otherwise) because it is the cheapest thing you can do.
A lot of cheap-skate microcontrollers and purpose built integrated circuits including tons of no-name chinese stuff that doesn't even come in a package feature zero-crossing detector circuits to derive a clocking signal from mains.

There are also legit applications for this in the non-budget world. Tapping mains frequency is probably the most popular way of synchronizing electronics circuits over comparably large distances (this is for phase synchronization, not absolute synchronization).
There are other reasonable applications of using mains frequency for a clocking signal such as for dimmer circuits, motor controllers and many others.

Especially with regards to long term drifts there are few options better than tracking mains frequency (at least in Europe) without exploding costs.

Also, "built in oscillator" does not mean that it is "less accurate". An oscillator is a complete component, not just a crystal or an RC network. While using a crystal is generally notably more accurate than something else (such as a simple RC network) it is not as simple as that. It just so happens that what you call "builtin oscillator" is not only the driving circuit but also the filter network and silicon doesn't really allow to build good networks (with a high Q factor). However, that doesn't mean that the rest of the oscillator is less accurate, bad or worse. After all, a crystal is really nothing else than an LCR network with a really, really high Q factor. There is theoretically nothing preventing you from building an oscialator from discrete inductors, resistors and capacitors that will perform equally or better than a crystal. It just costs a shit ton more money and is non-desirable for a bunch of other reasons such as temperature stability.

TL;DR: Many clocks are driven by mains frequency references. The cheaper they are the more likely it is. Voltage has nothing to do with it.

PMc · Mar 6, 2023

Alain De Vos said:
Why not run an "ntp-client". The offset will be one tenth of a second even after running a year.

As long as the network doesn't get saturated...

Code:

Local system status:
 3:01AM  up 9 days, 20:49, 17 users, load averages: 25.68, 18.57, 16.46
NTP status:
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+2a01:4f8:221:3b 130.149.17.21    2 u  923 1024  377   24.379   -1.153   3.780
+2001:638:504:20 129.70.137.82    2 u  536 1024  377   34.931   -1.498   9.397
+2a01:4f8:141:28 192.171.1.150    2 u  606 1024  377   27.219   -0.664   4.372
+2a03:4000:6:f07 130.149.17.8     2 u  205 1024  377   23.988   -1.559   2.328
+2a03:4000:40:36 131.188.3.223    2 u  258 1024  377   24.842   +0.216   1.825
+2001:4ba0:ffa4: 124.216.164.14   2 u  325 1024  377   23.287   -0.128   2.127
*2a01:4f8:201:41 36.224.68.195    2 u  327 1024  377   24.932   -1.220   2.858
#2a01:238:4244:b 131.188.3.220    2 u  330 1024  377   32.624   +1.908   1.713
[...]
Local system status:
 3:02AM  up 10 days, 20:50, 21 users, load averages: 24.33, 17.72, 15.19
NTP status:
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+2a01:4f8:221:3b 130.149.17.21    2 u    - 1024  377   27.134  -101.40 132.278
+2001:638:504:20 129.70.137.82    2 u  795 1024  377   38.267  -98.793 108.638
+2a01:4f8:141:28 192.171.1.150    2 u  726 1024  377   29.503  -98.080 140.123
+2a03:4000:6:f07 192.53.103.108   2 u  189 1024  377   24.820  -104.51 101.263
+2a03:4000:40:36 193.204.114.233  2 u  512 1024  377   24.683  -97.607  90.782
+2001:4ba0:ffa4: 237.17.204.95    2 u  313 1024  377   23.709  -103.14 134.917
*2a01:4f8:201:41 36.224.68.195    2 u  544 1024  377   31.457  -95.471 105.490
#2a01:238:4244:b 40.33.41.76      2 u  449 1024  377   35.521  -92.758 135.705

cracauer@ · Mar 6, 2023

Using make and NFS I have a low tolerance for clock drift.

gpw928 · Mar 6, 2023

Phishfry said:
I don't want the bulk of a service. I can live with 4 seconds drift a week.

Is the required correction method (a significant jump) going to cause any undesirable consequences?

ntpdate(8) suggests that a low impact method to keep time in good sync without running an NTP client is to set the time correctly at boot (ntpdate -b 162.159.200.123), and then slew the time every hour (ntpdate 162.159.200.123) from cron. The hourly interval generally ensures that the clock can be slewed and not stepped.

ralphbsz · Mar 7, 2023

PMc said:
As long as the network doesn't get saturated...4 377 35.521 -92.758 135.705[/code]

The first set of numbers are very good, you have single-digit milliseconds jitter. And the second set is not terrible either; being accurate to withim 130ms is pretty good. Clearly, the correct fix is to make the network not overload.

ralphbsz · Mar 7, 2023

cracauer@ said:
Using make and NFS I have a low tolerance for clock drift.

Using make over NFS really points out how badly designed the original NFS protocol was. Sadly, that makes sense: It was originally designed as a quick hack to build networks of diskless machines. At the time it was designed (late 70s), our knowledge of distributed systems was in its infancy; for example, Leslie Lamport's paper about clocks was from that time, and his consensus paper (Paxos) came 10 years later. The sad thing is that the VAXcluster protocol had solved many of the same problems; I don't know why this didn't reach a wider audience.

bakul · Mar 7, 2023

cracauer@ said:
Using make and NFS I have a low tolerance for clock drift.

I have a low tolerance for NFS!

Jose · Mar 7, 2023

It's worth a few megabytes to me to have accurate time. I run my own NTP server that's connected to a public stratum 1 server. Yes, I contacted the stratum 1 server's administrators for approval. Your mileage definitely varies on NTP's memory usage.

Freebsd 12.3

Code:

 $  ps ax -o args,vsz,rss | grep -i ntp
/usr/sbin/ntpd -    18884    5472

Openbsd 7.2 (Openntpd)

Code:

$  ps ax -o args,vsz,rss | grep -i ntp
ntpd: ntp engine  1212  2988
ntpd: dns engine  1060  2756
/usr/sbin/ntpd    1124  1644

Oldish Gentoo

Code:

# ps ax -o args,vsz,rss | grep -i ntp
/usr/sbin/ntpd -p /var/run/  73984  8556

ralphbsz said:
...Leslie Lamport's paper about clocks was from that time, and his consensus paper (Paxos) came 10 years later...

I just recently watched a presentation on Google Spanner. They use atomic clocks in the Paxos implementation they rely on for global strong consistency.

ralphbsz said:
The sad thing is that the VAXcluster protocol had solved many of the same problems; I don't know why this didn't reach a wider audience.

We had a VAX cluster in one of the jobs I had early in my career. That thing was subtle and quick to anger.

gpw928 said:
ntpdate(8) suggests that a low impact method to keep time in good sync without running an NTP client is to set the time correctly at boot (ntpdate -b 162.159.200.123), and then slew the time every hour (ntpdate 162.159.200.123) from cron. The hourly interval generally ensures that the clock can be slewed and not stepped.

Supposedly ntpdate(8) is deprecated and will be removed Real Soon Now. Running ntpd(8) with arguments -gq should be functionally equivalent. Slew the time is exactly what ntpd(8) does.

PMc · Mar 7, 2023

ralphbsz said:
The first set of numbers are very good, you have single-digit milliseconds jitter. And the second set is not terrible either; being accurate to withim 130ms is pretty good. Clearly, the correct fix is to make the network not overload.

Yes, and that is the difficult part. 130ms is rather bad - there is some spec that VoIP demands 40ms, and in any case trying to read tcpdumps across the cloud is impractical that way.
I tried a lot of things to make this better, I'm running ntpd and all routing components in rtprio, I try to give bandwidth priviledge to ntp, and nothing did lead to a significant success.

Until I figured where the problem actually is: the congestion algorithms (reno, cdg) are not suitable. They feed packets into the link for as long as the tcp window does allow (and the tcp window does not care about the link bandwidth, only about the buffers at the endpoints), until all the buffers on the way are full. Then packets get dropped, and only then the congestion algorithm will react: it will limit the traffic, the packet loss goes away, and then all starts anew.
The outcome: half of the time the link bandwidth is not fully utilized, packet loss is high nevertheless, and the roundtrip delay fluctuates heavily and creates jitter.

Finally I figured how vegas works - it works completely different and aims to keep the delay constant - and now it looks like this:

Code:

     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
+ntp3.linocomm.n 130.149.17.21    2 u  477 1024  377   33.635   +0.642   5.954
*stratum2-4.NTP. 129.70.137.82    2 u  959 1024  377   47.236   +1.827   2.566
+2a01:4f8:141:28 131.188.3.223    2 u  467 1024  377   37.507   +2.310   1.734
+mail.rleh.de    130.149.17.8     2 u  971 1024  377   33.672   +1.545   2.737
#daphne.ea1145.o 131.188.3.223    2 u  151 1024  377   33.584   +1.635   1.635
+2001:41d0:700:1 131.188.3.222    2 u  648 1024  377   30.743   +1.408   1.404
-srv.hueske-edv. 36.224.68.195    2 u  702 1024  377   39.281   +3.483   2.685
#ns2.madavi.de   131.188.3.220    2 u  555 1024  377   39.299   +2.870   4.666

What is obvious here: the absolute delay is 10ms more. This is because I have kern.hz=200 and net.inet.ip.dummynet.io_fast=1. The latter means that the dummynet pipes and queues will only engage when the link is actually filled, otherwise packets bypass dummynet and are sent directly. And dummynet runs on the HZ clock. So now dummynet is always engaged, the link is always fully saturated, packet loss is below what TCP SACK can handle inline, and the jitter is gone.
I think this is how it was intended to work.

Code:

net.inet.tcp.cc.algorithm="vegas"
net.inet.tcp.cc.vegas.beta="14"
net.inet.tcp.cc.vegas.alpha="8"

This works with nested VPNs, so the alpha and beta need to be larger than the default. [Alpha and Beta are the min and max of surplus packets inflight to fill the buffers on the way. Vegas measures the natural -minimum- roundtrip time, and computes the number of packets that should be naturally inflight at the current sending rate. If the actual number is above this+beta, the rate is reduced, if it is below this+alpha, the rate is increased.]

The main problem with vegas is that it is fairness-based and therefore cannot compete with the other more aggressive (and less effective) algorithms on the same link. As it is so often in life.

gpw928 · Mar 7, 2023

Jose said:
Slew the time is exactly what ntpd(8) does.

Both ntpdate and ntpd will step time if the time offset exceeds the step threshold, which is 128 ms by default, and I was just quoting ntpdate(8) suggesting that running ntpdate hourly would provide "precise enough timekeeping to avoid stepping the clock".

What is your tolerance for clock drift

Phishfry

Phishfry

Phishfry

ralphbsz

chrbr

SirDice

Administrator

rawthey

SirDice

Administrator

chrbr

Alain De Vos

jbo@

PMc

cracauer@

gpw928

ralphbsz

ralphbsz

bakul

Jose

PMc

gpw928