strange ntpd behaviour or perhaps a motherboard problem (or both)?

New system as follows:

* Gigabyte GA-Z77-D3H Intel Z77 Ivy Bridge Motherboard
* Intel Core i5 2500K Sandy Bridge 3.30GHz
* Corsair CML16GX3M4A1600C9 Vengeance LP 4x4GB DDR3-1600 RAM
* Crucial M4 64GB SSD

Motherboard BIOS/Firmware updated to the most recent: F13
http://www.gigabyte.com/products/product-page.aspx?pid=4140#bios

Motherboard settings are default, except time is set to UTC and changed default SATA setting from IDE to AHCI. The operating system is Debian GNU/Linux 6.0.5 (squeeze) with backported Linux 3.2.0 kernel needed to support the new motherboard.

I'm running ntp and I notice that when I restart the ntp service the system time is accurate.
But after a while it drifts outside the limits of what ntp can handle.

E.g. restarting ntp and checking status in 5 second intervals....

# service ntp restart; sleep 5; ntpq -pn; sleep 5; ntpq -pn; sleep 5; ntpq -pn
Stopping NTP server: ntpd.
Starting NTP server: ntpd.
     remote           refid      st t when poll reach   delay   offset jitter
==============================================================================
219.88.250.190 119.47.118.129   3 u    2   64    1   32.344 -4645.6 24.938
202.21.137.10   211.219.125.170 2 u    1   64    1   42.656 -4657.0 24.844
202.6.116.123   202.46.183.13    2 u    2   64    1   52.763 -4642.1   0.000
     remote           refid      st t when poll reach   delay   offset jitter
==============================================================================
219.88.250.190 119.47.118.129   3 u    2   64    1   32.373 -12.021   0.000
202.21.137.10   211.219.125.170 2 u    2   64    1   42.895 -11.757   0.000
202.6.116.123   202.46.183.13    2 u    3   64    0    0.000    0.000   0.000
     remote           refid      st t when poll reach   delay   offset jitter
==============================================================================
*219.88.250.190 119.47.118.129   3 u    1   64    1   32.713 -85.881 53.304
+202.21.137.10   211.219.125.170 2 u    1   64    1   42.945 -85.295 53.092
202.6.116.123   202.46.183.13    2 u    8   64    0    0.000    0.000   0.000

Then check about 15 minutes later.

# ntpq -pn
     remote           refid      st t when poll reach   delay   offset jitter
==============================================================================
219.88.250.190 119.47.118.129   3 u   61   64 377   31.689 -14676. 3779.72
202.21.137.10   211.219.125.170 2 u   63   64 377   42.899 -14650. 3699.08
202.6.116.123   202.46.181.123   2 u   34   64 377   49.662 -15006. 3725.38

Note: the first one is my ISP's time server -> ntp.orcon.net.nz = 219.88.250.190

The offset (above) is about 15 seconds in less than 20 minutes after ntp restart and none of the time sources have "+" or "*" marks, which means that ntp doesn't consider them to be valid.

Then the next day after about 14 hours:

$ ntpq -pn
     remote           refid      st t when poll reach   delay   offset jitter
==============================================================================
219.88.250.190 119.47.118.129   3 u    3   64 377   31.807 -544726 3599.06
202.6.116.123   202.46.183.13    2 u   15   64 377   51.482 -544573 3576.86
202.21.137.10   211.219.125.170 2 u    1   64 377   42.417 -544749 3612.79

This appears to be a sign of a bad hardware clock that is drifting too fast for NTP to cope with.

I started to follow the procedure here:
http://support.ntp.org/bin/view/Support/HowToCalibrateSystemClockUsingNTP

I changed /etc/ntp.conf to simplify it with just one server line:
server ntp.orcon.net.nz iburst

Then restart NTP and check after a few minutes.

$ ntpq -c rv
associd=0 status=c028 leap_alarm, sync_unspec, 2 events, no_sys_peer,
version="ntpd 4.2.6p2@1.2194-o Sun Oct 17 13:35:13 UTC 2010 (1)",
processor="x86_64", system="Linux/3.2.0-0.bpo.1-amd64", leap=11,
stratum=4, precision=-23, rootdelay=44.022, rootdisp=2693.963,
refid=219.88.250.190,
reftime=d36c13cc.47138567 Sun, May 27 2012 14:45:32.277,
clock=d36c15ea.13d6b8c4 Sun, May 27 2012 14:54:34.077, peer=0, tc=6,
mintc=3, offset=0.000, frequency=15.778, sys_jitter=1153.294,
clk_jitter=30.364, clk_wander=0.000

But I don't think it's valid due to this "no_sys_peer" so there is no successful synchronisation.
If I check on another machine it is saying "clock_sync".

Next step is to check the drift by switching the ntp daemon off and using ntpdate (-q option means query but don't set the clock).

First ran "ntptime -f 0" and also remove ntp.drift file as suggested in the link above.
Stop ntp and use ntpdate.

root@debian-i5:~# service ntp stop
Stopping NTP server: ntpd.
root@debian-i5:~# ntpdate ntp.orcon.net.nz
27 May 16:47:15 ntpdate[21638]: adjust time server 219.88.250.190 offset -0.449953 sec

Then after about one hour:

root@debian-i5:~# ntpdate -q ntp.orcon.net.nz
server 219.88.250.190, stratum 3, offset -46.641260, delay 0.05699
27 May 17:50:33 ntpdate[21804]: step time server 219.88.250.190 offset -46.641260 sec

Huge jump of more than 46 seconds, but perhaps it might settle down after a while...

But an hour later we see a similar drift of about 44 seconds, making a total of more than 90 seconds in two hours.

root@debian-i5:~# ntpdate -q ntp.orcon.net.nz
server 219.88.250.190, stratum 3, offset -90.685901, delay 0.05710
27 May 18:50:15 ntpdate[22051]: step time server 219.88.250.190 offset -90.685901 sec

So I'm about to e-mail Computerlounge and request a replacement motherboard, but I decide to just reboot and see what happens.
System reboots and ntpd starts.

Several hours later ntp is doing just fine.

# ntpq -pn
     remote           refid      st t when poll reach   delay   offset jitter
==============================================================================
*219.88.250.190 119.47.118.129   3 u 257 1024 377   31.948   -2.031   1.403

# ntpdate -q ntp.orcon.net.nz
server 219.88.250.190, stratum 3, offset -0.002055, delay 0.05696
28 May 00:05:43 ntpdate[30722]: adjust time server 219.88.250.190 offset -0.002055 sec

Then I find some similar cases - at least the first case appears to be software related:
https://bugzilla.redhat.com/show_bug.cgi?id=666558
http://forums.fedoraforum.org/showthread.php?p=1443346

Then I remember that the previous day I had been using the power control button on the Logitech K200 keyboard to suspend the pc while I went out for dinner and a couple of other times and maybe this upset ntpd or the kernel timekeeping.

Questions:
1. Has anyone else seen this sort of strange behaviour with ntpd?
2. If I eliminate the operating system issues and find there is some drift in the hardware clock, how much is too much?
3. Has anyone here measured the drift of their real time clock?

Here it suggests that 12 PPM (one second per day) might be considered normal/acceptable, but 500 PPM would be considered very bad (only poor old mechanical wristwatches are worse):
http://www.ntp.org/ntpfaq/NTP-s-sw-clocks-quality.htm

From memory I recall reading elsewhere, that ntp might be able to cope with a drift of 100 PPM, but that should only be seen in some old systems and not a brand new motherboard. My understanding is that the real time clock is integrated into the Southbridge chipset so replacement would mean motherboard swap.

Edit: fix some typos.

alexx

Ragnor

alexx

alexx

alexx

alexx

insane

Lego

alexx