Closed Bug 646056 Opened 13 years ago Closed 13 years ago

Releng machines should use ntp.build.mozilla.org as their time server

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zandr, Assigned: bhearsum)

References

Details

(Whiteboard: [puppet][opsi][buildmasters][buildslaves])

Attachments

(4 files)

In bug 617414 we've been working to lock down internet access for test machines. In the logs, I've been specifically ignoring ntp because the use of pool.ntp.org means we have literally hundreds of hosts listed. The macs appear to be hitting time.apple.com, and there's at least a little traffic to Microsoft.

If we shut down internet access before this is resolved, we can leave 123 open, but we really should use a local ntp server: ntp.build.mozilla.org.
There's a real mixture of behaviours here, depending on the age of the reference platform for various classes of machines. Some examples:
* bm-xserveN point at ntp1.bmo via /etc/ntpd.conf
* moz2-darwin10-slaveN points to time.apple.com and ntp1.bmo (if you modify ntp.conf don't use the System Preferences to modify the server afterwards)
* moz2-darwin9-slaveN points to ntp1.bmo only
* moz2-linux-slaveN are getting info from dhcp (end of /etc/ntp.conf):
# servers generated by /sbin/dhclient-script
server 63.245.208.36
server 63.245.208.37
server 10.2.71.5
server 127.127.1.0
fudge	127.127.1.0 stratum 10	

I'd be surprised if talos-r3* was modified from the default install behaviour.

Bug 539278 is related.

Also, why do both ntp1.bmo and ntp.bmo point at machines in scl1 ? It used to be a border box in MPT (like moz2-linux-slaveN get now).
(In reply to comment #1)

> Also, why do both ntp1.bmo and ntp.bmo point at machines in scl1 ? It used to
> be a border box in MPT (like moz2-linux-slaveN get now).

I didn't configure it, but ntp.b.m.o is maintained by IT as The Right Thing. 

At the moment, it's a round-robin of ntp1 and ntp2.infra.scl1, which are actually ns1 and ns2.infra.scl1. VMs on a lightly loaded cluster seem like a better choice to me than the ntp server on heavily loaded routers, and using that service name will let us change where that service comes from without having to touch all the machines.
Priority: -- → P3
Whiteboard: [puppet][opsi][buildmasters][buildslaves]
Now that DHCP will be serving option ntp-servers, we should use that. This will allow a local NTP server without having to rely on split horizon or search path.
Summary: Releng machines should use ntp.build.mozilla.org for time. → Releng machines should use follow DHCP option ntp-servers for time.
so we should revert the suggestion in https://bugzilla.mozilla.org/show_bug.cgi?id=646563#c0 to set PEERNTP=no in our dhcp configs?
Blocks: 646046
(In reply to comment #4)
> so we should revert the suggestion in
> https://bugzilla.mozilla.org/show_bug.cgi?id=646563#c0 to set PEERNTP=no in
> our dhcp configs?

Google says yes.
Assignee: nobody → bhearsum
Once bug 646126 I'll have a look and see which other classes of machines need updates.
This discussion seems to indicate that it's not possible for Windows to obey an NTP server: http://superuser.com/questions/147248/windows-clients-not-using-ntp-server-provided-via-dhcp. And by way of omission, http://support.microsoft.com/kb/121005 claims NTP servers aren't supported.

So, looks like we're managing it ourselves on Windows. We've got OPSI in some places (XP test machines, build machines, maybe some Windows 7 ones?), which will easily manage this. For everything else, should be easy to do over ssh.
(In reply to comment #7)
> This discussion seems to indicate that it's not possible for Windows to obey
> an NTP server:
> http://superuser.com/questions/147248/windows-clients-not-using-ntp-server-
> provided-via-dhcp. And by way of omission,
> http://support.microsoft.com/kb/121005 claims NTP servers aren't supported.
> 
> So, looks like we're managing it ourselves on Windows. We've got OPSI in
> some places (XP test machines, build machines, maybe some Windows 7 ones?),
> which will easily manage this. For everything else, should be easy to do
> over ssh.

Looks like the OS X DHCP Client may not support it either. It's very hard to find information on it, but based on my googling of www.opensource.apple.com, it looks like this option is only referenced in a few header files:
http://www.google.ca/search?hl=en&client=firefox-a&hs=OJI&rls=org.mozilla%3Atl%3Aunofficial&q=DHCPTAG_NETWORK_TIME_PROTOCOL_SERVERS+site%3Awww.opensource.apple.com&aq=f&aqi=&aql=&oq=
http://www.google.ca/search?hl=en&client=firefox-a&hs=ygI&rls=org.mozilla%3Atl%3Aunofficial&q=dhcptag_network_time_protocol_servers_e+site%3Awww.opensource.apple.com&aq=f&aqi=&aql=&oq=

Based on that, I'm not going to spend any more time researching this and set them explicitly through Puppet on Mac.
Summary: Releng machines should use follow DHCP option ntp-servers for time. → Releng machines should use follow DHCP option ntp-servers for time (or use ntp.build.mozilla.org)
So, after doing a bit more research, it turns out we need to remove existing NTP servers from ntp.conf before DHCP will update that file. However, to do that, we'd have to manage ntp.conf completely....which means that at boot, DHCP would set the servers in ntp.conf, and then get overridden by Puppet very shortly afterwards. Given that, I'm tossing out the idea of using this DHCP server and going with the simple plan of hardcoding ntp.build.mozilla.org everywhere. For Linux and Mac this will be managed by Puppet. For XP and Windows 2003, by OPSI. For Windows 7, it'll have to be done by hand over SSH.
Summary: Releng machines should use follow DHCP option ntp-servers for time (or use ntp.build.mozilla.org) → Releng machines should use ntp.build.mozilla.org as their time server
This patch syncs out new ntp.conf's everywhere. Mostly it's just re-arranging existing options but it does make sure the server is "ntp.build.mozilla.org" everywhere, and in same places, adds the "restrict 10.0.0.0 mask 255.0.0.0" line. On Linux build machines we're actually not syncing the time currently, and I've kept that behaviour because I don't want to deal with potential issues with ntp + VMs here. These machines will get a useful ntp.conf however, so turning it on later will be trivial.

Tested this across all types of machines that sync with Puppet.
Attachment #535333 - Flags: review?(dustin)
Tested on XP and 2003.
Attachment #535347 - Flags: review?(dustin)
Attached file reg keys for Windows 7 machines (deleted) —
Attachment #535347 - Flags: review?(dustin) → review+
Attachment #535333 - Flags: review?(dustin) → review+
Got bogged down with other work, planning to land all of these changes on Monday.
Attachment #535333 - Flags: checked-in+
I tested the Puppet part on one of each type of slave, and it seems to be landing fine. Moving on to the OPSI and Windows 7 parts.
Comment on attachment 535347 [details] [diff] [review]
OPSI package to set the time server

Landed, and set to deploy across the board on Windows 2003 build machines & XP test machines.
Attachment #535347 - Flags: checked-in+
Turns out I can't deploy to Windows 7 over ssh, so I'll have to do it over VNC. Planning to do so in tomorrow morning's downtime, because it'll be much easier when I don't have to worry as much about breaking things....

Deployment will happen with these commands, in a cmd.exe started through "run as administrator":
wget -O time.reg --no-check-certificate https://bugzilla.mozilla.org/attachment.cgi?id=535348
reg import time.reg
I've rolled out changes to all of the talos-r3-w7 machines except:
- 001, which is awaiting a re-image
- 011, 032, and 045, which are awaiting reboots

I've left a comment in bug 649835 about updating the NTP server after 001 get's re-imaged, and I'll take care of updating the other 3 when they come back from the reboot.
talos-r3-w7-032 and 045 are done. Just 001 and 011 left.
Dustin reminded me that the ref machines need doing, too. Earlier this morning I made sure talos-r3-xp-ref, win2k3sp2-ref (the master VMware image), and win32-ix-ref were up to date. Just a few minutes ago, I updated talos-r3-w7-ref. That's all of them.

(In reply to comment #18)
> talos-r3-w7-032 and 045 are done. Just 001 and 011 left.

001 is being re-imaged from the ref machine in bug 649835, so scratch that from the list. Only have 011 to worry about now.
No longer blocks: 646046
A log capture in bug 646046 showed that the Windows machines are still hitting time.apple.com. Turns out they run AppleTimeSrv.exe, which is at fault. This service claims to keep time in sync when rebooting between OS X and Windows, and indeed, just rebooting a Windows machine with the service disabled leaves me with the correct time. I'm going to disable it by hand on the staging Windows test machines, and if they all still have the correct time next week, I'll roll that change out to the rest of them.
Finally got talos-r3-w7-011 updated. Only thing left to do is figure out if turning off AppleTimeSrv is safe, and if so, do it.
(In reply to comment #20)
> A log capture in bug 646046 showed that the Windows machines are still
> hitting time.apple.com. Turns out they run AppleTimeSrv.exe, which is at
> fault. This service claims to keep time in sync when rebooting between OS X
> and Windows, and indeed, just rebooting a Windows machine with the service
> disabled leaves me with the correct time. I'm going to disable it by hand on
> the staging Windows test machines, and if they all still have the correct
> time next week, I'll roll that change out to the rest of them.

All of the staging machines still have the correct time, so I'll update the OPSI package to disable this service everywhere, and manually disable it on Windows 7.
Attached patch disable appletimesrv service (deleted) — Splinter Review
Attachment #538988 - Flags: review?(dustin)
Comment on attachment 538988 [details] [diff] [review]
disable appletimesrv service

+sc config AppleTimeSrv start= disabled
                              ^ intentional?

r+ on the assumption it worked for you..
Attachment #538988 - Flags: review?(dustin) → review+
(In reply to comment #24)
> Comment on attachment 538988 [details] [diff] [review] [review]
> disable appletimesrv service
> 
> +sc config AppleTimeSrv start= disabled
>                               ^ intentional?
> 
> r+ on the assumption it worked for you..

Yup, in fact, it's *required*: http://www.techrepublic.com/forum/discussions/47-171983
Comment on attachment 538988 [details] [diff] [review]
disable appletimesrv service

I landed this, and marked all of the XP machines for a re-install of the package.
Attachment #538988 - Flags: checked-in+
I went through all the Windows 7 machines (as they became idle), and disabled the AppleTimeSrv service on them, too. As far as I know, we're all done here.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: