Closed Bug 906785 Opened 11 years ago Closed 11 years ago

Ubuntu systems fail when any puppet master is down

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

Details

Attachments

(2 files)

In bug 906782, one of the puppet masters went down. This should be more or less impact-free, but unfortunately we got 150-some-odd of: Mon Aug 19 10:47:06 -0700 2013 /Stage[packagesetup]/Packages::Setup/Exec[apt-get-update] (err): Failed to call refresh: /usr/bin/apt-get update returned 100 instead of one of [0] Mon Aug 19 10:47:06 -0700 2013 /Stage[packagesetup]/Packages::Setup/Exec[apt-get-update] (err): /usr/bin/apt-get update returned 100 instead of one of [0] I think this occurs because apt-get expects all of its mirrors to be up. I'm not sure *why* it would require that. Rail, any ideas here?
Ah, this was a repeat of bug 876812, in which I figured out that apt-get treats different errors differently, some fatally. That said, I can't find any accesses from 10.26.56.140 around the time in the logs in comment 0. So I'm not sure what happened here. I think we should at least see the full output when this fails.
Assignee: relops → dustin
Attached patch bug906785.patch (deleted) — Splinter Review
Attachment #792460 - Flags: review?(bugspam.Callek)
Attachment #792460 - Flags: review?(bugspam.Callek) → review+
We have a nice natural experiment going on today: W: Failed to fetch http://releng-puppet1.srv.releng.use1.mozilla.com/repos/apt/puppetlabs/dists/precise/main/binary-amd64/Packages Unable to connect to releng-puppet1.srv.releng.use1.mozilla.com:http: ... E: Some index files failed to download. They have been ignored, or old ones used instead. [root@talos-linux64-ix-044.test.releng.scl3.mozilla.com ~]# echo $? 100
So apparently it's not just 404's that are treated as errors. http://anonscm.debian.org/loggerhead/apt/apt/debian-squeeze/view/head:/apt-pkg/acquire-worker.cc#L340 340 if(LookupTag(Message,"FailReason") == "Timeout" || 341 LookupTag(Message,"FailReason") == "TmpResolveFailure" || 342 LookupTag(Message,"FailReason") == "ResolveFailure" || 343 LookupTag(Message,"FailReason") == "ConnectionRefused") 344 Owner->Status = pkgAcquire::Item::StatTransientNetworkError; so this might require a bug report to debian..
There's definitely a bug here somewhere. I stuck a random non-working IP in, and here's the log: > Ign http://releng-puppet1.srv.releng.use1.mozilla.com precise/universe Translation-en > Ign http://releng-puppet1.srv.releng.use1.mozilla.com precise/dependencies Translation-en > Ign http://releng-puppet1.srv.releng.use1.mozilla.com precise/main Translation-en > Ign http://releng-puppet1.srv.releng.use1.mozilla.com precise-updates/all Translation-en > Ign http://releng-puppet1.srv.releng.use1.mozilla.com precise/main Translation-en > Ign http://releng-puppet1.srv.releng.use1.mozilla.com precise/main Translation-en > Ign http://10.26.48.99 precise InRelease > Err http://10.26.48.99 precise Release.gpg > Unable to connect to 10.26.48.99:http: > Ign http://10.26.48.99 precise Release > Ign http://10.26.48.99 precise/dependencies TranslationIndex > Ign http://10.26.48.99 precise/main TranslationIndex > Err http://10.26.48.99 precise/dependencies amd64 Packages > Unable to connect to 10.26.48.99:http: > Err http://10.26.48.99 precise/main amd64 Packages > Unable to connect to 10.26.48.99:http: > Err http://10.26.48.99 precise/dependencies i386 Packages > Unable to connect to 10.26.48.99:http: > Err http://10.26.48.99 precise/main i386 Packages > Unable to connect to 10.26.48.99:http: > Err http://10.26.48.99 precise/dependencies Translation-en > Unable to connect to 10.26.48.99:http: > Err http://10.26.48.99 precise/main Translation-en > Unable to connect to 10.26.48.99:http: The "Unable to connect" comes from http://anonscm.debian.org/loggerhead/apt/apt/debian-squeeze/view/head:/methods/connect.cc#L243. I tried reproducing with resolution errors, which produced: Err http://does.not.resolve precise/dependencies amd64 Packages Something wicked happened resolving 'does.not.resolve:http' (-5 - No address associated with hostname) ... W: Failed to fetch http://does.not.resolve/repos/apt/puppetlabs/dists/precise/Release.gpg Something wicked happened resolving 'does.not.resolve:http' (-5 - No address associated with hostname) ... E: Some index files failed to download. They have been ignored, or old ones used instead. [root@talos-linux64-ix-044.test.releng.scl3.mozilla.com ~]# echo $? 100 and with a bogus IP for a recognized host (in which case I would expect apt to have cached information for that host): Err http://releng-puppet2.build.scl1.mozilla.com precise/main Translation-en Unable to connect to releng-puppet2.build.scl1.mozilla.com:http: ... W: Failed to fetch http://releng-puppet2.build.scl1.mozilla.com/repos/apt/xorg-edgers/dists/precise/main/i18n/Translation-en Unable to connect to releng-puppet2.build.scl1.mozilla.com:http: E: Some index files failed to download. They have been ignored, or old ones used instead. [root@talos-linux64-ix-044.test.releng.scl3.mozilla.com ~]# echo $? 100 in other words, I can't replicate the 0 return status for transient errors observed in bug 876812. Looking more deeply at sources.list, it doesn't actually have a notion of mirrors of the same repo, the way yum does. Instead, each line is treated as a distinct repo. I'm not sure that the transient-error stuff ever works.
I filed this: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=720345 but traffic in #debian suggests that this isn't expected to work anyway.
Someone at PyCon suggested python-apt: http://anonscm.debian.org/gitweb/?p=apt/python-apt.git;a=tree but this seems to be a wrapper around the apt library, so not necessarily that helpful. Andrii Senkovych suggests: --- Hi, If I understand correctly, you name all your mirrors inside /etc/apt/sources.list on every host like: deb http://mirror1.mozilla.org/debian/ squeeze main deb http://mirror2.mozilla.org/debian/ squeeze main and so on... But since some mirror can be possibly down you get error 100 when running apt-get update despite all packages are still available on other mirrors. I'm not sure if it's possible to make on the apt (e.g. client) side, but you may be satisfied with some server-side solutions, for example: 1) use round-robin DNS and let your OS connect to the mirror currently online 2) use some kind of HTTP-redirector (for example the one used by http://http.debian.net/) In each case you should provide only one mirror on the client and have some dedicated DNS name for that purpose. --- I don't think using /etc/hosts will help here. Some simple test showed that libc resolver will always search host record in /etc/hosts until the first match and thus you cannot provide several IPs to one host name to use it for round-robin. You still can provide multiple records for one DNS name to make back-resolving mechanism work, but not the other way. Using a proper DNS server (even a dnsmasq) is necessary in this case I have also checked the behaviour in case of simple round-robin. It worked with no errors several times in a row with a RR-list of 10 servers with only one host really up. --- so, I think we'll need to set up a DNS round-robin for this :(
So here's the proposal: - add a puppet config option for the apt repo host, defaulting to 'repos', and use that to configure sources.list.d - set it to 'puppetagain-apt.pvt.build.mozilla.org' in moco-config.pp - give that hostname an A record for each master; named will use prefix matching and round-robin to load balance them exactly the way we want - write a script to run on the DM that will verify that the A records correspond exactly to the configured puppet masters, and run that from nagios Rail, any thoughts?
Flags: needinfo?(rail)
In overall sounds good to me. (In reply to Dustin J. Mitchell [:dustin] from comment #8) > - give that hostname an A record for each master; named will use prefix > matching and round-robin to load balance them exactly the way we want What happens if one of the hosts goes down? Will bind remove it from the list automatically or will we need need to manually do that?
Flags: needinfo?(rail)
In that case apt-get will automatically move on to the next IP in the list. This is actually the *only* way to get apt-get to do so.
https://inventory.mozilla.org/en-US/core/search/?search=puppetagain-apt#q=puppetagain-apt done via for ip in $(invtool search -q 'releng-puppet (type=:A)' | awk '{print $6}'); do invtool A create --ip $ip --fqdn puppetagain-apt.pvt.build.mozilla.org --no-public --private; done
on a test system, reconfig'd by hand: [root@talos-linux64-ix-002.test.releng.scl3.mozilla.com ~]# apt-get update Ign http://puppetagain-apt.pvt.build.mozilla.org precise-security InRelease Ign http://puppetagain-apt.pvt.build.mozilla.org precise InRelease Ign http://puppetagain-apt.pvt.build.mozilla.org precise InRelease Ign http://puppetagain-apt.pvt.build.mozilla.org precise-updates InRelease Ign http://puppetagain-apt.pvt.build.mozilla.org precise InRelease Ign http://puppetagain-apt.pvt.build.mozilla.org precise InRelease Get:1 http://puppetagain-apt.pvt.build.mozilla.org precise-security Release.gpg [198 B] Get:2 http://puppetagain-apt.pvt.build.mozilla.org precise Release.gpg [198 B] Get:3 http://puppetagain-apt.pvt.build.mozilla.org precise Release.gpg [836 B] Ign http://puppetagain-apt.pvt.build.mozilla.org precise-updates Release.gpg Ign http://puppetagain-apt.pvt.build.mozilla.org precise Release.gpg Get:4 http://puppetagain-apt.pvt.build.mozilla.org precise Release.gpg [316 B] Get:5 http://puppetagain-apt.pvt.build.mozilla.org precise-security Release [49.6 kB] Get:6 http://puppetagain-apt.pvt.build.mozilla.org precise Release [49.6 kB] Get:7 http://puppetagain-apt.pvt.build.mozilla.org precise Release [8877 B] Ign http://puppetagain-apt.pvt.build.mozilla.org precise-updates Release Get:8 http://puppetagain-apt.pvt.build.mozilla.org precise Release [2288 B] Get:9 http://puppetagain-apt.pvt.build.mozilla.org precise Release [11.9 kB] Get:10 http://puppetagain-apt.pvt.build.mozilla.org precise-updates/all amd64 Packages [2357 B] Get:11 http://puppetagain-apt.pvt.build.mozilla.org precise-updates/all i386 Packages [2354 B] Ign http://puppetagain-apt.pvt.build.mozilla.org precise-updates/all TranslationIndex Get:12 http://puppetagain-apt.pvt.build.mozilla.org precise/main amd64 Packages [9898 B] Get:13 http://puppetagain-apt.pvt.build.mozilla.org precise/main i386 Packages [9907 B] Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main TranslationIndex Get:14 http://puppetagain-apt.pvt.build.mozilla.org precise-security/main amd64 Packages [228 kB] Ign http://puppetagain-apt.pvt.build.mozilla.org precise Release Ign http://puppetagain-apt.pvt.build.mozilla.org precise Release Get:15 http://puppetagain-apt.pvt.build.mozilla.org precise-security/restricted amd64 Packages [3969 B] Get:16 http://puppetagain-apt.pvt.build.mozilla.org precise-security/universe amd64 Packages [68.2 kB] Get:17 http://puppetagain-apt.pvt.build.mozilla.org precise-security/main i386 Packages [237 kB] Get:18 http://puppetagain-apt.pvt.build.mozilla.org precise-security/restricted i386 Packages [3968 B] Get:19 http://puppetagain-apt.pvt.build.mozilla.org precise-security/universe i386 Packages [69.9 kB] Ign http://puppetagain-apt.pvt.build.mozilla.org precise-security/main TranslationIndex Ign http://puppetagain-apt.pvt.build.mozilla.org precise-security/restricted TranslationIndex Ign http://puppetagain-apt.pvt.build.mozilla.org precise-security/universe TranslationIndex Get:20 http://puppetagain-apt.pvt.build.mozilla.org precise/main amd64 Packages [1273 kB] Get:21 http://puppetagain-apt.pvt.build.mozilla.org precise/restricted amd64 Packages [8452 B] Get:22 http://puppetagain-apt.pvt.build.mozilla.org precise/universe amd64 Packages [4786 kB] Get:23 http://puppetagain-apt.pvt.build.mozilla.org precise/main i386 Packages [1274 kB] Get:24 http://puppetagain-apt.pvt.build.mozilla.org precise/restricted i386 Packages [8431 B] Get:25 http://puppetagain-apt.pvt.build.mozilla.org precise/universe i386 Packages [4796 kB] Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main TranslationIndex Ign http://puppetagain-apt.pvt.build.mozilla.org precise/restricted TranslationIndex Ign http://puppetagain-apt.pvt.build.mozilla.org precise/universe TranslationIndex Get:26 http://puppetagain-apt.pvt.build.mozilla.org precise/dependencies amd64 Packages [5638 B] Get:27 http://puppetagain-apt.pvt.build.mozilla.org precise/main amd64 Packages [39.5 kB] Get:28 http://puppetagain-apt.pvt.build.mozilla.org precise/dependencies i386 Packages [5647 B] Get:29 http://puppetagain-apt.pvt.build.mozilla.org precise/main i386 Packages [39.5 kB] Ign http://puppetagain-apt.pvt.build.mozilla.org precise/dependencies TranslationIndex Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main TranslationIndex Get:30 http://puppetagain-apt.pvt.build.mozilla.org precise/main amd64 Packages [37.1 kB] Get:31 http://puppetagain-apt.pvt.build.mozilla.org precise/main i386 Packages [37.6 kB] Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main TranslationIndex Ign http://puppetagain-apt.pvt.build.mozilla.org precise-updates/all Translation-en Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main Translation-en Ign http://puppetagain-apt.pvt.build.mozilla.org precise-security/main Translation-en Ign http://puppetagain-apt.pvt.build.mozilla.org precise-security/restricted Translation-en Ign http://puppetagain-apt.pvt.build.mozilla.org precise-security/universe Translation-en Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main Translation-en Ign http://puppetagain-apt.pvt.build.mozilla.org precise/restricted Translation-en Ign http://puppetagain-apt.pvt.build.mozilla.org precise/universe Translation-en Ign http://puppetagain-apt.pvt.build.mozilla.org precise/dependencies Translation-en Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main Translation-en Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main Translation-en Fetched 13.1 MB in 3s (3581 kB/s) Reading package lists... Done W: GPG error: http://puppetagain-apt.pvt.build.mozilla.org precise Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 1054B7A24BD6EC30 W: GPG error: http://puppetagain-apt.pvt.build.mozilla.org precise Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 4F191A5A8844C542 which is *much* shorter and less bandwidth, yay I'll come up with a patch for this next week, then.
Attached patch bug906785-dnsrr.patch (deleted) — Splinter Review
This doesn't implement the script to check that the DNS is correct, but that can come in a subsequent commit.
Attachment #796161 - Flags: review?(rail)
Attachment #796161 - Flags: review?(rail) → review+
Attachment #796161 - Flags: checked-in+
No issues here. I'd like to take down a master and see what happens. Tomorrow.
I took down a master, ran apt-get update, and it worked fine. Yay!
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: