Closed
Bug 906785
Opened 11 years ago
Closed 11 years ago
Ubuntu systems fail when any puppet master is down
Categories
(Infrastructure & Operations :: RelOps: Puppet, task)
Infrastructure & Operations
RelOps: Puppet
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dustin, Assigned: dustin)
Details
Attachments
(2 files)
(deleted),
patch
|
Callek
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
rail
:
review+
dustin
:
checked-in+
|
Details | Diff | Splinter Review |
In bug 906782, one of the puppet masters went down. This should be more or less impact-free, but unfortunately we got 150-some-odd of:
Mon Aug 19 10:47:06 -0700 2013 /Stage[packagesetup]/Packages::Setup/Exec[apt-get-update] (err): Failed to call refresh: /usr/bin/apt-get update returned 100 instead of one of [0]
Mon Aug 19 10:47:06 -0700 2013 /Stage[packagesetup]/Packages::Setup/Exec[apt-get-update] (err): /usr/bin/apt-get update returned 100 instead of one of [0]
I think this occurs because apt-get expects all of its mirrors to be up. I'm not sure *why* it would require that.
Rail, any ideas here?
Assignee | ||
Comment 1•11 years ago
|
||
Ah, this was a repeat of bug 876812, in which I figured out that apt-get treats different errors differently, some fatally.
That said, I can't find any accesses from 10.26.56.140 around the time in the logs in comment 0. So I'm not sure what happened here.
I think we should at least see the full output when this fails.
Assignee | ||
Updated•11 years ago
|
Assignee: relops → dustin
Assignee | ||
Comment 2•11 years ago
|
||
Attachment #792460 -
Flags: review?(bugspam.Callek)
Updated•11 years ago
|
Attachment #792460 -
Flags: review?(bugspam.Callek) → review+
Assignee | ||
Comment 3•11 years ago
|
||
We have a nice natural experiment going on today:
W: Failed to fetch http://releng-puppet1.srv.releng.use1.mozilla.com/repos/apt/puppetlabs/dists/precise/main/binary-amd64/Packages Unable to connect to releng-puppet1.srv.releng.use1.mozilla.com:http:
...
E: Some index files failed to download. They have been ignored, or old ones used instead.
[root@talos-linux64-ix-044.test.releng.scl3.mozilla.com ~]# echo $?
100
Assignee | ||
Comment 4•11 years ago
|
||
So apparently it's not just 404's that are treated as errors.
http://anonscm.debian.org/loggerhead/apt/apt/debian-squeeze/view/head:/apt-pkg/acquire-worker.cc#L340
340 if(LookupTag(Message,"FailReason") == "Timeout" ||
341 LookupTag(Message,"FailReason") == "TmpResolveFailure" ||
342 LookupTag(Message,"FailReason") == "ResolveFailure" ||
343 LookupTag(Message,"FailReason") == "ConnectionRefused")
344 Owner->Status = pkgAcquire::Item::StatTransientNetworkError;
so this might require a bug report to debian..
Assignee | ||
Comment 5•11 years ago
|
||
There's definitely a bug here somewhere. I stuck a random non-working IP in, and here's the log:
> Ign http://releng-puppet1.srv.releng.use1.mozilla.com precise/universe Translation-en
> Ign http://releng-puppet1.srv.releng.use1.mozilla.com precise/dependencies Translation-en
> Ign http://releng-puppet1.srv.releng.use1.mozilla.com precise/main Translation-en
> Ign http://releng-puppet1.srv.releng.use1.mozilla.com precise-updates/all Translation-en
> Ign http://releng-puppet1.srv.releng.use1.mozilla.com precise/main Translation-en
> Ign http://releng-puppet1.srv.releng.use1.mozilla.com precise/main Translation-en
> Ign http://10.26.48.99 precise InRelease
> Err http://10.26.48.99 precise Release.gpg
> Unable to connect to 10.26.48.99:http:
> Ign http://10.26.48.99 precise Release
> Ign http://10.26.48.99 precise/dependencies TranslationIndex
> Ign http://10.26.48.99 precise/main TranslationIndex
> Err http://10.26.48.99 precise/dependencies amd64 Packages
> Unable to connect to 10.26.48.99:http:
> Err http://10.26.48.99 precise/main amd64 Packages
> Unable to connect to 10.26.48.99:http:
> Err http://10.26.48.99 precise/dependencies i386 Packages
> Unable to connect to 10.26.48.99:http:
> Err http://10.26.48.99 precise/main i386 Packages
> Unable to connect to 10.26.48.99:http:
> Err http://10.26.48.99 precise/dependencies Translation-en
> Unable to connect to 10.26.48.99:http:
> Err http://10.26.48.99 precise/main Translation-en
> Unable to connect to 10.26.48.99:http:
The "Unable to connect" comes from http://anonscm.debian.org/loggerhead/apt/apt/debian-squeeze/view/head:/methods/connect.cc#L243.
I tried reproducing with resolution errors, which produced:
Err http://does.not.resolve precise/dependencies amd64 Packages
Something wicked happened resolving 'does.not.resolve:http' (-5 - No address associated with hostname)
...
W: Failed to fetch http://does.not.resolve/repos/apt/puppetlabs/dists/precise/Release.gpg Something wicked happened resolving 'does.not.resolve:http' (-5 - No address associated with hostname)
...
E: Some index files failed to download. They have been ignored, or old ones used instead.
[root@talos-linux64-ix-044.test.releng.scl3.mozilla.com ~]# echo $?
100
and with a bogus IP for a recognized host (in which case I would expect apt to have cached information for that host):
Err http://releng-puppet2.build.scl1.mozilla.com precise/main Translation-en
Unable to connect to releng-puppet2.build.scl1.mozilla.com:http:
...
W: Failed to fetch http://releng-puppet2.build.scl1.mozilla.com/repos/apt/xorg-edgers/dists/precise/main/i18n/Translation-en Unable to connect to releng-puppet2.build.scl1.mozilla.com:http:
E: Some index files failed to download. They have been ignored, or old ones used instead.
[root@talos-linux64-ix-044.test.releng.scl3.mozilla.com ~]# echo $?
100
in other words, I can't replicate the 0 return status for transient errors observed in bug 876812.
Looking more deeply at sources.list, it doesn't actually have a notion of mirrors of the same repo, the way yum does. Instead, each line is treated as a distinct repo.
I'm not sure that the transient-error stuff ever works.
Assignee | ||
Comment 6•11 years ago
|
||
I filed this:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=720345
but traffic in #debian suggests that this isn't expected to work anyway.
Assignee | ||
Comment 7•11 years ago
|
||
Someone at PyCon suggested python-apt:
http://anonscm.debian.org/gitweb/?p=apt/python-apt.git;a=tree
but this seems to be a wrapper around the apt library, so not necessarily that helpful.
Andrii Senkovych suggests:
---
Hi,
If I understand correctly, you name all your mirrors inside
/etc/apt/sources.list on every host like:
deb http://mirror1.mozilla.org/debian/ squeeze main
deb http://mirror2.mozilla.org/debian/ squeeze main
and so on...
But since some mirror can be possibly down you get error 100 when
running apt-get update despite all packages are still available on
other mirrors.
I'm not sure if it's possible to make on the apt (e.g. client) side,
but you may be satisfied with some server-side solutions, for example:
1) use round-robin DNS and let your OS connect to the mirror currently online
2) use some kind of HTTP-redirector (for example the one used by
http://http.debian.net/)
In each case you should provide only one mirror on the client and have
some dedicated DNS name for that purpose.
---
I don't think using /etc/hosts will help here. Some simple test showed
that libc resolver will always search host record in /etc/hosts until
the first match and thus you cannot provide several IPs to one host
name to use it for round-robin. You still can provide multiple records
for one DNS name to make back-resolving mechanism work, but not the
other way. Using a proper DNS server (even a dnsmasq) is necessary in
this case
I have also checked the behaviour in case of simple round-robin. It
worked with no errors several times in a row with a RR-list of 10
servers with only one host really up.
---
so, I think we'll need to set up a DNS round-robin for this :(
Assignee | ||
Comment 8•11 years ago
|
||
So here's the proposal:
- add a puppet config option for the apt repo host, defaulting to 'repos', and use that to configure sources.list.d
- set it to 'puppetagain-apt.pvt.build.mozilla.org' in moco-config.pp
- give that hostname an A record for each master; named will use prefix matching and round-robin to load balance them exactly the way we want
- write a script to run on the DM that will verify that the A records correspond exactly to the configured puppet masters, and run that from nagios
Rail, any thoughts?
Flags: needinfo?(rail)
Comment 9•11 years ago
|
||
In overall sounds good to me.
(In reply to Dustin J. Mitchell [:dustin] from comment #8)
> - give that hostname an A record for each master; named will use prefix
> matching and round-robin to load balance them exactly the way we want
What happens if one of the hosts goes down? Will bind remove it from the list automatically or will we need need to manually do that?
Flags: needinfo?(rail)
Assignee | ||
Comment 10•11 years ago
|
||
In that case apt-get will automatically move on to the next IP in the list. This is actually the *only* way to get apt-get to do so.
Assignee | ||
Comment 11•11 years ago
|
||
https://inventory.mozilla.org/en-US/core/search/?search=puppetagain-apt#q=puppetagain-apt
done via
for ip in $(invtool search -q 'releng-puppet (type=:A)' | awk '{print $6}'); do invtool A create --ip $ip --fqdn puppetagain-apt.pvt.build.mozilla.org --no-public --private; done
Assignee | ||
Comment 12•11 years ago
|
||
on a test system, reconfig'd by hand:
[root@talos-linux64-ix-002.test.releng.scl3.mozilla.com ~]# apt-get update
Ign http://puppetagain-apt.pvt.build.mozilla.org precise-security InRelease
Ign http://puppetagain-apt.pvt.build.mozilla.org precise InRelease
Ign http://puppetagain-apt.pvt.build.mozilla.org precise InRelease
Ign http://puppetagain-apt.pvt.build.mozilla.org precise-updates InRelease
Ign http://puppetagain-apt.pvt.build.mozilla.org precise InRelease
Ign http://puppetagain-apt.pvt.build.mozilla.org precise InRelease
Get:1 http://puppetagain-apt.pvt.build.mozilla.org precise-security Release.gpg [198 B]
Get:2 http://puppetagain-apt.pvt.build.mozilla.org precise Release.gpg [198 B]
Get:3 http://puppetagain-apt.pvt.build.mozilla.org precise Release.gpg [836 B]
Ign http://puppetagain-apt.pvt.build.mozilla.org precise-updates Release.gpg
Ign http://puppetagain-apt.pvt.build.mozilla.org precise Release.gpg
Get:4 http://puppetagain-apt.pvt.build.mozilla.org precise Release.gpg [316 B]
Get:5 http://puppetagain-apt.pvt.build.mozilla.org precise-security Release [49.6 kB]
Get:6 http://puppetagain-apt.pvt.build.mozilla.org precise Release [49.6 kB]
Get:7 http://puppetagain-apt.pvt.build.mozilla.org precise Release [8877 B]
Ign http://puppetagain-apt.pvt.build.mozilla.org precise-updates Release
Get:8 http://puppetagain-apt.pvt.build.mozilla.org precise Release [2288 B]
Get:9 http://puppetagain-apt.pvt.build.mozilla.org precise Release [11.9 kB]
Get:10 http://puppetagain-apt.pvt.build.mozilla.org precise-updates/all amd64 Packages [2357 B]
Get:11 http://puppetagain-apt.pvt.build.mozilla.org precise-updates/all i386 Packages [2354 B]
Ign http://puppetagain-apt.pvt.build.mozilla.org precise-updates/all TranslationIndex
Get:12 http://puppetagain-apt.pvt.build.mozilla.org precise/main amd64 Packages [9898 B]
Get:13 http://puppetagain-apt.pvt.build.mozilla.org precise/main i386 Packages [9907 B]
Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main TranslationIndex
Get:14 http://puppetagain-apt.pvt.build.mozilla.org precise-security/main amd64 Packages [228 kB]
Ign http://puppetagain-apt.pvt.build.mozilla.org precise Release
Ign http://puppetagain-apt.pvt.build.mozilla.org precise Release
Get:15 http://puppetagain-apt.pvt.build.mozilla.org precise-security/restricted amd64 Packages [3969 B]
Get:16 http://puppetagain-apt.pvt.build.mozilla.org precise-security/universe amd64 Packages [68.2 kB]
Get:17 http://puppetagain-apt.pvt.build.mozilla.org precise-security/main i386 Packages [237 kB]
Get:18 http://puppetagain-apt.pvt.build.mozilla.org precise-security/restricted i386 Packages [3968 B]
Get:19 http://puppetagain-apt.pvt.build.mozilla.org precise-security/universe i386 Packages [69.9 kB]
Ign http://puppetagain-apt.pvt.build.mozilla.org precise-security/main TranslationIndex
Ign http://puppetagain-apt.pvt.build.mozilla.org precise-security/restricted TranslationIndex
Ign http://puppetagain-apt.pvt.build.mozilla.org precise-security/universe TranslationIndex
Get:20 http://puppetagain-apt.pvt.build.mozilla.org precise/main amd64 Packages [1273 kB]
Get:21 http://puppetagain-apt.pvt.build.mozilla.org precise/restricted amd64 Packages [8452 B]
Get:22 http://puppetagain-apt.pvt.build.mozilla.org precise/universe amd64 Packages [4786 kB]
Get:23 http://puppetagain-apt.pvt.build.mozilla.org precise/main i386 Packages [1274 kB]
Get:24 http://puppetagain-apt.pvt.build.mozilla.org precise/restricted i386 Packages [8431 B]
Get:25 http://puppetagain-apt.pvt.build.mozilla.org precise/universe i386 Packages [4796 kB]
Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main TranslationIndex
Ign http://puppetagain-apt.pvt.build.mozilla.org precise/restricted TranslationIndex
Ign http://puppetagain-apt.pvt.build.mozilla.org precise/universe TranslationIndex
Get:26 http://puppetagain-apt.pvt.build.mozilla.org precise/dependencies amd64 Packages [5638 B]
Get:27 http://puppetagain-apt.pvt.build.mozilla.org precise/main amd64 Packages [39.5 kB]
Get:28 http://puppetagain-apt.pvt.build.mozilla.org precise/dependencies i386 Packages [5647 B]
Get:29 http://puppetagain-apt.pvt.build.mozilla.org precise/main i386 Packages [39.5 kB]
Ign http://puppetagain-apt.pvt.build.mozilla.org precise/dependencies TranslationIndex
Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main TranslationIndex
Get:30 http://puppetagain-apt.pvt.build.mozilla.org precise/main amd64 Packages [37.1 kB]
Get:31 http://puppetagain-apt.pvt.build.mozilla.org precise/main i386 Packages [37.6 kB]
Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main TranslationIndex
Ign http://puppetagain-apt.pvt.build.mozilla.org precise-updates/all Translation-en
Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main Translation-en
Ign http://puppetagain-apt.pvt.build.mozilla.org precise-security/main Translation-en
Ign http://puppetagain-apt.pvt.build.mozilla.org precise-security/restricted Translation-en
Ign http://puppetagain-apt.pvt.build.mozilla.org precise-security/universe Translation-en
Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main Translation-en
Ign http://puppetagain-apt.pvt.build.mozilla.org precise/restricted Translation-en
Ign http://puppetagain-apt.pvt.build.mozilla.org precise/universe Translation-en
Ign http://puppetagain-apt.pvt.build.mozilla.org precise/dependencies Translation-en
Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main Translation-en
Ign http://puppetagain-apt.pvt.build.mozilla.org precise/main Translation-en
Fetched 13.1 MB in 3s (3581 kB/s)
Reading package lists... Done
W: GPG error: http://puppetagain-apt.pvt.build.mozilla.org precise Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 1054B7A24BD6EC30
W: GPG error: http://puppetagain-apt.pvt.build.mozilla.org precise Release: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 4F191A5A8844C542
which is *much* shorter and less bandwidth, yay
I'll come up with a patch for this next week, then.
Assignee | ||
Comment 13•11 years ago
|
||
This doesn't implement the script to check that the DNS is correct, but that can come in a subsequent commit.
Attachment #796161 -
Flags: review?(rail)
Updated•11 years ago
|
Attachment #796161 -
Flags: review?(rail) → review+
Assignee | ||
Updated•11 years ago
|
Attachment #796161 -
Flags: checked-in+
Assignee | ||
Comment 14•11 years ago
|
||
No issues here. I'd like to take down a master and see what happens. Tomorrow.
Assignee | ||
Comment 15•11 years ago
|
||
I took down a master, ran apt-get update, and it worked fine. Yay!
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•