Closed
Bug 1430670
Opened 7 years ago
Closed 7 years ago
AWS network issues - Connection failure with the signing server, nagios socket timeouts
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dluca, Unassigned)
Details
(Whiteboard: [stockwell infra])
https://treeherder.mozilla.org/logviewer.html#?job_id=156436485&repo=mozilla-inbound#L63
[Errno -3] Cannot connect to host signing5.srv.releng.scl3.mozilla.com:9110 ssl:True [Temporary failure in name resolution]
exit code: 1
Comment 1•7 years ago
|
||
We're also seeing nagios alerts like
Mon 22:22:00 UTC [7728] [] signing-linux-8.srv.releng.usw2.mozilla.com:load is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds.
Mon 22:28:23 UTC [7752] [] signingworker-3.srv.releng.use1.mozilla.com:load is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds.
across several classes of systems in AWS, both usw2 and use1 while the nagios server is in scl3. Seem to be quite bursty.
Running mtr to look for packet loss:
1, scl3 -> use1
buildbot-master81.bb.releng.scl3.mozilla.com (0.0.0.0) Mon Jan 15 14:51:21 2018
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Drop Avg Best Wrst StDev
1. fw1.bb.releng.scl3.mozilla.net 0.0% 1363 0 0.5 0.5 2.5 0.2
2. ???
3. beetmoverworker-1.srv.releng.use1.mozilla.com 40.6% 1362 553 73.4 72.6 78.3 0.5
------
2, scl3 -> usw2
buildbot-master81.bb.releng.scl3.mozilla.com (0.0.0.0) Mon Jan 15 14:51:49 2018
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Drop Avg Best Wrst StDev
1. fw1.bb.releng.scl3.mozilla.net 0.0% 1367 0 0.5 0.5 2.8 0.2
2. ???
3. beetmoverworker-4.srv.releng.usw2.mozilla.com 27.5% 1366 376 24.7 23.5 42.7 1.1
------
Meanwhile use1 -> use1 and usw2 -> usw2 have zero packet loss.
Summary: Connection failure with the signing server → AWS network issues - Connection failure with the signing server, nagios socket timeouts
Comment 2•7 years ago
|
||
Also ran some traceroutes from scl3 to the AWS endpoints for the IPsec tunnels. There seems to be issues within the AWS networks, eg:
buildbot-master81.bb.releng.scl3.mozilla.com (0.0.0.0) Mon Jan 15 15:15:21 2018
Keys: Help Display mode Restart statistics Order of fields quit
Packets Pings
Host Loss% Snt Drop Avg Best Wrst StDev
1. fw1.bb.releng.scl3.mozilla.net 0.0% 549 0 0.7 0.5 2.9 0.3
2. [redacted] 0.0% 549 0 2.1 1.6 6.5 0.5
3. [redacted] 0.0% 549 0 1.0 0.8 2.8 0.3
4. [redacted] 0.0% 549 0 3.8 1.0 48.9 8.2
5. [redacted] 0.0% 549 0 1.6 1.4 8.5 0.6
6. xe-3-1-0.mpr2.pao1.us.above.net 0.0% 549 0 2.2 1.4 45.1 3.3
7. ae7.cr2.sjc2.us.zip.zayo.com 0.0% 549 0 3.1 1.9 35.0 3.5
8. ae16.mpr4.sjc7.us.zip.zayo.com 0.0% 549 0 2.9 2.0 75.9 4.6
9. 52.95.217.150 0.0% 549 0 2.4 2.1 10.5 0.8
10. 54.240.242.32 0.0% 549 0 24.7 23.5 35.4 1.4
11. 54.240.242.35 0.0% 549 0 26.9 24.4 134.0 9.2
12. ???
13. 54.239.41.228 0.9% 549 5 25.5 23.6 73.6 3.6
52.93.128.36
52.93.128.34
52.93.128.42
54.239.43.40
14. 54.239.46.104 11.5% 549 63 25.5 23.6 51.5 2.9
54.239.45.127
54.239.46.102
54.239.43.130
15. 52.93.12.114 88.5% 549 485 40.6 26.1 77.2 16.3
16. 52.93.12.137 0.0% 549 0 33.5 23.8 70.8 11.1
52.93.13.38
52.93.12.112
52.93.12.50
52.93.13.10
17. 52.93.15.217 0.0% 548 0 25.1 23.2 57.3 2.6
52.93.13.61
52.93.12.131
52.93.12.65
52.93.13.19
18. 205.251.233.122 0.0% 548 0 24.6 23.8 46.1 1.2
52.93.15.217
205.251.232.255
205.251.232.167
Don't see all that packet loss to use1 but still get nagios alerts though. :/
Comment hidden (Intermittent Failures Robot) |
Comment 4•7 years ago
|
||
This doesn't seem to be resolving itself, we should look at opening a ticket with AWS.
Comment 5•7 years ago
|
||
Some new tunnels were added in bug 1415678 an hour so before this kicked off, so I've asked there if there can be any connection with the issue here.
Comment 6•7 years ago
|
||
Belatedly remembered that loss on only a few hops (rather than N to last) of an mtr report only indicates rate limiting of ICMP packets. Since I haven't been seeing packet loss on the IPsec endpoints the transit of the internet seems OK. This means we should look even more closely at the system we changed at about the same time (bug 1415678).
Comment 7•7 years ago
|
||
09:57:37 < fubar> do we have any net.ops folks awake atm? bug 1430670 is causing us some
indigestion
09:58:06 < fubar> it's not clear if that's related to some recent changes or not
10:01:26 <@cshields> fubar: justdave might be up but it looks like you probably want johnb as
he created the mdc2 tunnels.
11:56:47 < johnb> fubar: still seeing issues?
11:57:05 < fubar> johnb: I believe so, based on chatter in #moc
12:01:37 < johnb> looked like MDC2 might have been advertising too many routes so that *might*
have been a problem... which I fixed w/in the last 15 min
Puppet has stopped throwing regular errors which seemed to be related.
Comment 8•7 years ago
|
||
That has resolved the issue, thanks.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Whiteboard: [stockwell disable-recommended] → [stockwell infra]
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•