Closed Bug 1430670 Opened 7 years ago Closed 7 years ago

AWS network issues - Connection failure with the signing server, nagios socket timeouts

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dluca, Unassigned)

Details

(Whiteboard: [stockwell infra])

https://treeherder.mozilla.org/logviewer.html#?job_id=156436485&repo=mozilla-inbound#L63 [Errno -3] Cannot connect to host signing5.srv.releng.scl3.mozilla.com:9110 ssl:True [Temporary failure in name resolution] exit code: 1
We're also seeing nagios alerts like Mon 22:22:00 UTC [7728] [] signing-linux-8.srv.releng.usw2.mozilla.com:load is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. Mon 22:28:23 UTC [7752] [] signingworker-3.srv.releng.use1.mozilla.com:load is CRITICAL: CHECK_NRPE: Socket timeout after 15 seconds. across several classes of systems in AWS, both usw2 and use1 while the nagios server is in scl3. Seem to be quite bursty. Running mtr to look for packet loss: 1, scl3 -> use1 buildbot-master81.bb.releng.scl3.mozilla.com (0.0.0.0) Mon Jan 15 14:51:21 2018 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Drop Avg Best Wrst StDev 1. fw1.bb.releng.scl3.mozilla.net 0.0% 1363 0 0.5 0.5 2.5 0.2 2. ??? 3. beetmoverworker-1.srv.releng.use1.mozilla.com 40.6% 1362 553 73.4 72.6 78.3 0.5 ------ 2, scl3 -> usw2 buildbot-master81.bb.releng.scl3.mozilla.com (0.0.0.0) Mon Jan 15 14:51:49 2018 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Drop Avg Best Wrst StDev 1. fw1.bb.releng.scl3.mozilla.net 0.0% 1367 0 0.5 0.5 2.8 0.2 2. ??? 3. beetmoverworker-4.srv.releng.usw2.mozilla.com 27.5% 1366 376 24.7 23.5 42.7 1.1 ------ Meanwhile use1 -> use1 and usw2 -> usw2 have zero packet loss.
Summary: Connection failure with the signing server → AWS network issues - Connection failure with the signing server, nagios socket timeouts
Also ran some traceroutes from scl3 to the AWS endpoints for the IPsec tunnels. There seems to be issues within the AWS networks, eg: buildbot-master81.bb.releng.scl3.mozilla.com (0.0.0.0) Mon Jan 15 15:15:21 2018 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Drop Avg Best Wrst StDev 1. fw1.bb.releng.scl3.mozilla.net 0.0% 549 0 0.7 0.5 2.9 0.3 2. [redacted] 0.0% 549 0 2.1 1.6 6.5 0.5 3. [redacted] 0.0% 549 0 1.0 0.8 2.8 0.3 4. [redacted] 0.0% 549 0 3.8 1.0 48.9 8.2 5. [redacted] 0.0% 549 0 1.6 1.4 8.5 0.6 6. xe-3-1-0.mpr2.pao1.us.above.net 0.0% 549 0 2.2 1.4 45.1 3.3 7. ae7.cr2.sjc2.us.zip.zayo.com 0.0% 549 0 3.1 1.9 35.0 3.5 8. ae16.mpr4.sjc7.us.zip.zayo.com 0.0% 549 0 2.9 2.0 75.9 4.6 9. 52.95.217.150 0.0% 549 0 2.4 2.1 10.5 0.8 10. 54.240.242.32 0.0% 549 0 24.7 23.5 35.4 1.4 11. 54.240.242.35 0.0% 549 0 26.9 24.4 134.0 9.2 12. ??? 13. 54.239.41.228 0.9% 549 5 25.5 23.6 73.6 3.6 52.93.128.36 52.93.128.34 52.93.128.42 54.239.43.40 14. 54.239.46.104 11.5% 549 63 25.5 23.6 51.5 2.9 54.239.45.127 54.239.46.102 54.239.43.130 15. 52.93.12.114 88.5% 549 485 40.6 26.1 77.2 16.3 16. 52.93.12.137 0.0% 549 0 33.5 23.8 70.8 11.1 52.93.13.38 52.93.12.112 52.93.12.50 52.93.13.10 17. 52.93.15.217 0.0% 548 0 25.1 23.2 57.3 2.6 52.93.13.61 52.93.12.131 52.93.12.65 52.93.13.19 18. 205.251.233.122 0.0% 548 0 24.6 23.8 46.1 1.2 52.93.15.217 205.251.232.255 205.251.232.167 Don't see all that packet loss to use1 but still get nagios alerts though. :/
This doesn't seem to be resolving itself, we should look at opening a ticket with AWS.
Some new tunnels were added in bug 1415678 an hour so before this kicked off, so I've asked there if there can be any connection with the issue here.
Belatedly remembered that loss on only a few hops (rather than N to last) of an mtr report only indicates rate limiting of ICMP packets. Since I haven't been seeing packet loss on the IPsec endpoints the transit of the internet seems OK. This means we should look even more closely at the system we changed at about the same time (bug 1415678).
09:57:37 < fubar> do we have any net.ops folks awake atm? bug 1430670 is causing us some indigestion 09:58:06 < fubar> it's not clear if that's related to some recent changes or not 10:01:26 <@cshields> fubar: justdave might be up but it looks like you probably want johnb as he created the mdc2 tunnels. 11:56:47 < johnb> fubar: still seeing issues? 11:57:05 < fubar> johnb: I believe so, based on chatter in #moc 12:01:37 < johnb> looked like MDC2 might have been advertising too many routes so that *might* have been a problem... which I fixed w/in the last 15 min Puppet has stopped throwing regular errors which seemed to be related.
That has resolved the issue, thanks.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Whiteboard: [stockwell disable-recommended] → [stockwell infra]
You need to log in before you can comment on or make changes to this bug.