Closed
Bug 1491948
Opened 6 years ago
Closed 6 years ago
connections between some RelEng machines in AWS and some in mdc1/mdc2 are much slower since Friday
Categories
(Infrastructure & Operations :: SRE, task)
Infrastructure & Operations
SRE
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: bhearsum, Assigned: dividehex)
References
Details
Attachments
(1 file)
(deleted),
text/x-github-pull-request
|
Details |
This appears to have begun between 9:30am and 11:30am EDT on Friday, September 14th. The most obvious symptom is connections between depsigning workers (RelEng use1/usw2) and signing servers (RelEng mdc1/mdc2) taking a lot longer than before.
Requests used to take less than a second, eg:
2018-09-14 14:44:13,606 - signingscript.utils - INFO - 2018-09-14 14:44:13,606 - Starting new HTTPS connection (1): signing7.srv.releng.mdc1.mozilla.com:9110
2018-09-14 14:44:13,995 - signingscript.utils - INFO - 2018-09-14 14:44:13,995 - https://signing7.srv.releng.mdc1.mozilla.com:9110 "GET /sign/sha2signcode/bc0f99f65d37709f2b4475af58c63aa0b9474260 HTTP/1.1" 404 None
And now take multiple seconds:
2018-09-14 21:04:18,862 - signingscript.utils - INFO - 2018-09-14 21:04:18,861 - Starting new HTTPS connection (1): signing11.srv.releng.mdc2.mozilla.com:9110
2018-09-14 21:04:23,929 - signingscript.utils - INFO - 2018-09-14 21:04:23,928 - https://signing11.srv.releng.mdc2.mozilla.com:9110 "GET /sign/sha2signcode/5127b34c5cea78b2d28d0f210730cfc6c9a5789a HTTP/1.1" 404 None
This is mtr output from depsigning-worker14.srv.releng.usw2.mozilla.com -> signing11.srv.releng.mdc2.mozilla.com
Host Loss% Snt Last Avg Best Wrst StDev
1. 169.254.249.69 0.0% 807 1.2 2.8 0.3 41.6 6.5
2. ???
3. 54.239.51.210 0.0% 806 0.8 1.5 0.6 129.2 5.7
4. 52.93.15.222 0.0% 806 1.5 1.9 1.3 76.0 3.4
52.93.13.74
5. 52.93.15.219 0.0% 806 1.6 2.1 1.3 84.6 4.7
54.239.48.179
6. 169.254.249.13 0.0% 806 1.1 1.9 1.1 74.7 6.0
7. 169.254.249.14 0.0% 806 84.7 84.7 83.3 86.4 0.2
8. signing11.srv.releng.mdc2.mozilla.com 0.0% 806 87.6 86.2 83.5 88.7 1.4
This has resulted in Taskcluster jobs taking ~10min instead of ~3min, which ends up causing backlogs in results. It is not closing the trees, so I'm filing as critical instead of blocker.
Comment 1•6 years ago
|
||
Once we shut off the scl3 vpns, it looks like we've lost all internet connectivity in the releng use1/usw2 regions... essentially, `mtr github.com` hangs.
Comment 2•6 years ago
|
||
Trees appear to be closed for this now?
Comment 4•6 years ago
|
||
Just FYI, the timing on when this started coincides with the area surrounding MDC2 and AWS us-east getting attacked by a hurricane... may or may not be related but just putting that out there.
Reporter | ||
Comment 5•6 years ago
|
||
I realize this is still important, but I want to note that jobs are still passing - we're just backlogged because things are slow. We had a brief period where jobs were failing (when we tried a fix that didn't work), but that is long over.
It's not my decision whether or not to close the trees, but I'll note that we've been in this state since Friday with trees open.
Comment 6•6 years ago
|
||
For timing/correlation/history, 2018-09-13 the day we powered off the majority of hosts in scl3. We left up most infra items, for IT and releng.
After caucusing at a standup on 2018-09-14, we agreed to power down the remainder of the releng.scl3 infra. It was at that point that admin1[ab].private.releng.scl3 and ns[12].private.releng.scl3 went down, ~1612 UTC, which would line up with the impact times.
Comment 7•6 years ago
|
||
Tree's re-opened at 2018-09-17T23:09:05+00:00. We have a workaround in place to improve signing speed by gcox turning ns[12].private.releng.scl3 back on, and releng reverting the DHCP options change reported in bug 1491497.
With SCL3 networking not lasting much longer the search for a proper fix is on. dividehex and gcox have been using tcpdump to check network traffic.
Severity: blocker → critical
Comment 8•6 years ago
|
||
We can potentially solve this by:
- pointing at scl3 dns servers, without any of the below. this is not a long term option
- turning off ipv6 on the scriptworker hosts, even if we're pointing at mdc dns
/etc/sysctl.conf :
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
sysctl -p
we'd need to add this to puppet if we want it to persist. This is not ideal; ipv6 is the future.
- enabling single-request in /etc/resolv.conf, even if we're pointing at mdc dns
options single-request
(we'd need to add this via the aws dhcp config somehow?)
- It's possible we could see an improvement by going to CentOS > 6.5; infra is using 7 and isn't seeing this issue. Untested.
Comment 9•6 years ago
|
||
tl;dr
Adding options single-request-reopen to resolv.conf seems to fix the issue.
doing a dig @10.48.75.120 and @10.50.75.120 return responses immediately.
Longer explanation
Best guess is that the difference in firewalls between SCL3 and MDC1/MDC2 is the culprit. The default process is for 2 separate sockets to open, 1 for the A and one for the AAAA response. The A response comes immediately, the AAAA times out, thus the 5 second slowdown.
The firwalls in MDC1/MDC2 being Panorama are much more application specific.
from man resolv.conf
single-request-reopen (since glibc 2.9)
The resolver uses the same socket for the A and AAAA requests. Some hardware mistakenly only sends back one reply. When that happens the client sytem will sit and
wait for the second reply. Turning this option on changes this behavior so that if two requests from the same port are not handled correctly it will close the socket
and open a new one before sending the second request.
Comment 10•6 years ago
|
||
I remember having to do that options single-request thing on the office admin hosts when we first got IPv6 in the offices.
Comment 11•6 years ago
|
||
Digging in further, looks like this could be a network layer issue inside the VPC.
We've engaged jbircher for additional troubleshooting.
Assignee | ||
Comment 12•6 years ago
|
||
Assignee: network-operations → jwatkins
Comment 13•6 years ago
|
||
Worked further with :johnb and there were multiple paths in/out of the VPC.
He downed the 2nd tunnel from MDC2 to the vpc and things are working now.
:bhearsum,
Can you please test further?
Flags: needinfo?(bhearsum)
Comment 14•6 years ago
|
||
on signing-linux-1.srv.releng.use1.mozilla.com, I:
- kept ipv6 enabled
- pointed resolv.conf at mdc*
- removed the `options` line from the resolv.conf
This resulted in slow wget. Adding the `options` line back in sped up wget.
That tells me that downing the 2nd tunnel didn't resolve the issue. It may be part of the solution, but it's not the whole thing.
Flags: needinfo?(bhearsum)
Comment 15•6 years ago
|
||
(In reply to Aki Sasaki [:aki] from comment #14)
> on signing-linux-1.srv.releng.use1.mozilla.com, I:
>
> - kept ipv6 enabled
> - pointed resolv.conf at mdc*
> - removed the `options` line from the resolv.conf
>
> This resulted in slow wget. Adding the `options` line back in sped up wget.
> That tells me that downing the 2nd tunnel didn't resolve the issue. It may
> be part of the solution, but it's not the whole thing.
Per :dividehex, we need the 2nd tunnel to avoid SPOF.
Comment 16•6 years ago
|
||
We've rolled out the `single-request-reopen` fix, pointed DNS back at mdc*, and rebooted all the scriptworkers.
This appears to have resolved our network slowdowns.
If we find a network-level fix, we can test by removing the single-request-reopen line from /etc/resolv.conf.
Comment 17•6 years ago
|
||
(In reply to Nick Thomas [:nthomas] (UTC+12) from comment #7)
> Tree's re-opened at 2018-09-17T23:09:05+00:00. We have a workaround in place
> to improve signing speed by gcox turning ns[12].private.releng.scl3 back on,
> and releng reverting the DHCP options change reported in bug 1491497.
nameservers for scl3 powered back down as of 2018-09-18T21:27:00+00:00, r=aki
Comment 18•6 years ago
|
||
Releng's EC2 routing tables have all been flipped, modulo the unused subnets with names like "relops-vpc" or "upload-nat".
We should be able to turn off the SCL3 VPNs.
Updated•6 years ago
|
Comment 19•6 years ago
|
||
I think we can resolve this bug. (Alternately we can wait til we ship releases tomorrow with the current network settings, but I'm not too worried.)
Updated•6 years ago
|
Component: NetOps → Infrastructure: AWS
QA Contact: jbircher → cshields
Assignee | ||
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•