Closed
Bug 1445580
Opened 7 years ago
Closed 7 years ago
issue on OSX machines - possible network connectivity issue
Categories
(Infrastructure & Operations :: RelOps: General, task)
Infrastructure & Operations
RelOps: General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nli, Unassigned)
References
(Blocks 1 open bug)
Details
(Whiteboard: [stockwell infra])
Attachments
(1 file)
(deleted),
text/plain
|
Details |
Being reaching out via IRC.
There were discussion on #taskcluster
for scrollback - https://mozilla.logbot.info/taskcluster/20180314
07:29:35 <pmoore> morning all
08:01:09 <dluca|sheriffduty> !t-rex: Hello, I'm seeing failure logs on OS X finishing showing Unnamed step, a lots of them, on both inbound and autoland. Can it be machine related ?
08:09:42 <pmoore> dluca|sheriffduty: is the workerId always the same?
08:11:23 <dluca|sheriffduty> pmoore: Nope, IDs are different
08:12:19 <dluca|sheriffduty> pmoore: Is is the same machine type or test type : gecko-t-osx-1010
08:13:33 <pmoore> dluca|sheriffduty: this worker type is unfortunately on a slightly older version of the worker that makes it difficult for treeherder to parse the error messages, but looking at one of them i see
08:13:36 <pmoore> Aborting task - max run time exceeded!
08:13:51 <pmoore> i'll see if this is a common pattern....
08:14:45 <dluca|sheriffduty> I can see the same thing
08:15:10 <pmoore> this looks like some network slowness: https://treeherder.mozilla.org/logviewer.html#?job_id=167866941&repo=mozilla-inbound&lineNumber=413-427
08:15:41 <pmoore> i suspect problems with network connectivity - these are macs running in our data center ....
08:16:02 <pmoore> grenade: ^
08:16:30 <pmoore> dluca|sheriffduty: i'm not sure who can support these types of issues at this time
08:16:43 <pmoore> might be worth raising in #moc ?
08:17:15 <dluca|sheriffduty> pmoore: Ok, thank you for looking into it!
08:17:24 <pmoore> yw :)
08:19:42 <pmoore> dluca|sheriffduty: it might be worth retrying some of the failed jobs, in case the network slowness was intermittent
08:20:20 <dluca|sheriffduty> pmoore: Was going to ask about that
https://treeherder.mozilla.org/logviewer.html#?job_id=167866941&repo=mozilla-inbound&lineNumber=413-427
From log it says task aborted because hitting max run time.
Frankly I'm far away from recognizing the issue.
Relops,
Could you please take a look at this?
Thank you very much.
Comment 1•7 years ago
|
||
I've attached an MTR generated from nagios1.private.releng.mdc1 to tools.taskcluster.net
It looks like the latency increase is due to routing through Japan and Taiwan. Is this intended?
Updated•7 years ago
|
Flags: needinfo?(klibby)
Comment 2•7 years ago
|
||
Hey Chris, from Kendall's irc nick it looks like he might be on PTO - do you know anything about this?
Flags: needinfo?(catlee)
Comment 4•7 years ago
|
||
So I remember a bug from late last year (bug 1413585) about how geoip was wrong somehow, and impacted routing from SCL3.
Flags: needinfo?(catlee)
Comment 5•7 years ago
|
||
Re-directing the NI to Jake who's filling in for :fubar this week.
Flags: needinfo?(klibby) → needinfo?(jwatkins)
Comment 6•7 years ago
|
||
A few observations here:
1. The host this task ran on (t-yosemite-r7-0214) is now located in MDC2 (it was on a long haul moving truck just last week going from the west coast to the east coast) and really shouldn't be running in production at the moment. The osx hosts that are there still need to be reimaged. Needless to say, we need to disabled the worker on these hosts or quarantine them until mdc2 is 'production' ready.
2. The slowdown seems to be coming from pypi. From the log, you can see pip timing out and retrying downloads which seems to have added enough time for the task to exceed it max runtime. pypi is hosted in SCL3 on the releng web cluster, so it might be worth having Webops take a look at it. If the cluster looks ok, it might have been network connectivity issues between SCL3 and MDC2.
3. I'm having trouble replicating the issues which makes me wonder if it was a transient problem that has since cleared itself up. Are we still seeing this?
Flags: needinfo?(jwatkins)
Comment 7•7 years ago
|
||
This was still an issue an hour ago: https://treeherder.mozilla.org/logviewer.html#?job_id=167992179&repo=mozilla-inbound&lineNumber=397-399
Comment 8•7 years ago
|
||
The releng web cluster servers that serve pypi are lightly loaded and the site seems responsive from mdc2 to me using curl. I checked the Apache logs and don't see many errors or anything about lost connections, etc. I'm running a curl loop from a server in mdc2 to look for connection errors.
Comment 9•7 years ago
|
||
Since the last 17 hours when this was filled, according Neglected oranges, this has 180 occurences and I can confirm that it is still occurring: https://treeherder.mozilla.org/logviewer.html#?job_id=168107489&repo=mozilla-inbound&lineNumber=697-699
Reporter | ||
Updated•7 years ago
|
Group: mozilla-employee-confidential
Comment 10•7 years ago
|
||
Could this be related to https://github.com/mozilla/DeepSpeech/issues/1289 ?
Flags: needinfo?(lissyx+mozillians)
Updated•7 years ago
|
Whiteboard: [stockwell infra]
Comment 11•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #10)
> Could this be related to https://github.com/mozilla/DeepSpeech/issues/1289 ?
Proper answer: nothing to do with that.
Flags: needinfo?(lissyx+mozillians)
Comment 12•7 years ago
|
||
I was able to manually reproduce this using the pip on a MDC2 host but not on a MDC1 host. This problem seems unique to MDC2. I've also opened a bug (bug 1446176) with netops to help troubleshoot this issue.
(testing) [root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# pip -V
pip 8.1.2 from /private/var/root/testing/lib/python2.7/site-packages (python 2.7)
(testing) [root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# pip install --timeout 120 --no-index --find-links http://pypi.pvt.build.mozilla.org/pub --find-links http://pypi.pub.build.mozilla.org/pub --trusted-host pypi.pub.build.mozilla.org --trusted-host pypi.pvt.build.mozilla.org psutil>=3.1.1
Retrying (Retry(total=4, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x105fca950>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub
Retrying (Retry(total=3, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x105fcaad0>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub
Retrying (Retry(total=2, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x105fcac50>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub
Retrying (Retry(total=1, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x105fcadd0>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub
Retrying (Retry(total=0, connect=None, read=None, redirect=None)) after connection broken by 'NewConnectionError('<pip._vendor.requests.packages.urllib3.connection.HTTPConnection object at 0x105fcaf50>: Failed to establish a new connection: [Errno 60] Operation timed out',)': /pub
Comment 13•7 years ago
|
||
Can you reproduce it using curl to just fetch a package from pypi? Might be a slightly simpler test case if that's available.
Comment 14•7 years ago
|
||
MDC2 gecko-t-osx-1010 workers need to stop taking jobs until MDC2 is 'production ready'
See: https://bugzilla.mozilla.org/show_bug.cgi?id=1445899#c2
Comment hidden (Intermittent Failures Robot) |
Comment 16•7 years ago
|
||
(testing) [root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# curl -O http://pypi.pub.build.mozilla.org/pub
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
(testing) [root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# curl -O http://pypi.pvt.build.mozilla.org/pub
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:01:14 --:--:-- 0curl: (7) Failed to connect to pypi.pvt.build.mozilla.org port 80: Operation timed out
The connection to pub seems to be ok when using curl but the connection to pvt timesout. Maybe we need to add the MDC2 cidr to a whitelist for pvt?
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Severity: normal → critical
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Comment 22•7 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #16)
>
> The connection to pub seems to be ok when using curl but the connection to
> pvt timesout. Maybe we need to add the MDC2 cidr to a whitelist for pvt?
We had updated the apache configs for everything on the relengweb cluster a while back when we noticed traffic from MDC1 was randomly failing (due to zeus cache misses). But it looks like we all ALSO missed a Zeus rule (releng-net-only) on the internal ZLB which hosts pypi.pvt.build.mozilla.org; in fact, that rule was missing the releng networks in MDC1, MDC2, us-west-2, and us-east-1. I've updated the releng-net-only rule to include those subnets (10.49.0.0/16, 10.51.0.0/16, 10.132.0.0/16, and 10.134.0.0/16).
There's also a note in IT puppet's modules/releng/manifests/init.pp:
# NOTE: if you change these, you'll also need to change the network_regexps
# secret in PuppetAgain. See
# https://wiki.mozilla.org/ReleaseEngineering/PuppetAgain/Secrets and check
# with someone from relops/releng.
Jake, can you verify that's been updated?
Flags: needinfo?(jwatkins)
Comment 23•7 years ago
|
||
(In reply to Kendall Libby [:fubar] (PTO Mar 14-18) from comment #22)
> We had updated the apache configs for everything on the relengweb cluster a
> while back when we noticed traffic from MDC1 was randomly failing (due to
> zeus cache misses). But it looks like we all ALSO missed a Zeus rule
> (releng-net-only) on the internal ZLB which hosts
> pypi.pvt.build.mozilla.org; in fact, that rule was missing the releng
> networks in MDC1, MDC2, us-west-2, and us-east-1. I've updated the
> releng-net-only rule to include those subnets (10.49.0.0/16, 10.51.0.0/16,
> 10.132.0.0/16, and 10.134.0.0/16).
> Jake, can you verify that's been updated?
I noticed that rule on the internal zlb also but it was set to disabled and the fact that it didn't have most of our DC cidr blocks in it led me to believe it was defunct anyway. I can confirm this is STILL being blocked from MDC2 so I think netops should take a look at this in bug 1446176 .
[root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# curl -O http://pypi.pub.build.mozilla.org/pub
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
[root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# curl -O http://pypi.pvt.build.mozilla.org/pub
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- 0:01:14 --:--:-- 0curl: (7) Failed to connect to pypi.pvt.build.mozilla.org port 80: Operation timed out
Flags: needinfo?(jwatkins)
Comment 24•7 years ago
|
||
There are similar issues affecting Linux (and possibly Android) tests, reported in bug 1411358 (alongside failures from other causes). For example:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=168901986&lineNumber=1019
[task 2018-03-19T10:52:31.184Z] 10:52:31 INFO - Installing None into virtualenv /builds/worker/workspace/build/venv
[task 2018-03-19T10:52:31.187Z] 10:52:31 INFO - error resolving pypi.pvt.build.mozilla.org (ignoring):
[taskcluster:error] Task timeout after 3600 seconds. Force killing container.
Comment 25•7 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #24)
> There are similar issues affecting Linux (and possibly Android) tests,
> reported in bug 1411358 (alongside failures from other causes). For example:
>
> https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-
> inbound&job_id=168901986&lineNumber=1019
>
> [task 2018-03-19T10:52:31.184Z] 10:52:31 INFO - Installing None into
> virtualenv /builds/worker/workspace/build/venv
> [task 2018-03-19T10:52:31.187Z] 10:52:31 INFO - error resolving
> pypi.pvt.build.mozilla.org (ignoring):
>
> [taskcluster:error] Task timeout after 3600 seconds. Force killing container.
I'm not entirely sure this is related. This looks like it is from a aws tc worker (as opposed to a hardware working) and it is a failure to resolve the dns name for pypi's external VIPs rather than connect. But it is still something that needs to be looked into.
Comment 27•7 years ago
|
||
Netops has fixed this issue in bug 1446176. And I have confirmed the traffic is no longer being block in MCD2.
[root@t-yosemite-r7-219.test.releng.mdc2.mozilla.com ~]# curl -O http://pypi.pvt.build.mozilla.org/pub
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 246 100 246 0 0 357 0 --:--:-- --:--:-- --:--:-- 357
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•