Closed
Bug 588711
Opened 14 years ago
Closed 14 years ago
[tracking bug] Frequent build automation failures due to "Gateway Time-out" during network activity related steps
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dholbert, Assigned: aravind)
References
()
Details
I've seen at least 10 TryServer builds fail today due to "abort: HTTP Error 504: Gateway Time-out" during hg clone.
The logs all end with something like this:
{
argv: ['/tools/python/bin/hg', 'clone', '--verbose', '--noupdate', '--rev', '583ae843a43488f3bfda25ff4615887cfc8a57f5', u'http://hg.mozilla.org/try', '/builds/slave/tryserver-linux64-debug/build']
environment:
[...]
using PTY: False
requesting all changes
abort: HTTP Error 504: Gateway Time-out
elapsedTime=1800.137150
program finished with exit code 255
}
List of failure logs with this problem (all of which just completed recently):
==============================================================================
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203583.1282205659.24030.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203477.1282205509.23531.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203476.1282205520.23582.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203475.1282205314.22453.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203475.1282205501.23520.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203475.1282205496.23498.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203475.1282205295.22395.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203474.1282205288.22338.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203475.1282205314.22454.gz
And here's one that failed with this error earlier today:
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282178040.1282180111.11311.gz
Reporter | ||
Updated•14 years ago
|
OS: Linux → All
Hardware: x86 → All
Reporter | ||
Comment 1•14 years ago
|
||
I initially thought this might be a case of "try server hg repository has too many heads & needs a fresh start", but I don't think that's actually what's going on.
I just tried running the |hg| command from comment 0 locally (at home)...
> hg clone --verbose --noupdate --rev 583ae843a43488f3bfda25ff4615887cfc8a57f5 http://hg.mozilla.org/try
...and it gets past the "requesting all changes" stage and on to "adding changesets" pretty much instantaneously.
So, this looks like a case of network congestion, I guess?
Reporter | ||
Comment 2•14 years ago
|
||
(In reply to comment #1)
> So, this looks like a case of network congestion, I guess?
(I mean: sporadic congestion, affecting the builders)
Comment 3•14 years ago
|
||
(In reply to comment #1)
> I initially thought this might be a case of "try server hg repository has too
> many heads & needs a fresh start", but I don't think that's actually what's
> going on.
Definitely not, because it does shallow clones (only the required changesets for the head you want).
> So, this looks like a case of network congestion, I guess?
504 Gateway Time-out means that the proxy made an HTTP request to the HG server, which didn't respond in a timely manner (in this case, ~30min). Generally, it indicates issues on the HG server, not network congestion.
Moving this to Server Ops. Marked critical -- but it's really a blocker once it starts happening again.
Assignee: nobody → server-ops
Severity: normal → critical
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Updated•14 years ago
|
Assignee: server-ops → aravind
Comment 4•14 years ago
|
||
Here's a tally of failure times from the last 24h (times listed are start times in PDT of the hg clone):
August 18:
11:45am - 12pm, 5 failures
12:15pm - 12:30, 2 failures
3:15 - 3:30, 5 failures
5 - 5:15, 22 failures
5:15 - 5:30, 11 failures
5:30 - 5:45, 22 failures
5:45 - 6, 7 failures
8:30 - 8:45, 10 failures
August 19th:
12am - 12:15am, 2 failures
12:15am - 12:30, 3 failures
12:30 - 12:45, 11 failures
2:30 - 2:45, 11 failures
3:30 - 3:45, 5 failures
7:30 - 7:45, 12 failures
Assignee | ||
Comment 6•14 years ago
|
||
I got the caching layer out of the picture and am sending try requests to the backend server directly. This should hopefully help with the problem. Please let me know if you notice any more try failures (since 6:30 AM PST).
Assignee | ||
Comment 7•14 years ago
|
||
Getting varnish out of the picture seems to have helped. Its not a permanent fix, that's going to be build & release reworking the way they do checkouts. Please re-open if you notice this issue again.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 8•14 years ago
|
||
We had some more failures last night, unfortunately:
Tryserver Mercurial failures on 2010/08/24
20:15: 1
21:15: 12
21:30: 14
21:45: 6
23:30: 2
I'm not sure what else we can do besides wait for the real fix in bug 589885
Reporter | ||
Comment 9•14 years ago
|
||
Ok - reopening per comment 7, then.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 10•14 years ago
|
||
(Even if there's nothing that can done here at the moment, it's probably best to have an open bug tracking this, so that other people who hit this issue can more easily find it.)
Comment 11•14 years ago
|
||
More than 10 builds had problems at 18:34 today, with
abort: HTTP Error 502: Bad Gateway
Updated•14 years ago
|
Whiteboard: [pending build fix]
Comment 12•14 years ago
|
||
yesterday, 13 Sep 2010, we had about 4 hours of intermittant 502 gateway errors happening with increasing frequency from 10am EDT to when IT restarted the varnish proxy.
it impacted more than 50 builds and triggered many more retries.
Updated•14 years ago
|
Summary: Frequent TryServer failures due to "abort: HTTP Error 504: Gateway Time-out" during hg clone → Frequent build automation failures due to "Gateway Time-out" during network activity related steps
Updated•14 years ago
|
Summary: Frequent build automation failures due to "Gateway Time-out" during network activity related steps → [tracking bug] Frequent build automation failures due to "Gateway Time-out" during network activity related steps
Comment 13•14 years ago
|
||
adding more details: two nightlies and a chunk of the mobile, maemo builds during the late overnight also had hg 502 failures
Comment 14•14 years ago
|
||
Raising priority, as these hg failures are causing builds and tests to fail in production. This is causing delays yesterday and last night, which caused long wait times. This also requires a lot of manual re-queuing of jobs.
Removing whiteboard "[pending build fix]", because while comment#8 is about enhancements to RelEng infrastructure, it is orthogonal to what needs to be fixed here.
Severity: critical → blocker
Whiteboard: [pending build fix]
Comment 15•14 years ago
|
||
from quick chat with aravind:
1) the errors yesterday morning were caused by someone running a spider on hg.m.o.
2) Unknown what caused the errors last night.
Assignee | ||
Comment 16•14 years ago
|
||
Is this still consistently happening? Is there a reproducible test case I can use?
Comment 17•14 years ago
|
||
No occurances have been reported to me today. The checkin/checkout activity seems much less than earlier.
I will close with the option to reopen if something spikes - thanks
Status: REOPENED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•