Closed Bug 934938 Opened 11 years ago Closed 11 years ago

Intermittent ftp.m.o "ERROR 503: Server Too Busy" during download-and-extract step

Categories

(Infrastructure & Operations Graveyard :: WebOps: Product Delivery, task, P1)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cbook, Unassigned)

References

()

Details

Ubuntu VM 12.04 mozilla-central pgo test jetpack on 2013-11-05 02:40:55 PST for push 770de5942471 slave: tst-linux32-ec2-021 not sure if there is something we can do about. https://tbpl.mozilla.org/php/getParsedLog.php?id=30130965&tree=Mozilla-Central --2013-11-05 02:41:19-- http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux-pgo/1383638406/firefox-28.0a1.en-US.linux-i686.tar.bz2 Resolving ftp.mozilla.org (ftp.mozilla.org)... 63.245.215.46 Connecting to ftp.mozilla.org (ftp.mozilla.org)|63.245.215.46|:80... connected. HTTP request sent, awaiting response... 503 Server Too Busy 2013-11-05 02:41:19 ERROR 503: Server Too Busy.
Assignee: relops → server-ops-webops
Component: RelOps → WebOps: Product Delivery
QA Contact: arich → nmaul
:Tomcat - could you help us find other occurrences of these http 503s? you mention that you're seeing them intermittently, did this just start recently or have you observed it for some time now? any more details you can provide would be helpful.
Flags: needinfo?(cbook)
Hardware: x86 → All
(In reply to Chris Turra [:cturra] from comment #1) > :Tomcat - could you help us find other occurrences of these http 503s? you > mention that you're seeing them intermittently, did this just start recently > or have you observed it for some time now? any more details you can provide > would be helpful. Hey Chris, so far this was the first time i have seen this and TBPL also reported this only one time. Ed Morley mentioned this issue with a 503 on ftp.m.o were seen in the past too (and if its happen frequently trees will closed" maybe he has a history of this issue here :)
Flags: needinfo?(cbook) → needinfo?(emorley)
We intermittently see 503s (maybe a few times a week), and often don't file, since the issues has resolved itself by the time it is spotted. However it seems worthwhile to track them in this bug, so we can see if there is a pattern (eg logrotate cron causing issues, or other similar things we've had happen in the past).
Flags: needinfo?(emorley)
:edmorley - i agree it's worth looking into, but am going to need more information to do that. can you help me track down date/times that you have observed these in the past? and if there is and header information from those failed attempts, that would as be useful.
Flags: needinfo?(emorley)
Not easily sadly. What I meant by comment 3, is that this bug is going to be what we track them in from this point forwards - ie no action required here until we have sufficient data points.
Flags: needinfo?(emorley)
Summary: Intermittent ERROR 503: Server Too Busy. on ftp.m.o → Intermittent ftp.m.o "ERROR 503: Server Too Busy" or "command timed out: 1200 seconds without output, attempting to kill" during download-and-extract step
This most recent episode appears to only be hitting ec2 slaves, which makes me think that this our recurring link-to-aws-is-slow issue. I'm getting 12M/sec now though, so it might be over already.
(In reply to Ben Hearsum [:bhearsum] from comment #8) > This most recent episode appears to only be hitting ec2 slaves, which makes > me think that this our recurring link-to-aws-is-slow issue. I'm getting > 12M/sec now though, so it might be over already. Have all the failures noted in this bug been from EC2 or have any been from a Mozilla DC? Dropping to normal for now
Severity: blocker → normal
raising severity since this spiked last night
Severity: normal → major
(In reply to Carsten Book [:Tomcat] from comment #583) > raising severity since this spiked last night Raising the severity is fine, but that also causes this bug to page the oncall sysadmin. Reading the comments here, I have no idea what action needs to be taken. -> P1:normal so it doesn't page
Severity: major → normal
Priority: -- → P1
What to do is say "in its current state, this bug is purely depending on bug 957502, so there's no point in twiddling flags here."
Depends on: 957502
Depends on: 961030
i would like to request that we stop sending these tbpl robot comments to this bug. it's great that we're tracking the frequency of these timeouts and i firmly believe we should continue to do so, but don't think that filling up a bug with comments is the right way to do that. additionally, while this information is useful, there are a couple dependent bugs here that have been identified as the root cause (bugs 957502 & 961030). it might be better to track the frequency of timeouts with those respectively?
Flags: needinfo?(emorley)
There's not a way to stop them on a per-bug basis. We need to be able to star these failures, so there's really no way to avoid it. For frequently-occurring bugs, we recommend leaving this bug as a dumping ground for stars and use dependencies for investigating and fixing the underlying causes.
Flags: needinfo?(emorley)
I switched ftp.mozilla.org away from Dynect DNS load balancing to simple DNS Round Robin. This should result in better load balancing between the two IPs hosting ftp.mozilla.org. The suspect problem solved by this is a thundering herd problem. Many (all?) of the releng nodes downloading files from ftp.mozilla.org do so in a very bursty fashion, and they all use the same DNS resolvers. Dynect only returns one record, not both. The effect is that all of these machines will burst all at once to the same IP, and (I think) hit its 2gbps data transfer cap (licensing). That's why there's 2 IPs in the first place, to get up to 4gbps. DNS Round Robin returns both IPs and leaves it up to the client to decide which to use. Thanks to the law of large numbers, this results in a very even distribution of traffic between the two IPs, because most operating systems will randomize (or otherwise alternate) which IP in a result set they use. Since making this change the traffic has not yet been high enough to where we think the problem was triggering, so at the moment I can't say for certain how effective this change is. I can say the bandwidth distribution between the two nodes is much more even and consistent... what I can't say is if this will actually cure the problem reported here.
(In reply to Jake Maul [:jakem] from comment #1519) > Since making this change the traffic has not yet been high enough to where > we think the problem was triggering, so at the moment I can't say for > certain how effective this change is. I can say the bandwidth distribution > between the two nodes is much more even and consistent... what I can't say > is if this will actually cure the problem reported here. I'm leaning towards it not being the cure.
in order to prevent further tbpl-robot comments on this bug i've enabled comment restricting.
Restrict Comments: true
It was the cure for a different disease, the tiny smattering of 503 Too Busy failures during the day; the choking out of the VPN around 19:30 every night resulting in the timeouts is another thing entirely.
https://tbpl.mozilla.org/php/getParsedLog.php?id=33463523&tree=Mozilla-Inbound, though, says it wasn't entirely a magic bullet for 503 Too Busy.
Trees closed again, not inclined to open them any time soon. This / bug 957502 needs escalating asap, my patience is somewhat diminishing :-( I'll send some emails out.
Things seem to have settled down, so I've reopened for now.
Thanks for the 503 fix, sorry we got 86ed from the bug.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Summary: Intermittent ftp.m.o "ERROR 503: Server Too Busy" or "command timed out: 1200 seconds without output, attempting to kill" during download-and-extract step → Intermittent ftp.m.o "ERROR 503: Server Too Busy" during download-and-extract step
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.