Closed
Bug 969590
Opened 11 years ago
Closed 11 years ago
Temporarily revert the change to m3.medium AWS instances to see if they are behind the recent increase in test timeouts
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: RyanVM, Assigned: rail)
References
Details
Within the last week, we've seen a dramatic increase in test timeouts on Linux64. The timeframe appears to line up with when we switched from m1 instances to m3 instances. To test that theory, we would like to temporarily revert to m1 instances for a period of time to see if the rate of timeouts decreases with it.
Comment 1•11 years ago
|
||
Can you look at switching back from m3.medium for a day?
Flags: needinfo?(rail)
Comment 2•11 years ago
|
||
The two easier-to-spot failures should be bug 777574, Linux64 ASan webgl timeouts which were a couple of times a week and are now 30 times a day since the afternoon of January 28th, and bug 926264, Linux64 Jetpack shutdown hangs that were all-platform and once a week until the afternoon of January 28th when they became 40 or 50 a day.
Comment 3•11 years ago
|
||
Another very likely candidate is bug 967816, in which gaia-ui-tests on linux64 became nearly permafail on Feb 3, which is the date that spot instances were converted to m3.medium. The same tests running on osx didn't experience a change in failure rates.
Blocks: 967816
Assignee | ||
Comment 4•11 years ago
|
||
I have a couple of questions here and some input.
Converting the instances back to m1.medium is not a bug deal it will just take some time. We will some time with mixed pool of m1 and m3 instances. The conversion will be happening whenever we start instances (not on reboot).
1) Do we want to switch all tst-linux64 instances to m1.medium (spot+on-demand)?
2) When we want to start switching? Are we OK starting during this weekend?
Flags: needinfo?(rail)
Assignee | ||
Comment 5•11 years ago
|
||
a bug deal == a big deal :)
Comment 6•11 years ago
|
||
I think it would make a clearer signal if we made the change on Monday, when there is likely to be heavier commit traffic.
We'd also see the clearest signal if we changed both spot and on-demand instances, but if it's easier just to handle spot instances, I imagine we could see enough of a signal to tell whether additional investigation in this direction is warranted.
If we go with the spot-only approach, how long would it take before all of the spot instances were running on the old node type? We'd want to wait about a day after we got to that state before attempting to make a call as to whether the node type change is the culprit.
Assignee | ||
Comment 7•11 years ago
|
||
(In reply to Jonathan Griffin (:jgriffin) from comment #6)
> I think it would make a clearer signal if we made the change on Monday, when
> there is likely to be heavier commit traffic.
WFM
>
> If we go with the spot-only approach, how long would it take before all of
> the spot instances were running on the old node type? We'd want to wait
> about a day after we got to that state before attempting to make a call as
> to whether the node type change is the culprit.
I have no precise figures here,unfortunately... I can start the conversion on Sunday evening when things are quiet, so we get faster turnaround and probably get everything ready by Monday morning.
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → rail
Comment 8•11 years ago
|
||
Thanks Rail!
Assignee | ||
Comment 9•11 years ago
|
||
To make the plan clear:
* we are going to switch to m1.medium for spot instances only
* the process will be started this Sunday
* we evaluate the results on Tuesday
Anything missing?
Comment 10•11 years ago
|
||
That sounds like a great plan to me.
Comment 11•11 years ago
|
||
While adding those dependencies I ran through the intermittent-failure bugs I've filed since January 28th, and especially on February 2nd, and there are lots more one-off failures on Linux64 besides those (I've since stopped filing them, just blowing off 10 or 30 per day), so that's another thing to watch for disappearing from spot but not from on-demand, random timeouts in tests that have never timed out before.
Comment 12•11 years ago
|
||
Favoritest one: bug 965534 failed only on on-demand slaves between January 29 and February 3, when it began also failing on spot slaves.
Blocks: 965534
Assignee | ||
Comment 13•11 years ago
|
||
Assignee | ||
Comment 14•11 years ago
|
||
As of now:
m3.medium: 8 / m1.medium: 385
I'm going to monitor remaining 8 vms and terminate them on reboot.
Assignee | ||
Comment 15•11 years ago
|
||
m3.medium: 0 / m1.medium: 382
Comment 16•11 years ago
|
||
(In reply to Jonathan Griffin (:jgriffin) from comment #3)
> Another very likely candidate is bug 967816, in which gaia-ui-tests on
> linux64 became nearly permafail on Feb 3, which is the date that spot
> instances were converted to m3.medium. The same tests running on osx didn't
> experience a change in failure rates.
This looks very promising so far. There have been 0 of these Gu timeouts on b2g-inbound and mozilla-inbound today on spot instances; the only occurrences have been on on-demand instances.
Comment 17•11 years ago
|
||
Yeah, the only thing that keeps it from being a no-questions absolutely perfect success is the nasty surprise of bug 970239 being m1.medium only, but that's still a deal I'd take in a heartbeat: throw away two tests in order to get back the entire suites and platforms that we hid over this.
Blocks: 970239
Comment 18•11 years ago
|
||
So, I declare this experiment a success (comment #17 notwithstanding). Rail, can we move the on-demand instances back to m1.medium too? Then, we can unhide several test suites that have been hidden since around Feb 3.
Since the new node type is more efficient, we should try to acquire some engineering resources to help identify the source of the hangs.
Assignee | ||
Comment 19•11 years ago
|
||
I pushed http://hg.mozilla.org/build/cloud-tools/rev/1e8ba299c4ab#l1.42 to let automation change the instances type of the new started instances. This may take some time. I'll take a look at the instance type breakdown tomorrow.
Assignee | ||
Comment 20•11 years ago
|
||
m3.medium: 88 / m1.medium: 167
Assignee | ||
Comment 21•11 years ago
|
||
This is done now. Zarro m3.medium instances.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 22•11 years ago
|
||
So, after failing to reproduce bug 926264 on a freshly installed m3.medium with ubuntu 12.04, xvfb and unity, I tried on a loaner, in case my fresh install didn't quite match our actual test environment. Guess what? I've been equally unable to reproduce. Whatever is happening on m3.mediums that made us backout this change is not happening by repeatedly running the same test. I'm afraid the only way to find what's wrong here would be to catch those timeouts while they are happening... in production. Could it perhaps be possible to setup duplicate test jobs on a small pool of m3.mediums?
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•