<a class="header-button" href="https://bugzilla-dev.allizom.org/home" title="Go to home page"> Bugzilla

Updated

•

11 years ago

Blocks: 777574

Updated

•

11 years ago

Blocks: 926264

Comment 2

•

11 years ago

The two easier-to-spot failures should be bug 777574, Linux64 ASan webgl timeouts which were a couple of times a week and are now 30 times a day since the afternoon of January 28th, and bug 926264, Linux64 Jetpack shutdown hangs that were all-platform and once a week until the afternoon of January 28th when they became 40 or 50 a day.

Comment 3

•

11 years ago

Another very likely candidate is bug 967816, in which gaia-ui-tests on linux64 became nearly permafail on Feb 3, which is the date that spot instances were converted to m3.medium. The same tests running on osx didn't experience a change in failure rates.

Blocks: 967816

Assignee

Comment 4

•

11 years ago

I have a couple of questions here and some input. Converting the instances back to m1.medium is not a bug deal it will just take some time. We will some time with mixed pool of m1 and m3 instances. The conversion will be happening whenever we start instances (not on reboot). 1) Do we want to switch all tst-linux64 instances to m1.medium (spot+on-demand)? 2) When we want to start switching? Are we OK starting during this weekend?

Flags: needinfo?(rail)

Assignee

Comment 5

•

11 years ago

a bug deal == a big deal :)

Comment 6

•

11 years ago

I think it would make a clearer signal if we made the change on Monday, when there is likely to be heavier commit traffic. We'd also see the clearest signal if we changed both spot and on-demand instances, but if it's easier just to handle spot instances, I imagine we could see enough of a signal to tell whether additional investigation in this direction is warranted. If we go with the spot-only approach, how long would it take before all of the spot instances were running on the old node type? We'd want to wait about a day after we got to that state before attempting to make a call as to whether the node type change is the culprit.

Assignee

Comment 7

•

11 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #6) > I think it would make a clearer signal if we made the change on Monday, when > there is likely to be heavier commit traffic. WFM > > If we go with the spot-only approach, how long would it take before all of > the spot instances were running on the old node type? We'd want to wait > about a day after we got to that state before attempting to make a call as > to whether the node type change is the culprit. I have no precise figures here,unfortunately... I can start the conversion on Sunday evening when things are quiet, so we get faster turnaround and probably get everything ready by Monday morning.

Assignee

Updated

•

11 years ago

Assignee: nobody → rail

Comment 8

•

11 years ago

Thanks Rail!

Assignee

Comment 9

•

11 years ago

To make the plan clear: * we are going to switch to m1.medium for spot instances only * the process will be started this Sunday * we evaluate the results on Tuesday Anything missing?

Comment 10

•

11 years ago

That sounds like a great plan to me.

Updated

•

11 years ago

Blocks: 966772

Updated

•

11 years ago

Blocks: 966796

Updated

•

11 years ago

Blocks: 966806

Comment 11

•

11 years ago

While adding those dependencies I ran through the intermittent-failure bugs I've filed since January 28th, and especially on February 2nd, and there are lots more one-off failures on Linux64 besides those (I've since stopped filing them, just blowing off 10 or 30 per day), so that's another thing to watch for disappearing from spot but not from on-demand, random timeouts in tests that have never timed out before.

Assignee

Updated

•

11 years ago

Blocks: 966070

Comment 12

•

11 years ago

Favoritest one: bug 965534 failed only on on-demand slaves between January 29 and February 3, when it began also failing on spot slaves.

Blocks: 965534

Assignee

Comment 13

•

11 years ago

Pushed https://hg.mozilla.org/build/cloud-tools/rev/a88d96185c8e

Assignee

Comment 14

•

11 years ago

As of now: m3.medium: 8 / m1.medium: 385 I'm going to monitor remaining 8 vms and terminate them on reboot.

Assignee

Comment 15

•

11 years ago

m3.medium: 0 / m1.medium: 382

Comment 16

•

11 years ago

(In reply to Jonathan Griffin (:jgriffin) from comment #3) > Another very likely candidate is bug 967816, in which gaia-ui-tests on > linux64 became nearly permafail on Feb 3, which is the date that spot > instances were converted to m3.medium. The same tests running on osx didn't > experience a change in failure rates. This looks very promising so far. There have been 0 of these Gu timeouts on b2g-inbound and mozilla-inbound today on spot instances; the only occurrences have been on on-demand instances.

Comment 17

•

11 years ago

Yeah, the only thing that keeps it from being a no-questions absolutely perfect success is the nasty surprise of bug 970239 being m1.medium only, but that's still a deal I'd take in a heartbeat: throw away two tests in order to get back the entire suites and platforms that we hid over this.

Blocks: 970239

Updated

•

11 years ago

Blocks: 967223

Comment 18

•

11 years ago

So, I declare this experiment a success (comment #17 notwithstanding). Rail, can we move the on-demand instances back to m1.medium too? Then, we can unhide several test suites that have been hidden since around Feb 3. Since the new node type is more efficient, we should try to acquire some engineering resources to help identify the source of the hangs.

Assignee

Comment 19

•

11 years ago

I pushed http://hg.mozilla.org/build/cloud-tools/rev/1e8ba299c4ab#l1.42 to let automation change the instances type of the new started instances. This may take some time. I'll take a look at the instance type breakdown tomorrow.

Assignee

Comment 20

•

11 years ago

m3.medium: 88 / m1.medium: 167

Assignee

Updated

•

11 years ago

No longer blocks: 966070