Closed
Bug 812395
Opened 12 years ago
Closed 12 years ago
Make mozharness emulator scripts re-try the run if a failure occurs during gecko installation
Categories
(Release Engineering :: Applications: MozharnessCore, defect)
Release Engineering
Applications: MozharnessCore
All
Gonk (Firefox OS)
Tracking
(firefox18 fixed, firefox19 fixed, firefox20 fixed)
RESOLVED
FIXED
B2G C2 (20nov-10dec)
People
(Reporter: jgriffin, Assigned: jgriffin)
References
Details
Attachments
(3 files, 3 obsolete files)
(deleted),
patch
|
mozilla
:
review+
jgriffin
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
mozilla
:
review+
jgriffin
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
jgriffin
:
review+
|
Details | Diff | Splinter Review |
This is in response to bug 809437.
It currently seems like the fastest way (but by no means the best way) to get tests unhidden in TBPL again is to change the mozharness script so that it can detect when the test run it has invoked fails due to a problem installing gecko into the emulator, and re-try it.
Longer-term solutions are to either identify and fix the B2G or emulator bug that's causing this problem (which would require B2G dev support), or to use a full emulator build and avoid the necessity of installing gecko into the emulator during the test run.
I have a patch for this that I'm currently debugging; hopefully it will be ready to try Friday.
Assignee | ||
Updated•12 years ago
|
Target Milestone: --- → B2G C2 (20nov-10dec)
Assignee | ||
Updated•12 years ago
|
Assignee: nobody → jgriffin
Assignee | ||
Comment 1•12 years ago
|
||
This comes in two patches:
- a gecko patch which changes emulator.py to fail gracefully during install_gecko, so instead of raising an exception, it prints the exception, prints "Error installing gecko!", and exits with sys.exit(). This is done to avoid tripping mozharness error detection.
- a patch for the marionette mozharness script which runs Marionette, and if it sees "Error installing gecko!", it re-runs it, up to 5 times, before bailing. No ERROR lines appear in the log during install_gecko failures, unless it fails for all 5 times.
This is a little yucky (well, probably more than a little), but I think it has a high likelihood of working. If this works, I'll need to tweak things a bit more so it works with mochitest/reftest.
To test it in production, I propose landing the mozharness script (which should have no effect on trees where the corresponding gecko patch hasn't landed) and then landing the gecko changes on cedar and retriggering several times (we haven't had much luck reproducing the failure in question on try).
Assignee | ||
Comment 2•12 years ago
|
||
Attachment #682582 -
Flags: review?(aki)
Assignee | ||
Comment 3•12 years ago
|
||
Attachment #682584 -
Flags: review?(ahalberstadt)
Attachment #682584 -
Flags: feedback?(aki)
Comment 4•12 years ago
|
||
Comment on attachment 682584 [details] [diff] [review]
gecko patch
Review of attachment 682584 [details] [diff] [review]:
-----------------------------------------------------------------
Someday someone is going to look at this and go: http://i.imgur.com/UOdjD.png
Patch looks good, but my guess is you made this off cedar. This will bitrot as soon m-c gets merged. Anyway, I think it is a little excessive to have the whole retry self._restart_b2g as well as the retry at the mozharness level. We might want to even get rid of the sleeps (which is what's currently in m-c).
nit: add this bug number somewhere
Attachment #682584 -
Flags: review?(ahalberstadt) → review+
Assignee | ||
Comment 5•12 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #4)
> Comment on attachment 682584 [details] [diff] [review]
> gecko patch
>
> Review of attachment 682584 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> Someday someone is going to look at this and go: http://i.imgur.com/UOdjD.png
I think we already look like that. :)
>
> Patch looks good, but my guess is you made this off cedar. This will bitrot
> as soon m-c gets merged. Anyway, I think it is a little excessive to have
> the whole retry self._restart_b2g as well as the retry at the mozharness
> level. We might want to even get rid of the sleeps (which is what's
> currently in m-c).
>
> nit: add this bug number somewhere
Yes, I'll take out the additional attempts at restarting b2g, and add this bug # to the comments. Then I'll land on cedar and we'll see how it does.
Comment 6•12 years ago
|
||
Comment on attachment 682582 [details] [diff] [review]
Mozharness patch
If you want additional retry, you can set self.buildbot_status(TBPL_RETRY) and the job will retry automatically in buildbot. This would go above the self.fatal() in the else: . (This would require importing TBPL_RETRY as well.)
r=me either way.
Attachment #682582 -
Flags: review?(aki) → review+
Comment 7•12 years ago
|
||
Comment on attachment 682582 [details] [diff] [review]
Mozharness patch
Also, this fixes marionette-webapi but we might need something similar for b2g_emulator_unittest.py .
Comment 8•12 years ago
|
||
Comment on attachment 682584 [details] [diff] [review]
gecko patch
I wish we could limit the except: to one or a handful of exceptions, but I'll take it if it helps here :)
Attachment #682584 -
Flags: feedback?(aki) → feedback+
Assignee | ||
Comment 9•12 years ago
|
||
(In reply to Aki Sasaki [:aki] from comment #7)
> Comment on attachment 682582 [details] [diff] [review]
> Mozharness patch
>
> Also, this fixes marionette-webapi but we might need something similar for
> b2g_emulator_unittest.py .
Yes, I'll work on that next if this doesn't completely blow something up.
Assignee | ||
Comment 10•12 years ago
|
||
(In reply to Aki Sasaki [:aki] from comment #6)
> Comment on attachment 682582 [details] [diff] [review]
> Mozharness patch
>
> If you want additional retry, you can set self.buildbot_status(TBPL_RETRY)
> and the job will retry automatically in buildbot. This would go above the
> self.fatal() in the else: . (This would require importing TBPL_RETRY as
> well.)
>
> r=me either way.
Cool, I'll add that!
Comment 11•12 years ago
|
||
(In reply to Aki Sasaki [:aki] from comment #8)
> Comment on attachment 682584 [details] [diff] [review]
> gecko patch
>
> I wish we could limit the except: to one or a handful of exceptions, but
> I'll take it if it helps here :)
Sadly over the course of bug 809437 I've seen about 4-5 different exceptions thrown from that chunk of code.
p.s. it kind of feels like we are breaking the tests out of the matrix here. :)
Assignee | ||
Comment 12•12 years ago
|
||
Comment on attachment 682582 [details] [diff] [review]
Mozharness patch
http://hg.mozilla.org/build/mozharness/rev/b97f11b392ef, with changes suggested by aki
Attachment #682582 -
Flags: checked-in+
Assignee | ||
Comment 13•12 years ago
|
||
Carry r+ forward. Updated per aki's and ahal's comments.
Attachment #682584 -
Attachment is obsolete: true
Attachment #682609 -
Flags: review+
Assignee | ||
Comment 14•12 years ago
|
||
I've landed this on cedar. I'll retrigger the Mn tests a few times, and if all is well, I'll land on inbound, then get to work on corresponding changes for mochitest/reftest.
https://tbpl.mozilla.org/?tree=Cedar&rev=6d2ff516cd68
Assignee | ||
Comment 15•12 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #11)
> (In reply to Aki Sasaki [:aki] from comment #8)
> > Comment on attachment 682584 [details] [diff] [review]
> > gecko patch
> >
> > I wish we could limit the except: to one or a handful of exceptions, but
> > I'll take it if it helps here :)
>
> Sadly over the course of bug 809437 I've seen about 4-5 different exceptions
> thrown from that chunk of code.
>
> p.s. it kind of feels like we are breaking the tests out of the matrix here.
> :)
On the cedar version I'm only catching DMError and MarionetteException (which will catch call subclasses of that). This accounts for the majority of failures; if we see exceptions not covered by those, I'll add them.
Assignee | ||
Comment 16•12 years ago
|
||
So far so good on cedar. Here's the corresponding mozharness patch for emulator unittests.
Attachment #682675 -
Flags: review?(aki)
Updated•12 years ago
|
Attachment #682675 -
Flags: review?(aki) → review+
Assignee | ||
Comment 17•12 years ago
|
||
Comment on attachment 682675 [details] [diff] [review]
mozharness patch for unittests
http://hg.mozilla.org/build/mozharness/rev/1a82918622af
Attachment #682675 -
Flags: checked-in+
Assignee | ||
Comment 18•12 years ago
|
||
This patch works with reftest and mochitests as well. It also adds a convenience method to Marionette, getMarionetteOrExit, to handle all of the dirty work, so we don't have to repeat similar code elsewhere.
Attachment #682609 -
Attachment is obsolete: true
Attachment #682686 -
Flags: review?(ahalberstadt)
Assignee | ||
Comment 19•12 years ago
|
||
Landed newest gecko patch on cedar to see how it works: https://tbpl.mozilla.org/?tree=Cedar&rev=e4fd39c0ee1b
Comment 20•12 years ago
|
||
Comment on attachment 682686 [details] [diff] [review]
gecko patch
Review of attachment 682686 [details] [diff] [review]:
-----------------------------------------------------------------
r+. Personally I think I preferred just printing the string and bailing at the source. But no point in delaying this over bikeshedding
::: testing/marionette/client/marionette/emulator.py
@@ +411,5 @@
> + time.sleep(5)
> + self.dm.shellCheckOutput(['stop', 'b2g'])
> + time.sleep(10)
> + self.dm.shellCheckOutput(['start', 'b2g'])
> + time.sleep(5)
nit: we should probably remove at least the first and last sleep (not sure why I put them in in the first place) in the interest of saving 10sec/job on the linux slaves. I'm not even sure if the middle sleep makes much of a difference.
Attachment #682686 -
Flags: review?(ahalberstadt) → review+
Assignee | ||
Comment 21•12 years ago
|
||
This job on cedar triggered the retry code and it worked:
https://tbpl.mozilla.org/php/getParsedLog.php?id=17160532&tree=Cedar&full=1
I'm going to land this on inbound after taking out the extra sleeps.
One of the things that we'll have to watch for is that each retry can add up to 10 min to the test run. So if the test run is close to the 60min cutoff, adding a retry or two could push it over that threshold and cause the test run to timeout. We should make sure our test chunks normally don't run over 40 min or so, I think.
Assignee | ||
Comment 22•12 years ago
|
||
Remove extra sleeps; carry r+ forward.
Attachment #682686 -
Attachment is obsolete: true
Attachment #683172 -
Flags: review+
Assignee | ||
Comment 23•12 years ago
|
||
Comment 24•12 years ago
|
||
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Comment 25•12 years ago
|
||
Were going to need this on aurora and beta still
Whiteboard: [automation-needed-in-aurora][automation-needed-in-beta]
Comment 26•12 years ago
|
||
Reopening to land on those branches
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 27•12 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/221d91248314
Will land beta once tests are passing in aurora.
Whiteboard: [automation-needed-in-aurora][automation-needed-in-beta] → [automation-needed-in-beta]
Comment 28•12 years ago
|
||
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
status-firefox18:
--- → fixed
status-firefox19:
--- → fixed
status-firefox20:
--- → fixed
Resolution: --- → FIXED
Whiteboard: [automation-needed-in-beta]
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•10 years ago
|
Component: General Automation → Mozharness
You need to log in
before you can comment on or make changes to this bug.
Description
•