Closed
Bug 949028
Opened 11 years ago
Closed 11 years ago
gaia-ui tests need to dump a stack when the process crashes
Categories
(Release Engineering :: General, defect)
Tracking
(b2g-v1.3 fixed)
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
b2g-v1.3 | --- | fixed |
People
(Reporter: bholley, Assigned: ahal)
References
Details
Attachments
(3 files, 1 obsolete file)
(deleted),
patch
|
jgriffin
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
mdas
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
mdas
:
review+
|
Details | Diff | Splinter Review |
This bit us in bug 937317.
The only recourse was to get Zac to run the tests locally (I tried, and failed to reproduce the crash), then have him link to the crash on socorro, then have bsmedberg send bz the minidump, then have bz run minidump_stackwalk on it to get the crash results. I think we need to do better here.
I'll also send some mail to get this added to the TBPL job visibility policy.
Assignee | ||
Comment 1•11 years ago
|
||
I'll take a look when I have some time.
Assignee: nobody → ahalberstadt
Status: NEW → ASSIGNED
Assignee | ||
Comment 2•11 years ago
|
||
The base marionette harness is set up to check for crashes (http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/runner/base.py#731) but it looks like it never receives a symbols_path from mozharness. We'll likely need to modify gaiatest to forward the symbols_path on as well.
Note that none of the logs from bug 937317 are available anymore, so I'll be feeling around in the dark a bit.
Assignee | ||
Comment 3•11 years ago
|
||
Looks like gaiatest just passes in cli args directly to the marionette runner which already has a --symbols-path option, so no changes should be needed there.
I believe this is all that's missing. Though since this is fairly difficult to test in production, and I can't see the original crashing logs anymore, I can't guarantee this will solve all crash stack issues. That being said, this patch needs to land either way.
Attachment #8356816 -
Flags: review?(jgriffin)
Assignee | ||
Updated•11 years ago
|
Component: General → General Automation
Product: Testing → Release Engineering
QA Contact: catlee
Reporter | ||
Comment 4•11 years ago
|
||
You should be able to reproduce the crash by removing the null check here:
http://mxr.mozilla.org/mozilla-central/source/dom/bindings/CallbackObject.cpp#105
Assignee | ||
Comment 5•11 years ago
|
||
Thanks, here's a try run: https://tbpl.mozilla.org/?tree=Try&rev=85ff590e740e
I'll do some re-triggers both before and after landing my patch to see if it helps.
Comment 6•11 years ago
|
||
Comment on attachment 8356816 [details] [diff] [review]
Patch 1.0 - mozharness should pass symbols to gaia ui tests
Review of attachment 8356816 [details] [diff] [review]:
-----------------------------------------------------------------
Looks reasonable to me too.
Attachment #8356816 -
Flags: review?(jgriffin) → review+
Assignee | ||
Comment 7•11 years ago
|
||
I passed in emulator instead of b2g desktop to that last try run by accident:
https://tbpl.mozilla.org/?tree=Try&rev=94769f1a9489
Assignee | ||
Comment 8•11 years ago
|
||
https://hg.mozilla.org/build/mozharness/rev/25b3d7cf9535
Try won't pick this up until the next reconfig, so should be enough time to get some re-triggers going to try and reproduce the crash.
Comment 9•11 years ago
|
||
Merged mozharness (not getting CCed to this bug).
Assignee | ||
Comment 10•11 years ago
|
||
It looks like the try run reproduced it:
https://tbpl.mozilla.org/php/getParsedLog.php?id=32706648&tree=Try&full=1
If that's the same crash, then there is still no stacktrace in the logs. At least the harness is receiving the symbols now. Either it isn't a crash, crashreporter isn't enabled, or we aren't checking for crashes in the right place (though after a brief look at the runner, I'm pretty sure we are). I'll have to test this out some more on Ash to see what's going on.
Comment 11•11 years ago
|
||
Ahal if I recall from this failure what happens is the crash reporter pops up and it's 'modal' of sorts so no more data can come in and out of the system. I had to clear the crash report prompt for it to be able to continue.
I think if you can the test locally that is what you would see.
This may be a totally separate bug - Marionette or gaiatest not disabling the crash reporter, or not making it crash silently.
Assignee | ||
Comment 12•11 years ago
|
||
Oh ok, that helps a lot. I think all we need to do is set the CRASHREPORTER_NO_REPORT env in the b2g process then. If the marionette runner had been using mozbase, this would have been done for us, but I guess it isn't.
Assignee | ||
Comment 13•11 years ago
|
||
Here's a new try run with check_for_crashes added to b2g desktop builds as well:
https://tbpl.mozilla.org/?tree=Try&rev=c1aae692f012
I realized after pushing I can just use runner.check_for_crashes, I'll update then when I upload the patch.
Assignee | ||
Comment 14•11 years ago
|
||
Missed the last try results due to try being reset. Here's a new push along with the patch:
https://tbpl.mozilla.org/?tree=Try&rev=2bbd8eba68de
Assignee | ||
Comment 15•11 years ago
|
||
Hmm, I thought that bustage in the last try run was something unrelated to my patch, but here's another with the same problem:
https://tbpl.mozilla.org/?tree=Try&rev=d137a2b0f187
And Gu is green before my push. I have no idea how my patch could be causing that...
Comment 16•11 years ago
|
||
Andrew, I have a linux64 machine so I'll run the test locally today and see what it is, but a hunch is that it's just failing to start up b2g at all.
Comment 17•11 years ago
|
||
hmm, works fine locally for me, both starting up the binary and using it locally (no errors in the console, Marionette reported that the port is open and "Marionette server ready".
Running tests was fine, too.
Sorry!
Assignee | ||
Comment 18•11 years ago
|
||
(In reply to Zac C (:zac) from comment #17)
> hmm, works fine locally for me, both starting up the binary and using it
> locally (no errors in the console, Marionette reported that the port is open
> and "Marionette server ready".
> Running tests was fine, too.
> Sorry!
Thanks for checking. The potential for my patch to cause this failure is there because I'm modifying the code that starts the emulator. I just don't understand why passing in some environment variables would stop it from coming up. I'll need to spend some more time debugging. It could be a bug in mozrunner.
Assignee | ||
Comment 19•11 years ago
|
||
Er, not emulator, b2g desktop binary.
Assignee | ||
Comment 20•11 years ago
|
||
I see what's going on (was also able to reproduce with desktop marionette tests as well). In LocalRunner if you don't pass in an env, it'll default to os.environ.copy():
https://github.com/mozilla/mozbase/blob/master/mozrunner/mozrunner/local.py#L123
I'm not sure if this is a bug or not.. the behaviour is definitely weird and unintuitive. I kind of just want to leave it as is as that way there's no risk of breaking something else depending on it.
Anyway, here's an updated try run with the crasher patch:
https://tbpl.mozilla.org/?tree=Try&rev=c431328b8cc8
Attachment #8366692 -
Attachment is obsolete: true
Comment 21•11 years ago
|
||
Out of curiosity, how did you replicate it locally? (wondering what I did differently).
Assignee | ||
Comment 22•11 years ago
|
||
Did you have my patch applied? I just tried it with Firefox desktop marionette tests (which use the same code path), didn't try with gaia-ui-tests, so maybe that's why. Also it only seemed to fail on linux, osx worked fine in the try run.
Comment 23•11 years ago
|
||
I just downloaded the build straight from TBPL and ran Gu locally in the way I thought TBPL did but I must have been missing something.
Assignee | ||
Comment 24•11 years ago
|
||
(In reply to Zac C (:zac) from comment #23)
> I just downloaded the build straight from TBPL and ran Gu locally in the way
> I thought TBPL did but I must have been missing something.
The error was caused by my patch in comment 20, so if you didn't have that applied to marionette, that's why you didn't see it :)
I made a typo in my last try run, didn't pluralize gaia-ui-test (filed bug 966260 to make it so pluralization doesn't matter). Maybe one of these days I'll get a useful try run on this bug:
https://tbpl.mozilla.org/?tree=Try&rev=34c10c663b34
Comment 25•11 years ago
|
||
I was trying to replicate the wait_for_port timeout from comment#20.
Never mind let's see how c#24 goes :)
Assignee | ||
Comment 26•11 years ago
|
||
(In reply to Zac C (:zac) from comment #25)
> I was trying to replicate the wait_for_port timeout from comment#20.
> Never mind let's see how c#24 goes :)
Yeah, the wait for port timeout was caused by my patch :)
Assignee | ||
Comment 27•11 years ago
|
||
Comment on attachment 8368133 [details] [diff] [review]
Make sure marionette geckoinstance checks for crashes
I give up, that try run still isn't scheduling Gu and I have no idea why. Neither gaia-ui-test or gaia-ui-tests seem to work, must be a bug in trychooser. I tested this patch locally by running desktop Mn tests, and it at least doesn't cause bustage there. So time to just review and land.
Attachment #8368133 -
Flags: review?(mdas)
Comment 28•11 years ago
|
||
Comment on attachment 8368133 [details] [diff] [review]
Make sure marionette geckoinstance checks for crashes
Review of attachment 8368133 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks for doing this!
::: testing/marionette/client/marionette/geckoinstance.py
@@ +43,5 @@
> os.remove(self.gecko_log)
> +
> + env = os.environ.copy()
> + env.update({ 'MOZ_CRASHREPORTER': '1',
> + 'MOZ_CRASHREPORTER_NO_REPORT': '1', })
Mind adding a comment here about what these two variables do? If I didn't know anything about this code, and just read this bit... then these variable names and values are confusing since they sound like they contradict each other.
Attachment #8368133 -
Flags: review?(mdas) → review+
Assignee | ||
Comment 29•11 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/cdc2338e71a3
Unfortunately I got that try run working, reproduced the crash and still no stack. This is another piece in place, but I'm not done yet. Looks like the harness needs to call check for crashes somewhere.
https://tbpl.mozilla.org/?tree=Try&rev=5830aa168a5d
Whiteboard: [leave-open]
Assignee | ||
Comment 30•11 years ago
|
||
So apparently marionette has two runners (the second being based on unittest.py).. and the gaia-ui-tests use the latter (even though runtests.py in gaiatest uses the base marionette runner?). Anyway, the unittest version doesn't check for crashes anywhere.
Comment 31•11 years ago
|
||
We cleared this up in irc, but fyi, there is one base runner (http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/runner/base.py) provided by marionette and it is based on unittest.py. The base runner doesn't have any test cases it can run by default, so it doesn't really do anything until you set test_handlers for it. Marionette's Runtests.py and Gaia-ui-tests extend this runner and give it test_handlers so you can actually use it.
Comment 32•11 years ago
|
||
Assignee | ||
Comment 33•11 years ago
|
||
Yeah, ignore comment 30 :).
What was tripping me up was I though the function run_test in the base runner ran a single test, when it actually runs a suite of tests. This meant we were only checking for crashes after the whole suite finished (and in this case buildbot kills the job before that happens).
So I added a check for crashes call to the tearDown method of the test case, but that still didn't work. Here's an in-progress try run where I added some extra debugging info to maybe help see what's going on:
https://tbpl.mozilla.org/?tree=Try&rev=b958f8609c44
Jgriffin, when this works, it might help with that linux only Gu socket timeout we're seeing.
Comment 34•11 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #33)
> So I added a check for crashes call to the tearDown method of the test case,
> but that still didn't work. Here's an in-progress try run where I added some
> extra debugging info to maybe help see what's going on:
> https://tbpl.mozilla.org/?tree=Try&rev=b958f8609c44
>
> Jgriffin, when this works, it might help with that linux only Gu socket
> timeout we're seeing.
In that run the crash there, I don't think it's the one you're trying to invoke, but it does look more like the linux only Gu socket timeout.
Assignee | ||
Comment 35•11 years ago
|
||
That's a good point. The diagnostics I added confirmed that we were checking for crashes after each test so it's likely that the socket timeout simply isn't a crash. Maybe I can test it on osx since that doesn't seem to suffer from the socket timeout.
Assignee | ||
Comment 36•11 years ago
|
||
Malini, in my previous try run, I added check_for_crashes to the cleanTest() method [1], but this got called about 2-3 times between each test as seen in [2]. Ideally this would only get called when we detect something went wrong. I'm trying to figure out where that would be, but the maze of of try/except/else blocks [3] is making it very difficult to figure out where this is. Do you have any ideas?
[1] https://hg.mozilla.org/try/rev/a41f614974d9
[2] https://tbpl.mozilla.org/php/getParsedLog.php?id=34098874&tree=Try&full=1#error3
[3] http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/marionette_test.py#136
Flags: needinfo?(mdas)
Assignee | ||
Comment 37•11 years ago
|
||
Same try run as last time except on osx in an attempt to avoid the socket timeout:
https://tbpl.mozilla.org/?tree=Try&rev=2a42b52a8426
Comment 38•11 years ago
|
||
I looked at this on friday and the ideal thing would be to put this in MarionetteTestCase's tearDown, but that's dependent on being able to detect if a test passed or failed, and from unittest's documentation, it doesn't look like you can determine that within tearDown. I don't see anything provided by unittest that does what we want to do in this scenario.
The next best thing may be to add this check in MarionetteTestResult(http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/runner/base.py#37), maybe in the add_test_result or addError/addFailure methods. It isn't ideal, since this is where result processing should be but it may be the closest thing we can get.
Flags: needinfo?(mdas)
Comment 39•11 years ago
|
||
It should be fine to check for crashes regardless of the success of the test. We do this unilaterally in desktop harnesses:
http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#951
This has the nice property that you can catch crashes from child processes that didn't cause obvious failures and other things like that.
Comment 40•11 years ago
|
||
in that case, then tearDown() would be the right place for it. It should be called only once per test function
Assignee | ||
Comment 41•11 years ago
|
||
Thanks for the clarification. I wasn't sure about tearDown because there's an ominous comment referencing a bug, but it's probably unrelated.
Comment 42•11 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #41)
> Thanks for the clarification. I wasn't sure about tearDown because there's
> an ominous comment referencing a bug, but it's probably unrelated.
ah to further clarify, I'd add it here http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/marionette_test.py#339, so that we can diagnose the crash before the TEST-END logline is output.
Also, mind removing the ominous bug comment from the other tearDown method in your patch? That's been resolved :)
Assignee | ||
Comment 43•11 years ago
|
||
Also removed some dead code I noticed.
Attachment #8373451 -
Flags: review?(mdas)
Comment 44•11 years ago
|
||
Comment on attachment 8373451 [details] [diff] [review]
Add check_for_crash() to MarionetteTestCase's tearDown method
Review of attachment 8373451 [details] [diff] [review]:
-----------------------------------------------------------------
looks good, thanks!
You may want to run these patches in try to make sure check_for_crash behaves sanely in all environments since I haven't tried these locally.
Attachment #8373451 -
Flags: review?(mdas) → review+
Assignee | ||
Comment 45•11 years ago
|
||
It got tested on b2g desktop on try already. And check_for_crash just calls through to mozcrash, so I think it should be good to go without further try testing:
https://hg.mozilla.org/integration/mozilla-inbound/rev/11c2e5f4fe48
Comment 46•11 years ago
|
||
Assignee | ||
Comment 47•11 years ago
|
||
I was unable to reproduce the crash on try: https://tbpl.mozilla.org/?tree=Try&rev=2a42b52a8426
I'm going to call this fixed for now. All the pieces should be in place, but I'll keep an eye out for crashes in Gu on tbpl and re-open if I don't see a stack.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
status-b2g-v1.3:
--- → fixed
Whiteboard: [leave-open]
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•