Closed Bug 949028 Opened 11 years ago Closed 11 years ago

gaia-ui tests need to dump a stack when the process crashes

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(b2g-v1.3 fixed)

RESOLVED FIXED
Tracking Status
b2g-v1.3 --- fixed

People

(Reporter: bholley, Assigned: ahal)

References

Details

Attachments

(3 files, 1 obsolete file)

This bit us in bug 937317. The only recourse was to get Zac to run the tests locally (I tried, and failed to reproduce the crash), then have him link to the crash on socorro, then have bsmedberg send bz the minidump, then have bz run minidump_stackwalk on it to get the crash results. I think we need to do better here. I'll also send some mail to get this added to the TBPL job visibility policy.
Blocks: 948395
I'll take a look when I have some time.
Assignee: nobody → ahalberstadt
Status: NEW → ASSIGNED
The base marionette harness is set up to check for crashes (http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/runner/base.py#731) but it looks like it never receives a symbols_path from mozharness. We'll likely need to modify gaiatest to forward the symbols_path on as well. Note that none of the logs from bug 937317 are available anymore, so I'll be feeling around in the dark a bit.
Looks like gaiatest just passes in cli args directly to the marionette runner which already has a --symbols-path option, so no changes should be needed there. I believe this is all that's missing. Though since this is fairly difficult to test in production, and I can't see the original crashing logs anymore, I can't guarantee this will solve all crash stack issues. That being said, this patch needs to land either way.
Attachment #8356816 - Flags: review?(jgriffin)
Component: General → General Automation
Product: Testing → Release Engineering
QA Contact: catlee
You should be able to reproduce the crash by removing the null check here: http://mxr.mozilla.org/mozilla-central/source/dom/bindings/CallbackObject.cpp#105
Thanks, here's a try run: https://tbpl.mozilla.org/?tree=Try&rev=85ff590e740e I'll do some re-triggers both before and after landing my patch to see if it helps.
Comment on attachment 8356816 [details] [diff] [review] Patch 1.0 - mozharness should pass symbols to gaia ui tests Review of attachment 8356816 [details] [diff] [review]: ----------------------------------------------------------------- Looks reasonable to me too.
Attachment #8356816 - Flags: review?(jgriffin) → review+
I passed in emulator instead of b2g desktop to that last try run by accident: https://tbpl.mozilla.org/?tree=Try&rev=94769f1a9489
https://hg.mozilla.org/build/mozharness/rev/25b3d7cf9535 Try won't pick this up until the next reconfig, so should be enough time to get some re-triggers going to try and reproduce the crash.
Merged mozharness (not getting CCed to this bug).
It looks like the try run reproduced it: https://tbpl.mozilla.org/php/getParsedLog.php?id=32706648&tree=Try&full=1 If that's the same crash, then there is still no stacktrace in the logs. At least the harness is receiving the symbols now. Either it isn't a crash, crashreporter isn't enabled, or we aren't checking for crashes in the right place (though after a brief look at the runner, I'm pretty sure we are). I'll have to test this out some more on Ash to see what's going on.
Ahal if I recall from this failure what happens is the crash reporter pops up and it's 'modal' of sorts so no more data can come in and out of the system. I had to clear the crash report prompt for it to be able to continue. I think if you can the test locally that is what you would see. This may be a totally separate bug - Marionette or gaiatest not disabling the crash reporter, or not making it crash silently.
Oh ok, that helps a lot. I think all we need to do is set the CRASHREPORTER_NO_REPORT env in the b2g process then. If the marionette runner had been using mozbase, this would have been done for us, but I guess it isn't.
Here's a new try run with check_for_crashes added to b2g desktop builds as well: https://tbpl.mozilla.org/?tree=Try&rev=c1aae692f012 I realized after pushing I can just use runner.check_for_crashes, I'll update then when I upload the patch.
Missed the last try results due to try being reset. Here's a new push along with the patch: https://tbpl.mozilla.org/?tree=Try&rev=2bbd8eba68de
Hmm, I thought that bustage in the last try run was something unrelated to my patch, but here's another with the same problem: https://tbpl.mozilla.org/?tree=Try&rev=d137a2b0f187 And Gu is green before my push. I have no idea how my patch could be causing that...
Andrew, I have a linux64 machine so I'll run the test locally today and see what it is, but a hunch is that it's just failing to start up b2g at all.
hmm, works fine locally for me, both starting up the binary and using it locally (no errors in the console, Marionette reported that the port is open and "Marionette server ready". Running tests was fine, too. Sorry!
(In reply to Zac C (:zac) from comment #17) > hmm, works fine locally for me, both starting up the binary and using it > locally (no errors in the console, Marionette reported that the port is open > and "Marionette server ready". > Running tests was fine, too. > Sorry! Thanks for checking. The potential for my patch to cause this failure is there because I'm modifying the code that starts the emulator. I just don't understand why passing in some environment variables would stop it from coming up. I'll need to spend some more time debugging. It could be a bug in mozrunner.
Er, not emulator, b2g desktop binary.
I see what's going on (was also able to reproduce with desktop marionette tests as well). In LocalRunner if you don't pass in an env, it'll default to os.environ.copy(): https://github.com/mozilla/mozbase/blob/master/mozrunner/mozrunner/local.py#L123 I'm not sure if this is a bug or not.. the behaviour is definitely weird and unintuitive. I kind of just want to leave it as is as that way there's no risk of breaking something else depending on it. Anyway, here's an updated try run with the crasher patch: https://tbpl.mozilla.org/?tree=Try&rev=c431328b8cc8
Attachment #8366692 - Attachment is obsolete: true
Out of curiosity, how did you replicate it locally? (wondering what I did differently).
Did you have my patch applied? I just tried it with Firefox desktop marionette tests (which use the same code path), didn't try with gaia-ui-tests, so maybe that's why. Also it only seemed to fail on linux, osx worked fine in the try run.
I just downloaded the build straight from TBPL and ran Gu locally in the way I thought TBPL did but I must have been missing something.
(In reply to Zac C (:zac) from comment #23) > I just downloaded the build straight from TBPL and ran Gu locally in the way > I thought TBPL did but I must have been missing something. The error was caused by my patch in comment 20, so if you didn't have that applied to marionette, that's why you didn't see it :) I made a typo in my last try run, didn't pluralize gaia-ui-test (filed bug 966260 to make it so pluralization doesn't matter). Maybe one of these days I'll get a useful try run on this bug: https://tbpl.mozilla.org/?tree=Try&rev=34c10c663b34
I was trying to replicate the wait_for_port timeout from comment#20. Never mind let's see how c#24 goes :)
(In reply to Zac C (:zac) from comment #25) > I was trying to replicate the wait_for_port timeout from comment#20. > Never mind let's see how c#24 goes :) Yeah, the wait for port timeout was caused by my patch :)
Comment on attachment 8368133 [details] [diff] [review] Make sure marionette geckoinstance checks for crashes I give up, that try run still isn't scheduling Gu and I have no idea why. Neither gaia-ui-test or gaia-ui-tests seem to work, must be a bug in trychooser. I tested this patch locally by running desktop Mn tests, and it at least doesn't cause bustage there. So time to just review and land.
Attachment #8368133 - Flags: review?(mdas)
Comment on attachment 8368133 [details] [diff] [review] Make sure marionette geckoinstance checks for crashes Review of attachment 8368133 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for doing this! ::: testing/marionette/client/marionette/geckoinstance.py @@ +43,5 @@ > os.remove(self.gecko_log) > + > + env = os.environ.copy() > + env.update({ 'MOZ_CRASHREPORTER': '1', > + 'MOZ_CRASHREPORTER_NO_REPORT': '1', }) Mind adding a comment here about what these two variables do? If I didn't know anything about this code, and just read this bit... then these variable names and values are confusing since they sound like they contradict each other.
Attachment #8368133 - Flags: review?(mdas) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/cdc2338e71a3 Unfortunately I got that try run working, reproduced the crash and still no stack. This is another piece in place, but I'm not done yet. Looks like the harness needs to call check for crashes somewhere. https://tbpl.mozilla.org/?tree=Try&rev=5830aa168a5d
Whiteboard: [leave-open]
So apparently marionette has two runners (the second being based on unittest.py).. and the gaia-ui-tests use the latter (even though runtests.py in gaiatest uses the base marionette runner?). Anyway, the unittest version doesn't check for crashes anywhere.
We cleared this up in irc, but fyi, there is one base runner (http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/runner/base.py) provided by marionette and it is based on unittest.py. The base runner doesn't have any test cases it can run by default, so it doesn't really do anything until you set test_handlers for it. Marionette's Runtests.py and Gaia-ui-tests extend this runner and give it test_handlers so you can actually use it.
Yeah, ignore comment 30 :). What was tripping me up was I though the function run_test in the base runner ran a single test, when it actually runs a suite of tests. This meant we were only checking for crashes after the whole suite finished (and in this case buildbot kills the job before that happens). So I added a check for crashes call to the tearDown method of the test case, but that still didn't work. Here's an in-progress try run where I added some extra debugging info to maybe help see what's going on: https://tbpl.mozilla.org/?tree=Try&rev=b958f8609c44 Jgriffin, when this works, it might help with that linux only Gu socket timeout we're seeing.
Blocks: 967816
(In reply to Andrew Halberstadt [:ahal] from comment #33) > So I added a check for crashes call to the tearDown method of the test case, > but that still didn't work. Here's an in-progress try run where I added some > extra debugging info to maybe help see what's going on: > https://tbpl.mozilla.org/?tree=Try&rev=b958f8609c44 > > Jgriffin, when this works, it might help with that linux only Gu socket > timeout we're seeing. In that run the crash there, I don't think it's the one you're trying to invoke, but it does look more like the linux only Gu socket timeout.
That's a good point. The diagnostics I added confirmed that we were checking for crashes after each test so it's likely that the socket timeout simply isn't a crash. Maybe I can test it on osx since that doesn't seem to suffer from the socket timeout.
Malini, in my previous try run, I added check_for_crashes to the cleanTest() method [1], but this got called about 2-3 times between each test as seen in [2]. Ideally this would only get called when we detect something went wrong. I'm trying to figure out where that would be, but the maze of of try/except/else blocks [3] is making it very difficult to figure out where this is. Do you have any ideas? [1] https://hg.mozilla.org/try/rev/a41f614974d9 [2] https://tbpl.mozilla.org/php/getParsedLog.php?id=34098874&tree=Try&full=1#error3 [3] http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/marionette_test.py#136
Flags: needinfo?(mdas)
Same try run as last time except on osx in an attempt to avoid the socket timeout: https://tbpl.mozilla.org/?tree=Try&rev=2a42b52a8426
I looked at this on friday and the ideal thing would be to put this in MarionetteTestCase's tearDown, but that's dependent on being able to detect if a test passed or failed, and from unittest's documentation, it doesn't look like you can determine that within tearDown. I don't see anything provided by unittest that does what we want to do in this scenario. The next best thing may be to add this check in MarionetteTestResult(http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/runner/base.py#37), maybe in the add_test_result or addError/addFailure methods. It isn't ideal, since this is where result processing should be but it may be the closest thing we can get.
Flags: needinfo?(mdas)
It should be fine to check for crashes regardless of the success of the test. We do this unilaterally in desktop harnesses: http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#951 This has the nice property that you can catch crashes from child processes that didn't cause obvious failures and other things like that.
in that case, then tearDown() would be the right place for it. It should be called only once per test function
Thanks for the clarification. I wasn't sure about tearDown because there's an ominous comment referencing a bug, but it's probably unrelated.
(In reply to Andrew Halberstadt [:ahal] from comment #41) > Thanks for the clarification. I wasn't sure about tearDown because there's > an ominous comment referencing a bug, but it's probably unrelated. ah to further clarify, I'd add it here http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/marionette_test.py#339, so that we can diagnose the crash before the TEST-END logline is output. Also, mind removing the ominous bug comment from the other tearDown method in your patch? That's been resolved :)
Also removed some dead code I noticed.
Attachment #8373451 - Flags: review?(mdas)
Comment on attachment 8373451 [details] [diff] [review] Add check_for_crash() to MarionetteTestCase's tearDown method Review of attachment 8373451 [details] [diff] [review]: ----------------------------------------------------------------- looks good, thanks! You may want to run these patches in try to make sure check_for_crash behaves sanely in all environments since I haven't tried these locally.
Attachment #8373451 - Flags: review?(mdas) → review+
It got tested on b2g desktop on try already. And check_for_crash just calls through to mozcrash, so I think it should be good to go without further try testing: https://hg.mozilla.org/integration/mozilla-inbound/rev/11c2e5f4fe48
I was unable to reproduce the crash on try: https://tbpl.mozilla.org/?tree=Try&rev=2a42b52a8426 I'm going to call this fixed for now. All the pieces should be in place, but I'll keep an eye out for crashes in Gu on tbpl and re-open if I don't see a stack.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [leave-open]
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: