Closed Bug 949028 Opened 11 years ago Closed 11 years ago

gaia-ui tests need to dump a stack when the process crashes

Categories

(Release Engineering :: General, defect)

Product:

Component:

Platform:

x86

macOS

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

(b2g-v1.3 fixed)

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

b2g-v1.3

---

fixed

People

(Reporter: bholley, Assigned: ahal)

References

Details

Attachments

(3 files, 1 obsolete file)

Patch 1.0 - mozharness should pass symbols to gaia ui tests 11 years ago Andrew Halberstadt [:ahal] (deleted), patch	jgriffin : review+	Details \| Diff \| Splinter Review
Make sure marionette geckoinstance checks for crashes 11 years ago Andrew Halberstadt [:ahal] (deleted), patch		Details \| Diff \| Splinter Review
Make sure marionette geckoinstance checks for crashes 11 years ago Andrew Halberstadt [:ahal] (deleted), patch	mdas : review+	Details \| Diff \| Splinter Review
Add check_for_crash() to MarionetteTestCase's tearDown method 11 years ago Andrew Halberstadt [:ahal] (deleted), patch	mdas : review+	Details \| Diff \| Splinter Review

Bobby Holley (:bholley)

Reporter

Description

•

11 years ago

This bit us in bug 937317. The only recourse was to get Zac to run the tests locally (I tried, and failed to reproduce the crash), then have him link to the crash on socorro, then have bsmedberg send bz the minidump, then have bz run minidump_stackwalk on it to get the crash results. I think we need to do better here. I'll also send some mail to get this added to the TBPL job visibility policy.

Bobby Holley (:bholley)

Reporter

Updated

•

11 years ago

Blocks: 948395

Andrew Halberstadt [:ahal]

Assignee

Comment 1

•

11 years ago

I'll take a look when I have some time.

Assignee: nobody → ahalberstadt

Status: NEW → ASSIGNED

Andrew Halberstadt [:ahal]

Assignee

Comment 2

•

11 years ago

The base marionette harness is set up to check for crashes (http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/runner/base.py#731) but it looks like it never receives a symbols_path from mozharness. We'll likely need to modify gaiatest to forward the symbols_path on as well. Note that none of the logs from bug 937317 are available anymore, so I'll be feeling around in the dark a bit.

Andrew Halberstadt [:ahal]

Assignee

Comment 3

•

11 years ago

Attached patch Patch 1.0 - mozharness should pass symbols to gaia ui tests (deleted) — Details — Splinter Review

Looks like gaiatest just passes in cli args directly to the marionette runner which already has a --symbols-path option, so no changes should be needed there. I believe this is all that's missing. Though since this is fairly difficult to test in production, and I can't see the original crashing logs anymore, I can't guarantee this will solve all crash stack issues. That being said, this patch needs to land either way.

Attachment #8356816 - Flags: review?(jgriffin)

Andrew Halberstadt [:ahal]

Assignee

Updated

•

11 years ago

Component: General → General Automation

Product: Testing → Release Engineering

QA Contact: catlee

Bobby Holley (:bholley)

Reporter

Comment 4

•

11 years ago

You should be able to reproduce the crash by removing the null check here: http://mxr.mozilla.org/mozilla-central/source/dom/bindings/CallbackObject.cpp#105

Andrew Halberstadt [:ahal]

Assignee

Comment 5

•

11 years ago

Thanks, here's a try run: https://tbpl.mozilla.org/?tree=Try&rev=85ff590e740e I'll do some re-triggers both before and after landing my patch to see if it helps.

Jonathan Griffin (:jgriffin)

Comment 6

•

11 years ago

Comment on attachment 8356816 [details] [diff] [review] Patch 1.0 - mozharness should pass symbols to gaia ui tests Review of attachment 8356816 [details] [diff] [review]: ----------------------------------------------------------------- Looks reasonable to me too.

Attachment #8356816 - Flags: review?(jgriffin) → review+

Andrew Halberstadt [:ahal]

Assignee

Comment 7

•

11 years ago

I passed in emulator instead of b2g desktop to that last try run by accident: https://tbpl.mozilla.org/?tree=Try&rev=94769f1a9489

Andrew Halberstadt [:ahal]

Assignee

Comment 8

•

11 years ago

https://hg.mozilla.org/build/mozharness/rev/25b3d7cf9535 Try won't pick this up until the next reconfig, so should be enough time to get some re-triggers going to try and reproduce the crash.

Armen [:armenzg]

Comment 9

•

11 years ago

Merged mozharness (not getting CCed to this bug).

Andrew Halberstadt [:ahal]

Assignee

Comment 10

•

11 years ago

It looks like the try run reproduced it: https://tbpl.mozilla.org/php/getParsedLog.php?id=32706648&tree=Try&full=1 If that's the same crash, then there is still no stacktrace in the logs. At least the harness is receiving the symbols now. Either it isn't a crash, crashreporter isn't enabled, or we aren't checking for crashes in the right place (though after a brief look at the runner, I'm pretty sure we are). I'll have to test this out some more on Ash to see what's going on.

Comment 11

•

11 years ago

Ahal if I recall from this failure what happens is the crash reporter pops up and it's 'modal' of sorts so no more data can come in and out of the system. I had to clear the crash report prompt for it to be able to continue. I think if you can the test locally that is what you would see. This may be a totally separate bug - Marionette or gaiatest not disabling the crash reporter, or not making it crash silently.

Andrew Halberstadt [:ahal]

Assignee

Comment 12

•

11 years ago

Oh ok, that helps a lot. I think all we need to do is set the CRASHREPORTER_NO_REPORT env in the b2g process then. If the marionette runner had been using mozbase, this would have been done for us, but I guess it isn't.

Andrew Halberstadt [:ahal]

Assignee

Comment 13

•

11 years ago

Here's a new try run with check_for_crashes added to b2g desktop builds as well: https://tbpl.mozilla.org/?tree=Try&rev=c1aae692f012 I realized after pushing I can just use runner.check_for_crashes, I'll update then when I upload the patch.

Andrew Halberstadt [:ahal]

Assignee

Comment 14

•

11 years ago

Attached patch Make sure marionette geckoinstance checks for crashes (obsolete) (deleted) — Details — Splinter Review

Missed the last try results due to try being reset. Here's a new push along with the patch: https://tbpl.mozilla.org/?tree=Try&rev=2bbd8eba68de

Andrew Halberstadt [:ahal]

Assignee

Comment 15

•

11 years ago

Hmm, I thought that bustage in the last try run was something unrelated to my patch, but here's another with the same problem: https://tbpl.mozilla.org/?tree=Try&rev=d137a2b0f187 And Gu is green before my push. I have no idea how my patch could be causing that...

Comment 16

•

11 years ago

Andrew, I have a linux64 machine so I'll run the test locally today and see what it is, but a hunch is that it's just failing to start up b2g at all.

Comment 17

•

11 years ago

hmm, works fine locally for me, both starting up the binary and using it locally (no errors in the console, Marionette reported that the port is open and "Marionette server ready". Running tests was fine, too. Sorry!

Andrew Halberstadt [:ahal]

Assignee

Comment 18

•

11 years ago

(In reply to Zac C (:zac) from comment #17) > hmm, works fine locally for me, both starting up the binary and using it > locally (no errors in the console, Marionette reported that the port is open > and "Marionette server ready". > Running tests was fine, too. > Sorry! Thanks for checking. The potential for my patch to cause this failure is there because I'm modifying the code that starts the emulator. I just don't understand why passing in some environment variables would stop it from coming up. I'll need to spend some more time debugging. It could be a bug in mozrunner.

Andrew Halberstadt [:ahal]

Assignee

Comment 19

•

11 years ago

Er, not emulator, b2g desktop binary.

Andrew Halberstadt [:ahal]

Assignee

Comment 20

•

11 years ago

Attached patch Make sure marionette geckoinstance checks for crashes (deleted) — Details — Splinter Review

I see what's going on (was also able to reproduce with desktop marionette tests as well). In LocalRunner if you don't pass in an env, it'll default to os.environ.copy(): https://github.com/mozilla/mozbase/blob/master/mozrunner/mozrunner/local.py#L123 I'm not sure if this is a bug or not.. the behaviour is definitely weird and unintuitive. I kind of just want to leave it as is as that way there's no risk of breaking something else depending on it. Anyway, here's an updated try run with the crasher patch: https://tbpl.mozilla.org/?tree=Try&rev=c431328b8cc8

Attachment #8366692 - Attachment is obsolete: true

Comment 21

•

11 years ago

Out of curiosity, how did you replicate it locally? (wondering what I did differently).

Andrew Halberstadt [:ahal]

Assignee

Comment 22

•

11 years ago

Did you have my patch applied? I just tried it with Firefox desktop marionette tests (which use the same code path), didn't try with gaia-ui-tests, so maybe that's why. Also it only seemed to fail on linux, osx worked fine in the try run.

Comment 23

•

11 years ago

I just downloaded the build straight from TBPL and ran Gu locally in the way I thought TBPL did but I must have been missing something.

Andrew Halberstadt [:ahal]

Assignee

Comment 24

•

11 years ago

(In reply to Zac C (:zac) from comment #23) > I just downloaded the build straight from TBPL and ran Gu locally in the way > I thought TBPL did but I must have been missing something. The error was caused by my patch in comment 20, so if you didn't have that applied to marionette, that's why you didn't see it :) I made a typo in my last try run, didn't pluralize gaia-ui-test (filed bug 966260 to make it so pluralization doesn't matter). Maybe one of these days I'll get a useful try run on this bug: https://tbpl.mozilla.org/?tree=Try&rev=34c10c663b34

Comment 25

•

11 years ago

I was trying to replicate the wait_for_port timeout from comment#20. Never mind let's see how c#24 goes :)

Andrew Halberstadt [:ahal]

Assignee

Comment 26

•

11 years ago

(In reply to Zac C (:zac) from comment #25) > I was trying to replicate the wait_for_port timeout from comment#20. > Never mind let's see how c#24 goes :) Yeah, the wait for port timeout was caused by my patch :)

Andrew Halberstadt [:ahal]

Assignee

Comment 27

•

11 years ago

Comment on attachment 8368133 [details] [diff] [review] Make sure marionette geckoinstance checks for crashes I give up, that try run still isn't scheduling Gu and I have no idea why. Neither gaia-ui-test or gaia-ui-tests seem to work, must be a bug in trychooser. I tested this patch locally by running desktop Mn tests, and it at least doesn't cause bustage there. So time to just review and land.

Attachment #8368133 - Flags: review?(mdas)

Malini Das [:mdas] - Away, not checking bugmail

Comment 28

•

11 years ago

Comment on attachment 8368133 [details] [diff] [review] Make sure marionette geckoinstance checks for crashes Review of attachment 8368133 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for doing this! ::: testing/marionette/client/marionette/geckoinstance.py @@ +43,5 @@ > os.remove(self.gecko_log) > + > + env = os.environ.copy() > + env.update({ 'MOZ_CRASHREPORTER': '1', > + 'MOZ_CRASHREPORTER_NO_REPORT': '1', }) Mind adding a comment here about what these two variables do? If I didn't know anything about this code, and just read this bit... then these variable names and values are confusing since they sound like they contradict each other.

Attachment #8368133 - Flags: review?(mdas) → review+

Andrew Halberstadt [:ahal]

Assignee

Comment 29

•

11 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/cdc2338e71a3 Unfortunately I got that try run working, reproduced the crash and still no stack. This is another piece in place, but I'm not done yet. Looks like the harness needs to call check for crashes somewhere. https://tbpl.mozilla.org/?tree=Try&rev=5830aa168a5d

Whiteboard: [leave-open]

Andrew Halberstadt [:ahal]

Assignee

Comment 30

•

11 years ago

So apparently marionette has two runners (the second being based on unittest.py).. and the gaia-ui-tests use the latter (even though runtests.py in gaiatest uses the base marionette runner?). Anyway, the unittest version doesn't check for crashes anywhere.

Malini Das [:mdas] - Away, not checking bugmail

Comment 31

•

11 years ago

We cleared this up in irc, but fyi, there is one base runner (http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/runner/base.py) provided by marionette and it is based on unittest.py. The base runner doesn't have any test cases it can run by default, so it doesn't really do anything until you set test_handlers for it. Marionette's Runtests.py and Gaia-ui-tests extend this runner and give it test_handlers so you can actually use it.

Ryan VanderMeulen [:RyanVM]

Comment 32

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/cdc2338e71a3

Andrew Halberstadt [:ahal]

Assignee

Comment 33

•

11 years ago

Yeah, ignore comment 30 :). What was tripping me up was I though the function run_test in the base runner ran a single test, when it actually runs a suite of tests. This meant we were only checking for crashes after the whole suite finished (and in this case buildbot kills the job before that happens). So I added a check for crashes call to the tearDown method of the test case, but that still didn't work. Here's an in-progress try run where I added some extra debugging info to maybe help see what's going on: https://tbpl.mozilla.org/?tree=Try&rev=b958f8609c44 Jgriffin, when this works, it might help with that linux only Gu socket timeout we're seeing.

Andrew Halberstadt [:ahal]

Assignee

Updated

•

11 years ago

Blocks: 967816

Comment 34

•

11 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #33) > So I added a check for crashes call to the tearDown method of the test case, > but that still didn't work. Here's an in-progress try run where I added some > extra debugging info to maybe help see what's going on: > https://tbpl.mozilla.org/?tree=Try&rev=b958f8609c44 > > Jgriffin, when this works, it might help with that linux only Gu socket > timeout we're seeing. In that run the crash there, I don't think it's the one you're trying to invoke, but it does look more like the linux only Gu socket timeout.

Andrew Halberstadt [:ahal]

Assignee

Comment 35

•

11 years ago

That's a good point. The diagnostics I added confirmed that we were checking for crashes after each test so it's likely that the socket timeout simply isn't a crash. Maybe I can test it on osx since that doesn't seem to suffer from the socket timeout.

Andrew Halberstadt [:ahal]

Assignee

Comment 36

•

11 years ago

Malini, in my previous try run, I added check_for_crashes to the cleanTest() method [1], but this got called about 2-3 times between each test as seen in [2]. Ideally this would only get called when we detect something went wrong. I'm trying to figure out where that would be, but the maze of of try/except/else blocks [3] is making it very difficult to figure out where this is. Do you have any ideas? [1] https://hg.mozilla.org/try/rev/a41f614974d9 [2] https://tbpl.mozilla.org/php/getParsedLog.php?id=34098874&tree=Try&full=1#error3 [3] http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/marionette_test.py#136

Flags: needinfo?(mdas)

Andrew Halberstadt [:ahal]

Assignee

Comment 37

•

11 years ago

Same try run as last time except on osx in an attempt to avoid the socket timeout: https://tbpl.mozilla.org/?tree=Try&rev=2a42b52a8426

Malini Das [:mdas] - Away, not checking bugmail

Comment 38

•

11 years ago

I looked at this on friday and the ideal thing would be to put this in MarionetteTestCase's tearDown, but that's dependent on being able to detect if a test passed or failed, and from unittest's documentation, it doesn't look like you can determine that within tearDown. I don't see anything provided by unittest that does what we want to do in this scenario. The next best thing may be to add this check in MarionetteTestResult(http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/runner/base.py#37), maybe in the add_test_result or addError/addFailure methods. It isn't ideal, since this is where result processing should be but it may be the closest thing we can get.

Flags: needinfo?(mdas)

(not currently active) Ted Mielczarek

Comment 39

•

11 years ago

It should be fine to check for crashes regardless of the success of the test. We do this unilaterally in desktop harnesses: http://mxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#951 This has the nice property that you can catch crashes from child processes that didn't cause obvious failures and other things like that.

Malini Das [:mdas] - Away, not checking bugmail

Comment 40

•

11 years ago

in that case, then tearDown() would be the right place for it. It should be called only once per test function

Andrew Halberstadt [:ahal]

Assignee

Comment 41

•

11 years ago

Thanks for the clarification. I wasn't sure about tearDown because there's an ominous comment referencing a bug, but it's probably unrelated.

Malini Das [:mdas] - Away, not checking bugmail

Comment 42

•

11 years ago

(In reply to Andrew Halberstadt [:ahal] from comment #41) > Thanks for the clarification. I wasn't sure about tearDown because there's > an ominous comment referencing a bug, but it's probably unrelated. ah to further clarify, I'd add it here http://mxr.mozilla.org/mozilla-central/source/testing/marionette/client/marionette/marionette_test.py#339, so that we can diagnose the crash before the TEST-END logline is output. Also, mind removing the ominous bug comment from the other tearDown method in your patch? That's been resolved :)

Andrew Halberstadt [:ahal]

Assignee

Comment 43

•

11 years ago

Attached patch Add check_for_crash() to MarionetteTestCase's tearDown method (deleted) — Details — Splinter Review

Also removed some dead code I noticed.

Attachment #8373451 - Flags: review?(mdas)

Malini Das [:mdas] - Away, not checking bugmail

Comment 44

•

11 years ago

Comment on attachment 8373451 [details] [diff] [review] Add check_for_crash() to MarionetteTestCase's tearDown method Review of attachment 8373451 [details] [diff] [review]: ----------------------------------------------------------------- looks good, thanks! You may want to run these patches in try to make sure check_for_crash behaves sanely in all environments since I haven't tried these locally.

Attachment #8373451 - Flags: review?(mdas) → review+

Andrew Halberstadt [:ahal]

Assignee

Comment 45

•

11 years ago

It got tested on b2g desktop on try already. And check_for_crash just calls through to mozcrash, so I think it should be good to go without further try testing: https://hg.mozilla.org/integration/mozilla-inbound/rev/11c2e5f4fe48

Ryan VanderMeulen [:RyanVM]

Comment 46

•

11 years ago

https://hg.mozilla.org/mozilla-central/rev/11c2e5f4fe48

Andrew Halberstadt [:ahal]

Assignee

Comment 47

•

11 years ago

I was unable to reproduce the crash on try: https://tbpl.mozilla.org/?tree=Try&rev=2a42b52a8426 I'm going to call this fixed for now. All the pieces should be in place, but I'll keep an eye out for crashes in Gu on tbpl and re-open if I don't see a stack.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

Malini Das [:mdas] - Away, not checking bugmail

Updated

•

11 years ago

status-b2g-v1.3: --- → fixed

Whiteboard: [leave-open]

Nobody; OK to take it and work on it

Updated

•

7 years ago

Component: General Automation → General

You need to log in before you can comment on or make changes to this bug.