Closed Bug 616383 Opened 14 years ago Closed 14 years ago

mozmill tests need to disable crash reporter

Categories

(Testing Graveyard :: Mozmill, defect)

defect
Not set
critical

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: nthomas, Assigned: k0scist)

References

Details

(Whiteboard: [mozmill-2.0+][mozmill-1.5.2+])

Attachments

(3 files, 1 obsolete file)

I was checking talos-r3-leopard slaves and found a couple disconnected from buildbot. What appears to have happened is Firefox crashed doing Mozmill tests and the CrashReporter fired. For tests/talos there are environment variables set so that crash reporter doesn't run on a crash and I think we should do that here too. Perhaps there's an environment we're not setting on the step ?
The twistd.log on the slave has: 2010-11-30 12:22:46-0800 [Broker,client] ShellCommand._startCommand 2010-11-30 12:22:46-0800 [Broker,client] bash -c bin/python -E bin/mozmill --show-all -b .././MinefieldDebug.app/Contents/MacOS/firefox-bin -t tests/firefox 2010-11-30 12:22:46-0800 [Broker,client] in dir /Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-mozmill-all/build/mozmill (timeout 600 secs) 2010-11-30 12:22:46-0800 [Broker,client] watching logfiles {} 2010-11-30 12:22:46-0800 [Broker,client] argv: ['bash', '-c', 'bin/python -E bin/mozmill --show-all -b .././MinefieldDebug.app/Contents/MacOS/firefox-bin -t tests/firefox'] 2010-11-30 12:22:46-0800 [Broker,client] environment: {'SHELL': '/bin/bash', 'PYTHONPATH': '/Library/Python/2.5/site-packages', 'SSH_AUTH_SOCK': '/tmp/launch-dDzU1h/Listeners', 'PWD': '/Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-mozmill-all/build/mozmill', 'Apple_PubSub_Socket_Render': '/tmp/launch-n5kpjE/Render', 'LOGNAME': 'cltbld', 'USER': 'cltbld', 'PATH': '/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'HOME': '/Users/cltbld', 'DISPLAY': '/tmp/launch-9XYYV7/:0', 'TMPDIR': '/var/folders/Xr/Xr--yJnSEY0U11ET5NZuMU+++TM/-Tmp-/'} 2010-11-30 12:22:46-0800 [Broker,client] closing stdin 2010-11-30 12:22:46-0800 [Broker,client] using PTY: False 2010-11-30 12:33:57-0800 [-] command timed out: 600 seconds without output 2010-11-30 12:33:57-0800 [-] self.process has no pid 2010-11-30 12:33:57-0800 [-] trying process.signalProcess('KILL') 2010-11-30 12:33:57-0800 [-] Unhandled Error Traceback (most recent call last): File "/tools/buildbot-0.8.0/lib/python2.5/site-packages/Twisted-9.0.0-py2.5-macosx-10.5-i386.egg/twisted/application/app.py", line 445, in startReactor self.config, oldstdout, oldstderr, self.profiler, reactor) File "/tools/buildbot-0.8.0/lib/python2.5/site-packages/Twisted-9.0.0-py2.5-macosx-10.5-i386.egg/twisted/application/app.py", line 348, in runReactorWithLogging reactor.run() File "/tools/buildbot-0.8.0/lib/python2.5/site-packages/Twisted-9.0.0-py2.5-macosx-10.5-i386.egg/twisted/internet/base.py", line 1166, in run self.mainLoop() File "/tools/buildbot-0.8.0/lib/python2.5/site-packages/Twisted-9.0.0-py2.5-macosx-10.5-i386.egg/twisted/internet/base.py", line 1175, in mainLoop self.runUntilCurrent() --- <exception caught here> --- File "/tools/buildbot-0.8.0/lib/python2.5/site-packages/Twisted-9.0.0-py2.5-macosx-10.5-i386.egg/twisted/internet/base.py", line 779, in runUntilCurrent call.func(*call.args, **call.kw) File "/tools/buildbot-0.8.0/lib/python2.5/site-packages/buildbot-0.8.0-py2.5.egg/buildbot/slave/commands/base.py", line 726, in doTimeout self.kill(msg) File "/tools/buildbot-0.8.0/lib/python2.5/site-packages/buildbot-0.8.0-py2.5.egg/buildbot/slave/commands/base.py", line 791, in kill self.process.signalProcess(self.KILL) File "/tools/buildbot-0.8.0/lib/python2.5/site-packages/Twisted-9.0.0-py2.5-macosx-10.5-i386.egg/twisted/internet/process.py", line 333, in signalProcess raise ProcessExitedAlready() twisted.internet.error.ProcessExitedAlready:
The mozmill log had: ###!!! ASSERTION: You can't dereference a NULL nsCOMPtr with operator->().: 'mRawPtr != 0', file ../../dist/include/nsCOMPtr.h, line 819 Timeout: bridge.describe("9ed1bf46-fcbf-11df-9bd7-d49a20d94aca", bridge.registry["{87695e32-a877-f644-90c6-1045460647dc}"]["appInfo"]) Traceback (most recent call last): File "bin/mozmill", line 8, in <module> load_entry_point('mozmill==1.5.1rc2', 'console_scripts', 'mozmill')() File "/Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-mozmill-all/build/mozmill/lib/python2.5/site-packages/mozmill/__init__.py", line 811, in cli CLI().run() File "/Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-mozmill-all/build/mozmill/lib/python2.5/site-packages/mozmill/__init__.py", line 758, in run self.mozmill.start(runner=runner, profile=runner.profile) File "/Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-mozmill-all/build/mozmill/lib/python2.5/site-packages/mozmill/__init__.py", line 244, in start self.appinfo = self.get_appinfo(self.bridge) File "/Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-mozmill-all/build/mozmill/lib/python2.5/site-packages/mozmill/__init__.py", line 334, in get_appinfo 'application_version': str(appInfo.version), File "/Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-mozmill-all/build/mozmill/lib/python2.5/site-packages/jsbridge/jsobjects.py", line 94, in __getattr__ attributes = self._bridge_.describe(self._name_)['attributes'] File "/Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-mozmill-all/build/mozmill/lib/python2.5/site-packages/jsbridge/network.py", line 231, in describe return self.run(_uuid, 'bridge.describe('+', '.join([encoder.encode(_uuid), obj_name])+')') File "/Users/cltbld/talos-slave/mozilla-central_leopard-o-debug_test-mozmill-all/build/mozmill/lib/python2.5/site-packages/jsbridge/network.py", line 195, in run raise JSBridgeDisconnectError("Connection timed out") jsbridge.network.JSBridgeDisconnectError: Connection timed out This was for mozmill on leopard, debug build on rev 2af96a164c62.
The buildbot error is handled in https://github.com/buildbot/buildbot/commit/91343c2ba3fa840cca3d358087878a72ad8085c4 but that was only committed a month ago, so it will be in 0.8.3. So this is a two-pronged error: buildbot is failing to recover from a botched kill (but would have been fine if the process exited on its own), and the test code didn't exit. What's confusing is that the process exited already - that should cause a call to self.finish when Buildbot gets the SIGCHLD.
When I ran ps there wasn't a firefox process but there was a crashreporter one. There's probably something tricksy going on so that crashreporter can catch the memory state at the Firefox crash.
Is this an OS specific issue or are the flags incorrectly set?
Whiteboard: [mozmill-2.0?][mozmill-1.5.2?]
Not sure. I had thought we were setting MOZ_CRASHREPORTER_NO_REPORT=1 in the environment for all tests/talos, but it turns out the test suites are as well http://mxr.mozilla.org/mozilla-central/search?string=MOZ_CRASHREPORTER_NO_REPORT so perhaps the fix belongs in Mozmill.
How would this work in the future if we plan to also run some crash tests with Mozmill? Tests which are hard to implement as a crashtest itself? How are operating the current crash tests to retrieve the stack? Or don't they and just fail with a short crash message in the logs?
So it seems it is open ended what to do here. Currently, mozrunner tries to aggressively kill crashreporter: https://github.com/mozautomation/mozmill/blob/master/mozrunner/mozrunner/runner.py#L56 https://github.com/mozautomation/mozmill/blob/master/mozrunner/mozrunner/runner.py#L199 This uses the very aggressive kill_process_by_name function that, on unix, will (attempt to) kill any process with crashreporter anyway in the output of `ps`: https://github.com/mozautomation/mozmill/blob/master/mozprocess/mozprocess/kill.py#L44 For this reason, I don't particularly like this function. It could be made smarter, but since this is the only place its used, I would advocate removing it, removing the aggressively_kill stuff, and setting the environment variable in the harness class, MozMill. I'm not sure what the status is on if crashrepoter is used or not. If not, it is my inclination to make it work when it is needed versus future proofing, not only because it isn't needed now but because we can't perceive of how we'll need it to work ahead of time and a need-driven solution is better than a future proofed one. If that's not the case, there are several alternatives: - is the kill_process_by_name failing or is it not being called? If the latter, we could try to tackle making stopping conditions safer for 1.5.2. Anytime a harness failure occurs (e.g. a JSBridgeDisconnect), the cleanup code should be called. - We could set the environment variable in the buildbotcustom code. Again, in my opinion, this is more of a harness issue than a buildbot issue. I'd assume use the simplest solution, which is remove the existing messy code and set the environment variable in the harness. I'm not sure if this breaks anything.
I agree that we shouldn't choose a solution for now based on an unspecified future. The bottom line is that slave capacity is being (slowly) diminished because of crash reporter on mozmill jobs, and we need to avoid that. Thank you for offering to set the environment variable in Mozmill. It may also be that there is a crash under Mozmill that should be investigated (or it might just be a crash we're hitting elsewhere). Looks like we're already downloading symbols, and you can look at the other tests to see how they run the minidump stack walker to resolve a crash.
Component: Release Engineering → Mozmill
Product: mozilla.org → Testing
QA Contact: release → mozmill
Version: other → Trunk
OS: Mac OS X → All
We should solve this for 1.5.2, and take the crashreporter killing code out of mozrunner, and do whatever the other tests do to handle this issue. Also, we need to look into the tests on buildbot tools-staging and see if there is a reprodeucible crash that we are exposing.
Whiteboard: [mozmill-2.0?][mozmill-1.5.2?] → [mozmill-2.0+][mozmill-1.5.2+]
Assignee: nobody → jhammel
Hardware: x86 → All
Attached patch hacky fix for 1.5.2 (deleted) — Splinter Review
This is a decent workaround. It removes the killing code in MozMill class. Currently, all internal execution paths will behave correctly. But it is *NOT* code-path proof, so if someone used the API badly...this problem could show up. I doubt that's the case though. For 2.0, this should probably be done more carefully
Attachment #496208 - Flags: review?(ctalbert)
(In reply to comment #7) > How would this work in the future if we plan to also run some crash tests with > Mozmill? Tests which are hard to implement as a crashtest itself? How are > operating the current crash tests to retrieve the stack? Or don't they and just > fail with a short crash message in the logs? If there are someday crash tests for mozmill to test the crash reporter, then those will have to run in automation outside of buildbot, I imagine. If the process crashes, we should use minidumpStackWalk [1] to retrieve the stack for the logs like the other test harnesses do. [1]: http://mxr.mozilla.org/mozilla-central/source/build/automationutils.py#118
(In reply to comment #11) > Created attachment 496208 [details] [diff] [review] > hacky fix for 1.5.2 > > This is a decent workaround. It removes the killing code in MozMill class. > Currently, all internal execution paths will behave correctly. But it is *NOT* > code-path proof, so if someone used the API badly...this problem could show up. > I doubt that's the case though. > > For 2.0, this should probably be done more carefully This is also untested (the code is tested, but the fix isn't) but it should work
Attached patch proposed fix for master (obsolete) (deleted) — Splinter Review
basically a port of the 1.5.2 fix. Again, untested.
Attachment #496268 - Flags: review?(ctalbert)
see also bug 592750
Attached patch proposed fix for master (deleted) — Splinter Review
oops, i generated the patch wrongly. here is the correct version
Attachment #496268 - Attachment is obsolete: true
Attachment #496274 - Flags: review?(ctalbert)
Attachment #496268 - Flags: review?(ctalbert)
Comment on attachment 496208 [details] [diff] [review] hacky fix for 1.5.2 >diff --git a/mozmill/mozmill/__init__.py b/mozmill/mozmill/__init__.py >index 9a4af6e..da20811 100644 >--- a/mozmill/mozmill/__init__.py >+++ b/mozmill/mozmill/__init__.py >@@ -175,6 +175,10 @@ class MozMill(object): > self.add_listener(self.startTest_listener, eventType='mozmill.setTest') > self.add_listener(self.userShutdown_listener, eventType='mozmill.userShutdown') > >+ # disable the crashreporter >+ os.environ['MOZ_CRASHREPORTER_NO_REPORT'] = '1' >+ os.putenv('MOZ_CRASHREPORTER_NO_REPORT', '1') # just to be sure >+ I think even on 2.4 environ should be enough. let's test this on the 3 os's we care about + 2.4 and make sure. Rest of it looks good. r+ with removing the putenv.
Attachment #496208 - Flags: review?(ctalbert) → review+
Comment on attachment 496274 [details] [diff] [review] proposed fix for master r+ with removing the putenv.
Attachment #496274 - Flags: review?(ctalbert) → review+
pushed to master: https://github.com/mozautomation/mozmill/commit/e41b2cf738e304156720a87ac2eddb1cb8d9f86d It might be worthwhile to open another bug to figure out the correct going-forward strategy for crashreporting. Currently we just throw away the reports, but it might be worthwhile e.g. doing what mochitest does (as :ted suggested)
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
(In reply to comment #20) > It might be worthwhile to open another bug to figure out the correct > going-forward strategy for crashreporting. Currently we just throw away the > reports, but it might be worthwhile e.g. doing what mochitest does (as :ted > suggested) Yes, please get this on file. I have no idea what ted suggested offline from this bug, so you should preferable create the bug.
Aaron, could you please verify this fix with the webgl crash you experience on Linux? Thanks.
Verified fixed on 1.5.2pre Tested on Linux with the known crash (bug 622850), where the crash occurs without the reporter. Output visible from last visible pass to exit of Mozmill below (removed the crash stack): TEST-START | /home/aaronmt/Mozilla/mozmill-tests/firefox/testPrivateBrowsing/testAboutPrivateBrowsing.js | testCheckPrivateBrowsingMode failed to create drawable ###!!! ABORT: X_FreeGC: BadGC (invalid GC parameter); 3 requests ago: file /builds/slave/cen-lnx64-ntly/build/toolkit/xre/nsX11ErrorHandler.cpp, line 190 TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed INFO Passed: 5 INFO Failed: 1 INFO Skipped: 0
Status: RESOLVED → VERIFIED
Is this code in production ? I'm still seeing a lot of machines blocked by prompts to submit a crash, and if I jump into /Users/cltbld/talos-slave/test/build/mozmill/bin activate the virtualenv, and call 'mozmill --version' I get '1.5.1rc2'.
And is there a bug for the steps in comment #10 to catch the crash-under-mozmill in staging ?
Nick, it's not on production yet. The version which will include this fix will be Mozmill 1.5.2. Current ETA for release is by end of this week or early next week. Clint can give in-detail information.
OK. Next time a bug is RESOLVED FIXED but the change is not actually deployed could you guys please explicitly note that 1) it's not deployed, 2) when it will be.
(In reply to comment #28) > OK. Next time a bug is RESOLVED FIXED but the change is not actually deployed > could you guys please explicitly note that 1) it's not deployed, 2) when it > will be. Sorry, about that Nick. The bug you want to watch for the full deployment of 1.5.2 onto mozilla-central is bug 618924.
Depends on: 618924
I had to hit "quit Firefox" twice on a snow leopard machine for the job to continue. This is quite important to fix as it puts our slaves out of action and it cannot be noticed easily. What is the new ETA for this to be in production? bash -c bin/python -E bin/mozmill --show-all -b .././MinefieldDebug.app/Contents/MacOS/firefox-bin -t tests/firefox in dir /Users/cltbld/talos-slave/test/build/mozmill (timeout 600 secs) watching logfiles {} argv: ['bash', '-c', 'bin/python -E bin/mozmill --show-all -b .././MinefieldDebug.app/Contents/MacOS/firefox-bin -t tests/firefox'] environment: Apple_PubSub_Socket_Render=/tmp/launch-O5Rr6Z/Render CVS_RSH=ssh DISPLAY=/tmp/launch-GgqdXm/:0 HOME=/Users/cltbld LOGNAME=cltbld PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin PWD=/Users/cltbld/talos-slave/test/build/mozmill PYTHONPATH=/Library/Python/2.5/site-packages SHELL=/bin/bash SSH_AUTH_SOCK=/tmp/launch-gcUuZd/Listeners TMPDIR=/var/folders/H5/H5TD8hgwEqKq9hgKlayjWU+++TM/-Tmp-/ USER=cltbld VERSIONER_PYTHON_PREFER_32_BIT=no VERSIONER_PYTHON_VERSION=2.6 __CF_USER_TEXT_ENCODING=0x1F6:0:0 closing stdin using PTY: False nsNativeModuleLoader::LoadModule("/Users/cltbld/talos-slave/test/build/MinefieldDebug.app/Contents/MacOS/components/libxpcomsample.dylib") - load FAILED, rv: 80004005, error: <unknown; can't get error from NSPR> Traceback (most recent call last): File "bin/mozmill", line 8, in <module> load_entry_point('mozmill==1.5.1rc2', 'console_scripts', 'mozmill')() File "/Users/cltbld/talos-slave/test/build/mozmill/lib/python2.6/site-packages/mozmill/__init__.py", line 811, in cli CLI().run() File "/Users/cltbld/talos-slave/test/build/mozmill/lib/python2.6/site-packages/mozmill/__init__.py", line 758, in run self.mozmill.start(runner=runner, profile=runner.profile) File "/Users/cltbld/talos-slave/test/build/mozmill/lib/python2.6/site-packages/mozmill/__init__.py", line 243, in start self.create_network() File "/Users/cltbld/talos-slave/test/build/mozmill/lib/python2.6/site-packages/mozmill/__init__.py", line 214, in create_network self.jsbridge_port) File "/Users/cltbld/talos-slave/test/build/mozmill/lib/python2.6/site-packages/jsbridge/__init__.py", line 72, in wait_and_create_network raise Exception("Sorry, cannot connect to jsbridge extension, port %s" % port) Exception: Sorry, cannot connect to jsbridge extension, port 24242 command timed out: 600 seconds without output (view as text) bash -c bin/python -E bin/mozmill-restart --show-all -b .././MinefieldDebug.app/Contents/MacOS/firefox-bin -t tests/firefox/restartTests in dir /Users/cltbld/talos-slave/test/build/mozmill (timeout 300 secs) watching logfiles {} argv: ['bash', '-c', 'bin/python -E bin/mozmill-restart --show-all -b .././MinefieldDebug.app/Contents/MacOS/firefox-bin -t tests/firefox/restartTests'] environment: Apple_PubSub_Socket_Render=/tmp/launch-O5Rr6Z/Render CVS_RSH=ssh DISPLAY=/tmp/launch-GgqdXm/:0 HOME=/Users/cltbld LOGNAME=cltbld PATH=/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin PWD=/Users/cltbld/talos-slave/test/build/mozmill PYTHONPATH=/Library/Python/2.5/site-packages SHELL=/bin/bash SSH_AUTH_SOCK=/tmp/launch-gcUuZd/Listeners TMPDIR=/var/folders/H5/H5TD8hgwEqKq9hgKlayjWU+++TM/-Tmp-/ USER=cltbld VERSIONER_PYTHON_PREFER_32_BIT=no VERSIONER_PYTHON_VERSION=2.6 __CF_USER_TEXT_ENCODING=0x1F6:0:0 closing stdin using PTY: False nsNativeModuleLoader::LoadModule("/Users/cltbld/talos-slave/test/build/MinefieldDebug.app/Contents/MacOS/components/libxpcomsample.dylib") - load FAILED, rv: 80004005, error: <unknown; can't get error from NSPR> Traceback (most recent call last): File "bin/mozmill-restart", line 8, in <module> load_entry_point('mozmill==1.5.1rc2', 'console_scripts', 'mozmill-restart')() File "/Users/cltbld/talos-slave/test/build/mozmill/lib/python2.6/site-packages/mozmill/__init__.py", line 817, in restart_cli RestartCLI().run() File "/Users/cltbld/talos-slave/test/build/mozmill/lib/python2.6/site-packages/mozmill/__init__.py", line 765, in run self.mozmill.run_tests(self.options.test) File "/Users/cltbld/talos-slave/test/build/mozmill/lib/python2.6/site-packages/mozmill/__init__.py", line 667, in run_tests self.run_dir(d, sleeptime) File "/Users/cltbld/talos-slave/test/build/mozmill/lib/python2.6/site-packages/mozmill/__init__.py", line 612, in run_dir frame = self.start_runner() File "/Users/cltbld/talos-slave/test/build/mozmill/lib/python2.6/site-packages/mozmill/__init__.py", line 575, in start_runner self.create_network() File "/Users/cltbld/talos-slave/test/build/mozmill/lib/python2.6/site-packages/mozmill/__init__.py", line 214, in create_network self.jsbridge_port) File "/Users/cltbld/talos-slave/test/build/mozmill/lib/python2.6/site-packages/jsbridge/__init__.py", line 72, in wait_and_create_network raise Exception("Sorry, cannot connect to jsbridge extension, port %s" % port) Exception: Sorry, cannot connect to jsbridge extension, port 24242 program finished with exit code 1 elapsedTime=131.852607
Severity: normal → critical
This patch doesn't seem to handle all situations very well. Using the NTT to crash the browser I get an exception and Mozill stops. I will file it as a new bug immediately.
(In reply to comment #31) > This patch doesn't seem to handle all situations very well. Using the NTT to > crash the browser I get an exception and Mozill stops. I will file it as a new > bug immediately. I think that what Henrik is seeing is unrelated to this issue. Landing all of 1.5.2 is gated on several issues, see bug 618924. If you want to just land this fix alone onto m-c then we can do that now while we wait for the rest of the reviews to come in.
Attached patch fix for 1.5.1 (deleted) — Splinter Review
This is a fix for 1.5.1. Ideally we'd not set the default environ in runner to always disable crashreporter, but for 1.5.1 that is likely necessary since jetpack also needs this functionality and uses Runner directly. Note that the calling application can still override this default setting by passing in an env to runner. The rest of the patch is an identical port of the 1.5.2 fix.
Attachment #507280 - Flags: review?(fayearthur+bugs)
Comment on attachment 507280 [details] [diff] [review] fix for 1.5.1 Not sure we need to set the environment var in mozrunner and mozmill, but can't hurt, tested with NTT.
Attachment #507280 - Flags: review?(fayearthur+bugs) → review+
Blocks: 629026
No longer depends on: 629026
ctalbert is this the fix you mentioned for disabling crashreporter?
Yep, landed it on Friday, so you should (hopefully) stop seeing mozmill hung builders.
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: