Closed Bug 1339594 Opened 8 years ago Closed 8 years ago

Intermittent TEST-UNEXPECTED-ERROR | damp,ts_paint,tps,a11yr,tresize | psutil.NoSuchProcess process no longer exists (pid=3740, name=u'firefox.exe')

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(firefox52 unaffected, firefox53 fixed, firefox54 fixed)

RESOLVED FIXED
mozilla54
Tracking Status
firefox52 --- unaffected
firefox53 --- fixed
firefox54 --- fixed

People

(Reporter: philor, Assigned: gbrown)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed])

Attachments

(1 file)

+++ This bug was initially created as a clone of Bug #1236770 +++ Carrying on the intermittents, since bug 1236770 got hijacked by one cause of one flavor in one subtest of one suite.
With mis-stars eliminated, I think this may be Windows-only and e10s-only, and mostly pgo.
This seems typical in recent failures: 23:53:35 INFO - TEST-UNEXPECTED-ERROR | ts_paint | psutil.NoSuchProcess process no longer exists (pid=3360, name=u'firefox.exe') 23:53:35 ERROR - Traceback (most recent call last): 23:53:35 INFO - File "C:\slave\test\build\tests\talos\talos\run_tests.py", line 202, in run_tests 23:53:35 INFO - talos_results.add(mytest.runTest(browser_config, test)) 23:53:35 INFO - File "C:\slave\test\build\tests\talos\talos\ttest.py", line 70, in runTest 23:53:35 INFO - return self._runTest(browser_config, test_config, setup) 23:53:35 INFO - File "C:\slave\test\build\tests\talos\talos\ttest.py", line 174, in _runTest 23:53:35 INFO - if counter_management else None), 23:53:35 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 140, in run_browser 23:53:35 INFO - return_code = context.kill_process() 23:53:35 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 40, in kill_process 23:53:35 INFO - procs = self.process.children() 23:53:35 INFO - File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper 23:53:35 INFO - raise NoSuchProcess(self.pid, self._name) 23:53:35 INFO - NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=3360, name=u'firefox.exe') It looks like, if self.process.terminate() throws NoSuchProcess, there is a good chance that self.process.children() will also throw NoSuchProcess, making the change in bug 1236770 not very effective. Brad, could we just ignore NoSuchProcess in kill_process? Or get a list of children some other way/at some other time??
Flags: needinfo?(blassey.bugs)
This is automation code, I have no idea.
Flags: needinfo?(blassey.bugs)
I don't mind ignoring NoSuchProcess, or when we get it follow another code path. I am just not sure how to ensure we get the child processes and why we would get NoSuchProcess. Maybe it terminated before we get to this point?
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=79049477&lineNumber=1947 06:12:52 INFO - Running cycle 13/20 for ts_paint test... 06:12:52 INFO - TEST-INFO | started process 1944 (C:\slave\test\build\application\firefox\firefox -profile c:\users\cltbld\appdata\local\temp\tmpkpbugm\profile http://localhost:49677/startup_test/tspaint_test.html) 06:15:22 INFO - Terminating psutil.Process(pid=1944, name=u'firefox.exe') 06:15:23 INFO - TEST-UNEXPECTED-ERROR | ts_paint | psutil.NoSuchProcess process no longer exists (pid=1944, name=u'firefox.exe') 06:15:23 ERROR - Traceback (most recent call last): 06:15:23 INFO - File "C:\slave\test\build\tests\talos\talos\run_tests.py", line 202, in run_tests 06:15:23 INFO - talos_results.add(mytest.runTest(browser_config, test)) 06:15:23 INFO - File "C:\slave\test\build\tests\talos\talos\ttest.py", line 70, in runTest 06:15:23 INFO - return self._runTest(browser_config, test_config, setup) 06:15:23 INFO - File "C:\slave\test\build\tests\talos\talos\ttest.py", line 174, in _runTest 06:15:23 INFO - if counter_management else None), 06:15:23 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 140, in run_browser 06:15:23 INFO - return_code = context.kill_process() 06:15:23 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 40, in kill_process 06:15:23 INFO - procs = self.process.children() 06:15:23 INFO - File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper 06:15:23 INFO - raise NoSuchProcess(self.pid, self._name) 06:15:23 INFO - NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=1944, name=u'firefox.exe') After "started process", there is a 150 second pause with no output. Perhaps there has been a timeout and TalosError has been raised: https://dxr.mozilla.org/mozilla-central/source/testing/talos/talos/talos_process.py#136 but I think that won't be reported if we raise again in the finally handler, in kill_process.
Assignee: nobody → gbrown
Attached patch handle timeouts better (deleted) — Splinter Review
I reproduced the NoSuchProcess on try and determined that the browser was hung and the test timed out at https://dxr.mozilla.org/mozilla-central/rev/a180b976c165b6cd174d24bbb77941919cdc53cb/testing/talos/talos/talos_process.py#124. A minidump was probably produced, but then the TalosError was raised, triggering the finally handler and the attempt to kill the browser...which had already been killed in kill_and_get_minidump(). In this patch, I catch and log exceptions from the possibly-second kill attempt, but then allow processing to continue. In the case of a timeout, the TalosError will still be raised, which would bypass crash reporting. To handle that, I added another check_for_crashes() call. This part of the patch is not an elegant solution, but gets the job done. Will try to demonstrate the new handling at https://treeherder.mozilla.org/#/jobs?repo=try&revision=33f2d50b458bae93328e3eea526b85a25ea77b3c; I expect the NoSuchProcess failures to be transformed into PROCESS-CRASH failures, with crash reports.
Attachment #8840540 - Flags: review?(jmaher)
Comment on attachment 8840540 [details] [diff] [review] handle timeouts better Review of attachment 8840540 [details] [diff] [review]: ----------------------------------------------------------------- this looks good. consider in the future possibly raising something from the finally:try/catch. ::: testing/talos/talos/talos_process.py @@ +145,5 @@ > + return_code = proc.wait(1) > + except: > + # Maybe killed by kill_and_get_minidump(), maybe ended? > + LOG.info("Unable to kill process") > + LOG.info(traceback.format_exc()) should we raise anything here?
Attachment #8840540 - Flags: review?(jmaher) → review+
(In reply to Joel Maher ( :jmaher) from comment #11) > ::: testing/talos/talos/talos_process.py > @@ +145,5 @@ > > + return_code = proc.wait(1) > > + except: > > + # Maybe killed by kill_and_get_minidump(), maybe ended? > > + LOG.info("Unable to kill process") > > + LOG.info(traceback.format_exc()) > > should we raise anything here? Maybe...sometimes? If there has been a timeout, a TalosError has already been raised - no sense in raising another exception. Otherwise, I suppose the test has completed successfully but there was a problem with killing the process; that might be cause for raising an exception. I'm hoping that in that case either 1. everything is actually okay and we complete successfully, or 2. something goes wrong at a later time and someone will notice the NoSuchProcess exception in the log as a possibly related clue. It might be better to restructure the code somehow to handle timeouts more directly...I'm not sure how exactly. I'll press on with this change for now.
(In reply to Geoff Brown [:gbrown] from comment #10) > Will try to demonstrate the new handling at > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=33f2d50b458bae93328e3eea526b85a25ea77b3c; I expect > the NoSuchProcess failures to be transformed into PROCESS-CRASH failures, > with crash reports. Here's one: https://archive.mozilla.org/pub/firefox/try-builds/gbrown@mozilla.com-33f2d50b458bae93328e3eea526b85a25ea77b3c/try-win32/try_win7_ix_test-chromez-e10s-bm112-tests1-windows-build37.txt.gz 16:10:11 INFO - TEST-INFO | started process 3588 (C:\slave\test\build\application\firefox\firefox -profile c:\users\cltbld\appdata\local\temp\tmp_xpvjo\profile http://localhost:49816/startup_test/tresize/addon/content/tresize-test.html) 16:12:41 INFO - Timeout waiting for test completion; killing browser... 16:12:42 INFO - Terminating psutil.Process(pid=3588, name=u'firefox.exe') 16:12:42 INFO - Unable to kill process 16:12:42 ERROR - Traceback (most recent call last): 16:12:42 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 143, in run_browser 16:12:42 INFO - return_code = context.kill_process() 16:12:42 INFO - File "C:\slave\test\build\tests\talos\talos\talos_process.py", line 41, in kill_process 16:12:42 INFO - procs = self.process.children() 16:12:42 INFO - File "C:\slave\test\build\venv\lib\site-packages\psutil\__init__.py", line 299, in wrapper 16:12:42 INFO - raise NoSuchProcess(self.pid, self._name) 16:12:42 INFO - NoSuchProcess: psutil.NoSuchProcess process no longer exists (pid=3588, name=u'firefox.exe') 16:12:42 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/B6M4Wx_rQCGEqZ2av4vOHQ/artifacts/public/build/firefox-54.0a1.en-US.win32.crashreporter-symbols.zip 16:12:49 INFO - mozcrash Copy/paste: C:\slave\test\build\win32-minidump_stackwalk.exe c:\users\cltbld\appdata\local\temp\tmp_xpvjo\profile\minidumps\6456ec98-13ba-4cc8-aeea-7e46e8ef4c20.dmp c:\users\cltbld\appdata\local\temp\tmprp75bd 16:12:56 INFO - mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\6456ec98-13ba-4cc8-aeea-7e46e8ef4c20.dmp 16:12:56 INFO - PROCESS-CRASH | tresize | application crashed [unknown top frame] 16:12:56 INFO - Crash dump filename: c:\users\cltbld\appdata\local\temp\tmp_xpvjo\profile\minidumps\6456ec98-13ba-4cc8-aeea-7e46e8ef4c20.dmp 16:12:56 INFO - Operating system: Windows NT 16:12:56 INFO - 6.1.7601 Service Pack 1 16:12:56 INFO - CPU: x86 16:12:56 INFO - GenuineIntel family 6 model 30 stepping 5 16:12:56 INFO - 8 CPUs 16:12:56 INFO - GPU: UNKNOWN 16:12:56 INFO - No crash 16:12:56 INFO - Process uptime: 151 seconds 16:12:56 INFO - Thread 0 16:12:56 INFO - 0 KERNELBASE.dll!GetLocaleInfoHelper + 0xb1 16:12:56 INFO - eip = 0x7577092c esp = 0x002ec9d8 ebp = 0x002ecac0 ebx = 0x000cee7c 16:12:56 INFO - esi = 0x000b65cc edi = 0x00000000 eax = 0x00000019 ecx = 0x00000001 16:12:56 INFO - edx = 0x00000000 efl = 0x00000202 16:12:56 INFO - Found by: given as instruction pointer in context
Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/eb573b6f4d66 Improve Talos error handling for hung browser; r=jmaher
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla54
I am reopening as this is more debugging code, but please close if we think this fixes the bug or will morph a new bug as more actionable.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [stockwell needswork]
Blocks: 1343549
(In reply to Joel Maher ( :jmaher) from comment #16) > I am reopening as this is more debugging code, but please close if we think > this fixes the bug or will morph a new bug as more actionable. We haven't had any new reports of NoSuchProcess (just a few mis-stars) and I don't think we will. The change has correctly morphed the NoSuchProcess reports into crashes (see see-also bugs).
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Whiteboard: [stockwell needswork] → [stockwell fixed]
Depends on: 1622257
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: