Closed
Bug 1276220
Opened 9 years ago
Closed 8 years ago
Intermittent Automation Error: Could not terminate process - OSError: [Errno 1] Operation not permitted | 'firefox_ui_harness/cli_update.py'
Categories
(Testing :: Firefox UI Tests, defect)
Tracking
(firefox48 unaffected, firefox49 fixed, firefox50 fixed)
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
firefox48 | --- | unaffected |
firefox49 | --- | fixed |
firefox50 | --- | fixed |
People
(Reporter: whimboo, Unassigned)
References
Details
(Keywords: hang, intermittent-failure, regression)
Attachments
(1 obsolete file)
Since yesterday we have a couple of busted tests on OS X for update tests:
Automation Error: mozprocess timed out after 300 seconds running ['/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/bin/python', '/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/firefox_ui_harness/cli_update.py', '--binary', '/Users/mozauto/jenkins/workspace/mozilla-central_update/build/application/FirefoxNightly.app/Contents/MacOS/firefox', '--address', 'localhost:2828', '--server-root', '/Users/mozauto/jenkins
timed out after 300 seconds of no output
Return code: -9
The return code looks like a fatal error or crash. Mozcrash doesn't detect anything.
Reporter | ||
Comment 1•9 years ago
|
||
This is most likely a fallout from the patch from Stephen on bug 394984. I can reproduce this locally too with the following command:
./mach firefox-ui-tests --binary %path_to_build_from_May_26_or_25%
Flags: needinfo?(spohl.mozilla.bugs)
Keywords: hang,
regression
Reporter | ||
Comment 2•9 years ago
|
||
Maybe the commit https://hg.mozilla.org/mozilla-central/rev/a468fe7af937 caused it given that we seemed to have changed how child processes are launched. I feel that mozprocess looses the pid here and cannot connect to Firefox anymore.
Comment 3•9 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #1)
> ./mach firefox-ui-tests --binary %path_to_build_from_May_26_or_25%
I'm confused by this timeline. If I'm reading the changelog properly, bug 394984 didn't land in time for the 5/25 nightly. Could you confirm?
Flags: needinfo?(spohl.mozilla.bugs) → needinfo?(hskupin)
Reporter | ||
Comment 4•9 years ago
|
||
Updates are broken for us with source builds starting May 25th. The nightly build on that day was with:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=d6d4e8417d2fd71fdf47c319b7a217f6ace9d5a5
Hg shows me that your patches are part of that Nightly:
https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=d6d4e8417d2fd71fdf47c319b7a217f6ace9d5a5
Flags: needinfo?(hskupin)
Updated•9 years ago
|
Attachment #8757588 -
Attachment is obsolete: true
Attachment #8757588 -
Attachment is patch: false
Comment hidden (obsolete) |
Comment hidden (Intermittent Failures Robot) |
Attempt to get this starrable.
Summary: Automation Error: mozprocess timed out after 300 seconds running ['python', 'firefox_ui_harness/cli_update.py'] → Intermittent Automation Error: mozprocess timed out after 300 seconds running | 'firefox_ui_harness/cli_update.py'
Reporter | ||
Comment 9•9 years ago
|
||
Stephen, I'm not familiar with any process handling on OS X so the following are just some speculations... With the change as linked below we seem to change the way how child processes are getting started. Do I see it correct that we no longer use the posix methods for it but some OS X specific launcher tools?
https://hg.mozilla.org/integration/mozilla-inbound/rev/a468fe7af93708141cbbc201890909adf4acad83
As of now mozprocess is able to follow processes if those are in the same process group. To kill the remaining open Firefox process it is using os.killpg(). I wonder if the above change breaks out of this process group and presents us an independent process.
Flags: needinfo?(spohl.mozilla.bugs)
Reporter | ||
Comment 10•9 years ago
|
||
I think that this is the problem. I tried the restart() command for one of our restart tests and as you can see the process group stays always the same for each of the three restarts:
COMM PID PPID PGID
$ ps xao comm,pid,ppid,pgid | grep Nightly
./FirefoxNightly 9624 9607 9624
$ ps xao comm,pid,ppid,pgid | grep Nightly
./FirefoxNightly 9629 1 9624
$ ps xao comm,pid,ppid,pgid | grep Nightly
./FirefoxNightly 9633 1 9624
$ ps xao comm,pid,ppid,pgid | grep Nightly
./FirefoxNightly 9636 1 9624
The following now shows the new behavior for update tests:
$ ps xao comm,pid,ppid,pgid | grep /var
/var/folders/z5/ 9661 9644 9661 (first instance to apply some prefs)
$ ps xao comm,pid,ppid,pgid | grep /var
/var/folders/z5/ 9671 1 9661 (after first restart to download and apply updates)
$ ps xao comm,pid,ppid,pgid | grep /var
/var/folders/z5/ 9678 1 9678 (after application has been upgraded)
Due to the change of the process group id we lost track of the underlying process group.
Reporter | ||
Comment 11•9 years ago
|
||
I'm confident now that this is clearly a regression from the mac updater changes on bug 394984.
Comment 12•9 years ago
|
||
This change may contribute to fixing bugs such as bug 1258515. We should make mozprocess compatible with this change. We still use posix_spawnp in other places, which might be why you're not seeing the same behavior for regular restart tests. However, we may change this as well to further reduce problems like bug 1258515.
Flags: needinfo?(spohl.mozilla.bugs)
Reporter | ||
Comment 13•9 years ago
|
||
So that means escaping the process group under those conditions is expected and necessary to fix the issues like the one you layed out above? If not, why can't we make sure to keep inside it?
Comment 14•9 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #13)
> So that means escaping the process group under those conditions is expected
> and necessary to fix the issues like the one you layed out above?
Correct.
Reporter | ||
Comment 15•9 years ago
|
||
I see. Ok, so as first step I want to get rid of the 300s timeout, so we get a better error message in automation. I will work on that via bug 1276886.
Depends on: 1276886
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 17•9 years ago
|
||
With the patch on bug 1276886 landed over the weekend we will at least no longer reproduce the hang after the update with the nightly builds starting maybe Sunday. I'm now working on bug 1176758 so that we can track the process even outside the known process group.
Comment hidden (Intermittent Failures Robot) |
05:57:17 INFO - Could not terminate process: 57921
05:57:17 ERROR - Failure during execution of the update test.
05:57:17 ERROR - Traceback (most recent call last):
05:57:17 ERROR -
05:57:17 ERROR - File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/firefox_ui_harness/runners/update.py", line 60, in _run_tests
05:57:17 ERROR - FirefoxUITestRunner.run_tests(self, tests)
05:57:17 ERROR -
05:57:17 ERROR - File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/marionette/runner/base.py", line 827, in run_tests
05:57:17 ERROR - self._print_summary(tests)
05:57:17 ERROR -
05:57:17 ERROR - File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/marionette/runner/base.py", line 859, in _print_summary
05:57:17 ERROR - self.marionette.instance.close()
05:57:17 ERROR -
05:57:17 ERROR - File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/marionette_driver/geckoinstance.py", line 150, in close
05:57:17 ERROR - self.runner.stop()
05:57:17 ERROR -
05:57:17 ERROR - File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/mozrunner/base/runner.py", line 165, in stop
05:57:17 ERROR - self.process_handler.kill(sig=sig)
05:57:17 ERROR -
05:57:17 ERROR - File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/mozprocess/processhandler.py", line 743, in kill
05:57:17 ERROR - self.proc.kill(sig=sig)
05:57:17 ERROR -
05:57:17 ERROR - File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/mozprocess/processhandler.py", line 164, in kill
05:57:17 ERROR - send_sig(signal.SIGTERM)
05:57:17 ERROR -
05:57:17 ERROR - File "/Users/mozauto/jenkins/workspace/mozilla-central_update/build/venv/lib/python2.7/site-packages/mozprocess/processhandler.py", line 151, in send_sig
05:57:17 ERROR - os.killpg(self.pid, sig)
05:57:17 ERROR -
05:57:17 ERROR - OSError: [Errno 1] Operation not permitted
Summary: Intermittent Automation Error: mozprocess timed out after 300 seconds running | 'firefox_ui_harness/cli_update.py' → Intermittent Automation Error: Could not terminate process - OSError: [Errno 1] Operation not permitted | 'firefox_ui_harness/cli_update.py'
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 26•8 years ago
|
||
The mozprocess process tracking is not easily to solve so I would propose we do a workaround patch in Marionette for now. It would mean we pass the current process id via the capabilities to the client, and update the pid of the runner appropriately.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 34•8 years ago
|
||
This should be fixed for the nightly builds today by my patch on bug 1282570.
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox50:
--- → fixed
Depends on: 1282570
Resolution: --- → FIXED
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Does this only run on nightly or does the fix need uplift?
Flags: needinfo?(hskupin)
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 39•8 years ago
|
||
The uplift has been handled via bug 1282570 and was done on July 12th.
Flags: needinfo?(hskupin)
You need to log in
before you can comment on or make changes to this bug.
Description
•