Closed
Bug 1434618
Opened 7 years ago
Closed 6 years ago
Worker shutdown should start at xpcom-will-shutdown notification
Categories
(Core :: DOM: Workers, defect, P2)
Tracking
()
People
(Reporter: baku, Assigned: baku)
References
(Blocks 3 open bugs)
Details
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f31755ef4145
Worker shutdown should start at xpcom-will-shutdown notification, r=smaug
If this ends up helping out with shutdown hangs, we might want to consider uplift to 59.
status-firefox59:
--- → affected
status-firefox60:
--- → affected
tracking-firefox59:
--- → +
tracking-firefox60:
--- → +
Comment 3•7 years ago
|
||
Backed out for webdriver failures, e.g. webdriver/tests/actions/mouse_dblclick.py:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0aa3298451dc7d725b1d98deb9461ae190cd635d
Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=cf81d20f671314bd737e8db7ef9ed97f48f183b1&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=159591549&repo=mozilla-inbound
[task 2018-01-31T16:03:59.497Z] 16:03:59 INFO - STDOUT: tests/web-platform/tests/webdriver/tests/actions/mouse_dblclick.py::test_no_dblclick
[task 2018-01-31T16:03:59.498Z] 16:03:59 INFO - PID 842 | 1517414639495 mozrunner::runner INFO Running command: "/builds/worker/workspace/build/application/firefox/firefox" "-marionette" "-profile" "/tmp/rust_mozprofile.bcqhtUmGL4p0"
[task 2018-01-31T16:03:59.575Z] 16:03:59 INFO - TEST-UNEXPECTED-TIMEOUT | /webdriver/tests/actions/mouse_dblclick.py | expected OK
Flags: needinfo?(amarchesini)
Updated•7 years ago
|
Severity: normal → major
Priority: -- → P2
Assignee | ||
Updated•7 years ago
|
Flags: needinfo?(amarchesini)
Pushed by amarchesini@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/22fe0145f9ed
Worker shutdown should start at xpcom-will-shutdown notification, r=smaug
Backout by shindli@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1ed7050a69b4
Backed out changeset 22fe0145f9ed for Wd failures in /webdriver/tests/actions/mouse_dblclick.py on a CLOSED TREE
Assignee | ||
Comment 6•7 years ago
|
||
It seems that this patch is amplifying an existing problem. Bug 1414144 is an example of it.
Tomorrow I'll take a more closer look. whimboo, if you have any idea, let me know.
Flags: needinfo?(hskupin)
Comment 7•7 years ago
|
||
We have a couple of know shutdown hang scenarios. So I wouldn't be surprised if one of those could be the cause. Or do those pass or fail in a crash only?
Btw. the restarts in-between the tests in this test module aren't necessary anymore. A patch on bug 1432105 could help a lot here in reducing the runtime to maybe at maximum 15s of this test. Maja, do you think you can have a look at this? It would help a lot, so that Andrea can continue to land patches for slow shutdowns.
Depends on: 1432105
Flags: needinfo?(hskupin) → needinfo?(mjzffr)
Yes, I'll probably have time for bug 1432105 later today.
Flags: needinfo?(mjzffr)
Likely too late for 59 at this point with release 2 weeks away, but we can still consider taking patches for 60.
Comment 11•7 years ago
|
||
(In reply to Marion Daly [:mdaly] from comment #10)
> :baku, update?
We are blocked landing this patch by bug 1432105, which itself was blocked by a couple of other bugs. I worked hard to get all the dependencies fixed, and by today we should be able to get the patch on bug 1432105 landed too. So later today or tomorrow we can try to land this patch again, or start as best with a try build.
Comment 12•7 years ago
|
||
Andrea, would you mind rebasing your patch and push to try? We won't be able to land bug 1432105 today due to some other issues, but I would like to know how this patch works now for wdspec tests. Given that I enabled trace logs for Marionette yesterday we should get way better failure details. Thanks!
Comment 13•7 years ago
|
||
Andrea pushed a try build for which the linux32 wdspec tests are still failing:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=32e799340b1bf34d6f1d9c2932d30fbbe7318f0a&filter-searchStr=linux%20wd&selectedJob=166721977
The problem seems to be that moving to `xpcom-will-shutdown` causes even more / longer hangs. You can check the log which prints all the details now:
https://treeherder.mozilla.org/logviewer.html#?job_id=166721977&repo=try&lineNumber=5535
See all the wait for browser shutdown lines above.
Maybe this is all related to the tabmodal dialog tests in each of those files and the hangs as reported on bug 1443524 by Andreas.
Assignee | ||
Comment 14•7 years ago
|
||
Andreas, any comment about this?
Flags: needinfo?(amarchesini) → needinfo?(ato)
Comment 15•7 years ago
|
||
mouse_dblclick.py doesn’t, as far as I can tell from just
inspecting the source code, cause any tabmodal dialogues from
appearing. But I suppose it is always possible that it could be
left over from an earlier test.
The best way to debug this would be to run it locally and inspect
what’s happening:
> % ./mach wpt testing/web-platform/tests/webdriver/tests/actions/mouse_dblclick.py --timeout-multiplier=100 --webdriver-arg=-vv
My experience has been that the shutdown hang is fairly easily
reproducible.
Flags: needinfo?(ato)
Updated•7 years ago
|
status-firefox61:
--- → affected
tracking-firefox61:
--- → +
Comment 16•7 years ago
|
||
Andrea, please rebase your patch on mozilla-central and try again. Thanks.
Flags: needinfo?(amarchesini)
Assignee | ||
Comment 17•7 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=90ab5c9fd2d7bf02d3f38b74bb553aca65bb4da8
still unhappy results.
Flags: needinfo?(amarchesini)
Comment 18•7 years ago
|
||
(In reply to Andrea Marchesini [:baku] from comment #17)
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=90ab5c9fd2d7bf02d3f38b74bb553aca65bb4da8
>
> still unhappy results.
Would you want to hand this off to someone else?
Assignee | ||
Comment 19•7 years ago
|
||
So far it's just about pushing to try. Not a big deal.
whimboo do you have time to see why the patch still breaks the WD tests?
Flags: needinfo?(hskupin)
Comment 20•7 years ago
|
||
I see some strange behavior here. All is based on the following log:
https://treeherder.mozilla.org/logviewer.html#?job_id=169826750&repo=try (log viewer)
https://taskcluster-artifacts.net/W8In7kJvTES01pc-37mDGA/0/public/logs/live_backing.log (full log)
1) Geckodriver reports after only 5s that it failed to stop the browser process. How can this be given that we should wait 70s for that to be happening. Andreas, do you have an idea? Here the excerpt from the log:
> [task 2018-03-23T09:02:46.253Z] 09:02:46 INFO - PID 8530 | 1521795766245 Marionette TRACE 0 -> [0,3,"quit",{"flags":["eForceQuit"]}]
[..]
> [task 2018-03-23T09:02:51.563Z] 09:02:51 INFO - PID 8530 | 1521795771555 geckodriver::marionette ERROR Failed to stop browser process
2) There are failures during shutdown for which I don't know if those could affect the shutdown time, and as such the issue as mentioned in 1). Maybe those are unrelated because I also see 1) without those failures:
[task 2018-03-23T09:02:53.945Z] 09:02:53 INFO - PID 8530 | ###!!! [Parent][DispatchAsyncMessage] Error: PClientManager::Msg_Teardown Route error: message sent to unknown actor ID
[task 2018-03-23T09:02:53.945Z] 09:02:53 INFO - PID 8530 |
[task 2018-03-23T09:02:53.946Z] 09:02:53 INFO - PID 8530 | JavaScript error: resource://gre/modules/osfile/osfile_async_front.jsm, line 408: Error: OS.File has been shut down. Rejecting post to DirectoryIterator_prototype_close
[task 2018-03-23T09:02:53.947Z] 09:02:53 INFO - PID 8530 | JavaScript error: resource://gre/modules/osfile/osfile_async_front.jsm, line 408: Error: OS.File has been shut down. Rejecting post to DirectoryIterator_prototype_close
[task 2018-03-23T09:02:53.948Z] 09:02:53 INFO - PID 8530 | JavaScript error: resource://gre/modules/ProfileAge.jsm, line 174: Error: Unable to fetch oldest profile entry: Error: OS.File has been shut down. Rejecting post to DirectoryIterator_prototype_next
[task 2018-03-23T09:02:53.948Z] 09:02:53 INFO - PID 8530 | JavaScript error: resource://gre/modules/ProfileAge.jsm, line 174: Error: Unable to fetch oldest profile entry: Error: OS.File has been shut down. Rejecting post to DirectoryIterator_prototype_next
Flags: needinfo?(hskupin) → needinfo?(ato)
Comment 21•7 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #20)
> 1) Geckodriver reports after only 5s that it failed to stop the
> browser process. How can this be given that we should wait 70s
> for that to be happening. Andreas, do you have an idea? Here the
> excerpt from the log:
>
> > [task 2018-03-23T09:02:46.253Z] 09:02:46 INFO - PID 8530 | 1521795766245 Marionette TRACE 0 -> [0,3,"quit",{"flags":["eForceQuit"]}]
> [..]
> > [task 2018-03-23T09:02:51.563Z] 09:02:51 INFO - PID 8530 | 1521795771555 geckodriver::marionette ERROR Failed to stop browser process
I think std::process::Child::kill() errors if the process isn’t
running:
https://doc.rust-lang.org/std/process/struct.Child.html#method.kill
mozrunner unconditionally calls this after
mozrunner::RunnerProcess::running() is false in the ::wait() loop:
https://searchfox.org/mozilla-central/rev/003262ae12ce937950ffb8d3b0fa520d1cc38bff/testing/mozbase/rust/mozrunner/src/runner.rs#147
Flags: needinfo?(ato)
Comment 22•7 years ago
|
||
Although killing twice an ended process does not:
https://play.rust-lang.org/?gist=85ae7d8f983307245872fb328b5685b4&version=stable
Comment 23•7 years ago
|
||
I filed bug 1448900 to get the geckodriver issue investigated.
Comment 24•7 years ago
|
||
Investigation has shown that the problem is just a confusing output in the log, and the browser process has been exited on its own. So it's nothing we would have to worry about. The problem should be something else...
A couple of tests require a new session of Firefox, and as such the browser is getting restarted. As it looks like there is always a 5s gap for debug builds in-between those two lines:
[task 2018-03-23T08:44:32.763Z] 08:44:32 INFO - PID 5373 | JavaScript error: resource://gre/modules/ProfileAge.jsm, line 174: Error: Unable to fetch oldest profile entry: Error: OS.File has been shut down. Rejecting request to DirectoryIterator_prototype_next
[task 2018-03-23T08:44:37.917Z] 08:44:37 INFO - PID 5373 | [Parent 5514, Main Thread] WARNING: NS_ENSURE_TRUE(maybeContext) failed: file /builds/worker/workspace/build/src/xpcom/threads/nsThread.cpp, line 809
That increases the shutdown time from 5s to 10s for every shutdown and results in a test duration of 365 seconds:
> [task 2018-03-23T08:44:14.042Z] 08:44:14 INFO - TEST-START | /webdriver/tests/sessions/new_session/create_alwaysMatch.py
[..]
> [task 2018-03-23T08:50:19.044Z] 08:50:19 INFO - TEST-UNEXPECTED-TIMEOUT | /webdriver/tests/sessions/new_session/create_alwaysMatch.py | expected OK
> [task 2018-03-23T08:50:19.044Z] 08:50:19 INFO - TEST-INFO took 365002ms
I do not see that hang for regular CI builds on Treeherder. For those we have the following timing:
>[task 2018-03-25T23:09:34.731Z] 23:09:34 INFO - TEST-START | /webdriver/tests/sessions/new_session/create_alwaysMatch.py
[..]
> [task 2018-03-25T23:13:44.860Z] 23:13:44 INFO - TEST-OK | /webdriver/tests/sessions/new_session/create_alwaysMatch.py | took 250110ms
That means with your patch the affected tests need 50% more time to complete. I would also say that all tests are involved here which causes a restart of Firefox, but only those 3 are permanently failing because of the higher number of subtests needed to run.
Andrea, I hope that helps for further debugging the shutdown delays.
Flags: needinfo?(amarchesini)
Updated•7 years ago
|
Blocks: RunWatchdogShutdownhang
Updated•7 years ago
|
Blocks: IPCError_ShutDownKill
Updated•7 years ago
|
Comment 25•7 years ago
|
||
Probably wontfix for 60 at this point.
Comment 26•7 years ago
|
||
:baku update?
Comment 27•7 years ago
|
||
Andrea, did you had a chance to test again on try? The timeouts of wdspec are all gone now.
Assignee | ||
Comment 28•7 years ago
|
||
Flags: needinfo?(amarchesini)
Comment 29•7 years ago
|
||
I checked a try push from Andrea (as made with the patch on bug 1460239) and I noticed something interesting:
try:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b20ebbb2bc64c73e1d7f606db8e99ada7a367775&filter-searchStr=wd
autoland:
https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=07d5c09b696accd800da11b297cd1f57c6d0f761&filter-searchStr=wd
Take for example the test create_alwaysMatch.py. Here I can see a 5s delay with the build on the try push every time the browser gets shutdown:
https://treeherder.mozilla.org/logviewer.html#?job_id=177700590&repo=try&lineNumber=62190-62193
> [task 2018-05-09T16:43:53.317Z] 16:43:53 INFO - PID 8331 | 1525884233312 Marionette DEBUG Received observer notification xpcom-will-shutdown
> [task 2018-05-09T16:43:53.318Z] 16:43:53 INFO - PID 8331 | 1525884233312 Marionette DEBUG Resetting recommended pref toolkit.cosmeticAnimations.enabled
> [task 2018-05-09T16:43:53.319Z] 16:43:53 INFO - PID 8331 | 1525884233313 Marionette DEBUG Resetting recommended pref datareporting.policy.dataSubmissionPolicyAccepted
> [task 2018-05-09T16:43:58.415Z] 16:43:58 INFO - PID 8331 | 1525884238411 geckodriver::marionette DEBUG Browser process stopped: exit code: 0
Marionette receives the xpcom-will-shutdown notification, and resets the prefs. Then it takes 5s until the browser exited.
Now compare it with browser shutdowns on autoland:
https://treeherder.mozilla.org/logviewer.html#?job_id=177714286&repo=autoland&lineNumber=64025-64028
> [task 2018-05-09T18:09:20.611Z] 18:09:20 INFO - PID 8253 | 1525889360603 Marionette DEBUG Received observer notification xpcom-will-shutdown
> [task 2018-05-09T18:09:20.612Z] 18:09:20 INFO - PID 8253 | 1525889360604 Marionette DEBUG Resetting recommended pref toolkit.cosmeticAnimations.enabled
> [task 2018-05-09T18:09:20.614Z] 18:09:20 INFO - PID 8253 | 1525889360604 Marionette DEBUG Resetting recommended pref datareporting.policy.dataSubmissionPolicyAccepted
> [task 2018-05-09T18:09:20.877Z] 18:09:20 INFO - PID 8253 | 1525889360872 geckodriver::marionette DEBUG Browser process stopped: exit code: 0
Firefox quits immediately without the 5s delay.
So moving the worker shutdown to xpcom-will-shutdown clearly produces another unwanted shutdown delay.
No longer depends on: 1460239
Comment 30•7 years ago
|
||
I'm still very interested in seeing this land in time for uplift to 61, but it's not a new issue either and I don't think tracking it for each release on an ongoing basis adds much value to anything.
status-firefox62:
--- → fix-optional
status-firefox-esr52:
--- → unaffected
status-firefox-esr60:
--- → fix-optional
status-firefox63:
--- → affected
Comment 31•6 years ago
|
||
noise |
(In reply to Ryan VanderMeulen [:RyanVM] from comment #30)
> ... I don't think tracking it for each release
> on an ongoing basis adds much value to anything.
This wouldn't be a problem if you hadn't switched to a rapid release schedule for new Firefox versions.
Assignee | ||
Comment 33•6 years ago
|
||
Yes, but it seems that is not needed anymore. The worker shutdown seems in a good shape at the moment. Am I right?
Initially I want to have this to separate the worker shutdown from any other component shutdown, but probably we can mark this as invalid.
Flags: needinfo?(amarchesini) → needinfo?(mdaly)
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INVALID
Updated•5 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•