Closed
Bug 1370027
Opened 7 years ago
Closed 7 years ago
Runaway subprocess_worker_unix.js worker thread is pegging an entire CPU
Categories
(Toolkit :: Async Tooling, defect)
Toolkit
Async Tooling
Tracking
()
People
(Reporter: dbaron, Assigned: kmag)
References
Details
(Keywords: perf, power)
Attachments
(2 files)
I've had a local opt build of Firefox running for a few days (since Wednesday May 31 at 16:31 PDT, to be exact). It's running https://hg.mozilla.org/mozilla-central/rev/34ac1a5d6576 (plus my local patches as of Monday May 29 at 20:14 PDT up to "Don't propagate style flushes to the parent document").
I notice that Firefox is pegging an entire core and then some. "top -H" shows that the guilty thread is a worker thread, and adding "Worker" to the thread filter in the Gecko profiler (https://perf-html.io/ and see bug 1297773) shows that the problematic worker is an instance of toolkit/modules/subprocess/subprocess_worker_unix.js .
It's pretty bad that our own code has run away like this and is pegging an entire CPU core.
Profile at https://perfht.ml/2rnHMxf
[Tracking Requested - why for this release]: This is a massive power usage bug -- it would lead to users basically reporting all the time that Firefox is killing their battery. It's unacceptable to ship in this state.
Reporter | ||
Comment 1•7 years ago
|
||
s/all the time/whenever this bug occurs/ (since I'm not sure what triggers it)
Reporter | ||
Comment 2•7 years ago
|
||
And here's a profile that also includes the Timer thread: https://perfht.ml/2rnF3UC
(top -H says this worker thread is using 92% CPU and the Timer thread is using 37% CPU, with two other worker threads using 6% and 1%.)
Assignee | ||
Comment 3•7 years ago
|
||
Presumably this is triggered by the profiler's use of the Subprocess API, but it's hard to tell from that profile whether it's a bug in the profiler API or a bug in the subprocess implementation. But from that profile, it *looks* like we're spawning a huge number of processes and waiting for them to exit when there's an error.
It may be caused by another WebExtension that uses the native messaging API, though. What extensions do you have installed in that profile?
Also, a surprising amount of that time seems to be spent in WorkerPrivate::SetGCTimerMode, canceling GC timers.
Reporter | ||
Comment 4•7 years ago
|
||
The only other extension I have installed is https://addons.mozilla.org/en-US/firefox/addon/character-identifier/ (source at https://github.com/dbaron/char-identifier ) which doesn't use native messaging (as far as I'm aware).
Reporter | ||
Comment 5•7 years ago
|
||
(have installed *and enabled*, at least)
Reporter | ||
Comment 6•7 years ago
|
||
Also, this Firefox is still running, so I can debug it to figure out what ran this if there's a good way to do so.
Reporter | ||
Comment 7•7 years ago
|
||
(Also, perhaps this JS code could have profiler labels that say who ran it and/or what it was running? Is that something that we can do with the profiler?)
Assignee | ||
Comment 8•7 years ago
|
||
Do you have any useful output in the browser console?
And, if you can debug that worker thread, can you check the values of the io.processes map? In particular, the exitCode and pid properties should give us some idea of whether we're in an inifinite loop polling for the same process that's already exited (or failed to spawn).
(In reply to David Baron :dbaron: ⌚️UTC-7 from comment #7)
> (Also, perhaps this JS code could have profiler labels that say who ran it
> and/or what it was running? Is that something that we can do with the
> profiler?)
Not at the moment, but I agree that it would be helpful.
Reporter | ||
Comment 9•7 years ago
|
||
There are 39 elements in the map. All have an exitCode of null. None of the PIDs correspond to running processes.
Assignee | ||
Comment 10•7 years ago
|
||
OK. That's interesting... So clearly Process.wait() is failing, and we're continuing to poll for processes that don't exist anymore. But the question is why.
I guess the most likely possibilities are that either something is changing the SIGCHLD handler for your parent process to SIG_IGN, or some other thread is calling wait() or waitpid(-1) and reaping the process before the worker gets a chance. And, unfortunately, [1] looks like a likely culprit for the latter.
So this probably winds up happening as soon as someone uses nsIProcess. From a quick code search, I suspect that's the Browser Toolbox[2] in this case.
If so, the simplest short term solution is probably to update that code not to use nsIProcess. Beyond that, we should probably try to handle that situation non-catastrophically, but I'd be very surprised if that wait loop thread doesn't cause problems elsewhere, particularly for our IPC code.
[1]: http://searchfox.org/mozilla-central/rev/507743376d1ba753d14ab6b9305b7c6358570be8/nsprpub/pr/src/md/unix/uxproces.c#648
[2]: http://searchfox.org/mozilla-central/rev/507743376d1ba753d14ab6b9305b7c6358570be8/devtools/client/framework/ToolboxProcess.jsm#234
Assignee | ||
Updated•7 years ago
|
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 14•7 years ago
|
||
mozreview-review |
Comment on attachment 8875055 [details]
Bug 1370027: Part 1 - Cleanly handle a subprocess child being reaped by NSPR.
https://reviewboard.mozilla.org/r/146422/#review150436
I was looking forward to subprocess.rs but this will do in the mean time.
Attachment #8875055 -
Flags: review?(aswan) → review+
Assignee | ||
Comment 15•7 years ago
|
||
(In reply to Andrew Swan [:aswan] from comment #14)
> I was looking forward to subprocess.rs
Some day...
Comment 16•7 years ago
|
||
mozreview-review |
Comment on attachment 8875056 [details]
Bug 1370027: Part 2 - Use Subprocess.jsm rather than nsIProcess to create the Browser Toolbox process.
https://reviewboard.mozilla.org/r/146424/#review150820
This all looks reasonable, and the Browser Toolbox still works, so fine by me
Attachment #8875056 -
Flags: review?(bgrinstead) → review+
Assignee | ||
Comment 17•7 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/f4f2dab8b32338f937a3661946809df4674440d6
Bug 1370027: Part 1 - Cleanly handle a subprocess child being reaped by NSPR. r=aswan
https://hg.mozilla.org/integration/mozilla-inbound/rev/0d16a08bbcfa181535789a43856ef176cc9829c8
Bug 1370027: Part 2 - Use Subprocess.jsm rather than nsIProcess to create the Browser Toolbox process. r=bgrins
Comment 18•7 years ago
|
||
https://hg.mozilla.org/mozilla-central/rev/f4f2dab8b323
https://hg.mozilla.org/mozilla-central/rev/0d16a08bbcfa
Assignee: nobody → kmaglione+bmo
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Updated•7 years ago
|
Whiteboard: [qf] → [qf:p1]
Updated•3 years ago
|
Performance Impact: --- → P1
Whiteboard: [qf:p1]
You need to log in
before you can comment on or make changes to this bug.
Description
•