Closed Bug 958011 Opened 11 years ago Closed 11 years ago

Worker object cycle collection is broken

Categories

(Core :: DOM: Workers, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla29
Tracking Status
firefox27 --- unaffected
firefox28 --- fixed
firefox29 --- fixed

People

(Reporter: sole, Assigned: khuey)

References

Details

(Keywords: regression)

Attachments

(2 files)

We've found this issue where launching web workers repeatedly on a page ends up with random timeouts in actually the web worker being launched. It happens more or less every 20th web worker, and the first frame that "freezes" is also somewhat random. On Firefox release it seems to be consistently the 11th, but on Nightly it is sometimes the first, other times the 2nd, the 11th... quite unpredictable. I thought it was our Animated_GIF.js library having something wrong but it also happens with gif.js. Here are two demos, where up to 200 gifs will be generated. The text in each square is the time elapsed between generating that one and the previous one-if it takes more than a second it's painted RED because it's slow. You'll see there's a certain pattern with this: * Using Animated_GIF - we're launching 2 workers per GIF so it freezes every 10th GIF: http://people.mozilla.org/~spenades/test_cases/workers_every_10th/tests_stress.html * Using gif.js - since it launches only 1 worker per GIF it freezes every 20th GIF: http://people.mozilla.org/~spenades/test_cases/workers_every_20th/tests_stress.html We are also seeing this behaviour in chat.meatspac.es where we've already filed a bug: https://github.com/meatspaces/meatspace-chat/issues/103 This behaviour is not present in Chrome (it crashes in my stress test, but after many GIFs have been generated without lag/timeouts between them). I have been looking at the network graph and the profiler but I can't really figure out what it is that is happening :-/
There is a hardcoded limit of 20 workers per origin to prevent DOSing the browser. http://mxr.mozilla.org/mozilla-central/source/modules/libpref/src/init/all.js#100
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → INVALID
So if the workers are used and then go away, why are they not being GC'ed earlier? Because it looks to me that what's happening is it takes too long for them to be "restored into the pool". Sometimes it takes 7 seconds. Sometimes it takes A MINUTE. Sometimes IT NEVER HAPPENS. The other browsers are fine. So something is wrong here. Reopening ;)
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
Kyle, this problem doesn't happen in Chrome and it affects sites like https://chat.meatspace.es which is heavily used.
Ok, I think there is actually a Gecko bug here related to GCing cycles involving workers.
Status: REOPENED → NEW
Summary: Random timeouts when repeatedly launching Web Workers → Worker object cycle collection is broken
This library never actually terminates the workers (unless abort is called), so the only way to shut them down is for the GIF object to be GCd. If the library consumer causes a GIF object to be entrained it will leak a running worker. That's not the issue here, but it is poor design.
Hey Kyle I added an explicit method to the library so you can ask it to release the workers once you don't want to use the parent object anymore, by calling terminate() on them. This makes the test pass without crashing, so maybe this confirms the GC issue? Here's the new test: http://sole.github.io/Animated_GIF/tests_stress.html
Web workers shut down in one of 3 ways: when close() is called on the worker thread, when terminate() is closed on the parent thread, or when the Worker object is no longer reachable on the parent thread. There's a bug in the implementation of the later (having a grey JS wrapper should count as unreachable for these purposes). But the page should do one of the first two anyways, and not rely on the third, which is just there as a last resort.
Assignee: nobody → khuey
Keywords: regression
Attached patch Patch (deleted) — Splinter Review
Attachment #8368810 - Flags: review?(continuation)
Attachment #8368810 - Flags: review?(bent.mozilla)
Comment on attachment 8368810 [details] [diff] [review] Patch Review of attachment 8368810 [details] [diff] [review]: ----------------------------------------------------------------- Hooray for code removal! The cycle collector and binding generator changes look good to me.
Attachment #8368810 - Flags: review?(continuation) → review+
Comment on attachment 8368810 [details] [diff] [review] Patch Review of attachment 8368810 [details] [diff] [review]: ----------------------------------------------------------------- ::: dom/workers/WorkerPrivate.cpp @@ +3463,5 @@ > + // down the thread. > + > + if (!tmp->mBusyCount && !tmp->mMainThreadObjectsForgotten) { > + NS_CYCLE_COLLECTION_NOTE_EDGE_NAME(cb, "self"); > + cb.NoteXPCOMChild(tmp); I'd much prefer this to traverse an actual nsRefPtr<WorkerPrivateParent> called mSelfRef. Then we could set it to something in Constructor() and null it in the WorkerFinishedRunnable. It would at least be more easily searchable than looking for a stray AddRef/Release in here.
Attachment #8368810 - Flags: review?(bent.mozilla) → review+
Status: NEW → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla29
Comment on attachment 8368810 [details] [diff] [review] Patch [Approval Request Comment] Bug caused by (feature/regressing bug #): Bug 928312 User impact if declined: Leaks - worker objects may not be GCd properly ever, pages may not work if they depend on workers being reclaimed (e.g. sole's demos) Testing completed (on m-c, etc.): On m-c, has automated test. Risk to taking this patch (and alternatives if risky): Low risk, well tested. Backing out 928312 would be far more invasive String or IDL/UUID changes made by this patch: None
Attachment #8368810 - Flags: approval-mozilla-beta?
Comment on attachment 8368810 [details] [diff] [review] Patch Switching the approval to aurora as Fx27 is unaffected here
Attachment #8368810 - Flags: approval-mozilla-beta? → approval-mozilla-aurora?
Attachment #8368810 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
This has some non-trivial conflicts with Aurora that I'm not confident trying to resolve.
Flags: needinfo?(khuey)
Attached patch Aurora patch (deleted) — Splinter Review
Flags: needinfo?(khuey)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: