Intermittent dom/workers/test/browser_worker_use_counters.js | single tracking bug
Categories
(Core :: DOM: Workers, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox113 | --- | fixed |
People
(Reporter: jmaher, Assigned: asuth)
References
Details
(Keywords: intermittent-failure, intermittent-testcase)
Crash Data
Attachments
(5 files, 2 obsolete files)
Reporter | ||
Comment 1•2 years ago
|
||
Additional information about this bug failures and frequency patterns can be found by running: ./mach test-info failure-report --bug 1775784
Comment 2•2 years ago
|
||
Hi :jmaher, ./mach test-info failure-report --bug 1775842
gives me:
no failures were found for given bugid, please ensure bug is
accessible via: https://treeherder.mozilla.org/intermittent-failures
but https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2022-06-16&endday=2022-06-23&tree=all&bug=1775842
shows there is some.
Please note the tree=all
, with tree=trunk
I get an empty list also there. Is this the intended behavior for ./mach test-info failure-report
?
Reporter | ||
Comment 3•2 years ago
|
||
this report will work once failures are associated with this new bug. the ./mach test-info ...
command reports by default the last 30 days on trunk:
https://searchfox.org/mozilla-central/source/testing/mach_commands.py#996
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 18•2 years ago
|
||
tl;dr: Worker shutdown is happening at Canceling not Killing as far as the remote worker mechanism is concerned for the purposes of deciding when to stop keeping the process alive, which leads to a race between worker shutdown and the process being torn down, and this leads to the telemetry not being reported because the observer shutdown notification has already happened by the time the worker gets around to reporting its usage. By fixing this by moving the remote worker notification to the transition to Killing (via helper lambda passed into the WorkerPrivate constructor) we can also likely address another set of related races.
My longer key notes here from slack discussion:
So I understand that for things to work:
- The worker has to reach the killing stage and report its use counters. (I guess we must do this for reasons related to lock contention? Arguably doing it once at worker shutdown does avoid a lot of pathological possibilities for multiple workers doing the exact same thing!)
- The child has to send telemetry up to the parent. It will do this in "content-child-shutdown".
In the pernosco trace at https://pernos.co/debug/KzErTzqDyGm8hBQz1FdPCg/index.html#f{m[AkVx,ChMJ_,t[AbE,Cmc_,f{e[AkVx,ChMC_,s{af4gCXQAA,bARo,uHk90Sw,oHlBm3Q___/ at that timestamp we can see that the "content-child-shutdown" telemetry IPC sending is happening prior to the worker reaching the killing state and so the telemetry never gets sent.The reason the child is able to start shutdown for SharedWorkers or ServiceWorkers is that RemoteWorkerParent::ActorDestroy notifies the ContentParent which can then start the parent shutdown.
Presumably we can address this by keeping the RemoteWorkerParent alive slightly longer, and in doing so we'll probably eliminate a whole bunch of intermittents, which could be very nice! We have a somewhat nice diagram that suggests that RemoteWorkerChild::ShutdownOnWorker is responsible for sending the close message that causes problems, and indeed that's what happened in the above pernosco trace. This WeakWorkerRef is triggered on the transition to Canceling which is strictly before Killing.
Unfortunately, it's a bit awkward to be notified of the transition to Killing right now. We won't transition to killing if we have active worker refs. Also it would be a wacky concept to be able to have WorkerRefs that notify on killing because that's the point where everything should have stopped and you're explicitly not allowed to do anything. Maybe the most practical thing is to add an optional lambda argument to WorkerPrivate::Constructor like RemoteWorkerChild::ExecWorkerOnMainThread calls and call that lambda in WorkerPrivate::ClearSelfAndParentEventTargetRef. This is potentially a bit extreme but we know the worker is basically entirely dead at that point. As the logic that creates and dispatches the worker finished runnable says, the worker literally can't do anything after dispatching that runnable! I don't see any real harm in making sure the worker is totally, totally done. (Well, we will be less efficient, but from our team's perspectives, a LOT of edge-cases will go away, so it's a win for us!)
Assignee | ||
Comment 19•2 years ago
|
||
This blocks landing of the GC/CC enhancements in bug 1777921
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 32•2 years ago
|
||
Assignee | ||
Comment 33•2 years ago
|
||
Depends on D164643
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 37•2 years ago
|
||
Depends on D164643
Assignee | ||
Comment 38•2 years ago
|
||
Depends on D166261
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 40•2 years ago
|
||
nsIAsyncShutdownService defined in nsIAsyncShutdown.idl is the preferred
approach for handling asynchronous shutdowns as it provides for parallelism
in shutdown rather than having N independent observer service-based
implementations each running in parallel because they each spin their own
nested event loop. (There have been discussions in #xpcom about moving the
service to C++ from JS, but there is not currently a bug.)
In this patch stack on (this) bug 1775784 we move RemoteWorkerService to
use the async shutdown service to address correctness problems related
to its shutdown while avoiding adding a serializing event loop spin.
As it relates to the browser_startup_content*.js tests:
- nsAsyncShutdown.sys.mjs is already present, but its dependencies that
are loaded if it is actually used (AsyncShutdown.sys.mjs,
PromiseUtils.sys.mjs) were not listed. Presumably some code that
lazily adds a blocker was loaded but it did not add a blocker. - RemoteWorkerService is unconditionally started up (this has not
changed; it's necessary and important for latency for ServiceWorkers)
and so it is definitely loading the async shtudown service and using
it, so the additional deps need to be present. - Currently async shutdown lives in the intermittently loaded scripts
section. Since RemoteWorkerService is unconditionally initialized,
it's not clear if this should be moved up toknown_scripts
since
it currently will always loaded or if it's better to leave it as
intermittent because it may be possible that upcoming worker cleanups
and refactorings may be able to moot the async shutdown service load,
in which case it would be appropriate to move it back to intermittent
but it's possible some other component might have made it
unconditionally load, but it would be hard to tell that because they
would not have run into this test failing. - I've decided to leave the load in the intermittent section because
the async shutdown service is arguably fundamental infrastructure
which we want people to use, but we also don't want to require that
it's running at startup; that's just an implemention artifact. (And
if we have concerns about the amount of JS in the async shutdown
service as exposed via nsIAsyncShutdownService, that should be done
as an enhancement bug in XPCOM.)
Depends on D166262
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 45•2 years ago
|
||
First step for some docs/diagrams for RemoteWorker. I'm using plantuml
because it's backed by graphviz's layout algorithm and has a built-in
preprocessor which makes it easy to add searchfox links for things
without crufting up the document.
The trade-offs relative to mermaid in more detail:
- Mermaid does not require someone to run
plantuml -tsvg OVERVIEW.md
to
update the diagram, which is nice. - Mermaid diagrams are currently only rendered by firefox-source-docs
client-side using JS and on firefox-source-docs.mozilla.org this results
in a very noticeable delay even for trivial diagrams. This is bad UX but
could be improved. - Mermaid uses the deprecated/abandoned dagre layout lib which I've found
through prior extensive investigation to have much more limited automatic
layout capabilities as it relates to (nested) clusters and in particular
is not comparable in sophistication for graphviz's record-type (HTML
label) mechanisms. (I believe this largely stems from dagre's primary
use case lying in bioinformatics where nodes tended towards points
rather than complicated records, although dagre is not without related
capabilities.) - PlantUML explicitly exposes graphviz's record/HTML label mechanism with
a friendly syntax, including ports. This also includes neat tricks with
ports on containers. Mermaid does not have anything comparable.- PlantUML actually has some extremely neat capabilities involving JSON.
- PlantUML has a sophisticated preprocessor mechanism that makes it easy to
create helper wrappers to automatically create searchfox searches
without having to manually create and formulate links which would be
inherently distracting in working with the source.- Furthermore, its include mechanism and JSON processing support in
conjunction with the preprocessor would make it possible to issue
queries against searchfox's sorch endpoint to automatically produce
some simple diagrams like class hierarchies. (Arguably those should
just be produced by searchfox, but that's not currently in the cards.)
- Furthermore, its include mechanism and JSON processing support in
Depends on D164644
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 51•2 years ago
|
||
As discussed in https://phabricator.services.mozilla.com/D166269, the
nsIAsyncShutdownService does not normally run at startup in content
processes, so anything that causes it to newly run at startup
unconditionally potentially introduces a non-trivial amount of jank
because this necessitates synchronous loads of JS code.
Bug 1760855 tracks re-implementing nsIAsyncShutdownService in non-JS
for C++ consumers like the RemoteWorkerService. At that time, this
patch can be reverted and any bit-rot fixed so that we go back to
using nsIAsyncShutdownService, with the following extra changes:
- The runnable that defers the call to InitializeOnMainThread should
no longer be necessary, although if left intact it is harmless,
just not necessary. There is more discussion on
https://phabricator.services.mozilla.com/D164644.
Depends on D164644
Updated•2 years ago
|
Comment 52•2 years ago
|
||
Comment 53•2 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/dc1a4b1f617f
https://hg.mozilla.org/mozilla-central/rev/295469f72761
https://hg.mozilla.org/mozilla-central/rev/6dc3d26eda30
https://hg.mozilla.org/mozilla-central/rev/6c391b739e84
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Updated•2 years ago
|
Comment 57•1 years ago
|
||
Copying crash signatures from duplicate bugs.
Description
•