1775784 - Intermittent dom/workers/test/browser_worker_use_counters.js | single tracking bug

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Description

•

2 years ago

No description provided.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 1

•

2 years ago

Additional information about this bug failures and frequency patterns can be found by running: ./mach test-info failure-report --bug 1775784

Jens Stutte [:jstutte]

Comment 2

•

2 years ago

Hi :jmaher, ./mach test-info failure-report --bug 1775842 gives me:

no failures were found for given bugid, please ensure bug is
accessible via: https://treeherder.mozilla.org/intermittent-failures

but https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2022-06-16&endday=2022-06-23&tree=all&bug=1775842 shows there is some.

Please note the tree=all, with tree=trunk I get an empty list also there. Is this the intended behavior for ./mach test-info failure-report ?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 3

•

2 years ago

this report will work once failures are associated with this new bug. the ./mach test-info ... command reports by default the last 30 days on trunk:
https://searchfox.org/mozilla-central/source/testing/mach_commands.py#996

Flags: needinfo?(jmaher)

Comment hidden (Intermittent Failures Robot)

Randell Jesup [:jesup] (needinfo me)

Updated

•

2 years ago

Severity: -- → S3

Comment hidden (Intermittent Failures Robot)

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

2 years ago

Assignee: nobody → bugmail

Status: NEW → ASSIGNED

Comment hidden (Intermittent Failures Robot)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 18

•

2 years ago

tl;dr: Worker shutdown is happening at Canceling not Killing as far as the remote worker mechanism is concerned for the purposes of deciding when to stop keeping the process alive, which leads to a race between worker shutdown and the process being torn down, and this leads to the telemetry not being reported because the observer shutdown notification has already happened by the time the worker gets around to reporting its usage. By fixing this by moving the remote worker notification to the transition to Killing (via helper lambda passed into the WorkerPrivate constructor) we can also likely address another set of related races.

My longer key notes here from slack discussion:

So I understand that for things to work:

The worker has to reach the killing stage and report its use counters. (I guess we must do this for reasons related to lock contention? Arguably doing it once at worker shutdown does avoid a lot of pathological possibilities for multiple workers doing the exact same thing!)
The child has to send telemetry up to the parent. It will do this in "content-child-shutdown".

In the pernosco trace at https://pernos.co/debug/KzErTzqDyGm8hBQz1FdPCg/index.html#f{m[AkVx,ChMJ_,t[AbE,Cmc_,f{e[AkVx,ChMC_,s{af4gCXQAA,bARo,uHk90Sw,oHlBm3Q___/ at that timestamp we can see that the "content-child-shutdown" telemetry IPC sending is happening prior to the worker reaching the killing state and so the telemetry never gets sent.The reason the child is able to start shutdown for SharedWorkers or ServiceWorkers is that RemoteWorkerParent::ActorDestroy notifies the ContentParent which can then start the parent shutdown.

Presumably we can address this by keeping the RemoteWorkerParent alive slightly longer, and in doing so we'll probably eliminate a whole bunch of intermittents, which could be very nice! We have a somewhat nice diagram that suggests that RemoteWorkerChild::ShutdownOnWorker is responsible for sending the close message that causes problems, and indeed that's what happened in the above pernosco trace. This WeakWorkerRef is triggered on the transition to Canceling which is strictly before Killing.

Unfortunately, it's a bit awkward to be notified of the transition to Killing right now. We won't transition to killing if we have active worker refs. Also it would be a wacky concept to be able to have WorkerRefs that notify on killing because that's the point where everything should have stopped and you're explicitly not allowed to do anything. Maybe the most practical thing is to add an optional lambda argument to WorkerPrivate::Constructor like RemoteWorkerChild::ExecWorkerOnMainThread calls and call that lambda in WorkerPrivate::ClearSelfAndParentEventTargetRef. This is potentially a bit extreme but we know the worker is basically entirely dead at that point. As the logic that creates and dispatches the worker finished runnable says, the worker literally can't do anything after dispatching that runnable! I don't see any real harm in making sure the worker is totally, totally done. (Well, we will be less efficient, but from our team's perspectives, a LOT of edge-cases will go away, so it's a win for us!)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 19

•

2 years ago

This blocks landing of the GC/CC enhancements in bug 1777921

Blocks: 1777921

Comment hidden (Intermittent Failures Robot)

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

2 years ago

Blocks: 1749068

Comment hidden (Intermittent Failures Robot)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 32

•

2 years ago

Attached file Bug 1775784 - Make PRemoteWorker refcounted instead of ManualDealloc. r=#dom-workers! (deleted) — Details

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 33

•

2 years ago

Attached file Bug 1775784 - Improve worker shutdown ordering. r=#dom-workers! (deleted) — Details

Depends on D164643

Comment hidden (Intermittent Failures Robot)

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

2 years ago

Blocks: 1769350

Comment hidden (Intermittent Failures Robot)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 37

•

2 years ago

Attached file Bug 1775784 - Make PRemoteWorkerService refcounted instead of ManualDealloc. r=#dom-workers! (deleted) — Details

Depends on D164643

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 38

•

2 years ago

Attached file Bug 1775784 - Make RemoteWorkerService use an async shutdown blocker. r=#dom-workers! (obsolete) (deleted) — Details

Depends on D166261

Comment hidden (Intermittent Failures Robot)

Phabricator Automation

Updated

•

2 years ago

Attachment #9308210 - Attachment description: WIP: Bug 1775784 - Make PRemoteWorker refcounted instead of ManualDealloc. r=#dom-workers! → Bug 1775784 - Make PRemoteWorker refcounted instead of ManualDealloc. r=#dom-workers!

Phabricator Automation

Updated

•

2 years ago

Attachment #9311239 - Attachment description: WIP: Bug 1775784 - Make PRemoteWorkerService refcounted instead of ManualDealloc. r=#dom-workers! → Bug 1775784 - Make PRemoteWorkerService refcounted instead of ManualDealloc. r=#dom-workers!

Phabricator Automation

Updated

•

2 years ago

Attachment #9311240 - Attachment description: WIP: Bug 1775784 - Make RemoteWorkerService use an async shutdown blocker. r=#dom-workers! → Bug 1775784 - Make RemoteWorkerService use an async shutdown blocker. r=#dom-workers!

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 40

•

2 years ago

Attached file Bug 1775784 - Adjust tests that check JS loads for nsIAsyncShutdownService. r=Standard8! (obsolete) (deleted) — Details

nsIAsyncShutdownService defined in nsIAsyncShutdown.idl is the preferred
approach for handling asynchronous shutdowns as it provides for parallelism
in shutdown rather than having N independent observer service-based
implementations each running in parallel because they each spin their own
nested event loop. (There have been discussions in #xpcom about moving the
service to C++ from JS, but there is not currently a bug.)

In this patch stack on (this) bug 1775784 we move RemoteWorkerService to
use the async shutdown service to address correctness problems related
to its shutdown while avoiding adding a serializing event loop spin.

As it relates to the browser_startup_content*.js tests:

nsAsyncShutdown.sys.mjs is already present, but its dependencies that
are loaded if it is actually used (AsyncShutdown.sys.mjs,
PromiseUtils.sys.mjs) were not listed. Presumably some code that
lazily adds a blocker was loaded but it did not add a blocker.
RemoteWorkerService is unconditionally started up (this has not
changed; it's necessary and important for latency for ServiceWorkers)
and so it is definitely loading the async shtudown service and using
it, so the additional deps need to be present.
Currently async shutdown lives in the intermittently loaded scripts
section. Since RemoteWorkerService is unconditionally initialized,
it's not clear if this should be moved up to known_scripts since
it currently will always loaded or if it's better to leave it as
intermittent because it may be possible that upcoming worker cleanups
and refactorings may be able to moot the async shutdown service load,
in which case it would be appropriate to move it back to intermittent
but it's possible some other component might have made it
unconditionally load, but it would be hard to tell that because they
would not have run into this test failing.
I've decided to leave the load in the intermittent section because
the async shutdown service is arguably fundamental infrastructure
which we want people to use, but we also don't want to require that
it's running at startup; that's just an implemention artifact. (And
if we have concerns about the amount of JS in the async shutdown
service as exposed via nsIAsyncShutdownService, that should be done
as an enhancement bug in XPCOM.)

Depends on D166262

Phabricator Automation

Updated

•

2 years ago

Attachment #9308211 - Attachment description: WIP: Bug 1775784 - Improve worker shutdown ordering. r=#dom-workers! → Bug 1775784 - Improve worker shutdown ordering. r=#dom-workers!

Comment hidden (Intermittent Failures Robot)

Phabricator Automation

Updated

•

2 years ago

Attachment #9311259 - Attachment is obsolete: true

Comment hidden (Intermittent Failures Robot)

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 45

•

2 years ago

Attached file WIP: Bug 1775784 - WIP RemoteWorker documentation diagrams with links (deleted) — Details

First step for some docs/diagrams for RemoteWorker. I'm using plantuml
because it's backed by graphviz's layout algorithm and has a built-in
preprocessor which makes it easy to add searchfox links for things
without crufting up the document.

The trade-offs relative to mermaid in more detail:

Mermaid does not require someone to run plantuml -tsvg OVERVIEW.md to
update the diagram, which is nice.
Mermaid diagrams are currently only rendered by firefox-source-docs
client-side using JS and on firefox-source-docs.mozilla.org this results
in a very noticeable delay even for trivial diagrams. This is bad UX but
could be improved.
Mermaid uses the deprecated/abandoned dagre layout lib which I've found
through prior extensive investigation to have much more limited automatic
layout capabilities as it relates to (nested) clusters and in particular
is not comparable in sophistication for graphviz's record-type (HTML
label) mechanisms. (I believe this largely stems from dagre's primary
use case lying in bioinformatics where nodes tended towards points
rather than complicated records, although dagre is not without related
capabilities.)
PlantUML explicitly exposes graphviz's record/HTML label mechanism with
a friendly syntax, including ports. This also includes neat tricks with
ports on containers. Mermaid does not have anything comparable.
- PlantUML actually has some extremely neat capabilities involving JSON.
PlantUML has a sophisticated preprocessor mechanism that makes it easy to
create helper wrappers to automatically create searchfox searches
without having to manually create and formulate links which would be
inherently distracting in working with the source.
- Furthermore, its include mechanism and JSON processing support in
  conjunction with the preprocessor would make it possible to issue
  queries against searchfox's sorch endpoint to automatically produce
  some simple diagrams like class hierarchies. (Arguably those should
  just be produced by searchfox, but that's not currently in the cards.)

Depends on D164644

Comment hidden (Intermittent Failures Robot)

Eden Chuang[:edenchuang]

Updated

•

2 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1814368

Marian-Vasile Laza

Updated

•

2 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1703952

Comment hidden (Intermittent Failures Robot)

Cosmin Sabou [:CosminS]

Updated

•

2 years ago

Blocks: 1703952

Cosmin Sabou [:CosminS]

Updated

•

2 years ago

Blocks: 1717657

Cosmin Sabou [:CosminS]

Updated

•

2 years ago

Blocks: 1743092

Comment hidden (Intermittent Failures Robot)

Phabricator Automation

Updated

•

2 years ago

Attachment #9311240 - Attachment is obsolete: true

Phabricator Automation

Updated

•

2 years ago

Attachment #9308211 - Attachment description: Bug 1775784 - Improve worker shutdown ordering. r=#dom-workers! → WIP: Bug 1775784 - Improve worker shutdown ordering. r=#dom-workers!

Phabricator Automation

Updated

•

2 years ago

Attachment #9308211 - Attachment description: WIP: Bug 1775784 - Improve worker shutdown ordering. r=#dom-workers! → Bug 1775784 - Improve worker shutdown ordering. r=#dom-workers!

Andrew Sutherland [:asuth] (he/him)

Assignee

Comment 51

•

2 years ago

Attached file Bug 1775784 - Switch back to event loop spinning. r=#dom-workers! (deleted) — Details

As discussed in https://phabricator.services.mozilla.com/D166269, the
nsIAsyncShutdownService does not normally run at startup in content
processes, so anything that causes it to newly run at startup
unconditionally potentially introduces a non-trivial amount of jank
because this necessitates synchronous loads of JS code.

Bug 1760855 tracks re-implementing nsIAsyncShutdownService in non-JS
for C++ consumers like the RemoteWorkerService. At that time, this
patch can be reverted and any bit-rot fixed so that we go back to
using nsIAsyncShutdownService, with the following extra changes:

The runnable that defers the call to InitializeOnMainThread should
no longer be necessary, although if left intact it is harmless,
just not necessary. There is more discussion on
https://phabricator.services.mozilla.com/D164644.

Depends on D164644

Jens Stutte [:jstutte]

Updated

•

2 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1760855

Pulsebot

Comment 52

•

2 years ago

Pushed by bugmail@asutherland.org: https://hg.mozilla.org/integration/autoland/rev/dc1a4b1f617f Make PRemoteWorker refcounted instead of ManualDealloc. r=dom-worker-reviewers,smaug https://hg.mozilla.org/integration/autoland/rev/295469f72761 Make PRemoteWorkerService refcounted instead of ManualDealloc. r=dom-worker-reviewers,smaug https://hg.mozilla.org/integration/autoland/rev/6dc3d26eda30 Improve worker shutdown ordering. r=dom-worker-reviewers,smaug https://hg.mozilla.org/integration/autoland/rev/6c391b739e84 Switch back to event loop spinning. r=dom-worker-reviewers,smaug

Atila Butkovits

Comment 53

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/dc1a4b1f617f
https://hg.mozilla.org/mozilla-central/rev/295469f72761
https://hg.mozilla.org/mozilla-central/rev/6dc3d26eda30
https://hg.mozilla.org/mozilla-central/rev/6c391b739e84

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

status-firefox113: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 113 Branch

Comment hidden (Intermittent Failures Robot)

Treeherder Bug Filer

Updated

•

2 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1823742

Comment hidden (Intermittent Failures Robot)

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

2 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1825446

Andrew Sutherland [:asuth] (he/him)

Assignee

Updated

•

1 years ago

Duplicate of this bug: 1619923

BugBot [:suhaib / :marco/ :calixte]

Comment 57

•

1 years ago

Copying crash signatures from duplicate bugs.

Crash Signature: [@ mozilla::dom::WorkerPrivate::Notify]

Bug 1775784 - Make PRemoteWorker refcounted instead of ManualDealloc. r=#dom-workers! 2 years ago Andrew Sutherland [:asuth] (he/him) (deleted), text/x-phabricator-request		Details
Bug 1775784 - Improve worker shutdown ordering. r=#dom-workers! 2 years ago Andrew Sutherland [:asuth] (he/him) (deleted), text/x-phabricator-request		Details
Bug 1775784 - Make PRemoteWorkerService refcounted instead of ManualDealloc. r=#dom-workers! 2 years ago Andrew Sutherland [:asuth] (he/him) (deleted), text/x-phabricator-request		Details
Bug 1775784 - Make RemoteWorkerService use an async shutdown blocker. r=#dom-workers! 2 years ago Andrew Sutherland [:asuth] (he/him) (deleted), text/x-phabricator-request		Details
Bug 1775784 - Adjust tests that check JS loads for nsIAsyncShutdownService. r=Standard8! 2 years ago Andrew Sutherland [:asuth] (he/him) (deleted), text/x-phabricator-request		Details
WIP: Bug 1775784 - WIP RemoteWorker documentation diagrams with links 2 years ago Andrew Sutherland [:asuth] (he/him) (deleted), text/x-phabricator-request		Details
Bug 1775784 - Switch back to event loop spinning. r=#dom-workers! 2 years ago Andrew Sutherland [:asuth] (he/him) (deleted), text/x-phabricator-request		Details