Closed Bug 1775784 Opened 2 years ago Closed 2 years ago

Intermittent dom/workers/test/browser_worker_use_counters.js | single tracking bug

Categories

(Core :: DOM: Workers, defect, P3)

defect

Tracking

()

RESOLVED FIXED
113 Branch
Tracking Status
firefox113 --- fixed

People

(Reporter: jmaher, Assigned: asuth)

References

Details

(Keywords: intermittent-failure, intermittent-testcase)

Crash Data

Attachments

(5 files, 2 obsolete files)

No description provided.

Additional information about this bug failures and frequency patterns can be found by running: ./mach test-info failure-report --bug 1775784

Hi :jmaher, ./mach test-info failure-report --bug 1775842 gives me:

no failures were found for given bugid, please ensure bug is
accessible via: https://treeherder.mozilla.org/intermittent-failures

but https://treeherder.mozilla.org/intermittent-failures/bugdetails?startday=2022-06-16&endday=2022-06-23&tree=all&bug=1775842 shows there is some.

Please note the tree=all, with tree=trunk I get an empty list also there. Is this the intended behavior for ./mach test-info failure-report ?

Flags: needinfo?(jmaher)

this report will work once failures are associated with this new bug. the ./mach test-info ... command reports by default the last 30 days on trunk:
https://searchfox.org/mozilla-central/source/testing/mach_commands.py#996

Flags: needinfo?(jmaher)
Severity: -- → S3
Assignee: nobody → bugmail
Status: NEW → ASSIGNED

tl;dr: Worker shutdown is happening at Canceling not Killing as far as the remote worker mechanism is concerned for the purposes of deciding when to stop keeping the process alive, which leads to a race between worker shutdown and the process being torn down, and this leads to the telemetry not being reported because the observer shutdown notification has already happened by the time the worker gets around to reporting its usage. By fixing this by moving the remote worker notification to the transition to Killing (via helper lambda passed into the WorkerPrivate constructor) we can also likely address another set of related races.

My longer key notes here from slack discussion:

So I understand that for things to work:

  • The worker has to reach the killing stage and report its use counters. (I guess we must do this for reasons related to lock contention? Arguably doing it once at worker shutdown does avoid a lot of pathological possibilities for multiple workers doing the exact same thing!)
  • The child has to send telemetry up to the parent. It will do this in "content-child-shutdown".

In the pernosco trace at https://pernos.co/debug/KzErTzqDyGm8hBQz1FdPCg/index.html#f{m[AkVx,ChMJ_,t[AbE,Cmc_,f{e[AkVx,ChMC_,s{af4gCXQAA,bARo,uHk90Sw,oHlBm3Q___/ at that timestamp we can see that the "content-child-shutdown" telemetry IPC sending is happening prior to the worker reaching the killing state and so the telemetry never gets sent.The reason the child is able to start shutdown for SharedWorkers or ServiceWorkers is that RemoteWorkerParent::ActorDestroy notifies the ContentParent which can then start the parent shutdown.

Presumably we can address this by keeping the RemoteWorkerParent alive slightly longer, and in doing so we'll probably eliminate a whole bunch of intermittents, which could be very nice! We have a somewhat nice diagram that suggests that RemoteWorkerChild::ShutdownOnWorker is responsible for sending the close message that causes problems, and indeed that's what happened in the above pernosco trace. This WeakWorkerRef is triggered on the transition to Canceling which is strictly before Killing.

Unfortunately, it's a bit awkward to be notified of the transition to Killing right now. We won't transition to killing if we have active worker refs. Also it would be a wacky concept to be able to have WorkerRefs that notify on killing because that's the point where everything should have stopped and you're explicitly not allowed to do anything. Maybe the most practical thing is to add an optional lambda argument to WorkerPrivate::Constructor like RemoteWorkerChild::ExecWorkerOnMainThread calls and call that lambda in WorkerPrivate::ClearSelfAndParentEventTargetRef. This is potentially a bit extreme but we know the worker is basically entirely dead at that point. As the logic that creates and dispatches the worker finished runnable says, the worker literally can't do anything after dispatching that runnable! I don't see any real harm in making sure the worker is totally, totally done. (Well, we will be less efficient, but from our team's perspectives, a LOT of edge-cases will go away, so it's a win for us!)

This blocks landing of the GC/CC enhancements in bug 1777921

Blocks: 1777921
Blocks: 1749068
Blocks: 1769350
Attachment #9308210 - Attachment description: WIP: Bug 1775784 - Make PRemoteWorker refcounted instead of ManualDealloc. r=#dom-workers! → Bug 1775784 - Make PRemoteWorker refcounted instead of ManualDealloc. r=#dom-workers!
Attachment #9311239 - Attachment description: WIP: Bug 1775784 - Make PRemoteWorkerService refcounted instead of ManualDealloc. r=#dom-workers! → Bug 1775784 - Make PRemoteWorkerService refcounted instead of ManualDealloc. r=#dom-workers!
Attachment #9311240 - Attachment description: WIP: Bug 1775784 - Make RemoteWorkerService use an async shutdown blocker. r=#dom-workers! → Bug 1775784 - Make RemoteWorkerService use an async shutdown blocker. r=#dom-workers!

nsIAsyncShutdownService defined in nsIAsyncShutdown.idl is the preferred
approach for handling asynchronous shutdowns as it provides for parallelism
in shutdown rather than having N independent observer service-based
implementations each running in parallel because they each spin their own
nested event loop. (There have been discussions in #xpcom about moving the
service to C++ from JS, but there is not currently a bug.)

In this patch stack on (this) bug 1775784 we move RemoteWorkerService to
use the async shutdown service to address correctness problems related
to its shutdown while avoiding adding a serializing event loop spin.

As it relates to the browser_startup_content*.js tests:

  • nsAsyncShutdown.sys.mjs is already present, but its dependencies that
    are loaded if it is actually used (AsyncShutdown.sys.mjs,
    PromiseUtils.sys.mjs) were not listed. Presumably some code that
    lazily adds a blocker was loaded but it did not add a blocker.
  • RemoteWorkerService is unconditionally started up (this has not
    changed; it's necessary and important for latency for ServiceWorkers)
    and so it is definitely loading the async shtudown service and using
    it, so the additional deps need to be present.
  • Currently async shutdown lives in the intermittently loaded scripts
    section. Since RemoteWorkerService is unconditionally initialized,
    it's not clear if this should be moved up to known_scripts since
    it currently will always loaded or if it's better to leave it as
    intermittent because it may be possible that upcoming worker cleanups
    and refactorings may be able to moot the async shutdown service load,
    in which case it would be appropriate to move it back to intermittent
    but it's possible some other component might have made it
    unconditionally load, but it would be hard to tell that because they
    would not have run into this test failing.
  • I've decided to leave the load in the intermittent section because
    the async shutdown service is arguably fundamental infrastructure
    which we want people to use, but we also don't want to require that
    it's running at startup; that's just an implemention artifact. (And
    if we have concerns about the amount of JS in the async shutdown
    service as exposed via nsIAsyncShutdownService, that should be done
    as an enhancement bug in XPCOM.)

Depends on D166262

Attachment #9308211 - Attachment description: WIP: Bug 1775784 - Improve worker shutdown ordering. r=#dom-workers! → Bug 1775784 - Improve worker shutdown ordering. r=#dom-workers!
Attachment #9311259 - Attachment is obsolete: true

First step for some docs/diagrams for RemoteWorker. I'm using plantuml
because it's backed by graphviz's layout algorithm and has a built-in
preprocessor which makes it easy to add searchfox links for things
without crufting up the document.

The trade-offs relative to mermaid in more detail:

  • Mermaid does not require someone to run plantuml -tsvg OVERVIEW.md to
    update the diagram, which is nice.
  • Mermaid diagrams are currently only rendered by firefox-source-docs
    client-side using JS and on firefox-source-docs.mozilla.org this results
    in a very noticeable delay even for trivial diagrams. This is bad UX but
    could be improved.
  • Mermaid uses the deprecated/abandoned dagre layout lib which I've found
    through prior extensive investigation to have much more limited automatic
    layout capabilities as it relates to (nested) clusters and in particular
    is not comparable in sophistication for graphviz's record-type (HTML
    label) mechanisms. (I believe this largely stems from dagre's primary
    use case lying in bioinformatics where nodes tended towards points
    rather than complicated records, although dagre is not without related
    capabilities.)
  • PlantUML explicitly exposes graphviz's record/HTML label mechanism with
    a friendly syntax, including ports. This also includes neat tricks with
    ports on containers. Mermaid does not have anything comparable.
    • PlantUML actually has some extremely neat capabilities involving JSON.
  • PlantUML has a sophisticated preprocessor mechanism that makes it easy to
    create helper wrappers to automatically create searchfox searches
    without having to manually create and formulate links which would be
    inherently distracting in working with the source.
    • Furthermore, its include mechanism and JSON processing support in
      conjunction with the preprocessor would make it possible to issue
      queries against searchfox's sorch endpoint to automatically produce
      some simple diagrams like class hierarchies. (Arguably those should
      just be produced by searchfox, but that's not currently in the cards.)

Depends on D164644

Blocks: 1703952
Blocks: 1717657
Blocks: 1743092
Attachment #9311240 - Attachment is obsolete: true
Attachment #9308211 - Attachment description: Bug 1775784 - Improve worker shutdown ordering. r=#dom-workers! → WIP: Bug 1775784 - Improve worker shutdown ordering. r=#dom-workers!
Attachment #9308211 - Attachment description: WIP: Bug 1775784 - Improve worker shutdown ordering. r=#dom-workers! → Bug 1775784 - Improve worker shutdown ordering. r=#dom-workers!

As discussed in https://phabricator.services.mozilla.com/D166269, the
nsIAsyncShutdownService does not normally run at startup in content
processes, so anything that causes it to newly run at startup
unconditionally potentially introduces a non-trivial amount of jank
because this necessitates synchronous loads of JS code.

Bug 1760855 tracks re-implementing nsIAsyncShutdownService in non-JS
for C++ consumers like the RemoteWorkerService. At that time, this
patch can be reverted and any bit-rot fixed so that we go back to
using nsIAsyncShutdownService, with the following extra changes:

Depends on D164644

Pushed by bugmail@asutherland.org: https://hg.mozilla.org/integration/autoland/rev/dc1a4b1f617f Make PRemoteWorker refcounted instead of ManualDealloc. r=dom-worker-reviewers,smaug https://hg.mozilla.org/integration/autoland/rev/295469f72761 Make PRemoteWorkerService refcounted instead of ManualDealloc. r=dom-worker-reviewers,smaug https://hg.mozilla.org/integration/autoland/rev/6dc3d26eda30 Improve worker shutdown ordering. r=dom-worker-reviewers,smaug https://hg.mozilla.org/integration/autoland/rev/6c391b739e84 Switch back to event loop spinning. r=dom-worker-reviewers,smaug
Duplicate of this bug: 1619923

Copying crash signatures from duplicate bugs.

Crash Signature: [@ mozilla::dom::WorkerPrivate::Notify]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: