1743671 - Crash in [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::AutoEntryScript::AutoEntryScript], [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::WorkerGlobalScope::AddRef ]

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

3 years ago

7 crashes from 5+ different machines, oldest with Firefox Nightly 96.0a1 20211127092810. 6/7 crashes on about: pages (newtab, preferences).

Based on the first affected builds, this looks like a regression from the changes in bug 1742333. Masayuki, could you take the time and verify this, please?

Crash report: https://crash-stats.mozilla.org/report/index/832ee87d-63c3-45ee-b180-8e4ff0211128

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0 xul.dll mozilla::DOMEventTargetHelper::AddRef dom/events/DOMEventTargetHelper.cpp:86
1 xul.dll mozilla::dom::AutoEntryScript::AutoEntryScript dom/script/AutoEntryScript.cpp:75
2 xul.dll mozilla::dom::Promise::MaybeSomething<const JS::Handle<JS::Value>&> dom/promise/Promise.h:326
3 xul.dll mozilla::MozPromise<mozilla::Ok, mozilla::dom::IOUtils::IOError, 1>::ThenValue<`lambda at /builds/worker/checkouts/gecko/dom/system/IOUtils.cpp:266:9', `lambda at /builds/worker/checkouts/gecko/dom/system/IOUtils.cpp:269:9'>::DoResolveOrRejectInternal xpcom/threads/MozPromise.h:846
4 xul.dll mozilla::MozPromise<bool, mozilla::dom::IOUtils::IOError, 1>::ThenValueBase::ResolveOrRejectRunnable::Run xpcom/threads/MozPromise.h:487
5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1177
6 xul.dll NS_ProcessPendingEvents xpcom/threads/nsThreadUtils.cpp:432
7 xul.dll mozilla::dom::WorkerPrivate::ClearMainEventQueue dom/workers/WorkerPrivate.cpp:3718
8 xul.dll mozilla::dom::WorkerPrivate::ScheduleDeletion dom/workers/WorkerPrivate.cpp:3545
9 xul.dll mozilla::dom::workerinternals::`anonymous namespace'::WorkerThreadPrimaryRunnable::Run dom/workers/RuntimeService.cpp:2277

Flags: needinfo?(masayuki)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 1

•

3 years ago

It just converted if and else ifs to a switch... If it caused this bug, I did something simple mistake... But this happens in another event loop, so I guess that this is caused by different bug, anyway, I'll check my changes though.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Updated

•

3 years ago

No longer regressed by: 1742333

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 2

•

3 years ago

As far as I checked, my patches didn't change any object's life cycle. So bug 1742333 shouldn't cause this bug.

According to the stack, it tries to call AddRef of an EventTargetHelper pointer whose value is nullptr. I don't find any changes of dom::Promise nor MozPromise in November. So I have no idea which change causes this crash. On the other hand, I have no idea how to call AddRef of nullptr since nsCOMPtr and RefPtr do null-check. NS_ADDREF is used? or the cycle collector related to this?

Flags: needinfo?(masayuki)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 3

•

3 years ago

Olli, could you check the questions raised by Masayuki in comment 2, please?

Flags: needinfo?(bugs)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 4

•

3 years ago

Something to do with https://searchfox.org/mozilla-central/rev/33641307ec33033f5129826d8d6eda19feb8a01f/dom/workers/WorkerScope.h#82-83
and higher up in the stack there is IOUtils.cpp (in all the crashes). That is some chrome JS only API.

Aryx, do you have a possible regression range for this? My guess is some change to some chrome JS code dealing with IO in a worker.

Flags: needinfo?(bugs) → needinfo?(aryx.bugmail)

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 5

•

3 years ago

bholley, do you recall why AutoEntryScript doesn't have Init(), which AutoJSAPI has (to handle error cases)?

Flags: needinfo?(bholley)

Bobby Holley (:bholley)

Comment 6

•

3 years ago

(In reply to Olli Pettay [:smaug] from comment #5)

bholley, do you recall why AutoEntryScript doesn't have Init(), which AutoJSAPI has (to handle error cases)?

I think the fallible init functions were added after (see discussion in bug 1025476 comment 0). I'm guessing the failure modes we were handling weren't a problem for AutoEntryScript, and we didn't want to go and update all the consumers. Making it fallible like AutoJSAPI seems fine if needed, just a bit of work.

Flags: needinfo?(bholley)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 7

•

3 years ago

No crash reports for the last 7 Nightlies, this might have already been fixed. Potential regression range when this started (global Nightly usage should have been higher on Friday and more likely to get a report for the signature, hence commits closer to the top of the list are more likely to have changed the underlying behavior).

Flags: needinfo?(aryx.bugmail)

Dianna Smith [:diannaS]

Updated

•

3 years ago

status-firefox96: affected → fix-optional

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 8

•

3 years ago

For Firefox 96 betas, there are 14 crashes from 8+ devices with [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::WorkerGlobalScope::AddRef], e.g. bp-d49bcac5-5af1-404e-83dd-a27e50211212. The difference is [thunk]: virtual unsigned long __stdcall mozilla::dom::WorkerGlobalScope::AddRef in stack frame #1.

Crash Signature: [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::AutoEntryScript::AutoEntryScript] → [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::AutoEntryScript::AutoEntryScript] [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::WorkerGlobalScope::AddRef]

Summary: Crash in [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::AutoEntryScript::AutoEntryScript] → Crash in [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::AutoEntryScript::AutoEntryScript], [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::WorkerGlobalScope::AddRef ]

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Updated

•

3 years ago

Crash Signature: [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::AutoEntryScript::AutoEntryScript] [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::WorkerGlobalScope::AddRef] → [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::AutoEntryScript::AutoEntryScript] [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::DedicatedWorkerGlobalScope::AddRef] [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::WorkerGlobal…

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

status-firefox96: fix-optional → wontfix

status-firefox97: --- → affected

status-firefox98: --- → affected

Ryan VanderMeulen [:RyanVM]

Comment 9

•

3 years ago

This hasn't gone away on Nightly and is showing up on the Beta97 topcrash list. Any chance we can take another look at this?

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

Flags: needinfo?(masayuki)

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)

Comment 10

•

3 years ago

No, probably I'm not related to the regression cause as I said above.

Flags: needinfo?(masayuki)

Ryan VanderMeulen [:RyanVM]

Comment 11

•

3 years ago

Hsin-Yi, can you please help find an owner for this?

Flags: needinfo?(htsai)

Hsin-Yi Tsai (she/her) [:hsinyi]

Comment 12

•

3 years ago

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #7)

No crash reports for the last 7 Nightlies, this might have already been fixed. Potential regression range when this started (global Nightly usage should have been higher on Friday and more likely to get a report for the signature, hence commits closer to the top of the list are more likely to have changed the underlying behavior).

Olli, something suspicious from the potential regression range? Thank you.

Flags: needinfo?(htsai) → needinfo?(bugs)

Hsin-Yi Tsai (she/her) [:hsinyi]

Comment 13

•

3 years ago

Andrew, I'm NI you here to try to get more insights .

Flags: needinfo?(continuation)

Andrew Sutherland [:asuth] (he/him)

Comment 14

•

3 years ago

(Edit note Jan 27th: this section is overly concerned about the event targets in use, but these end up being more indicative of a lifecycle problem than the grungy event target things going on in the bowels of workers. Although we now have some good plans about improving the worker event target machinery.)

The most recent crash for this in nightly at https://crash-stats.mozilla.org/report/index/9c97e29c-a05c-40ac-897a-0753b0220126 is fairly interesting:

Its crash being MOZ_CRASH(Found dangling CheckedUnsafePtr) but this is not surprising because...
IOUtils scheduled a MozPromise against the bare nsThread underlying the worker, rather than the worker's event target, but...
IOUtils is quite reasonably using GetCurentSerialEventTarget which falls back to using NS_GetCurrentThread if there's no known serial event target.

I think there's potentially 2 things to be addressed here:

It seems like we definitely need some custom behavior for workers for GetCurentSerialEventTarget. In particular, for workers (edited Jan 27 for future clarity) there are lifecycle concerns and our current WorkerThread::Dispatch logic is troublesome and potentially no longer needed because we can hand out something more granular than GetCurentSerialEventTarget. While this would workers-ish specific, I wonder if CycleCollectedJSRuntime might be the right place for the glue. :mccr8, when you address the needinfo, I think this is a question for you.
Given that IOUtils is explicitly an implementation of a WebIDL interface maybe it should be using nsIGlobalObject's inherited implementation of DispatcherTrait which provides Dispatch and EventTargetFor.
- Note that the comments for Dispatch and EventTargetFor quite unhelpfully say "This method may or may not be safe off of the main thread." and we should probably make that more helpful. For workers it's already the case that EventTargetFor returns the worker private's hybrid event target which is the wacky target that wraps runnables in WorkerRunnables until the worker gets canceled and then it starts wrapping them in WorkerControlRunnables. In the future we will almost certainly change the event target so that it starts discarding runnables in non-debug builds and maybe asserts hard in debug builds because it's generally an indication that 1) you wanted a WorkerRef and/or 2) better shutdown logic is needed
- :smaug, does this make sense for event dispatching that is explicitly about a global?
- :barret, same question from the IOUtils application domain.

Component: DOM: Events → XPCOM

Flags: needinfo?(brennie)

Andrew Sutherland [:asuth] (he/him)

Comment 15

•

3 years ago

(I moved this into XPCOM for now because that's where GetCurentSerialEventTarget lives and this is the most direct point at which things are going wrong because of foot-gun issues but as noted in the previous comment, there are potentially multiple fixes to do and where those happen may be in XPConnect and/or workers and/or IOUtils and/or elsewhere.)

Ryan VanderMeulen [:RyanVM]

Comment 16

•

3 years ago

The crash volume on Beta remains worryingly-high. I'm worried we're going to see this spike when 97 goes to Release in a little over a week. Anything we can do to work around this in the near future would be appreciated.

tracking-firefox97: --- → +

tracking-firefox98: --- → +

Andrew McCreight [:mccr8]

Comment 17

•

3 years ago

I don't know anything about IOUtils, workers or event targets so I don't think I have anything to contribute here. It does look like there have been a fair amount of changes to IOUtils in the last few months.

Flags: needinfo?(continuation)

Barret Rennie [:barret] (they/them)

Assignee

Comment 18

•

3 years ago

:asuth I think this makes sense for IOUtils, as long as the original worker is getting the task dispatched back to it.

Flags: needinfo?(brennie)

Andrew Sutherland [:asuth] (he/him)

Comment 19

•

3 years ago

:nika and I talked on matrix about event targets. I need to still do some conceptual untangling of the relationship of WorkerThread::Dispatch and related logic which pre-dates GetCurentSerialEventTarget, but I think the core decisions relative to GetCurentSerialEventTarget are:

WorkerThreadPrimaryRunnable should directly instantiate a SerialEventTargetGuard if it wants something other than its nsIThread exposed. This is not something we want in CycleCollectedJSRuntime because the main thread definitely doesn't want behavior like the worker's for this.
This means IOUtils should still be able to use GetCurentSerialEventTarget as it is currently using. The nsIGlobalObject EventTargetFor logic from Quantum DOM isn't something we want to use.
Switching to a SerialEventTargetGuard could potentially help eliminate the need for WorkerThread.
:nika did some exploratory work in bug 1738103 "to add a way to observe event target shutdown and react to it" which could be relevant if we're trying to be able to shutdown the event target we hand out for workers.

Unfortunately, I think this may still leave us with some lifecycle issues around the DOM Promise. I think IOUtils potentially needs to either hold a (Strong)WorkerRef or we need to ensure that the dom::Promise gets torn down when the global gets torn down. If we're trying to fix this just by having an event target that gets turned off when the worker destroys the global, we'd otherwise have to leak all runnables sent to it, as we can't safely clean up the relevant objects. I'm going to try and look at this a little more.

Flags: needinfo?(bugs)

Andrew Sutherland [:asuth] (he/him)

Comment 20

•

3 years ago

(moving back to the component identified for IOUtils.cpp by the moz.build files as displayed in searchfox)

Component: XPCOM → DOM: Core & HTML

Andrew Sutherland [:asuth] (he/him)

Comment 21

•

3 years ago

Preliminary investigation didn't show an easy way to systematically fix this at the DOM Promise level at this time as it's the MozPromise lifecycle that's entraining the DOM Proimise that's problematic. That is, if the DOM Promise was neutralized when the worker changed to canceling, that wouldn't do anything about the MozPromise dispatched Then lambda still trying to run after the worker has gone away.

The worker either needs to live as long as the IOUtils-scheduled task lasts, or IOUtils needs to be cleaning up the task when the worker transitions to canceling. From a simplicity and sanity perspective and given that this is explicitly a System API (and if uplift wants to happen), I think it makes sense to just use a StrongWorkerRef that does nothing when notified and just gets dropped when the Then handler runs. IOUtils already has a shutdown mechanism that shutdown mechanism that prevents issuing new tasks once the shutdown blocker gets tripped which presents a reasonable mechanism to ensure shutdown happens in a timely fashion modulo in-flight requests. Adding complexity in order to try and abort in-progress sync-IO seems like a lot of work and new edge cases that will almost never prevent I/O from happening beyond what the existing shutdown infrastructure already does.

Given that IOUtils already unifies the lifetime through IOUtils::DispatchAndResolve, I think it then makes sense to, in there:

Optional: Consider switching to use the single-lambda Then variant that takes ResolveOrRejectValue random example. I think I'm just obsessing over having 2 concrete references to the StrongWorkerRef when conceptually we only need 1, but they are inline refcounted, so it's fine to have 2 references to it.
When not on the main thread (ex: web locks doing so), create a StrongWorkerRef with no callback that will get dropped when the promise resolves on the worker thread. random example. If it's possible for the lambda to get released on a different thread somehow, then use a ThreadSafeWorkerRef instead.

:barret If this sounds reasonable, can you work up a fix for this and ask the #dom-workers review group for review for the worker parts (and whoever is appropriate for the IOUtils parts)? Thanks!

Flags: needinfo?(brennie)

Barret Rennie [:barret] (they/them)

Assignee

Comment 22

•

3 years ago

Attached file Bug 1743671 - Keep worker threads alive while IOUtils tasks are running r?#dom-workers!,gijs! (deleted) — Details

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → brennie

Status: NEW → ASSIGNED

Pulsebot

Comment 23

•

3 years ago

Pushed by brennie@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4484267f0b66 Keep worker threads alive while IOUtils tasks are running r=Gijs,dom-worker-reviewers,asuth

Norisz Fay [:noriszfay]

Comment 24

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/4484267f0b66

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox98: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 98 Branch

Barret Rennie [:barret] (they/them)

Assignee

Comment 25

•

3 years ago

Comment on attachment 9261553 [details]
Bug 1743671 - Keep worker threads alive while IOUtils tasks are running r?#dom-workers!,gijs!

Beta/Release Uplift Approval Request

User impact if declined: This patch addresses a crash on nightly and beta that his happening with a high frequency. Declining means this crash will start to occur on our release population, which could affect retention
Is this code covered by automated tests?: No
Has the fix been verified in Nightly?: No
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): Unfortunately this patch is not testable at this time due to the complicated nature of causing the crash. However, this patch is simple and only affects the worker threads by keeping them alive slightly longer (so the IOUtils tasks can run to completion without the worker going away) which should not cause any issues.
String changes made/needed:

Flags: needinfo?(brennie)

Attachment #9261553 - Flags: approval-mozilla-release?

Hsin-Yi Tsai (she/her) [:hsinyi]

Comment 26

•

3 years ago

Thank you everyone for helping this out quickly.

Ryan VanderMeulen [:RyanVM]

Comment 27

•

3 years ago

Comment on attachment 9261553 [details]
Bug 1743671 - Keep worker threads alive while IOUtils tasks are running r?#dom-workers!,gijs!

Topcrash fix, results looking good on Nightly so far. Approved for 97.0rc2.

Attachment #9261553 - Flags: approval-mozilla-release? → approval-mozilla-release+

Ryan VanderMeulen [:RyanVM]

Comment 28

•

3 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-release/rev/263fa5e14da5

status-firefox97: affected → fixed