Closed Bug 1743671 Opened 3 years ago Closed 3 years ago

Crash in [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::AutoEntryScript::AutoEntryScript], [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::WorkerGlobalScope::AddRef ]

Categories

(Core :: DOM: Core & HTML, defect)

Unspecified
All
defect

Tracking

()

RESOLVED FIXED
98 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox94 --- unaffected
firefox95 --- unaffected
firefox96 --- wontfix
firefox97 + fixed
firefox98 + fixed

People

(Reporter: aryx, Assigned: barret)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(1 file)

7 crashes from 5+ different machines, oldest with Firefox Nightly 96.0a1 20211127092810. 6/7 crashes on about: pages (newtab, preferences).

Based on the first affected builds, this looks like a regression from the changes in bug 1742333. Masayuki, could you take the time and verify this, please?

Crash report: https://crash-stats.mozilla.org/report/index/832ee87d-63c3-45ee-b180-8e4ff0211128

Reason: EXCEPTION_ACCESS_VIOLATION_READ

Top 10 frames of crashing thread:

0 xul.dll mozilla::DOMEventTargetHelper::AddRef dom/events/DOMEventTargetHelper.cpp:86
1 xul.dll mozilla::dom::AutoEntryScript::AutoEntryScript dom/script/AutoEntryScript.cpp:75
2 xul.dll mozilla::dom::Promise::MaybeSomething<const JS::Handle<JS::Value>&> dom/promise/Promise.h:326
3 xul.dll mozilla::MozPromise<mozilla::Ok, mozilla::dom::IOUtils::IOError, 1>::ThenValue<`lambda at /builds/worker/checkouts/gecko/dom/system/IOUtils.cpp:266:9', `lambda at /builds/worker/checkouts/gecko/dom/system/IOUtils.cpp:269:9'>::DoResolveOrRejectInternal xpcom/threads/MozPromise.h:846
4 xul.dll mozilla::MozPromise<bool, mozilla::dom::IOUtils::IOError, 1>::ThenValueBase::ResolveOrRejectRunnable::Run xpcom/threads/MozPromise.h:487
5 xul.dll nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1177
6 xul.dll NS_ProcessPendingEvents xpcom/threads/nsThreadUtils.cpp:432
7 xul.dll mozilla::dom::WorkerPrivate::ClearMainEventQueue dom/workers/WorkerPrivate.cpp:3718
8 xul.dll mozilla::dom::WorkerPrivate::ScheduleDeletion dom/workers/WorkerPrivate.cpp:3545
9 xul.dll mozilla::dom::workerinternals::`anonymous namespace'::WorkerThreadPrimaryRunnable::Run dom/workers/RuntimeService.cpp:2277
Flags: needinfo?(masayuki)

It just converted if and else ifs to a switch... If it caused this bug, I did something simple mistake... But this happens in another event loop, so I guess that this is caused by different bug, anyway, I'll check my changes though.

As far as I checked, my patches didn't change any object's life cycle. So bug 1742333 shouldn't cause this bug.

According to the stack, it tries to call AddRef of an EventTargetHelper pointer whose value is nullptr. I don't find any changes of dom::Promise nor MozPromise in November. So I have no idea which change causes this crash. On the other hand, I have no idea how to call AddRef of nullptr since nsCOMPtr and RefPtr do null-check. NS_ADDREF is used? or the cycle collector related to this?

Flags: needinfo?(masayuki)

Olli, could you check the questions raised by Masayuki in comment 2, please?

Flags: needinfo?(bugs)

Something to do with https://searchfox.org/mozilla-central/rev/33641307ec33033f5129826d8d6eda19feb8a01f/dom/workers/WorkerScope.h#82-83
and higher up in the stack there is IOUtils.cpp (in all the crashes). That is some chrome JS only API.

Aryx, do you have a possible regression range for this? My guess is some change to some chrome JS code dealing with IO in a worker.

Flags: needinfo?(bugs) → needinfo?(aryx.bugmail)

bholley, do you recall why AutoEntryScript doesn't have Init(), which AutoJSAPI has (to handle error cases)?

Flags: needinfo?(bholley)

(In reply to Olli Pettay [:smaug] from comment #5)

bholley, do you recall why AutoEntryScript doesn't have Init(), which AutoJSAPI has (to handle error cases)?

I think the fallible init functions were added after (see discussion in bug 1025476 comment 0). I'm guessing the failure modes we were handling weren't a problem for AutoEntryScript, and we didn't want to go and update all the consumers. Making it fallible like AutoJSAPI seems fine if needed, just a bit of work.

Flags: needinfo?(bholley)

No crash reports for the last 7 Nightlies, this might have already been fixed. Potential regression range when this started (global Nightly usage should have been higher on Friday and more likely to get a report for the signature, hence commits closer to the top of the list are more likely to have changed the underlying behavior).

Flags: needinfo?(aryx.bugmail)

For Firefox 96 betas, there are 14 crashes from 8+ devices with [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::WorkerGlobalScope::AddRef], e.g. bp-d49bcac5-5af1-404e-83dd-a27e50211212. The difference is [thunk]: virtual unsigned long __stdcall mozilla::dom::WorkerGlobalScope::AddRef in stack frame #1.

Crash Signature: [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::AutoEntryScript::AutoEntryScript] → [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::AutoEntryScript::AutoEntryScript] [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::WorkerGlobalScope::AddRef]
Summary: Crash in [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::AutoEntryScript::AutoEntryScript] → Crash in [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::AutoEntryScript::AutoEntryScript], [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::WorkerGlobalScope::AddRef ]
Crash Signature: [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::AutoEntryScript::AutoEntryScript] [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::WorkerGlobalScope::AddRef] → [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::AutoEntryScript::AutoEntryScript] [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::DedicatedWorkerGlobalScope::AddRef] [@ mozilla::DOMEventTargetHelper::AddRef | mozilla::dom::WorkerGlobal…

This hasn't gone away on Nightly and is showing up on the Beta97 topcrash list. Any chance we can take another look at this?

Flags: needinfo?(masayuki)

No, probably I'm not related to the regression cause as I said above.

Flags: needinfo?(masayuki)

Hsin-Yi, can you please help find an owner for this?

Flags: needinfo?(htsai)

(In reply to Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout) from comment #7)

No crash reports for the last 7 Nightlies, this might have already been fixed. Potential regression range when this started (global Nightly usage should have been higher on Friday and more likely to get a report for the signature, hence commits closer to the top of the list are more likely to have changed the underlying behavior).

Olli, something suspicious from the potential regression range? Thank you.

Flags: needinfo?(htsai) → needinfo?(bugs)

Andrew, I'm NI you here to try to get more insights .

Flags: needinfo?(continuation)

(Edit note Jan 27th: this section is overly concerned about the event targets in use, but these end up being more indicative of a lifecycle problem than the grungy event target things going on in the bowels of workers. Although we now have some good plans about improving the worker event target machinery.)

The most recent crash for this in nightly at https://crash-stats.mozilla.org/report/index/9c97e29c-a05c-40ac-897a-0753b0220126 is fairly interesting:

  • Its crash being MOZ_CRASH(Found dangling CheckedUnsafePtr) but this is not surprising because...
  • IOUtils scheduled a MozPromise against the bare nsThread underlying the worker, rather than the worker's event target, but...
  • IOUtils is quite reasonably using GetCurentSerialEventTarget which falls back to using NS_GetCurrentThread if there's no known serial event target.

I think there's potentially 2 things to be addressed here:

  • It seems like we definitely need some custom behavior for workers for GetCurentSerialEventTarget. In particular, for workers (edited Jan 27 for future clarity) there are lifecycle concerns and our current WorkerThread::Dispatch logic is troublesome and potentially no longer needed because we can hand out something more granular than GetCurentSerialEventTarget. While this would workers-ish specific, I wonder if CycleCollectedJSRuntime might be the right place for the glue. :mccr8, when you address the needinfo, I think this is a question for you.
  • Given that IOUtils is explicitly an implementation of a WebIDL interface maybe it should be using nsIGlobalObject's inherited implementation of DispatcherTrait which provides Dispatch and EventTargetFor.
    • Note that the comments for Dispatch and EventTargetFor quite unhelpfully say "This method may or may not be safe off of the main thread." and we should probably make that more helpful. For workers it's already the case that EventTargetFor returns the worker private's hybrid event target which is the wacky target that wraps runnables in WorkerRunnables until the worker gets canceled and then it starts wrapping them in WorkerControlRunnables. In the future we will almost certainly change the event target so that it starts discarding runnables in non-debug builds and maybe asserts hard in debug builds because it's generally an indication that 1) you wanted a WorkerRef and/or 2) better shutdown logic is needed
    • :smaug, does this make sense for event dispatching that is explicitly about a global?
    • :barret, same question from the IOUtils application domain.
Component: DOM: Events → XPCOM
Flags: needinfo?(brennie)

(I moved this into XPCOM for now because that's where GetCurentSerialEventTarget lives and this is the most direct point at which things are going wrong because of foot-gun issues but as noted in the previous comment, there are potentially multiple fixes to do and where those happen may be in XPConnect and/or workers and/or IOUtils and/or elsewhere.)

The crash volume on Beta remains worryingly-high. I'm worried we're going to see this spike when 97 goes to Release in a little over a week. Anything we can do to work around this in the near future would be appreciated.

I don't know anything about IOUtils, workers or event targets so I don't think I have anything to contribute here. It does look like there have been a fair amount of changes to IOUtils in the last few months.

Flags: needinfo?(continuation)

:asuth I think this makes sense for IOUtils, as long as the original worker is getting the task dispatched back to it.

Flags: needinfo?(brennie)

:nika and I talked on matrix about event targets. I need to still do some conceptual untangling of the relationship of WorkerThread::Dispatch and related logic which pre-dates GetCurentSerialEventTarget, but I think the core decisions relative to GetCurentSerialEventTarget are:

  • WorkerThreadPrimaryRunnable should directly instantiate a SerialEventTargetGuard if it wants something other than its nsIThread exposed. This is not something we want in CycleCollectedJSRuntime because the main thread definitely doesn't want behavior like the worker's for this.
  • This means IOUtils should still be able to use GetCurentSerialEventTarget as it is currently using. The nsIGlobalObject EventTargetFor logic from Quantum DOM isn't something we want to use.
  • Switching to a SerialEventTargetGuard could potentially help eliminate the need for WorkerThread.
  • :nika did some exploratory work in bug 1738103 "to add a way to observe event target shutdown and react to it" which could be relevant if we're trying to be able to shutdown the event target we hand out for workers.

Unfortunately, I think this may still leave us with some lifecycle issues around the DOM Promise. I think IOUtils potentially needs to either hold a (Strong)WorkerRef or we need to ensure that the dom::Promise gets torn down when the global gets torn down. If we're trying to fix this just by having an event target that gets turned off when the worker destroys the global, we'd otherwise have to leak all runnables sent to it, as we can't safely clean up the relevant objects. I'm going to try and look at this a little more.

Flags: needinfo?(bugs)

(moving back to the component identified for IOUtils.cpp by the moz.build files as displayed in searchfox)

Component: XPCOM → DOM: Core & HTML

Preliminary investigation didn't show an easy way to systematically fix this at the DOM Promise level at this time as it's the MozPromise lifecycle that's entraining the DOM Proimise that's problematic. That is, if the DOM Promise was neutralized when the worker changed to canceling, that wouldn't do anything about the MozPromise dispatched Then lambda still trying to run after the worker has gone away.

The worker either needs to live as long as the IOUtils-scheduled task lasts, or IOUtils needs to be cleaning up the task when the worker transitions to canceling. From a simplicity and sanity perspective and given that this is explicitly a System API (and if uplift wants to happen), I think it makes sense to just use a StrongWorkerRef that does nothing when notified and just gets dropped when the Then handler runs. IOUtils already has a shutdown mechanism that shutdown mechanism that prevents issuing new tasks once the shutdown blocker gets tripped which presents a reasonable mechanism to ensure shutdown happens in a timely fashion modulo in-flight requests. Adding complexity in order to try and abort in-progress sync-IO seems like a lot of work and new edge cases that will almost never prevent I/O from happening beyond what the existing shutdown infrastructure already does.

Given that IOUtils already unifies the lifetime through IOUtils::DispatchAndResolve, I think it then makes sense to, in there:

  • Optional: Consider switching to use the single-lambda Then variant that takes ResolveOrRejectValue random example. I think I'm just obsessing over having 2 concrete references to the StrongWorkerRef when conceptually we only need 1, but they are inline refcounted, so it's fine to have 2 references to it.
  • When not on the main thread (ex: web locks doing so), create a StrongWorkerRef with no callback that will get dropped when the promise resolves on the worker thread. random example. If it's possible for the lambda to get released on a different thread somehow, then use a ThreadSafeWorkerRef instead.

:barret If this sounds reasonable, can you work up a fix for this and ask the #dom-workers review group for review for the worker parts (and whoever is appropriate for the IOUtils parts)? Thanks!

Flags: needinfo?(brennie)
Assignee: nobody → brennie
Status: NEW → ASSIGNED
Pushed by brennie@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4484267f0b66 Keep worker threads alive while IOUtils tasks are running r=Gijs,dom-worker-reviewers,asuth
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 98 Branch

Comment on attachment 9261553 [details]
Bug 1743671 - Keep worker threads alive while IOUtils tasks are running r?#dom-workers!,gijs!

Beta/Release Uplift Approval Request

  • User impact if declined: This patch addresses a crash on nightly and beta that his happening with a high frequency. Declining means this crash will start to occur on our release population, which could affect retention
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Unfortunately this patch is not testable at this time due to the complicated nature of causing the crash. However, this patch is simple and only affects the worker threads by keeping them alive slightly longer (so the IOUtils tasks can run to completion without the worker going away) which should not cause any issues.
  • String changes made/needed:
Flags: needinfo?(brennie)
Attachment #9261553 - Flags: approval-mozilla-release?

Thank you everyone for helping this out quickly.

Comment on attachment 9261553 [details]
Bug 1743671 - Keep worker threads alive while IOUtils tasks are running r?#dom-workers!,gijs!

Topcrash fix, results looking good on Nightly so far. Approved for 97.0rc2.

Attachment #9261553 - Flags: approval-mozilla-release? → approval-mozilla-release+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: