Crash in [@ mozilla::DOMEventTargetHelper::Release]
Categories
(WebExtensions :: Request Handling, defect, P2)
Tracking
(firefox-esr68 wontfix, firefox-esr78 wontfix, firefox72 wontfix, firefox73 wontfix, firefox74 wontfix, firefox87 wontfix, firefox88 wontfix, firefox89 wontfix, firefox90 wontfix)
People
(Reporter: marcia, Assigned: kmag)
References
Details
(Keywords: crash)
Crash Data
Attachments
(1 file)
(deleted),
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta-
|
Details |
This bug is for crash report bp-d811269a-12d0-4682-b1d1-040d80191220.
Seen while looking at nightly crash stats - Linux crashes started in 20191202220401 and were joined by some macOS crashes shortly thereafter: https://bit.ly/2ZMB2sF. 10 crashes/10 installs so far so rather low volume. There are some Linux and Windows crashes before 73 nightly with the same signature, but I am not sure if they are the same issue.
Possible nightly regression range based on build id: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=8504d70d827261346737af1cbe9b96acf6756b6d&tochange=9420b5dc27e0cdf30a65b1b733400a970b3cb708
Top 9 frames of crashing thread:
0 XUL mozilla::DOMEventTargetHelper::Release dom/events/DOMEventTargetHelper.cpp:87
1 XUL mozilla::UniquePtr<mozilla::extensions:: toolkit/components/extensions/webrequest/ChannelWrapper.cpp:1144
2 libsystem_c.dylib __cxa_finalize_ranges
3 libsystem_c.dylib exit
4 XUL google_breakpad::ExceptionHandler::WaitForMessage toolkit/crashreporter/breakpad-client/mac/handler/exception_handler.cc:631
5 libsystem_pthread.dylib _pthread_body
6 libsystem_pthread.dylib _pthread_start
7 libsystem_pthread.dylib thread_start
8 XUL XUL@0x41e737f
Comment 1•5 years ago
|
||
ChannelWrapper just extends DOMEventTargetHelper
Reporter | ||
Comment 2•5 years ago
|
||
Last crash was in 1-02 build and one crash in 73.0b1. Jim any ideas what might have caused this small volume regression?
Comment 3•5 years ago
|
||
From the crash data reports above, there seems to be a lot of different reasons for a crash on mozilla::DOMEventTargetHelper::Release. Looking at the potential regression range, I don't see anything that sticks out to me.
a) is there a way to limit the crash data report to those including ChannelWrapper?
b) probably will need someone more fluent in c++ than me to look into this, but if I can get a tighter report, I might be able to see something.
Reporter | ||
Comment 4•5 years ago
|
||
There is a small subset of crashes that have ChannelWrapper (https://bit.ly/303QosK), but given the small overall volume on release I don't think we should probably devote that much time to figure out what might be causing them. In terms of the nightly crashes, since I filed this there haven't been any crashes since 1-02. I thought maybe this was a nightly only regression, but it seems as Shane notes in Comment 3 there may be lots of causes for this crash.
Comment 5•5 years ago
|
||
Thanks for that...Another link:
One thing I notice on a random sampling of these, with Linux, the crash seems to happen during a run_exit_handlers call. As well, earlier in the stack for all of those is some webrender stuff.
Updated•5 years ago
|
Comment 6•5 years ago
|
||
The priority flag is not set for this bug.
:jimm, could you have a look please?
For more information, please visit auto_nag documentation.
Updated•5 years ago
|
Comment 7•4 years ago
|
||
Just bumped on this because our new WER-based crash reporting logic picked up this crash on Windows:
https://crash-stats.mozilla.org/report/index/d40073e9-5cec-4b11-9550-ee8500210408
Looking at the stacks on Linux and macOS this always seems to be happening when running exit handlers. This is a shortened stack on Windows:
0 mozilla::DOMEventTargetHelper::Release()
1 mozilla::UniquePtr<mozilla::extensions::(anonymous namespace)::ChannelListHolder, mozilla::DefaultDelete<mozilla::extensions::(anonymous namespace)::ChannelListHolder> >::reset(mozilla::extensions::`anonymous namespace'::ChannelListHolder*)
...
4 execute_onexit_table
...
11 exit_or_terminate_process
And this is Linux:
0 mozilla::DOMEventTargetHelper::Release()
1 mozilla::UniquePtr<mozilla::extensions::(anonymous namespace)::ChannelListHolder, mozilla::DefaultDelete<mozilla::extensions::(anonymous namespace)::ChannelListHolder> >::reset(mozilla::extensions::(anonymous namespace)::ChannelListHolder*)
2 __run_exit_handlers
3 exit
And macOS:
0 mozilla::DOMEventTargetHelper::Release()
1 mozilla::UniquePtr<mozilla::extensions::(anonymous namespace)::ChannelListHolder, mozilla::DefaultDelete<mozilla::extensions::(anonymous namespace)::ChannelListHolder> >::~UniquePtr()
2 __cxa_finalize_ranges
3 exit
Comment 8•4 years ago
|
||
The work I've done in bug 1682507 seems to be capturing quite a few crashes on Windows under this signature that we were missing before. I've opened a few minidumps in Visual Studio to analyze the issue and it's probably more complicated than it appears. First of all this is the stack trace as reported by VS:
1. xul.dll!mozilla::dom::Animation::Release() Line 42 C++
2. [Inline Frame] xul.dll!mozilla::extensions::ChannelWrapper::Release() Line 1201 C++
3. [Inline Frame] xul.dll!mozilla::RefPtrTraits<mozilla::extensions::ChannelWrapper>::Release(mozilla::extensions::ChannelWrapper * aPtr) Line 50 C++
4. [Inline Frame] xul.dll!RefPtr<mozilla::extensions::ChannelWrapper>::ConstRemovingRefPtrTraits<mozilla::extensions::ChannelWrapper>::Release(mozilla::extensions::ChannelWrapper * aPtr) Line 381 C++
5. [Inline Frame] xul.dll!RefPtr<mozilla::extensions::ChannelWrapper>::assign_assuming_AddRef(mozilla::extensions::ChannelWrapper * aNewPtr) Line 69 C++
6. [Inline Frame] xul.dll!RefPtr<mozilla::extensions::ChannelWrapper>::operator=(void *) Line 168 C++
7. [Inline Frame] xul.dll!mozilla::extensions::ChannelWrapper::Die() Line 137 C++
8. [Inline Frame] xul.dll!mozilla::extensions::`anonymous namespace'::ChannelListHolder::~ChannelListHolder() Line 95 C++
9. [Inline Frame] xul.dll!mozilla::DefaultDelete<mozilla::extensions::(anonymous namespace)::ChannelListHolder>::operator()(mozilla::extensions::`anonymous namespace'::ChannelListHolder *) Line 463 C++
10. xul.dll!mozilla::UniquePtr<mozilla::extensions::(anonymous namespace)::ChannelListHolder,mozilla::DefaultDelete<mozilla::extensions::(anonymous namespace)::ChannelListHolder>>::reset(mozilla::extensions::`anonymous namespace'::ChannelListHolder * aPtr) Line 307 C++
11. [External Code]
The crash reason is a NULL pointer access, but it's not in our code. What's happening is that somehow in the Release()
call we're trying to access thread local storage and the _tls_end
symbol is NULL. To understand what's going on a bit of background on how Windows implements TLS is needed: what's going on is that in every DLL that uses TLS variables those are stored in an area bound by two variables called _tls_start
and _tls_end
. When the DLL is loaded those symbols become available and are used at runtime by the TLS access code. This crash is happening very late during shutdown, somewhere in an atexit()
callback from what I can tell. At that point it's likely that whatever DLL contains the TLS variables we're trying to access (probably XUL) is being unloaded - or has already been unloaded, so the _tls_start
and _tls_end
symbols aren't available anymore and the code trying to access them fails with the reported NULL pointer access.
This might prove tricky to fix: we probably have to avoid accessing TLS variables inside atexit()
handlers. On the upside this crash only happens very, very late during shutdown so it shouldn't be too annoying for users.
Updated•4 years ago
|
Comment 9•4 years ago
|
||
This bug is spiking on nightly 89 the last days
Comment 10•4 years ago
|
||
Change the status for beta to have the same as nightly and release.
For more information, please visit auto_nag documentation.
Comment 11•4 years ago
|
||
(In reply to Pascal Chevrel:pascalc from comment #9)
This bug is spiking on nightly 89 the last days
Note that the spike is "artificial" in that these crashes were already happening but we didn't catch them; they'd go straight to Microsoft health dashboards. What's happening is that we're now seeing their actual volume.
Comment 12•4 years ago
|
||
Thanks Gabriele, this is what I suspected. We should probably revisit the priority of this bug given that our users are crashing a lot more than we thought with this signature.
Updated•4 years ago
|
Comment 13•4 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #8)
The crash reason is a NULL pointer access, but it's not in our code. What's happening is that somehow in the
Release()
call we're trying to access thread local storage and the_tls_end
symbol is NULL.
The addref and release methods of cycle collected classes call into TLS (to get the cycle collector object for the current thread). This could be the same underlying issue as the spike in crashes seen with the signature NS_CycleCollectorSuspect3, as that is the method that AddRef and Release call into.
Looking at the types, it seems like we must be destroying sChannelList. Bug 1495748 added some cleanup code that runs at XPCOMShutdown, but maybe that is failing to run for some reason, or maybe new things are being added to the list after shutdown has occurred. Maybe we could just not add things to sChannelList after XPCOMShutdown has occurred somehow? In fact, this whole mechanism seems like something added to make the leak checker happy, so maybe we could simply not use this list at all except in builds where we care about leaks.
Assignee | ||
Comment 14•4 years ago
|
||
If we do, we create a new instance after its ClearOnShutdown
handler has
run, and it gets destroyed from a static destructor after the cycle collector
has shutdown.
Updated•4 years ago
|
Assignee | ||
Comment 15•4 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #13)
In fact, this whole mechanism seems like something added to make the leak checker happy, so maybe we could simply not use this list at all except in builds where we care about leaks.
I think we do need to run this even in cases where we don't care about leaks, since if we don't, the ChannelWrappers will still get destroyed possibly too late in shutdown to be safe in those cases. It won't matter when we do early exits, but when we don't, it could potentially lead to weird crashes.
Comment 16•4 years ago
|
||
I am tracking this bug for 89 as this has a high volume of crashes on nightly.
Comment 17•4 years ago
|
||
Comment 18•4 years ago
|
||
bugherder |
Comment 19•4 years ago
|
||
This grafts cleanly as far back as ESR78 also. Under the assumption that the frequency is likely being under-reported there due to the aforementioned WER spike, I'm thinking we probably want to uplift this fix there too.
Assignee | ||
Comment 20•4 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #19)
This grafts cleanly as far back as ESR78 also. Under the assumption that the frequency is likely being under-reported there due to the aforementioned WER spike, I'm thinking we probably want to uplift this fix there too.
The PastShutdownPhase
function doesn't exist before 83.
Comment 21•4 years ago
|
||
Well that's certainly a problem :). Guess we'll live with it until ESR91 then!
Assignee | ||
Comment 22•4 years ago
|
||
Comment on attachment 9216484 [details]
Bug 1606347: Don't create a new ChannelWrapper list late in shutdown. r=mccr8
Beta/Release Uplift Approval Request
- User impact if declined: Shutdown crashes
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Adds a simple check so we don't reinstantiate things during shutdown.
- String changes made/needed:
Comment 23•4 years ago
|
||
Doug, are you working on the fast shutdown stuff? Is it expected that we're running static destructors like this on Nightly?
Comment 24•4 years ago
|
||
Crash volume is unchanged in builds with the fix.
Comment 25•4 years ago
|
||
This issue is old enough that it isn't really a regression at this point. The surge on crash-stats is just due to increased reporting.
Maybe we need to move ahead with a solution that makes the cycle collector suspect not crash when called after CC shutdown. It would be good if it still continued to crash when called from a thread where the CC was never started (before shutdown).
Comment 26•4 years ago
|
||
Comment on attachment 9216484 [details]
Bug 1606347: Don't create a new ChannelWrapper list late in shutdown. r=mccr8
No change on nightly crashes so no need to uplift that patch at the moment, thanks
Updated•4 years ago
|
Assignee | ||
Updated•3 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Description
•