Crash in [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging]. Shutdown problem in workers.
Categories
(Core :: DOM: Workers, defect, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | wontfix |
firefox-esr60 | --- | wontfix |
firefox-esr78 | --- | wontfix |
firefox59 | --- | wontfix |
firefox60 | --- | wontfix |
firefox61 | --- | wontfix |
firefox62 | --- | wontfix |
firefox63 | - | wontfix |
firefox82 | --- | wontfix |
firefox83 | --- | wontfix |
firefox84 | --- | wontfix |
firefox85 | --- | wontfix |
firefox86 | --- | fix-optional |
People
(Reporter: mccr8, Assigned: jstutte)
References
(Depends on 3 open bugs, Blocks 2 open bugs)
Details
(5 keywords, Whiteboard: [DWS_NEXT][stockwell unknown][tbird topcrash],qa-not-actionable)
Crash Data
Attachments
(1 obsolete file)
Comment 2•7 years ago
|
||
Comment 3•7 years ago
|
||
Comment 4•7 years ago
|
||
Updated•7 years ago
|
Comment 5•7 years ago
|
||
Updated•7 years ago
|
Comment 6•7 years ago
|
||
Comment 7•7 years ago
|
||
Comment 8•7 years ago
|
||
Comment 9•7 years ago
|
||
Comment 10•7 years ago
|
||
Comment 11•7 years ago
|
||
Updated•7 years ago
|
Updated•7 years ago
|
Comment 12•7 years ago
|
||
Comment 13•7 years ago
|
||
Comment 15•7 years ago
|
||
Comment 16•7 years ago
|
||
Comment 17•7 years ago
|
||
Comment 18•7 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Comment 20•7 years ago
|
||
Updated•7 years ago
|
Updated•7 years ago
|
Comment 22•7 years ago
|
||
Updated•7 years ago
|
Comment 23•7 years ago
|
||
Comment 24•7 years ago
|
||
Updated•7 years ago
|
Comment 27•7 years ago
|
||
Comment 28•6 years ago
|
||
Comment 30•6 years ago
|
||
Comment 32•6 years ago
|
||
Updated•6 years ago
|
Updated•6 years ago
|
Comment hidden (off-topic) |
Comment 36•6 years ago
|
||
Updated•6 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (off-topic) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (off-topic) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (off-topic) |
Updated•5 years ago
|
Assignee | ||
Comment 68•5 years ago
|
||
I want to understand better, what all these signatures are about, so ni to myself.
Reporter | ||
Comment 69•5 years ago
|
||
[@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging ] is the number one crash signature for the April 15 Linux Nightlies. I don't know if somebody was just having a really bad day or what.
Reporter | ||
Updated•5 years ago
|
Assignee | ||
Comment 70•5 years ago
|
||
Digging a bit into the signatures, I see three main buckets of signatures:
-
Crashes at arbitrary places caused by MOZ_CRASH("Shutdown hanging before starting."); or by MOZ_CRASH("Shutdown too long, probably frozen, causing a crash.");
1.1 shutdownhang | nsThread::Shutdown | mozilla::net::nsSocketTransportService::ShutdownThread
Happens mostly on 68.x but has some occurrences also on 75.0.1.2 shutdownhang | mozilla::net::ShutdownEvent::PostAndWait
Here 75.0 and 68.x are dominant, but the total volume is one order of magnitude lower than 1.1.1.3 shutdownhang | mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
Here we have only versions up to 68.x. and mostly ESR. Volume is about half of 1.1.1.4 shutdownhang | mozilla::net::nsHttpConnectionMgr::Shutdown
Here 75 and 52.9.0esr dominate the ranks. I assume this to be just a variant of 1.3.1.5 shutdownhang | mozilla::dom::workers::RuntimeService::Cleanup
Happens only on versions up to 52.9.0esr and can be safely ignored.All these crashes seem not really worker related to me (or I am overlooking something not evident), at least we do not know, what caused the hang.
-
Crashes with worker specific "Workers Hanging ..." MOZ_CRASH messages
2.1. mozilla::dom::workerinternals::RuntimeService::CrashIfHanging
(This signature has been added twice, it seems.)
Here we have a collection of many different (but similar) MOZ_CRASH reasons. I assume, they reflect the evolution of those messages through the different versions (as we can see versions back to 60.2.0esr here).
The 4 top scorers (making together more than 60%) are:
1 Workers Hanging - 1|A:1|S:0|Q:0-BC:1|WorkerDebuggeeRunnable::mSender 393 21.30 %
2 Workers Hanging - 1|A:1|S:0|Q:0-BC:0Dispatch Error 283 15.34 %
3 Workers Hanging - 1|A:3|S:0|Q:0-BC:0Dispatch Error-BC:0Dispatch Error-BC:0Dispatch Error 282 15.28 %
4 Workers Hanging - 1|A:1|S:0|Q:0-BC:1|IDBOpenDBRequest 165 8.94 %
5 Workers Hanging - 1|A:2|S:0|Q:0-BC:0Dispatch Error-BC:0Dispatch Error 141 7.64 %
. These are the cases to care (most) about in this bug, I think. It would be interesting to relate the different messages to the versions we see in order to narrow down similar causes.
. 2.2 mozilla::dom::workers::RuntimeService::CrashIfHanging
This signature happens on very old, unsupported versions only and can be safely ignored.
-
Signatures without MOZ_CRASH message at all
3.1 mozilla::net::CacheFileIOManager::SyncRemoveDir
Has a very low but recent volume. I think this signature deserves a bug on its own.3.2 shutdownhang | static bool mozilla::SpinEventLoopUntil<T> | mozilla::net::nsHttpConnectionMgr::Shutdown
Has no occurrences at all in our data and can be removed from the signatures.
:mccr8, do I read well in the crash data and can we adjust a bit the signatures relevant for this bug?
(edit: it seems I am unable to format this well - hope it works anyway)
Assignee | ||
Comment 71•5 years ago
|
||
Looking into case 2.1. for "Dispatch Error" messages:
The "Dispatch Error" message is constructed here if and only if the Dispatch() returns false.
The first occasion to fail is the call to PreDispatch(mWorkerPrivate), which is a virtual function with many overrides.
Most implementations of that function just return true (some of them doing AssertIsOnMainThread(); some not) but there are four, that do more:
EventRunnable::PreDispatch
Despite its length, it returns always true (if it does not crash). So probably not relevant here.
WorkerDebuggeeRunnable::PreDispatch
Has a special behavior in case of ParentThreadUnchangedBusyCount
, which can lead to false responses. This smells, as the busy count might be involved in determining pending workers? Interestinglee, WorkerDebuggeeRunnable
has its own shutdown hang messages, too.
WorkerRunnable::PreDispatch
Here we have a special behavior in case of WorkerThreadModifyBusyCount
which returns the result of aWorkerPrivate->ModifyBusyCount(true);
. Again this smells.
NotifyRunnable::PreDispatch
Here we always return the result of aWorkerPrivate->ModifyBusyCount(true);
. This smells even more with respect to the previous case?
Andrew, it might be just a gut feeling (ignoring the details of that code), but my impression is, that a PreDispatch returning false might provoke a shutdown hang (not) manipulating the BC correctly (in some cases)?
Assignee | ||
Comment 72•5 years ago
|
||
BTW, I made a sheet with the "Workers Hanging ..." messages for 75.0. Note the very long messages for some WorkerDebuggeeRunnable::mSender
which are caused by many WorkerRefs in WorkerPrivate. This looks suspicious, too.
Reporter | ||
Comment 73•5 years ago
|
||
I don't know anything about worker shutdown, but your analysis sounds reasonable to me. It looks like baku fixed a bunch of issues back in 2018 when this was first filed, so it would make sense that some of the signatures might not be happening in recent versions. I'm not sure why he added the HTTP connection manager signatures to this bug.
Assignee | ||
Comment 74•5 years ago
|
||
Removed all signatures but case 2.1 from this bug, created bug 1633342 to collect the other (probably net related) signatures (and dropped the signatures with cases for unsupported versions only).
Assignee | ||
Comment 75•5 years ago
|
||
Moved the single dependencies to bug 1633342. Still I am not sure, if all these dependencies are real.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 76•5 years ago
|
||
Not sure why bugzilla switched those dependencies.
Assignee | ||
Updated•5 years ago
|
Comment 77•5 years ago
|
||
Expanding on comment 71, the relevant Runnable
is a CrashIfHangingRunnable
, whose PreDispatch
always returns true
, so that can't be the source of the failure. Going one step further into the relevant DispatchInternal
, WorkerPrivate::DispatchControlRunnable
is getting called. This function only fails if the worker's status is Dead
, so that seems to be the source of the problem. This failure is possible because there's a race condition between a worker's status turning Dead
and its removal from RuntimeService::mDomainMap
. If we consider the simple case with a single worker, its removal happens here, and gets scheduled from here.
Long story short, it looks like the worker isn't really hanging, but rather the main thread doesn't run the Runnable
that removes the record of the worker's existence. If we take a look at one of the relevant reports, we can see that the DOM Worker thread is indeed idle. It might be possible to reduce the number of reports that falsely attribute the hang to workers by removing the worker's entry from RuntimeService::mDomainMap
at the same time its status changes to Dead
, I have to look into it.
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 78•4 years ago
|
||
Shall we expect then with the fix from bug 1636147, that all the crashes with DispatchError
in the message (around 70%) would go away, leaving around only the ones with WorkerDebuggeRunnable::mSender
? That would be a great reduction!
Comment 79•4 years ago
|
||
Yes, that's the idea. Those messages indicate that the main thread is hanging, not the workers.
Comment hidden (off-topic) |
Assignee | ||
Comment 81•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #72)
BTW, I made a sheet with the "Workers Hanging ..." messages for 75.0. Note the very long messages for some
WorkerDebuggeeRunnable::mSender
which are caused by many WorkerRefs in WorkerPrivate. This looks suspicious, too.
(In reply to Jens Stutte [:jstutte] from comment #78)
Shall we expect then with the fix from bug 1636147, that all the crashes with
DispatchError
in the message (around 70%) would go away, leaving around only the ones withWorkerDebuggeRunnable::mSender
? That would be a great reduction!
It seems, that the so-far remaining cases on 78 are all carrying WorkerDebuggeeRunnable::mSender
messages, which did not go away as predicted by Yaron and for which we have not yet a clear understanding, what is causing them.
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 83•4 years ago
|
||
Looking at the first new crashes coming in from beta 83 with enhanced reporting as of bug 1664386 I see:
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
It may be early to say this definitely, but it seems, that the original assumption that we have chrome workers blocking us, is false (if we can trust the result of IsChromeWorker()
).
As of bug 1664386 comment 1, this means that:
a) the (single) shutdown timeout has been reached by the RunWatchdog (active only on the parent process)
b) the shutdown steps were completed (sShutdownNotified == true)
c) there is a worker associated to any domain which is still able to receive runnables (and to respond!)
d) the blocking worker is not (necessarily) a chrome worker
Asuth, Yaron, are we aware of any non-chrome worker that may run in the parent process?
Updated•4 years ago
|
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•4 years ago
|
Comment hidden (Intermittent Failures Robot) |
Updated•4 years ago
|
Updated•4 years ago
|
Comment 88•3 years ago
|
||
I crashed with Thunderbird 90.0b3 on Mac during shutdown - not password related
bp-3ca8e432-e5b3-4c41-b9e3-2ba500210701
0 XUL mozilla::dom::workerinternals::RuntimeService::CrashIfHanging() dom/workers/RuntimeService.cpp:1708 context
1 XUL mozilla::(anonymous namespace)::RunWatchdog(void*) toolkit/components/terminator/nsTerminator.cpp:230 scan
2 libnss3.dylib _pt_root nsprpub/pr/src/pthreads/ptthread.c:201 scan
3 libsystem_pthread.dylib _pthread_start scan
4 libsystem_pthread.dylib thread_start scan
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Comment 90•3 years ago
|
||
The variant with () is not happening any more.
Reporter | ||
Comment 91•3 years ago
|
||
FWIW, the () now get removed on crash stats, so they'll never show up in signatures.
Updated•3 years ago
|
Reporter | ||
Comment 92•3 years ago
|
||
I filed bugbug issues on it reverting a change and on it generating junk signatures.
https://github.com/mozilla/bugbug/issues/2540
https://github.com/mozilla/bugbug/issues/2541
Comment 93•3 years ago
|
||
IIRC the bot is just grabbing the signatures from the duplicate bugs.
Assignee | ||
Comment 94•3 years ago
|
||
I adjusted the signature in the other bug accordingly, too.
Assignee | ||
Updated•3 years ago
|
Comment 95•3 years ago
|
||
In this case it's not a bugbug-based change, so I moved the issues to the relman-auto-nag repository.
(In reply to Julien Cristau [:jcristau] from comment #93)
IIRC the bot is just grabbing the signatures from the duplicate bugs.
Yes, exactly. Let's discuss it in the issues.
Reporter | ||
Comment 96•3 years ago
|
||
I removed the () from the dupes, and filed a bug on the TreeHerder intermittent filer, as I think that's actually where the () in signatures are from. Sorry for my confusion! I forgot about the duplicate bug signature thing.
Assignee | ||
Comment 97•3 years ago
|
||
FWIW, in the most frequent case we still see any variation of recursion depth for:
Workers Hanging - 1|A:1|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 98•3 years ago
|
||
Adjusting severity/priority based on the frequency.
Comment 99•2 years ago
|
||
The severity field for this bug is set to S3. However, the bug has the topcrash
keyword.
:jstutte, could you consider increasing the severity of this top-crash bug? If the crash isn't "top" anymore, could you drop the topcrash
keyword?
For more information, please visit auto_nag documentation.
Assignee | ||
Comment 100•2 years ago
|
||
This is a top-crash only for Thunderbird, it seems. There it spiked up from version 91.7, it seems.
I am not sure what this would mean then for our severity/priority here.
Assignee | ||
Comment 101•2 years ago
|
||
So looking a bit at some crashes, we (still) mostly seem to have a problem with the WorkerDebuggeeRunnable
here.
Looking at https://searchfox.org/mozilla-central/rev/da6a85e615827d353e5ca0e05770d8d346b761a9/dom/workers/WorkerPrivate.h#1245 I am wondering if we just never get a chance to execute the debugee runnable, being this a throttled event queue that targets the main thread (which is probably busy all the time during shutdown).
I am wondering if it is really a good idea to carry away a ThreadSafeWorkerRef
here (which is a StrongWorkerRef
) and if it wouldn't be better to downgrade this to a WeakWorkerRef
?
Comment 102•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #100)
This is a top-crash only for Thunderbird, it seems. There it spiked up from version 91.7, it seems.
Indeed it is #3 crash for Thunderbird. I had marked up bug 1435961 for this.
The spike is false - it is the result of Thunderbird not having crash reporting on crash-stats from Nov 2021 to April 2022.
I am not sure what this would mean then for our severity/priority here.
I am wondering if it is really a good idea to carry away a ThreadSafeWorkerRef here (which is a StrongWorkerRef) and if it wouldn't be better to downgrade this to a WeakWorkerRef ?
Any idea if this would help Thunderbird crashes?
Assignee | ||
Comment 103•2 years ago
|
||
The sole purpose of this mSender
worker ref seems to be to keep the worker alive, as there is no real use of that variable. Downgrading it to a weak worker ref would be equivalent to removing it, at this point.
What is not clear to me is why we think we need to keep the worker alive? I assume we want to be sure to have still a living worker when we execute the runnable on the main thread (if we are on the worker thread, we are surely alive).
Looking at the sub-classes of WorkerDebuggeeRunnable
it seems:
ServiceWorkerOpRunnable
is only meant to run on the worker thread.CheckScriptEvaluationWithCallback
is only meant to run on the worker thread.CompileScriptRunnable
is only meant to run on the worker thread.
Instead:
MessageEventRunnable
can be run on the main thread.ReportErrorRunnable
can be run on the main thread.ReportGenericErrorRunnable
can be run on the main threadCancelingOnParentRunnable
can be run on the main thread.
I assume if we move to a weak worker ref, those runnables should check the worker ref before doing anything with the WorkerPrivate*
?
Assignee | ||
Comment 104•2 years ago
|
||
thunderbird |
(In reply to Wayne Mery (:wsmwk) from comment #102)
Any idea if this would help Thunderbird crashes?
Well, the huge difference in numbers here seem to indicate, that Thunderbird's main thread loop is too busy to ever let the RefPtr<ThrottledEventQueue> mMainThreadDebuggeeEventTarget;
event queue execute their events on the main thread. So apparently there is an issue on Thunderbird's side with being too busy on the main thread with whatever during worker shutdown.
But if we ensure that the worker can go away without harm before our WorkerDebuggeeRunnable
ever executes on the main thread, that could definitely help, both Firefox & Thunderbird. But I need some more expertise from Eden here if this is not going to brake other things.
Assignee | ||
Comment 105•2 years ago
|
||
Updated•2 years ago
|
Assignee | ||
Comment 106•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #104)
(In reply to Wayne Mery (:wsmwk) from comment #102)
Any idea if this would help Thunderbird crashes?
Well, the huge difference in numbers here seem to indicate, that Thunderbird's main thread loop is too busy to ever let the
RefPtr<ThrottledEventQueue> mMainThreadDebuggeeEventTarget;
event queue execute their events on the main thread. So apparently there is an issue on Thunderbird's side with being too busy on the main thread with whatever during worker shutdown.But if we ensure that the worker can go away without harm before our
WorkerDebuggeeRunnable
ever executes on the main thread, that could definitely help, both Firefox & Thunderbird.
To be clear: The patch here would not make block the worker shutdown by those runnables. Still on Thunderbird something frequently seems to prevent those runnables from ever being executed in time. Whatever this is, it still might cause a different flavor of hang after this patch lands. I would thus not be too optimistic that this patch fights those hangs, but it might help to have better diagnostic.
Canceling the ni? as I asked for review on the patch.
Assignee | ||
Comment 107•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #103)
The sole purpose of this
mSender
worker ref seems to be to keep the worker alive, as there is no real use of that variable. Downgrading it to a weak worker ref would be equivalent to removing it, at this point.What is not clear to me is why we think we need to keep the worker alive? I assume we want to be sure to have still a living worker when we execute the runnable on the main thread (if we are on the worker thread, we are surely alive).
Looking at the sub-classes of
WorkerDebuggeeRunnable
it seems:
ServiceWorkerOpRunnable
is only meant to run on the worker thread.CheckScriptEvaluationWithCallback
is only meant to run on the worker thread.CompileScriptRunnable
is only meant to run on the worker thread.Instead:
MessageEventRunnable
can be run on the main thread.ReportErrorRunnable
can be run on the main thread.ReportGenericErrorRunnable
can be run on the main threadCancelingOnParentRunnable
can be run on the main thread.I assume if we move to a weak worker ref, those runnables should check the worker ref before doing anything with the
WorkerPrivate*
?
So I fear things are a bit more complicated at least for the MessageEventRunnable
. If a worker dispatches this kind of event, we must make sure it arrives somewhere. It seems as if keeping the worker alive was kind of a trick to ensure this. Actually I think the MessageEventRunnable
should extract all needed information from the worker on dispatch in order to be able to deliver the event even when the worker ended?
Comment 108•2 years ago
|
||
Comment on attachment 9276716 [details]
Bug 1435343: Use a weak worker reference for WorkerDebuggeeRunnable. r?#dom-worker-reviewers
Revision D146447 was moved to bug 1769913. Setting attachment 9276716 [details] to obsolete.
Comment 110•2 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #107)
(In reply to Jens Stutte [:jstutte] from comment #103)
The sole purpose of this
mSender
worker ref seems to be to keep the worker alive, as there is no real use of that variable. Downgrading it to a weak worker ref would be equivalent to removing it, at this point.What is not clear to me is why we think we need to keep the worker alive? I assume we want to be sure to have still a living worker when we execute the runnable on the main thread (if we are on the worker thread, we are surely alive).
Looking at the sub-classes of
WorkerDebuggeeRunnable
it seems:
ServiceWorkerOpRunnable
is only meant to run on the worker thread.CheckScriptEvaluationWithCallback
is only meant to run on the worker thread.CompileScriptRunnable
is only meant to run on the worker thread.Instead:
MessageEventRunnable
can be run on the main thread.ReportErrorRunnable
can be run on the main thread.ReportGenericErrorRunnable
can be run on the main threadCancelingOnParentRunnable
can be run on the main thread.I assume if we move to a weak worker ref, those runnables should check the worker ref before doing anything with the
WorkerPrivate*
?So I fear things are a bit more complicated at least for the
MessageEventRunnable
. If a worker dispatches this kind of event, we must make sure it arrives somewhere. It seems as if keeping the worker alive was kind of a trick to ensure this. Actually I think theMessageEventRunnable
should extract all needed information from the worker on dispatch in order to be able to deliver the event even when the worker ended?
Maybe having some sort of interim "check-in" thread between those two would work. Just thinking out loud here. It seems there are a lot of things happening, and they need to happen in a more organized manner overall.
Assignee | ||
Comment 112•2 years ago
|
||
(In reply to Worcester12345 from comment #110)
Maybe having some sort of interim "check-in" thread between those two would work. Just thinking out loud here. It seems there are a lot of things happening, and they need to happen in a more organized manner overall.
Not sure I get the idea here.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 114•2 years ago
|
||
(In reply to Intermittent Failures Robot from comment #113)
For more details, see:
https://treeherder.mozilla.org/intermittent-failures/bugdetails?bug=1435343&startday=2022-09-05&endday=2022-09-11&tree=all
The things I see here here do not really seem to be related to the original bug?
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 119•2 years ago
|
||
TB 102.3.2 (32bit) on Windows 10 64bit cashed after first closing TB main window and after that sending composed mail from open mail-editor window. (Master-Password active)
bp-7e21d996-0b93-4f56-8904-742c00221013
Thunderbird 102.3.2 Crash Report [@ mozilla::dom::workerinternals::RuntimeService::CrashIfHanging ]
MOZ_CRASH Reason (Sanitized)
Workers Hanging - 1|A:6|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Crashing Thread (40), Name: Shutdown Hang Terminator
Frame Module Signature Source Trust
0 xul.dll mozilla::dom::workerinternals::RuntimeService::CrashIfHanging() dom/workers/RuntimeService.cpp:1603 context
1 xul.dll mozilla::`anonymous namespace'::RunWatchdog(void*) toolkit/components/terminator/nsTerminator.cpp:232 cfi
2 nss3.dll _PR_NativeRunThread(void*) nsprpub/pr/src/threads/combined/pruthr.c:399 cfi
3 nss3.dll pr_root(void*) nsprpub/pr/src/md/windows/w95thred.c:139 cfi
4 ucrtbase.dll thread_start<unsigned int (__stdcall*)(void*), 1> cfi
5 kernel32.dll BaseThreadInitThunk cfi
6 mozglue.dll patched_BaseThreadInitThunk(int, void*, void*) toolkit/xre/dllservices/mozglue/WindowsDllBlocklist.cpp:572 cfi
7 ntdll.dll __RtlUserThreadStart cfi
8 ntdll.dll _RtlUserThreadStart cfi
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 129•2 years ago
|
||
The remaining intermittent failure instances seem unrelated to workers. See bug 1805147, for example.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 137•1 year ago
|
||
Is https://crash-stats.mozilla.org/report/index/6ec67d36-0c11-4739-a2e4-6025b0230322 an example** of this bug, where the Thunderbird user was locked out by something password related, perhaps too many failed password attempts.
**Crash reason is listed as
Workers Hanging - 1|A:2|S:0|Q:0-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender-BC:1IsChromeWorker(false)|WorkerDebuggeeRunnable::mSender|WorkerDebuggeeRunnable::mSender
Also https://crash-stats.mozilla.org/report/index/fe8b8258-7db3-4c48-9244-d9e3c0230323
Or should we be filing these as Thunderbird bugs?
Comment 138•1 year ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #137)
Is https://crash-stats.mozilla.org/report/index/6ec67d36-0c11-4739-a2e4-6025b0230322 an example** of this bug, where the Thunderbird user was locked out by something password related, perhaps too many failed password attempts.
Also https://crash-stats.mozilla.org/report/index/fe8b8258-7db3-4c48-9244-d9e3c0230323
Or should we be filing these as Thunderbird bugs?
Meta for a sec: This bug exists primarily to track the Firefox shutdown hangs attributed to workers by way of the associated crash signature. These hangs are potentially attributable to:
- Bugs in the core worker implementation. (Frequently technical debt related.)
- Bugs in Web APIs exposed on workers.
- Bugs in system JS code using workers, potentially involving a failure to pay attention to shutdown phases.
- Bugs in system code related to shutdown phases, such as failing to generate appropriate shutdown phases or tear down content globals, etc.
For the specific crashes you identify above, it's very likely that bug 1800659 will address the (type 1) problem. Unfortunately, that's only going to be landing in v116 and it's unlikely we'll be able to uplift[1], which is very unfortunate for Thunderbird's model of building against ESR (and the Firefox ESR itself).
Because of Thunderbird building against ESR and because I think it's potentially difficult to distinguish type 3 and type 4 problems that are specific to Thunderbird until after performing a potentially detailed investigation, it's likely appropriate to file distinct Thunderbird bugs which can be marked as depending on platform bugs as appropriate.
For TB built against m-c trunk/beta/release the calculus changes a bit because type 1 and type 2 problems are more likely to be timely. However, I think it could still make sense to file distinct TB bugs because type 3 and type 4 factors are still potentially so significant and in the event we get users commenting on the bug, the potential for confusion goes up. Also, this simplifies bug prioritization since the product impacts may vary.
For the filed TB bugs where the TB team would like input from the workers team the best practices would probably be:
- Try and make sure the bug is shovel ready by including:
- If available, the Workers Hanging string (as you've done above, thank you!). This should be available in crash reports (as protected data that is okay to report in bugs because we explicitly do not include any origin data, although any propagating the information should of course confirm there is nothing potentially identifying before pasting), and in debug builds where MOZ_ReportCrash does a printf.
- If stdout/wherever MOZ_LOG would go if enabled (which may be to MOZ_LOG_FILE) is available, any of the worker state information added in https://phabricator.services.mozilla.com/D173430 which is automatically emitted as MOZ_LOG category "WorkerShutdownDump" which gets temporarily force-enabled. The output looks like https://bugzilla.mozilla.org/show_bug.cgi?id=1805613#c83
- Links to any Thunderbird documentation about:
- Its shutdown phases for content and system/app logic
- Its use of workers for system/app logic. This should also include mention of any subsystems that might have previously been used in Firefox and m-c but which are no longer used (or maybe even present) in Firefox/m-c but have been forked into TB, etc.
- Any context about extensions indicated by the crash-stats which might use the TB extension experiments mechanism that allows extensions to do all the legacy add-on stuff that Firefox is able to assume is no longer possible. My concern here would be add-ons that are creating workers and are not aware of shutdown phases as opposed to things the add-on would be doing in the worker since XPConnect is not exposed to workers and "ctypes" usage should show up in crash stacks (if active at the time of the crash).
- Ask about the bug in the Workers & Storage chat.mozilla.org channel, pasting the bug link there. The rationale is that there really isn't 1 right person to answer questions about the factors that might contribute to shutdown hangs, especially as type 2 issues will be something that the core worker peers won't necessarily be directly aware of.
- That said, it might make sense for the TB team to designate a dev as the "worker liaison"/similar and so anyone triaging TB crashes/bugs in this space could needinfo the relevant TB dev.
1: It's likely that the bug 1800659 fixes will be a massive improvement for these specific hangs and would be appropriate for uplift on its own, but it also:
- Represents a major shift in worker behavior that potentially will result in a number of fixes in other components and this would increase uplift risk because those fixes may result in their own cascade of fixes which could intertwine with new functionality, etc.
- Is expected to be followed-up by a number of other worker technical debt paydown refactorings for which it's also not clear we could uplift. And arguably it would be better to leave ESR in the pre-bug 1800659 state that has been the equilibrium state for a long time rather than having ESR have a temporary intermediate equilibrium that might only exist in Fx116 or maybe not exist in any shipped Firefox if we land more refactorings in 116 that we definitely don't want to risk uplifting (I have a few of these...).
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•1 year ago
|
Comment 141•1 year ago
|
||
(In reply to Andrew Sutherland [:asuth] (he/him) from comment #138)
...
1: It's likely that the bug 1800659 fixes will be a massive improvement for these specific hangs and would be appropriate for uplift on its own, but it also:
- Represents a major shift in worker behavior that potentially will result in a number of fixes in other components and this would increase uplift risk because those fixes may result in their own cascade of fixes which could intertwine with new functionality, etc.
- Is expected to be followed-up by a number of other worker technical debt paydown refactorings for which it's also not clear we could uplift. And arguably it would be better to leave ESR in the pre-bug 1800659 state that has been the equilibrium state for a long time rather than having ESR have a temporary intermediate equilibrium that might only exist in Fx116 or maybe not exist in any shipped Firefox if we land more refactorings in 116 that we definitely don't want to risk uplifting (I have a few of these...).
Thanks for that info. So indeed bug 1800659 is on version 116 and not back ported to esr115
Description
•