(Fixed) intermittent signature for nullptr deref in RemoteWorkerChild::SharedWorkerOp::MaybeStart
Categories
(Core :: DOM: Workers, defect, P3)
Tracking
()
People
(Reporter: intermittent-bug-filer, Unassigned)
References
(Depends on 1 open bug)
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:backout][stockwell unknown])
Crash Data
Filed by: aiakab [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=291610092&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/SqjrRmV-S6anOLaOrFRQag/runs/0/artifacts/public/logs/live_backing.log
[task 2020-03-04T11:24:37.041Z] 11:24:37 INFO - TEST-START | dom/workers/test/test_bug978260.html
[task 2020-03-04T11:24:37.041Z] 11:24:37 INFO - GECKO(2335) | [Parent 2335, IPDL Background] WARNING: Unknown MessagePort in ForceClose(): file /builds/worker/workspace/build/src/dom/messagechannel/MessagePortService.cpp, line 392
[task 2020-03-04T11:24:37.041Z] 11:24:37 INFO - GECKO(2335) | [Parent 2335, IPDL Background] WARNING: Unknown MessagePort in ForceClose(): file /builds/worker/workspace/build/src/dom/messagechannel/MessagePortService.cpp, line 392
[task 2020-03-04T11:24:37.079Z] 11:24:37 INFO - GECKO(2335) | Assertion failure: mRawPtr != nullptr (You can't dereference a NULL RefPtr with operator->().), at /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/RefPtr.h:315
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO - GECKO(2335) | #01: mozilla::dom::RemoteWorkerChild::SharedWorkerOp::Exec(mozilla::dom::SelfHolder&) [dom/workers/remoteworkers/RemoteWorkerChild.cpp:896]
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO - GECKO(2335) | #02: mozilla::detail::RunnableFunction<mozilla::dom::RemoteWorkerChild::SharedWorkerOp::MaybeStart(mozilla::dom::RemoteWorkerChild*, mozilla::Variant<mozilla::dom::RemoteWorkerChild::Pending, mozilla::dom::RemoteWorkerChild::Running, mozilla::dom::RemoteWorkerChild::PendingTerminated, mozilla::dom::RemoteWorkerChild::Terminated>&)::'lambda'()>::Run() [xpcom/threads/nsThreadUtils.h:558]
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO - GECKO(2335) | #03: mozilla::SchedulerGroup::Runnable::Run() [xpcom/threads/SchedulerGroup.cpp:282]
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO - GECKO(2335) | #04: nsThread::ProcessNextEvent(bool, bool*) [mfbt/RefPtr.h:314]
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO - GECKO(2335) | #05: NS_ProcessNextEvent(nsIThread*, bool) [xpcom/threads/nsThreadUtils.cpp:481]
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO - GECKO(2335) | #06: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) [ipc/glue/MessagePump.cpp:87]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO - GECKO(2335) | #07: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:291]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO - GECKO(2335) | #08: nsBaseAppShell::Run() [widget/nsBaseAppShell.cpp:139]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO - GECKO(2335) | #09: nsAppShell::Run() [widget/cocoa/nsAppShell.mm:705]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO - GECKO(2335) | #10: XRE_RunAppShell() [toolkit/xre/nsEmbedFunctions.cpp:926]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO - GECKO(2335) | #11: mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) [ipc/glue/MessagePump.cpp:237]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO - GECKO(2335) | #12: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:291]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO - GECKO(2335) | #13: XRE_InitChildProcess(int, char**, XREChildData const*) [toolkit/xre/nsEmbedFunctions.cpp:765]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | #14: plugin-container + 0xf0b
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/workspace/build/src/ipc/glue/ProtocolUtils.cpp, line 477
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x370131,name=PContent::Msg_DetachBrowsingContext) Channel error: cannot send/recv
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/workspace/build/src/ipc/glue/ProtocolUtils.cpp, line 477
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/workspace/build/src/ipc/glue/ProtocolUtils.cpp, line 477
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/ipc/ProtocolUtils.h, line 294
[task 2020-03-04T11:25:00.379Z] 11:25:00 ERROR - GECKO(2335) | A content process crashed and MOZ_CRASHREPORTER_SHUTDOWN is set, shutting down
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: '!inner', file /builds/worker/workspace/build/src/dom/ipc/JSWindowActorService.cpp, line 182
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Child 2336, Main Thread] WARNING: nsAppShell::Exit() called redundantly: file /builds/worker/workspace/build/src/widget/cocoa/nsAppShell.mm, line 725
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file /builds/worker/workspace/build/src/netwerk/cache/nsCacheService.cpp, line 855
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file /builds/worker/workspace/build/src/netwerk/cache/nsCacheService.cpp, line 818
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file /builds/worker/workspace/build/src/netwerk/cache/nsApplicationCacheService.cpp, line 161
[task 2020-03-04T11:25:00.380Z] 11:25:00 INFO - GECKO(2335) | [Child 2336, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/workspace/build/src/xpcom/base/nsCycleCollector.cpp, line 3352
[task 2020-03-04T11:25:00.380Z] 11:25:00 INFO - GECKO(2335) | nsStringStats
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 4•5 years ago
|
||
So we have a mWorkerPrivate nullptr access. I assume this is related to bug 1539508.
Comment hidden (Intermittent Failures Robot) |
Comment 6•5 years ago
|
||
(In reply to Intermittent Failures Robot from comment #5)
1 failures in 4528 pushes (0.0 failures/push) were associated with this bug in the last 7 days.
Repository breakdown:
- try: 1
Platform breakdown:
- linux1804-64: 1
For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1619923&startday=2020-04-13&endday=2020-04-19&tree=all
This happens in a totally different code here. It seems, that host can be null here and we do not deal with it (or do not expect this to be possible). Ryan, you introduced that host variable not too long ago?
Comment hidden (Intermittent Failures Robot) |
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 10•5 years ago
|
||
The 7th of May spike seems to have been fixed by https://hg.mozilla.org/integration/autoland/rev/3523714afacddd8e35b603a862229935c3139fdb
Comment 11•5 years ago
|
||
I have corrected the classifications, all the failures were fixed by commit:
https://hg.mozilla.org/integration/autoland/rev/ad5be7cecd83b9de5fd520ad0e19de7ea44b1466
Comment hidden (Intermittent Failures Robot) |
Comment 13•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #6)
(In reply to Intermittent Failures Robot from comment #5)
1 failures in 4528 pushes (0.0 failures/push) were associated with this bug in the last 7 days.
Repository breakdown:
- try: 1
Platform breakdown:
- linux1804-64: 1
For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1619923&startday=2020-04-13&endday=2020-04-19&tree=allThis happens in a totally different code here. It seems, that host can be null here and we do not deal with it (or do not expect this to be possible). Ryan, you introduced that host variable not too long ago?
No, if you look at the diff you can see that this was part of a refactoring and previously this variable was called browserParent. It's possible the refactoring caused this error to be possible, but my guess would be that this is pre-existing or caused by something else.
Comment hidden (Intermittent Failures Robot) |
Comment 15•4 years ago
|
||
I see three different origins, only two of them worker related, one not:
Linux + Windows
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=302715437&repo=mozilla-central&lineNumber=5979
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=301979097&repo=autoland&lineNumber=5637
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=298836794&repo=autoland&lineNumber=12451
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=293473260&repo=autoland&lineNumber=7169
Linux stack start: mozilla::dom::RemoteWorkerChild::CloseWorkerOnMainThread(mozilla::Variant<mozilla::dom::RemoteWorkerChild::Pending, mozilla::dom::RemoteWorkerChild::Running, mozilla::dom::RemoteWorkerChild::PendingTerminated, mozilla::dom::RemoteWorkerChild::Terminated>&) [dom/workers/remoteworkers/RemoteWorkerChild.cpp:615]
Windows stack start: xul.dll!mozilla::dom::RemoteWorkerChild::CloseWorkerOnMainThread(mozilla::Variant<mozilla::dom::RemoteWorkerChild::Pending,mozilla::dom::RemoteWorkerChild::Running,mozilla::dom::RemoteWorkerChild::PendingTerminated,mozilla::dom::RemoteWorkerChild::Terminated>&) [RemoteWorkerChild.cpp:0d6adfdd3b38601bf4949673c95e617a3eacff0e : 0 + 0x2a]
OSX
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=292263044&repo=autoland&lineNumber=29298
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=291610092&repo=autoland&lineNumber=29335
OSX Stack start: mozilla::dom::RemoteWorkerChild::SharedWorkerOp::Exec(mozilla::dom::SelfHolder&) [dom/workers/remoteworkers/RemoteWorkerChild.cpp:896]
OSX + Linux
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=299611143&repo=autoland&lineNumber=81163
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=297918122&repo=try&lineNumber=75842
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=300125171&repo=try&lineNumber=81651
OSX stack start: XUL!mozilla::AppWindow::GetPrimaryRemoteTabSize(int*, int*) [AppWindow.cpp:d920c3959dc28c50c5ef0acebcf3de0fce5a9700 : 2002 + 0x29]
Linux stack start: libxul.so!mozilla::AppWindow::GetPrimaryRemoteTabSize(int*, int*) [AppWindow.cpp:180f0cbaea8765e95c8a573b9bb89705dee1a080 : 2003 + 0x9]
Should we add a prefix for those?
Comment 16•4 years ago
|
||
Unfortunately the crash signatures generated from test failures don't go through Socorro's signature generation code so we can't change them easily. We've got a long standing bug to do that (bug 867571) and we have the necessary tools, but we never had the spare cycles to get to it.
Comment 17•4 years ago
|
||
OK, and I just noticed, that the first two signatures are actually the same, it seems that CloseWorkerOnMainThread
is just omitted on the OSX stack trace, as it is called there, too.
FWIW, I found also a small number of similar real life signatures :
https://crash-stats.mozilla.org/report/index/6b726166-00f7-4473-ad97-b7c4c0200526
Just two lines ahead. It seems, like the lock->as<Running>()
can be null here, which is not checked before use, just asserted in debug builds. Note, that this is NOT the CloseWorkerOnMainThread
case, though. Still there is something unexpected happening with those variants, it seems.
I am wondering, if this might be related to bug 1638170 as it seems to have also here Workers around in an unclear state?
Comment 18•4 years ago
|
||
I don't think it's related. In bug 1638170 there weren't any nullptr
pointers involved, only something happening on the wrong thread.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 23•4 years ago
|
||
This signature happens for different root causes:
So here we have a MutexAutoLock lock(mMutex);
and an if (mEventSource)
in place in order to protect mEventSource
. Thus I must assume, that another thread sets mEventSource
to null between the if and the dereference.
The only place where mEventSource
is explicitly set to null, seems to be here called from here (thus jumping forth and back between EventSourceImpl
and EventSource
). To my understanding, we probably miss a lock on mMutex that surrounds the call to UpdateDontKeepAlive
.
(more to come)
Comment 24•4 years ago
|
||
Here instead we have an mWorkerPrivate
null pointer dereference. Here I see no mutex and no check at all, if we have a valid workerprivate pointer. I see however that writes to the same mWorkerPrivate
seem to be guarded by a lock here and here but not here. So my assumption here is, that all accesses to mWorkerPrivate (both write and read) should be made thread safe.
Comment 26•4 years ago
|
||
I suspect bug 1545733 to have the same cause as in comment 24.
Comment 27•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #24)
The last location you mentioned is in the destructor of WorkerPrivateAccessibleState
. An access to its member as such is thread-safe and doesn't need synchronization. It can only be called from the destructor of RemoteWorkerChild
, or when *RemoteWorkerChild::mState
is re-assigned, but this can only happen under a lock.
However, mWorkerPrivate
can apparently be legitimately null, as this comment indicates: https://searchfox.org/mozilla-central/rev/af5cff556a9482d7aa2267a82f2ccfaf18e797e9/dom/workers/remoteworkers/RemoteWorkerChild.cpp#729-732
Comment 28•4 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #23)
This signature happens for different root causes:
So here we have a
MutexAutoLock lock(mMutex);
and anif (mEventSource)
in place in order to protectmEventSource
. Thus I must assume, that another thread setsmEventSource
to null between the if and the dereference.The only place where
mEventSource
is explicitly set to null, seems to be here called from here (thus jumping forth and back betweenEventSourceImpl
andEventSource
). To my understanding, we probably miss a lock on mMutex that surrounds the call toUpdateDontKeepAlive
.(more to come)
That seems like a possible explanation. However, then the comment (https://searchfox.org/mozilla-central/rev/af5cff556a9482d7aa2267a82f2ccfaf18e797e9/dom/base/EventSource.cpp#273-275) on mMutex
is wrong, which states that only mEventSource->mReadyState
is protected by the mutex, not mEventSource
itself. If that were actually true, then the locking in GetReadyState
would be too broad.
Updated•4 years ago
|
Updated•4 years ago
|
Comment 29•4 years ago
|
||
(In reply to Simon Giesecke [:sg] [he/him] from comment #27)
The last location you mentioned but not here is in the destructor of
WorkerPrivateAccessibleState
. An access to its member as such is thread-safe and doesn't need synchronization. It can only be called from the destructor ofRemoteWorkerChild
, or when*RemoteWorkerChild::mState
is re-assigned, but this can only happen under a lock.
Thanks for clarifying!
However,
mWorkerPrivate
can apparently be legitimately null, as this comment indicates: https://searchfox.org/mozilla-central/rev/af5cff556a9482d7aa2267a82f2ccfaf18e797e9/dom/workers/remoteworkers/RemoteWorkerChild.cpp#729-732
Interesting! And once we check mWorkerPrivate for null ptr, we probably need to do so guarded by the same lock used for write access here and here in order to avoid to end up in a similar situation like the one of comment 28.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 32•4 years ago
|
||
(In reply to Simon Giesecke [:sg] [he/him] from comment #28)
(In reply to Jens Stutte [:jstutte] from comment #23)
This signature happens for different root causes:
So here we have a
MutexAutoLock lock(mMutex);
and anif (mEventSource)
in place in order to protectmEventSource
. Thus I must assume, that another thread setsmEventSource
to null between the if and the dereference.The only place where
mEventSource
is explicitly set to null, seems to be here called from here (thus jumping forth and back betweenEventSourceImpl
andEventSource
). To my understanding, we probably miss a lock on mMutex that surrounds the call toUpdateDontKeepAlive
.(more to come)
That seems like a possible explanation. However, then the comment (https://searchfox.org/mozilla-central/rev/af5cff556a9482d7aa2267a82f2ccfaf18e797e9/dom/base/EventSource.cpp#273-275) on
mMutex
is wrong, which states that onlymEventSource->mReadyState
is protected by the mutex, notmEventSource
itself. If that were actually true, then the locking inGetReadyState
would be too broad.
I would guess that the comment is wrong... It's probably better to use a DataMutex
here for these fields.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 57•4 years ago
|
||
Hi Yaron, this is still happening, though the stack changed slightly with the other recent patches around EventSource. Does this give any new clue?
Comment 58•4 years ago
|
||
All three stacks from last week's failures have nothing to do with EventSource
anymore, so those are different bugs now.
Comment 59•4 years ago
|
||
Yes, you are right, and it is not even clear to my why those have been associated here - there is no trace of that assertion at all. Gabriele, any clue?
Comment 60•4 years ago
|
||
Given this stack the crash must be happening here. Quickly looking at the code I'd say that mMutex
must contain a NULL
mMutex
field (yeah, it's unfortunate that both are called the same but you get the idea). Since the mutex stored in a SharedMutex
is held in a RefPtr
then the assertion makes sense.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 83•3 years ago
|
||
Hi Marco, the failures here are just different places all over our code that happen to do an unexpected nullptr deref on a RefPtr
. Could we have some prefix here? The original bug seems to not happen any more.
Comment 84•3 years ago
|
||
I don't know if there's a way to do that, but it would make perfect sense. Gabriele, do you know?
If there is no way, we can ask sheriffs to classify based on the line after the failure line instead of just the failure line.
Comment 85•3 years ago
|
||
I don't know which part of our testing harness highlights these errors (and chooses the line to put in the signature). I grep'd around a bit but couldn't find an obvious place where this happens.
Comment 86•3 years ago
|
||
What should the prefix be? Those logs are coming out of Gecko so a prefix would need to be added there. If there's a line that should show up in the failure summary but isn't, we could add it to here:
https://searchfox.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/testing/errors.py#152
Comment 87•3 years ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #86)
What should the prefix be? Those logs are coming out of Gecko so a prefix would need to be added there.
Basically in this case we don't care that the assertion failure is coming from https://searchfox.org/mozilla-central/rev/d10188f9b4f1c4974264f3925505a0498d346c57/mfbt/RefPtr.h#314, but we want to know the caller of that function.
I don't think there is currently a way to add an assertion that mentions the caller in the same line as the log for the assertion failure.
If there's a line that should show up in the failure summary but isn't, we could add it to here:
https://searchfox.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/testing/errors.py#152
After the assertion failure is logged, the stack is dumped. Is there a way to use both the assertion failure message and the first line of the stack trace?
We'd only do it for this specific assertion though (or maybe for all assertions in header files?), not all assertions.
Comment 88•3 years ago
|
||
Yeah, it's possible.. we'd have to store state in this function and search for the stack lines after detecting an assertion we care about:
https://searchfox.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/structuredlog.py#67
Though this feels a bit hacky.
Comment 89•3 years ago
|
||
I'll just mention that this reminds me of https://searchfox.org/mozilla-central/rev/d10188f9b4f1c4974264f3925505a0498d346c57/testing/mozbase/mozcrash/mozcrash/mozcrash.py#177, where we ignore some abort signatures in favour of reporting what caused the abort....but I don't think there's anything that can be leveraged there.
How about just filing dependent bugs for "Assertion failure: mRawPtr != nullptr -- caused by ...."?
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 95•3 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #89)
How about just filing dependent bugs for "Assertion failure: mRawPtr != nullptr -- caused by ...."?
The most recent failures I see here happen to be a duplicate of bug 1545733. I filed bug 1741155 as proposed and closed it immediately as duplicate. I assume we should move this bug to a more general component if we want to keep it open?
Updated•3 years ago
|
Comment 96•3 years ago
|
||
Thanks for that follow-up!
Yes, it doesn't make sense to keep this generic failure report in DOM : workers. Technically, this assertion is in mfbt code...
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 104•3 years ago
|
||
(In reply to Intermittent Failures Robot from comment #103)
For more details, see:
https://treeherder.mozilla.org/intermittent-failures/bugdetails?bug=1619923&startday=2022-01-10&endday=2022-01-16&tree=all
The latest instance is again a case of bug 1545733, it seems.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 107•3 years ago
|
||
Looking at the last instance, I see always an instance of bug 1545733:
- On the main thread we receive a close operation which makes us lock our state variable in
SharedWorkerOp::Exec
and then we'll findmWorkerPrivate
beingnullptr
.
xul.dll!mozilla::dom::WorkerPrivate::Notify(mozilla::dom::WorkerStatus aStatus) Zeile 1660 C++
[Inlineframe] xul.dll!mozilla::dom::WorkerPrivate::Cancel() Zeile 158 C++
> xul.dll!mozilla::dom::RemoteWorkerChild::CloseWorkerOnMainThread(mozilla::Variant<mozilla::dom::RemoteWorkerChild::Pending,mozilla::dom::RemoteWorkerChild::Running,mozilla::dom::RemoteWorkerChild::PendingTerminated,mozilla::dom::RemoteWorkerChild::Terminated> & aState) Zeile 646 C++
xul.dll!mozilla::dom::RemoteWorkerChild::SharedWorkerOp::Exec(mozilla::dom::SelfHolder & aOwner) Zeile 942 C++
[Inlineframe] xul.dll!mozilla::dom::RemoteWorkerChild::SharedWorkerOp::MaybeStart::<lambda_1>::operator()() Zeile 906 C++
xul.dll!mozilla::detail::RunnableFunction<`lambda at /builds/worker/checkouts/gecko/dom/workers/remoteworkers/RemoteWorkerChild.cpp:896:19'>::Run() Zeile 532 C++
- On the worker thread we see a race of the
TransitionStateToRunning
locked on the very same state variable:
ntdll.dll!NtWaitForAlertByThreadId() Unbekannt
ntdll.dll!RtlAcquireSRWLockExclusive() Unbekannt
[Inlineframe] xul.dll!mozilla::OffTheBooksMutex::Lock() Zeile 63 C++
> [Inlineframe] xul.dll!mozilla::DataMutexBase<mozilla::Variant<mozilla::dom::RemoteWorkerChild::Pending,mozilla::dom::RemoteWorkerChild::Running,mozilla::dom::RemoteWorkerChild::PendingTerminated,mozilla::dom::RemoteWorkerChild::Terminated>,mozilla::Mutex>::AutoLock::AutoLock(mozilla::DataMutexBase<mozilla::Variant<mozilla::dom::RemoteWorkerChild::Pending,mozilla::dom::RemoteWorkerChild::Running,mozilla::dom::RemoteWorkerChild::PendingTerminated,mozilla::dom::RemoteWorkerChild::Terminated>,mozilla::Mutex> * aDataMutex) Zeile 81 C++
[Inlineframe] xul.dll!mozilla::DataMutexBase<mozilla::Variant<mozilla::dom::RemoteWorkerChild::Pending,mozilla::dom::RemoteWorkerChild::Running,mozilla::dom::RemoteWorkerChild::PendingTerminated,mozilla::dom::RemoteWorkerChild::Terminated>,mozilla::Mutex>::Lock() Zeile 92 C++
xul.dll!mozilla::dom::RemoteWorkerChild::TransitionStateToRunning(already_AddRefed<mozilla::dom::WorkerPrivate> aWorkerPrivate, already_AddRefed<mozilla::dom::WeakWorkerRef> aWorkerRef) Zeile 799 C++
xul.dll!mozilla::dom::RemoteWorkerChild::InitializeOnWorker() Zeile 571 C++
xul.dll!mozilla::dom::RemoteWorkerChild::InitializeWorkerRunnable::WorkerRun(JSContext *, mozilla::dom::WorkerPrivate *) Zeile 226 C++
xul.dll!mozilla::dom::WorkerRunnable::Run() Zeile 377 C++
xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Zeile 1190 C++
This confirms the suspect from bug 1545733 comment 18 that we close a worker that has never been transitioned to Running
. It is not clear from the stack traces, who scheduled the close operation, though.
In any case it seems the intuition of this comment was the right one (in the sense that this is a weird way of initializing mWorkerPrivate
late). We might need to expect mWorkerPrivate
to be nullptr
before using it in the Pending
state?
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 110•2 years ago
|
||
Moving ni? to :asuth, see comment 107
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment 113•2 years ago
|
||
Update:
There have been 36 failures within the last 7 days, all of them on Linux 18.04 x64 WebRender debug
.
Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=413497701&repo=autoland&lineNumber=10321
[task 2023-04-23T22:15:26.765Z] 22:15:26 INFO - TEST-START | /infrastructure/server/wpt-server-wpt-flags.sub.html
[task 2023-04-23T22:15:26.767Z] 22:15:26 INFO - Setting pref network.webtransport.enabled to true
[task 2023-04-23T22:15:26.784Z] 22:15:26 INFO - Setting pref network.webtransport.datagrams.enabled to true
[task 2023-04-23T22:15:26.791Z] 22:15:26 INFO - PID 9050 | Assertion failure: mRawPtr != nullptr (You can't dereference a NULL RefPtr with operator->().), at /builds/worker/workspace/obj-build/dist/include/mozilla/RefPtr.h:354
[task 2023-04-23T22:15:26.800Z] 22:15:26 INFO - STDOUT: Initializing stack-fixing for the first stack frame, this may take a while...
[task 2023-04-23T22:15:26.803Z] 22:15:26 INFO - Setting pref security.OCSP.enabled to 0
[task 2023-04-23T22:15:26.826Z] 22:15:26 INFO - Setting pref network.webtransport.enabled to true
[task 2023-04-23T22:15:26.866Z] 22:15:26 INFO - Setting pref network.webtransport.datagrams.enabled to true
[task 2023-04-23T22:15:26.896Z] 22:15:26 INFO - Setting pref security.OCSP.enabled to 0
[task 2023-04-23T22:15:26.918Z] 22:15:26 INFO - Closing window 4f216fad-9009-4dd8-aafe-31edcb39ae0e
Mike, any chance you could assign this to someone?
Thank you.
Comment hidden (Intermittent Failures Robot) |
Comment 115•2 years ago
|
||
Looking at recent failure logs, this is a problem with InputToReadableStreamAlgorithms::OnInputStreamReady, not RefPtr. This assertion failure is always going to be a problem with whatever is calling the function, not the function itself.
Updated•2 years ago
|
I can take a look, but InputToReadableStreamAlgorithms is very new. It's kinda weird to track a recent regression in a 3 years old bug, should I file a separate bug?
For now let me assign myself.
Comment 118•2 years ago
|
||
(In reply to Kagami [:saschanaz] from comment #116)
I can take a look, but InputToReadableStreamAlgorithms is very new. It's kinda weird to track a recent regression in a 3 years old bug, should I file a separate bug?
Yes, please do. This bug is really a dumping ground for unrelated issues. We can move it back to Core::General or something.
Updated•2 years ago
|
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 122•1 years ago
|
||
Comment 0 and comment 107 were about bug 1775784, which is now fixed.
The 1 m-c occurrence from comment 121 and 4 m-c occurrences from comment 120 are all instances of InputToReadableStreamAlgorithms::OnInputStreamReady
which was spun off into bug 1829961 which was then duped into bug 1829166.
Any new instances of the mRawPtr != nullptr
signature are probably bug 1829166 and should be filed against that bug. Someone may need to updated that bug's subject?
I am marking this bug as a dupe of bug 1775784 because that's what it was about and changing its subject to avoid having any confusion about new incidences of the null RefPtr access signature since all of the recent intermittents are actually bug 1829166 and in comment 118 it was determined to let that be its own bug. (I'm also moving the bug back into workers.)
Description
•