Closed Bug 1619923 Opened 5 years ago Closed 1 years ago

(Fixed) intermittent signature for nullptr deref in RemoteWorkerChild::SharedWorkerOp::MaybeStart

Categories

(Core :: DOM: Workers, defect, P3)

defect

Tracking

()

RESOLVED DUPLICATE of bug 1775784

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:backout][stockwell unknown])

Crash Data

Filed by: aiakab [at] mozilla.com
Parsed log: https://treeherder.mozilla.org/logviewer.html#?job_id=291610092&repo=autoland
Full log: https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/SqjrRmV-S6anOLaOrFRQag/runs/0/artifacts/public/logs/live_backing.log


[task 2020-03-04T11:24:37.041Z] 11:24:37 INFO - TEST-START | dom/workers/test/test_bug978260.html
[task 2020-03-04T11:24:37.041Z] 11:24:37 INFO - GECKO(2335) | [Parent 2335, IPDL Background] WARNING: Unknown MessagePort in ForceClose(): file /builds/worker/workspace/build/src/dom/messagechannel/MessagePortService.cpp, line 392
[task 2020-03-04T11:24:37.041Z] 11:24:37 INFO - GECKO(2335) | [Parent 2335, IPDL Background] WARNING: Unknown MessagePort in ForceClose(): file /builds/worker/workspace/build/src/dom/messagechannel/MessagePortService.cpp, line 392
[task 2020-03-04T11:24:37.079Z] 11:24:37 INFO - GECKO(2335) | Assertion failure: mRawPtr != nullptr (You can't dereference a NULL RefPtr with operator->().), at /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/RefPtr.h:315
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO - GECKO(2335) | #01: mozilla::dom::RemoteWorkerChild::SharedWorkerOp::Exec(mozilla::dom::SelfHolder&) [dom/workers/remoteworkers/RemoteWorkerChild.cpp:896]
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO - GECKO(2335) | #02: mozilla::detail::RunnableFunction<mozilla::dom::RemoteWorkerChild::SharedWorkerOp::MaybeStart(mozilla::dom::RemoteWorkerChild*, mozilla::Variant<mozilla::dom::RemoteWorkerChild::Pending, mozilla::dom::RemoteWorkerChild::Running, mozilla::dom::RemoteWorkerChild::PendingTerminated, mozilla::dom::RemoteWorkerChild::Terminated>&)::'lambda'()>::Run() [xpcom/threads/nsThreadUtils.h:558]
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO - GECKO(2335) | #03: mozilla::SchedulerGroup::Runnable::Run() [xpcom/threads/SchedulerGroup.cpp:282]
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO - GECKO(2335) | #04: nsThread::ProcessNextEvent(bool, bool*) [mfbt/RefPtr.h:314]
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO - GECKO(2335) | #05: NS_ProcessNextEvent(nsIThread*, bool) [xpcom/threads/nsThreadUtils.cpp:481]
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.377Z] 11:25:00 INFO - GECKO(2335) | #06: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) [ipc/glue/MessagePump.cpp:87]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO - GECKO(2335) | #07: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:291]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO - GECKO(2335) | #08: nsBaseAppShell::Run() [widget/nsBaseAppShell.cpp:139]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO - GECKO(2335) | #09: nsAppShell::Run() [widget/cocoa/nsAppShell.mm:705]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO - GECKO(2335) | #10: XRE_RunAppShell() [toolkit/xre/nsEmbedFunctions.cpp:926]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO - GECKO(2335) | #11: mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) [ipc/glue/MessagePump.cpp:237]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO - GECKO(2335) | #12: MessageLoop::Run() [ipc/chromium/src/base/message_loop.cc:291]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO - GECKO(2335) | #13: XRE_InitChildProcess(int, char**, XREChildData const*) [toolkit/xre/nsEmbedFunctions.cpp:765]
[task 2020-03-04T11:25:00.378Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | #14: plugin-container + 0xf0b
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO -
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/workspace/build/src/ipc/glue/ProtocolUtils.cpp, line 477
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | ###!!! [Parent][MessageChannel] Error: (msgtype=0x370131,name=PContent::Msg_DetachBrowsingContext) Channel error: cannot send/recv
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/workspace/build/src/ipc/glue/ProtocolUtils.cpp, line 477
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/workspace/build/src/ipc/glue/ProtocolUtils.cpp, line 477
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: IPC message discarded: actor cannot send: file /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/ipc/ProtocolUtils.h, line 294
[task 2020-03-04T11:25:00.379Z] 11:25:00 ERROR - GECKO(2335) | A content process crashed and MOZ_CRASHREPORTER_SHUTDOWN is set, shutting down
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: '!inner', file /builds/worker/workspace/build/src/dom/ipc/JSWindowActorService.cpp, line 182
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Child 2336, Main Thread] WARNING: nsAppShell::Exit() called redundantly: file /builds/worker/workspace/build/src/widget/cocoa/nsAppShell.mm, line 725
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file /builds/worker/workspace/build/src/netwerk/cache/nsCacheService.cpp, line 855
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file /builds/worker/workspace/build/src/netwerk/cache/nsCacheService.cpp, line 818
[task 2020-03-04T11:25:00.379Z] 11:25:00 INFO - GECKO(2335) | [Parent 2335, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file /builds/worker/workspace/build/src/netwerk/cache/nsApplicationCacheService.cpp, line 161
[task 2020-03-04T11:25:00.380Z] 11:25:00 INFO - GECKO(2335) | [Child 2336, Main Thread] WARNING: Extra shutdown CC: 'i < NORMAL_SHUTDOWN_COLLECTIONS', file /builds/worker/workspace/build/src/xpcom/base/nsCycleCollector.cpp, line 3352
[task 2020-03-04T11:25:00.380Z] 11:25:00 INFO - GECKO(2335) | nsStringStats

So we have a mWorkerPrivate nullptr access. I assume this is related to bug 1539508.

Depends on: 1539508
Priority: P5 → P3

(In reply to Intermittent Failures Robot from comment #5)

1 failures in 4528 pushes (0.0 failures/push) were associated with this bug in the last 7 days.

Repository breakdown:

  • try: 1

Platform breakdown:

  • linux1804-64: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1619923&startday=2020-04-13&endday=2020-04-19&tree=all

This happens in a totally different code here. It seems, that host can be null here and we do not deal with it (or do not expect this to be possible). Ryan, you introduced that host variable not too long ago?

Flags: needinfo?(rhunt)
Summary: Intermittent GECKO(2335) | Assertion failure: mRawPtr != nullptr (You can't dereference a NULL RefPtr with operator->().), at /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/RefPtr.h:315 → Intermittent Assertion failure: mRawPtr != nullptr (You can't dereference a NULL RefPtr with operator->().), at /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/RefPtr.h:315

I have corrected the classifications, all the failures were fixed by commit:
https://hg.mozilla.org/integration/autoland/rev/ad5be7cecd83b9de5fd520ad0e19de7ea44b1466

Whiteboard: [stockwell needswork:owner] → [stockwell fixed:backout]

(In reply to Jens Stutte [:jstutte] from comment #6)

(In reply to Intermittent Failures Robot from comment #5)

1 failures in 4528 pushes (0.0 failures/push) were associated with this bug in the last 7 days.

Repository breakdown:

  • try: 1

Platform breakdown:

  • linux1804-64: 1

For more details, see:
https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?bug=1619923&startday=2020-04-13&endday=2020-04-19&tree=all

This happens in a totally different code here. It seems, that host can be null here and we do not deal with it (or do not expect this to be possible). Ryan, you introduced that host variable not too long ago?

No, if you look at the diff you can see that this was part of a refactoring and previously this variable was called browserParent. It's possible the refactoring caused this error to be possible, but my guess would be that this is pre-existing or caused by something else.

Flags: needinfo?(rhunt)

I see three different origins, only two of them worker related, one not:

Linux + Windows
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=302715437&repo=mozilla-central&lineNumber=5979
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=301979097&repo=autoland&lineNumber=5637
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=298836794&repo=autoland&lineNumber=12451
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=293473260&repo=autoland&lineNumber=7169
Linux stack start:   mozilla::dom::RemoteWorkerChild::CloseWorkerOnMainThread(mozilla::Variant<mozilla::dom::RemoteWorkerChild::Pending, mozilla::dom::RemoteWorkerChild::Running, mozilla::dom::RemoteWorkerChild::PendingTerminated, mozilla::dom::RemoteWorkerChild::Terminated>&) [dom/workers/remoteworkers/RemoteWorkerChild.cpp:615]
Windows stack start: xul.dll!mozilla::dom::RemoteWorkerChild::CloseWorkerOnMainThread(mozilla::Variant<mozilla::dom::RemoteWorkerChild::Pending,mozilla::dom::RemoteWorkerChild::Running,mozilla::dom::RemoteWorkerChild::PendingTerminated,mozilla::dom::RemoteWorkerChild::Terminated>&) [RemoteWorkerChild.cpp:0d6adfdd3b38601bf4949673c95e617a3eacff0e : 0 + 0x2a]

OSX
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=292263044&repo=autoland&lineNumber=29298
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=291610092&repo=autoland&lineNumber=29335
OSX Stack start:     mozilla::dom::RemoteWorkerChild::SharedWorkerOp::Exec(mozilla::dom::SelfHolder&) [dom/workers/remoteworkers/RemoteWorkerChild.cpp:896]

OSX + Linux
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=299611143&repo=autoland&lineNumber=81163
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=297918122&repo=try&lineNumber=75842
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=300125171&repo=try&lineNumber=81651
OSX stack start:     XUL!mozilla::AppWindow::GetPrimaryRemoteTabSize(int*, int*) [AppWindow.cpp:d920c3959dc28c50c5ef0acebcf3de0fce5a9700 : 2002 + 0x29]
Linux stack start:   libxul.so!mozilla::AppWindow::GetPrimaryRemoteTabSize(int*, int*) [AppWindow.cpp:180f0cbaea8765e95c8a573b9bb89705dee1a080 : 2003 + 0x9]

Should we add a prefix for those?

Flags: needinfo?(gsvelto)

Unfortunately the crash signatures generated from test failures don't go through Socorro's signature generation code so we can't change them easily. We've got a long standing bug to do that (bug 867571) and we have the necessary tools, but we never had the spare cycles to get to it.

Flags: needinfo?(gsvelto)

OK, and I just noticed, that the first two signatures are actually the same, it seems that CloseWorkerOnMainThread is just omitted on the OSX stack trace, as it is called there, too.

FWIW, I found also a small number of similar real life signatures :
https://crash-stats.mozilla.org/report/index/6b726166-00f7-4473-ad97-b7c4c0200526
Just two lines ahead. It seems, like the lock->as<Running>() can be null here, which is not checked before use, just asserted in debug builds. Note, that this is NOT the CloseWorkerOnMainThread case, though. Still there is something unexpected happening with those variants, it seems.

I am wondering, if this might be related to bug 1638170 as it seems to have also here Workers around in an unclear state?

Flags: needinfo?(ytausky)

I don't think it's related. In bug 1638170 there weren't any nullptr pointers involved, only something happening on the wrong thread.

Flags: needinfo?(ytausky)

This signature happens for different root causes:

So here we have a MutexAutoLock lock(mMutex); and an if (mEventSource) in place in order to protect mEventSource. Thus I must assume, that another thread sets mEventSource to null between the if and the dereference.

The only place where mEventSourceis explicitly set to null, seems to be here called from here (thus jumping forth and back between EventSourceImpl and EventSource). To my understanding, we probably miss a lock on mMutex that surrounds the call to UpdateDontKeepAlive.

(more to come)

Here instead we have an mWorkerPrivate null pointer dereference. Here I see no mutex and no check at all, if we have a valid workerprivate pointer. I see however that writes to the same mWorkerPrivate seem to be guarded by a lock here and here but not here. So my assumption here is, that all accesses to mWorkerPrivate (both write and read) should be made thread safe.

Flags: needinfo?(bugmail)

I suspect bug 1545733 to have the same cause as in comment 24.

(In reply to Jens Stutte [:jstutte] from comment #24)
The last location you mentioned is in the destructor of WorkerPrivateAccessibleState. An access to its member as such is thread-safe and doesn't need synchronization. It can only be called from the destructor of RemoteWorkerChild, or when *RemoteWorkerChild::mState is re-assigned, but this can only happen under a lock.

However, mWorkerPrivate can apparently be legitimately null, as this comment indicates: https://searchfox.org/mozilla-central/rev/af5cff556a9482d7aa2267a82f2ccfaf18e797e9/dom/workers/remoteworkers/RemoteWorkerChild.cpp#729-732

(In reply to Jens Stutte [:jstutte] from comment #23)

This signature happens for different root causes:

So here we have a MutexAutoLock lock(mMutex); and an if (mEventSource) in place in order to protect mEventSource. Thus I must assume, that another thread sets mEventSource to null between the if and the dereference.

The only place where mEventSourceis explicitly set to null, seems to be here called from here (thus jumping forth and back between EventSourceImpl and EventSource). To my understanding, we probably miss a lock on mMutex that surrounds the call to UpdateDontKeepAlive.

(more to come)

That seems like a possible explanation. However, then the comment (https://searchfox.org/mozilla-central/rev/af5cff556a9482d7aa2267a82f2ccfaf18e797e9/dom/base/EventSource.cpp#273-275) on mMutex is wrong, which states that only mEventSource->mReadyState is protected by the mutex, not mEventSource itself. If that were actually true, then the locking in GetReadyState would be too broad.

Crash Signature: [@ mozilla::dom::WorkerPrivate::Notify(mozilla::dom::WorkerStatus)]

(In reply to Simon Giesecke [:sg] [he/him] from comment #27)

The last location you mentioned but not here is in the destructor of WorkerPrivateAccessibleState. An access to its member as such is thread-safe and doesn't need synchronization. It can only be called from the destructor of RemoteWorkerChild, or when *RemoteWorkerChild::mState is re-assigned, but this can only happen under a lock.

Thanks for clarifying!

However, mWorkerPrivate can apparently be legitimately null, as this comment indicates: https://searchfox.org/mozilla-central/rev/af5cff556a9482d7aa2267a82f2ccfaf18e797e9/dom/workers/remoteworkers/RemoteWorkerChild.cpp#729-732

Interesting! And once we check mWorkerPrivate for null ptr, we probably need to do so guarded by the same lock used for write access here and here in order to avoid to end up in a similar situation like the one of comment 28.

(In reply to Simon Giesecke [:sg] [he/him] from comment #28)

(In reply to Jens Stutte [:jstutte] from comment #23)

This signature happens for different root causes:

So here we have a MutexAutoLock lock(mMutex); and an if (mEventSource) in place in order to protect mEventSource. Thus I must assume, that another thread sets mEventSource to null between the if and the dereference.

The only place where mEventSourceis explicitly set to null, seems to be here called from here (thus jumping forth and back between EventSourceImpl and EventSource). To my understanding, we probably miss a lock on mMutex that surrounds the call to UpdateDontKeepAlive.

(more to come)

That seems like a possible explanation. However, then the comment (https://searchfox.org/mozilla-central/rev/af5cff556a9482d7aa2267a82f2ccfaf18e797e9/dom/base/EventSource.cpp#273-275) on mMutex is wrong, which states that only mEventSource->mReadyState is protected by the mutex, not mEventSource itself. If that were actually true, then the locking in GetReadyState would be too broad.

I would guess that the comment is wrong... It's probably better to use a DataMutex here for these fields.

Depends on: 1658776

Hi Yaron, this is still happening, though the stack changed slightly with the other recent patches around EventSource. Does this give any new clue?

Flags: needinfo?(bugmail) → needinfo?(ytausky)

All three stacks from last week's failures have nothing to do with EventSource anymore, so those are different bugs now.

Flags: needinfo?(ytausky)

Yes, you are right, and it is not even clear to my why those have been associated here - there is no trace of that assertion at all. Gabriele, any clue?

Flags: needinfo?(gsvelto)

Given this stack the crash must be happening here. Quickly looking at the code I'd say that mMutex must contain a NULL mMutex field (yeah, it's unfortunate that both are called the same but you get the idea). Since the mutex stored in a SharedMutex is held in a RefPtr then the assertion makes sense.

Flags: needinfo?(gsvelto)

Hi Marco, the failures here are just different places all over our code that happen to do an unexpected nullptr deref on a RefPtr. Could we have some prefix here? The original bug seems to not happen any more.

Flags: needinfo?(mcastelluccio)

I don't know if there's a way to do that, but it would make perfect sense. Gabriele, do you know?

If there is no way, we can ask sheriffs to classify based on the line after the failure line instead of just the failure line.

Flags: needinfo?(mcastelluccio) → needinfo?(gsvelto)

I don't know which part of our testing harness highlights these errors (and chooses the line to put in the signature). I grep'd around a bit but couldn't find an obvious place where this happens.

Flags: needinfo?(gsvelto)

What should the prefix be? Those logs are coming out of Gecko so a prefix would need to be added there. If there's a line that should show up in the failure summary but isn't, we could add it to here:
https://searchfox.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/testing/errors.py#152

(In reply to Andrew Halberstadt [:ahal] from comment #86)

What should the prefix be? Those logs are coming out of Gecko so a prefix would need to be added there.

Basically in this case we don't care that the assertion failure is coming from https://searchfox.org/mozilla-central/rev/d10188f9b4f1c4974264f3925505a0498d346c57/mfbt/RefPtr.h#314, but we want to know the caller of that function.
I don't think there is currently a way to add an assertion that mentions the caller in the same line as the log for the assertion failure.

If there's a line that should show up in the failure summary but isn't, we could add it to here:
https://searchfox.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/testing/errors.py#152

After the assertion failure is logged, the stack is dumped. Is there a way to use both the assertion failure message and the first line of the stack trace?
We'd only do it for this specific assertion though (or maybe for all assertions in header files?), not all assertions.

Yeah, it's possible.. we'd have to store state in this function and search for the stack lines after detecting an assertion we care about:
https://searchfox.org/mozilla-central/source/testing/mozharness/mozharness/mozilla/structuredlog.py#67

Though this feels a bit hacky.

I'll just mention that this reminds me of https://searchfox.org/mozilla-central/rev/d10188f9b4f1c4974264f3925505a0498d346c57/testing/mozbase/mozcrash/mozcrash/mozcrash.py#177, where we ignore some abort signatures in favour of reporting what caused the abort....but I don't think there's anything that can be leveraged there.

How about just filing dependent bugs for "Assertion failure: mRawPtr != nullptr -- caused by ...."?

Depends on: 1741155

(In reply to Geoff Brown [:gbrown] from comment #89)

How about just filing dependent bugs for "Assertion failure: mRawPtr != nullptr -- caused by ...."?

The most recent failures I see here happen to be a duplicate of bug 1545733. I filed bug 1741155 as proposed and closed it immediately as duplicate. I assume we should move this bug to a more general component if we want to keep it open?

Flags: needinfo?(gbrown)

Thanks for that follow-up!

Yes, it doesn't make sense to keep this generic failure report in DOM : workers. Technically, this assertion is in mfbt code...

Component: DOM: Workers → MFBT
Flags: needinfo?(gbrown)

(In reply to Intermittent Failures Robot from comment #103)

For more details, see:
https://treeherder.mozilla.org/intermittent-failures/bugdetails?bug=1619923&startday=2022-01-10&endday=2022-01-16&tree=all

The latest instance is again a case of bug 1545733, it seems.

Looking at the last instance, I see always an instance of bug 1545733:

 	xul.dll!mozilla::dom::WorkerPrivate::Notify(mozilla::dom::WorkerStatus aStatus) Zeile 1660	C++
 	[Inlineframe] xul.dll!mozilla::dom::WorkerPrivate::Cancel() Zeile 158	C++
>	xul.dll!mozilla::dom::RemoteWorkerChild::CloseWorkerOnMainThread(mozilla::Variant<mozilla::dom::RemoteWorkerChild::Pending,mozilla::dom::RemoteWorkerChild::Running,mozilla::dom::RemoteWorkerChild::PendingTerminated,mozilla::dom::RemoteWorkerChild::Terminated> & aState) Zeile 646	C++
 	xul.dll!mozilla::dom::RemoteWorkerChild::SharedWorkerOp::Exec(mozilla::dom::SelfHolder & aOwner) Zeile 942	C++
 	[Inlineframe] xul.dll!mozilla::dom::RemoteWorkerChild::SharedWorkerOp::MaybeStart::<lambda_1>::operator()() Zeile 906	C++
 	xul.dll!mozilla::detail::RunnableFunction<`lambda at /builds/worker/checkouts/gecko/dom/workers/remoteworkers/RemoteWorkerChild.cpp:896:19'>::Run() Zeile 532	C++

  • On the worker thread we see a race of the TransitionStateToRunning locked on the very same state variable:
 	ntdll.dll!NtWaitForAlertByThreadId()	Unbekannt
 	ntdll.dll!RtlAcquireSRWLockExclusive()	Unbekannt
 	[Inlineframe] xul.dll!mozilla::OffTheBooksMutex::Lock() Zeile 63	C++
>	[Inlineframe] xul.dll!mozilla::DataMutexBase<mozilla::Variant<mozilla::dom::RemoteWorkerChild::Pending,mozilla::dom::RemoteWorkerChild::Running,mozilla::dom::RemoteWorkerChild::PendingTerminated,mozilla::dom::RemoteWorkerChild::Terminated>,mozilla::Mutex>::AutoLock::AutoLock(mozilla::DataMutexBase<mozilla::Variant<mozilla::dom::RemoteWorkerChild::Pending,mozilla::dom::RemoteWorkerChild::Running,mozilla::dom::RemoteWorkerChild::PendingTerminated,mozilla::dom::RemoteWorkerChild::Terminated>,mozilla::Mutex> * aDataMutex) Zeile 81	C++
 	[Inlineframe] xul.dll!mozilla::DataMutexBase<mozilla::Variant<mozilla::dom::RemoteWorkerChild::Pending,mozilla::dom::RemoteWorkerChild::Running,mozilla::dom::RemoteWorkerChild::PendingTerminated,mozilla::dom::RemoteWorkerChild::Terminated>,mozilla::Mutex>::Lock() Zeile 92	C++
 	xul.dll!mozilla::dom::RemoteWorkerChild::TransitionStateToRunning(already_AddRefed<mozilla::dom::WorkerPrivate> aWorkerPrivate, already_AddRefed<mozilla::dom::WeakWorkerRef> aWorkerRef) Zeile 799	C++
 	xul.dll!mozilla::dom::RemoteWorkerChild::InitializeOnWorker() Zeile 571	C++
 	xul.dll!mozilla::dom::RemoteWorkerChild::InitializeWorkerRunnable::WorkerRun(JSContext *, mozilla::dom::WorkerPrivate *) Zeile 226	C++
 	xul.dll!mozilla::dom::WorkerRunnable::Run() Zeile 377	C++
 	xul.dll!nsThread::ProcessNextEvent(bool aMayWait, bool * aResult) Zeile 1190	C++

This confirms the suspect from bug 1545733 comment 18 that we close a worker that has never been transitioned to Running. It is not clear from the stack traces, who scheduled the close operation, though.

In any case it seems the intuition of this comment was the right one (in the sense that this is a weird way of initializing mWorkerPrivate late). We might need to expect mWorkerPrivate to be nullptr before using it in the Pending state?

Flags: needinfo?(echuang)

Moving ni? to :asuth, see comment 107

Flags: needinfo?(echuang) → needinfo?(bugmail)
Severity: normal → S3
Crash Signature: [@ mozilla::dom::WorkerPrivate::Notify(mozilla::dom::WorkerStatus)] → [@ mozilla::dom::WorkerPrivate::Notify]
Duplicate of this bug: 1649451

Update:

There have been 36 failures within the last 7 days, all of them on Linux 18.04 x64 WebRender debug.

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=413497701&repo=autoland&lineNumber=10321

[task 2023-04-23T22:15:26.765Z] 22:15:26     INFO - TEST-START | /infrastructure/server/wpt-server-wpt-flags.sub.html
[task 2023-04-23T22:15:26.767Z] 22:15:26     INFO - Setting pref network.webtransport.enabled to true
[task 2023-04-23T22:15:26.784Z] 22:15:26     INFO - Setting pref network.webtransport.datagrams.enabled to true
[task 2023-04-23T22:15:26.791Z] 22:15:26     INFO - PID 9050 | Assertion failure: mRawPtr != nullptr (You can't dereference a NULL RefPtr with operator->().), at /builds/worker/workspace/obj-build/dist/include/mozilla/RefPtr.h:354
[task 2023-04-23T22:15:26.800Z] 22:15:26     INFO - STDOUT: Initializing stack-fixing for the first stack frame, this may take a while...
[task 2023-04-23T22:15:26.803Z] 22:15:26     INFO - Setting pref security.OCSP.enabled to 0
[task 2023-04-23T22:15:26.826Z] 22:15:26     INFO - Setting pref network.webtransport.enabled to true
[task 2023-04-23T22:15:26.866Z] 22:15:26     INFO - Setting pref network.webtransport.datagrams.enabled to true
[task 2023-04-23T22:15:26.896Z] 22:15:26     INFO - Setting pref security.OCSP.enabled to 0
[task 2023-04-23T22:15:26.918Z] 22:15:26     INFO - Closing window 4f216fad-9009-4dd8-aafe-31edcb39ae0e

Mike, any chance you could assign this to someone?
Thank you.

Flags: needinfo?(mh+mozilla)
Whiteboard: [stockwell fixed:backout] → [stockwell fixed:backout][stockwell needswork:owner]

Looking at recent failure logs, this is a problem with InputToReadableStreamAlgorithms::OnInputStreamReady, not RefPtr. This assertion failure is always going to be a problem with whatever is calling the function, not the function itself.

Component: MFBT → DOM: Streams
Flags: needinfo?(mh+mozilla) → needinfo?(jmarshall)

I can take a look, but InputToReadableStreamAlgorithms is very new. It's kinda weird to track a recent regression in a 3 years old bug, should I file a separate bug?

Flags: needinfo?(jmarshall)

For now let me assign myself.

Assignee: nobody → krosylight

(In reply to Kagami [:saschanaz] from comment #116)

I can take a look, but InputToReadableStreamAlgorithms is very new. It's kinda weird to track a recent regression in a 3 years old bug, should I file a separate bug?

Yes, please do. This bug is really a dumping ground for unrelated issues. We can move it back to Core::General or something.

Assignee: krosylight → nobody
Component: DOM: Streams → General

Comment 0 and comment 107 were about bug 1775784, which is now fixed.

The 1 m-c occurrence from comment 121 and 4 m-c occurrences from comment 120 are all instances of InputToReadableStreamAlgorithms::OnInputStreamReady which was spun off into bug 1829961 which was then duped into bug 1829166.

Any new instances of the mRawPtr != nullptr signature are probably bug 1829166 and should be filed against that bug. Someone may need to updated that bug's subject?

I am marking this bug as a dupe of bug 1775784 because that's what it was about and changing its subject to avoid having any confusion about new incidences of the null RefPtr access signature since all of the recent intermittents are actually bug 1829166 and in comment 118 it was determined to let that be its own bug. (I'm also moving the bug back into workers.)

Status: NEW → RESOLVED
Closed: 1 years ago
Component: General → DOM: Workers
Duplicate of bug: 1775784
Flags: needinfo?(bugmail)
Resolution: --- → DUPLICATE
Summary: Intermittent Assertion failure: mRawPtr != nullptr (You can't dereference a NULL RefPtr with operator->().), at /builds/worker/workspace/build/src/obj-firefox/dist/include/mozilla/RefPtr.h:315 → (Fixed) intermittent signature for nullptr deref in RemoteWorkerChild::SharedWorkerOp::MaybeStart
You need to log in before you can comment on or make changes to this bug.