Closed Bug 1685666 Opened 4 years ago Closed 4 years ago

RDD Crash in [@ mozilla::ipc::IToplevelProtocol::OtherPid]

Categories

(Core :: Audio/Video: Playback, defect, P2)

defect

Tracking

()

RESOLVED FIXED
88 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox84 --- unaffected
firefox85 --- wontfix
firefox86 --- wontfix
firefox87 --- fixed
firefox88 --- fixed

People

(Reporter: aryx, Assigned: alwu)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: crash, regression, Whiteboard: [not-a-fission-bug])

Crash Data

Attachments

(1 file)

Maybe Fission related. (DOMFissionEnabled=1)

Crash report: https://crash-stats.mozilla.org/report/index/9570ab9a-a94f-4d74-a127-8f5c30210107

Reason: SIGSEGV /SEGV_MAPERR

Top 10 frames of crashing thread:

0 libxul.so mozilla::ipc::IToplevelProtocol::OtherPid const ipc/glue/ProtocolUtils.cpp:618
1 libxul.so mozilla::RDDProcessManager::CreateContentBridge dom/media/ipc/RDDProcessManager.cpp:268
2 libxul.so mozilla::RDDProcessManager::EnsureRDDProcessAndCreateBridge const dom/media/ipc/RDDProcessManager.cpp:192
3 libxul.so mozilla::MozPromise<bool, nsresult, false>::ThenValue<mozilla::RDDProcessManager::EnsureRDDProcessAndCreateBridge xpcom/threads/MozPromise.h:835
4 libxul.so mozilla::MozPromise<bool, nsresult, false>::ThenValueBase::ResolveOrRejectRunnable::Run xpcom/threads/MozPromise.h:476
5 libxul.so mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal xpcom/threads/TaskController.cpp:739
6 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1200
7 libxul.so mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:87
8 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:309
9 libxul.so nsBaseAppShell::Run widget/nsBaseAppShell.cpp:137
Flags: needinfo?(jya-moz)
Flags: needinfo?(jya-moz) → needinfo?(matt.woodrow)

Bryce, do you know who plans to take over ownership of the RDD process?

We run LaunchRDDProcess() which launches the process, and then asynchronously runs the callback. The callback calls CreateContentBridge, which assumes mRDDChild exists and dereferences it.

In this case it's nullptr, because we ran code between LaunchRDDProcess and cleared it (I guess because it crashed).

We could try make this not async (direct dispatch promises?), fail gracefully, or retry.

Fairly low volume crash, but probably worth fixing.

Flags: needinfo?(matt.woodrow) → needinfo?(bvandyk)

(In reply to Matt Woodrow (:mattwoodrow) from comment #1)

Bryce, do you know who plans to take over ownership of the RDD process?

We run LaunchRDDProcess() which launches the process, and then asynchronously runs the callback. The callback calls CreateContentBridge, which assumes mRDDChild exists and dereferences it.

In this case it's nullptr, because we ran code between LaunchRDDProcess and cleared it (I guess because it crashed).

We could try make this not async (direct dispatch promises?), fail gracefully, or retry.

Fairly low volume crash, but probably worth fixing.

I think :jhlin has expressed some interest, but he's on PTO for the near future. CCing :alwu who may be able to help. I don't have a lot of cycles right now, but will hold NI in case I get time to look.

Blocks: RDD
Severity: -- → S2
Flags: needinfo?(bvandyk)
Priority: -- → P3

Changing the priority to p2 as the bug is tracked by a release manager for the current nightly.
See What Do You Triage for more information

Priority: P3 → P2
Severity: S2 → S3

This crash is not related to Fission. About 23% of this signature's crash reports have Fission enabled, but that's roughly the percentage of Nightly users with Fission enabled, so there is no correlation.

OS: Unspecified → All
Hardware: Unspecified → All
Summary: Crash in [@ mozilla::ipc::IToplevelProtocol::OtherPid] → RDD Crash in [@ mozilla::ipc::IToplevelProtocol::OtherPid]
Whiteboard: [fission-]

Not a Fission bug

Whiteboard: [fission-] → [not-a-fission-bug]

With a suitable selection of open Web pages, this turns into a crash upon session restore. (I have no clue what page among my many tabs triggers this crash.)

[Tracking Requested - why for this release]:

jbauman, I suggest restoring the more severe designation of the bug considering that it seems easy for this to be a startup crash (really session restore-time crash) from the user perspective.

Flags: needinfo?(jbauman)

I don't have strong feelings about S2/S3. I probably changed it because it's possible to disable RDD, therefore a workaround exists (which is contrary to the definition of S2) and unassigned S2 bugs lead to triage automation spam for our team.

Flags: needinfo?(jbauman)

(In reply to Jon Bauman [:jbauman:] from comment #9)

I don't have strong feelings about S2/S3. I probably changed it because it's possible to disable RDD, therefore a workaround exists (which is contrary to the definition of S2) and unassigned S2 bugs lead to triage automation spam for our team.

A workaround exists but not in a way that's discoverable and understandable to a user experiencing this problem.

Hi, do any of you have a bandwidth to take this? If not, then feel free to cancel my NI and I will take this later. Thank you.

Flags: needinfo?(jolin)
Flags: needinfo?(jbauman)

I need to focus on the AVIF stuff at least until we get it shipping by default, so I don't think I can spare the necessary time for this in the short term, but please do include me on any reviews since I'm still planning to get more involved in RDD generally.

Flags: needinfo?(jbauman)

Will take a look on it later today.

Assignee: nobody → alwu
Flags: needinfo?(jolin)

(In reply to Matt Woodrow (:mattwoodrow) from comment #1)

We could try make this not async (direct dispatch promises?), fail gracefully, or retry.

I do see some value of keeping creating content bridge async, which separate the process of "creating RDD + Video bridge" and "creating content bridge". So I will keep this approach and add new check to avoid getting the crash.

As the process of creating content process is async, the RDD process might be shutdown before that (eg. crash). So we should check if RDD still exists first.

Pushed by alwu@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bf32f5089547 check RDDChild before creating content bridge. r=mattwoodrow
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 88 Branch

Do you want to uplift the fix to beta/87? I'm not sure about uplifting to release/86 because this has been tagged as 'wontfix' for 86.

Flags: needinfo?(alwu)

Sure, uplifting it to Beta seems enough because this is not a serious crash.

Flags: needinfo?(alwu)

Comment on attachment 9206170 [details]
Bug 1685666 - check RDDChild before creating content bridge.

Beta/Release Uplift Approval Request

  • User impact if declined: Crash.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch doesn't introduce any new behavior, it simply added a protection to prevent from accessing the dangling pointer.
  • String changes made/needed: none
Attachment #9206170 - Flags: approval-mozilla-beta?

Thanks for fixing!

(In reply to Alastor Wu [:alwu] from comment #19)

Sure, uplifting it to Beta seems enough because this is not a serious crash.

If the user has content that triggers this bug in tabs that get restored by session restore, the UX is very bad and, I assume, could push the user to try another browser. I suggest uplifting to release as well in case we end up doing a point release for reasons beyond this bug.

Comment on attachment 9206170 [details]
Bug 1685666 - check RDDChild before creating content bridge.

crash fix, approved for 87.0b6

Attachment #9206170 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

(In reply to Henri Sivonen (:hsivonen) from comment #21)

Thanks for fixing!
If the user has content that triggers this bug in tabs that get restored by session restore, the UX is very bad and, I assume, could push the user to try another browser. I suggest uplifting to release as well in case we end up doing a point release for reasons beyond this bug.

On the last seven days, we got 77 crashs total on Fx86, which is not a large amount. I'm not sure if that crosses the line of worth to uplift to Release. I personally prefer to bake the fix on Beta for a while even if the fix is only about a small change, because sometime the thing just happens on someplace you didn't expect. In addition, this crash is avoidable by turning off the pref media.rdd-process.enabled, so it seems not an urgent problem.

Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: