Closed Bug 1562616 Opened 5 years ago Closed 5 years ago

Crash in [@ mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::ipc::ReadIPDLParam<T>]

Categories

(Core :: Graphics, defect, P1)

Unspecified
macOS
defect

Tracking

()

VERIFIED FIXED
mozilla70
Tracking Status
firefox-esr60 --- unaffected
firefox-esr68 --- unaffected
firefox67 --- unaffected
firefox68 --- unaffected
firefox69 blocking verified
firefox70 + verified

People

(Reporter: marcia, Assigned: mattwoodrow)

References

Details

(Keywords: crash, regression, topcrash, Whiteboard: [rca - Coding Error])

Crash Data

Attachments

(1 file, 1 obsolete file)

This bug is for crash report bp-8688e7fb-c841-4264-99c8-ea8e60190629.

Seen while reviewing nightly crashes. These crashes started in 20190628161248: https://bit.ly/302JP8m. Many of the URLS are youtube.com, but there are some others.

Possible regression range based on build id: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=9ee669c657c79b26a6ec0df246cc5bfe2b1d9a9a&tochange=8537d24d8aa8c0c2189c308d5835def2feec313c

Bug 1547218 touched some code in this area when it landed, ni on nika

Top 10 frames of crashing thread:

0 XUL mozilla::ipc::FatalError ipc/glue/ProtocolUtils.cpp:259
1 XUL mozilla::ipc::IProtocol::HandleFatalError const ipc/glue/ProtocolUtils.cpp:495
2 XUL bool mozilla::ipc::ReadIPDLParam<mozilla::layers::TimedTexture> ipc/glue/IPDLParamTraits.h:61
3 XUL bool mozilla::ipc::ReadIPDLParam<nsTArray<mozilla::layers::TimedTexture> > ipc/glue/IPDLParamTraits.h:61
4 XUL bool mozilla::ipc::ReadIPDLParam<mozilla::layers::OpUseTexture> ipc/glue/IPDLParamTraits.h:61
5 XUL bool mozilla::ipc::ReadIPDLParam<mozilla::layers::CompositableOperationDetail> ipc/glue/IPDLParamTraits.h:61
6 XUL mozilla::ipc::IPDLParamTraits<mozilla::layers::CompositableOperation>::Read ipc/ipdl/LayersMessages.cpp:10713
7 XUL mozilla::ipc::IPDLParamTraits<nsTArray<mozilla::layers::CompositableOperation> >::Read ipc/glue/IPDLParamTraits.h:182
8 XUL mozilla::layers::PImageBridgeParent::OnMessageReceived ipc/ipdl/PImageBridgeParent.cpp:349
9 XUL mozilla::ipc::MessageChannel::DispatchMessage ipc/glue/MessageChannel.cpp:2158

Flags: needinfo?(nika)

The error message (IPCFatalErrorMsg in the metadata) is Error deserializing 'textureParent' (PTexture) member of 'TimedTexture', so this is a problem with deserializing an actor reference.

I think this is probably showing up as being blamed by that code as I changed the names of some of the stack stack frames used there. I think this is more likely to be caused by some graphics code failing to manage actor lifecycles. Perhaps in webrender (no clue tbh)

ni? :jrmuizel because they might know more?

Flags: needinfo?(nika) → needinfo?(jmuizelaar)

Sotaro, do you have any guesses as to what this could be?

Flags: needinfo?(jmuizelaar) → needinfo?(sotaro.ikeda.g)

(In reply to :Nika Layzell (ni? for response) from comment #2)

Perhaps in webrender (no clue tbh)

All crash logs that I check was "WR-", then WR seems not related to the crash.

(In reply to :Nika Layzell (ni? for response) from comment #2)

I think this is probably showing up as being blamed by that code as I changed the names of some of the stack stack frames used there. I think this is more likely to be caused by some graphics code failing to manage actor lifecycles.

I checked the related graphics code. It seemed not to have a problem of how to send ipdl message.

When the error happened, crash report had the following error log.

  • Error deserializing 'textureParent' (PTexture) member of 'TimedTexture'

'TimedTexture' is sent with ipdl only by OpUseTexture. On ImageBridgeChild, it is created only by ImageBridgeChild::UseTextures().
https://searchfox.org/mozilla-central/source/gfx/layers/ipc/LayersMessages.ipdlh#439
https://searchfox.org/mozilla-central/source/gfx/layers/ipc/ImageBridgeChild.cpp#127

ImageBridgeChild::UseTextures() is called under ImageBridgeChild::UpdateImageClient(). The UpdateImageClient() manages ipc transaction.
https://searchfox.org/mozilla-central/source/gfx/layers/ipc/ImageBridgeChild.cpp#304

Function call sequence is like the following.

ImageContainer::SetCurrentImages()
->// dispatch to ImageBridge thread.
->ImageBridgeChild::UpdateImageClient()
-->ImageBridgeChild::BeginTransaction()
-->ImageClientSingle::UpdateImage()
--->ImageBridgeChild::UseTextures()
---->// Add OpUseTexture message
-->ImageBridgeChild::EndTransaction()
--->PImageBridgeChild::SendUpdate()

During the transaction, It checks if PTextureClient is valid. Therefore PTexture should be valid when ipdl message was sent by the SendUpdate().

In crash reports on Windows, there are several types of IPC error messages like the following. All of them include array. From it, I wonder if array in IPDL message might be related to the problem.

  • Error deserializing 'securityPolicies' (ContentSecurityPolicy[]) member of 'ContentPrincipalInfo'
  • Error deserializing 'originNoSuffix' (nsCString) member of 'ContentPrincipalInfo'
  • Error deserializing 'attrs' (OriginAttributes) member of 'ContentPrincipalInfo'
  • Error deserializing 'textureParent' (PTexture) member of 'TimedTexture'
  • Error deserializing variant TArrayOfTransformFunction of union Animatable
  • Error deserializing 'displayListLog' (nsCString) member of 'CommonLayerAttributes'
  • Error deserializing 'scrollMetadata' (ScrollMetadata[]) member of 'CommonLayerAttributes'
  • Error deserializing 'ancestorMaskLayers' (LayerHandle[]) member of 'CommonLayerAttributes'
Flags: needinfo?(sotaro.ikeda.g)

The “crash signature” here just means that something failed to be deserialized. It's not surprising that there are a number of different failures grouped together under it; that doesn't imply anything one way or the other about whether they're actually related.

I notice that the vast majority (93%) of reports over the past week are for Error deserializing 'textureParent' (PTexture) member of 'TimedTexture'; the other error messages are much less common.

Jed, does an error deserializing an actor (PTexture) suggest that we've got a race condition where we're sending a message containing a reference to an actor that has already been deleted?

Bug 1561178 is in the regression range, which did touch some graphics IPC code, though it shouldn't affect PImageBridge.

We have ImageBridgeChild::DestroyInTransaction which should detect the case where we want to destroy a PTexture while in the process of accumulating a transaction, and puts the delete request into the transaction (and processes it on the parent side after the transaction).

It's possible that there's a case where we miss that, but I don't see it.

One possible place of failure is IPDLParamTraits<mozilla::layers::PTextureParent*>::Read().
https://searchfox.org/mozilla-central/source/__GENERATED__/ipc/ipdl/PTextureParent.cpp#201

Another possible place of failure is IPDLParamTraits<nsTArray<T>>::Read()
https://searchfox.org/mozilla-central/source/ipc/glue/IPDLParamTraits.h#153

When ImageBridgeParent::AllocPTextureParent() failed because of TextureHost::CreateIPDLActor() failure. I also saw the following error message.

IPDL protocol error: Error deserializing 'textureParent' (PTexture) member of 'TimedTexture'

https://searchfox.org/mozilla-central/source/gfx/layers/ipc/ImageBridgeParent.cpp#291

Depends on: 1564667

[Tracking Requested - why for this release]:
this is a topcrasher in early results from 69.0b3 - together with the signature from bug 1561976 it's accounting for over a third of all browser crashes on beta currently.

as mentioned at https://bugzilla.mozilla.org/show_bug.cgi?id=1561976#c2 too, timing-wise it seems possible that bug 1561178 has regressed this.

Keywords: topcrash
Crash Signature: [@ mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::ipc::ReadIPDLParam<T>] → [@ mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::ipc::ReadIPDLParam<T>] [@ mozilla::ipc::FatalError | .str.161.llvm.970794318124662302]

ni'ing Matt to see if there are any leads, as this is a beta topcrasher.

Flags: needinfo?(matt.woodrow)
Component: IPC → Graphics

There is a patch here: 1564667 for more error logging so we can determine what is actually going on

If Sotaro's theory is right (and I suspect it is), this could be because GPUVideoTextureHost::CreateFromDescriptor is fallible.

We should probably unconditionally create the actor, and move the LookupTexture call into GPUVideoTextureHost::Lock.

Flags: needinfo?(matt.woodrow)
Priority: -- → P1
Assignee: nobody → matt.woodrow

Assuming this and bug 1561976 have the same root cause, this is blocking wider rollout of Fx69 to the Beta population.

Crash Signature: [@ mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::ipc::ReadIPDLParam<T>] [@ mozilla::ipc::FatalError | .str.161.llvm.970794318124662302] → [@ mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::ipc::ReadIPDLParam<T>] [@ mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::ipc::IPDLParamTraits<T>::Read] [@ mozilla::ipc::FatalError | .st…

Comment on attachment 9077526 [details]
Bug 1562616 - Don't fail to create a GPUVideoTextureHost if the dependent texture isn't available, since that crashes the compositor. r?sotaro

Beta/Release Uplift Approval Request

  • User impact if declined: Race condition that causes browser (parent process) crashes.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Should be very low risk since it changes the failing condition from causing an IPDL assert to instead just fail to render the racy video frame.

The other patch tries to fix the race condition, but we don't need to rush to uplift that.

  • String changes made/needed:
Attachment #9077526 - Flags: approval-mozilla-beta?

Filed bug 1565405 to make this sort of issue easier to diagnose in the future.

Pushed by mwoodrow@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/805e1d6269d9 Don't fail to create a GPUVideoTextureHost if the dependent texture isn't available, since that crashes the compositor. r=sotaro

Comment on attachment 9077526 [details]
Bug 1562616 - Don't fail to create a GPUVideoTextureHost if the dependent texture isn't available, since that crashes the compositor. r?sotaro

Approved for 69.0b4 to hopefully blunt this topcrash.

Attachment #9077526 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Flags: qe-verify+
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70
QA Whiteboard: [qa-triaged]

Matt, maybe it'd be better to move the other patch to a new bug at this point to make our lives easier for tracking?

Flags: needinfo?(matt.woodrow)

Hi, I managed to reproduce this issue in Firefox Beta 69.0b3 on a Mac OSX 10.14.5 using multiple tabs with Office, Powerpoint and Excel files as well as a few Youtube videos and Netflix, after Refreshing some of those pages Firefox would crash with the same crash signature as this issue.

I was unable to reproduce this issue on Firefox Beta 69.0b4 using the same profile, a few more tabs open like Reddit and other common websites while the videos from Youtube and Netflix were playing in the background. I do believe this issue is Verified as Fixed but we will not update any flags until we are certain of the Fix. We will keep our eyes on the crash reports for the next week or so and make sure that this crash no longer occurs in newer versions of Firefox.

Status: RESOLVED → VERIFIED

Comment on attachment 9077525 [details]
Bug 1562616 - Wait for a response from the VideoBridge to ensure textures are available before telling the content process about them. r?sotaro

Revision D37781 was moved to bug 1566956. Setting attachment 9077525 [details] to obsolete.

Attachment #9077525 - Attachment is obsolete: true
Flags: needinfo?(matt.woodrow)
Regressions: 1567737
No longer regressions: 1567737
Regressions: 1568419

Based on Crash stats and previous comments, remove the Qe-verify+ flag.

QA Whiteboard: [qa-triaged]
Flags: qe-verify+

This bug has been identified as part of a pilot on determining root causes of blocking and dot release drivers.

It needs a root-cause set for it. Please see the list at https://docs.google.com/document/d/1FFEGsmoU8T0N8R9kk-MXWptOPtXXXRRIe4vQo3_HgMw/.

Add the root cause as a whiteboard tag in the form [rca - <cause> ] and remove the rca-needed keyword.

If you have questions, please contact :tmaity.

Keywords: rca-needed

Propagating an error within an IPDL/IPC message handler causes a deliberate process crash, as that's the desired behaviour if we receive corrupt input.

The issue here is that a runtime error (not related to bad IPC input) was propagating, causing us to crash in situations that we'd prefer to handle silently.

Going with 'Coding Error' for now, though it could be argued to be an architecture issue that it's too easy to trigger the deliberate crash code.

Keywords: rca-needed
Whiteboard: [rca - Coding Error]
Regressions: 1617096
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: