Closed Bug 1779578 Opened 2 years ago Closed 1 year ago

Crash in [@ mozilla::ipc::IProtocol::ChannelSend | mozilla::a11y::PDocAccessibleChild::SendShowEvent | IPC_Message_Name=PDocAccessible::Msg_ShowEvent]

Categories

(Core :: Disability Access APIs, defect)

defect

Tracking

()

VERIFIED FIXED
117 Branch
Tracking Status
firefox-esr102 --- wontfix
firefox-esr115 --- fix-optional
firefox116 --- wontfix
firefox117 --- wontfix
firefox118 --- fixed

People

(Reporter: manuel, Assigned: Jamie)

References

(Blocks 1 open bug, Regressed 1 open bug)

Details

(Keywords: crash, topcrash, Whiteboard: [ctw-23h2])

Crash Data

Attachments

(2 files)

Crash report: https://crash-stats.mozilla.org/report/index/7dea1601-1203-4eae-b833-984860220714

MOZ_CRASH Reason: MOZ_CRASH(IPC message size is too large)

Top 10 frames of crashing thread:

0 libxul.so mozilla::ipc::IProtocol::ChannelSend /build/firefox/src/firefox-102.0.1/ipc/glue/ProtocolUtils.cpp:487
1 libxul.so mozilla::a11y::PDocAccessibleChild::SendShowEvent s3:gecko-generated-sources-l1:05e7b0735999cb80520a210a9906635a37fa583e4e1f4e967a53e718cdc2415da987df2e56d2f110dcd48bb9981897dfd5456f07d2ce3b91a11378d9eb21a298/ipc/ipdl/PDocAccessibleChild.cpp::254
2 libxul.so mozilla::a11y::DocAccessibleChildBase::MaybeSendShowEvent /build/firefox/src/firefox-102.0.1/accessible/ipc/DocAccessibleChildBase.h:86
3 libxul.so mozilla::a11y::DocAccessibleChildBase::InsertIntoIpcTree /build/firefox/src/firefox-102.0.1/accessible/ipc/DocAccessibleChildBase.cpp:91
4 libxul.so mozilla::a11y::LocalAccessible::HandleAccEvent /build/firefox/src/firefox-102.0.1/accessible/generic/LocalAccessible.cpp:931
5 libxul.so mozilla::a11y::AccessibleWrap::HandleAccEvent /build/firefox/src/firefox-102.0.1/accessible/atk/AccessibleWrap.cpp:1021
6 libxul.so nsEventShell::FireEvent /build/firefox/src/firefox-102.0.1/accessible/base/nsEventShell.cpp:54
7 libxul.so mozilla::a11y::NotificationController::ProcessMutationEvents /build/firefox/src/firefox-102.0.1/accessible/base/NotificationController.cpp:577
8 libxul.so mozilla::a11y::NotificationController::WillRefresh /build/firefox/src/firefox-102.0.1/accessible/base/NotificationController.cpp:896
9 libxul.so nsRefreshDriver::Tick /build/firefox/src/firefox-102.0.1/layout/base/nsRefreshDriver.cpp:2468

STR:

  1. Go to https://crash-stats.mozilla.org/report/index/5db7e2f3-187b-4ff3-ae90-ade790220714
  2. Click on "Raw Data and Minidumps"

Ouch. That's pretty nasty. :(

Given that this is happening without the full cache (just the tree structure), this is even more likely with the full cache. We've discussed ways to paginate the cache. There are a few challenges there:

  1. How do we paginate? Do we paginate up to a certain number of nodes or something like that?
  2. When we paginate, how do we handle that in terms of IPDL calls? The current SendShowEvent depends on having a single root, but we might paginate at any position in the tree.
  3. On the parent side, we might yield to the event loop between handling of incoming IPDL calls, which means an a11y client query can be handled. I don't think we have any way of stopping that. Even if we did, that would make the browser unresponsive for a potentially long period of time.
  4. Because of 3), a client might see an incomplete view of the tree. This is particularly important on Windows where a screen reader might have cached its own representation of the tree. Thus, we need to fire mutation events for anything we add to the tree in subsequent pages. That's going to be tricky, since pagination could occur at any position in the tree. This problem is similar to 2).
Blocks: a11y-ctw

This is nasty, but also extremely rare. Downgrading to s3.

Severity: S2 → S3

(In reply to James Teh [:Jamie] from comment #1)

  1. On the parent side, we might yield to the event loop between handling of incoming IPDL calls, which means an a11y client query can be handled. ...
  2. Because of 3), a client might see an incomplete view of the tree. ... Thus, we need to fire mutation events for anything we add to the tree in subsequent pages.

We could avoid these problems by sending the (sub)tree such that it is detached from its document until the last page is received. Clients would not be able to access the subtree. In the content process, pages could be sent without yielding to the event loop, thus avoiding the tree changing underneath us as we're sending it. The only problem we need to solve then is how to paginate the IPDL calls, but that's a lot simpler when we don't have to worry about events, the tree changing, etc.

Note that this only solves the IPDL message size problem and maybe a bit of parent process jank. It does not make things any faster for a11y clients. That is, a11y clients don't get the first part of the tree any earlier; they still have to wait until the whole tree is received before they can access anything at all.

It looks like the crash signature for this has morphed.
bp-1f84d790-2eb2-4a6e-90ad-40b770230606

Crash Signature: [@ mozilla::ipc::IProtocol::ChannelSend | mozilla::a11y::PDocAccessibleChild::SendShowEvent | IPC_Message_Name=PDocAccessible::Msg_ShowEvent] → [@ mozilla::ipc::IProtocol::ChannelSend | mozilla::a11y::PDocAccessibleChild::SendShowEvent | IPC_Message_Name=PDocAccessible::Msg_ShowEvent] [@ mozilla::ipc::PortLink::SendMessage | IPC_Message_Name=PDocAccessible::Msg_ShowEvent ]
Duplicate of this bug: 1837207
Whiteboard: [ctw-soon]
Whiteboard: [ctw-soon] → [ctw-23h2]

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 content process crashes on beta

:Jamie, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(jteh)
Keywords: topcrash

I wonder why this is spiking now? I would have expected this to spike as soon as we shipped CtW, rather than waiting a few releases.

Severity: S3 → S2
Flags: needinfo?(jteh)
Assignee: nobody → jteh

When we serialize a subtree, we put it into a flat list.
Previously, we included the child count for each Accessible so that we knew how many Accessibels to consume as children when de-serializing.
We also de-serialized recursively.
This made it very difficult to split serialization across IPDL calls, since we would always end up splitting in the middle of some Accessible's children.
Instead, we now no longer include the child count, but we do include the parent id and the index of the child in that parent.
This means that each Accessible can be de-serialized independently and iteratively, making it possible to split wherever we need to.
RemoteAccessible creation has also been separated from attachment of the child to its parent, since we will need this when splitting.

In the content process, we simply split into multiple calls when the number of Accessibles exceeds our maximum.
In the parent process, we defer attaching the root of the new subtree to its parent until the final call.
This is achieved by saving the root during the first call and using that to attach and fire events in the final call.

The maximum is calculated to allow for every Accessible to consume 2 KB in the IPDL message.
Currently, this means we split every 131072 Accessibles.
Of course, we could still exceed this IPDL message size if one or more Accessibles consumed a lot more than this; e.g. many labels longer than 2 KB.
However, this seems unlikely in the real world.
If this turns out to be a problem, we'll need to count the actual size of the serialized data for each Accessible.
For example, we could use AccAttributes::SizeOfExcludingThis, though that isn't exactly the serialized size.
I worry though that such data structure traversal could get expensive at scale.

Unfortunately, the URL in the STR in comment 0 is now dead. However, you can reliably reproduce this crash by loading this page with accessibility active:
https://registry.khronos.org/vulkan/specs/1.3-extensions/html/vkspec.html

Pushed by jteh@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/cc0a76b6a875 part 1: When serializing a11y subtrees, include parent id and index in parent for each Accessible. r=eeejay https://hg.mozilla.org/integration/autoland/rev/91dd0dbb8e73 part 2: Split serialization of a11y subtrees across multiple IPDL calls if we are likely to exceed the IPDL maximum message size. r=eeejay
Regressions: 1846212
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 117 Branch
Flags: qe-verify+

Reproduced with the info from comment 10 on Fx 116.0.2 on Windows 10.
I can no longer crash the browser on Fx 117.0b6 and Fx 118.0a1 (2023-08-10), tested on Windows 10.

Status: RESOLVED → VERIFIED
Flags: qe-verify+

Is this something we should uplift to ESR115? It'll need a rebased patch if so.

Flags: needinfo?(jteh)

Potentially, but this change is not what I'd consider low risk. I'd prefer to let it bake on release for a bit before making that call.

Flags: needinfo?(jteh)
Regressions: 1848991

Backed out from Beta for 117.0b9 for causing bug 1848991 and to allow for more bake time. It remains fixed for 118+.
https://hg.mozilla.org/releases/mozilla-beta/rev/4a0c1a095483

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: