Closed Bug 1757534 Opened 3 years ago Closed 3 years ago

[ARM64] Crash [@ mozilla::ipc::FatalError | mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::gmp::PGMPParent::OnMessageReceived ]

Categories

(Core :: IPC, defect, P1)

ARM64
Windows 10
defect

Tracking

()

VERIFIED FIXED
99 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox97 --- unaffected
firefox98 blocking verified
firefox99 + verified

People

(Reporter: cbaica, Assigned: mccr8)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(7 files)

Attached file about_support info (deleted) —

Affected versions

  • Fx 98.0
  • Fx99.0a1

Affected platforms

  • Windows 10 ARM

Steps to reproduce

  1. Launch Firefox.
  2. Connect to VPN. (optional if Hulu already works in your region)
  3. Navigate to www.hulu.com and login.
  4. Start any live video.

Expected result

  • Video plays back without any issue.

Actual result

  • Firefox crashes.

Regression range

  • Will come back with a regression range ASAP. This issue didn't occur during the RC 97 validation.

Additional notes

  • Firefox also crashes when accessing netflix.com and logging in. Didn't even reach a video playback. This happened without any connection to a VPN. Same type of crash signature.
  • In Firefox Nightly, the crash signature is different https://crash-stats.mozilla.org/report/index/4d266dba-21b3-4fc5-805d-b3c660220301 , but still an IPC error.
  • The crash in nightly is pointing to this bug 1416560.
  • I've also attached the about:support info, for details on the machine.
  • This was not reproducible on other OSs.
Crash Signature: https://crash-stats.mozilla.org/report/index/e0e81346-3085-4c0a-bda9-d7c4c0220301 → [@ mozilla::ipc::FatalError | mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::gmp::PGMPParent::OnMessageReceived ]
Crash Signature: [@ mozilla::ipc::FatalError | mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::gmp::PGMPParent::OnMessageReceived ] → [@ mozilla::ipc::FatalError | mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::gmp::PGMPParent::OnMessageReceived ] [@ mozilla::ipc::SentinelReadError ]

Jed, this is a new crasher found by QA while doing 98 RC build validatioon, could you help us find an owner rapidly? Thanks!

Flags: needinfo?(jld)

The crash reason from the report in comment 0 is "incorrect sentinel when reading Error deserializing 'ByteBuf'".

Bug 1745511 is in the regression range, and adds a few messages from a GMP process that send a ByteBuf.

Also, when looking at the crash report, we're in the parent process deserializing a FlushFOGData message, so that's also a good indicator.

Regressed by: 1745511
Keywords: crash

I could imagine there being some kind of endianness issue, but we use ByteBufs in a few places, so I'm not sure why this is the only place we see the crash. Maybe this specific message hits some edge case.

FWIW, there's been a steady trickle of these crashes going back at least a month, but I guess we didn't get more than one or two a day, so nobody noticed. bp-ad3c659f-0bb1-4ab4-b636-77f490220201

Are we still using the x64 Widevine CDM for the Win ARM64 builds? If so, might that be relevant here?

Flags: needinfo?(bvandyk)

(In reply to Andrew McCreight [:mccr8] from comment #5)

I could imagine there being some kind of endianness issue, but we use ByteBufs in a few places, so I'm not sure why this is the only place we see the crash. Maybe this specific message hits some edge case.

Nika said the endianness should be the same, so never mind about that.

Looking back over the last 3 months, every one of these crashes is ARM64 Windows.

(In reply to Ryan VanderMeulen [:RyanVM] from comment #7)

Are we still using the x64 Widevine CDM for the Win ARM64 builds? If so, might that be relevant here?

The about:support info in the attachment has this: "media.gmp-widevinecdm.abi": "x86-msvc-x86",

As Nika said, if the GMP process is really still running as 32-bit (and it sounds like it was as of bug 1527811), then there's probably a pointer width issue here. The length of a ByteBuf is sent as a size_t, which of course will vary between 32 bit and 64 bit, so that would be enough to break things.

Assignee: nobody → continuation
Flags: needinfo?(jld)

On ARM64 Windows, the GMP process is running in 32-bit, but
the parent process is running in 64-bit. The FlushFOGData reply
sends a ByteBuf from a GMP process to the parent process. This
is a problem because size_t has different sizes in each process.
This patch addresses that by sending the size as a uint32_t.

WriteBytesZeroCopy already implicitly converts the length
argument, so it seems unlikely that the other users of ByteBuf
depend on this behavior. Additionally, I think that any ByteBuf
large enough to cause issues here would end up crashing due to
hitting the IPC message size limit.

In terms of uplift risk, ByteBuf is used in a few other places, but like I said in the commit message, any place that depend on mLen being large enough that it can't be represented as a uint32_t probably has other issues, because mLen gets passed to WriteBytesZeroCopy, which will convert it (unchecked) to a uint32_t, and also because IPC messages are limited to 256MB, so if mLen is big enough to not be representable in a uint32_t, then we're probably crashing anyways due to hitting the message limit.

Priority: -- → P1

Here's a try push that hopefully has the right kind of build that can be used for testing on ARM64.

Andrew, is that something we could consider uplifting in a RC2? My main concern is if there is a potential side effect on non ARM64 builds.

Flags: needinfo?(continuation)

Yeah, the risk here is that it will change the behavior slightly on other platforms, and also of other IPC messages besides the specific FOG messages that are crashing. However, as far as I'm able to tell, I don't think any existing place can be depending on the length being a 64 bit value, because it would be crashing for two other different reasons. But I could be wrong.

A more targeted fix might be to disable this GMP FOG telemetry on ARM64 Windows, maybe by disabling the GMP FogData and FlushFogData messages on that platform. That would probably require Florian or somebody who is familiar with the GMP telemetry to figure out exactly what needs to be fixed for that (for instance, some tests will probably have to be disabled) because the patch is a bit large. So, that seems more complex, but it would be nice in that the build should be bit-for-bit identical on other platforms.

It does seem like we should do something, as I think without it then on ARM64 Win10 you'll just always crash when trying to use Hulu, Disney+, Netflix, Amazon Video, Direct TV video, etc., judging by the crash reports.

Flags: needinfo?(continuation)
Pushed by amccreight@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bf53c704f431 Send ByteBuf::mLen as a uint32_t. r=ipc-reviewers,nika

(In reply to Andrew McCreight [:mccr8] from comment #15)

A more targeted fix might be to disable this GMP FOG telemetry on ARM64 Windows, maybe by disabling the GMP FogData and FlushFogData messages on that platform. That would probably require Florian or somebody who is familiar with the GMP telemetry to figure out exactly what needs to be fixed for that (for instance, some tests will probably have to be disabled) because the patch is a bit large. So, that seems more complex, but it would be nice in that the build should be bit-for-bit identical on other platforms.

Florian, can you provide such a patch for uplift?

Flags: needinfo?(florian)

(In reply to Andrew McCreight [:mccr8] from comment #15)

A more targeted fix might be to disable this GMP FOG telemetry on ARM64 Windows, maybe by disabling the GMP FogData and FlushFogData messages on that platform. That would probably require Florian or somebody who is familiar with the GMP telemetry to figure out exactly what needs to be fixed for that (for instance, some tests will probably have to be disabled)

I actually don't understand how it's possible that we crash whenever that IPC is sent, but the test exercising this IPC didn't fail. Are we running browser chrome mochitests at all on Windows Arm64?

(In reply to Florian Quèze [:florian] from comment #18)

I actually don't understand how it's possible that we crash whenever that IPC is sent, but the test exercising this IPC didn't fail.

Answering my own question: the test sends a boolean instead of a ByteBuf, so it doesn't hit the crash.

(In reply to Florian Quèze [:florian] from comment #18)

I actually don't understand how it's possible that we crash whenever that IPC is sent, but the test exercising this IPC didn't fail. Are we running browser chrome mochitests at all on Windows Arm64?

Looking at mozilla-central, in terms of Windows AArch64, it looks like we only run test-windows10-aarch64-qr/opt-mochitest-media-fis-e10s, which looks like various tests under dom/media. There don't appear to be any on mozilla-inbound. It looks like the test added in bug 1745511 is a browser chrome mochitest, and is over with Glean tests, so I guess we're not running it. (I also don't know whether the tests that we do run have a 32-bit or a 64-bit GMP process, and the former is needed to catch this issue.) If telemetry is disabled during normal tests, then we also wouldn't get incidental testing of the telemetry in the media mochitests.

I submitted on phabricator a best-effort patch to stop the parent process from requesting FOG data from GMP processes and stop the GMP child processes from automatically sending FOG data when they shutdown. That should be the minimum amount of ifdefs required to avoid crashing. I verified locally using #if 0 that this patch makes the test fail, and I also disabled the test, although I can't figure out how to run that test on try. Are we still running browser-chrome mochitests in CI for Windows Arm64? I can't see them either on the mozilla-release treeherder: https://treeherder.mozilla.org/jobs?repo=mozilla-release&searchStr=aarch%2Cwin

Anyway, I pushed my patch to try to verify that it builds: https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=9d39e31222152b4904cf434e9899ad54e4078698

I'm on PTO until the end of the week, so please take over from here.

Flags: needinfo?(florian)

(In reply to Florian Quèze [:florian] from comment #22)

Anyway, I pushed my patch to try to verify that it builds: https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=9d39e31222152b4904cf434e9899ad54e4078698

Hopefully these builds can also be used by QA to verify that the patch is enough to no longer encounter the crash.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 99 Branch

QA is testing the try build with Florian's patch, can you check it as well and proceed with an uplift request to mozilla-release? If we feel confident that this may fix or mitigate the crashes, I will probably build a RC2 with this patch. Thanks!

Flags: needinfo?(continuation)

Hello,

So this is what the investigation of the fix uncovered:

** Trybuild fix ** (provided by Florian)

  • Hulu no longer crashes and works without any issues
  • Netflix video playback is impossible. Regardless of how much time passes or how many restarts are performed the following error is received: F7121-1331, which seems to point to an HTML error. For more details please see the attached screenshot related to the try-build.

** Nightly fix **

  • Hulu no longer crashes and works without any issues
  • Netflix video playback receives an HTML 5 error. On one win 10 ARM machine, after a while and some restarts, video playback became possible, without any apparent cause. On another win 10 ARM machine, video playback never became possible and the same error as on the try-build was constantly received. For more details please see the attached screenshot related to nightlybuild.
Attached image trybuild-netflix_error (deleted) —
Attached image nightlybuild-error (deleted) —

Comment on attachment 9265956 [details]
Bug 1757534 - disable FOG IPCs in GMP processes on Windows Arm64, r=chutten.

Beta/Release Uplift Approval Request

  • User impact if declined: crashes when playing videos on many sites, on ARM64 Windows
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: Yes
  • If yes, steps to reproduce: given in comment 0
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch disables some telemetry, and only on Windows ARM64.
  • String changes made/needed: none
Flags: needinfo?(continuation)
Attachment #9265956 - Flags: approval-mozilla-release?
Flags: qe-verify+

Comment on attachment 9265956 [details]
Bug 1757534 - disable FOG IPCs in GMP processes on Windows Arm64, r=chutten.

Approved for a 98 RC2, thanks!

Attachment #9265956 - Flags: approval-mozilla-release? → approval-mozilla-release+

I filed bug 1757763 about getting testing in automation from GMP telemetry on Aarch64 Win64. I also previously filed bug 1757605 about adding SentinelRead to the prefix list, in the hope that splitting signatures out for these deserialization errors into separate messages will give our automated crash detection more of a chance to notice regressions.

QA Whiteboard: [qa-triaged]

The issue is verified fixed in Fx98.0-build2 RC and latest nightly. Hulu and Netflix no longer crash and play back without any issues.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
Flags: needinfo?(bvandyk)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: