[ARM64] Crash [@ mozilla::ipc::FatalError | mozilla::ipc::FatalError | mozilla::ipc::IProtocol::HandleFatalError | mozilla::gmp::PGMPParent::OnMessageReceived ]
Categories
(Core :: IPC, defect, P1)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr91 | --- | unaffected |
firefox97 | --- | unaffected |
firefox98 | blocking | verified |
firefox99 | + | verified |
People
(Reporter: cbaica, Assigned: mccr8)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(7 files)
Affected versions
- Fx 98.0
- Fx99.0a1
Affected platforms
- Windows 10 ARM
Steps to reproduce
- Launch Firefox.
- Connect to VPN. (optional if Hulu already works in your region)
- Navigate to www.hulu.com and login.
- Start any live video.
Expected result
- Video plays back without any issue.
Actual result
- Firefox crashes.
Regression range
- Will come back with a regression range ASAP. This issue didn't occur during the RC 97 validation.
Additional notes
- Firefox also crashes when accessing netflix.com and logging in. Didn't even reach a video playback. This happened without any connection to a VPN. Same type of crash signature.
- In Firefox Nightly, the crash signature is different https://crash-stats.mozilla.org/report/index/4d266dba-21b3-4fc5-805d-b3c660220301 , but still an IPC error.
- The crash in nightly is pointing to this bug 1416560.
- I've also attached the about:support info, for details on the machine.
- This was not reproducible on other OSs.
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment 1•3 years ago
|
||
Jed, this is a new crasher found by QA while doing 98 RC build validatioon, could you help us find an owner rapidly? Thanks!
Reporter | ||
Comment 2•3 years ago
|
||
Best I could narrow it down to is:
-
First bad: 2022-19-01 - https://hg.mozilla.org/mozilla-central/rev/cc33400f0ff80f0eada6c3aa637f37d247a3ff46
-
Last good: 2022-18-01 - https://hg.mozilla.org/mozilla-central/rev/89aa2c8696b7b10a4e71f95d4a468171b92bb828
-
Unfortunately nothing stands out to me from the specified regression range.
Updated•3 years ago
|
Assignee | ||
Comment 3•3 years ago
|
||
The crash reason from the report in comment 0 is "incorrect sentinel when reading Error deserializing 'ByteBuf'".
Assignee | ||
Comment 4•3 years ago
|
||
Bug 1745511 is in the regression range, and adds a few messages from a GMP process that send a ByteBuf.
Also, when looking at the crash report, we're in the parent process deserializing a FlushFOGData message, so that's also a good indicator.
Assignee | ||
Comment 5•3 years ago
|
||
I could imagine there being some kind of endianness issue, but we use ByteBufs in a few places, so I'm not sure why this is the only place we see the crash. Maybe this specific message hits some edge case.
Assignee | ||
Comment 6•3 years ago
|
||
FWIW, there's been a steady trickle of these crashes going back at least a month, but I guess we didn't get more than one or two a day, so nobody noticed. bp-ad3c659f-0bb1-4ab4-b636-77f490220201
Comment 7•3 years ago
|
||
Are we still using the x64 Widevine CDM for the Win ARM64 builds? If so, might that be relevant here?
Assignee | ||
Comment 8•3 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #5)
I could imagine there being some kind of endianness issue, but we use ByteBufs in a few places, so I'm not sure why this is the only place we see the crash. Maybe this specific message hits some edge case.
Nika said the endianness should be the same, so never mind about that.
Looking back over the last 3 months, every one of these crashes is ARM64 Windows.
Assignee | ||
Comment 9•3 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #7)
Are we still using the x64 Widevine CDM for the Win ARM64 builds? If so, might that be relevant here?
The about:support info in the attachment has this: "media.gmp-widevinecdm.abi": "x86-msvc-x86",
Assignee | ||
Comment 10•3 years ago
|
||
As Nika said, if the GMP process is really still running as 32-bit (and it sounds like it was as of bug 1527811), then there's probably a pointer width issue here. The length of a ByteBuf is sent as a size_t, which of course will vary between 32 bit and 64 bit, so that would be enough to break things.
Assignee | ||
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Comment 11•3 years ago
|
||
On ARM64 Windows, the GMP process is running in 32-bit, but
the parent process is running in 64-bit. The FlushFOGData reply
sends a ByteBuf from a GMP process to the parent process. This
is a problem because size_t has different sizes in each process.
This patch addresses that by sending the size as a uint32_t.
WriteBytesZeroCopy already implicitly converts the length
argument, so it seems unlikely that the other users of ByteBuf
depend on this behavior. Additionally, I think that any ByteBuf
large enough to cause issues here would end up crashing due to
hitting the IPC message size limit.
Assignee | ||
Comment 12•3 years ago
|
||
In terms of uplift risk, ByteBuf is used in a few other places, but like I said in the commit message, any place that depend on mLen being large enough that it can't be represented as a uint32_t probably has other issues, because mLen gets passed to WriteBytesZeroCopy, which will convert it (unchecked) to a uint32_t, and also because IPC messages are limited to 256MB, so if mLen is big enough to not be representable in a uint32_t, then we're probably crashing anyways due to hitting the message limit.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 13•3 years ago
|
||
Here's a try push that hopefully has the right kind of build that can be used for testing on ARM64.
Comment 14•3 years ago
|
||
Andrew, is that something we could consider uplifting in a RC2? My main concern is if there is a potential side effect on non ARM64 builds.
Assignee | ||
Comment 15•3 years ago
|
||
Yeah, the risk here is that it will change the behavior slightly on other platforms, and also of other IPC messages besides the specific FOG messages that are crashing. However, as far as I'm able to tell, I don't think any existing place can be depending on the length being a 64 bit value, because it would be crashing for two other different reasons. But I could be wrong.
A more targeted fix might be to disable this GMP FOG telemetry on ARM64 Windows, maybe by disabling the GMP FogData and FlushFogData messages on that platform. That would probably require Florian or somebody who is familiar with the GMP telemetry to figure out exactly what needs to be fixed for that (for instance, some tests will probably have to be disabled) because the patch is a bit large. So, that seems more complex, but it would be nice in that the build should be bit-for-bit identical on other platforms.
It does seem like we should do something, as I think without it then on ARM64 Win10 you'll just always crash when trying to use Hulu, Disney+, Netflix, Amazon Video, Direct TV video, etc., judging by the crash reports.
Comment 16•3 years ago
|
||
Comment 17•3 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #15)
A more targeted fix might be to disable this GMP FOG telemetry on ARM64 Windows, maybe by disabling the GMP FogData and FlushFogData messages on that platform. That would probably require Florian or somebody who is familiar with the GMP telemetry to figure out exactly what needs to be fixed for that (for instance, some tests will probably have to be disabled) because the patch is a bit large. So, that seems more complex, but it would be nice in that the build should be bit-for-bit identical on other platforms.
Florian, can you provide such a patch for uplift?
Comment 18•3 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #15)
A more targeted fix might be to disable this GMP FOG telemetry on ARM64 Windows, maybe by disabling the GMP FogData and FlushFogData messages on that platform. That would probably require Florian or somebody who is familiar with the GMP telemetry to figure out exactly what needs to be fixed for that (for instance, some tests will probably have to be disabled)
I actually don't understand how it's possible that we crash whenever that IPC is sent, but the test exercising this IPC didn't fail. Are we running browser chrome mochitests at all on Windows Arm64?
Comment 19•3 years ago
|
||
(In reply to Florian Quèze [:florian] from comment #18)
I actually don't understand how it's possible that we crash whenever that IPC is sent, but the test exercising this IPC didn't fail.
Answering my own question: the test sends a boolean instead of a ByteBuf, so it doesn't hit the crash.
Assignee | ||
Comment 20•3 years ago
|
||
(In reply to Florian Quèze [:florian] from comment #18)
I actually don't understand how it's possible that we crash whenever that IPC is sent, but the test exercising this IPC didn't fail. Are we running browser chrome mochitests at all on Windows Arm64?
Looking at mozilla-central, in terms of Windows AArch64, it looks like we only run test-windows10-aarch64-qr/opt-mochitest-media-fis-e10s, which looks like various tests under dom/media. There don't appear to be any on mozilla-inbound. It looks like the test added in bug 1745511 is a browser chrome mochitest, and is over with Glean tests, so I guess we're not running it. (I also don't know whether the tests that we do run have a 32-bit or a 64-bit GMP process, and the former is needed to catch this issue.) If telemetry is disabled during normal tests, then we also wouldn't get incidental testing of the telemetry in the media mochitests.
Comment 21•3 years ago
|
||
Comment 22•3 years ago
|
||
I submitted on phabricator a best-effort patch to stop the parent process from requesting FOG data from GMP processes and stop the GMP child processes from automatically sending FOG data when they shutdown. That should be the minimum amount of ifdefs required to avoid crashing. I verified locally using #if 0
that this patch makes the test fail, and I also disabled the test, although I can't figure out how to run that test on try. Are we still running browser-chrome mochitests in CI for Windows Arm64? I can't see them either on the mozilla-release treeherder: https://treeherder.mozilla.org/jobs?repo=mozilla-release&searchStr=aarch%2Cwin
Anyway, I pushed my patch to try to verify that it builds: https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=9d39e31222152b4904cf434e9899ad54e4078698
I'm on PTO until the end of the week, so please take over from here.
Comment 23•3 years ago
|
||
(In reply to Florian Quèze [:florian] from comment #22)
Anyway, I pushed my patch to try to verify that it builds: https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=9d39e31222152b4904cf434e9899ad54e4078698
Hopefully these builds can also be used by QA to verify that the patch is enough to no longer encounter the crash.
Comment 24•3 years ago
|
||
bugherder |
Comment 25•3 years ago
|
||
QA is testing the try build with Florian's patch, can you check it as well and proceed with an uplift request to mozilla-release? If we feel confident that this may fix or mitigate the crashes, I will probably build a RC2 with this patch. Thanks!
Reporter | ||
Comment 26•3 years ago
|
||
Hello,
So this is what the investigation of the fix uncovered:
** Trybuild fix ** (provided by Florian)
- Hulu no longer crashes and works without any issues
- Netflix video playback is impossible. Regardless of how much time passes or how many restarts are performed the following error is received: F7121-1331, which seems to point to an HTML error. For more details please see the attached screenshot related to the try-build.
** Nightly fix **
- Hulu no longer crashes and works without any issues
- Netflix video playback receives an HTML 5 error. On one win 10 ARM machine, after a while and some restarts, video playback became possible, without any apparent cause. On another win 10 ARM machine, video playback never became possible and the same error as on the try-build was constantly received. For more details please see the attached screenshot related to nightlybuild.
Reporter | ||
Comment 27•3 years ago
|
||
Reporter | ||
Comment 28•3 years ago
|
||
Reporter | ||
Comment 29•3 years ago
|
||
Reporter | ||
Comment 30•3 years ago
|
||
Assignee | ||
Comment 31•3 years ago
|
||
Comment on attachment 9265956 [details]
Bug 1757534 - disable FOG IPCs in GMP processes on Windows Arm64, r=chutten.
Beta/Release Uplift Approval Request
- User impact if declined: crashes when playing videos on many sites, on ARM64 Windows
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: Yes
- If yes, steps to reproduce: given in comment 0
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This patch disables some telemetry, and only on Windows ARM64.
- String changes made/needed: none
Assignee | ||
Updated•3 years ago
|
Comment 32•3 years ago
|
||
Comment on attachment 9265956 [details]
Bug 1757534 - disable FOG IPCs in GMP processes on Windows Arm64, r=chutten.
Approved for a 98 RC2, thanks!
Comment 33•3 years ago
|
||
bugherder uplift |
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 34•3 years ago
|
||
I filed bug 1757763 about getting testing in automation from GMP telemetry on Aarch64 Win64. I also previously filed bug 1757605 about adding SentinelRead to the prefix list, in the hope that splitting signatures out for these deserialization errors into separate messages will give our automated crash detection more of a chance to notice regressions.
Updated•3 years ago
|
Reporter | ||
Comment 35•3 years ago
|
||
The issue is verified fixed in Fx98.0-build2 RC and latest nightly. Hulu and Netflix no longer crash and play back without any issues.
Reporter | ||
Updated•3 years ago
|
Updated•3 years ago
|
Description
•