Page loads failing intermittently with black screen
Categories
(GeckoView :: General, defect, P1)
Tracking
(firefox-esr91 unaffected, firefox99 unaffected, firefox100+ disabled, firefox101+ fixed)
Tracking | Status | |
---|---|---|
firefox-esr91 | --- | unaffected |
firefox99 | --- | unaffected |
firefox100 | + | disabled |
firefox101 | + | fixed |
People
(Reporter: csadilek, Assigned: jnicol)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: regression)
Attachments
(3 files)
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-phabricator-request
|
diannaS
:
approval-mozilla-release+
|
Details |
(deleted),
text/x-phabricator-request
|
diannaS
:
approval-mozilla-release+
|
Details |
This has been detected by Raptor tests and was also reported by Fenix users.
STR that work reliably for me in Fenix Nightly (101.0a1 #201587625):
- Open Home Screen
- Tap on wikipedia shortuct
Expected:
Page loads.
Actual:
Black (sometimes white) screen.
Needs to be repeated few times to trigger the problem. I've also encountered this just following a link.
This started happening at 2022-04-05 based on bug 1692701. It looks like a regression of 1753700, but still needs to be confirmed.
Fenix ticket: https://github.com/mozilla-mobile/fenix/issues/24782
Reporter | ||
Comment 1•3 years ago
|
||
Agi, can you take a look? Timing-wise the tests started failing right after we landed the patch for bug 1753700 to adjust process priorities, but I haven't confirmed manually. Happy to help. Luckily this is reproducible, see above.
Reporter | ||
Comment 2•3 years ago
|
||
There are no content process crashes reported and locally I don't see anything obvious in my logcat. Raptor tests have some "process died" messages though: https://firefoxci.taskcluster-artifacts.net/IHq16g62T7S_pz-iMsRmUg/0/public/test_info/logcat-ZY322HN8G4.log
Comment 3•3 years ago
|
||
I doubt Bug 1753700 can cause something like that, and the effects of this bug are exactly the same as Bug 1762025. And indeed, I can only reproduce this bug with the GPU process turned on on my Samsung S10e. I can look more on Monday but my guess is that we lose the GPU process at some point and we're not able to restart it for some reason.
FYI S1 is for browser breaking bugs, and I don't believe this one is one of them.
Comment 4•3 years ago
|
||
[Tracking Requested - why for this release]: Didn't mean to remove tracking flags.
Updated•3 years ago
|
Assignee | ||
Comment 5•3 years ago
|
||
Can also reproduce, though it takes quite a lot of closing and opening tabs. And also haven't reproduced with the GPU process disabled, so it probably only occurs with it enabled.
But it doesn't look like the GPU process is crashing at all, nor that we have a unfixable issue rendering like in bug 1762025. Webrender seems to believe that there's nothing to render for some reason, so chooses not to. Rotating the device or clicking the url bar and back again seems to force it to start rendering again.
Reporter | ||
Comment 6•3 years ago
|
||
Interestingly, I cannot reproduce the problem on Beta/100, only Nightly/101, and the GPU process was enabled in 100 already. It seems like 100 is not affected.
Comment 7•3 years ago
|
||
I'm trying to bisect.
Comment 8•3 years ago
|
||
OK so good news is I can reproduce on GVE fairly easily:
- close all tabs
- open new tab (this is to work around a bug in GVE)
- launch
adb shell am start -W -n org.mozilla.geckoview_example/.GeckoViewActivity -a android.intent.action.VIEW -d "https://www.wikipedia.com"
to load wikipedia in the new tab
trying this a few times I get a black screen.
Unfortunately at the moment I only have an Android 12 device available, where I can test 4/19-4/25 only due to Bug 1762424. If somebody has a non-android-12 device handy to bisect that would be great! Otherwise I'll do that tonight.
Assignee | ||
Comment 9•3 years ago
|
||
I have a pretty good idea what is going on here. Webrender is bailing on rendering here because there is no active document. This is because we didn't create the frame here because has_pixels()
returned false. This is because the scene's device rect is zero-sized. The device rect gets set here in SetDisplayList()
for the root WebRenderBridgeParent (ie the browser chrome, rather than the tab). It gets the size from AndroidCompositorWidget::GetClientSize()
, which is returning 0x0
intermittently. This explains why interacting with the page (eg scrolling) doesn't cause us to render again, but changing the device's orientation does (as that will cause the window to resize and a new display list to be sent for the chrome, and the size will be valid by this point).
We've seen something similar before in bug 1749745.
Our widget size handling on android is a little weird. We mostly ignore the size set from the main thread and instead try to calculate the size from the Surface (which is set from the UI thread). This means it is zero initially, then when the compositor gets Resumed we can calculate the size from the Surface.
The reason we get a zero-size is because SyncResumeAndResize()
sometimes occurs before the compositor has been initialized. Or more importantly, before the UI thread has been notified that the compositor has been initialized. This means we are unable to resume the compositor here. The compositor will instead be resumed after it is finished initializing here. This leaves us in a race between the first paint occuring (which will set the webrender display list) and the compositor being resumed (which will set the CompositorWidget size). Since these occur cross-process, from different threads, and using different IPDL protocols, we have no guarantee about their ordering.
I don't think this explains why this only affects 101 and not 100. Nor why it's GPU process specific. I guess it's just timing sensitive and the GPU process can affect the timings quite significantly.
I think we can fix this by allowing the CompositorWidget
's size to be updated from the main thread, as well as from the UI thread like it currently is. If we add an IPDL call to PCompositorWidget
to set the size, this will be guaranteed to occur before the PWebRenderBridge
SetDisplayList
IPDL call (as both protocols have the same manager)
Assignee | ||
Comment 10•3 years ago
|
||
Unfortunately I cannot reproduce in GVE even with these STR, so I cannot bisect either.
This reproduces very reliably if I change this DispatchToUiThread
in to a DelayedDispatch
with a 1000ms wait. Which seems to confirm my theory. And my proposed fix avoids the bug even with the delay.
Assignee | ||
Comment 11•3 years ago
|
||
On Android we have long since calculated the compositor widget size
from the Surface rather than using the main thread widget size. This
is to avoid a trip via the main thread in response to a
ResumeAndResize event on the UI thread.
AndroidCompositorWidget::mClientSize therefore gets calculated when
the compositor is resumed.
However, it is possible in some circumstances for the compositor to
receive a display list prior to it being resumed. In this bug's case,
SyncResumeAndResize is getting called before the UI thread has been
notified that the compositor has been initialized. It therefore cannot
resume the compositor, and we instead resume the compositor on the
subsequent NotifyCompositorCreated call. This starts a race between
the main thread paint and NotifyCompositorCreated being scheduled on
the UI thread then resuming the compositor via
PUiCompositorController. If we receive
WebRenderBridgeParent::RecvSetDisplayList prior to
UiCompositorControllerParent::RecvResumeAndResize, then
AndroidCompositorWidget::mClientSize will be zero. This results in us
setting zero-sized webrender scene rect, leading to webrender exiting
early during rendering, resulting in a black screen.
To fix this, allow the main thread to set the AndroidCompositorWidget
size, in addition to the UI thread being able to set it. We do so by
adding a NotifyClientSizeChanged method to
PlatformCompositorWidgetDelegate, called from
nsWindow::OnSizeChanged. The cross-process implementation of this uses
an IPDL call on PCompositorWidget, which shares a top-level protocol
with PWebRenderBridge, meaning calls are guaranteed to occur in
order. This means a resize event on the main thread is guaranteed to
set the CompositorWidget size before the display list from the
subsequent paint is received.
Updated•3 years ago
|
Comment 12•3 years ago
|
||
Do we need to uplift this fix to Beta 100? It will merge to the Release channel next week, so we have very little time to test a fix in Beta.
Comment 13•3 years ago
|
||
I bisected this and the bisection does point to Bug 1331109 as the root cause.
My test was:
- close tab (GVE has 0 tabs now)
- open tab
until the tab becomes black (instead of white) indicating that nothing is painting on the surface.
13:12.17 INFO: No more integration revisions, bisection finished.
13:12.17 INFO: Last good revision: c526b3d6a76d508e67fa61483669605b4ed89f4b
13:12.17 INFO: First bad revision: 1e862c8fcaceb95c6577bc32be1169821af8faff
13:12.17 INFO: Pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=c526b3d6a76d508e67fa61483669605b4ed89f4b&tochange=1e862c8fcaceb95c6577bc32be1169821af8faff
Comment 14•3 years ago
|
||
Assignee | ||
Comment 15•3 years ago
|
||
(In reply to Chris Peterson [:cpeterson] from comment #12)
Do we need to uplift this fix to Beta 100? It will merge to the Release channel next week, so we have very little time to test a fix in Beta.
Yes, we should uplift this. I'll request as soon as it's landed
Comment 16•3 years ago
|
||
The nominated patch does not graft cleanly to release
widget/android/AndroidCompositorWidget.h
widget/android/CompositorWidgetParent.h
widget/android/InProcessAndroidCompositorWidget.h
Comment 17•3 years ago
|
||
Given that we found Bug 1766127 so late in the beta cycle, let's disable the
GPU process for one more version, the GPU process will ride the trains on 101
instead.
Comment 18•3 years ago
|
||
Comment on attachment 9273704 [details]
Bug 1766127 - Disable GPU process in beta.
Beta/Release Uplift Approval Request
- User impact if declined: Potentially undiscovered bugs that cause pages to go blank intermittently
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Medium
- Why is the change risky/not risky? (and alternatives if risky): This code path has been the default for years and is the default for Android 12 devices, so we don't expect regressions.
- String changes made/needed:
- Is Android affected?: Yes
Comment 19•3 years ago
|
||
Comment on attachment 9273704 [details]
Bug 1766127 - Disable GPU process in beta.
Approved for 100.0rc1
Comment 20•3 years ago
|
||
bugherder uplift |
Updated•3 years ago
|
Assignee | ||
Comment 21•3 years ago
|
||
On Android we have long since calculated the compositor widget size
from the Surface rather than using the main thread widget size. This
is to avoid a trip via the main thread in response to a
ResumeAndResize event on the UI thread.
AndroidCompositorWidget::mClientSize therefore gets calculated when
the compositor is resumed.
However, it is possible in some circumstances for the compositor to
receive a display list prior to it being resumed. In this bug's case,
SyncResumeAndResize is getting called before the UI thread has been
notified that the compositor has been initialized. It therefore cannot
resume the compositor, and we instead resume the compositor on the
subsequent NotifyCompositorCreated call. This starts a race between
the main thread paint and NotifyCompositorCreated being scheduled on
the UI thread then resuming the compositor via
PUiCompositorController. If we receive
WebRenderBridgeParent::RecvSetDisplayList prior to
UiCompositorControllerParent::RecvResumeAndResize, then
AndroidCompositorWidget::mClientSize will be zero. This results in us
setting zero-sized webrender scene rect, leading to webrender exiting
early during rendering, resulting in a black screen.
To fix this, allow the main thread to set the AndroidCompositorWidget
size, in addition to the UI thread being able to set it. We do so by
adding a NotifyClientSizeChanged method to
PlatformCompositorWidgetDelegate, called from
nsWindow::OnSizeChanged. The cross-process implementation of this uses
an IPDL call on PCompositorWidget, which shares a top-level protocol
with PWebRenderBridge, meaning calls are guaranteed to occur in
order. This means a resize event on the main thread is guaranteed to
set the CompositorWidget size before the display list from the
subsequent paint is received.
Assignee | ||
Comment 22•3 years ago
|
||
Comment on attachment 9273794 [details]
Bug 1766127 - Allow AndroidCompositorWidget::mClientSize to be updated from main thread.
Beta/Release Uplift Approval Request
- User impact if declined: Potential black screens when opening a new tab (though less likely now we have disabled the GPU process)
- Is this code covered by automated tests?: Yes
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Just ensures the compositor widget size is correctly set from the main thread. If the main-thread widget size was ever wrong we'd encounter other issues, so this should be safe
- String changes made/needed:
- Is Android affected?: Yes
Assignee | ||
Comment 23•3 years ago
|
||
D144687 is the patch rebased on release. I think we should take it even though the GPU process has now been disabled, as the bug could still occur anyway (it's just a timing issue)
Comment 24•3 years ago
|
||
bugherder |
Updated•3 years ago
|
Comment 25•3 years ago
|
||
Comment on attachment 9273794 [details]
Bug 1766127 - Allow AndroidCompositorWidget::mClientSize to be updated from main thread.
Approved for 100.0rc2
Comment 26•3 years ago
|
||
uplift |
Updated•3 years ago
|
Description
•