Firefox (intermittently) failing WPTs in css/cssom-view/scroll-behavior-main-frame-root.html and scroll-behavior-main-frame-window.html
Categories
(Core :: Panning and Zooming, defect, P3)
Tracking
()
People
(Reporter: twisniewski, Assigned: hiro)
References
(Blocks 1 open bug)
Details
The tests being failed in both of these are Main frame with auto scroll-behavior ; scroll() with smooth behavior
, so I presume they may have the same underlying cause:
Assignee | ||
Comment 1•1 year ago
|
||
The failure happens at the first async scrolling which should have been processed by APZ, but in fact it doesn't since the test expands the root scroll container and at the time the failure test runs, APZ seems to not yet know the expanded size.
Assignee | ||
Comment 2•1 year ago
|
||
This is tough. The comment 1 isn't true. There's a race condition in between a paint in the browser parent process and a paint in the content process.
The race comes from this SetReferentId call in the paint in the browser parent process. When we call the function, it's too late for processing the async scroll update in APZ, the scroll update had been already reached to APZ and without the proper ReferentId it was ignored unfortunately.
I don't have any clear idea to solve the race. Queuing the scroll updates until the proper ReferentId comes if we know for sure it's coming (but how?)? Or we do skip painting in the content process until we got the proper ReferentId? The latter will probably regress first-contentful paint?
CCing Botond.
Assignee | ||
Comment 3•1 year ago
|
||
Note that just making the test pass reliably is to give a chance to the parent process to do a paint, waiting two rAF callbacks just before starting the async scroll test case here would be sufficient.
Assignee | ||
Comment 4•1 year ago
|
||
Kats had already noticed the race in bug 1437295.
Assignee | ||
Comment 5•1 year ago
|
||
There's something which horribly takes some amount of times on the parent process. The time difference between when we call a paint when we got the LayersId of the content and when we call the SetReferentId
in a paint is over 150ms, it's truly slow given that my machine is not so slow (16 cores machine)
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 6•1 year ago
|
||
Okay, I've tracked down what's going on in between the 150ms time span.
When we got the proper ReferentId (i.e. LayersId) in BrowserParent::InitRendering()
, we actually call InvalidateFrame, so a paint will be triggered in the next refresh driver's tick. Indeed it happens.
But unfortunately when we query the LayersId in the paint, we use the outdated mRetainedRemoteFrame
. It was introduced in bug 1677324 to avoid flashing on page navigation.
Just calling RecvDidUnsuppressPainting
instead of InvalidateFrame
in BrowserParent::InitRendering()
would solve this race, but I am almost 100% sure it will regress bug 1677324.
Emilio, what's the right course to fix this test failure and fix potential failure risks caused by the race? I can add awaiting two rAF callbacks into this test, but I have no idea how we can eliminate the potential risks.
Comment 7•1 year ago
|
||
Do you know why we haven't unsuppressed by the time the test runs? The test runs on load, so we should have unsuppressed then.
Is the issue we don't hit this code on time? Or something else?
Do you have a pernosco trace we could poke at or something?
Assignee | ||
Comment 8•1 year ago
|
||
(In reply to Emilio Cobos Álvarez (:emilio) from comment #7)
Do you know why we haven't unsuppressed by the time the test runs? The test runs on load, so we should have unsuppressed then.
Is the issue we don't hit this code on time? Or something else?
Yeah the code actually hit, but it's too late. In the content process after calling UnsuppressAndInvalidate we build the scroll data in question and it's sent to APZ. In the meantime, the UnsuppressAndInvalidate reaches to the parent process via an IPC call, then in the parent process we do display list building with the proper LayersId and it's sent to APZ. But unfortunately when we got the proper LayersId in APZ the scroll data from the content process has been already processed.
Assignee | ||
Comment 9•1 year ago
|
||
I'd like to propose to do a workaround in the tests themselves, and file a followup bug to fix the race because the purpose of the tests, I believe, is testing each scroll-behavior for each scroll operations (e.g. scrollTo) works correctly. Adding the workaround (awaiting two rAF) would not spoil the purpose.
We could also add a new wpt for the race itself. Actually I've written it. A new interesting found during I was writing it is the race happens only if there's a paint in the content after the paint where we sent the scroll update data. If there's no such paint we don't clobber the scroll update data here in APZUpdater::UpdateScrollDataAndTreeState even if the parent process's paint is delayed. And the paint in question in the test is triggered if there are two instant scroll operations before the async one. I don't know why. Though the new wpt I've written doesn't use instant scroll operations, it rather just changes the scroll container's color to make a paint happen.
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 10•1 year ago
|
||
I sent a PR to the upstream; https://github.com/web-platform-tests/wpt/pull/41200
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 11•1 year ago
|
||
The PR has been merged.
Description
•