Closed Bug 1841305 Opened 1 year ago Closed 1 year ago

Firefox (intermittently) failing WPTs in css/cssom-view/scroll-behavior-main-frame-root.html and scroll-behavior-main-frame-window.html

Categories

(Core :: Panning and Zooming, defect, P3)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: twisniewski, Assigned: hiro)

References

(Blocks 1 open bug)

Details

The tests being failed in both of these are Main frame with auto scroll-behavior ; scroll() with smooth behavior, so I presume they may have the same underlying cause:

The failure happens at the first async scrolling which should have been processed by APZ, but in fact it doesn't since the test expands the root scroll container and at the time the failure test runs, APZ seems to not yet know the expanded size.

Severity: -- → S3
Component: Layout → Panning and Zooming
Priority: -- → P3

This is tough. The comment 1 isn't true. There's a race condition in between a paint in the browser parent process and a paint in the content process.

The race comes from this SetReferentId call in the paint in the browser parent process. When we call the function, it's too late for processing the async scroll update in APZ, the scroll update had been already reached to APZ and without the proper ReferentId it was ignored unfortunately.

I don't have any clear idea to solve the race. Queuing the scroll updates until the proper ReferentId comes if we know for sure it's coming (but how?)? Or we do skip painting in the content process until we got the proper ReferentId? The latter will probably regress first-contentful paint?

CCing Botond.

Note that just making the test pass reliably is to give a chance to the parent process to do a paint, waiting two rAF callbacks just before starting the async scroll test case here would be sufficient.

Kats had already noticed the race in bug 1437295.

There's something which horribly takes some amount of times on the parent process. The time difference between when we call a paint when we got the LayersId of the content and when we call the SetReferentId in a paint is over 150ms, it's truly slow given that my machine is not so slow (16 cores machine)

Summary: Firefox failing WPTs in css/cssom-view/scroll-behavior-main-frame-root.html and scroll-behavior-main-frame-window.html → Firefox (intermittently) failing WPTs in css/cssom-view/scroll-behavior-main-frame-root.html and scroll-behavior-main-frame-window.html

Okay, I've tracked down what's going on in between the 150ms time span.

When we got the proper ReferentId (i.e. LayersId) in BrowserParent::InitRendering(), we actually call InvalidateFrame, so a paint will be triggered in the next refresh driver's tick. Indeed it happens.

But unfortunately when we query the LayersId in the paint, we use the outdated mRetainedRemoteFrame. It was introduced in bug 1677324 to avoid flashing on page navigation.

Just calling RecvDidUnsuppressPainting instead of InvalidateFrame in BrowserParent::InitRendering() would solve this race, but I am almost 100% sure it will regress bug 1677324.

Emilio, what's the right course to fix this test failure and fix potential failure risks caused by the race? I can add awaiting two rAF callbacks into this test, but I have no idea how we can eliminate the potential risks.

Flags: needinfo?(emilio)

Do you know why we haven't unsuppressed by the time the test runs? The test runs on load, so we should have unsuppressed then.

Is the issue we don't hit this code on time? Or something else?

Do you have a pernosco trace we could poke at or something?

Flags: needinfo?(emilio) → needinfo?(hikezoe.birchill)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #7)

Do you know why we haven't unsuppressed by the time the test runs? The test runs on load, so we should have unsuppressed then.

Is the issue we don't hit this code on time? Or something else?

Yeah the code actually hit, but it's too late. In the content process after calling UnsuppressAndInvalidate we build the scroll data in question and it's sent to APZ. In the meantime, the UnsuppressAndInvalidate reaches to the parent process via an IPC call, then in the parent process we do display list building with the proper LayersId and it's sent to APZ. But unfortunately when we got the proper LayersId in APZ the scroll data from the content process has been already processed.

Flags: needinfo?(hikezoe.birchill)

I'd like to propose to do a workaround in the tests themselves, and file a followup bug to fix the race because the purpose of the tests, I believe, is testing each scroll-behavior for each scroll operations (e.g. scrollTo) works correctly. Adding the workaround (awaiting two rAF) would not spoil the purpose.

We could also add a new wpt for the race itself. Actually I've written it. A new interesting found during I was writing it is the race happens only if there's a paint in the content after the paint where we sent the scroll update data. If there's no such paint we don't clobber the scroll update data here in APZUpdater::UpdateScrollDataAndTreeState even if the parent process's paint is delayed. And the paint in question in the test is triggered if there are two instant scroll operations before the async one. I don't know why. Though the new wpt I've written doesn't use instant scroll operations, it rather just changes the scroll container's color to make a paint happen.

The PR has been merged.

Assignee: nobody → hikezoe.birchill
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.