Closed Bug 1571758 Opened 5 years ago Closed 3 years ago

(un)smooth scroll bug

Categories

(Core :: Panning and Zooming, defect, P3)

defect

Tracking

()

VERIFIED FIXED
98 Branch
Tracking Status
firefox98 --- verified

People

(Reporter: 1.1.1998, Assigned: hiro)

References

(Blocks 3 open bugs, Regressed 1 open bug)

Details

Attachments

(11 files, 5 obsolete files)

(deleted), text/html
Details
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
(deleted), text/x-phabricator-request
Details
Attached file bug.html (deleted) —

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.143 YaBrowser/19.7.2.455 Yowser/2.5 Safari/537.36

Steps to reproduce:

i scroll content on page like(attached bug.html)

Actual results:

element with position absolute jumping on new position.

Expected results:

element with position absolute moving synchronously with scroll without jumping.

Component: Untriaged → DOM: CSS Object Model
Product: Firefox → Core

I also see flickering (though less) on other browsers. In general I don't think you're guaranteed getting async scroll events before paint, though Botond can confirm.

Why not using position: sticky for this?

Component: DOM: CSS Object Model → Layout: Scrolling and Overflow

(In reply to Emilio Cobos Álvarez (:emilio) from comment #1)

In general I don't think you're guaranteed getting async scroll events before paint, though Botond can confirm.

That's correct.

The page is using what we call a scroll-linked effect - please see this MDN page for more info about those.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #1)

I also see flickering (though less) on other browsers. In general I don't think you're guaranteed getting async scroll events before paint, though Botond can confirm.

Why not using position: sticky for this?

1)In Chromium browsers i don't see it(maybe my eyes liying me))
2)this example enough for demonstration bug and for my task need compatibility with old browsers(for now) sticky don't work everywhere

(In reply to Long from comment #3)

1)In Chromium browsers i don't see it(maybe my eyes liying me))

I definitely see some flickering in Chrome when I move back and forth close to the top.

2)this example enough for demonstration bug and for my task need compatibility with old browsers(for now) sticky don't work everywhere

There's various ways to support older browsers while using (faster, more correct, and power-efficent) scrolling on newer browsers.

An example, if you change .abs to be position: sticky, you can use the following as a fallback:

<script>
window.addEventListener("load", function() {
  if (window.CSS && window.CSS.supports && window.CSS.supports("position: sticky"))
    return; //Nothing to do, position: sticky does it for us.
  let abs = document.querySelector(".abs");
  abs.style.position = "absolute";
  abs.parentNode.addEventListener("scroll", function() {
    abs.style.top = this.scrollTop + "px";
  }, false);
}, false);
</script>

(In reply to Emilio Cobos Álvarez (:emilio) from comment #4)

(In reply to Long from comment #3)

1)In Chromium browsers i don't see it(maybe my eyes liying me))

I definitely see some flickering in Chrome when I move back and forth close to the top.

I tested in Chrome version 76. Maybe my GPU or CPU helps paint right.

2)this example enough for demonstration bug and for my task need compatibility with old browsers(for now) sticky don't work everywhere

There's various ways to support older browsers while using (faster, more correct, and power-efficent) scrolling on newer browsers.

An example, if you change .abs to be position: sticky, you can use the following as a fallback:

<script>
window.addEventListener("load", function() {
  if (window.CSS && window.CSS.supports && window.CSS.supports("position: sticky"))
    return; //Nothing to do, position: sticky does it for us.
  let abs = document.querySelector(".abs");
  abs.style.position = "absolute";
  abs.parentNode.addEventListener("scroll", function() {
    abs.style.top = this.scrollTop + "px";
  }, false);
}, false);
</script>

Task consists in fixing position <thead> in <table>.

(In reply to Botond Ballo [:botond] from comment #2)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #1)

In general I don't think you're guaranteed getting async scroll events before paint, though Botond can confirm.

That's correct.

The page is using what we call a scroll-linked effect - please see this MDN page for more info about those.

Thanks for article.

You can test for sticky support of a <thead> in a <table>, though I agree it's more elaborate (you need to create a scrollable table with a sticky header, then read out the position of the header).

I'm not sure whether we can do better at this... I suspect the fact that with devtools open this flickers way less means that paint-skipping may be involved somehow?

Botond (sorry for the continuous ni? storm), in this case paint-skipping seems to be doing wrong, isn't it? Or maybe I'm completely off and there's another reason for this.

Flags: needinfo?(botond)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #7)

I suspect the fact that with devtools open this flickers way less means that paint-skipping may be involved somehow?

Paint skipping shouldn't make a difference here, because while we may not schedule a paint in response to the scroll itself, we still should in response to the style.top change.

In principle, with the APZ frame delay in place, scroll-linked effects should remain in sync if we can paint the page within the frame budget.

The next step here would be to investigate whether, on this page, we (a) can't paint the page within the frame budget, or (b) there's something else going on that's causing the effect to be out of sync in spite of the frame delay and being able to paint within budget.

Bug 1367770 tracks investigating cases like this, and the page here makes a good test case.

Blocks: 1367770
Flags: needinfo?(botond)
OS: Unspecified → All
Hardware: Unspecified → All
Version: unspecified → 68 Branch

The priority flag is not set for this bug.
:TYLin, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(aethanyc)
Flags: needinfo?(aethanyc)
Priority: -- → P3
Version: 68 Branch → unspecified

I did track down this issue (since it's somewhat related to bug 1692708), as far as I can tell the one-frame delay isn't working as expected, in fact it's two-frames delay. That's because we do invoke SampleCompositedAsyncTransform before sampling an animation, whereas a RepaintRequest will be based on the metric after the animation sampling, thus scrollTop value on the main-thread will be actually an one-frame ahead value.

For example, given that there's a queued sampled offset (0, 100) after this AdvanceToNextSample call,

  1. SampleCompositedAsyncTransform adds a new offset, say (0, 110)
  2. sampling an animation produces the latest scroll offset (i.e. the APZC's metric), say (0, 120)
  3. APZCTreeManager::SampleForWebRender uses the (0, 100) offset for the composition
    in the meantime the (0, 120) offset is delivered to the main-thread and used there and "updated position:absolute position" is reflected to the compositor
  4. in the next APZCTreeManager::SampleForWebRender the (0, 110) is used but the position:absolute element's position is based on the (0, 120)

That's what I've observed. Moving the SampleCompositedAsyncTransform call after sampling an animation fixes this issue at least on my local Linux box. Though as of now I am not sure it's a right thing to do.

Status: UNCONFIRMED → NEW
Component: Layout: Scrolling and Overflow → Panning and Zooming
Ever confirmed: true

This is what I commented in comment 10. With this change, all test cases blocking bug 1367770 other than bug 1665900 case look better. I don't know why bug 1665900 looks worse. My wild guess is that painting consumes over 16ms in the case, but not sure yet. I'd need some debugging metrics to tell what's going on there easier.

Attachment #9248348 - Attachment is patch: true
Attachment #9248348 - Attachment mime type: application/octet-stream → text/plain
Attached patch Dump scroll linked effects (deleted) — Splinter Review

I managed to add a (ugly) hack to dump scroll-linked-effects element positions and WebRender's scroll frame metrics. This hack is not perfect since the target scroll frame is not the linked one, it's (probably I guess) ASR.

Here is a typical two frames dump dumped by using a modified test case in comment 0.

--- start --- SystemTime { tv_sec: 1637134429, tv_nsec: 522116809 }
scroll linked effects: Some(SpatialNodeIndex(9))
	 element pos:            Box2D((0.0, 4176.0), (1114.0, 4226.0))
	 viewport:               Box2D((0.0, 0.0), (1114.4, 884.1))
	 scrollable_size:        0.0x7596.9995
	 scroll offset:          (0.0, -4156.317)
	 external_scroll_offset: (0.0, 4176.0)
--- end   --- SystemTime { tv_sec: 1637134429, tv_nsec: 522116809 }

--- start --- SystemTime { tv_sec: 1637134429, tv_nsec: 538789320 }
scroll linked effects: Some(SpatialNodeIndex(9))
	 element pos:            Box2D((0.0, 4176.0), (1114.0, 4226.0))
	 viewport:               Box2D((0.0, 0.0), (1114.4, 884.1))
	 scrollable_size:        0.0x7596.9995
	 scroll offset:          (0.0, -4176.0)
	 external_scroll_offset: (0.0, 4176.0)
--- end   --- SystemTime { tv_sec: 1637134429, tv_nsec: 538789320 }

element pos is representing the abs pos position in the frame, scroll offset is a value changed by APZ. And external_scroll_offset is a value set by display list, it's kinda scroll offset at the time when NotifyLayersUpdated gets called.

At the first frame, the abs-pos element position is slightly off to down from the scroll top position at that moment. At the second frame its position is same as the scroll top.

With the patch I posted in comment 11, the issue just like at the first frame will be mostly eliminated, but I think it is not a fundamental fix.

A problem I can see by the dump is the element pos and external_scroll_offset are basically based on the last metrics when NotifyLayersUpdated gets called (from the perspective of APZ) whereas the scroll offset is a value sampled before.

Though I haven't figured out what a proper solution here is, we should fix this inconsistency (along with the patch in comment 11.

Assignee: nobody → hikezoe.birchill

What I am currently thinking is we will have to have queued SampledAPZState in WebRender (ScrollFrame I assume) as well as AsyncPanZoomController does. And we will have a value representing generation in the SampledAPZState, and we need to pick a proper SampledAPZState in the webrender renderer thread to find the same generation sampled result. Is it worth a shot?

(In reply to Hiroyuki Ikezoe (:hiro) from comment #13)

What I am currently thinking is we will have to have queued SampledAPZState in WebRender (ScrollFrame I assume) as well as AsyncPanZoomController does. And we will have a value representing generation in the SampledAPZState, and we need to pick a proper SampledAPZState in the webrender renderer thread to find the same generation sampled result. Is it worth a shot?

That approach sounds like it should fix the problem. We would want to have some kind of limit in place, such that if the generation from the main thread is "too old", we give up trying to stay in sync and just use a newer APZ offset (otherwise, if the main thread takes a long time to paint we are back to seeming unresponsive / getting "sync" scrolling).

That said, I'm a bit curious what makes this approach necessary. I think the original idea behind the frame delay was that if the main thread consistently paints things within the frame budget, then the display lists and corresponding samples (that would have the same generation with this approach) would just naturally align. I'm curious why this doesn't happen -- does a mis-alignment get introduced because sometimes one of the steps involved takes longer than one frame? Or sometimes a step happens twice in one frame?

Basically, I'm wondering if the described fix might be papering over a bug introduced in bug 1630781 (or not properly fixed by bug 1630781), which may cause other symptoms, and which might have a less invasive fix.

(In reply to Botond Ballo [:botond] from comment #14)

That said, I'm a bit curious what makes this approach necessary. I think the original idea behind the frame delay was that if the main thread consistently paints things within the frame budget, then the display lists and corresponding samples (that would have the same generation with this approach) would just naturally align. I'm curious why this doesn't happen -- does a mis-alignment get introduced because sometimes one of the steps involved takes longer than one frame? Or sometimes a step happens twice in one frame?

No, what happens is that an opposed case. A round trip scroll offset (initiated by an APZC and reported to the main-thread, and the offset has been received in the WebRender's scroll offset), happens within a Vsync tick. (Vsync tick isn't precisely describing the situation because there are a couple of different threads running).

For example, an APZC has two sampled offsets, say 10px and 20px, and the APZC notify the 20px in a RepaintRequest to the main-thread, then the main-thread gets the 20px and update the scroll position in question and an abs-pos element uses 20px, then an display item having 20px is notified to the WebRender, then WebRender renders that the scroller's position 10px, but the abs-pos position is 20px.

Ah, interesting, so it's kind of "two steps in one frame (vsync)". (I wonder if maybe that wasn't the case yet when bug 1630781 was fixed, and then was introduced by a subsequent performance optimization or someting like that.)

Anyways, if I'm understanding correctly that it's actually the main thread position that's ahead of the (composited) async scroll position, then this sounds like pretty good news, in that fixing this will not only put scroll-linked effects into sync, but also increase the responsiveness of async scrolling in general!

FWIW here is a try run with the recent changes I did locally;
https://treeherder.mozilla.org/jobs?repo=try&revision=8ca74557303118556ad153562722aaeaf0a8e81b

There remain a couple of relatively big issues;

  1. test_wheel_scroll.html fails due to some hit testing issues caused by the changes
  2. no automated tests has been made
  3. scroll thumb positions (and fixed/sticky positioned elements on mobile) might be slightly mis-positioned because I haven't changed those transforms, e.g. this transform for scroll thumbs.

I was naively thinking we don't need to change eForHitTesting cases, but given that there's at least one hit testing issue 1), we might need to. I'd think 3) can be deferred in a follow up.

(In reply to Hiroyuki Ikezoe (:hiro) from comment #18)

  1. test_wheel_scroll.html fails due to some hit testing issues caused by the changes

I did address this test failure locally with a very ugly hack. To be honest, I don't understand how the failure was caused exactly, with the original changes WebRender hit tester sometimes fails to return eApzAwareListeners for some reasons. Anyway I found what makes the test failure. In the current m-c SpatialNode::set_scroll_origin has a calculation like this;

  let new_offset = normalized_offset - scrolling.external_scroll_offset;

SpatialNode::set_scroll_origin is triggered by an APZ sampling and it gets called in a frame transaction (in WebRender's term I think). normalized_offset is a scroll offset caused by an APZC, scrolling.external_scroll_offset is a scroll offset came from a display item on the main-thread. My original change deferred this calculation as much as possible, I mean, the calculation is done every time we want the result, e.g. here. I am still believing deferring the calculation as mush as possible is a right thing to do ideally, but given that it introduce an inconsistency between hit tester, I am not going to defer the calculation as much as possible for now, I am going to do the calculation in the frame transaction as what the current code does. The test case in this bug still works with the way.

Also note that I realized the my changes doesn't fix cases where any APZ animation is not involved, e.g. scroll thumb dragging, I hope those cases can be relatively easily fixed by bumping up the scroll generation I will introduce in this bug.

Depends on: 1744842

We are going to use it in various places, APZC, SampledAPZState etc.

Each APZC will have a ScrollGeneratioin and will use this 0th generation as a
special case where there's no active animation in APZC.

Depends on D133437

Depends on D133439

In a subsequent change, we'd like to use this getting async transform
calculation for each SampledAPZCState.

Depends on D133440

It's not used for Gecko at all. And in the next commit we are going to have
multiple scroll offsets in a ScrollFrameInfo and encapsulate
ScrollFrameInfo.scroll_offset method so with this unused scroll function,
the next change will be quite messy.

Depends on D133442

This change mitigates the gap between the external_scroll_offset informed from
the main-thread and scroll_offset informed from APZ.

The change inside GetAsyncScrollDeltaForSampling (which is renamed to
GetSampledScrollOffsets in this change) is basically doing what
GetCurrentAsyncTransformWithOverscroll does but for each SampledAPZCState,
i.e. getting the async transform and applying overscroll transform if it's
necessary. Unfortunately I don't have any great idea to generalize them.
GetCurrentAsyncTransformWithOverscroll will be limited to eHitTesting in
a later commit.

Some wrench reftests for this change are in the next commit.

Depends on D133443

Depends on D133444

Now the function gets caled only for eHitTesting.

Depends on D133445

Now I realized that the way will introduce jitter scrolling in some cases.

So for example, given that there were two sampled scroll offsets;

  • 10px with generation: 1
  • 20px with generation: 2

and if on the main-thread we had used the 20px offset with generation 2, then rendered it. And in the next sampling we have

  • 20px with generation 2
  • 30px with generation 3

in the meantime if the main-thread is busy for some reasons, then in the next rendering we do re-use 20px offset again. Though in the next rendering we will use 30px with generation 3 or the next sampled offset depending on the main-thread work.

A way to mitigate is checking scroll-linked effects (we will have to check scroll-linked animations too) and if there's any scroll-linked effect, we use this new machinery, and if there's no such thing, we don't use this machinery.

Botond, what do you think? I am quite unsure how often this situation happens in the wild.

Flags: needinfo?(botond)

I am going to add a pref to be able to flip the behavior at least.

(In reply to Botond Ballo [:botond] from comment #14)

We would want to have some kind of limit in place, such that if the generation from the main thread is "too old", we give up trying to stay in sync and just use a newer APZ offset (otherwise, if the main thread takes a long time to paint we are back to seeming unresponsive / getting "sync" scrolling).

Now I understand this part: the limit is naturally in place due to this code which ensures we don't retain too many samples (never more than 2, I think).

(In reply to Hiroyuki Ikezoe (:hiro) from comment #30)

Now I realized that the way will introduce jitter scrolling in some cases.

So for example, given that there were two sampled scroll offsets;

  • 10px with generation: 1
  • 20px with generation: 2

and if on the main-thread we had used the 20px offset with generation 2, then rendered it. And in the next sampling we have

  • 20px with generation 2
  • 30px with generation 3

in the meantime if the main-thread is busy for some reasons, then in the next rendering we do re-use 20px offset again. Though in the next rendering we will use 30px with generation 3 or the next sampled offset depending on the main-thread work.

Yeah, good catch. I guess, if some paints complete before the end of the current vsync interval and others not until the next vsync interval, we will in fact get some jitter.

It wouldn't surprise me if this happens somewhat often.

A way to mitigate is checking scroll-linked effects (we will have to check scroll-linked animations too) and if there's any scroll-linked effect, we use this new machinery, and if there's no such thing, we don't use this machinery.

I think this is a promising idea. In the case we have a scroll-linked effect, the jitter is probably preferable to having the effect be out of sync. But in the case we don't have a scroll-linked effect, it would be a regression compared to the current consistent one-frame delay.

In fact, if we do this, we could consider doing something else: using the most recent sample (offsets.last() instead of offsets.first()) in the case where we have no scroll-linked effect. That would effectively get rid of the one-frame delay altogether (which has been requested by users, for example in this bug) if there are no scroll-linked effects.

One caveat here is that our detection logic for "are there any scroll-linked effects on this page?" is incomplete (see bug 1276361). If we are going with this route, we may want to take the opportunity to improve that detection logic as well.

An alternative solution to the jitter that we could consider, is to keep a queue of main-thread samples as well, and if the latest main-thread sample is very fresh (latest vsync), use a sample corresponding to the previous vsync (and the APZ sample with the corresponding generation) instead. This would mean the one-frame delay will always be there, but the jitter would be avoided (even with scroll-linked effects).

Flags: needinfo?(botond)

(In reply to Botond Ballo [:botond] from comment #32)

An alternative solution to the jitter that we could consider, is to keep a queue of main-thread samples as well, and if the latest main-thread sample is very fresh (latest vsync), use a sample corresponding to the previous vsync (and the APZ sample with the corresponding generation) instead. This would mean the one-frame delay will always be there, but the jitter would be avoided (even with scroll-linked effects).

Thanks for the alternative solution. It took time to understand what it means. As I understand it, it can be done by without having the queue of the main-thread since we can tell the freshness by just comparing the last APZ sampled generation with the main-thread generation. And indeed, it fixes this bug pretty well as I've confirmed it locally. Hooray! I am going to drop the pref I added to flip this machinery.

Thank you!

Attachment #9254674 - Attachment description: Bug 1571758 - Add a generic GetAsyncTransform function. r?botond → Bug 1571758 - Add an optional index argument to some APZC methods to be able to get an arbirary sampled metrics. r?botond
Attachment #9254679 - Attachment is obsolete: true
Attachment #9254671 - Attachment is obsolete: true

(In reply to Botond Ballo [:botond] from comment #32)

An alternative solution to the jitter that we could consider, is to keep a queue of main-thread samples as well, and if the latest main-thread sample is very fresh (latest vsync), use a sample corresponding to the previous vsync (and the APZ sample with the corresponding generation) instead. This would mean the one-frame delay will always be there, but the jitter would be avoided (even with scroll-linked effects).

Thanks for the alternative solution. It took time to understand what it means. As I understand it, it can be done by without having the queue of the main-thread since we can tell the freshness by just comparing the last APZ sampled generation with the main-thread generation. And indeed, it fixes this bug pretty well as I've confirmed it locally. Hooray! I am going to drop the pref I added to flip this machinery.

As Botond noticed me in a review comment, I did misunderstand it.

In fact we can tell "if the latest main-thread sample is very fresh (latest vsync)" situation to compare the last APZ scroll generation with the main-thread one (that's what I did in D133444), but we can't tell the difference between;

  1. the main-thread generation is constantly one-frame delay
  2. a non-fresh main-thread generation is a one-time or a couple of time delays

To tell the difference, what I can think of is;

  1. add the last rendered generation to ScrollFrameInfo
  2. if the main-thread generation is same as the last rendered generation, we use the next generation's scroll offset

With this way, a question did arise in my head, that is, given that a situation;

  1. We did keep using the last fresh APZ sampled offset for a while
  2. The main-thead got busy for a while
  3. So we couldn't get any information from the main-thead
  4. Thus we did still keep using the last fresh APZ sampled offset during the busy
  5. The main-thread notified a very stale offset and a generation

In such case, we should keep using the last fresh APZ sampled offset, otherwise the last scroll position will jump back.

Another issue I have in my mind is our scroll generation implementation is not generating continuous numbers, it may be tricky to get the next generation.

As discussed with Botond in #apz channel, I introduced a new flag representing there's any scroll linked effect or not and used it to use this new machinery. The new flag is stored in each ScrollFrameInfo, fwiw (I was initially thinking I should ask Glenn where we should store it since in Gecko it's a flag per DOM document, but as far as I can tell, in WebRender there's no struct corresponding to DOM document).

Attachment #9254670 - Attachment description: Bug 1571758 - Split out ScrollGeneration into a new header. r?botond → WIP: Bug 1571758 - Split out ScrollGeneration into a new header. r?botond
Attachment #9254672 - Attachment description: Bug 1571758 - Introduce scroll generation in APZC and SampledAPZCState and inform it to the scrollable frame on the main-thread via RepaintRequest. r?botond → WIP: Bug 1571758 - Introduce scroll generation in APZC and SampledAPZCState and inform it to the scrollable frame on the main-thread via RepaintRequest. r?botond
Attachment #9254673 - Attachment description: Bug 1571758 - Factor out GetAsyncScrollDeltaForSampling. r?botond → WIP: Bug 1571758 - Factor out GetAsyncScrollDeltaForSampling. r?botond
Attachment #9254674 - Attachment description: Bug 1571758 - Add an optional index argument to some APZC methods to be able to get an arbirary sampled metrics. r?botond → WIP: Bug 1571758 - Add an optional index argument to some APZC methods to be able to get an arbirary sampled metrics. r?botond
Attachment #9254675 - Attachment description: Bug 1571758 - Inform apz scroll generation to WebRender's ScrollFrame from the main-thread. r?gw!,botond! → WIP: Bug 1571758 - Inform apz scroll generation to WebRender's ScrollFrame from the main-thread. r?gw!,botond!
Attachment #9254676 - Attachment description: Bug 1571758 - Drop scroll method from SpatialNode. r?gw → WIP: Bug 1571758 - Drop scroll method from SpatialNode. r?gw
Attachment #9254677 - Attachment description: Bug 1571758 - Inform multiple sampled scroll offsets to WR and pick the most appropriate one in WR. r?botond!,gw! → WIP: Bug 1571758 - Inform multiple sampled scroll offsets to WR and pick the most appropriate one in WR. r?botond!,gw!
Attachment #9254678 - Attachment description: Bug 1571758 - Wrench reftests for scroll offset generation. r?gw!,botond! → WIP: Bug 1571758 - Wrench reftests for scroll offset generation. r?gw!,botond!
Attachment #9254670 - Attachment description: WIP: Bug 1571758 - Split out ScrollGeneration into a new header. r?botond → Bug 1571758 - Split out ScrollGeneration into a new header. r?botond
Attachment #9254672 - Attachment description: WIP: Bug 1571758 - Introduce scroll generation in APZC and SampledAPZCState and inform it to the scrollable frame on the main-thread via RepaintRequest. r?botond → Bug 1571758 - Introduce scroll generation in APZC and SampledAPZCState and inform it to the scrollable frame on the main-thread via RepaintRequest. r?botond
Attachment #9254673 - Attachment description: WIP: Bug 1571758 - Factor out GetAsyncScrollDeltaForSampling. r?botond → Bug 1571758 - Factor out GetAsyncScrollDeltaForSampling. r?botond
Attachment #9254674 - Attachment description: WIP: Bug 1571758 - Add an optional index argument to some APZC methods to be able to get an arbirary sampled metrics. r?botond → Bug 1571758 - Add an optional index argument to some APZC methods to be able to get an arbirary sampled metrics. r?botond
Attachment #9254675 - Attachment description: WIP: Bug 1571758 - Inform apz scroll generation to WebRender's ScrollFrame from the main-thread. r?gw!,botond! → Bug 1571758 - Inform apz scroll generation to WebRender's ScrollFrame from the main-thread. r?gw!,botond!
Attachment #9254676 - Attachment description: WIP: Bug 1571758 - Drop scroll method from SpatialNode. r?gw → Bug 1571758 - Drop scroll method from SpatialNode. r?gw
Attachment #9254677 - Attachment description: WIP: Bug 1571758 - Inform multiple sampled scroll offsets to WR and pick the most appropriate one in WR. r?botond!,gw! → Bug 1571758 - Inform multiple sampled scroll offsets to WR and pick the most appropriate one in WR. r?botond!,gw!
Attachment #9254678 - Attachment description: WIP: Bug 1571758 - Wrench reftests for scroll offset generation. r?gw!,botond! → Bug 1571758 - Wrench reftests for scroll offset generation. r?gw!,botond!
Attachment #9256114 - Attachment is obsolete: true

Depends on D133445

Hey :aosmond, I am running into a floating point rounding issue here in this bug.

In short with APZ's scroll offset this snap function gets called with negative value, thus, the snapped result causes unexpected 1px difference. For negative value, I'd guess it would be (value - 0.5).ceil() instead of (value + 0.5).floor()? Note that during scrolling downwards scroll offset in the WebRender is basically negative so that the snap function gets called with the negative values.

Flags: needinfo?(aosmond)

FWIW, with the (value - 0.5).ceil() change, the new mochitest in comment 37 worked properly in a try run; https://treeherder.mozilla.org/jobs?repo=try&revision=b2b2569ac4b7ca5b472f5d7c0e708e670885eb79 (It's the first time the test doesn't cause any oranges, hooray!)

I'd be very wary about changing the snapping. That will have implications far beyond your test case, even if the tree is green, I would expect to see complaints in the field. Can you link to the failing test cases?

Generally we want to snap with + 0.5 but I suppose it is possible there is something special about scroll offsets that require a difference touch. Often this sort of problem is caused by us snapping too early. Any thoughts Jeff on snapping away from zero for negative scroll offsets?

Flags: needinfo?(aosmond) → needinfo?(jmuizelaar)

(value - 0.5).ceil() also seems wrong to me. Hiro, do you have an explanation for why the direction of snapping should change when we cross 0?

Flags: needinfo?(jmuizelaar) → needinfo?(hikezoe.birchill)

I am afraid I don't have any explanation on that. The way is a speculative attempt to make the rounding way symmetric to zero because the place is the only one suspicious place I had found at that moment, given that async scroll can be both negative/positive values I thought it kinda makes sense. That said, I think now I found another suspicious place, will dig into there.

Here is a failed try run, fwiw; https://treeherder.mozilla.org/jobs?repo=try&revision=ee34988c082def104d7bb51123d8e02f40835e0d

Thanks!

Flags: needinfo?(hikezoe.birchill)

Now I have no idea how to solve the rounding issue. The another suspicious place I meant in comment 42 was this ClampAndAlignWithPixels, the function is invoked by APZ operations, but I have no idea the function is correct or not.

So for example, if APZ scrolls at (0, 170.5), then the ClampAndAlignWithPixels function aligns it to (0, 171), then if a JS script tried to use the aligned scroll position for another element, say element.style.top = "171px", the element will be rendered at (0, 171). In the mean time, APZ also notifies the scroll offset in a different route, as a result of it, this scrolling.offset value will be (0, -170.5). Then unfortunately with the snap function in question clamped to (0, -170). That's what I've been seeing.

Note that the reason why offset is negative is describing in this comment, though TBH I don't quite understand it (I guess we can make it positive, even so there are cases the value would be definitely negative in cases of overscrolling).

Andrew and Jeff, what's the best solution here?

Flags: needinfo?(jmuizelaar)
Flags: needinfo?(aosmond)

My instinct here would be to either try consistently using the snapped result from ClampAndAlignWithPixels, so that when it calculates the scrolling.offset passed into WebRender, we would get -171 instead of -170.5, or to change how we snap scrolling.offset inside of WR to mirror APZ's expectations. This should not change how we snap anything else however, just APZ.

Flags: needinfo?(aosmond)

The latter likely makes more sense (unless we remove it entirely in favour of ClampAndAlignWithPixels on the content side.)

Attachment #9254675 - Attachment description: Bug 1571758 - Inform apz scroll generation to WebRender's ScrollFrame from the main-thread. r?gw!,botond! → Bug 1571758 - Inform apz scroll generation to WebRender's ScrollFrame from the main-thread. r?botond!
Attachment #9254677 - Attachment description: Bug 1571758 - Inform multiple sampled scroll offsets to WR and pick the most appropriate one in WR. r?botond!,gw! → Bug 1571758 - Inform multiple sampled scroll offsets to WR and pick the most appropriate one in WR. r?botond!

Comment on attachment 9254678 [details]
Bug 1571758 - Wrench reftests for scroll offset generation. r?botond!

Revision D133445 was moved to bug 1571759. Setting attachment 9254678 [details] to obsolete.

Attachment #9254678 - Attachment is obsolete: true

(In reply to Andrew Osmond [:aosmond] (he/him) from comment #44)

My instinct here would be to either try consistently using the snapped result from ClampAndAlignWithPixels, so that when it calculates the scrolling.offset passed into WebRender, we would get -171 instead of -170.5, or to change how we snap scrolling.offset inside of WR to mirror APZ's expectations. This should not change how we snap anything else however, just APZ.

Thank you Andrew for the great suggestion. With your suggestion and as Botond realized me on Matrix, I am going to apply the same logic as what ClampAndAlignWithPixels does here in AsyncPanZoomController::GetCurrentAsyncTransform, it should probably fix the rounding issue.

Flags: needinfo?(jmuizelaar)
Attachment #9254678 - Attachment description: Bug 1571758 - Wrench reftests for scroll offset generation. r?gw!,botond! → Bug 1571758 - Wrench reftests for scroll offset generation. r?botond!
Attachment #9254678 - Attachment is obsolete: false

(In reply to Hiroyuki Ikezoe (:hiro) from comment #47)

I am going to apply the same logic as what ClampAndAlignWithPixels does here in AsyncPanZoomController::GetCurrentAsyncTransform, it should probably fix the rounding issue.

It's quite hard to fix this rounding issue.

Firstly, it turned out GetCurrentAsyncTransform is not the right place to do the stuff. The reason is the ClampAndAlignWithPixels logic uses NS_round() against the delta between the new scroll offset and the previous scroll offset, in the APZ side, the previous scroll offset could be the aligned value. So for example, once we sampled a new async scroll offset 29.5 and the scroll position on the main-thread at that moment was 160, then it will be aligned to 190, in the mean time, the FrameMetrics in question still has 189.5 ( = 160 + 295) offset, thus if we received the new aligned 190 value, the delta in the APZ side will be -0.5 (= 189.5 - 190), thus the NS_round() clamps the value to 189 unfortunately.

Secondly I tried to apply the ClampAndAlignWithPixels logic with mLastContentPaintMetrics in AsyncPanZoomController::SampleCompositedAsyncTransform, but there's still a race conditions that the mlastContentPaintMetrics is updated with the new scroll position from the main-thread (i.e. NotifyLayersUpdated gets called).

So, a last resort now I can think of is to store the aligned offset in FrameMetrics (i.e. in mLayoutViewport) in the first place. Though I am not 100% sure whether it works as expected.

(In reply to Hiroyuki Ikezoe (:hiro) from comment #48)

So, a last resort now I can think of is to store the aligned offset in FrameMetrics (i.e. in mLayoutViewport) in the first place. Though I am not 100% sure whether it works as expected.

This seems to be working, to be correct, I did apply ClampAndAlignWithLayerPixels against FrameMetrics::mScrollOffset in FrameMetrics::SetVisualScrollOffset. (Applying ClampAndAlignWithLayerPixels to mLayoutViewport didn't help). That said, this will more or less change our scroll behavior. So for example, here is an early state of scroll positions with/without the change

the original position the rendered position with the change the rendered position without the change (snapped in WR)
0.116667 0 0
0.516667 1 0
1.366670 1 1
2.783330 3 2

With the change the second value 0.516667 is aligned by floor(0.516667 + 0.5), it will be 1.0.
Whereas without the change the second value is aligned by (-0.516667 + 0.5).floor(), it will be 0.0.

Due to this behavior change, the tests added in bug 1692708 will frequently fail since we will observe 1 twice.

Also note that I've noticed that this floating rounding error has been causing a race condition that we mis-detect this userScrolled flag at the end of async scroll, but that's not a big deal.

Anyway, I am going to upload the change here in this bug.

After some more attempts I found out;

  1. the reason why clamp-and-align FrameMetrics::mLayoutViewport didn't work as expected is AsyncPanZoomController::ScrollBy scrolls based on FrameMetrics::mScrollOffset instead of the previous mLayoutViewport.TopLeft()
  2. Doing clamp-and-align in FrameMetrics::SetVisualScrollOffset is overkill since we don't do clamp-and-align the visual viewport offset on the main-thread, it causes some test failures (e.g. helper_zoom_restore_position_tabswitch.html)
  3. there's still a race condition that an APZC samples multiple times but on the main-thread we update the scroll position just once due to heavy load on the main-thread, thus delta value in ClampAndAlignWithPixels will defer, so that we can't have the same clamp-and-align-ed result both on the APZC and on the main-thread

Botond, I am leaning to defer this floating rounding difference into a later bug. Even without fixing the difference rest of patches would work in most cases of scroll linked effect issue. What do you think?

Flags: needinfo?(botond)

(In reply to Hiroyuki Ikezoe (:hiro) from comment #51)

Botond, I am leaning to defer this floating rounding difference into a later bug. Even without fixing the difference rest of patches would work in most cases of scroll linked effect issue. What do you think?

Thanks for your investigation on this, Hiro. It sounds like a tricky issue, and I think it's completely reasonable to land what we have so far and defer this issue to a later bug.

Flags: needinfo?(botond)

Thank you Botond. FIled bug 1752789.

Attachment #9259474 - Attachment is obsolete: true
Attachment #9260963 - Attachment is obsolete: true
Pushed by hikezoe.birchill@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/70bbedfc5b91 Split out ScrollGeneration into a new header. r=botond https://hg.mozilla.org/integration/autoland/rev/f60c125e2015 Introduce scroll generation in APZC and SampledAPZCState and inform it to the scrollable frame on the main-thread via RepaintRequest. r=botond https://hg.mozilla.org/integration/autoland/rev/58003e7dc5a0 Factor out GetAsyncScrollDeltaForSampling. r=botond https://hg.mozilla.org/integration/autoland/rev/75a9175ff83a Add an optional index argument to some APZC methods to be able to get an arbirary sampled metrics. r=botond https://hg.mozilla.org/integration/autoland/rev/3f9fdc20eac2 Inform apz scroll generation to WebRender's ScrollFrame from the main-thread. r=botond https://hg.mozilla.org/integration/autoland/rev/b13b738bc31b Drop scroll method from SpatialNode. r=gw https://hg.mozilla.org/integration/autoland/rev/9e0f35846353 Inform multiple sampled scroll offsets to WR and pick the most appropriate one in WR. r=botond https://hg.mozilla.org/integration/autoland/rev/c55e3a40a4de Wrench reftests for scroll offset generation. r=botond
Regressions: 1752893
Regressions: 1753436
Blocks: 1754129
Attachment #9259474 - Attachment is obsolete: false

Comment on attachment 9259474 [details]
Bug 1571758 - Add a mochitest for scroll linked effects. r?botond

Revision D136207 was moved to bug 1754129. Setting attachment 9259474 [details] to obsolete.

Attachment #9259474 - Attachment is obsolete: true
Flags: qe-verify+

Reproducible on Linux x86_64(Ubuntu 20.04) on Firefox 97(20220202182137). This is fixed on Firefox 98.0b5(20220215194438) and the latest Nightly 99.0a1(20220215210051) build on Linux x86_64(Ubuntu 20.04), macOS Big Sur 11 and Win10.

Status: RESOLVED → VERIFIED
Flags: qe-verify+
Regressions: 1759147

Hi, this bug was not present to me until recently, so I bisected it with mozregression GUI on Linux.

It can be always reproduced on the phoronix forum and golem.de with enabled auto scrolling via middle mouse.

This is the result of mozregression:

Bug 1571758 - Wrench reftests for scroll offset generation. r=botond

Differential Revision: https://phabricator.services.mozilla.com/D133445

Please feel free to move this comment to a new bug in case it does not belong here, regards.

(In reply to puxplaying from comment #58)

Hi, this bug was not present to me until recently, so I bisected it with mozregression GUI on Linux.

It can be always reproduced on the phoronix forum and golem.de with enabled auto scrolling via middle mouse.

This is the result of mozregression:

Bug 1571758 - Wrench reftests for scroll offset generation. r=botond

Differential Revision: https://phabricator.services.mozilla.com/D133445

Please feel free to move this comment to a new bug in case it does not belong here, regards.

That sounds like bug 1759017. The patches in bug 1760222 should fix that. You can test a build with those patches here https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/bt1XBV_HQNuF0cfOkMQydg/runs/0/artifacts/public/build/target.tar.bz2 if you'd like.

(In reply to Timothy Nikkel (:tnikkel) from comment #59)

That sounds like bug 1759017. The patches in bug 1760222 should fix that. You can test a build with those patches here https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/bt1XBV_HQNuF0cfOkMQydg/runs/0/artifacts/public/build/target.tar.bz2 if you'd like.

I can confirm that this build does resolve all scrolling issues in my case, thanks!

Blocks: 1802263
Regressions: 1812940
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: