Frames skipped during fling on mac
Categories
(Core :: Graphics: WebRender, defect, P3)
Tracking
()
People
(Reporter: jrmuizel, Unassigned)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
Attachments
(3 files)
Take a look at https://share.firefox.dev/2ZGH6om
There should be a steady stream of new frames but sometimes we don't paint.
Reporter | ||
Comment 1•4 years ago
|
||
We're calling wr_notifier_nop_frame_done for the skipped frames
Reporter | ||
Comment 2•4 years ago
|
||
And we call wr_notifier_nop_frame_done because https://searchfox.org/mozilla-central/rev/7dafc35406b9c945189c617d427f5458933fd3fb/gfx/wr/webrender/src/render_backend.rs#1606
Updated•4 years ago
|
Comment 3•4 years ago
|
||
IIRC Jamie was looking at this code not long ago, there was a patch fixing something here that might be related. i don't recall the details though.
Comment 4•4 years ago
|
||
Oh, I was thinking of https://hg.mozilla.org/mozilla-central/rev/2760b8a14f17ac39b2b6cc1ac2d63f2f1d05bb77 - maybe not related after all.
Comment 5•4 years ago
|
||
It could be related: my patch inverted the logic for how doc.rendered_frame_is_valid
is set. I would have thought that during a fling the composite_descriptor would be changing, so we'd set it to false (and therefore render the frame), but maybe I misunderstand what that describes.
Reporter | ||
Comment 6•4 years ago
|
||
Here's a bit of logging:
update_doc true
FrameMsg::ScrollId (0,100) ExternalScrollId(2, PipelineId(1, 10))
FrameMsg::ScrollId (0,0) ExternalScrollId(2, PipelineId(1, 1))
BuildFrame
new Frame
update_doc true
FrameMsg::ScrollId (0,144) ExternalScrollId(2, PipelineId(1, 10))
FrameMsg::ScrollId (0,0) ExternalScrollId(2, PipelineId(1, 1))
BuildFrame
new Frame
update_doc true
FrameMsg::ScrollId (0,144) ExternalScrollId(2, PipelineId(1, 10))
FrameMsg::ScrollId (0,0) ExternalScrollId(2, PipelineId(1, 1))
nop Frame
update_doc false
FrameMsg::ScrollId (0,-78) ExternalScrollId(2, PipelineId(1, 10))
FrameMsg::ScrollId (0,0) ExternalScrollId(2, PipelineId(1, 1))
update_doc true
FrameMsg::ScrollId (0,0) ExternalScrollId(2, PipelineId(1, 10))
FrameMsg::ScrollId (0,0) ExternalScrollId(2, PipelineId(1, 1))
BuildFrame
new Frame
update_doc true
FrameMsg::ScrollId (0,34) ExternalScrollId(2, PipelineId(1, 10))
FrameMsg::ScrollId (0,0) ExternalScrollId(2, PipelineId(1, 1))
BuildFrame
new Frame
Reporter | ||
Comment 7•4 years ago
|
||
Of note is the fact that we have two frames of FrameMsg::ScrollId (0,144)
Reporter | ||
Comment 8•4 years ago
|
||
A profile showing the problem with some extra annotation: https://share.firefox.dev/2M7XFBE
Reporter | ||
Updated•4 years ago
|
Comment 9•4 years ago
|
||
What are the STR for this bug?
Reporter | ||
Comment 10•4 years ago
|
||
I believe I was just flinging on a long hackernews thread.
Comment 11•4 years ago
|
||
I think this is the same problem as the one we had on Android before we switched to delivering Vsync from the main thread: The fling "wheel" events fire on the main thread, at vsync rate, and we can get unlucky if they fire just after we start the composite.
You can see this in the profile. For every no-op composite in the renderer thread, zoom in on the previous frame (since we composite APZ scrolls one frame after they happen) and compare the time between the CompositeToTarget marker on the compositor thread and the DOMEvent marker on the main thread. If the DOMEvent happens after the CompositeToTarget, then the event fired too late and we skip the next frame.
Of course this is not a good situation to be in.
Comment 12•4 years ago
|
||
Linking to the relevant Android bug. Thanks for the analysis, it makes sense (from what I recall of this bug).
Comment 13•4 years ago
|
||
We can fix this in a bit of a hacky way if we define some threshold time (e.g. 3ms after vsync) during which any user input events will cause a resampling of APZ state for the next vsync. With the changes in bug 1630781 we have a better encapsulation of sampled state and implementing this sort of change should be relatively easy. However defining the threshold time itself feels not-great.
If the input events dispatched from the OS have timestamps that are consistently related to the vsync timestamp then we can use that for a less-hacky solution. i.e. if we get user inputs that arrive after CompositeToTarget but with a timestamp from before the vsync then we know they just got held up somewhere for Reasons, and we should resample the APZ state after the user input is applied, and use that for the next composite. I don't know if the timestamps will line up though.
Comment 14•4 years ago
|
||
I can do some investigation into the timestamp approach.
Comment 15•4 years ago
|
||
Comment 16•4 years ago
|
||
Depends on D85297
Comment 17•4 years ago
|
||
Depends on D85298
Comment 18•4 years ago
|
||
Attaching WIPs for what I had in mind. I'm not sure of a good way to test this yet.
Comment 19•4 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=2865599c5d48cf05d46b3cee7eafc4e66427a32a is a try push which includes these patches, in case people want to try it locally.
Comment 20•4 years ago
|
||
I tested the try build for a bit. It was a bit hard to get a conclusive comparison because in my machine's current state, even a regular Nightly is mostly smooth during flings. I'm not sure how to increase the machine load in a way that reliably results in jittery wheel event timing.
However, while scrolling was mostly smooth, I did encounter a few instances of frame skips during flings in the try build.
Here's a profile from a jittery fling: https://share.firefox.dev/3gfPTD6
You can see a mix of 1-frame and 2-frame "APZScroll Payload Presented" latencies.
Comment 21•4 years ago
|
||
Ah, the patch didn't touch the payload stuff, so the payload for "just after vsync" events wouldn't get reported properly. I can fix that but I think your experience underscores the need for a more reliable way to test this. I'll give this some thought.
Comment 22•4 years ago
|
||
Oh I see. Yeah, switching to the Renderer thread, the no-op composites are in similar but different places: https://share.firefox.dev/2XfVIcj
Comment 23•4 years ago
|
||
I wrote some patches to get test_smoothness usable again (bug 1251612) and added some more logging to understand this problem better. It looks like the unsmoothness actually happens because the events are processed too early rather than too late. i.e. this bit is smooth:
0:13.30 GECKO(91709) 0x12506d000 Sampling for frame at 5377.784193
0:13.34 GECKO(91709) 0x12506d000 Handled event with timestamp 5387.830474
0:13.34 GECKO(91709) 0x12506d000 Sampling for frame at 5411.221955
0:13.34 GECKO(91709) 0x12506d000 Handled event with timestamp 5399.613695
and this bit is not:
0:13.42 GECKO(91709) 0x12506d000 Sampling for frame at 5495.568431
0:13.43 GECKO(91709) 0x12506d000 Handled event with timestamp 5484.319876
0:13.44 GECKO(91709) 0x12506d000 Handled event with timestamp 5496.650982
0:13.44 GECKO(91709) 0x12506d000 Sampling for frame at 5511.339022
0:13.46 GECKO(91709) 0x12506d000 Sampling for frame at 5528.897627
0:13.46 GECKO(91709) 0x12506d000 Handled event with timestamp 5516.373299
0:13.48 GECKO(91709) 0x12506d000 Handled event with timestamp 5530.763990
We can see that the events arrive bunched up, but the second event actually has a timestamp that's later than previous "sampling for frame" timestamp, and so we actually want to exclude that event from going into the sample. That seems to be the opposite problem than what I thought, and maybe not as easy to fix. Maybe we should keep the input events in the input queue until the sampling for the previous interval happens.
Comment 24•4 years ago
|
||
I'm going to leave this bug parked with me for now although I don't plan on working on it in the immediate future as I'm focused on desktop zooming stuff. The discovery outlined in the previous comment means the fix is more complicated than I anticipated and I want to think about how to do it best.
Comment 25•4 years ago
|
||
This came up during the gfx-daily today. I doubt I'll be working on this in near/medium future, so I'm going to unassign myself. Markus has plans for a bit of vsync rewrite which might make fixing this easier.
Comment 26•3 years ago
|
||
Is this still happening? I looked at the profile in comment 0, I'm guessing the problem is that there is not a payload presented marker for every vsync timestamp?
It looks like our markers have changed quite a bit since that profile, but in my local profiles ( https://share.firefox.dev/3ssydKp ) I see apz payload presented every vsync. Or should I be looking at contentpaint payload presented? Which only seems to happen after we get a new display list?
Comment 27•3 years ago
|
||
Yes, I still see this regularly. Profile from just now: https://share.firefox.dev/3uzXLan
You can see the APZ scroll payload markers fluctuate between 20ms and 36ms.
Comment 28•3 years ago
|
||
Yeah, I see that in your profile, thanks. Is there anything more complicated to the STR then just fling a simple page that I might be missing?
Comment 29•3 years ago
|
||
I'm not sure why it only happens sometimes. Maybe the synthetic momentum events aren't generated based on vsync, maybe there's another clock in play. So then maybe it depends on the clock drift between those two timing sources.
Comment 30•3 years ago
|
||
Tried to reproduce on two other macs and couldn't. (Don't spend any time trying to figure out why, I'm not sure if I'll look into this.)
Reporter | ||
Updated•3 years ago
|
Comment 31•2 years ago
|
||
I see this problem all the time now. I'm not sure what has changed. Whether wheel events get delivered at a consistent rate seems to be really finicky; it seems to depend on the window size, and possibly on the power state. The stutter effect becomes way less pronounced when the profiler is running, or when I record the screen using QuickTime. This makes it really hard to get a handle on it.
Description
•