Open Bug 1641070 Opened 4 years ago Updated 2 years ago

Frames skipped during fling on mac

Categories

(Core :: Graphics: WebRender, defect, P3)

Unspecified
macOS
defect

Tracking

()

People

(Reporter: jrmuizel, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(3 files)

Take a look at https://share.firefox.dev/2ZGH6om
There should be a steady stream of new frames but sometimes we don't paint.

Blocks: wr-mac

We're calling wr_notifier_nop_frame_done for the skipped frames

Severity: -- → S3
OS: Unspecified → macOS
Priority: -- → P3

IIRC Jamie was looking at this code not long ago, there was a patch fixing something here that might be related. i don't recall the details though.

It could be related: my patch inverted the logic for how doc.rendered_frame_is_valid is set. I would have thought that during a fling the composite_descriptor would be changing, so we'd set it to false (and therefore render the frame), but maybe I misunderstand what that describes.

Here's a bit of logging:

update_doc true
FrameMsg::ScrollId (0,100) ExternalScrollId(2, PipelineId(1, 10))
FrameMsg::ScrollId (0,0) ExternalScrollId(2, PipelineId(1, 1))
BuildFrame
new Frame
update_doc true
FrameMsg::ScrollId (0,144) ExternalScrollId(2, PipelineId(1, 10))
FrameMsg::ScrollId (0,0) ExternalScrollId(2, PipelineId(1, 1))
BuildFrame
new Frame
update_doc true
FrameMsg::ScrollId (0,144) ExternalScrollId(2, PipelineId(1, 10))
FrameMsg::ScrollId (0,0) ExternalScrollId(2, PipelineId(1, 1))
nop Frame
update_doc false
FrameMsg::ScrollId (0,-78) ExternalScrollId(2, PipelineId(1, 10))
FrameMsg::ScrollId (0,0) ExternalScrollId(2, PipelineId(1, 1))
update_doc true
FrameMsg::ScrollId (0,0) ExternalScrollId(2, PipelineId(1, 10))
FrameMsg::ScrollId (0,0) ExternalScrollId(2, PipelineId(1, 1))
BuildFrame
new Frame
update_doc true
FrameMsg::ScrollId (0,34) ExternalScrollId(2, PipelineId(1, 10))
FrameMsg::ScrollId (0,0) ExternalScrollId(2, PipelineId(1, 1))
BuildFrame
new Frame

Of note is the fact that we have two frames of FrameMsg::ScrollId (0,144)

A profile showing the problem with some extra annotation: https://share.firefox.dev/2M7XFBE

Flags: needinfo?(kats)

What are the STR for this bug?

Flags: needinfo?(kats) → needinfo?(jmuizelaar)

I believe I was just flinging on a long hackernews thread.

Flags: needinfo?(jmuizelaar)

I think this is the same problem as the one we had on Android before we switched to delivering Vsync from the main thread: The fling "wheel" events fire on the main thread, at vsync rate, and we can get unlucky if they fire just after we start the composite.

You can see this in the profile. For every no-op composite in the renderer thread, zoom in on the previous frame (since we composite APZ scrolls one frame after they happen) and compare the time between the CompositeToTarget marker on the compositor thread and the DOMEvent marker on the main thread. If the DOMEvent happens after the CompositeToTarget, then the event fired too late and we skip the next frame.

Of course this is not a good situation to be in.

Linking to the relevant Android bug. Thanks for the analysis, it makes sense (from what I recall of this bug).

We can fix this in a bit of a hacky way if we define some threshold time (e.g. 3ms after vsync) during which any user input events will cause a resampling of APZ state for the next vsync. With the changes in bug 1630781 we have a better encapsulation of sampled state and implementing this sort of change should be relatively easy. However defining the threshold time itself feels not-great.

If the input events dispatched from the OS have timestamps that are consistently related to the vsync timestamp then we can use that for a less-hacky solution. i.e. if we get user inputs that arrive after CompositeToTarget but with a timestamp from before the vsync then we know they just got held up somewhere for Reasons, and we should resample the APZ state after the user input is applied, and use that for the next composite. I don't know if the timestamps will line up though.

I can do some investigation into the timestamp approach.

Assignee: nobody → kats

Attaching WIPs for what I had in mind. I'm not sure of a good way to test this yet.

I tested the try build for a bit. It was a bit hard to get a conclusive comparison because in my machine's current state, even a regular Nightly is mostly smooth during flings. I'm not sure how to increase the machine load in a way that reliably results in jittery wheel event timing.

However, while scrolling was mostly smooth, I did encounter a few instances of frame skips during flings in the try build.
Here's a profile from a jittery fling: https://share.firefox.dev/3gfPTD6
You can see a mix of 1-frame and 2-frame "APZScroll Payload Presented" latencies.

Ah, the patch didn't touch the payload stuff, so the payload for "just after vsync" events wouldn't get reported properly. I can fix that but I think your experience underscores the need for a more reliable way to test this. I'll give this some thought.

Oh I see. Yeah, switching to the Renderer thread, the no-op composites are in similar but different places: https://share.firefox.dev/2XfVIcj

I wrote some patches to get test_smoothness usable again (bug 1251612) and added some more logging to understand this problem better. It looks like the unsmoothness actually happens because the events are processed too early rather than too late. i.e. this bit is smooth:

 0:13.30 GECKO(91709) 0x12506d000 Sampling for frame at 5377.784193
 0:13.34 GECKO(91709) 0x12506d000 Handled event with timestamp 5387.830474
 0:13.34 GECKO(91709) 0x12506d000 Sampling for frame at 5411.221955
 0:13.34 GECKO(91709) 0x12506d000 Handled event with timestamp 5399.613695

and this bit is not:

 0:13.42 GECKO(91709) 0x12506d000 Sampling for frame at 5495.568431
 0:13.43 GECKO(91709) 0x12506d000 Handled event with timestamp 5484.319876
 0:13.44 GECKO(91709) 0x12506d000 Handled event with timestamp 5496.650982
 0:13.44 GECKO(91709) 0x12506d000 Sampling for frame at 5511.339022
 0:13.46 GECKO(91709) 0x12506d000 Sampling for frame at 5528.897627
 0:13.46 GECKO(91709) 0x12506d000 Handled event with timestamp 5516.373299
 0:13.48 GECKO(91709) 0x12506d000 Handled event with timestamp 5530.763990

We can see that the events arrive bunched up, but the second event actually has a timestamp that's later than previous "sampling for frame" timestamp, and so we actually want to exclude that event from going into the sample. That seems to be the opposite problem than what I thought, and maybe not as easy to fix. Maybe we should keep the input events in the input queue until the sampling for the previous interval happens.

I'm going to leave this bug parked with me for now although I don't plan on working on it in the immediate future as I'm focused on desktop zooming stuff. The discovery outlined in the previous comment means the fix is more complicated than I anticipated and I want to think about how to do it best.

This came up during the gfx-daily today. I doubt I'll be working on this in near/medium future, so I'm going to unassign myself. Markus has plans for a bit of vsync rewrite which might make fixing this easier.

Assignee: kats → nobody

Is this still happening? I looked at the profile in comment 0, I'm guessing the problem is that there is not a payload presented marker for every vsync timestamp?

It looks like our markers have changed quite a bit since that profile, but in my local profiles ( https://share.firefox.dev/3ssydKp ) I see apz payload presented every vsync. Or should I be looking at contentpaint payload presented? Which only seems to happen after we get a new display list?

Yes, I still see this regularly. Profile from just now: https://share.firefox.dev/3uzXLan
You can see the APZ scroll payload markers fluctuate between 20ms and 36ms.

Yeah, I see that in your profile, thanks. Is there anything more complicated to the STR then just fling a simple page that I might be missing?

I'm not sure why it only happens sometimes. Maybe the synthetic momentum events aren't generated based on vsync, maybe there's another clock in play. So then maybe it depends on the clock drift between those two timing sources.

Tried to reproduce on two other macs and couldn't. (Don't spend any time trying to figure out why, I'm not sure if I'll look into this.)

Depends on: 1735391

I see this problem all the time now. I'm not sure what has changed. Whether wheel events get delivered at a consistent rate seems to be really finicky; it seems to depend on the window size, and possibly on the power state. The stutter effect becomes way less pronounced when the profiler is running, or when I record the screen using QuickTime. This makes it really hard to get a handle on it.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: