Open Bug 1720068 Opened 3 years ago Updated 3 years ago

3.95 - 3.85% wikipedia SpeedIndex / wikipedia FirstVisualChange + 1 more (Linux) regression on Fri June 25 2021

Categories

(Core :: Layout: Text and Fonts, defect)

defect

Tracking

()

REOPENED
Performance Impact medium
Tracking Status
firefox91 --- wontfix

People

(Reporter: Bebe, Unassigned)

References

(Regression)

Details

(4 keywords)

Perfherder has detected a browsertime performance regression from push 365a5fd3033bc32d07cf1d82f28c408eacb6bf86. As author of one of the patches included in that push, we need your help to address this regression.

Regressions:

Ratio Suite Test Platform Options Absolute values (old vs new)
4% wikipedia SpeedIndex linux1804-64-shippable-qr cold webrender 1,605.23 -> 1,668.62
4% wikipedia ContentfulSpeedIndex linux1804-64-shippable-qr cold webrender 1,611.54 -> 1,673.54
4% wikipedia FirstVisualChange linux1804-64-shippable-qr cold webrender 1,600.00 -> 1,661.54

Details of the alert can be found in the alert summary, including links to graphs and comparisons for each of the affected tests. Please follow our guide to handling regression bugs and let us know your plans within 3 business days, or the offending patch(es) will be backed out in accordance with our regression policy.

For more information on performance sheriffing please see our FAQ.

Flags: needinfo?(jfkthame)

It makes sense that the change in bug 1561868 could cause this, because it prioritizes consistency (ensuring the font-list refresh is handled in the content process) when this is competing with a visual refresh (handling vsync). The wikipedia cold-start test is something of a "worst-case" here in that reflow of wikipedia.org ends up hitting font fallback for a lot of "unusual" languages at the point where font list initialization is not yet complete.

I think we should accept this regression. It will affect only a small subset of scenarios -- once the browser is fully launched this message only occurs in response to a (rare) change in the system font configuration or relevant browser prefs such as font whitelist, so it will not affect most browsing activity.

ni? to lsalzman for confirmation or any other thoughts here.

Flags: needinfo?(jfkthame) → needinfo?(lsalzman)

That makes sense to me, and I am okay with this.

Flags: needinfo?(lsalzman)

Closing based on comments 1/2

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX

Reopening, since in my local tests 91 Release and 93 Nightly both show massive regressions compared to 85 for live measurements of wikipedia. For 91 release, I see a mean difference of ~30%, and a median difference of 233%; for 93 Nighty I see 134 and 282%.

A re-measurement shows similar issues. 85 and 91 have LOTS of noise (std. dev. ~70%), and 91 regresses a lot (27%/42%); 93 nightly has little noise, but is far slower - 175%/425% regressed.

On live-site tests in AWFY, I also see indications of a regression, though not as obvious (and we measure infrequently): https://arewefastyet.com/linux64/cold-page-load-live/overview?numDays=90&series=Firefox,Firefox-Fission
Around a 21% regression overall (though hard to see), and the timing matches - late June it appears. the noise makes it tough to read with few datapoints

The Wikipedia people see similar regressions when they switched to 90 in early July (the patch was uplifted to 90); they saw up to 300% regression in their tests.

Status: RESOLVED → REOPENED
Flags: needinfo?(jfkthame)
Resolution: WONTFIX → ---
Whiteboard: [qf]

The wikipedia data points shown at https://arewefastyet.com/linux64/cold-page-load-live/overview?numDays=90&series=Firefox,Firefox-Fission are curious.... they seem to be almost exactly "snapped" to multiples of 40ms (whereas the data points for other sites don't show that sort of quantization, they vary much more freely). Do we have any idea what's causing that? It feels like the page must be doing something a bit unusual that's somehow timing-related.

Flags: needinfo?(jfkthame) → needinfo?(rjesup)

There are many pages that 'snap' to values/quantize. Since speedindex is basically area-over-curve, and a lot of pages pop in more-or-less all at once, the point of that paint determines the speedindex score - and paints are tied to vsync. So that makes sense.

That can also mean that small changes can be amplified, if a page was near a quantization breakpoint.

This normally affects very-fast-loading pages like google, bing, and yahoo mail. That would not explain the huge changes here. It can explain the noise levels (bistable between two quantization points) to a degree, but we may also be seeing noise due to races with IPC messages between parent and child, etc.

The original alert actually was caused by a change from ~1650 to bistable(1650, 1950). The live site values seem much worse, however, than the values we record in automation.

The other possibility (which wouldn't explain the June 25th issue) would be a switch to software webrender somehow hurting wikipedia (but not others). That seemed more possible when we weren't seeing this locally, and they were seeing it in docker tests, but my tests were done on a Linux box with nvidia fairly-up-to-date drivers and a quadro board.

Flags: needinfo?(rjesup)
Whiteboard: [qf] → [qf:p1:pageload]
Whiteboard: [qf:p1:pageload] → [qf:p2:pageload]
Has Regression Range: --- → yes
Performance Impact: --- → P2
Keywords: perf:pageload
Whiteboard: [qf:p2:pageload]
You need to log in before you can comment on or make changes to this bug.