Closed
Bug 1396260
Opened 7 years ago
Closed 7 years ago
Intermittent Win7 reftest failures due to failure to load/use the expected font [was: Intermittent race between taking the reftest and loading the WOFF]
Categories
(Core :: CSS Parsing and Computation, defect, P5)
Core
CSS Parsing and Computation
Tracking
()
RESOLVED
FIXED
mozilla58
People
(Reporter: intermittent-bug-filer, Assigned: jmaher, NeedInfo)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Keywords: intermittent-failure)
Attachments
(1 file, 1 obsolete file)
(deleted),
patch
|
jfkthame
:
review+
|
Details | Diff | Splinter Review |
Filed by: archaeopteryx [at] coole-files.de
https://treeherder.mozilla.org/logviewer.html#?job_id=128014276&repo=autoland
https://queue.taskcluster.net/v1/task/K7fvf5c8Tfmzwul_yys44A/runs/0/artifacts/public/test_info/reftest_errorsummary.log
https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/K7fvf5c8Tfmzwul_yys44A/runs/0/artifacts/public/test_info/reftest_errorsummary.log&only_show_unexpected=1
There is a general race between taking the reftest and loading the WOFF according to philor who investigated. Previous failures have been filed against individual tests.
Comment 1•7 years ago
|
||
David, can you find anybody to work on this so we see less intermittents (on Windows?)? Thank you.
Severity: normal → major
Flags: needinfo?(dbaron)
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Flags: needinfo?(jfkthame)
Comment 3•7 years ago
|
||
We've been getting a steady trickle of these recently, apparently in more-or-less any reftest that loads a @font-face resource, but any given failure is rare enough that it seems hard to reproduce/test.
What causes the reftest harness to wait for in-progress font loads to finish before taking a snapshot?
Flags: needinfo?(jfkthame)
Comment 4•7 years ago
|
||
I don't see anything in:
https://searchfox.org/mozilla-central/rev/999385a5e8c2d360cc37286882508075fc2078bd/layout/tools/reftest/reftest-content.js
that does so.
(Why didn't we have this problem before, though?)
Comment 5•7 years ago
|
||
No, I didn't see anything obvious there, either.
Given how sporadic this is, and that it seems to be able to hit any test, I'm guessing that maybe it happens when a particularly slow GC (or something like that) happens at a bad moment and causes excessive jank at exactly the wrong time, such that we paint before the font load and following restyle is finished. Perhaps the risk was always there, but some recent changes affecting timing, speed, parallelism, etc., mean it has gone from being a one-in-a-million chance to one-in-a-thousand.
I wonder if we should be checking something like
document.fonts.status != 'loading'
before allowing the snapshot to be taken?
Comment hidden (Intermittent Failures Robot) |
Comment 7•7 years ago
|
||
Replying to bug 1391762 comment 7 (moving discussion here as it's the more general bug):
> It's not clear to me why these font load issues weren't an issue before, but
> presumably we need to teach the reftest harness to wait for downloadable
> font loads to complete.
That was my initial thought, but I don't think it's really an explanation. We have some tests that involve a -lot- of font loads that take multiple seconds (e.g. a bunch of tests in layout/reftests/font-matching/stretchmapping-* and weightmapping-*), and when running these tests locally you can watch as the various fonts get loaded and the rendering "evolves" as different elements of the testcase get painted. If the harness weren't waiting for font loads, ISTM this would always be failing - but it doesn't, the harness does wait until the fonts are loaded and then takes the snapshot it needs.
AFAICS, the tests have never done anything special to cause the harness to wait for fonts, nor to ensure "early" loading of @font-face resources. I assume, therefore (but David, you may know better!) that what happens is something along the lines that any font loads triggered during initial reflow will block the load event, as they add to the page's dependent resources, and reftest is waiting for onload to fire before it proceeds.
It's also worth noting bug 1391515 comment 5. The example there is different in that it doesn't involve any webfonts, only system-installed fonts, yet seems similar in that one font face apparently failed to be used in the testcase (although its bold face worked, and both regular and bold faces worked in the reference file immediately afterwards). So perhaps the (highly intermittent) failure we're seeing is not really specific to webfonts at all, but a more general failure in styling or in font instantiation. Or maybe that was an unrelated case.
Comment 8•7 years ago
|
||
(In reply to Jonathan Kew (:jfkthame) from comment #7)
> We have some tests that involve a -lot- of font loads that take multiple
> seconds (e.g. a bunch of tests in
> layout/reftests/font-matching/stretchmapping-* and weightmapping-*), and
> when running these tests locally you can watch as the various fonts get
> loaded and the rendering "evolves" as different elements of the testcase get
> painted.
FWIW, one of the best tests for watching this (in that it takes a long time to load its fonts) seems to be layout/reftests/font-matching/font-stretch-1.html, which loads a whole collection of differently-styled DejaVu faces, and -- at least when running my local opt+debug build -- can be seen gradually re-rendering various elements as the font faces come in. The onload event doesn't fire until all the loads have completed (and so multiple rounds of reflow and paint have happened), and only then does the reftest harness grab the result.
Comment hidden (Intermittent Failures Robot) |
Comment 10•7 years ago
|
||
Has this only been since stylo was switched on by default?
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 12•7 years ago
|
||
this started about 2 weeks after stylo was on by default and from the bugs I have seen this is windows 7 only (including default + stylo-disabled)
Assignee | ||
Updated•7 years ago
|
Comment 13•7 years ago
|
||
(In reply to Cameron McCormack (:heycam) from comment #10)
> Has this only been since stylo was switched on by default?
I don't think so; AFAICS, that happened around 09/06, but this was filed a few days before that (and refers to "previous failures" that had already been happening). Or was stylo enabled significantly earlier than that?
It does seem to have become a big problem in roughly the same timescale as stylo getting enabled, but some failures are showing up on non-stylo test runs as well, so I'm not sure whether there's a real connection or just an unfortunate correlation.
It's also unclear to me at this point whether the problem here is really style- or reflow-related at all, or if it's a Graphics-level problem whereby we're getting occasional font instantiation failures (most commonly with a newly-loaded webfont, where timing might easily come into play somehow, but there are some examples that appear to show failure with a system font, too).
Assignee | ||
Comment 14•7 years ago
|
||
as a note, I am only looking at font related bugs that show up 30+ times/week, there could be many more.
Assignee | ||
Comment 15•7 years ago
|
||
I did some experiments by running reftest in 32 chunks instead of 8:
https://treeherder.mozilla.org/#/jobs?repo=try&author=jmaher@mozilla.com&fromchange=6ac496c794145ddeaa366bb20d767030bb2683e5&tochange=6fe4911192ec13f2a8541146cda206c70abb8c24
you can see many more failures with 8 chunks- should we consider doing this approach? This would be for windows 7.
Comment 16•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #14)
> as a note, I am only looking at font related bugs that show up 30+
> times/week, there could be many more.
Indeed, there are lots on file that only show up a handful of times - perhaps only once - but look like they're the same underlying problem. In general, I think testcases that use more fonts, particularly @font-face resources, are much more likely to break, but pretty much any reftest with text in it -might- get unlucky.
I'm not surprised that running in smaller chunks is helpful. It looks like with that change, src-list-local-full is the worst remaining offender (although it's still far from consistent).
Comment 17•7 years ago
|
||
Interesting that your try run shows inconsistency even within how src-list-local-full fails. The common(er) case is that the first chunk of the testcase fails to load the expected src:local font, and falls back to serif:
https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/TFZ5vyScRBqj_9FzNWWQfw/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1
but in one run, the first -two- chunks of the test both failed:
https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/BP0OcVXzSiW6cKmoLI0Y3Q/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1
Just an observation; I don't see that it particularly helps us figure out the nature of the failure.
Comment 18•7 years ago
|
||
Updating the title here to be more generic, as not all the failures relate to loading WOFF fonts (that's just a particularly common scenario where it shows up); the underlying issue seems to be wider than that.
Summary: Intermittent race between taking the reftest and loading the WOFF → Intermittent Win7 reftest failures due to failure to load/use the expected font [was: Intermittent race between taking the reftest and loading the WOFF]
Assignee | ||
Comment 19•7 years ago
|
||
another technique would be to run the font-face tests in their own suite, then we could make it tier-2 until this is fixed.
It is odd that if we split it into chunks and disable src-list-local-full that we would have a low intermittency rate. Maybe less time the browser is up and running before starting the font-face tests.
Also why is this windows 7 only, not on windows7 debug either?
Comment 20•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #19)
> another technique would be to run the font-face tests in their own suite,
> then we could make it tier-2 until this is fixed.
Note that although this is most common among @font-face tests, there are also scattered occurrences all over the place in other reftest directories. So while demoting these tests would probably reduce the noise considerably, it wouldn't entirely solve it.
> It is odd that if we split it into chunks and disable src-list-local-full
> that we would have a low intermittency rate.
I'm guessing that disabling that particular test wouldn't actually help much, as we'd tend to get the failure on the following test instead. I notice that src-list-local-full is the first test in your reftest-18 chunk that uses @font-face with src:local(), which means it probably triggers a bunch of work inside the gfxPlatformFontList code to retrieve font names for matching, and maybe that doesn't reliably complete in time. Disabling this test would just mean that the next test that uses src:local() will hit the same issue.
> Also why is this windows 7 only, not on windows7 debug either?
I don't know, but perhaps it's timing-related; e.g. because the opt builds run faster, they accumulate more cruft from earlier tests in between clean-up by periodic GCs, and so tend to have higher resource-usage peaks? Or if it really is a race condition of some kind (rather than a temporary resource-exhaustion problem), any kind of timing changes (whether faster or slower) could affect it.
Updated•7 years ago
|
Assignee | ||
Comment 22•7 years ago
|
||
as it currently stands reftest does not meet the requirements of tier-1 status, I think if we do not have any traction on this bug by end of the week we will mark windows7 reftests as tier-2 and ask the sheriffs to stop starring failures. I believe reftests are stable on other platforms- possibly we can just disable them on windows7 altogether and save some resources?
Comment 23•7 years ago
|
||
Disabling on Win7 is not an acceptable option. There are significant differences in config between Win7 and Win10 that affect how reftests run (D2D) and losing that coverage isn't OK.
Assignee | ||
Comment 24•7 years ago
|
||
ok, but this doesn't meet our tree visibility requirements, we need to address this one way or another and I wanted to start that converation here. If the jobs are hidden and not visible by default, will we still get value from the D2D win7 tests running? I am not eager to turn them off, just to solve this problem.
Comment 25•7 years ago
|
||
How did we go from "let's make them tier-2" to "hidden and not visible by default" in the span of two comments?
Comment 26•7 years ago
|
||
ISTM a better way forward would be to run them in smaller chunks, as per comment 15; with 32 chunks the failure rate drops substantially, and maybe we'd then be able to selectively disable just the worst remaining offenders on Win7?
Assignee | ||
Comment 27•7 years ago
|
||
I see a >50% failure rate on rg5:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=windows%20reftest%20rg5&fromchange=e22de36807c6229b3710382f0094ffe5e31eaa26
I am happy to do the 32 chunks and then consider disabling some tests, I think we will need to disable the font-face tests for win7 given the fact that the first test is failing and we suspect the first test will often fail.
Comment 28•7 years ago
|
||
I tried to find a regression range for this, but couldn't find anything conclusive. See https://bugzilla.mozilla.org/show_bug.cgi?id=1386941#c12, for example.
Assignee | ||
Comment 29•7 years ago
|
||
here is a try run, we see 1 failure in 50 on the font tests:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=504b68a70e5404c9b2137fef4189ccaad19209b3
Comment 30•7 years ago
|
||
I notice the one failure on each of Opt and PGO occurred on different tests, .
Given this, what I'd prefer us to try would be landing the patch to run in 32 chunks, but without the test-disabling line, and see how that goes. If we find we're still getting a significant number of failures on that particular test, we can do a followup to disable it, but I'd like a chance to see how consistent (or not) the failures are with the smaller chunks, before we do so.
Note that the next test after the one you disabled (src-list-local-full-quotes.html) loads exactly the same fonts as src-list-local-full.html, and renders the same text; the only difference is the use of quote-marks on the font names in the second version. It's hard to see how one of these would be more reliable than the other, when running in the same environment. So I think we should give it a little time to see how things go in the new smaller-chunks world, before targeting specific test(s) to disable.
Assignee | ||
Comment 31•7 years ago
|
||
here is the patch with just the larger chunks
Attachment #8918401 -
Attachment is obsolete: true
Attachment #8918401 -
Flags: review?(jfkthame)
Attachment #8918543 -
Flags: review?(jfkthame)
Comment 32•7 years ago
|
||
Comment on attachment 8918543 [details] [diff] [review]
run windows 7 reftests in 32 chunks
Review of attachment 8918543 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks; let's see how this goes, and what the mix of residual failures looks like afterwards.
Attachment #8918543 -
Flags: review?(jfkthame) → review+
Comment 33•7 years ago
|
||
Pushed by jmaher@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f68eef8bbd21
run reftest in smaller browser sessions on windows 7 to reduce font loading/rendering issues. r=jfkthame
Comment 34•7 years ago
|
||
Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/mozilla-inbound/rev/039963b03830
run reftest in smaller browser sessions on windows 7 to reduce font loading/rendering issues: fix indentation. r=linting-fix
Comment 35•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/f68eef8bbd21
https://hg.mozilla.org/mozilla-central/rev/039963b03830
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
status-firefox58:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Updated•7 years ago
|
status-firefox57:
--- → affected
Comment 36•7 years ago
|
||
bugherder uplift |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 40•7 years ago
|
||
Sweeping this bug under the rug by increasing the chunk count increased our AWS costs by >$250,000 annually. More details in bug 1431291.
Assignee | ||
Comment 41•7 years ago
|
||
possibly another related bug 1407210
Assignee | ||
Updated•7 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•