[tracking] Improve performance of Fluent on the startup path
Categories
(Core :: Internationalization, enhancement, P1)
Tracking
()
People
(Reporter: zbraniecki, Assigned: zbraniecki)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
(Keywords: meta)
Attachments
(3 obsolete files)
Assignee | ||
Updated•7 years ago
|
Assignee | ||
Comment 1•7 years ago
|
||
Assignee | ||
Comment 2•7 years ago
|
||
Assignee | ||
Comment 3•7 years ago
|
||
Assignee | ||
Comment 4•7 years ago
|
||
Assignee | ||
Comment 5•7 years ago
|
||
Assignee | ||
Comment 6•7 years ago
|
||
Assignee | ||
Comment 7•7 years ago
|
||
Assignee | ||
Comment 8•7 years ago
|
||
Assignee | ||
Comment 9•6 years ago
|
||
Assignee | ||
Comment 10•6 years ago
|
||
Comment 11•6 years ago
|
||
Assignee | ||
Comment 12•6 years ago
|
||
Assignee | ||
Comment 13•6 years ago
|
||
Assignee | ||
Comment 14•6 years ago
|
||
Assignee | ||
Comment 15•6 years ago
|
||
Comment 16•6 years ago
|
||
Assignee | ||
Comment 17•6 years ago
|
||
Assignee | ||
Comment 18•6 years ago
|
||
Assignee | ||
Comment 19•6 years ago
|
||
Assignee | ||
Comment 20•6 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 21•6 years ago
|
||
Assignee | ||
Comment 22•6 years ago
|
||
Assignee | ||
Comment 23•6 years ago
|
||
Comment 24•6 years ago
|
||
Assignee | ||
Comment 25•6 years ago
|
||
Comment 26•6 years ago
|
||
Assignee | ||
Comment 27•6 years ago
|
||
Comment 28•6 years ago
|
||
Comment 29•6 years ago
|
||
Assignee | ||
Comment 30•6 years ago
|
||
Comment 31•6 years ago
|
||
Comment 32•6 years ago
|
||
Comment 33•6 years ago
|
||
Assignee | ||
Comment 34•6 years ago
|
||
Updated•6 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Assignee | ||
Comment 35•5 years ago
|
||
With bug 1552714 fixed, I'm testing the perf impact of landing a single string on the startup path: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=c1a9623c2c56f38cfcef106d2a41b1c8fa058f3d&newProject=try&newRevision=2b5aaba72a5856b507b5741faf4c4ac2ce28ec98
Comment 36•5 years ago
|
||
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #35)
With bug 1552714 fixed, I'm testing the perf impact of landing a single string on the startup path: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=c1a9623c2c56f38cfcef106d2a41b1c8fa058f3d&newProject=try&newRevision=2b5aaba72a5856b507b5741faf4c4ac2ce28ec98
This still shows regressions in ts_paint and twinopen. There's some noise but the overall picture still looks like this would be a regression.
Are we still intending to fix bug 1517880?
Assignee | ||
Comment 37•5 years ago
|
||
This still shows regressions in ts_paint and twinopen. There's some noise but the overall picture still looks like this would be a regression.
Yeah, before windows results showed up it looked better :/
Are we still intending to fix bug 1517880?
Yes. I believe so. Or switch to fluent-rs, which is likely a larger task, so my focus will be on bug 1517880 first.
Assignee | ||
Comment 38•5 years ago
|
||
Update on this. With all of the Fluent DOM now in C++, and with changes in bug 1558602, we're getting to the point where we may want to consider starting to move away from DTD in browser.xhtml.
As an experiment, I migrated the whole menubar and some bits and pieces from browser.xhtml and panelUI - total of 250 strings. I didn't remove them from DTD and I didn't remove the DTD yet, just added Fluent on top.
Here are talos results: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=6fa61d1c9e669fd5f1f156201dff5ec0d001ceae&newProject=try&newRevision=068bfbc7c468903dd6b9318e2228ad49f7ca4db5
I'm still interested in bug 1517880 and will now work on that, but I think we're at the point where we can start working on bug 1501886 in parallel.
I'd like to verify that claim with someone from the perf team - mconley. Can you take a look at the talos and tell me what you think? Is there anything else you'd like me to test?
In my initial landing, I'd like to migrate just the menubar, which is ~100 strings.
Comment 39•5 years ago
|
||
Thanks for tagging me in, gandalf.
So the comparison still shows a regression - but as you say, we've effectively doubled the work in your treatment build; we're loading both the Fluent strings and the DTD strings. Does the regression go away if you remove the DTDs? If so, that's pretty compelling.
Assignee | ||
Comment 40•5 years ago
|
||
Hah, actually, let me provide more context first!
So, starting here, I'd already want to ask a question - what's an acceptable performance penalty kick for the value we're getting. What's the process, who are the stakeholders, and how can we evaluate that hit against the things the change unblocks us to do.
To reiterate on why we want to unblock Fluent on startup:
- Removal of Yellow Screen of Death https://docs.google.com/document/d/1A2wQI8tdEe0wIFK2KQZniBlQOjSrGoZglQUO_Xfvx2U/edit#
- Introducing of Fluent L10n Context in browser.xhtml enables plethora of cleanups in all browser.xhtml's JS
- And unification of that JS l10n with XUL/XHTML l10n into a single context
- Final deprecation of DTD in Firefox
- Runtime locale switching without restart
- Increased security in result of DTD removal
- Improved l10n experience with better error fallbacking
- Step on the path to remove all of the old StringBundle, PluralRules etc. codebase
All of those things, (and a couple others) are basically blocked on this for the last 6 months. I of course understand that we want to prevent any perf regressions, but I believe this is not the first case, where reality requires us to make compromises and we have two conflicting needs.
On top of that, some of those changes, such as JS cleanups around browser.xhtml are likely to bring small perf improvements over time. They won't show up as alerts in talos, and it's impossible to evaluate the claim except of anegdotal experience of porting Preferences to Fluent leading to code cleanups and easier startup JS code.
I don't know how much longer will I need to shave off the final milliseconds, but I'd like to understand what are the criteria on evaluating accepting some performance hit against those sought after goals.
Maybe the answer is "zero" - as in, if 6 months from now we're still 3ms away from green talos on windows7-32, then none of that can be unblocked. But I hope that the answer is more nuanced.
Now, let's take a dive into the numbers as of today:
So the comparison still shows a regression
We have 4 "important" changes in this talos result:
sessionrestore opt e10s stylo / windows10-64-shippable - 2.37%
ts_paint opt e10s stylo / windows10-64-shippable - 2.03%
twinopen ext+twinopen:twinopen.html opt e10s stylo / linux64-shippable - 2.30%
twinopen ext+twinopen:twinopen.html opt e10s stylo / macosx1014-64-shippable - 3.72%
In isolation, that may look like "just a regression", but if we look at a couple similar tests, the conclusion may be slightly different:
sessionrestore opt e10s stylo / windows10-64-shippable-qr - -0.47%
ts_paint opt e10s stylo / windows10-64-shippable-qr - -0.20%
In multiple other cases the results for windows*-qr
variant show better performance than non-qr. Since we are turning on webrender everywhere. Is there a reason to argue that this is either noise or not very impactful for us?
The two others are one yellow on linux64-shippable, but not significant on -qr
variant, and the macos1064 twinopen.
The macos doesn't have the -qr
variant yet, but my guess is that once we add it, it'll show no impact.
My best guess is that there's something about the event loop that quantifies the events into some frames - you can notice that if you hover over twinopen/ts_paint results on windows - sometimes they have quasi-normal distribution, sometimes bi-modal.
Does the regression go away if you remove the DTDs?
I can't do that yet. Unfortunately the state of DTD in browser.xhtml is a patchwork over last 15 years. Strings are used all over the place and a lot of strings from menubar are used elsewhere. I'd have to migrate all of browser.xhtml all at once which would be daunting to work on and would take much longer.
My question is how far those results are from something acceptable.
Irrelevant of the answer, I'm going to work on the next steps:
- fluent-minicache (bug 1517880)
- migration to fluent-rs
The mini-cache should be close to being landable, but the fluent-rs will take longer.
I'd prefer to work on the cache and fluent-rs in parallel with unblocking the front-end and l10n teams to migrate browser.xhtml away from DTD, so I'm trying to understand who and how can make such a decision.
Assignee | ||
Comment 41•5 years ago
|
||
for the record, I got the mini-cache to work (bug 1517880). Talos numbers here: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=6fa61d1c9e669fd5f1f156201dff5ec0d001ceae&newProject=try&newRevision=d8de254ce2afb8540f9c603797b8925b68849432
I still think it would be nice to understand what are the criteria and who are the stakeholders for enabling Fluent on the startup path
Comment 42•5 years ago
|
||
I really apologize for not having gotten back to this sooner - Fission crunch for Whistler and then Whistler kinda took their toll.
what's an acceptable performance penalty kick for the value we're getting. What's the process, who are the stakeholders, and how can we evaluate that hit against the things the change unblocks us to do.
Great question. I think we have to look at what we're paying for with the cost of the performance regression - you've done a good job of enumerating the performance benefits. Naturally, the ideal scenario is that we wouldn't regress at all, so the aspirational target is 0 or negative performance regression.
That having been said, I believe it's generally understood that if you've picked most of the low-hanging fruit, that the effort and time cost in order to shave off the last few milliseconds to gain parity will be high.
So the equation roughly is: is the value that we're getting (which you've enumerated) greater than the cost of the regression? And which costs more - the regression, or the effort to address the regression?
I seem to recall a meeting during the All Hands in Orlando where we discussed the Fluent minicache as the thing that will likely "solve" the performance regression. Have you performed any experiments or built any prototypes to demonstrate that this is indeed so?
Assignee | ||
Comment 43•5 years ago
|
||
I seem to recall a meeting during the All Hands in Orlando where we discussed the Fluent minicache as the thing that will likely "solve" the performance regression. Have you performed any experiments or built any prototypes to demonstrate that this is indeed so?
I'm still working on bug 1558602 and bug 1517880. I'm just getting to the point where things become tradeoffs. The cache, for example, increases code complexity and depending on how we implement it may be hard to maintain. I'm still optimistic about getting to "no regression" but I would like to understand what's my room to at some point say "ok, adding this amount of convulsed code is not worth shaving those X ms" and who can I consult on that decision.
Assignee | ||
Comment 44•5 years ago
|
||
Olli, I profiled three release builds: m-c (with patch from bug 1558602), menubar on startup path (bug 1501886) async, and sync.
== central ==
https://perfht.ml/2NiFhcR
https://perfht.ml/320ofn8
https://perfht.ml/2NgpYBe
https://perfht.ml/2NjlcTG
https://perfht.ml/2NlJFaM
https://treeherder.mozilla.org/#/jobs?repo=try&revision=708b775e66b3e27a98e13a554766f851c29f02d9
== menubar FTL async ==
https://perfht.ml/2NhhA4w
https://perfht.ml/31XhrGQ
https://perfht.ml/2NhOYIh
https://perfht.ml/2NhUwCL
https://perfht.ml/2NiZ0c4
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b35b97e3b5e96f84463629897eb2e7a80e391b7e
== menubar FTL sync ==
https://perfht.ml/2NsP2Fs
https://perfht.ml/2NfZaRv
https://perfht.ml/2Njpx9q
https://perfht.ml/2NhiuOs
https://perfht.ml/2Ndhc6T
https://treeherder.mozilla.org/#/jobs?repo=try&revision=64737f98524938c9fcaa4ec353a20f37d2208bde
perfherder (vs m-c) - https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=708b775e66b3e27a98e13a554766f851c29f02d9&newProject=try&newRevision=64737f98524938c9fcaa4ec353a20f37d2208bde
Can you tell me what you see in the profiles? Is it still mostly the JS engine initialization?
:mconley - you can also notice from the perfherder results that some swings are wild. +16% here, -33% there. I'm worried that at some point we'll have to try to read the profiles without statistical significance and then the interpretation will be prone to "confirmation bias" fallacy - if you want to find regression in the results, you'll notice the +2.0% here, if you want to believe it's faster, you'll notice -2.0% there. I'd like to understand who'se go-ahead am I looking for.
Comment 45•5 years ago
|
||
(looking https://perfht.ml/2Njpx9q)
There is definitely couple of milliseconds (~5) coming from js component loading.
mozilla::intl::Localization::Init overall takes ~10ms, half of it is js module loading, another half is getLocalization call.
Does getLocalization use generators heavily - in such way which doesn't get jit'ed?
translateElements is called elsewhere (DocumentL10n::TriggerInitialDocumentTranslation), and takes 3ms.
So yes, JS usage does show up.
Assignee | ||
Comment 46•5 years ago
|
||
Does getLocalization use generators heavily - in such way which doesn't get jit'ed?
Idk, is it possible to see it from the profile?
Comment 47•5 years ago
|
||
(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #44)
:mconley - you can also notice from the perfherder results that some swings are wild. +16% here, -33% there. I'm worried that at some point we'll have to try to read the profiles without statistical significance and then the interpretation will be prone to "confirmation bias" fallacy - if you want to find regression in the results, you'll notice the +2.0% here, if you want to believe it's faster, you'll notice -2.0% there. I'd like to understand who'se go-ahead am I looking for.
Hey gandalf,
At that point, where you've (admirably!) boiled down to the noise, you'll want to probably have esmyth make the decision on whether or not the "regression" (real or imagined) is acceptable, based on what we're buying with it.
Assignee | ||
Comment 48•5 years ago
|
||
Since the cache looks to require more wrangling between prototypes and documents and fluent, I decided to try to push bug 1560038 a bit.
I now have a patch that replaces Fluent.jsm with FluentBundle and FluentResource backed by fluent-rs and here are talos numbers: https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=9cf343dbb979f2429785f51be2c44e61e1254286&newProject=try&newRevision=49b4a81baecee9ea35d75dec902963fc331ba196
I think it's pretty promising and I'll continue evaluating it.
Assignee | ||
Comment 49•5 years ago
|
||
With the latest set of performance improvements to the Arc
branch of fluent-rs - https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=2812d08947cbcfb966be0d5d148cd8c5cc492ef2
Comment 50•5 years ago
|
||
(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #49)
With the latest set of performance improvements to the
Arc
branch of fluent-rs - https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=2812d08947cbcfb966be0d5d148cd8c5cc492ef2
This still regresses ts_paint on windows pgo builds by nearly 2%. :-(
Assignee | ||
Comment 51•5 years ago
|
||
This still regresses ts_paint on windows pgo builds by nearly 2%. :-(
This is statisticially insignificant. I'm happy to run more reps to try to get better p=95, but I think we're in the "trying to read into random data" point. If you look for regressions, you'll find them, if you look for wins, you'll find a 25% win in twinopen on the same platform.
I believe both to be flukes.
Comment 52•5 years ago
|
||
(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #51)
This still regresses ts_paint on windows pgo builds by nearly 2%. :-(
This is statisticially insignificant.
I don't think it's necessarily insignificant (we're doing our darndest to shave off fractions of a percent over here), but like I wrote in comment 47, you should double-check with esmyth to see if the possible cost is worth what we're buying.
Comment 53•5 years ago
|
||
(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #51)
This still regresses ts_paint on windows pgo builds by nearly 2%. :-(
This is statisticially insignificant.
Like mconley, I disagree with this assessment. The spread of numbers for ts_paint on win64 (that you can see when hovering over the data points on both sides of the graph) is almost completely disjoint (ie no overlap), and perfherder claims high confidence in the regression based on a t-test.
I'm happy to run more reps to try to get better p=95, but I think we're in the "trying to read into random data" point. If you look for regressions, you'll find them, if you look for wins, you'll find a 25% win in twinopen on the same platform.
I see a 13.26% regression on win10-64. On win10-64-qr, there's an "improvement", but it is low-confidence, and if you look at the spread of the numbers that are being used, you can see that on the baseline (m-c) dataset, there are 2 outliers at about 380ms each, and the rest of the numbers actually come in below the 123.82 average on your trypush, ie there, too, this actually looks likely to be a regression. The outliers look like an irregular occurence (but not unprecedented) based on the graph.
So no, I don't think this just depends on what you're looking for. I think the data points out that fluent is slower.
(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #48)
Since the cache looks to require more wrangling between prototypes and documents and fluent, I decided to try to push bug 1560038 a bit.
Is this a temporary or fundamental hold-up to the caching work?
Assignee | ||
Comment 54•5 years ago
|
||
The spread of numbers for ts_paint on win64 (that you can see when hovering over the data points on both sides of the graph) is almost completely disjoint (ie no overlap), and perfherder claims high confidence in the regression based on a t-test.
This is a bimodal distribution which unfortunately means that the t-test used by talos to calculate significance is not the proper test to calculate it.
It seems that for ts_paint and twinopen all windows results are bimodal (maybe collapsing on frames?) which makes it more counter-intuitive to try to read their significance.
I see a 13.26% regression on win10-64. On win10-64-qr, there's an "improvement", but it is low-confidence, and if you look at the spread of the numbers that are being used, you can see that on the baseline (m-c) dataset, there are 2 outliers at about 380ms each, and the rest of the numbers actually come in below the 123.82 average on your trypush, ie there, too, this actually looks likely to be a regression.
I think that the way you analyze the data is not in line with how I was instructed to in statistics classes.
As I stated before, t-test is (quoting wikipedia) "most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. ".
In particular, it is a common mistake to apply t-test to a non-normal distributions, like bi-modal or multimodal.
There are tests that can verify if the distribution is normal (it is visibly not, but if you prefer I can also calculate it), and in such a case, Mann–Whitney U test would be more appropriate [0], as it is (quoting Wikipedia): "Unlike the t-test it does not require the assumption of normal distributions."
What you do past that point, dropping outliers, is a risky operation that has far fetching consequences on validity of the interpretation. I believe if you apply the same manual removal of outliers on twinopen "windows7-32" you'll get a perf win, but I'd be very cautious about its validity.
In other words, if we don't trust talos results significance, I believe we should not try to "read between the lines" and try to deduct what we believe to be true, because data will naturally give us ability to select bits that fit our preconception.
Instead, we should revisit the tools we use (t-test?) and if we want to remove outliers, we should do this using statistical tools, not our human intuition. One such tool is IQR score [2].
Finally, as you see from the graphs, we're comparing multimodal and normal distributions which are bound to have different characteristics.
That means that we have a pretty high chance of chasing ghosts and trying to normalize the numbers to behave in line with baseline, even if it doesn't match actual impact on user experience.
As I stated in comment 44, at some point, assuming my patches change the behavior and distribution, we'll have to face a balance of gains/costs that are not proven to be statistically significant and try to deduct what it means.
Gijs in the comment above, removes outliers in a selected result when the outcome doesn't match the hypothesis, until it does. At the same time, he doesn't remove the outlier in "ts_paint/windows10-64-shippable-qr" which might drop the significance from mid to low.
I'm afraid that such strategy will always be available to prove a point.
We can also dwell into outliers. If in twinopen/windows10-64-shippable-qr
m-c has outliers, and those outliers also happen in real life on user machines, and my patch removes those outliers, but regresses the median by 0.2%, is it a win or loss?
My question is - if we will never get to the point where the data in talos shows exactly what we want to see, and instead it'll remain chaotic and fluctuating - whos call it is to make a call that it is good enough?
Gijs? Mconley? esmyth?
And if we evaluate that there indeed is a hidden regression, say 0.3% on ts_paint. Whos call it is to evaluate it against YSOD or overall removal of DTD?
Because I'm concerned that we use the wrong tools to read the data and keep delaying migration of the startup path to conform to a wrong test.
[0] https://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U_test
[1] https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/
[2] https://en.wikipedia.org/wiki/Interquartile_range#Outliers
Assignee | ||
Comment 55•5 years ago
|
||
Is this a temporary or fundamental hold-up to the caching work?
I'm not sure. I'm trying to verify it with smaug and bdahl this week.
Comment 56•5 years ago
|
||
(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #54)
My question is - if we will never get to the point where the data in talos shows exactly what we want to see, and instead it'll remain chaotic and fluctuating - whos call it is to make a call that it is good enough?
Gijs? Mconley? esmyth?
esmyth.
Comment 57•5 years ago
|
||
(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #54)
Gijs in the comment above, removes outliers in a selected result when the outcome doesn't match the hypothesis, until it does. At the same time, he doesn't remove the outlier in "ts_paint/windows10-64-shippable-qr" which might drop the significance from mid to low.
I'm afraid that such strategy will always be available to prove a point.
FWIW, I didn't look at the win10 qr result in detail as it had only medium confidence, only the non-qr win10-64 shippable, which afaict doesn't have significant outliers and has a "high" confidence t-test, not medium. Runs before are 314-318 (11 samples), runs after are 317-324 (19), with a 1.98% regression marked. I don't see a bimodal distribution in the graphs for ts_paint win10-shippable (non-qr) either (neither before nor after), but perhaps I'm not looking at the same thing as you?
To get more reliable distributions it's probably worth doing a trypush off the base revision instead of just taking the data from the last 2 days of m-c runs, to reduce the potential for additional noise caused by other patches. For the general point of whether specific tests are bimodal or not, it seems worth having a conversation with the perf infra team folks about stabilizing those tests and/or having perfherder employ better tests when describing the "results" of comparing two revisions -- in a separate bug, of course.
Assignee | ||
Comment 58•5 years ago
|
||
With the landing of bug 1517880 and now bug 1501886 we now have Fluent on the startup path with no talos regressions.
Hooray!
Assignee | ||
Comment 59•5 years ago
|
||
For the general point of whether specific tests are bimodal or not, it seems worth having a conversation with the perf infra team folks about stabilizing those tests and/or having perfherder employ better tests when describing the "results" of comparing two revisions -- in a separate bug, of course.
I still intend to file a bug about it, just failed to prioritize so far. Setting NI
Assignee | ||
Comment 60•5 years ago
|
||
This is now handled in bug 1581533 and bug 1601952! Hooray!
Assignee | ||
Updated•5 years ago
|
Description
•