Closed Bug 1004429 Opened 11 years ago Closed 10 years ago

5% tart regression for linux 64 on fx-team (v32) April 24

Categories

(Firefox :: Search, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox32 - ---

People

(Reporter: jmaher, Unassigned)

References

Details

(Keywords: perf, regression, Whiteboard: [talos_regression] p=13 [qa-])

Attachments

(2 files)

A few thoughts: There's now more visible content in the page's source -- the search bar. I don't know whether the time to build that content and the time to draw it contribute to the regression. In the preloaded case, the time to build it shouldn't contribute. When the page moves out of the preloader and is shown, the search panel asynchronously builds itself: it creates and adds to itself a small number of nodes per search engine. "Asynchronously" because it sends an async message to chrome, waits for a response, and then builds itself based on the response. The response is large because it contains some base 64-encoded small images. I don't know if that matters. When the page loads, the search bar's width is changed along with the grid's. In the preloaded case, that ought to happen before the tab is shown. In the non-preloaded case, resizing the search bar shouldn't be any more expensive than resizing the grid, which was already happening before these changesets.
thanks :adw! I assume the tab animation stuff would be affected by this, is there anything why this would be only on linux64?
(In reply to Joel Maher (:jmaher) from comment #2) > is there anything why this would be only on linux64? I'm guessing that on other platforms it's within the noise level so it doesn't trigger the notice, since the regression is only for the newtab sub tests (2 tests out of 10). Drew, when I look at datazilla on the affected graph, I see that newtab suffers, for both with and without preload, and both are about the same magnitude of regression: https://datazilla.mozilla.org/?start=1397615525&stop=1398964948&product=Firefox&repository=Fx-Team&os=linux&os_version=Ubuntu%2012.04&test=tart&x86=false&project=talos I couldn't quite understand from comment 1 if you also expected the regression with preload. The without-preload numbers are only there to make us notice internal perf changes, but in practice users have preload by default for newtab, so we don't really care if it regressed without preload. newtab regressions with preload are visible to users, so we do care about them. And when you look only at the preload values, you see that they regressed meaningfully: - Average interval (.all): from 2.9 to 4.2ms: 45% regression. - Average interval over the 2nd half of the animation (.half): 2.5 to 3.7ms (and noisier): 48% regression. - The .error values didn't change meaningfully. So IMO it warrants at least some investigation. The newtab page is frequent enough, and a lot of effort went into making its tab animation smooth, including the preload system and other efforts.
Flags: needinfo?(adw)
(In reply to Avi Halachmi (:avih) from comment #3) > I couldn't quite understand from comment 1 if you also expected the > regression with preload. I don't know, I'm just trying to explain things that may cause regressions in either case. I don't see how any of those things would cause meaningful regressions, but what do I know.
Flags: needinfo?(adw)
hey, we are all in this together. Maybe backing out a small part on try to see if we can reduce the scope of code that causes the problem could be a next step.
(In reply to Joel Maher (:jmaher) from comment #5) > hey, we are all in this together. ??? Did I say something wrong? I'm saying I don't know what caused the regressions, that's all. I'm starting to think about:newtab should be a blank page that builds itself after the tab animation has finished and then fades itself in. We'd never regress TART no matter what we did (see this bug and the sponsored tile bugs), and we could get rid of the preloader, which is a pain to work with and reason about, especially in tests.
Nobody said anything wrong. Nice idea about the newtab being blank, then fading in. Is that on the roadmap?
(In reply to Drew Willcoxon :adw from comment #6) > I'm starting to think about:newtab should be a blank page that builds itself > after the tab animation has finished and then fades itself in. We did consider this approach about a year ago when we tried to improve the animation for about:newtab. Another complementary approach was to delay the animation start in ~100ms such that it only starts after the rendering has settled. Eventually ttaubert came up with the preload system, which seemed reasonably good at the time. Since then we found out that the while preload mitigates the perf costs of creating the page, there are still more costs which manifest when the page actually gets rendered (followed by some XUL experiments in simplification etc). > ... We'd never regress TART no matter what we did ... The goal is not to never regress the numbers. The goal is to have the best user experience we can produce. The numbers are only a gauge which shows us where the performance change, and then we, as humans, first become aware of the cost of a new feature, and then we can evaluate if the value of the feature is worth its cost. We don't have hard thresholds because there can't be any. But hopefully with enough collective experience we can come up with a good answer for this question. My personal opinion on this is that this specific cost is not negligible. But I can't assess the value of the search bar (personally for me it's zero, but what do I know).
(In reply to Avi Halachmi (:avih) from comment #8) > My personal opinion on this is that this specific cost is not negligible. > But I can't assess the value of the search bar (personally for me it's zero, > but what do I know). Oh, OK, it's when I said "but what do I know." I've been turning Joel's comments over and over again in my mind all day. Let me apologize right now here publicly in the bug before going on. I perceived this to be a platform problem, and I said that as someone who works on the front end and often doesn't understand how seemingly simple things cause complex platform problems like this. It's probably not right to consider this a platform problem, but I don't think that's important right now. "But what do I know" wasn't directed at you or anyone but at all similar problems that frustrate me and that I feel are out of my reach. But I took a little offense at your "but what do I know," which makes me see that you probably took offense at mine. I don't know whether yours was directed at me with that intention, but I'm sorry.
(In reply to Drew Willcoxon :adw from comment #9) > ... But I took a little offense at your > "but what do I know," which makes me see that you probably took offense at > mine. Apologies if that seemed offensiveness, but it was completely unrelated. I also didn't find your comment offensive in any way :) My "What do I know" was only about me acknowledging that even if I don't find it useful, I can't assess the value of the new search bar from marketing/product perspective. My entire post was informative rather than any kind of retort. I also don't believe anyone suggested that this is a platform issue. FWIW, to me it looks like a front end issue, and joel also blocked a front end related bug. As for your general frustration with regressions queries coming your way, I feel for you, and you may want to share your thoughts on bug 990644.
(In reply to Drew Willcoxon :adw from comment #9) > ... I perceived this to be a platform > problem, and I said that as someone who works on the front end and often > doesn't understand how seemingly simple things cause complex platform > problems like this. It's probably not right to consider this a platform > problem, but I don't think that's important right now. Sorry, I somehow misread this. I thought you couldn't understand how it's a platform issue. Anyway, it's an issue, and it's a result of a front end change, which might have triggered a platform deficiency. OTOH, platform is not magic. It's true that we'd like to add elements and expect them to just work without regressing anything, but in practice some stuff costs us, and this bug is to inform us of it, and from here we/you can consider alternatives or fixes if enough people think it's worth the time, including the one to delay all the about:newtab rendering.
I pushed some runs to try that disable parts of the page. Each one builds on the previous, so they're cumulative. Baseline with no changes: https://tbpl.mozilla.org/?tree=Try&rev=b6fa7d0d83a7 No resizing the search bar: https://tbpl.mozilla.org/?tree=Try&rev=522c5651e0d2 No building the search panel: https://tbpl.mozilla.org/?tree=Try&rev=bf0229c8d571 No setting up the logo node: https://tbpl.mozilla.org/?tree=Try&rev=e8043160258e No sending a message to chrome to request the initial search state: https://tbpl.mozilla.org/?tree=Try&rev=a60d5c0b1eea No message passing machinery at all: https://tbpl.mozilla.org/?tree=Try&rev=dc3d7381c08a Reverted changes to the XUL and most of the CSS: https://tbpl.mozilla.org/?tree=Try&rev=13f45a0f910f
All the patches were around 5 ms or slightly under until the final patch (https://tbpl.mozilla.org/?tree=Try&rev=13f45a0f910f), which went down to 3.58 ms. That's still not the 3 ms that we were at before my changesets (according to the graph Avi linked to in comment 3). Those numbers are from newtab-open-preload-yes.all.TART. As a sanity check I pushed a patch that completely reverts the newtab changeset, https://hg.mozilla.org/integration/fx-team/rev/a36dd9f25739: https://tbpl.mozilla.org/?tree=Try&rev=e0ad2dd49013 And then another based on that one that also reverts the lower-level ContentSearch changeset, https://hg.mozilla.org/integration/fx-team/rev/eeafc69ebfb1: https://tbpl.mozilla.org/?tree=Try&rev=2d160dc30fde One of those should bring the number back down to 3 ms.
Assignee: nobody → adw
Status: NEW → ASSIGNED
Both of those were 3.7 ms, which is not 3. There was one small subsequent fix that Ed landed in http://hg.mozilla.org/mozilla-central/rev/5a01a53e80e8, so I tried reverting that on top of the other two reverts: https://tbpl.mozilla.org/?tree=Try&rev=ad5deb12df72 But that was 3.7, too. Observations so far: Merely the presence of the XUL, CSS, and one related querySelector() call (https://hg.mozilla.org/try/rev/13f45a0f910f) seems to account for 1.4 ms of the 2 ms regression. But even after reverting the changesets there's ~0.6 ms that I can't account for.
(In reply to Drew Willcoxon :adw from comment #14) > But even after reverting the changesets there's ~0.6 ms that I can't account > for. This could be due to unrelated regressions which landed since, or due to some noise in the results, or maybe you didn't actually back out everything. My numbers on comment 3 were not scientific. I looked at the graph and tried to pick the average value before and after the change, but without actually calculating it (I don't know if we have a tool for that other than the dev.tree-management notifications, but those only notify of average values of all the subtests together - they don't do average value of each subtest by itself).
I did some retriggers and the reduction became more clear- the first reduction: No sending a message to chrome to request the initial search state: https://tbpl.mozilla.org/?tree=Try&rev=a60d5c0b1eea the second reduction: Reverted changes to the XUL and most of the CSS: https://tbpl.mozilla.org/?tree=Try&rev=13f45a0f910f so with those two changes, we are back to normal.
Flags: firefox-backlog+
Avi, do you know how I might run TART locally and hook the profiler into it? It would be nice to get data instead of pushing to try with things disabled.
Sure. Clone the talos repo (http://hg.mozilla.org/build/talos), zip the content of talos/talos/page_load_test/tart/addon, rename the zip to xpi and install the addon. Then visit chrome://tart/content/tart.html and follow the instructions on screen to use TART. Use the profile as you would usually do. You can also get profiling on try pushes by adding 'mozharness: --spsProfile' or via try chooser.
Note that probably due to bug 967635, the TART addon is broken in a normal browser outside of talos. Though it might work when the profiler addon is installed. If tart doesn't start the animation when you request it to, try to build the addon from talos changeset 751:0ce52a5fcce0 (just before the breakage landed).
Whiteboard: [talos_regression] → [talos_regression] s=IT-32C-31A-30B.1
Flags: needinfo?(jbecerra)
Whiteboard: [talos_regression] s=IT-32C-31A-30B.1 → [talos_regression] s=IT-32C-31A-30B.1 [qa?]
Flags: needinfo?(jbecerra)
Whiteboard: [talos_regression] s=IT-32C-31A-30B.1 [qa?] → [talos_regression] s=IT-32C-31A-30B.1 [qa-]
Whiteboard: [talos_regression] s=IT-32C-31A-30B.1 [qa-] → [talos_regression] p=13 s=it-32c-31a-30b.1 [qa-]
Whiteboard: [talos_regression] p=13 s=it-32c-31a-30b.1 [qa-] → [talos_regression] p=13 s=it-32c-31a-30b.2 [qa-]
(In reply to Joel Maher (:jmaher) from comment #16) > so with those two changes, we are back to normal. I don't see that at all. Am I doing it wrong? The first changeset has two dots in the newtab-open-preload-yes.all.TART graph, the lower one having an average of 3.81. The second changeset has one dot with an average of 3.58. In the graph in comment 3, all the dots are around 3.0 before the regression.
I have tried a dozen times to get the graph from comment 3 to open with no luck :( Keep in mind as time goes on other regressions or improvements can take place to the benchmark. If you look at current values (or values on a try push with the latest source as your base), the absolute values won't be there.
Attached image datazilla-comment-3.png - taken 2014-0516 (deleted) —
I also tried to look the graph from comment 3. First time datazilla seemed to be waiting for data forever, then I reloaded the page, and it loaded after less than a minute. I'm attaching a screenshot of how it looked on my screen. This looks completely different than what I saw when I posted comment 3 - many of the close animations now look bimodal and I'm pretty certain they weren't back then, and there's no big jump on both preload-yes and preload-no .error on Apr-24. I'm pretty certain I also verified that the link works - after I posted it. I can only conclude that right now datazila should not be relied on. Either that, or I completely messed up in comment 3. But right now I'm suspecting datazilla.
(In reply to Avi Halachmi (:avih) from comment #22) > I can only conclude that right now datazila should not be relied on. > > Either that, or I completely messed up in comment 3. But right now I'm > suspecting datazilla. Or maybe neither. Before Apr-24, the graphs shows similar values to the numbers I posted on comment 3 (the .all and .half values of preload-yes), and a jump in values around Apr-25th. Maybe a bit less than what I assessed on comment 3, but still there. I was expecting to see the jump at the .error values, and only after I posted I reread comment 3 and noticed that the .error values were not affected.
I've tried a lot of things to fix this, but I'm not closer to understanding what the problem is. This is as best as I can describe the problem: about:newtab has more content and is doing more than it used to. That's it. There doesn't seem to be one thing or a smoking gun. Using Instruments.app, I profiled a baseline (with search-in-newtab reverted) and trunk while I ran TART. The baseline took 28s and trunk took 38s (give or take; I did multiple runs and profiles). Below is a call tree that represents the samples taken during the profiles. It's actually a heavily edited diff between the baseline and trunk. Each node starts off with +Nms, meaning that trunk took N additional ms over the baseline for that stack. Some things show up that make sense to me, like a ton of JS execution, appendChild, cloneNode, querySelector, setAttribute -- all things that happen, or happen more, with the search bar. But there's a lot of native stuff that's happening that I don't understand. Based off the Instruments profiles, I pushed different things to try, like eliminating the appendChilds and querySelector and even removing all the new JS -- so that no JS related to search runs at all -- but nothing alone makes a significant difference, according to try and the resulting datazilla graphs. > +4298.3ms nsThread::ProcessNextEvent > +1740.8ms nsRefreshDriver::Tick(long long, mozilla::TimeStamp) > +1695.5ms mozilla::dom::FrameRequestCallback::Call > +1695.3ms JS::Call > +1142.5ms mozilla::dom::NodeBinding::appendChild > +1088.5ms mozilla::dom::XULDocument::EndUpdate > +975.6ms nsContentUtils::RemoveScriptBlocker > +975.5ms nsFrameLoader::MaybeCreateDocShell > +973.8ms nsFrameLoader::EnsureMessageManager > +973.6ms nsFrameMessageManager::LoadFrameScript > +495.6ms nsInProcessTabChildGlobal::Init > +495.2ms nsXPConnect::InitClassesWithNewWrappedGlobal > +474.2ms JS_InitStandardClasses > +478.0ms nsFrameScriptExecutor::LoadFrameScriptInternal > +477.6ms JS::CloneAndExecuteScript > +233.6ms js::CallJSNative > +32.0ms non-virtual thunk to nsInProcessTabChildGlobal::SendSyncMessage > +222.1ms mozilla::dom::HTMLInputElementBinding::select > +197.5ms nsFocusManager::SetFocus > +42.0ms nsViewManager::ProcessPendingUpdates > +580.7ms non-virtual thunk to mozilla::LazyIdleThread::Notify > +307.5ms nsTimerEvent::Run > +165.1ms nsXPCWrappedJS::CallMethod > +53.0ms mozilla::RefreshDriverTimer::TimerTick > +239.2ms nsInputStreamReadyEvent::Run > +218.1ms nsDocShell::CreateContentViewer > +112.8ms nsDocumentViewer::Init(nsIWidget*, nsIntRect const&) > +168.2ms nsXPCWrappedJS::CallMethod > +165.3ms nsINode::AppendChild > +1147.8ms NotifyOffThreadScriptCompletedRunnable::Run > +1127.8ms non-virtual thunk to mozilla::dom::XULDocument::OnScriptCompileComplete > +911.5ms mozilla::dom::XULDocument::ExecuteScript > +43.9ms mozilla::dom::GenericBindingMethod > +16.9ms mozilla::dom::NodeBinding::cloneNode > +13.1ms mozilla::dom::ElementBinding::querySelector > +8.8ms mozilla::dom::ElementBinding::setAttribute > +4.8ms mozilla::dom::NodeBinding::appendChild > +215.6ms mozilla::dom::XULDocument::ResumeWalk > +214.6ms mozilla::dom::XULDocument::StartLayout() > +210.2ms nsCSSFrameConstructor::ContentInserted > +785.0ms nsTimerEvent::Run() > +1185.9ms mozilla::RefreshDriverTimer::Tick()
Can you help?
Flags: needinfo?(avihpit)
Since I'm not very familiar with neither internal(In reply to Drew Willcoxon :adw from comment #24) > This is as best as I can describe the problem: > about:newtab has more content and is doing more than it used to. That's it. Yup. > There doesn't seem to be one thing or a smoking gun. Using Instruments.app, > I profiled a baseline (with search-in-newtab reverted) and trunk while I ran > TART. The baseline took 28s and trunk took 38s (give or take; I did > multiple runs and profiles). Below is a call tree that represents the > samples taken during the profiles. It's actually a heavily edited diff > between the baseline and trunk. Each node starts off with +Nms, meaning > that trunk took N additional ms over the baseline for that stack. I'm guessing 28 ms and 38 ms (not seconds)? of what exactly? It would probably be best to profile while all the non newtab subtests are disabled. But I'm guessing that this is what you did, because it shows 33% regression which is roughly what we estimated earlier when we looked only at the newtab tests. Technically, I'm probably unable to help much, since I'm unfamiliar with the internal implementations of the search field or the costs it apparently induces. But I asked Mike to help a bit. If you guys conclude that that's more or less the best which can be done with the current code without tweaking it to death, then we'll need to think how much effort we want to put into this and what to do about it in general.
Flags: needinfo?(avihpit)
(In reply to Avi Halachmi (:avih) from comment #26) > I'm guessing 28 ms and 38 ms (not seconds)? No, seconds. I did this on a debug build and ran the test 3 times. > of what exactly? I pasted the call tree. > It would probably be best to profile while all the non newtab subtests are > disabled. But I'm guessing that this is what you did This is what I did.
(In reply to Drew Willcoxon :adw from comment #27) > (In reply to Avi Halachmi (:avih) from comment #26) > > I'm guessing 28 ms and 38 ms (not seconds)? > > No, seconds. I did this on a debug build and ran the test 3 times. > > > of what exactly? > > I pasted the call tree. I'm confused. TART shouldn't take longer or shorter to complete the run - it only measures higher or lower intervals and other durations, and should take exactly the same amount of seconds to complete if you run the same set of tests with different build. Are you saying that the TART run with the newtab search takes 38s to complete, while without it it takes 28 seconds to complete? And even if yes, I also don't understand how you conclude that longer or shorter run time correlates to tart regression reports. Not saying that you made an incorrect conclusion - just that I don't understand it. When you run TART locally, once the test finishes, it outputs a report to the browser window (at the bottom). Whatever talos thinks regressed, you should be able to see similar regression numbers locally at the end of the TART run, on the browser screen and without a profiler. Once you're able to do that - see on your own system that TART reports different results for different builds - then you should profile and see if it shows interesting differences at the profile. The numbers which interest us most are the results for: newtab-open-preload-yes.all newtab-open-preload-yes.half Are you able to see at the tart report once if finishes, that these numbers are worse in about 40-50% with the newtab search than without the newtab search?
I call startProfiling at the end of Tart.prototype._startTest, right before the _doSequence call. I call stopProfiling in _doneInternal. In tart.html, the only box I check under "Configure TART" is "newtabYesPreload". I set "Repeat" to 3. I uncheck "[Uncheck when profiling]". I hit the Start button. When I load the resulting profiles in Instruments, profiles without search are ~30s, profiles with search are ~40s, both of which times match my subjective experience of waiting for the test to finish. (In reply to Avi Halachmi (:avih) from comment #28) > And even if yes, I also don't understand how you conclude that longer or > shorter run time correlates to tart regression reports. I don't think I said that? But I don't understand what you mean. In the profiles' call trees, I'm looking for calls in the with-search profiles that are absent from the without-search profiles, and calls in the with-search profiles that take longer than their counterparts in the without-search profiles. Is that not right? Some of those calls are in the call tree diff above. > Once you're able to do that - see on your own system that TART reports > different results for different builds - then you should profile and see if > it shows interesting differences at the profile. I am able to see that and that's what I've been profiling, reporting, and trying to make sense of. Without profiling, I've been setting "Repeat" to 10 and running "newtabYesPreload": with search, I'm getting anywhere from 90-180ms for newtab-open-preload-yes.all, again on a debug build if that matters; without search, 60-70ms. When I think I've found a lead and try a patch locally, and the patch reduces the test time significantly, but usually not consistently, I push to try and check datazilla, but the results aren't repeated there. I don't know whether these patches actually improve things but only a very small amount that's lost in the noise, or they don't improve anything at all.
(In reply to Drew Willcoxon :adw from comment #29) > I call startProfiling at the end of Tart.prototype._startTest, right before > the _doSequence call. I call stopProfiling in _doneInternal. In tart.html, > the only box I check under "Configure TART" is "newtabYesPreload". I set > "Repeat" to 3. I uncheck "[Uncheck when profiling]". I hit the Start > button. I see. Do I understand correctly that you modified TART? If yes, what is your base build? and please post the patch you used to modify it. Without seeing the patch and after reading the above, I'd say that it's probably incorrect and captures more than it should, but post the base TART build and the patch, and let's take it from there. > When I load the resulting profiles in Instruments, profiles without search > are ~30s, profiles with search are ~40s, both of which times match my > subjective experience of waiting for the test to finish. I'm not familiar with "Instruments", but I still can't understand where could the difference come from. As far as I can guess what your TART patch does, it still shouldn't result in this. I do know that TART is not expected (or programmed) to take different duration to complete based on different performance of the build. So something here is definitely unexpected, at least to me. We should first make sure we don't have unexpected stuff - before going on to interpret the results it produced IMO. > (In reply to Avi Halachmi (:avih) from comment #28) > > And even if yes, I also don't understand how you conclude that longer or > > shorter run time correlates to tart regression reports. > > I don't think I said that? But I don't understand what you mean. Since we're trying to address an issue which was noticed on TART results, this makes TART the reference, but you quoted numbers which not taken with TART, so I assume you believe your numbers somehow correlate to the TART regressions. Is this not the case? > In the profiles' call trees, I'm looking for calls in the with-search > profiles that are absent from the without-search profiles, and calls in the > with-search profiles that take longer than their counterparts in the > without-search profiles. Is that not right? Some of those calls are in the > call tree diff above. Yes, this sounds very plausible, as long as your profiles are collected correctly. > Without profiling, I've been setting "Repeat" to 10 and running > "newtabYesPreload": with search, I'm getting anywhere from 90-180ms for > newtab-open-preload-yes.all, again on a debug build if that matters; without > search, 60-70ms. Right, so this means you can reproduce the talos results locally, and see even bigger regressions. Excellent. We couldn't continue without it. > When I think I've found a lead and try a patch locally, > and the patch reduces the test time significantly, but usually not > consistently Yup, that's the hard part. Hopefully we can make it easier by grabbing the profiles more accurately.
Attached patch TART native profiling patch (deleted) — — Splinter Review
This patch is based on http://hg.mozilla.org/build/talos/file/53c8b1ad4828. (In reply to Avi Halachmi (:avih) from comment #30) > Without seeing the patch and after reading the above, I'd say that it's > probably incorrect and captures more than it should, The point is to compare profiles, so as long as two profiles capture the same sequence of events, it doesn't really matter if they capture "more than they should," which as far as I can tell they don't anyway. > I do know that TART is not expected (or programmed) to take different > duration to complete based on different performance of the build. There are two aspects to what you're saying here that I don't understand. Profiling the build always makes it run way slower, so times under profiling will be longer than when not profiling. And if TART isn't expected to take longer when a build's performance is worse, then what's the point? What's it measuring? Clearly something about newtab search made performance worse, increasing TART's duration from 3ms to 5ms on Linux 64 opt or whatever. > Since we're trying to address an issue which was noticed on TART results, > this makes TART the reference, but you quoted numbers which not taken with > TART, so I assume you believe your numbers somehow correlate to the TART > regressions. Is this not the case? You've lost me. Where did I quote numbers not taken with TART? And why do you assume I'm comparing apples to oranges?
(In reply to Drew Willcoxon :adw from comment #31) > The point is to compare profiles, so as long as two profiles capture the > same sequence of events, it doesn't really matter if they capture "more than > they should," which as far as I can tell they don't anyway. When you have hard times using the profile data, then the less irrelevant (not measured in TART) stuff it contains, the easier it will be to see the differences between profiles and what's important. The TART version you used as base already starts and stops the profiler (at least it should) whenever it start and stops the tab animation. see: http://hg.mozilla.org/build/talos/file/53c8b1ad4828/talos/page_load_test/tart/addon/content/tart.js#l190 If you used one version earlier, then it wouldn't have and instead only added profile markers. But your profile still also includes tab close animations, and warmup runs which aren't measured. If the profile is already useful for you as is, fine, but if it isn't then it could help to not collect those to the profile. > There are two aspects to what you're saying here that I don't understand. > Profiling the build always makes it run way slower, so times under profiling > will be longer than when not profiling. They run "slower" but it doesn't mean they will take longer to complete. See next. > And if TART isn't expected to take > longer when a build's performance is worse, then what's the point? What's > it measuring? Clearly something about newtab search made performance worse, > increasing TART's duration from 3ms to 5ms on Linux 64 opt or whatever. It measures frames per seconds of the animation. Or rather, average frames intervals, which is 1000/FPS. All the animations will always take 220ms (or whatever they should take), and all the pauses will always take 500ms or 1000ms (depending on the next animation type), and the entire run should always complete in the same duration. The only difference between a "slow" build and a faster one is that the faster will manage to squeeze more animation frames within each animation (so lower average intervals). And this is what TART measures. > You've lost me. Where did I quote numbers not taken with TART? And why do > you assume I'm comparing apples to oranges? Apologies for any wrong assumptions I may have made. Let's put it aside.
So I took a few minutes to look at this today. I'm afraid I haven't fully gone through the other comments in this bug, so if I'm repeating anything that's already been said or discovered, my apologies for wasting your time. :) So here's what I did. First - I successfully reproduced the regression on my OS X 10.8 machine. Next, I turned on paint flashing, and noticed that with the patches applied, we do extra paints for the placeholder text, and then again for the search input that we didn't do before adw's patches. I thought that was pretty interesting, and so I put visibility: hidden on the newtab-search-container, and noticed that most (if not all) of the regression disappeared. This seems to indicate that painting might be where the bottleneck is. I also noticed that if I disabled or delayed the handleEvent function in search.js, the newtab-preload regression pretty much disappeared. This makes me think we might be able to eliminate a lot of the TART regression for preloading if we: a) Make it so that the preloaded browser also sets the current engine b) The TART test ensures that the newtab preload is complete before continuing the test I'm not sure if (b) already happens, but I'm reasonably certain that (a) is not currently happening, and I don't see why it shouldn't. I'll note that this doesn't seem to solve all of the non-preload case, so I'm still poking around there. I hypothesized it had something to do with the box-shadow styling being applied to the search input, since that has always been expensive, but commenting those style rules out didn't give us much win. I might have some more time to poke around today, but I just thought I'd share my findings and thoughts.
Thanks, Mike! (In reply to Mike Conley (:mconley) from comment #33) > b) The TART test ensures that the newtab preload is complete before > continuing the test > I'm not sure if (b) already happens, ... (b) It doesn't ensure, but it definitely tries to make that the case. In coordination with ttaubert at the time, the preload was set to start IIRC after 600ms from $something, while TART waits 1000ms before starting a measured newtab animation (in tart.js): > function(){animate(1000, addTab, next, true, "newtab-open-preload-yes", tabRefDuration);}, For the sake of verifying this and/or making progress, we could change 1000 to 2000 or 5000 to make sure that preload is done before we animate and measure. As a more permanent solution we could consider keeping the wait longer or hanging it off some event. For now, longer delay should neutralize this factor, if it is indeed a factor. Small comment though, I prefer to use duration based delay rather than event, because event will ensure that preload is completed, but if, due to some bug or other reason, the event doesn't fire for 10 or 20 seconds, then we do want to detect it. Also, as a side effect, it could make TART runs considerably longer. But I'm listening to interesting solutions here. > I'll note that this doesn't seem to solve all of the non-preload case, ... Re improving non-preload cases, I'll consider it a bonus if it gets fixed, but I probably wouldn't put extra time into fixing it if that's the only issue. Since users do have preload by default and we do expect their newly opened tabs to be preloaded, then as far as I can tell, regressions here don't affect users. The preload-no test was added to give us some estimation of how "heavy" the newtab page is, if we didn't have it and only tested with preload, it could very well be masked, and then one day we could suddenly find that it takes us 500ms to paint the page. The preload-no sub-result prevents it, and gives us an internal gauge to follow, while still knowing it wouldn't affect users.
Whiteboard: [talos_regression] p=13 s=it-32c-31a-30b.2 [qa-] → [talos_regression] p=13 s=it-32c-31a-30b.3 [qa-]
(In reply to Avi Halachmi (:avih) from comment #34) > The preload-no test was added to give us some estimation of how "heavy" the > newtab page is, if we didn't have it and only tested with preload, it could > very well be masked, and then one day we could suddenly find that it takes > us 500ms to paint the page. The preload-no sub-result prevents it, and gives > us an internal gauge to follow, while still knowing it wouldn't affect users. Ok, cool. Thanks Avi. So, I poked at this a little more after my last comment, and I'm a little confused - it seems that handleEvent in search.js receives some event that causes us to update the UI in two ways during the tab animation: 1) I think it collapses the space for the graphic asset that's supposed to go to the left of the search input 2) I think it sets the placeholder text However, setting a breakpoint on the only thing that seems to send those events[1] didn't cause me to break, which I found pretty mysterious. Maybe that's just the debugger failing out, or me not understanding how things work here, but it certainly caught my attention. Maybe this is because content.js is being reloaded somehow, and the debugger doesn't know how to deal with it. I've certainly had that happen with breakpoints set for separate windows. Anyhow, I'm going to assume for the second that the world operates as it should, and this thing actually sends events that are being received by newTab.js / search.js. My question is: is it possible to have that event fired, and that event received while the document is loaded in the hidden preload docshell / browser? In my estimate, that'd be the clearest way to knock out this regression. [1]: http://dxr.mozilla.org/mozilla-central/source/browser/base/content/content.js#233
Flags: needinfo?(adw)
From IRC: <ttaubert> the problem is that the event listener only exists in browser windows, not the hidden window <ttaubert> so initializing the about:newtab search wouldn't work in the preloaded tab
Is it possible to change some of the async stuff to sync? Would it work better? any downsides for such approach?
Yeah, we can't do that synchronously as we want to support e10s in the future. There are a lot of bugs waiting for about:newtab to become unprivileged.
Thanks, Mike. (In reply to Mike Conley (:mconley) from comment #33) > I thought that was pretty interesting, and so I put visibility: hidden on > the newtab-search-container, and noticed that most (if not all) of the > regression disappeared. This seems to indicate that painting might be where > the bottleneck is. I couldn't reproduce this locally. TART was slightly better but most of the regression didn't disappear. > I also noticed that if I disabled or delayed the handleEvent function in > search.js, the newtab-preload regression pretty much disappeared. Locally, I can reproduce this some of the time -- it's inconsistent. But when I've pushed similar patches to try, I could never reproduce there, according to datazilla. (In reply to Mike Conley (:mconley) from comment #35) > So, I poked at this a little more after my last comment, and I'm a little > confused - it seems that handleEvent in search.js receives some event that > causes us to update the UI in two ways during the tab animation: > > 1) I think it collapses the space for the graphic asset that's supposed to > go to the left of the search input > 2) I think it sets the placeholder text That's right. When the page first becomes visible (because it either moved out of the preloader or bypassed the preloader), gSearch.setUpInitialState sends a message to chrome, and then chrome replies, which results in a ContentSearchService event sent to the page with a "State" "subtype," which handleEvent dispatches to onState. Then onState creates and adds some nodes to the search popup panel, updates the graphic at the start of the input, and sets the input's placeholder text. > However, setting a breakpoint on the only thing that seems to send those > events[1] didn't cause me to break, which I found pretty mysterious. That does sound strange. Maybe it has to do with the preloader's docshell swap? dump()s would work as a poor alternative. > My question is: is it possible to have that event fired, and that event > received while the document is loaded in the hidden preload docshell / > browser? In my estimate, that'd be the clearest way to knock out this > regression. To sidestep your question first, I've had some success locally skipping all of this event and message passing so that no events and messages are passed, which also means that the page is not updated as a result (although here too it's inconsistent). But then again when I try on try, the numbers don't budge. But to answer your question, right now we don't load the content script while the page is in the preloader browser. We could do that and then start the message passing while the page is in the preloader. I don't think there would be a problem with that, but maybe Tim would disagree. Since the message passing is async, it would be possible for a new tab to be opened while the message passing is ongoing, so in that case the search part of the page would not be set up in the preloader, but that's no worse than what happens currently.
Flags: needinfo?(adw)
I pushed Mike's visibility: hidden idea to try: https://tbpl.mozilla.org/?tree=Try&rev=9cb170b869ee I couldn't reproduce the win locally, but maybe try can.
Maybe, as an alternative, somehow build and prepare the search bar widget in advance, so that when the new tab is opened, we don't have to update anything? I think the current system also has the negative side effect that the search bar starts one way, then immediately changes in front of the user with a visible "flicker" which also causes this bug's performance regression. As for implementation details of when to do this update, I could think of two approaches: 1. Once on startup and then whenever the default search changes - update this widget. or 2. When opening a newtab. We query the search provider, and if it's the same as last time, do nothing as we already have the widget for it. If it has changed since we've built/got the widget, then behave like we do now - with the flicker and perf regression. I think 2 would be simpler to implement, and having this regression only on the first newtab and once after the search has changed would be a very good improvement compared to now.
(In reply to Avi Halachmi (:avih) from comment #41) > Maybe, as an alternative, somehow build and prepare the search bar widget in > advance, so that when the new tab is opened, we don't have to update > anything? That's what Mike was suggesting and I commented on, yes. But again, in one of the comments above I ripped out all the JS and the regression remained, according to try. > I think the current system also has the negative side effect that the search > bar starts one way, then immediately changes in front of the user with a > visible "flicker" which also causes this bug's performance regression. It's not at all clear what's causing the regression.
(In reply to Drew Willcoxon :adw from comment #42) > It's not at all clear what's causing the regression. I'd take it back then that this is the cause for the regression. but the "flicker" side effect remains.
I pushed to try the parent revision of my revisions, i.e., the tip of the tree before my revisions: https://tbpl.mozilla.org/?tree=Try&rev=253b4fb54379 And then the last of my search revisions, i.e., the tip of the tree after my revisions: https://tbpl.mozilla.org/?tree=Try&rev=ab1678b13ec8 That's different from what I did in comment 12, where all of those try pushes were based on the same parent, but that parent was the current tip at the time I wrote comment 12, not the parent of my search revisions. So this should factor out any unrelated changes that landed between the time that my search revisions landed and now. If you look at newtab-open-preload-yes.all.TART on datazilla.m.o, the two pushes have the same average, 3.75. In other words, datazilla shows no regression between the parent of my revisions and my revisions: https://datazilla.mozilla.org/?start=1400798201&stop=1401403001&product=Firefox&repository=Try-Non-PGO&os=linux&os_version=Ubuntu%2012.04&test=tart&graph_search=ab1678b13ec8,253b4fb54379&tr_id=5704335&graph=newtab-open-preload-yes.all.TART&x86=false&project=talos If you look at the graph from comment 0 from graphs.m.o. and expand it from March to now, you see a regression around Apr 24, when my search revisions landed, but then a comparable improvement around May 8. In other words, there doesn't seem to be a sustained regression: http://graphs.mozilla.org/graph.html#tests=[[293,64,35]]&sel=none&displayrange=90&datatype=running If you look at compare-talos, you see a 39.14% regression from the parent revision of my revisions to my revisions: http://compare-talos.mattn.ca/breakdown.html?oldTestIds=37344817&newTestIds=37344577&testName=tart&osName=Ubuntu%2064&server=graphs.mozilla.org I pushed other patches to try that disable parts of search, and at least compare-talos responded with changes in the numbers, unlike datazilla, so now I feel like I have something to work with. I am so sick of this. Which of our many tools and sites am I supposed to be using to confirm and fix this? Which numbers do I believe? Which numbers are the numbers reported by whatever reports them when you land things on m-i/m-c? What am I supposed to do?
(In reply to Drew Willcoxon :adw from comment #44) > I pushed to try the parent revision of my revisions ... > If you look at the graph ... there doesn't seem to be a sustained regression. Great approach, very useful results, and indeed unfortunate. > If you look at compare-talos, you see a 39.14% regression from the parent > revision of my revisions to my revisions ... so now I feel like I have something to work with. Indeed it looks like it. > I am so sick of this. Understandably, and so was I when my patches were generating "regressions" which I just didn't believe. The way I handled it was to investigate the tests themselves, which resulted in the understanding that many of them are not as reliable as we wanted them to be. I ended up putting most of my time into making sure that the talos tests are as reliable as I can make them, and that we don't have false regression reports. It's an on going process, and I've been doing it for about a year now almost exclusively. I really think it's _that_ important. > Which of our many tools and sites am I supposed to be > using to confirm and fix this? Which numbers do I believe? Which numbers > are the numbers reported by whatever reports them when you land things on > m-i/m-c? What am I supposed to do? These are indeed very good questions. While I've been working on making sure the _tests_ produce reliable results, I didn't get to the backend tools which process the results and allow exploring them. Both because backend systems, servers, databases etc are not really my domain, and also because there's still work to do on the tests reliability front and this is where I'm able to make a meaningful impact right now. The backend tools are in a somewhat unfortunate position right now. We have 2.5 "official" systems: - Graph server - reliable and useful for an overview, but the way it aggregates the sub results into a single number leaves something to be desired in terms of resolution. - DataZilla - newer system aimed at replacing graphserver. It's not out of the woods yet in terms of usefulness compared to graphserver, but it is able to display higher resolution results - even if it loses some of the higher level overview which graphserver provides. - Compare talos - This was started by one person as a "side tool", and became useful enough for many. It isn't void of issues, but it does provide a specific value which neither graphserver nor datazilla provide. The problem with the backend systems is that apparently they don't get enough resources to improve them. No one is working on graphserver because datazilla should replace it, yet datazilla is still not as useful as graphserver, and yet it also doesn't get resources to improve it. So we're kinda stuck on this front. MattN seems to be currently able to provide fixes for compare-talos, so effectively he's maintaining it and responds to requests as much as he can. So there you go. I fully understand your frustration, I'm spending almost 100% of my time trying to make sure one aspect of this system work well (the numbers themselves), and am aware that we're still very much lacking on other fronts. And yet, most of the time, we're able to put these systems to good and effective use. Some times, like this case, the tools don't provide the absolute reliability and usefulness we wish them to provide, and therefore I'm here gladly spending as much time as you need to make sure you can make progress on your issues. That's how it is.
(In reply to Drew Willcoxon :adw from comment #44) > Which numbers are the numbers reported by whatever reports them when > you land things on m-i/m-c? Specifically for this, it uses the graphserver values. Talos runs each test (all the sub-tests) 25 times, and the numbers which graphserver collects are as follows: - Median for each subtest's 25 runs. - Average of these medians between the sub-tests, excluding the highest median <-- the graphserver datapoint. The system which produces the regression reports on dev.tree-management list looks at the last 12 (IIRC) results of this test, and if it detects a shift with magnitude which is higher than (IIRC) 2x the noise level of this value (~stddev), then it send a regression email. Two specific issues I see with this system is the use of medians rather than averages, and ignoring the subtest with highest absolute value. Though I'm not sure what would be a good general alternative to this. You can read more about it here: https://wiki.mozilla.org/Buildbot/Talos
According to http://compare-talos.mattn.ca/, the regression was nearly halved by passing images between chrome and content using ArrayBuffers instead of base 64-encoded data URI strings: https://tbpl.mozilla.org/?tree=Try&rev=5293ebbbe638 http://compare-talos.mattn.ca/breakdown.html?oldTestIds=37344817,37355643,37355653,37355663,37355693&newTestIds=37392939,37393141,37393151,37393181,37393191&testName=tart&osName=Ubuntu%2064&server=graphs.mozilla.org The latter link shows a regression of 15.41% for newtab-open-preload-yes.all. The original regression was 28.65% [1]. That's a 46% improvement. I'm waiting on tryserver results for other ideas so I can run them through compare-talos. [1] This number is different from the 39.14% number I quoted in comment 44. It's based on the same revisions comparison. The only difference is it's based on more test retriggers. It's bad that more test runs changed the regression number so much. Here's the link showing the new, lower number: http://compare-talos.mattn.ca/breakdown.html?oldTestIds=37344817,37355643,37355653,37355663,37355693&newTestIds=37344577,37355803,37355813,37355951,37355979&testName=tart&osName=Ubuntu%2064&server=graphs.mozilla.org
Depends on: 1019989
Depends on: 1019990
Depends on: 1019991
Removed from Iteration 32.3
Assignee: adw → nobody
Status: ASSIGNED → NEW
Whiteboard: [talos_regression] p=13 s=it-32c-31a-30b.3 [qa-] → [talos_regression] p=13 [qa-]
(In reply to Drew Willcoxon :adw from comment #47) > I'm waiting on tryserver results for other ideas so I can run them through > compare-talos. Nothing really panned out, and in some cases the results were weird. I talked to Gavin about all of this, and we're just going to fix bugs that seem like they might be causing the regression even though I can't seem to prove it. I filed three bugs an hour ago and marked them blocking this one. Back to the results: Use ArrayBuffers for images + use a document fragment for populating the panel with engines: 16.49% regression (vs. parent revision 253b4fb54379 of my search revisions, on newtab-open-preload-yes.all) https://tbpl.mozilla.org/?tree=Try&rev=b8e56a752371 http://compare-talos.mattn.ca/breakdown.html?oldTestIds=37344817,37355643,37355653,37355663,37355693&newTestIds=37474845,37475705,37475715,37475735,37475755,37475765&testName=tart&osName=Ubuntu%2064&server=graphs.mozilla.org The previous + add hidden="true" to the logo div in the XUL so that the logo is hidden immediatly by the CSS instead of by the JS after the page is set up: 17.03% https://tbpl.mozilla.org/?tree=Try&rev=54a49d95ed28 http://compare-talos.mattn.ca/breakdown.html?oldTestIds=37344817,37355643,37355653,37355663,37355693&newTestIds=37474875,37475775,37475785,37475827,37475841,37475869&testName=tart&osName=Ubuntu%2064&server=graphs.mozilla.org The previous + don't set/remove the "selected" attribute on engines in the panel: 21.35% (!) https://tbpl.mozilla.org/?tree=Try&rev=81065124c4ae http://compare-talos.mattn.ca/breakdown.html?oldTestIds=37344817,37355643,37355653,37355663,37355693&newTestIds=37474855,37475805,37475851,37475883,37475893,37475903&testName=tart&osName=Ubuntu%2064&server=graphs.mozilla.org The previous + don't populate the panel at all: 22.16% (!) https://tbpl.mozilla.org/?tree=Try&rev=d21830073e0b http://compare-talos.mattn.ca/breakdown.html?oldTestIds=37344817,37355643,37355653,37355663,37355693&newTestIds=37474917,37475913,37475925,37475959,37475969,37475999&testName=tart&osName=Ubuntu%2064&server=graphs.mozilla.org The last two just don't make sense. The code is doing less than the other tries above but the regression is worse.
By the way, I also used the Gecko profiler to try and find differences in the call trees between the parent revision and my patches. Thanks for your help with that, Avi (and BenWa). It's actually really nice to use, but unfortunately I don't see anything different. Between the start and end markers of newtab-open-preload-yes.all, there's a lot of painting going on, but that's true for both the parent revision and my patches, and it doesn't look like my patches are doing more of it. But, I think this captures things that are happening in the chrome, too, while I'm primarily interested in what's going on in the page and what's going on as a result of what's going on the page, so it's possible I'm missing some needle in the haystack.
so this regression was solid and we uplifted, luckily when 32 uplifted to aurora on june 8th, we recovered about half of our loss: http://graphs.mozilla.org/graph.html#tests=[[293,52,35]]&sel=1397326252589,1405102252589&displayrange=90&datatype=running
(In reply to Joel Maher (:jmaher) from comment #51) > so this regression was solid and we uplifted, luckily when 32 uplifted to > aurora on june 8th, we recovered about half of our loss: > http://graphs.mozilla.org/graph.html#tests=[[293,52,35]]&sel=1397326252589, > 1405102252589&displayrange=90&datatype=running Sounds like perhaps the bugs from comment 49 had an effect that was only noticeable later? In any case I don't think there's much value in tracking this further. It would be nice to get another look at this and see if we can tell exactly what regression is remaining if any, but we're not going back on the search bar addition, and trying to squeeze out small percentages on TART has a high cost/benefit ratio.
Gavin, the regression was very meaningful - while the overall TART regression which this bug reports is 5%, the newtab tab animation regressed in about 40% (and many other sub-tests didn't regress, so the average only regressed a little). Some of it got mitigated with bug 1019990 which improved the preload, but after bug 1026561 was filed, I experimented locally with more tiles, and the the tab animation of the newtab page still depends on the amount of content it has (e.g. number of tiles), and does regress with more tiles. It's hard for me to show this in a graph or examine it exactly because DataZilla is either too slow or broken (DZ shows individual sub-tests such as for newtab, and which graphserver can't do), and not sure if tree herder will have this data for April when this regression started. jeads, will help with extracting the TART data from April and I hope to be able to script and analyze it instead of datazilla.
I thought the proximate cause here was the search field - are you saying it's actually tiles (or both)?
This bug was not filed in order to back out the newtab search bar. It was filed in order to track and hopefully fix the regression caused by the search bar. This bug's regression of ~40% on newtab tab animation was strictly from the search field, and some of it got mitigated by bug 1019990. I can't tell how much exactly because DataZilla is the only tool which can show it and it's kinda mostly not working right now. My hunch however, is that what's left is still meaningful. The tiles reference was with regards to the newtab page in general. I took your suggestion "tell exactly what regression is remaining" as a suggestion to looking at the big picture of the newtab page, and which the search and tiles are part of. Trying to "fix" every small 3-10% regression is both probably impossible and might also paint an incorrect or partial picture of "this only ended up as 3%, and that only regressed overall at 4%", while in practice, when you look at the newtab tab animation, it's considerably worse than just plain tab animation without the newtab page.
If you want to see how much worse is "populated" newtab tab animation compared to "empty" (controlled via the top-right "mode" icon at the newtab page) on nightly 33: - [1] Empty newtab : Median: 17 ms, Mean: 22 ms, 95th percentile: 44 ms. - [2] Populated newtab: Median: 21 ms, Mean: 33 ms, 95th percentile: 89 ms. And even if you remove the search bar completely, populated newtab still performs considerably worse than an ampty newtab. So newtab needs big picture approach. Not unlike the one which ended with the newtab preload system about a year ago. [1] http://telemetry.mozilla.org/#filter=nightly%2F33%2FFX_TAB_ANIM_OPEN_FRAME_INTERVAL_MS&aggregates=multiselect-all!Submissions!Mean!5th%20percentile!25th%20percentile!median!75th%20percentile!95th%20percentile&evoOver=Builds&locked=true&sanitize=true&renderhistogram=Graph [2] http://telemetry.mozilla.org/#filter=nightly%2F33%2FFX_TAB_ANIM_OPEN_PREVIEW_FRAME_INTERVAL_MS&aggregates=multiselect-all!Submissions!Mean!5th%20percentile!25th%20percentile!median!75th%20percentile!95th%20percentile&evoOver=Builds&locked=true&sanitize=true&renderhistogram=Graph
I would be more than happy to dedicate resources to improve new tab "big picture" perf, particularly if you're able to help guide us through the difficulty of dealing with our perf tools. Drew spent a significant amount of time chasing this down previously and didn't get very far, mostly because it was near impossible to get a clear picture from our perf measurements (either locally or on try). I marked this tracking32- not because I don't think it's important, but because I have no idea what we can realistically do here on beta in the next few weeks. I would love it if you could help me figure that out.
I would like to close this out, we have shipped with this code. Here is a graph over time: http://graphs.mozilla.org/graph.html#tests=%5B%5B293,53,35%5D%5D&sel=1380648746043,1412184746043&displayrange=365&datatype=running Avi, thoughts?
Flags: needinfo?(avihpit)
The original issue was somewhat mitigated at bug 1019990. The real problem here is that we can't track how much it improved since then, because this 5% overall regression is actually about 40% newtab related regressions, but we can't view newtab-only results because datazilla is non functional, and graphserver's averaging just doesn't give us the view we need. OTOH, it's been so long since this regression appeared, that it's impractical to fix it specifically. However, there's some major fx-team newtab perf work which now hangs off bug 1059558. So I'm closing this as WONTFIX, and the overall newtab perf discussion moves to bug 1059558.
Status: NEW → RESOLVED
Closed: 10 years ago
Flags: needinfo?(avihpit)
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: