Closed Bug 1059353 Opened 10 years ago Closed 10 years ago

[Clock] Startup time regressions of ~205 ms from 2.0 to 2.1

Categories

(Firefox OS Graveyard :: Gaia::Clock, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(tracking-b2g:+)

RESOLVED WONTFIX
tracking-b2g +

People

(Reporter: tchung, Unassigned)

References

Details

(Keywords: perf, regression, Whiteboard: [priority])

[Blocking Requested - why for this release]: Per Datazilla, Clock cold startup numbers have regressed about 205 ms since 2.0 to 2.1. Environment: startup_>_moz-app-visually-complete, flame-319MB, workload = light. human perception goal = 1000ms * 2.0 (All numbers use median, looking at latest numbers from 8-20 to 8-25) ** https://datazilla.mozilla.org/b2g/?branch=v2.0&device=flame-319MB&range=7&test=startup_%3E_moz-app-visually-complete&app_list=clock&app=clock&gaia_rev=d72f8ad53448aed5&gecko_rev=2a18149b3ae8&plot=avg clock -- High, 963; Low, 891; Trend ~905 * 2.1 (All numbers use median, looking at numbers from 8-08 to 8-13 (last available)) * https://datazilla.mozilla.org/b2g/?branch=master&device=flame-319MB&range=30&test=startup_%3E_moz-app-visually-complete&app_list=clock&app=clock&gaia_rev=9483fb304235e7ce&gecko_rev=7f32109287e9&plot=avg clock -- High, 1155; Low, 1063; Trend ~1100 Please investigate, thanks.
blocking based on performance regression. team to investigate
blocking-b2g: 2.1? → 2.1+
Assignee: nobody → m
Target Milestone: --- → 2.1 S4 (12sep)
I did some basic investigation on the datazilla graph + commits that happened between July 23-24th, and there are a few commits that affected multiple apps. More info at https://bugzilla.mozilla.org/show_bug.cgi?id=1059349#c2
Keywords: perf
Looking at the latest datazilla graphs, the numbers continue to fluctuate wildly, with a median of around ~1050ms and a fairly consistent ~900ms as the lowest time; the max ranges wildly from one second to over two seconds. Very little changed in clock over the course of 2.0->2.1, and it doesn't appear that anything done on Clock's end would have meaningfully regressed its performance numbers. Per Datazilla, it appears the reported startup numbers are inconsistent from day to day, indicating larger reporting-level inconsistencies from run to run, or system-level changes that affect either the test harness or actual performance. This bug is marked blocking for 2.1, but it isn't clear to me that it's actionable. Given that the startup numbers don't appear to be caused by Clock changes, and I don't see any low-hanging fruit for which we could shave off additional time. We could try to change Clock's startup behavior more substantially (add a caching layer or something equally involved) to meaningfully reduce Clock's perceived startup time, however it seems risky to attempt a major change for uplift in 2.1 at this stage. Tony, any suggestions as to how to proceed here? Do we absolutely need to get the number under 1000ms for 2.1, or is this primarily to track investigation?
Flags: needinfo?(tchung)
Target Milestone: 2.1 S4 (12sep) → ---
(In reply to Marcus Cavanaugh [:mcav] <mcav@mozilla.com> from comment #3) > Looking at the latest datazilla graphs, the numbers continue to fluctuate > wildly, with a median of around ~1050ms and a fairly consistent ~900ms as > the lowest time; the max ranges wildly from one second to over two seconds. > > Very little changed in clock over the course of 2.0->2.1, and it doesn't > appear that anything done on Clock's end would have meaningfully regressed > its performance numbers. Per Datazilla, it appears the reported startup > numbers are inconsistent from day to day, indicating larger reporting-level > inconsistencies from run to run, or system-level changes that affect either > the test harness or actual performance. > > This bug is marked blocking for 2.1, but it isn't clear to me that it's > actionable. Given that the startup numbers don't appear to be caused by > Clock changes, and I don't see any low-hanging fruit for which we could > shave off additional time. We could try to change Clock's startup behavior > more substantially (add a caching layer or something equally involved) to > meaningfully reduce Clock's perceived startup time, however it seems risky > to attempt a major change for uplift in 2.1 at this stage. > > Tony, any suggestions as to how to proceed here? Do we absolutely need to > get the number under 1000ms for 2.1, or is this primarily to track > investigation? Thanks for investigating. i dont have the answers here, because we could be seeing something systemic across the OS that is causing overall regressions of startup. Getting the number under 1000ms isnt the actual goal. Finding and keeping a baseline number for Clock startup is the goal. I actually dont know what that clock number should be, since every app is different. We should bring this up with product and engineering team on creating a defined baseline to begin with, that way we can always check and alert ourselves going forward when the tests have surpassed some defined threshold. (ie. 1000ms with a 100ms std?) What i do now we should do is continue to watch why prev release -> curr release has regressed, which circles back to the original bug here. I am concerned about such a wild swing of ~200ms move, but maybe if product/engineering defines that 1000ms is the right call, then a 1000ms Medium isnt so bad against our own standards. Given that you didnt see anything in the App code itself that could have caused this, i propose 2 steps from here: 1) collaborate with the other app developers to see why their apps have been regressing from 2.0 -> 2.1 also. camera,gallery,calendar all have shown a range of ~75 to ~500ms regression range. Talk to the other teams and see if we can track if there where that systemic problem lies. (i heard settings API made changes in 2.1?) 2) with Clock, and Email and Calendar too, we need to define a baseline of what is justified for 2.1. We also need to define what the minimum threshold is safe before we set datazilla alerts to go off. (see bug 1060545) ni? Sri for product perspective. Mcav, can you work to help get more information on 1 and 2?
Flags: needinfo?(tchung)
Flags: needinfo?(skasetti)
Flags: needinfo?(m)
Target Milestone: --- → 2.1 S5 (26sep)
Flags: needinfo?(m)
[update] I used the today's build to test again, the result is following. [Device] Flame + KK [Environment] Gaia-Rev 13973ab50760d1e8bb773082163f0dff19d35a44 Gecko-Rev https://hg.mozilla.org/releases/mozilla-aurora/rev/6e317e075d04 Build-ID 20140928160204 Version 34.0a2 Device-Name flame FW-Release 4.4.2 FW-Incremental eng.cltbld.20140925.192608 FW-Date Thu Sep 25 19:26:18 EDT 2014 Bootloader L1TC10011800 [Testing result comparison] Clock: * High, 1155; Low, 1063; Trend ~1100 - Comment 1: * High, 747; Low, 458; median:499, mean:512 - My testing
I would love to believe that we just sliced our startup time in half but the comment 1 numbers are coming from datazilla so I don't think we can compare those directly to the numbers in your test setup. Is this a 319mb configuration? Can you set up a 2.0 build with the same configuration so that we can get a more accurate base to compare against?
Flags: needinfo?(edchen)
I used the v2.0 with 319mb to test it today, actually I got result less than datazilla displayed. I just let you know what result in my testing environment. Of cause we can keep monitor to trace it.
Flags: needinfo?(edchen)
(In reply to Edward Chen[:edchen] from comment #7) > I used the v2.0 with 319mb to test it today, actually I got result less than > datazilla displayed. I just let you know what result in my testing > environment. Yes, so what are those 2.0 numbers? We are trying to understand if there is a regression between 2.0 and 2.1, so we need to see numbers from the same configurations running the same tests in the same environment.
Flags: needinfo?(edchen)
It may be possible to just blindly install the 2.1 version of Clock on a 2.0 device, to compare with absolute certainty. We didn't add any functionality that depends on 2.1 Gaia.
Target Milestone: 2.1 S5 (26sep) → 2.1 S6 (10oct)
ni? on :mcav to report back after trying the test in comment #9
Flags: needinfo?(m)
Flags: needinfo?(edchen)
I've gathered data which confirms my hypothesis that Clock is not the cause of this performance regression. I "make test-perf" several times on a v2.0 device, using two versions of clock: v2.0 and v2.1. I dropped the highest number from each test run (which was highly variable, anywhere from 1000ms to 1800ms) to remove some of the substantial noise. v2.0 Flame (with Clock from v2.0): _ [996, 960, 895, 910] \ [955, 880, 925, 881] } 933ms average [949, 971, 995, 881] _/ v2.0 Flame (with Clock from v2.1): _ [1035, 946, 986, 940] \ [1053, 935, 930, 943] } 992ms average [1084, 955, 1102, 1001] _/ We could run the numbers on a v2.1 Flame, and I'd expect to see similar results. From that data, I note the following: The numbers vary widely from run to run, making even these numbers a very rough estimate of performance. However, most of these runs hover around 940ms, both for the 2.0 and 2.1 Clock versions. The number of runs here is stlil relatively small given the noise, and I would expect the two sets of performance numbers to roughly converge somewhere between 900 and 1000ms given enough test runs. What is clear, however, is that the difference between Clock in v2.0 and 2.1 is negligible at best, and clearly not an actionable Clock regression. The Datazilla performance numbers for v2.1 also vary widely (on the order of hundreds of milliseconds), even in the absence of Clock changes <http://goo.gl/JOR2JZ>, indicating either system-level changes or pure sampling noise. These numbers are at or near our team-level target of 1000ms startup time for Clock, further indicating that attempting to spend extra time addressing this here would be a misuse of resources. Based on this data, combined with the lack of Clock changes between v2.0 and v2.1, the DataZilla graph numbers, and System bugs that cause performance regressions (bug 1069450), this bug is not clock-caused nor clock-actionable, and should be either unblocked, closed, or reassigned to a system component for further investigation.
Flags: needinfo?(m)
I'm about to publish large-dataset test run results for 2.1 acceptance--just waiting for the run to finish for the post-9/29 patch 2.1, which improved performence. Pre-9/29, with 480 repetitions per launch, 2.0 clock was median 919 and 2.1 clock was median 1097, with distribution graphs that reasonably support that conclusion. I expect the second number to probably come down with the new run. Couple of comments on the rest: The first highly-variable result is indeed invalid for this test (and my test run actually threw them out). That it's variable and a lot different than the others is an excellent clue, but moreover it's not logically the same scenario as the rest. First result is first run after FTE with any system/app initialization required, rest are post-init after a hard kill. This is typical for load-time type tests, where the first run is really a priming run. Average is invalid and shouldn't be used. The distribution for load time isn't symmetric normal, so average doesn't really tell you the convergence point (it overestimates because of positive skew, fwiw). Median is the closest simple stat that works. I agree that 30 repetitions is probably insufficient to do more than be a general canary re: whether something is grossly wrong. Trying to compare numerical results with any precision there doesn't make a ton of sense. re: not Clock's issue, I suspect the issue is a system issue for all the reasons you said. That said, it'll probably have to be diagnosed through profiling the individual apps that are affected. Not sure how that should be handled re: ownership.
Re: first result, btw, what I said is true for Datazilla's runs, which re-flash before the test run. If you still saw a highly variable first result repeating make test-perf within the same execution context, there must be something extra happening after the test run initialization (which I think restarts the b2g process). If the highly variable result was elsewhere, that's a deeper issue. Most of the DZ tests (and my local ones) show a high first replicate, more or less consistent rest, which is what I'd expect to see here too. High result in the middle is probably system-generated. If there's a lot of it that's problematic for the relationship of the results to real-world usage--means the user might experience lag spikes too so our result doesn't tell the whole story.
Blocks: 1079564
Team confirmed this will not block but high-priority for backlog
blocking-b2g: 2.1+ → backlog
Whiteboard: [priority]
Assignee: m → nobody
[priority] --> tracking-b2g:+ conversion
tracking-b2g: --- → +
Target Milestone: 2.1 S6 (10oct) → ---
Flags: needinfo?(skasetti)
We don't have anything actionable here for Clock in the 2.0 to 2.1 migration as there doesn't appear to be any regression specifically in the Clock code. (See comment #11) Closing this 2.1-specific bug; we will focus any efforts on master and the more generic startup time bug 1079564.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
blocking-b2g: backlog → ---
You need to log in before you can comment on or make changes to this bug.