Closed
Bug 1059353
Opened 10 years ago
Closed 10 years ago
[Clock] Startup time regressions of ~205 ms from 2.0 to 2.1
Categories
(Firefox OS Graveyard :: Gaia::Clock, defect)
Tracking
(tracking-b2g:+)
RESOLVED
WONTFIX
tracking-b2g | + |
People
(Reporter: tchung, Unassigned)
References
Details
(Keywords: perf, regression, Whiteboard: [priority])
[Blocking Requested - why for this release]:
Per Datazilla, Clock cold startup numbers have regressed about 205 ms since 2.0 to 2.1.
Environment: startup_>_moz-app-visually-complete, flame-319MB, workload = light. human perception goal = 1000ms
* 2.0 (All numbers use median, looking at latest numbers from 8-20 to 8-25)
** https://datazilla.mozilla.org/b2g/?branch=v2.0&device=flame-319MB&range=7&test=startup_%3E_moz-app-visually-complete&app_list=clock&app=clock&gaia_rev=d72f8ad53448aed5&gecko_rev=2a18149b3ae8&plot=avg
clock -- High, 963; Low, 891; Trend ~905
* 2.1 (All numbers use median, looking at numbers from 8-08 to 8-13 (last available))
* https://datazilla.mozilla.org/b2g/?branch=master&device=flame-319MB&range=30&test=startup_%3E_moz-app-visually-complete&app_list=clock&app=clock&gaia_rev=9483fb304235e7ce&gecko_rev=7f32109287e9&plot=avg
clock -- High, 1155; Low, 1063; Trend ~1100
Please investigate, thanks.
Comment 1•10 years ago
|
||
blocking based on performance regression. team to investigate
blocking-b2g: 2.1? → 2.1+
Updated•10 years ago
|
Assignee: nobody → m
Target Milestone: --- → 2.1 S4 (12sep)
Comment 2•10 years ago
|
||
I did some basic investigation on the datazilla graph + commits that happened between July 23-24th, and there are a few commits that affected multiple apps. More info at https://bugzilla.mozilla.org/show_bug.cgi?id=1059349#c2
Comment 3•10 years ago
|
||
Looking at the latest datazilla graphs, the numbers continue to fluctuate wildly, with a median of around ~1050ms and a fairly consistent ~900ms as the lowest time; the max ranges wildly from one second to over two seconds.
Very little changed in clock over the course of 2.0->2.1, and it doesn't appear that anything done on Clock's end would have meaningfully regressed its performance numbers. Per Datazilla, it appears the reported startup numbers are inconsistent from day to day, indicating larger reporting-level inconsistencies from run to run, or system-level changes that affect either the test harness or actual performance.
This bug is marked blocking for 2.1, but it isn't clear to me that it's actionable. Given that the startup numbers don't appear to be caused by Clock changes, and I don't see any low-hanging fruit for which we could shave off additional time. We could try to change Clock's startup behavior more substantially (add a caching layer or something equally involved) to meaningfully reduce Clock's perceived startup time, however it seems risky to attempt a major change for uplift in 2.1 at this stage.
Tony, any suggestions as to how to proceed here? Do we absolutely need to get the number under 1000ms for 2.1, or is this primarily to track investigation?
Flags: needinfo?(tchung)
Target Milestone: 2.1 S4 (12sep) → ---
Reporter | ||
Comment 4•10 years ago
|
||
(In reply to Marcus Cavanaugh [:mcav] <mcav@mozilla.com> from comment #3)
> Looking at the latest datazilla graphs, the numbers continue to fluctuate
> wildly, with a median of around ~1050ms and a fairly consistent ~900ms as
> the lowest time; the max ranges wildly from one second to over two seconds.
>
> Very little changed in clock over the course of 2.0->2.1, and it doesn't
> appear that anything done on Clock's end would have meaningfully regressed
> its performance numbers. Per Datazilla, it appears the reported startup
> numbers are inconsistent from day to day, indicating larger reporting-level
> inconsistencies from run to run, or system-level changes that affect either
> the test harness or actual performance.
>
> This bug is marked blocking for 2.1, but it isn't clear to me that it's
> actionable. Given that the startup numbers don't appear to be caused by
> Clock changes, and I don't see any low-hanging fruit for which we could
> shave off additional time. We could try to change Clock's startup behavior
> more substantially (add a caching layer or something equally involved) to
> meaningfully reduce Clock's perceived startup time, however it seems risky
> to attempt a major change for uplift in 2.1 at this stage.
>
> Tony, any suggestions as to how to proceed here? Do we absolutely need to
> get the number under 1000ms for 2.1, or is this primarily to track
> investigation?
Thanks for investigating. i dont have the answers here, because we could be seeing something systemic across the OS that is causing overall regressions of startup.
Getting the number under 1000ms isnt the actual goal. Finding and keeping a baseline number for Clock startup is the goal. I actually dont know what that clock number should be, since every app is different. We should bring this up with product and engineering team on creating a defined baseline to begin with, that way we can always check and alert ourselves going forward when the tests have surpassed some defined threshold. (ie. 1000ms with a 100ms std?)
What i do now we should do is continue to watch why prev release -> curr release has regressed, which circles back to the original bug here. I am concerned about such a wild swing of ~200ms move, but maybe if product/engineering defines that 1000ms is the right call, then a 1000ms Medium isnt so bad against our own standards. Given that you didnt see anything in the App code itself that could have caused this, i propose 2 steps from here:
1) collaborate with the other app developers to see why their apps have been regressing from 2.0 -> 2.1 also. camera,gallery,calendar all have shown a range of ~75 to ~500ms regression range. Talk to the other teams and see if we can track if there where that systemic problem lies. (i heard settings API made changes in 2.1?)
2) with Clock, and Email and Calendar too, we need to define a baseline of what is justified for 2.1. We also need to define what the minimum threshold is safe before we set datazilla alerts to go off. (see bug 1060545) ni? Sri for product perspective.
Mcav, can you work to help get more information on 1 and 2?
Flags: needinfo?(tchung)
Flags: needinfo?(skasetti)
Flags: needinfo?(m)
Updated•10 years ago
|
Target Milestone: --- → 2.1 S5 (26sep)
Updated•10 years ago
|
Flags: needinfo?(m)
Comment 5•10 years ago
|
||
[update]
I used the today's build to test again, the result is following.
[Device]
Flame + KK
[Environment]
Gaia-Rev 13973ab50760d1e8bb773082163f0dff19d35a44
Gecko-Rev https://hg.mozilla.org/releases/mozilla-aurora/rev/6e317e075d04
Build-ID 20140928160204
Version 34.0a2
Device-Name flame
FW-Release 4.4.2
FW-Incremental eng.cltbld.20140925.192608
FW-Date Thu Sep 25 19:26:18 EDT 2014
Bootloader L1TC10011800
[Testing result comparison]
Clock:
* High, 1155; Low, 1063; Trend ~1100 - Comment 1:
* High, 747; Low, 458; median:499, mean:512 - My testing
Comment 6•10 years ago
|
||
I would love to believe that we just sliced our startup time in half but the comment 1 numbers are coming from datazilla so I don't think we can compare those directly to the numbers in your test setup. Is this a 319mb configuration? Can you set up a 2.0 build with the same configuration so that we can get a more accurate base to compare against?
Flags: needinfo?(edchen)
Comment 7•10 years ago
|
||
I used the v2.0 with 319mb to test it today, actually I got result less than datazilla displayed. I just let you know what result in my testing environment. Of cause we can keep monitor to trace it.
Flags: needinfo?(edchen)
Comment 8•10 years ago
|
||
(In reply to Edward Chen[:edchen] from comment #7)
> I used the v2.0 with 319mb to test it today, actually I got result less than
> datazilla displayed. I just let you know what result in my testing
> environment.
Yes, so what are those 2.0 numbers? We are trying to understand if there is a regression between 2.0 and 2.1, so we need to see numbers from the same configurations running the same tests in the same environment.
Flags: needinfo?(edchen)
Comment 9•10 years ago
|
||
It may be possible to just blindly install the 2.1 version of Clock on a 2.0 device, to compare with absolute certainty. We didn't add any functionality that depends on 2.1 Gaia.
Updated•10 years ago
|
Target Milestone: 2.1 S5 (26sep) → 2.1 S6 (10oct)
Comment 10•10 years ago
|
||
ni? on :mcav to report back after trying the test in comment #9
Flags: needinfo?(m)
Updated•10 years ago
|
Flags: needinfo?(edchen)
Comment 11•10 years ago
|
||
I've gathered data which confirms my hypothesis that Clock is not the
cause of this performance regression.
I "make test-perf" several times on a v2.0 device, using two versions
of clock: v2.0 and v2.1. I dropped the highest number from each test
run (which was highly variable, anywhere from 1000ms to 1800ms) to
remove some of the substantial noise.
v2.0 Flame (with Clock from v2.0):
_
[996, 960, 895, 910] \
[955, 880, 925, 881] } 933ms average
[949, 971, 995, 881] _/
v2.0 Flame (with Clock from v2.1):
_
[1035, 946, 986, 940] \
[1053, 935, 930, 943] } 992ms average
[1084, 955, 1102, 1001] _/
We could run the numbers on a v2.1 Flame, and I'd expect to see
similar results. From that data, I note the following:
The numbers vary widely from run to run, making even these numbers a
very rough estimate of performance. However, most of these runs hover
around 940ms, both for the 2.0 and 2.1 Clock versions. The number of
runs here is stlil relatively small given the noise, and I would
expect the two sets of performance numbers to roughly converge
somewhere between 900 and 1000ms given enough test runs.
What is clear, however, is that the difference between Clock in v2.0
and 2.1 is negligible at best, and clearly not an actionable Clock
regression. The Datazilla performance numbers for v2.1 also vary
widely (on the order of hundreds of milliseconds), even in the absence
of Clock changes <http://goo.gl/JOR2JZ>, indicating either
system-level changes or pure sampling noise.
These numbers are at or near our team-level target of 1000ms startup
time for Clock, further indicating that attempting to spend extra time
addressing this here would be a misuse of resources.
Based on this data, combined with the lack of Clock changes between
v2.0 and v2.1, the DataZilla graph numbers, and System bugs that cause
performance regressions (bug 1069450), this bug is not clock-caused
nor clock-actionable, and should be either unblocked, closed, or
reassigned to a system component for further investigation.
Flags: needinfo?(m)
I'm about to publish large-dataset test run results for 2.1 acceptance--just waiting for the run to finish for the post-9/29 patch 2.1, which improved performence.
Pre-9/29, with 480 repetitions per launch, 2.0 clock was median 919 and 2.1 clock was median 1097, with distribution graphs that reasonably support that conclusion. I expect the second number to probably come down with the new run.
Couple of comments on the rest:
The first highly-variable result is indeed invalid for this test (and my test run actually threw them out). That it's variable and a lot different than the others is an excellent clue, but moreover it's not logically the same scenario as the rest. First result is first run after FTE with any system/app initialization required, rest are post-init after a hard kill. This is typical for load-time type tests, where the first run is really a priming run.
Average is invalid and shouldn't be used. The distribution for load time isn't symmetric normal, so average doesn't really tell you the convergence point (it overestimates because of positive skew, fwiw). Median is the closest simple stat that works.
I agree that 30 repetitions is probably insufficient to do more than be a general canary re: whether something is grossly wrong. Trying to compare numerical results with any precision there doesn't make a ton of sense.
re: not Clock's issue, I suspect the issue is a system issue for all the reasons you said. That said, it'll probably have to be diagnosed through profiling the individual apps that are affected. Not sure how that should be handled re: ownership.
Re: first result, btw, what I said is true for Datazilla's runs, which re-flash before the test run.
If you still saw a highly variable first result repeating make test-perf within the same execution context, there must be something extra happening after the test run initialization (which I think restarts the b2g process).
If the highly variable result was elsewhere, that's a deeper issue. Most of the DZ tests (and my local ones) show a high first replicate, more or less consistent rest, which is what I'd expect to see here too. High result in the middle is probably system-generated.
If there's a lot of it that's problematic for the relationship of the results to real-world usage--means the user might experience lag spikes too so our result doesn't tell the whole story.
Comment 15•10 years ago
|
||
Team confirmed this will not block but high-priority for backlog
blocking-b2g: 2.1+ → backlog
Whiteboard: [priority]
Updated•10 years ago
|
Assignee: m → nobody
Updated•10 years ago
|
Target Milestone: 2.1 S6 (10oct) → ---
Updated•10 years ago
|
Flags: needinfo?(skasetti)
Comment 17•10 years ago
|
||
We don't have anything actionable here for Clock in the 2.0 to 2.1 migration as there doesn't appear to be any regression specifically in the Clock code. (See comment #11)
Closing this 2.1-specific bug; we will focus any efforts on master and the more generic startup time bug 1079564.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Assignee | ||
Updated•10 years ago
|
blocking-b2g: backlog → ---
You need to log in
before you can comment on or make changes to this bug.
Description
•