Closed Bug 1112149 Opened 10 years ago Closed 10 years ago

[raptor] Determine the feasibility of running perf tests on the b2g-emulator on the cloud

Categories

(Firefox OS Graveyard :: Gaia::PerformanceTest, defect)

Other
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rwood, Assigned: rwood)

References

Details

(Keywords: perf)

Determine if it is feasible to use the b2g-emulator to generate a performance baseline for the raptor tests. Also determine if AWS will be the location/cloud of choice for this automation. Will the raptor perf tests run successfully on the b2g-emulator (stability wise)? Will the generated numbers be consistent, and enable the establishment of a perf baseline? Try this locally first, then on an AWS instance.
Created AWS raptor instance for this work; going direct to instance as want to get numbers from there to see if feasible to establish raptor performance baseline.
Depends on: 1123884
Depends on: 1129948
AWS instance: r3.xlarge ubuntu 14.04 Docker image: https://github.com/rwood-moz/raptor-docker-runner Cmd: DEBUG=* RUNS=100 APPS='clock' RAPTOR_EMULATOR=1 node tests/raptor/emulator_launch_test.js Duration for 1x cmd: Approx 25 minutes Restarted the emulator (and did a make raptor) before each cmd. Results: https://gist.github.com/rwood-moz/17604d572bfa4be06217
Flags: needinfo?(eperelman)
Same emulator setup as in comment 2, with updated/latest gaia and gecko. Results: https://gist.github.com/rwood-moz/e52492f11323c484ee2e For comparison, the launch test run on an actual local flame-kk device on my local VM. Cmd: DEBUG=* RUNS=100 APPS='clock' node tests/raptor/launch_test.js Duration for 1x cmd: Approx 14 minutes Results: https://gist.github.com/rwood-moz/db58832d9ab1089efcb8
For the results of the emulator tests in comment 2: ➜ logfile 1 Mean: 6814.98 Median: 6668.5 Minimum: 5421 Maximum: 8961 Standard Deviation: 724.532255662196 ➜ logfile 2 Mean: 6792.14 Median: 6681.5 Minimum: 5370 Maximum: 8942 Standard Deviation: 760.2411236315791 ➜ logfile 3 Mean: 6704.27 Median: 6529 Minimum: 5691 Maximum: 9571 Standard Deviation: 768.0729902052462
Flags: needinfo?(eperelman)
For emulator tests in comment 3: Mean 0: 7077.551724137931 Median 0: 7042 Mode 0: 7042 Minimum 0: 5529 Maximum 0: 9318 Standard Deviation 0: 896.145369450003 95th Percentile 0: 8925.65 ---------------- Mean 1: 7080.0344827586205 Median 1: 6981 Mode 1: 6981 Minimum 1: 5510 Maximum 1: 8851 Standard Deviation 1: 910.0654952845379 95th Percentile 1: 8714.2 ---------------- Mean 2: 6939.931034482759 Median 2: 6912 Mode 2: 6912 Minimum 2: 5491 Maximum 2: 9589 Standard Deviation 2: 863.0025345003932 95th Percentile 2: 8307.449999999997 ---------------- Mean 3: 7251.586206896552 Median 3: 7224 Mode 3: 7224 Minimum 3: 5394 Maximum 3: 9852 Standard Deviation 3: 1072.862960598204 95th Percentile 3: 9344.699999999997 ---------------- Mean 4: 6885.482758620689 Median 4: 6784 Mode 4: 6784 Minimum 4: 5510 Maximum 4: 8529 Standard Deviation 4: 851.6285259750632 95th Percentile 4: 8280.099999999999 ---------------- Mean 5: 6998.620689655172 Median 5: 6981 Mode 5: 6981 Minimum 5: 5491 Maximum 5: 9589 Standard Deviation 5: 860.0780002235203 95th Percentile 5: 8887.899999999998 ---------------- Mean 6: 7171.103448275862 Median 6: 7053 Mode 6: 7053 Minimum 6: 5394 Maximum 6: 9852 Standard Deviation 6: 985.0501945841027 95th Percentile 6: 8952.349999999997 ---------------- Mean 7: 6946.793103448276 Median 7: 6912 Mode 7: 6912 Minimum 7: 5529 Maximum 7: 9318 Standard Deviation 7: 917.6021743093602 95th Percentile 7: 8568.449999999997 ---------------- Mean 8: 7057.172413793103 Median 8: 6981 Mode 8: 6981 Minimum 8: 5491 Maximum 8: 8851 Standard Deviation 8: 854.2952759727435 95th Percentile 8: 8714.2 ---------------- Mean 9: 7108.275862068966 Median 9: 6942 Mode 9: 6942 Minimum 9: 5394 Maximum 9: 9852 Standard Deviation 9: 1056.702577688359 95th Percentile 9: 9602.149999999998 ---------------- Mean 10: 7077.551724137931 Median 10: 7042 Mode 10: 7042 Minimum 10: 5529 Maximum 10: 9318 Standard Deviation 10: 896.145369450003 95th Percentile 10: 8925.65 ---------------- Mean 11: 7080.0344827586205 Median 11: 6981 Mode 11: 6981 Minimum 11: 5510 Maximum 11: 8851 Standard Deviation 11: 910.0654952845379 95th Percentile 11: 8714.2 ---------------- Mean 12: 6939.931034482759 Median 12: 6912 Mode 12: 6912 Minimum 12: 5491 Maximum 12: 9589 Standard Deviation 12: 863.0025345003932 95th Percentile 12: 8307.449999999997 ---------------- Mean 13: 7251.586206896552 Median 13: 7224 Mode 13: 7224 Minimum 13: 5394 Maximum 13: 9852 Standard Deviation 13: 1072.862960598204 95th Percentile 13: 9344.699999999997 ---------------- Mean 14: 6885.482758620689 Median 14: 6784 Mode 14: 6784 Minimum 14: 5510 Maximum 14: 8529 Standard Deviation 14: 851.6285259750632 95th Percentile 14: 8280.099999999999 ---------------- Mean 15: 6998.620689655172 Median 15: 6981 Mode 15: 6981 Minimum 15: 5491 Maximum 15: 9589 Standard Deviation 15: 860.0780002235203 95th Percentile 15: 8887.899999999998 ---------------- Mean 16: 7168.8 Median 16: 7072.5 Mode 16: 7072.5 Minimum 16: 5394 Maximum 16: 9852 Standard Deviation 16: 1052.1954001039921 95th Percentile 16: 9171
Status: NEW → ASSIGNED
Keywords: perf
For device results in comment 3: Mean 0: 1177.5862068965516 Median 0: 1181 Mode 0: 1181 Minimum 0: 1101 Maximum 0: 1252 Standard Deviation 0: 39.15091687954342 95th Percentile 0: 1238.7 ---------------- Mean 1: 1188.2413793103449 Median 1: 1163 Mode 1: 1163 Minimum 1: 1077 Maximum 1: 1338 Standard Deviation 1: 65.49633060130493 95th Percentile 1: 1330.4 ---------------- Mean 2: 1168.1379310344828 Median 2: 1173 Mode 2: 1173 Minimum 2: 1053 Maximum 2: 1281 Standard Deviation 2: 51.406912280374705 95th Percentile 2: 1244.8999999999999 ---------------- Mean 3: 1181.1724137931035 Median 3: 1181 Mode 3: 1181 Minimum 3: 1080 Maximum 3: 1273 Standard Deviation 3: 42.81239666226018 95th Percentile 3: 1251.15 ---------------- Mean 4: 1185.896551724138 Median 4: 1163 Mode 4: 1163 Minimum 4: 1077 Maximum 4: 1338 Standard Deviation 4: 64.53833120716668 95th Percentile 4: 1330.4 ---------------- Mean 5: 1168.0344827586207 Median 5: 1163 Mode 5: 1163 Minimum 5: 1053 Maximum 5: 1263 Standard Deviation 5: 50.40011135634981 95th Percentile 5: 1245.9 ---------------- Mean 6: 1184.1724137931035 Median 6: 1187 Mode 6: 1187 Minimum 6: 1080 Maximum 6: 1281 Standard Deviation 6: 48.01154240928353 95th Percentile 6: 1273.3999999999999 ---------------- Mean 7: 1177.3793103448277 Median 7: 1171 Mode 7: 1171 Minimum 7: 1077 Maximum 7: 1330 Standard Deviation 7: 54.41818521673611 95th Percentile 7: 1310.05 ---------------- Mean 8: 1179.1379310344828 Median 8: 1163 Mode 8: 1163 Minimum 8: 1107 Maximum 8: 1338 Standard Deviation 8: 57.3343034695641 95th Percentile 8: 1297.1499999999999 ---------------- Mean 9: 1171.344827586207 Median 9: 1170 Mode 9: 1170 Minimum 9: 1053 Maximum 9: 1281 Standard Deviation 9: 53.742858939948846 95th Percentile 9: 1273.3999999999999 ---------------- Mean 10: 1177.5862068965516 Median 10: 1181 Mode 10: 1181 Minimum 10: 1101 Maximum 10: 1252 Standard Deviation 10: 39.15091687954342 95th Percentile 10: 1238.7 ---------------- Mean 11: 1188.2413793103449 Median 11: 1163 Mode 11: 1163 Minimum 11: 1077 Maximum 11: 1338 Standard Deviation 11: 65.49633060130493 95th Percentile 11: 1330.4 ---------------- Mean 12: 1168.1379310344828 Median 12: 1173 Mode 12: 1173 Minimum 12: 1053 Maximum 12: 1281 Standard Deviation 12: 51.406912280374705 95th Percentile 12: 1244.8999999999999 ---------------- Mean 13: 1181.1724137931035 Median 13: 1181 Mode 13: 1181 Minimum 13: 1080 Maximum 13: 1273 Standard Deviation 13: 42.81239666226018 95th Percentile 13: 1251.15 ---------------- Mean 14: 1185.896551724138 Median 14: 1163 Mode 14: 1163 Minimum 14: 1077 Maximum 14: 1338 Standard Deviation 14: 64.53833120716668 95th Percentile 14: 1330.4 ---------------- Mean 15: 1168.0344827586207 Median 15: 1163 Mode 15: 1163 Minimum 15: 1053 Maximum 15: 1263 Standard Deviation 15: 50.40011135634981 95th Percentile 15: 1245.9 ---------------- Mean 16: 1180.6 Median 16: 1179.5 Mode 16: 1179.5 Minimum 16: 1080 Maximum 16: 1281 Standard Deviation 16: 52.30812556381657 95th Percentile 16: 1277
Just a note that the results from comment 5 and 6 were re-chunked into groups of 30 runs each.
AWS instance: c3.2xlarge ubuntu 14.04 Docker image: https://github.com/rwood-moz/raptor-docker-runner Cmd: DEBUG=* RUNS=30 APPS='clock' RAPTOR_EMULATOR=1 node tests/raptor/emulator_launch_test.js Duration for 1x cmd: Approx 10 minutes Restarted the emulator (and did a make raptor) before each cmd. Results: https://gist.github.com/rwood-moz/5f43461b237118ecd198
For emulator-kk results in comment 8 (thanks for the script, Eli!): Mean 0: 6335.172413793103 Median 0: 6086 Mode 0: 6086 Minimum 0: 5311 Maximum 0: 8721 Standard Deviation 0: 843.0406816774259 95th Percentile 0: 7528.749999999997 ---------------- Mean 1: 6330 Median 1: 6244 Mode 1: 6244 Minimum 1: 4919 Maximum 1: 7786 Standard Deviation 1: 622.0492274565657 95th Percentile 1: 7420.249999999998 ---------------- Mean 2: 6004.862068965517 Median 2: 5950 Mode 2: 5950 Minimum 2: 4870 Maximum 2: 7528 Standard Deviation 2: 709.3941428716105 95th Percentile 2: 7443.45 ---------------- Mean 3: 5969.620689655172 Median 3: 5608 Mode 3: 5608 Minimum 3: 4814 Maximum 3: 9034 Standard Deviation 3: 931.0047377113093 95th Percentile 3: 7767.649999999996 ---------------- Mean 4: 6380.793103448276 Median 4: 6069 Mode 4: 6069 Minimum 4: 5052 Maximum 4: 8127 Standard Deviation 4: 877.9617609813886 95th Percentile 4: 7969.299999999999 ---------------- Mean 5: 5848.310344827586 Median 5: 5734 Mode 5: 5734 Minimum 5: 4632 Maximum 5: 7849 Standard Deviation 5: 800.2816790074938 95th Percentile 5: 7676.099999999999 ---------------- Mean 6: 6036.758620689655 Median 6: 5847 Mode 6: 5847 Minimum 6: 4911 Maximum 6: 7882 Standard Deviation 6: 678.3625139678034 95th Percentile 6: 7572.299999999999 ---------------- Mean 7: 6458.724137931034 Median 7: 6253 Mode 7: 6253 Minimum 7: 5291 Maximum 7: 7844 Standard Deviation 7: 750.3344953649683 95th Percentile 7: 7672.05 ---------------- Mean 8: 6522.275862068966 Median 8: 6561 Mode 8: 6561 Minimum 8: 5022 Maximum 8: 8422 Standard Deviation 8: 785.2974352633233 95th Percentile 8: 7695.249999999997 ---------------- Mean 9: 6437.827586206897 Median 9: 6300 Mode 9: 6300 Minimum 9: 5252 Maximum 9: 8384 Standard Deviation 9: 835.9394495876675 95th Percentile 9: 8162.65
For the emulator results in comment 5: Smallest mean: 6885ms Largest mean: 7251ms Range: 366ms 10% regression: 6885 + 689 = 7574 7251 + 725 = 7976 5% regression: 6885 + 344 = 7229 7251 + 363 = 7614 Assessment: A 5% regression falls within the normal range of values, making it impossible to determine regressions unless it is greater than 5%, and even then should probably be significantly above 5% in order prove that the max range hasn't increased. Falling back to requiring a 10% regression to fail a test means needing to have a *700ms* regression. The potential for missed regressions in the 3-10% range, which are the most common, would usually never be caught, leaving this mechanism only available to catch the most serious regressions. I'm not sure, but it seems inefficient to allocate this much time to testing every patch if it can only prevent 10% of the regressions that ever arise. The emulator values to exhibit quite a bit of variance, with the potential of a minimum-to-maximum value being swung by 100%. --- For the device results in comment 6: Smallest mean: 1168ms Largest mean: 1188ms Range: 20ms 10% regression: 1168 + 117 = 1285 1188 + 119 = 1307 5% regression: 1168 + 58 = 1226 1188 + 60 = 1248 3% regression: 1168 + 35 = 1203 1188 + 36 = 1224 Assessment: Even a 3% regression in the smallest mean puts it 15ms higher than the largest mean. I would feel comfortable failing regressions of 3-4% if they ran on devices. The values produced by devices seems to be pretty stable and with minimal variance compared to emulator values.
For the more powerful emulator results in comment 9: Smallest mean: 5848ms Largest mean: 6522ms Range: 674ms Assessment: With a more powerful emulator, we are experiencing a much larger range of means? That doesn't bode well for the reliability of the emulator. A range 674ms already represents more than 10% of the value of the largest mean, concluding that we would need a regression threshold of 15% just to ensure its validity, which also means that we wouldn't catch any regressions less than 15%. :(
Duration for emulator-k results in comment 5: 6 min (approx) for initial emulator boot-up and make raptor + 25 min for 100 RUNS of 1 single app (a single test iteration of 100 launches of a single app) = 31 minutes for a single app Extend it to all apps: 6 min initial emulator boot + (25 min for 100 RUNS x 12 apps minimum) = 306 minutes for all apps if done in the same test run x2 (master, master + PR) although they could be run concurrently via different tasks Duration for device results in comment 6: 2 min (approx) for initial emulator boot-up and make raptor + 14 min for 100 RUNS of 1 single app (a single test iteration of 100 launches of a single app) = 16 minutes for a single app Extend it to all apps: 2 min initial device boot + (14 min for 100 RUNS x 12 apps minimum) = 170 minutes for all apps if done in the same test run x2 (master, master + PR) although they could be run concurrently via different devices
Duration for emulator-kk results in comment 9: 6 min (approx) for initial emulator boot-up and make raptor + 10 min for 30 RUNS of 1 single app (a single test iteration of 100 launches of a single app) = 16 minutes for a single app Extend it to all apps: 6 min initial emulator boot + (10 min for 100 RUNS x 12 apps minimum) = 126 minutes for all apps if done in the same test run x2 (master, master + PR) although they could be run concurrently via different tasks
Depends on: 1138074
(In reply to :Eli Perelman from comment #10) > Assessment: > A 5% regression falls within the normal range of values, making it > impossible to determine regressions unless it is greater than 5%, and even > then should probably be significantly above 5% in order prove that the max > range hasn't increased. Falling back to requiring a 10% regression to fail a > test means needing to have a *700ms* regression. The potential for missed This doesn't surprise me in the least. I feel that emulator checks are going to be useful as sanity checks only. We shouldn't count on these for how we monitor performance as a be-all-end-all. Among other things, as we discussed on IRC, there is no guarantee of similar performance between separate instances hosting the emulator, or even the same instance at different times. That means the most we can pull out is a simple back to back A->B test on the same instance. So while this can be ok, within reason, for "fail a build if it's grossly different" it'll be pretty useless for giving us a picture of performance over time. Yesterday's results won't be comparable to today's results because they'll possibly be on a fundamentally different host speed. We could *try* to stitch it together from a series of deltas (last test had an A->B diff of 3%, next test a B->C diff of 5%, therefore we extrapolate that A->B->C had a diff of (1.03 * 1.05) = 8.15% difference. But it's going to be awfully indirect compared to just testing on a more stable environment like a device. Even using this for sanity checks, I suspect we will have spikes that foul those tests, depending on how long that takes to execute; the performance can change underneath the test as the hosting service prioritizes/deprioritizes VMs. I think we should roll it out and see, but I also think we need to have good sanity check protocols in hand, like retesting the same build a bunch of times over through the final test architecture. If it starts failing itself when the build's otherwise bit-identical, we might have an issue depending on how often that happens. > Even a 3% regression in the smallest mean puts it 15ms higher than the > largest mean. I would feel comfortable failing regressions of 3-4% if they > ran on devices. The values produced by devices seems to be pretty stable and > with minimal variance compared to emulator values. Again, reiterating an IRC discussion for the purposes of rolling up in to Bugzilla, my concern with this line of thinking on both device and emulator is that we shouldn't over-rely on the mean, or even the median. Two very different sets of performance characteristics can have similar values for these depending on how spread out the results are on either side of the center. While I'm wary of standard deviation on a non-gaussian distribution (task time tests are always positive-skewed due to a hard lower bound on the minimum possible time to complete the task), it may be good enough to correlate with the distribution shape. Alternately, possible looking at a higher percentile (90th/95th) along with mean or median would capture that. The problem with both these approaches are that I'm not sure you'll get stable results on them from only 30 data points; they're both very distribution-dependent, and small samples tend to have variable distributions. A 90th percentile on that is the 4th highest point, and the 95th percentile is the 2nd highest point. Similarly, a standard deviation covers 20 out of the 30 points, with only 10 points outside it on either side. Those all strike me as pretty thin. On the plus side, I do think even a mean/median-only approach will be ok catching instances where the distribution doesn't change and the numbers just get higher. That would correspond with a single operation adding a predictable amount of time, which is probably the most common scenario. As long as we supplement this kind of approach with longer-running tests with more data (I'd suggest a daily 20-something hour run, with as many points as we can gather in that time) to catch other types of perf regressions I think it can be OK.
(In reply to :Eli Perelman from comment #11) > For the more powerful emulator results in comment 9: > > Smallest mean: 5848ms > Largest mean: 6522ms > Range: 674ms > > Assessment: > With a more powerful emulator, we are experiencing a much larger range of > means? That doesn't bode well for the reliability of the emulator. A range > 674ms already represents more than 10% of the value of the largest mean, > concluding that we would need a regression threshold of 15% just to ensure > its validity, which also means that we wouldn't catch any regressions less > than 15%. > > :( I'm not ready to conclude that the increased variability correlates with the power of the emulator. I think you might find that there's a high level of variability, period, and any one of these sessions may or may not capture a minimum or maximum amount of fluctuation. I would run a *lot* of these before drawing any conclusions. Try running a day's worth of 30-test results and see what comes out then.
(In reply to Geo Mealer [:geo] from comment #15) > I'm not ready to conclude that the increased variability correlates with the > power of the emulator. Oh yeah, I wasn't making that connection, it was just disappointing to see that increased power hadn't improved the situation. That said, I agree with all your points. I do believe that on the post-commit automation side, we need to crank up the amount of data so the visualization is logical and useful, contrary to what we currently have. Also like you said, reliance on mean/median/p50 in post-commit automation hasn't been particularly insightful, hence needing to move to p90/95 there with much more data. Then again, this data will continue to run against devices for the foreseeable future, which have been pretty reliable. For the pre-commit side, it's really a balance between time and resources. Do you think its possible to get a valuable determination from the current state of the emulators if we had more runs (maybe 40-60) and did a percentile analysis?
Flags: needinfo?(gmealer)
Depends on: 1139428
Depends on: 1139448
(In reply to :Eli Perelman from comment #16) > > For the pre-commit side, it's really a balance between time and resources. > Do you think its possible to get a valuable determination from the current > state of the emulators if we had more runs (maybe 40-60) and did a > percentile analysis? Probably, but I honestly still don't know. But I think you should try it, just maybe not with a blocking test yet. This is long. Sorry. Let me separate concerns. First, "current state of the emulators": My primary concern comes back to how performance-stable the host instance for the emulator is, nothing about the emulators themselves. If your host's performance is changing (or I/O latency changes to virtualized disk stores, or any other systemic dependency) it's kind of obvious that your overall performance will change with that. And further, we host on Amazon, and it's obvious from hunting around the web that the performance of those hosts is typically quite variable. So my conclusion there: Your Firefox OS emulators, as currently hosted on Amazon (and probably any other VPS- or EC-like host) will not be performance-stable, unless you've somehow locked down performance-stable host instances. That drives my conclusion that this won't be reliable for comparing results from sessions at different times; the host instance performance can and probably will change between them. But what I don't know is *if the performance fluctuates significantly during a short period of time*. If so, that means even the proposed immediate A/B test is vulnerable, because the performance might spike or trough during the A/B test. My suspicion is that yes, it can. So then it comes down to how variable can it be, and that's why I suggested a very long-running set of performance tests in the same way you've done above, to get as much data as reasonably possible so you can see what the spread is. It has the side effect of making sure it's during one theoretical session, so you can also verify whether or not it will change during that session. But there are other concerns, and these -do- have to do with emulators and Firefox OS: The host OS itself may not be performance-stable. The emulators run on Linux. When do maintenance daemons kick in and steal performance from the emulator? If you don't know that, you don't know when spikes will be introduced. And Firefox OS isn't very performance-stable as it is. You can rerun simple timing tests on it over and over and get pretty variable results, with unpredictable spikes. That doesn't surprise me in a multi-layered OS with a bunch of concurrency going on, but does mean performance testing it in general isn't going to be as precise as you'd like unless you learn how to turn off a lot of background activity. The golden rule of performance testing is that the only thing that's supposed to be variable is the exact thing you're testing. Everything else just craps up the test. But we're talking: Variable virtualized instance + variable host OS + variable Firefox OS. The emulator is probably the only stable part of it, to be honest, hence the least of my concerns. And not to be negative, but keep in mind I'm still not sure we've nailed down how to do this right on just "variable Firefox OS". But I do think you should try it. I'd recommend making sure it works to your satisfaction on-device first, -then- moving to emulator, though. Take on one stabilization challenge at a time. And you might consider self-hosting the machines or VMs used for this particular test. That might go a long way towards stabilizing down their performance. Maybe not everything can/should be elastic. Either way you go, also keep in mind that I/O is a VM's kryptonite. It's not just about stabilizing the VM--gotta stabilize its external disk stores too. Second, "40 or 50 data points": I think you should get as many as you can afford to get time-wise during one run, as long as they're all on the same instance. Don't parallelize this between instances, due to the problem above. There's nothing magic about 30 data points, aside from that then the samples are considered adequate for aggregation via mean of means by way of Central Limit Theorem. I assume someone knew that significance and decided 30 would be a good number. But to make that work, you have to have a lot of independent 30+ point samples from the same build, running in the same environment (i.e. from the same population). We don't test that way, so 30 points is meaningless to us. You can't take samples from different builds and aggregate them like that, and we only generate one sample per build. So it comes down to this: the more points you can get--again, assuming things are otherwise performance stable--the closer your mean (or percentile, or whatever) will be to the "true" mean. You will decrease variability due to random sampling. So get as many as you can afford to get for now, but temper that by doing more trials after reducing the sample size and see if things get much more variable. Somewhere there's a sweet spot where sample consistency will be "good enough," and aim for that, no less, no more. It may take time and a bunch of false positives or negatives to find that. That's why I'm saying do this by staging the test as a non-blocking test for the time being. Also, this -all- assumes things are even stable enough to test on, per the first concern. Finally, "percentile analysis": My observation so far from your samples are that the means/medians way more consistent than your high percentiles. My guess is that's because you have more outliers than we'd probably like, but that probably reflects the shifting ground you're testing on. So for the type of simple qualification test you're doing, I'd align on the statistic or combination of statistics that A) is the most stable build over build but B) will change if there's a large enough problem. That's probably mean/median, possibly considered in combination with standard deviation. I do think that *acceptance* should be done on 90+ percentile analysis, for the simple reason that it answers the question we probably care about: "What level of quality will people experience the grand majority of the time?" as opposed to "What level of quality will people experience half the time?" Standard web testing practice seems to agree with me, so I'm confident on that. But you need way more points to make that practical, so I think it's moot here. For a simple qualification test, just do whatever seems to call out a problem. Just don't get hung up on the threshold. It's probably going to need to be an insensitive test to avoid a lot of false positives, given the stablization problems I outlined above. It's a sanity check, don't treat it as anything more than that. Even if you only catch within a 20% threshold, go with it. It's better than nothing, and we can shrink it later once we figure out how to stabilize.
Flags: needinfo?(gmealer)
> And Firefox OS isn't very performance-stable as it is. You can rerun simple timing tests on it over and over and get pretty variable results, with unpredictable spikes. That doesn't surprise me in a multi-layered OS with a bunch of concurrency going on, but does mean performance testing it in general isn't going to be as precise as you'd like unless you learn how to turn off a lot of background activity. I should also call out that this is its own issue. The users experience that kind of variability as well. It's possible one of the things we should be tracking is how variable the OS is inherently, and tightening that up -not- just for the purposes of stabilizing tests, but for also to make the user experience more predictable.
More tests in progress, tweaking the emulator to see if that results in reduced variance. https://gist.github.com/rwood-moz/3732062d4aeea123f95a
After more tests with emulator tweaks, this set of numbers is a bit more promising concerning variance. Note that this is on emulator and not emulator-kk (emulator-kk blocked by bug 1139428). AWS r3.xlarge ubuntu 14.04 Emulator (NOT emulator-kk) Cfg: 2048 RAM (default 2047MB partition size which is max) Added: -wipe-data (rest of settings default) Cmd: DEBUG=* RUNS=30 APPS='settings' RAPTOR_EMULATOR=1 node tests/raptor/emulator_launch_test.js Iterations: 10 (shutdown and restart emulator and do make raptor before each cmd iteration) Separate iterations but appending to one log. Summary of results (via test-stats): Mean 0: 14510.137931034482 Median 0: 14265 Mode 0: 14265 Minimum 0: 13736 Maximum 0: 15820 Standard Deviation 0: 607.9252860252013 95th Percentile 0: 15706.949999999999 ---------------- Mean 1: 14179.275862068966 Median 1: 13896 Mode 1: 13896 Minimum 1: 13465 Maximum 1: 15896 Standard Deviation 1: 679.9015596368863 95th Percentile 1: 15744 ---------------- Mean 2: 14358.206896551725 Median 2: 14066 Mode 2: 14066 Minimum 2: 13661 Maximum 2: 15906 Standard Deviation 2: 623.9393968863645 95th Percentile 2: 15639.05 ---------------- Mean 3: 14014.448275862069 Median 3: 13933 Mode 3: 13933 Minimum 3: 13193 Maximum 3: 15219 Standard Deviation 3: 462.8235372525455 95th Percentile 3: 15188.6 ---------------- Mean 4: 14843.620689655172 Median 4: 14652 Mode 4: 14652 Minimum 4: 14001 Maximum 4: 16042 Standard Deviation 4: 555.3590583666294 95th Percentile 4: 15707.6 ---------------- Mean 5: 14158.586206896553 Median 5: 14006 Mode 5: 14006 Minimum 5: 13627 Maximum 5: 15366 Standard Deviation 5: 452.38696566800195 95th Percentile 5: 15283.349999999999 ---------------- Mean 6: 14172.241379310344 Median 6: 13995 Mode 6: 13995 Minimum 6: 13545 Maximum 6: 15519 Standard Deviation 6: 553.3661664303498 95th Percentile 6: 15311.9 ---------------- Mean 7: 14049.586206896553 Median 7: 13991 Mode 7: 13991 Minimum 7: 13558 Maximum 7: 15243 Standard Deviation 7: 366.07456534968014 95th Percentile 7: 14619.799999999997 ---------------- Mean 8: 14069.931034482759 Median 8: 13863 Mode 8: 13863 Minimum 8: 13558 Maximum 8: 15389 Standard Deviation 8: 488.00119636897466 95th Percentile 8: 15214.2 ---------------- Mean 9: 14224.310344827587 Median 9: 13895 Mode 9: 13895 Minimum 9: 13409 Maximum 9: 15485 Standard Deviation 9: 660.7001043015046 95th Percentile 9: 15348.2 ---------------- (Raw data here: https://gist.github.com/rwood-moz/164db86970da075ea1e6)
Switched to using the 'template' app. Initial results here: https://gist.github.com/rwood-moz/964a4fea4affdae263c4
Making progress. Latest results with more emulator/qemu tweaks: https://gist.github.com/rwood-moz/53ee14e19de8ac63bcaf Next I will try running the emulator on a ramdisk that is mounted in the docker container.
Latest numbers, running the emulator on a ramdisk: https://gist.github.com/rwood-moz/d16e44d7d48203d8f92c
In summary of all the test data thus far, this is the best setup when it comes to reducing launch test variance. Note, this is using the ics emulator; emulator-kk is currently out of commission for these tests because of bug 1139428 and bug 1139448. Hopefully this setup will remain true for emulator-kk. AWS r3.xlarge ubuntu 14.04 Docker container: https://github.com/rwood-moz/raptor-docker-runner.git Emulator (NOT emulator-kk) ** Emulator running on a 1024MB RAMDISK mounted into the docker container ** Cfg: 2048 RAM, 2047 partition (max partition size) Removed: -skin, -skindir, -camera-back Added: -wipe-data, -no-skin (rest of settings default) Added: '-net none -bt hci,null' to end of -qemu $TAIL_ARGS Added: -cache-size 2048 Cmd: DEBUG=* RUNS=30 APPS='template' RAPTOR_EMULATOR=1 node tests/raptor/emulator_launch_test.js Iterations: 10 (shutdown and restart emulator and do make raptor before each cmd iteration)
Same setup as above except on a c4.2xlarge AWS instance type (thanks to Jonas' info on his instance performance tests/rankings). The app launch times are faster, therefore the suite will run faster; however the variance is not improved much at all, when compared with the tests on the r3.xlarge (at least in this limited sample). https://gist.github.com/rwood-moz/d16e44d7d48203d8f92c#file-ramdisk_6-txt
Did more cycles on the c4.2xlarge and it seems pretty consistent. This is on the ics emulator currently but will move to emulator-kk when it is working with raptor again. This looks like the best setup: AWS c4.2xlarge ubuntu 14.04 ** Emulator running on a 1024MB RAMDISK mounted into the docker container ** Cfg: 2048 RAM, 2047 partition (max partition size) Removed: -skin, -skindir, -camera-back Added: -wipe-data, -no-skin (rest of settings default) Added: '-net none -bt hci,null' to end of -qemu $TAIL_ARGS Added: -cache-size 2048 Cmd: DEBUG=* RUNS=30 APPS='template' RAPTOR_EMULATOR=1 node tests/raptor/emulator_launch_test.js Iterations: 10 (shutdown and restart emulator and do make raptor before each cmd iteration) Eli, what do you think of these latest numbers (this file and three more cycles below it): https://gist.github.com/rwood-moz/d16e44d7d48203d8f92c#file-ramdisk_6-txt
Flags: needinfo?(eperelman)
Looks like we've made some good progress. Just for kicks, can you do a set of 30 runs for the Settings app so we can compare?
Flags: needinfo?(eperelman)
Switched back to emulator-kk as it is working again now. Unfortunately the launch time is slower and the variance is higher on emulator-kk. :( AWS c4.2xlarge ubuntu 14.04 ** Emulator-KK running on a 1024MB RAMDISK mounted into the docker container ** Cfg: 2048 RAM, 2047 partition (max partition size) Removed: -skin, -skindir, -camera-back Added: -no-skin (not using -wipe-data b/c on emulator-kk it actually removes /data folder, need that) Added: '-net none -bt hci,null' to end of -qemu $TAIL_ARGS Added: -cache-size 2048 Cmd: DEBUG=* RUNS=30 APPS='template' RAPTOR_EMULATOR=1 node tests/raptor/emulator_launch_test.js Iterations: 10 (shutdown and restart emulator and do make raptor before each cmd iteration) Results: https://gist.github.com/rwood-moz/d16e44d7d48203d8f92c#file-ramdisk_9a-txt Same setup, but 'settings' app: https://gist.github.com/rwood-moz/d16e44d7d48203d8f92c#file-ramdisk_9b-txt What do you think, Eli?
Flags: needinfo?(eperelman)
Rob and I discussed this in person, and while the results from emulator-kk aren't as good as emulator, the percentage standard deviation is of p95 sits at 8% for emulator, and 12% for emulator-kk. A difference of 4% will have to be tolerated in this instance. I think for getting this to the POC stage, we should start a threshold at 15% and evaluate the efficacy and rate of false positives to see if this is valid.
Flags: needinfo?(eperelman)
No longer depends on: 1139448
Work is now underway to integrate the raptor launch test on emulator with gaia-try, via taskcluster; using the parameters/test environment that has been established here.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.