Closed
Bug 1222890
Opened 9 years ago
Closed 8 years ago
Compare e10s vs non-e10s crash rates from Telemetry crash pings
Categories
(Toolkit :: Telemetry, defect, P1)
Toolkit
Telemetry
Tracking
()
RESOLVED
INVALID
Tracking | Status | |
---|---|---|
e10s | + | --- |
People
(Reporter: vladan, Unassigned)
References
(Blocks 1 open bug)
Details
We need to track e10s vs non-e10s crash rates on an ongoing basis from collected Telemetry crash pings. We should also be able to look at e10s vs non-e10s crash rates during A/B experiments (e.g. bug 1193089)
Reporter | ||
Comment 1•9 years ago
|
||
Birunthan: for now, can you post your findings from Aurora? And also, please re-run your analysis on the Telemetry A/B experiment from bug 1193089
Reporter | ||
Updated•9 years ago
|
Flags: needinfo?(birunthan)
Comment 2•9 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #0)
> We need to track e10s vs non-e10s crash rates on an ongoing basis from
> collected Telemetry crash pings. We should also be able to look at e10s vs
> non-e10s crash rates during A/B experiments (e.g. bug 1193089)
I presume this should be done through a scheduled analysis job? If so, how frequently should I make it run and what timeframe should the analysis look at (e.g. last week)?
Comment 3•9 years ago
|
||
I opened https://github.com/vitillo/e10s_analyses/pull/5 for this.
You can view unreviewed the analysis here: https://github.com/poiru/e10s_analyses/blob/crash-rate/aurora/e10s_crash_rate.ipynb
Also adding ni? for comment 2.
Flags: needinfo?(birunthan) → needinfo?(vladan.bugzilla)
Reporter | ||
Comment 4•9 years ago
|
||
(In reply to Birunthan Mohanathas [:poiru] from comment #2)
> I presume this should be done through a scheduled analysis job? If so, how
> frequently should I make it run and what timeframe should the analysis look
> at (e.g. last week)?
Let's copy crash-stats.mozilla.com. So it would be a daily job that plots daily crash statistics. It could run on Nightly, Aurora and Beta to start.
We should also run the stability comparison after each A/B experiment, using only A/B data. The experiment stability analysis will be ad hoc.
Flags: needinfo?(vladan.bugzilla)
Reporter | ||
Comment 5•9 years ago
|
||
Benjamin: Does Telemetry report content-process crashes? I know we submit crash reports for content-process crashes so Telemetry should generate crash pings for these too, right?
Birunthan: do Telemetry pings have information about whether it was a parent or child e10s process that crashed? we'll want to include this information in the analysis as well
Flags: needinfo?(benjamin)
Reporter | ||
Comment 6•9 years ago
|
||
Birunthan: Telemetry reports content-process crashes via SUBPROCESS_ABNORMAL_ABORT, SUBPROCESS_CRASHES_WITH_DUMP, and crash-pings
http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/docs/crashes.rst#23
Can you compare:
1) e10s parent-process crash ping rates vs e10s aborted-session ping rates
2) repeat #1 for single-process only
3) e10s child-process crash ping rates vs SUBPROCESS_CRASHES_WITH_DUMP rates vs SUBPROCESS_ABNORMAL_ABORT rates
We'll need all of the above compared vs crash-stats rates, but let's start with this. I can show you how to do it in crash-stats
Flags: needinfo?(benjamin)
Comment 7•9 years ago
|
||
I'd like to suggest that you should (if possible) also produce the crashes chart excluding accessibility.
I ran a few searches last week on crash-stats that seemed to highlight accessibility as a major cause of extra e10s crashes.
Updated•9 years ago
|
tracking-e10s:
--- → +
Comment 8•9 years ago
|
||
There is not a separate crash ping for content crashes, but they are recorded in the following histograms:
SUBPROCESS_ABNORMAL_ABORT
SUBPROCESS_CRASHES_WITH_DUMP
PROCESS_CRASH_SUBMIT_ATTEMPT
PROCESS_CRASH_SUBMIT_SUCCESS
Comment 9•9 years ago
|
||
Vladan, the PR was merged so you can now use this link: http://nbviewer.ipython.org/github/vitillo/e10s_analyses/blob/master/aurora/e10s_crash_rate.ipynb
I'll work on another PR to address comment 6.
Updated•9 years ago
|
Status: NEW → ASSIGNED
Reporter | ||
Comment 10•9 years ago
|
||
The e10s crash rate in the Beta experiment is more than twice that of non-e10s:
https://github.com/poiru/e10s_analyses/blob/beta/beta/e10s_crash_rate.ipynb
Even when only considering profiles without extensions, the e10s crash rate is imporved but still twice as bad as single-process:
https://github.com/poiru/e10s_analyses/blob/beta/beta/e10s_crash_rate_without_extensions.ipynb
Unsurprisingly, the profiles with extensions have the worst e10s crash rate:
https://github.com/poiru/e10s_analyses/blob/beta/beta/e10s_crash_rate_with_extensions.ipynb
Flags: needinfo?(jmathies)
Reporter | ||
Comment 11•9 years ago
|
||
Btw, I don't think bug 1227312 affects SUBPROCESS_ABNORMAL_ABORT, I think it only affects SUBPROCESS_CRASHES_WITH_DUMP
Comment 12•9 years ago
|
||
(In reply to Vladan Djeric (:vladan) -- please needinfo from comment #11)
> Btw, I don't think bug 1227312 affects SUBPROCESS_ABNORMAL_ABORT, I think it
> only affects SUBPROCESS_CRASHES_WITH_DUMP
Yes that's correct.
(In reply to Vladan Djeric (:vladan) -- please needinfo from comment #10)
> The e10s crash rate in the Beta experiment is more than twice that of
> non-e10s:
> https://github.com/poiru/e10s_analyses/blob/beta/beta/e10s_crash_rate.ipynb
Ok, this we can ignore for now. Higher crash / hang rates with addons installed is unfortunate but not surprising.
>
> Even when only considering profiles without extensions, the e10s crash rate
> is imporved but still twice as bad as single-process:
> https://github.com/poiru/e10s_analyses/blob/beta/beta/
> e10s_crash_rate_without_extensions.ipynb
Are we sure we're not double counting here? Where does 'crashes_per_day' come from in e10s profiles? Is there any chance it includes SUBPROCESS_ABNORMAL_ABORT numbers?
Flags: needinfo?(jmathies)
Comment 13•9 years ago
|
||
I'm afraid (more relieved) the script is wrong.
Viewing on 28th vs crash stats
non browser 2413 v 1087 221%
e10s browser 1340 v 495 271%
e10s content 3797 v 783 485%
I set out to find why content is so much higher.
Now see it is browser that has artificially been made lower.
Two filters;
is_sensible_creation_date (not sure of impact, probably minor)
is_after_first_session_timestamp (high number get removed)
The filers don't get applied to content.
"Make sure it's not the first session ping as the experiment branch will not be enforced until the next restart"
No longer the case in beta experiment as branches are created for first session.
Comment 14•9 years ago
|
||
Just noticed also the graph should be generated by dividing by installs_per_day and not sessions_per_day.
installs_per_day being the list of the number of unique clientIds in the days sessions.
Comment 15•9 years ago
|
||
I reran the analysis using clients per day instead of sessions per day: https://gist.github.com/poiru/3b7495be8adcf77f020f
The numbers seem quite high. After running this for a different set of dates, I noticed that a particular client sent 1300 crash pings over 4 days. How should we handle outliers like this?
Flags: needinfo?(vladan.bugzilla)
Comment 16•9 years ago
|
||
(In reply to Birunthan Mohanathas [:poiru] from comment #15)
> I reran the analysis using clients per day instead of sessions per day:
> https://gist.github.com/poiru/3b7495be8adcf77f020f
>
> The numbers seem quite high. After running this for a different set of
> dates, I noticed that a particular client sent 1300 crash pings over 4 days.
> How should we handle outliers like this?
Do you know if this user had addons? Serious outliers like this should probably be investigated. Can we connect this user to reports on crashstats?
Comment 17•9 years ago
|
||
(In reply to Birunthan Mohanathas [:poiru] from comment #15)
> I reran the analysis using clients per day instead of sessions per day:
> https://gist.github.com/poiru/3b7495be8adcf77f020f
These numbers look much better than the first report. We were at a regression of around 140%, this new report brings the regression down to 55%.
Comment 18•9 years ago
|
||
Would it be worthwhile running the numbers using the main ping (aborted-session + shutdown) and see how these compare to this run that is using crash & saved-session?
Comment 19•9 years ago
|
||
Here is a rerun for a different timeframe with a particular client (1000+ non-e10s crashes) excluded: https://gist.github.com/poiru/dbcd6cff0bd7d862fb12
I also included a sorted list of crashes per client.
Comment 20•9 years ago
|
||
4th not a good day to include. End of experiment.
A combined list for e10s might be useful, maybe as tupple. Length of list representing number of clients that crashed at least once.
Comment 21•9 years ago
|
||
(In reply to Birunthan Mohanathas [:poiru] from comment #19)
> Here is a rerun for a different timeframe with a particular client (1000+
> non-e10s crashes) excluded:
> https://gist.github.com/poiru/dbcd6cff0bd7d862fb12
>
> I also included a sorted list of crashes per client.
Next time you run this, can you add a comparison of chrome side crashes per client vs. non-e10s crashes per client? It's frustrating to me that browser crashes haven't improved much if at all with e10s.
Comment 22•9 years ago
|
||
Added bug 1241106 to cover another theory of reason for discrepancy.
Comment 23•9 years ago
|
||
I'm concerned that we're talking about various crash numbers without having an agreed definition of which ones we should be using. My proposal at https://telemetry.mozilla.org/new-pipeline/crash-summary.html was that we should be using the following two metrics as our official crash rates:
* crash pings/subsessionlength (reported as crashes per 1000 usage-hours)
* SUBPROCESS_CRASHES_WITH_DUMP["content"] / subsessionlength (also reported as crashes per 1000 usage-hours)
Together these are the crash rate for a build. I assert that unless there is a clear reason not to use these, this should be the generally-agreed metric for crash rates. I'd be willing to consider using activeTicks instead of subsessionlength if there's evidence that produces better results.
Furthermore, I have the following concern about the analysis at https://gist.github.com/poiru/3b7495be8adcf77f020f . Has anyone else reviewed this analysis yet?
* what's the purpose of is_sensible_creation_date? It's perfectly normal for a ping to have a different submission date than creation date, especially for pings which are created at shutdown (the final subsession). These aren't submitted until this next startup, which can easily be the next day, or several days if there's a weekend. I'm worried that you're throwing away essential data.
Comment 24•9 years ago
|
||
Being able to link two separate systems (with full accountability of raw discrepancy) is a good indication what is being analysed is correct. Having extra information (such as top crash numbers and users experiencing crashes) can be good in identifying problems. I agree a single definition is good for long term comparison. Personally will stay far away from determining definition, argument over what to use isn't greatly helpful.
My concern with crash pings and (saved-session ping) SUBPROCESS_CRASHES_WITH_DUMP [current 44 using reasonable approximation SUBPROCESS_ABNORMAL_ABORT due to bug] hence comment 18 is, they are different systems. I know far too little to have confidence they combine together. Similar example with crash-stats; The telemetry crash submit success is indicating a significant difference. Knowing so allows discrepancy to be adjusted for.
Reporter | ||
Comment 25•9 years ago
|
||
(In reply to Birunthan Mohanathas [:poiru] from comment #15)
> I reran the analysis using clients per day instead of sessions per day:
> https://gist.github.com/poiru/3b7495be8adcf77f020f
>
> The numbers seem quite high. After running this for a different set of
> dates, I noticed that a particular client sent 1300 crash pings over 4 days.
> How should we handle outliers like this?
- Yes, I would exclude crazy outliers (e.g. > 10 crashes per day) from any average, they are worth investigating separately though
- I don't see the "crashes per client" plot at your link? https://gist.github.com/poiru/3b7495be8adcf77f020f
- Benjamin, Jim & Jonathan can provide better guidance on crash rate metrics
Flags: needinfo?(vladan.bugzilla)
Comment 26•9 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #23)
> I'm concerned that we're talking about various crash numbers without having
> an agreed definition of which ones we should be using. My proposal at
> https://telemetry.mozilla.org/new-pipeline/crash-summary.html was that we
> should be using the following two metrics as our official crash rates:
>
> * crash pings/subsessionlength (reported as crashes per 1000 usage-hours)
> * SUBPROCESS_CRASHES_WITH_DUMP["content"] / subsessionlength (also reported
> as crashes per 1000 usage-hours)
Does this look reasonable: https://gist.github.com/poiru/4cfefe157b9ffa353389
I limited this to just a single build ID and a few days for now. I can include other builds and more days if the analysis looks right.
Also note that the analysis doesn't yet attempt to filter out any disproportionately crashy clients.
Flags: needinfo?(benjamin)
Comment 27•9 years ago
|
||
This is currently using SUBPROCESS_ABNORMAL_ABORT because of the double-counting bug with SUBPROCESS_CRASHES_WITH_DUMP?
That sucks because we know that ABNORMAL_ABORT overcounts in some normal shutdown cases. We should at least add a comment to that effect. But overall I think this is the right approach. Clearly what we need to do is repeat this with the 45 experiment and CRASHES_WITH_DUMP and then focus our efforts on either the top or the "new" content-process crashes, since overall chrome-process stability has improved significantly.
Flags: needinfo?(benjamin)
Comment 28•9 years ago
|
||
(In reply to Birunthan Mohanathas [:poiru] from comment #26)
> (In reply to Benjamin Smedberg [:bsmedberg] from comment #23)
> > I'm concerned that we're talking about various crash numbers without having
> > an agreed definition of which ones we should be using. My proposal at
> > https://telemetry.mozilla.org/new-pipeline/crash-summary.html was that we
> > should be using the following two metrics as our official crash rates:
> >
> > * crash pings/subsessionlength (reported as crashes per 1000 usage-hours)
> > * SUBPROCESS_CRASHES_WITH_DUMP["content"] / subsessionlength (also reported
> > as crashes per 1000 usage-hours)
>
> Does this look reasonable: https://gist.github.com/poiru/4cfefe157b9ffa353389
SUBPROCESS_ABNORMAL_ABORT would include counts for plugins, content, and gmp. When we calculate non-e10s crash numbers, what do we get here?
> crash_pings = get_pings(sc, doc_type="crash", **PING_OPTIONS).filter(is_in_e10s_experiment)
Comment 29•9 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #28)
>
> SUBPROCESS_ABNORMAL_ABORT would include counts for plugins, content, and
> gmp.
Looking at this more, I see we're filtering the SUBPROCESS_ABNORMAL_ABORT set on "content" type crashes here, so I guess we're getting the right sub set. I'm still curious what the non-e10s code does though, that's harder to follow.
Comment 30•9 years ago
|
||
Here is the analysis for the beta 45 pings so far: https://gist.github.com/poiru/d6c98741f0f7cc81b1f1
It seems like non-e10s and e10s-parent are are similar to the beta 44 numbers, but e10s-content has regressed from 27 to 36 crashes per 1000 usage hours.
Comment 31•9 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #29)
> (In reply to Jim Mathies [:jimm] from comment #28)
> >
> > SUBPROCESS_ABNORMAL_ABORT would include counts for plugins, content, and
> > gmp.
>
> Looking at this more, I see we're filtering the SUBPROCESS_ABNORMAL_ABORT
> set on "content" type crashes here, so I guess we're getting the right sub
> set. I'm still curious what the non-e10s code does though, that's harder to
> follow.
I'm not sure what you mean. We don't look at SUBPROCESS_ABNORMAL_ABORT for non-e10s.
Comment 32•9 years ago
|
||
I reran the analysis looking at SUBPROCESS_CRASHES_WITH_DUMP instead of SUBPROCESS_ABNORMAL_ABORT since I've been told that we can get an abort on shutdown sometimes.
Sadly, the numbers appear to be no better: https://gist.github.com/chutten/e80d7f2f1a52f07e642b
Comment 33•9 years ago
|
||
A new top crash added appears to be with add-ons only.
Running analysis without add-ons with any luck won't add a regression vs 44.
Comment 34•9 years ago
|
||
Just spotted ~10% extra users in both 45-experiments control groups.
Accessibly hasn't been removed.
Seems unlikely bug 1241106 will get dealt with soon but wondering;
Should telemetry be added for counting the ShutDownKill KillHards?
https://dxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp#3641
Comment 35•9 years ago
|
||
Now having put a tiny bit more thought into it, that one spot;
-Does not cover all possible shutdown content crashes.
-There can be content shutdown crashes without full browser shutdown (particularly when moving to having multiple content processes.)
Still might not be a bad thing to record, just more also could be recorder.
Comment 36•9 years ago
|
||
The crash analyses for the last and current experiment are available at:
https://github.com/vitillo/e10s_analyses/tree/master/beta45-withaddons
https://github.com/vitillo/e10s_analyses/tree/master/beta45-withoutaddons
I reran the withoutaddons experiment using SUBPROCESS_CRASHES_WITH_DUMP instead of SUBPROCESS_ABNORMAL_ABORT.
build ID non-e10s e10s-parent e10s-content
20160211221018 24.325 12.396 19.977 (SUBPROCESS_ABNORMAL_ABORT)
20160211221018 24.325 12.396 21.630 (SUBPROCESS_CRASHES_WITH_DUMP)
The numbers are still very different from those reported by http://bsmedberg.github.io/telemetry-dashboard/new-pipeline/crash-summary.html (select beta channel). I'm looking into it.
Comment 37•9 years ago
|
||
I was reviewing the notebook and found it pretty confusing, so I worked up an alternate way of doing it. See https://gist.github.com/bsmedberg/b28263ba0df97ddf0106
NOTE: this is done on a very small sample, so it needs to be re-run on a larger cluster with the full dataset. Here are some notes of what I changed:
all buildIDs: I didn't select a particular buildid, because I wasn't sure why that would be valuable.
used get_pings_properties to normalize the data a bit earlier, make it easier to write filters
Did a groupby just to check that the experiment branch matched up with the environment.settings.e10sEnabled setting. They do match.
Used accumulators to collect the counts of e10s and non-e10s separately.
The numbers here are close to regular beta, but have some surprises:
There are a rather significant number of content process crashes showing up in the non-e10s branch with e10s disabled. This could be because we run the thumbnail service in a content process, but it's still pretty weird. It makes me not trust the data and wonder if we should go back and do some basic data sanity-checking with the experiment. Please talk to Felipe and Ryan about that.
I added some plugin crash rate checking and according to this the plugin crash rate went down too. Yay.
Comment 38•9 years ago
|
||
Could you take a look at the measurement FX_THUMBNAILS_BG_CAPTURE_DONE_REASON_2 and compare with the list of reasons here: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/thumbnails/BackgroundPageThumbs.jsm#25
Comment 39•9 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #37)
> I was reviewing the notebook and found it pretty confusing, so I worked up
> an alternate way of doing it. See
> https://gist.github.com/bsmedberg/b28263ba0df97ddf0106
I think this is off by a factor of 10. `self.session_seconds.value / 360.0 / 1000` should probably use 3600 instead of 360.
I ran the fixed version using the same PING_OPTIONS as my analysis:
non-e10s e10s
usage hours 808 721
chrome crashes 19534 9054
content crashes 2070 15601
plugin crashes 7471 9485
main crash rate 24.18 12.55
main+content crash rate 26.75 34.18
plugin crash rate 9.25 13.15
https://gist.github.com/poiru/67383381d6895cb90347
The results are similar to the numbers I got with my analysis in comment 36.
Comment 40•9 years ago
|
||
(In reply to :Felipe Gomes (needinfo me!) from comment #38)
> Could you take a look at the measurement
> FX_THUMBNAILS_BG_CAPTURE_DONE_REASON_2 and compare with the list of reasons
> here:
> http://mxr.mozilla.org/mozilla-central/source/toolkit/components/thumbnails/
> BackgroundPageThumbs.jsm#25
Done: https://gist.github.com/poiru/9af631c6c5f60d5025f0
The results (for non-e10s parent) are:
0 16230 (TEL_CAPTURE_DONE_OK)
1 41817 (TEL_CAPTURE_DONE_TIMEOUT)
2 0
3 0
4 95 (TEL_CAPTURE_DONE_CRASHED)
5 46 (TEL_CAPTURE_DONE_BAD_URI)
Comment 41•9 years ago
|
||
Social is another feature that might use content processes without e10s. Shane said:
> I doubt the worker is creating a significant contributor to crashes unless they
> also correlate to the Chinese distribution where Weibo is pre-installed.
Could you also look at the non-e10s content crashes submissions and see if they are coming mostly from Chinese builds? I don't know exactly what to look for in the telemetry environment to determine that but let me find out
Comment 42•9 years ago
|
||
You're right about 360/3600. I made the same mistake in the dashboard at https://github.com/mozilla/telemetry-dashboard/blob/gh-pages/new-pipeline/src/crash-summary.js#L65 which I'll fix shortly.
Why are you still using those restricting PING_OPTIONS? In particular you're limiting to a particular buildid, which seems unfortunate given that we're running the experiment for a longer time so that we can have high confidence in the results.
That the non-e10s main crash rate varies between 11.6 and 24.18 is evidence of something wrong or at least weird. What are next steps now?
Comment 43•9 years ago
|
||
(In reply to :Felipe Gomes (needinfo me!) from comment #41)
> Could you also look at the non-e10s content crashes submissions and see if
> they are coming mostly from Chinese builds? I don't know exactly what to
> look for in the telemetry environment to determine that but let me find out
You could look first to see if their locale is zh-CN, and some more validation would be to see if they have an add-on with the id of "m.weibo.cn@services.mozilla.org" in the list of activeAddons.
Comment 44•9 years ago
|
||
(In reply to Benjamin Smedberg [:bsmedberg] from comment #42)
> Why are you still using those restricting PING_OPTIONS? In particular you're
> limiting to a particular buildid, which seems unfortunate given that we're
> running the experiment for a longer time so that we can have high confidence
> in the results.
I did that just for a fair comparison between our analyses. I went ahead and ran your analysis on the full dataset with your original PING_OPTIONS: https://gist.github.com/poiru/363fa343dd7cbec17b1f
Looking at 4 additional days of pings, we get: https://gist.github.com/poiru/1ceec405172bcafa7305
> That the non-e10s main crash rate varies between 11.6 and 24.18 is evidence
> of something wrong or at least weird. What are next steps now?
The numbers in the two links above are pretty consistent. The numbers for the 20160211221018 build (comment 39) were higher, but perhaps that build was in fact crashier than the other builds in the experiment. I'm looking into this and will post an update within a day.
Comment 45•9 years ago
|
||
I ran the analysis using all pings for specific build IDs without restricting the submission date. While looking at the results, I noticed that 20160215141016 and 20160211221018 happened to have similar usage hours for non-e10s, but the non-e10s crash rate seems to be around 50% higher for 20160215141016.
20160215141016:
non-e10s e10s
usage hours 2157 1851
chrome crashes 41629 18405
content crashes 4765 32738
plugin crashes 17688 30040
main crash rate 19.30 9.94
main+content crash rate 21.50 27.63
plugin crash rate 8.20 16.23
20160211221018:
non-e10s e10s
usage hours 2179 2150
chrome crashes 25278 12337
content crashes 4046 29275
plugin crashes 14129 17321
main crash rate 11.60 5.74
main+content crash rate 13.46 19.35
plugin crash rate 6.48 8.06
I've tried to figure out why the non-e10s main crash rate in particular is different. I looked at the distribution of crashes per client and it seems consistent. However, 20160215141016 has about 20k clients that sent crash pings and 20160211221018 only has 13k clients that sent crash pings.
I've tried various things like checking only clientIds that sent crash pings for both builds, but the number of crashes is still quite different (16k/11k vs 41k/25k).
I'll spend more time on this, but I'm mostly taking shots in the dark here. If someone has any idea what might be going on, please do share!
Comment 46•9 years ago
|
||
Do we know how many clients make up the usage hour buckets for each build / experiment branch?
Where I'm going with this - non-e10s usage hours of 2157 with high crash rates (41629 crashes) might cause some users to stop using this beta. Hence I would expect usage hours to fall in the next beta build, but they don't. If client counts fall though that would show that we lose testers between builds. Also, in both braches we see crash rates fall between the two build, that also seems to indicate users who experience high instability might be bailing on the beta.
Comment 47•9 years ago
|
||
(In reply to Birunthan Mohanathas [:poiru] from comment #45)
> I ran the analysis using all pings for specific build IDs without
> restricting the submission date. While looking at the results, I noticed
> that 20160215141016 and 20160211221018 happened to have similar usage hours
> for non-e10s, but the non-e10s crash rate seems to be around 50% higher for
> 20160215141016.
Yes, that's expected, we have a numbers of crashes we have been tracking, for this and some other builds. What counts for the experiment is the differences between e10s and non-e10s and if we in sum crash more or less with e10s. It's my (and other people's) job to figure out the issues to fix in general to bring the crash levels down, and we have had some issues on 45 beta that we had to get fixed (hopefully we have now, but that doesn't affect the experiment any more).
Comment 48•9 years ago
|
||
20160215141016 b6 is the current build (b9.2 just out as I post) and has been running longer.
Your problem is timing in system.
Telemetry is registering crash pings quicker than main.
Oddity to me is 20160211221018 usage hours seems too close for non-e10s and e10s.
Just now from crash stats
non-e10s includes accessibility (~10% extra)
20160215141016 b6
non-e10s e10s
chrome 13,996 4,890
content 0 11,339
plugin 1,306 514
20160211221018 b5
chrome 8,386 3,076
content 0 7,238
plugin 746 293
Comment 49•9 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #46)
> Do we know how many clients make up the usage hour buckets for each build /
> experiment branch?
253k for 20160215141016 and 238k for 20160211221018 in the control-no-addons branch.
Comment 50•9 years ago
|
||
(In reply to Jonathan Howard from comment #48)
> Telemetry is registering crash pings quicker than main.
That doesn't sound good, makes it hard to compare or relate numbers for the two pings in any recent days. :(
Is there any way to deal with that?
Comment 51•9 years ago
|
||
(In reply to Jonathan Howard from comment #48)
> Telemetry is registering crash pings quicker than main.
I just filed bug 1251623 on that.
Updated•9 years ago
|
Assignee: birunthan → nobody
Status: ASSIGNED → NEW
Comment 52•9 years ago
|
||
https://github.com/vitillo/e10s_analyses
Comparing beta45-withoutaddons to beta46-noapz
Most at a glance have the same ratio (including session count) but;
e10s usage hours is significantly lower in 46 sample.
SIMPLE_MEASURES_UPTIME using https://telemetry.mozilla.org also highlights this, strongly at start.
No idea why.
Comment 53•8 years ago
|
||
Chris, does this bug still seem relevant?
Or should we just close it?
Flags: needinfo?(chutten)
Comment 54•8 years ago
|
||
Yup, no longer relevant. E10s is shipping, so comparing it to !e10s is not so useful.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(chutten)
Resolution: --- → INVALID
You need to log in
before you can comment on or make changes to this bug.
Description
•