Closed Bug 1222890 Opened 9 years ago Closed 8 years ago

Compare e10s vs non-e10s crash rates from Telemetry crash pings

Categories

(Toolkit :: Telemetry, defect, P1)

defect

Tracking

()

RESOLVED INVALID
Tracking Status
e10s + ---

People

(Reporter: vladan, Unassigned)

References

(Blocks 1 open bug)

Details

We need to track e10s vs non-e10s crash rates on an ongoing basis from collected Telemetry crash pings. We should also be able to look at e10s vs non-e10s crash rates during A/B experiments (e.g. bug 1193089)
Birunthan: for now, can you post your findings from Aurora? And also, please re-run your analysis on the Telemetry A/B experiment from bug 1193089
Flags: needinfo?(birunthan)
(In reply to Vladan Djeric (:vladan) -- please needinfo! from comment #0) > We need to track e10s vs non-e10s crash rates on an ongoing basis from > collected Telemetry crash pings. We should also be able to look at e10s vs > non-e10s crash rates during A/B experiments (e.g. bug 1193089) I presume this should be done through a scheduled analysis job? If so, how frequently should I make it run and what timeframe should the analysis look at (e.g. last week)?
Flags: needinfo?(birunthan) → needinfo?(vladan.bugzilla)
(In reply to Birunthan Mohanathas [:poiru] from comment #2) > I presume this should be done through a scheduled analysis job? If so, how > frequently should I make it run and what timeframe should the analysis look > at (e.g. last week)? Let's copy crash-stats.mozilla.com. So it would be a daily job that plots daily crash statistics. It could run on Nightly, Aurora and Beta to start. We should also run the stability comparison after each A/B experiment, using only A/B data. The experiment stability analysis will be ad hoc.
Flags: needinfo?(vladan.bugzilla)
Benjamin: Does Telemetry report content-process crashes? I know we submit crash reports for content-process crashes so Telemetry should generate crash pings for these too, right? Birunthan: do Telemetry pings have information about whether it was a parent or child e10s process that crashed? we'll want to include this information in the analysis as well
Flags: needinfo?(benjamin)
Birunthan: Telemetry reports content-process crashes via SUBPROCESS_ABNORMAL_ABORT, SUBPROCESS_CRASHES_WITH_DUMP, and crash-pings http://mxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/docs/crashes.rst#23 Can you compare: 1) e10s parent-process crash ping rates vs e10s aborted-session ping rates 2) repeat #1 for single-process only 3) e10s child-process crash ping rates vs SUBPROCESS_CRASHES_WITH_DUMP rates vs SUBPROCESS_ABNORMAL_ABORT rates We'll need all of the above compared vs crash-stats rates, but let's start with this. I can show you how to do it in crash-stats
Flags: needinfo?(benjamin)
I'd like to suggest that you should (if possible) also produce the crashes chart excluding accessibility. I ran a few searches last week on crash-stats that seemed to highlight accessibility as a major cause of extra e10s crashes.
Depends on: 1224364
There is not a separate crash ping for content crashes, but they are recorded in the following histograms: SUBPROCESS_ABNORMAL_ABORT SUBPROCESS_CRASHES_WITH_DUMP PROCESS_CRASH_SUBMIT_ATTEMPT PROCESS_CRASH_SUBMIT_SUCCESS
Vladan, the PR was merged so you can now use this link: http://nbviewer.ipython.org/github/vitillo/e10s_analyses/blob/master/aurora/e10s_crash_rate.ipynb I'll work on another PR to address comment 6.
Status: NEW → ASSIGNED
The e10s crash rate in the Beta experiment is more than twice that of non-e10s: https://github.com/poiru/e10s_analyses/blob/beta/beta/e10s_crash_rate.ipynb Even when only considering profiles without extensions, the e10s crash rate is imporved but still twice as bad as single-process: https://github.com/poiru/e10s_analyses/blob/beta/beta/e10s_crash_rate_without_extensions.ipynb Unsurprisingly, the profiles with extensions have the worst e10s crash rate: https://github.com/poiru/e10s_analyses/blob/beta/beta/e10s_crash_rate_with_extensions.ipynb
Flags: needinfo?(jmathies)
Btw, I don't think bug 1227312 affects SUBPROCESS_ABNORMAL_ABORT, I think it only affects SUBPROCESS_CRASHES_WITH_DUMP
(In reply to Vladan Djeric (:vladan) -- please needinfo from comment #11) > Btw, I don't think bug 1227312 affects SUBPROCESS_ABNORMAL_ABORT, I think it > only affects SUBPROCESS_CRASHES_WITH_DUMP Yes that's correct. (In reply to Vladan Djeric (:vladan) -- please needinfo from comment #10) > The e10s crash rate in the Beta experiment is more than twice that of > non-e10s: > https://github.com/poiru/e10s_analyses/blob/beta/beta/e10s_crash_rate.ipynb Ok, this we can ignore for now. Higher crash / hang rates with addons installed is unfortunate but not surprising. > > Even when only considering profiles without extensions, the e10s crash rate > is imporved but still twice as bad as single-process: > https://github.com/poiru/e10s_analyses/blob/beta/beta/ > e10s_crash_rate_without_extensions.ipynb Are we sure we're not double counting here? Where does 'crashes_per_day' come from in e10s profiles? Is there any chance it includes SUBPROCESS_ABNORMAL_ABORT numbers?
Flags: needinfo?(jmathies)
Blocks: 1234647
I'm afraid (more relieved) the script is wrong. Viewing on 28th vs crash stats non browser 2413 v 1087 221% e10s browser 1340 v 495 271% e10s content 3797 v 783 485% I set out to find why content is so much higher. Now see it is browser that has artificially been made lower. Two filters; is_sensible_creation_date (not sure of impact, probably minor) is_after_first_session_timestamp (high number get removed) The filers don't get applied to content. "Make sure it's not the first session ping as the experiment branch will not be enforced until the next restart" No longer the case in beta experiment as branches are created for first session.
Just noticed also the graph should be generated by dividing by installs_per_day and not sessions_per_day. installs_per_day being the list of the number of unique clientIds in the days sessions.
I reran the analysis using clients per day instead of sessions per day: https://gist.github.com/poiru/3b7495be8adcf77f020f The numbers seem quite high. After running this for a different set of dates, I noticed that a particular client sent 1300 crash pings over 4 days. How should we handle outliers like this?
Flags: needinfo?(vladan.bugzilla)
(In reply to Birunthan Mohanathas [:poiru] from comment #15) > I reran the analysis using clients per day instead of sessions per day: > https://gist.github.com/poiru/3b7495be8adcf77f020f > > The numbers seem quite high. After running this for a different set of > dates, I noticed that a particular client sent 1300 crash pings over 4 days. > How should we handle outliers like this? Do you know if this user had addons? Serious outliers like this should probably be investigated. Can we connect this user to reports on crashstats?
(In reply to Birunthan Mohanathas [:poiru] from comment #15) > I reran the analysis using clients per day instead of sessions per day: > https://gist.github.com/poiru/3b7495be8adcf77f020f These numbers look much better than the first report. We were at a regression of around 140%, this new report brings the regression down to 55%.
Would it be worthwhile running the numbers using the main ping (aborted-session + shutdown) and see how these compare to this run that is using crash & saved-session?
Here is a rerun for a different timeframe with a particular client (1000+ non-e10s crashes) excluded: https://gist.github.com/poiru/dbcd6cff0bd7d862fb12 I also included a sorted list of crashes per client.
4th not a good day to include. End of experiment. A combined list for e10s might be useful, maybe as tupple. Length of list representing number of clients that crashed at least once.
(In reply to Birunthan Mohanathas [:poiru] from comment #19) > Here is a rerun for a different timeframe with a particular client (1000+ > non-e10s crashes) excluded: > https://gist.github.com/poiru/dbcd6cff0bd7d862fb12 > > I also included a sorted list of crashes per client. Next time you run this, can you add a comparison of chrome side crashes per client vs. non-e10s crashes per client? It's frustrating to me that browser crashes haven't improved much if at all with e10s.
Added bug 1241106 to cover another theory of reason for discrepancy.
I'm concerned that we're talking about various crash numbers without having an agreed definition of which ones we should be using. My proposal at https://telemetry.mozilla.org/new-pipeline/crash-summary.html was that we should be using the following two metrics as our official crash rates: * crash pings/subsessionlength (reported as crashes per 1000 usage-hours) * SUBPROCESS_CRASHES_WITH_DUMP["content"] / subsessionlength (also reported as crashes per 1000 usage-hours) Together these are the crash rate for a build. I assert that unless there is a clear reason not to use these, this should be the generally-agreed metric for crash rates. I'd be willing to consider using activeTicks instead of subsessionlength if there's evidence that produces better results. Furthermore, I have the following concern about the analysis at https://gist.github.com/poiru/3b7495be8adcf77f020f . Has anyone else reviewed this analysis yet? * what's the purpose of is_sensible_creation_date? It's perfectly normal for a ping to have a different submission date than creation date, especially for pings which are created at shutdown (the final subsession). These aren't submitted until this next startup, which can easily be the next day, or several days if there's a weekend. I'm worried that you're throwing away essential data.
Being able to link two separate systems (with full accountability of raw discrepancy) is a good indication what is being analysed is correct. Having extra information (such as top crash numbers and users experiencing crashes) can be good in identifying problems. I agree a single definition is good for long term comparison. Personally will stay far away from determining definition, argument over what to use isn't greatly helpful. My concern with crash pings and (saved-session ping) SUBPROCESS_CRASHES_WITH_DUMP [current 44 using reasonable approximation SUBPROCESS_ABNORMAL_ABORT due to bug] hence comment 18 is, they are different systems. I know far too little to have confidence they combine together. Similar example with crash-stats; The telemetry crash submit success is indicating a significant difference. Knowing so allows discrepancy to be adjusted for.
(In reply to Birunthan Mohanathas [:poiru] from comment #15) > I reran the analysis using clients per day instead of sessions per day: > https://gist.github.com/poiru/3b7495be8adcf77f020f > > The numbers seem quite high. After running this for a different set of > dates, I noticed that a particular client sent 1300 crash pings over 4 days. > How should we handle outliers like this? - Yes, I would exclude crazy outliers (e.g. > 10 crashes per day) from any average, they are worth investigating separately though - I don't see the "crashes per client" plot at your link? https://gist.github.com/poiru/3b7495be8adcf77f020f - Benjamin, Jim & Jonathan can provide better guidance on crash rate metrics
Flags: needinfo?(vladan.bugzilla)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #23) > I'm concerned that we're talking about various crash numbers without having > an agreed definition of which ones we should be using. My proposal at > https://telemetry.mozilla.org/new-pipeline/crash-summary.html was that we > should be using the following two metrics as our official crash rates: > > * crash pings/subsessionlength (reported as crashes per 1000 usage-hours) > * SUBPROCESS_CRASHES_WITH_DUMP["content"] / subsessionlength (also reported > as crashes per 1000 usage-hours) Does this look reasonable: https://gist.github.com/poiru/4cfefe157b9ffa353389 I limited this to just a single build ID and a few days for now. I can include other builds and more days if the analysis looks right. Also note that the analysis doesn't yet attempt to filter out any disproportionately crashy clients.
Flags: needinfo?(benjamin)
This is currently using SUBPROCESS_ABNORMAL_ABORT because of the double-counting bug with SUBPROCESS_CRASHES_WITH_DUMP? That sucks because we know that ABNORMAL_ABORT overcounts in some normal shutdown cases. We should at least add a comment to that effect. But overall I think this is the right approach. Clearly what we need to do is repeat this with the 45 experiment and CRASHES_WITH_DUMP and then focus our efforts on either the top or the "new" content-process crashes, since overall chrome-process stability has improved significantly.
Flags: needinfo?(benjamin)
(In reply to Birunthan Mohanathas [:poiru] from comment #26) > (In reply to Benjamin Smedberg [:bsmedberg] from comment #23) > > I'm concerned that we're talking about various crash numbers without having > > an agreed definition of which ones we should be using. My proposal at > > https://telemetry.mozilla.org/new-pipeline/crash-summary.html was that we > > should be using the following two metrics as our official crash rates: > > > > * crash pings/subsessionlength (reported as crashes per 1000 usage-hours) > > * SUBPROCESS_CRASHES_WITH_DUMP["content"] / subsessionlength (also reported > > as crashes per 1000 usage-hours) > > Does this look reasonable: https://gist.github.com/poiru/4cfefe157b9ffa353389 SUBPROCESS_ABNORMAL_ABORT would include counts for plugins, content, and gmp. When we calculate non-e10s crash numbers, what do we get here? > crash_pings = get_pings(sc, doc_type="crash", **PING_OPTIONS).filter(is_in_e10s_experiment)
(In reply to Jim Mathies [:jimm] from comment #28) > > SUBPROCESS_ABNORMAL_ABORT would include counts for plugins, content, and > gmp. Looking at this more, I see we're filtering the SUBPROCESS_ABNORMAL_ABORT set on "content" type crashes here, so I guess we're getting the right sub set. I'm still curious what the non-e10s code does though, that's harder to follow.
Here is the analysis for the beta 45 pings so far: https://gist.github.com/poiru/d6c98741f0f7cc81b1f1 It seems like non-e10s and e10s-parent are are similar to the beta 44 numbers, but e10s-content has regressed from 27 to 36 crashes per 1000 usage hours.
(In reply to Jim Mathies [:jimm] from comment #29) > (In reply to Jim Mathies [:jimm] from comment #28) > > > > SUBPROCESS_ABNORMAL_ABORT would include counts for plugins, content, and > > gmp. > > Looking at this more, I see we're filtering the SUBPROCESS_ABNORMAL_ABORT > set on "content" type crashes here, so I guess we're getting the right sub > set. I'm still curious what the non-e10s code does though, that's harder to > follow. I'm not sure what you mean. We don't look at SUBPROCESS_ABNORMAL_ABORT for non-e10s.
I reran the analysis looking at SUBPROCESS_CRASHES_WITH_DUMP instead of SUBPROCESS_ABNORMAL_ABORT since I've been told that we can get an abort on shutdown sometimes. Sadly, the numbers appear to be no better: https://gist.github.com/chutten/e80d7f2f1a52f07e642b
Blocks: e10s-perf
Priority: -- → P1
A new top crash added appears to be with add-ons only. Running analysis without add-ons with any luck won't add a regression vs 44.
Just spotted ~10% extra users in both 45-experiments control groups. Accessibly hasn't been removed. Seems unlikely bug 1241106 will get dealt with soon but wondering; Should telemetry be added for counting the ShutDownKill KillHards? https://dxr.mozilla.org/mozilla-central/source/dom/ipc/ContentParent.cpp#3641
Now having put a tiny bit more thought into it, that one spot; -Does not cover all possible shutdown content crashes. -There can be content shutdown crashes without full browser shutdown (particularly when moving to having multiple content processes.) Still might not be a bad thing to record, just more also could be recorder.
The crash analyses for the last and current experiment are available at: https://github.com/vitillo/e10s_analyses/tree/master/beta45-withaddons https://github.com/vitillo/e10s_analyses/tree/master/beta45-withoutaddons I reran the withoutaddons experiment using SUBPROCESS_CRASHES_WITH_DUMP instead of SUBPROCESS_ABNORMAL_ABORT. build ID non-e10s e10s-parent e10s-content 20160211221018 24.325 12.396 19.977 (SUBPROCESS_ABNORMAL_ABORT) 20160211221018 24.325 12.396 21.630 (SUBPROCESS_CRASHES_WITH_DUMP) The numbers are still very different from those reported by http://bsmedberg.github.io/telemetry-dashboard/new-pipeline/crash-summary.html (select beta channel). I'm looking into it.
I was reviewing the notebook and found it pretty confusing, so I worked up an alternate way of doing it. See https://gist.github.com/bsmedberg/b28263ba0df97ddf0106 NOTE: this is done on a very small sample, so it needs to be re-run on a larger cluster with the full dataset. Here are some notes of what I changed: all buildIDs: I didn't select a particular buildid, because I wasn't sure why that would be valuable. used get_pings_properties to normalize the data a bit earlier, make it easier to write filters Did a groupby just to check that the experiment branch matched up with the environment.settings.e10sEnabled setting. They do match. Used accumulators to collect the counts of e10s and non-e10s separately. The numbers here are close to regular beta, but have some surprises: There are a rather significant number of content process crashes showing up in the non-e10s branch with e10s disabled. This could be because we run the thumbnail service in a content process, but it's still pretty weird. It makes me not trust the data and wonder if we should go back and do some basic data sanity-checking with the experiment. Please talk to Felipe and Ryan about that. I added some plugin crash rate checking and according to this the plugin crash rate went down too. Yay.
Depends on: 1249880
Could you take a look at the measurement FX_THUMBNAILS_BG_CAPTURE_DONE_REASON_2 and compare with the list of reasons here: http://mxr.mozilla.org/mozilla-central/source/toolkit/components/thumbnails/BackgroundPageThumbs.jsm#25
(In reply to Benjamin Smedberg [:bsmedberg] from comment #37) > I was reviewing the notebook and found it pretty confusing, so I worked up > an alternate way of doing it. See > https://gist.github.com/bsmedberg/b28263ba0df97ddf0106 I think this is off by a factor of 10. `self.session_seconds.value / 360.0 / 1000` should probably use 3600 instead of 360. I ran the fixed version using the same PING_OPTIONS as my analysis: non-e10s e10s usage hours 808 721 chrome crashes 19534 9054 content crashes 2070 15601 plugin crashes 7471 9485 main crash rate 24.18 12.55 main+content crash rate 26.75 34.18 plugin crash rate 9.25 13.15 https://gist.github.com/poiru/67383381d6895cb90347 The results are similar to the numbers I got with my analysis in comment 36.
(In reply to :Felipe Gomes (needinfo me!) from comment #38) > Could you take a look at the measurement > FX_THUMBNAILS_BG_CAPTURE_DONE_REASON_2 and compare with the list of reasons > here: > http://mxr.mozilla.org/mozilla-central/source/toolkit/components/thumbnails/ > BackgroundPageThumbs.jsm#25 Done: https://gist.github.com/poiru/9af631c6c5f60d5025f0 The results (for non-e10s parent) are: 0 16230 (TEL_CAPTURE_DONE_OK) 1 41817 (TEL_CAPTURE_DONE_TIMEOUT) 2 0 3 0 4 95 (TEL_CAPTURE_DONE_CRASHED) 5 46 (TEL_CAPTURE_DONE_BAD_URI)
Social is another feature that might use content processes without e10s. Shane said: > I doubt the worker is creating a significant contributor to crashes unless they > also correlate to the Chinese distribution where Weibo is pre-installed. Could you also look at the non-e10s content crashes submissions and see if they are coming mostly from Chinese builds? I don't know exactly what to look for in the telemetry environment to determine that but let me find out
You're right about 360/3600. I made the same mistake in the dashboard at https://github.com/mozilla/telemetry-dashboard/blob/gh-pages/new-pipeline/src/crash-summary.js#L65 which I'll fix shortly. Why are you still using those restricting PING_OPTIONS? In particular you're limiting to a particular buildid, which seems unfortunate given that we're running the experiment for a longer time so that we can have high confidence in the results. That the non-e10s main crash rate varies between 11.6 and 24.18 is evidence of something wrong or at least weird. What are next steps now?
(In reply to :Felipe Gomes (needinfo me!) from comment #41) > Could you also look at the non-e10s content crashes submissions and see if > they are coming mostly from Chinese builds? I don't know exactly what to > look for in the telemetry environment to determine that but let me find out You could look first to see if their locale is zh-CN, and some more validation would be to see if they have an add-on with the id of "m.weibo.cn@services.mozilla.org" in the list of activeAddons.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #42) > Why are you still using those restricting PING_OPTIONS? In particular you're > limiting to a particular buildid, which seems unfortunate given that we're > running the experiment for a longer time so that we can have high confidence > in the results. I did that just for a fair comparison between our analyses. I went ahead and ran your analysis on the full dataset with your original PING_OPTIONS: https://gist.github.com/poiru/363fa343dd7cbec17b1f Looking at 4 additional days of pings, we get: https://gist.github.com/poiru/1ceec405172bcafa7305 > That the non-e10s main crash rate varies between 11.6 and 24.18 is evidence > of something wrong or at least weird. What are next steps now? The numbers in the two links above are pretty consistent. The numbers for the 20160211221018 build (comment 39) were higher, but perhaps that build was in fact crashier than the other builds in the experiment. I'm looking into this and will post an update within a day.
I ran the analysis using all pings for specific build IDs without restricting the submission date. While looking at the results, I noticed that 20160215141016 and 20160211221018 happened to have similar usage hours for non-e10s, but the non-e10s crash rate seems to be around 50% higher for 20160215141016. 20160215141016: non-e10s e10s usage hours 2157 1851 chrome crashes 41629 18405 content crashes 4765 32738 plugin crashes 17688 30040 main crash rate 19.30 9.94 main+content crash rate 21.50 27.63 plugin crash rate 8.20 16.23 20160211221018: non-e10s e10s usage hours 2179 2150 chrome crashes 25278 12337 content crashes 4046 29275 plugin crashes 14129 17321 main crash rate 11.60 5.74 main+content crash rate 13.46 19.35 plugin crash rate 6.48 8.06 I've tried to figure out why the non-e10s main crash rate in particular is different. I looked at the distribution of crashes per client and it seems consistent. However, 20160215141016 has about 20k clients that sent crash pings and 20160211221018 only has 13k clients that sent crash pings. I've tried various things like checking only clientIds that sent crash pings for both builds, but the number of crashes is still quite different (16k/11k vs 41k/25k). I'll spend more time on this, but I'm mostly taking shots in the dark here. If someone has any idea what might be going on, please do share!
Do we know how many clients make up the usage hour buckets for each build / experiment branch? Where I'm going with this - non-e10s usage hours of 2157 with high crash rates (41629 crashes) might cause some users to stop using this beta. Hence I would expect usage hours to fall in the next beta build, but they don't. If client counts fall though that would show that we lose testers between builds. Also, in both braches we see crash rates fall between the two build, that also seems to indicate users who experience high instability might be bailing on the beta.
(In reply to Birunthan Mohanathas [:poiru] from comment #45) > I ran the analysis using all pings for specific build IDs without > restricting the submission date. While looking at the results, I noticed > that 20160215141016 and 20160211221018 happened to have similar usage hours > for non-e10s, but the non-e10s crash rate seems to be around 50% higher for > 20160215141016. Yes, that's expected, we have a numbers of crashes we have been tracking, for this and some other builds. What counts for the experiment is the differences between e10s and non-e10s and if we in sum crash more or less with e10s. It's my (and other people's) job to figure out the issues to fix in general to bring the crash levels down, and we have had some issues on 45 beta that we had to get fixed (hopefully we have now, but that doesn't affect the experiment any more).
20160215141016 b6 is the current build (b9.2 just out as I post) and has been running longer. Your problem is timing in system. Telemetry is registering crash pings quicker than main. Oddity to me is 20160211221018 usage hours seems too close for non-e10s and e10s. Just now from crash stats non-e10s includes accessibility (~10% extra) 20160215141016 b6 non-e10s e10s chrome 13,996 4,890 content 0 11,339 plugin 1,306 514 20160211221018 b5 chrome 8,386 3,076 content 0 7,238 plugin 746 293
(In reply to Jim Mathies [:jimm] from comment #46) > Do we know how many clients make up the usage hour buckets for each build / > experiment branch? 253k for 20160215141016 and 238k for 20160211221018 in the control-no-addons branch.
(In reply to Jonathan Howard from comment #48) > Telemetry is registering crash pings quicker than main. That doesn't sound good, makes it hard to compare or relate numbers for the two pings in any recent days. :( Is there any way to deal with that?
(In reply to Jonathan Howard from comment #48) > Telemetry is registering crash pings quicker than main. I just filed bug 1251623 on that.
Assignee: birunthan → nobody
Status: ASSIGNED → NEW
https://github.com/vitillo/e10s_analyses Comparing beta45-withoutaddons to beta46-noapz Most at a glance have the same ratio (including session count) but; e10s usage hours is significantly lower in 46 sample. SIMPLE_MEASURES_UPTIME using https://telemetry.mozilla.org also highlights this, strongly at start. No idea why.
Chris, does this bug still seem relevant? Or should we just close it?
Flags: needinfo?(chutten)
Yup, no longer relevant. E10s is shipping, so comparing it to !e10s is not so useful.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(chutten)
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.