Get some kind of regular reporting of crash ping telemetry
Categories
(Core :: Graphics: WebRender, task, P3)
Tracking
()
People
(Reporter: jrmuizel, Unassigned)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
We get way more reports in telemetry crash pings then we do crash reports. We should try to have some way of routinely looking at the telemetry results. This will be especially helpful for monitoring release 67.
Updated•6 years ago
|
Reporter | ||
Updated•6 years ago
|
Updated•6 years ago
|
Reporter | ||
Comment 1•6 years ago
|
||
Here's kat's databricks workbook: https://dbc-caf9527b-e073.cloud.databricks.com/#notebook/101076/command/101077
Comment 2•6 years ago
|
||
As an FYI: that workbook uses fx-crash-sig which is (as near as I can tell) unmaintained. It requires a really old version of the experimental signature generation library. That's one of the reasons it's seeing lots of GeckoCrash signatures.
If you're going to go this route, siggen and fx-crash-sig will need updates and active maintenance.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Comment 3•5 years ago
|
||
Some new possibilities have opened up here with the new telemetry.crash dataset: https://mail.mozilla.org/pipermail/fx-data-dev/2019-October/000269.html
siggen and fx-crash-sig would need to be rewritten to use this new dataset, but getting the actual crash data should be much easier and faster than it was previously.
Comment 4•5 years ago
|
||
How do siggen and fx-crash-sig need to be rewritten?
Comment 5•5 years ago
|
||
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #4)
How do siggen and fx-crash-sig need to be rewritten?
Perhaps "rewritten" is the wrong way of phrasing this. They would need to process the crash payload as it appears in the telemetry.crash dataset, which I think might be a bit different from how you would have fetched it via (e.g.) the python_moztelemetry API.
Comment 6•5 years ago
|
||
Ahhh--got it! If the structure of the data set that fx-crash-sig works on has changed, then I think we only need to change fx-crash-sig. It's the library that's responsible for taking a crash ping, extracting the bits that are needed for symbolication, symbolicating using Symbols, and then running the results of that through siggen for signature generation.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 7•5 years ago
|
||
That needinfo reminded me, I wrote up a cookbook a few weeks ago on working with crash pings using bigquery: https://docs.telemetry.mozilla.org/cookbooks/crash_pings.html
Much of kats' databricks notebook could be reproduced in sql.tmo as a dashboard using some of the techniques described in there. Getting data on specific signatures is slightly more complicated beast, but as mentioned above much more tractable than previously.
Comment 8•5 years ago
|
||
I can take a look and figure out next steps here.
Comment 9•5 years ago
|
||
I started playing around with the data in STMO. It seems relatively straightforward to get the crash data and plot number of crashes broken down by buildid and/or vendorId. But I'm not sure what would be the most useful data to display. If anybody has thoughts on that please chime in.
So far I'm thinking of plotting number of crashes as well as average uptime based on buildid, for the last 3 months. Different charts for release vs beta vs nightly. And additional charts to break the numbers down by vendorId. So that would be six charts in total (two for each channel - one aggregate and one broken by vendor). But the data is still fairly noisy and it's not obvious that this will produce the desired result of "look at the graph and immediately notice we introduced a crasher bug".
Reporter | ||
Comment 10•5 years ago
|
||
I'd mostly like to see a list of top signatures
Comment 11•5 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #9)
So far I'm thinking of plotting number of crashes as well as average uptime based on buildid, for the last 3 months. Different charts for release vs beta vs nightly. And additional charts to break the numbers down by vendorId. So that would be six charts in total (two for each channel - one aggregate and one broken by vendor). But the data is still fairly noisy and it's not obvious that this will produce the desired result of "look at the graph and immediately notice we introduced a crasher bug".
Yeah, this is what missioncontrol v1 and v2 try to do (try to chart/track crashes normalized by other things over time):
v1: https://missioncontrol.telemetry.mozilla.org
v2: https://metrics.mozilla.com/~sguha/mz/missioncontrol/ex1/mc2/missioncontrol_v2.html
It's a bit of a topic on its own-- it's basically very complicated to get a good signal out of this type of normalized error rate and you need to really consider a wide variety of factors (release dates, update schedules, etc.) to be able to properly interpret what's going on. That said, I don't think we've tried to break this down by graphics chipset before -- it's possible that might yield useful results in some cases.
Comment 12•5 years ago
|
||
If all we want is a list of top signatures then I'm not sure STMO is the way to go. It's probably better to spruce up my original databricks workbook and dashboard-ize it.
Comment 13•5 years ago
|
||
Hm, looks like per https://mail.mozilla.org/pipermail/fx-data-dev/2019-November/000291.html the moztelemetry
python thing isn't a thing anymore, so I guess I have to use STMO.
(In reply to William Lachance (:wlach) (use needinfo!) from comment #3)
Some new possibilities have opened up here with the new telemetry.crash dataset: https://mail.mozilla.org/pipermail/fx-data-dev/2019-October/000269.html
Read this, and it talks about the crash stacks being available, but I don't see them anywhere in the payload record in the telemetry.crash
table. Am I missing something?
Comment 14•5 years ago
|
||
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #13)
Hm, looks like per https://mail.mozilla.org/pipermail/fx-data-dev/2019-November/000291.html the
moztelemetry
python thing isn't a thing anymore, so I guess I have to use STMO.
You can access BigQuery from Databricks, but we're not really encouraging use of it these days. I'd encourage you to explore what redash can do, it's pretty powerful e.g. https://sql.telemetry.mozilla.org/dashboard/windows-10-client-distributions
It's also possible to pull data out from STMO and display it in a different way, this is e.g. what I hooked up for Mike Conley's tab spinner dashboard a few months ago:
Read this, and it talks about the crash stacks being available, but I don't see them anywhere in the payload record in the
telemetry.crash
table. Am I missing something?
We need to add the stacks to the schema so they have their own BigQuery column, I just filed bug 1623626 for this. For now you should be able to find them (along with other fields that haven't yet made it into the schema) inside the additional_properties
column.
One idea I had was to create a derived dataset of telemetry.crash which included crash signatures derived from this type of information. In theory this shouldn't be terribly difficult, and would enable some interesting things.
Comment 15•5 years ago
|
||
Thanks, that helps point me in the right direction. I didn't realize there was more stuff in the additional_properties
column. Having a derived dataset with crash signatures would certainly simplify what I'm trying to do!
Updated•5 years ago
|
Comment 16•5 years ago
|
||
A note on stacks in crash pings: the ones in Windows crash pings should be every bit as good as the ones on crash-stats. On macOS and Linux not so much.
Comment 17•5 years ago
|
||
Just saving some WIP links here so I don't lose them:
https://sql.telemetry.mozilla.org/queries/69281/source
https://iodide.telemetry.mozilla.org/notebooks/470/
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 18•4 years ago
|
||
Quick update on the current status here.
This is waiting on bug 1631563 which in turn is waiting on bug 1636210 for faster symbolication. Once the new symbolication server is deployed, bug 1631563 tracks the work to automatically symbolicate incoming crash pings and create a derived dataset in STMO with that information. I submitted a PoC (with much help from :wlach and :willkg) which demonstrates how it could be done. Once all that is done and the derived dataset is in place, this bug tracks doing whatever gfx-specific thing we want using that data. It should really just be a matter of doing a SQL query against the derived dataset to group by crash signature and sort by descending count, and that will give us the list of top crashing signatures.
So until the dependencies are resolved there's nothing to do here. As it's unlikely to get done in the next week, I'll unassign this bug so somebody else can pick it up when the time comes.
Reporter | ||
Comment 19•3 years ago
|
||
We have this here: https://mathies.com/mozilla/crashes/
Description
•