Considerable drop in number of crash reports getting indexed
Categories
(Toolkit :: Crash Reporting, defect, P1)
Tracking
()
Root Cause | Coding: Other |
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox71 | --- | unaffected |
firefox72 | --- | unaffected |
firefox73 | blocking | verified |
People
(Reporter: philipp, Assigned: gsvelto)
References
(Regression)
Details
(Keywords: regression)
Attachments
(1 obsolete file)
After 73.0a1 build 20191202220401 there's a considerable drop of crash reports we're receiving on crash-stats.mozilla.com - bug 1420363 was an obvious change related to crash handling in this build, so i'll assume this was the regressor.
this is most easily spotted by looking at the crashing graph for a long-running top crash signature on the nightly channel:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&signature=IPCError-browser%20%7C%20ShutDownKill&date=%3E%3D2019-06-11#graphs
another way to notice this is to look at the following super search query, after sorting it ascending by buildid. before the change we normally had 1500-2000 crashes that got reported per build, afterwards it's more in the area of 300-400 crashes (20191203094830 is an outlier where a single install was causing 1000 reports):
https://crash-stats.mozilla.com/search/?release_channel=nightly&build_id=%3E%3D20191127215655&build_id=%3C20191206214833&platform=Windows&platform=Mac%20OS%20X&date=%3E%3D2019-11-11T19%3A39%3A00.000Z&date=%3C2019-12-11T19%3A39%3A00.000Z&_facets=build_id#facet-build_id
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
This is bad. There's two possible explanations for this: either we're doing something wrong in the exception handler and it's triggering recursive exceptions thus causing the crash report not to be written at all, or we're emitting invalid JSON when we write the .extra file. Either way we need to back this out ASAP and it requires a dedicated patch because some things have changed in the meantime. I'll prepare a patch.
Assignee | ||
Comment 2•5 years ago
|
||
Assignee | ||
Comment 3•5 years ago
|
||
Try run of the backout: https://treeherder.mozilla.org/#/jobs?repo=try&revision=75482e1b304217bf6ba83bb97c7990ac8206c609
I'll try to land this tomorrow but I don't want to rush it either because what I really don't want is even more breakage.
Reporter | ||
Comment 4•5 years ago
|
||
from the discussion in #stability slack the current theory is that the reports are still getting submitted properly and end up on socorro but there may be some indexing problems due to the format change.
Comment 5•5 years ago
|
||
I think this is caused by bug #1603236 and is a bug in Socorro. I think Socorro is getting the crash reports, but is throwing an error when indexing the processed crashes for crash reports that were sent in JSON. I think once I fix bug #1603236, I can reprocess those crash reports and the dip will go away.
Updated•5 years ago
|
Assignee | ||
Comment 6•5 years ago
|
||
As per the discussion on bug 1603236 I'll just revert the change to the ModuleSignatureInfo
field so that it's a string again and with Will's fixes applied to Socorro's side we should recover all the crashes that we missed in the past two weeks.
Comment 7•5 years ago
|
||
I deployed the fix for bug #1603236 about 20 minutes ago and the problem is gone now. We don't need to change ModuleSignatureInfo
. I've got something to go reprocess all the reports we got over the last couple of weeks that didn't get into Elasticsearch. I'll let you know when that's done.
Assignee | ||
Comment 8•5 years ago
|
||
Alright, thanks!
Comment 9•5 years ago
|
||
Took a while to get a list of crash ids for crash reports that were processed, but didn't make it into Elasticsaerch. I reprocessed about 29k crash reports in the last couple of hours. The ShutDownKill graph looks fine now.
Comment 11•5 years ago
|
||
How do I recognize a Fenix 3.0 crash report? Is the Fenix version number in the annotations somewhere?
Assignee | ||
Comment 12•5 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #10)
I think we're good here now?
Yeah, everything's back to normal.
Assignee | ||
Comment 13•5 years ago
|
||
(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #11)
How do I recognize a Fenix 3.0 crash report? Is the Fenix version number in the annotations somewhere?
I think there is no such thing ATM, you should file a bug to add it.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 14•5 years ago
|
||
Please specify a root cause for this bug. See :tmaity for more information.
Assignee | ||
Comment 15•5 years ago
|
||
I'd say that the most appropriate root cause here is "Coding: Compatibility Issue" but it's not available from the drop-down menu so I picked "Coding:Other" instead. The patch in bug 1420363 introduced a small change in the format of the payload we send to Socorro which started discarding the new reports as malformed because of that.
Updated•3 years ago
|
Description
•