Closed Bug 1603201 Opened 5 years ago Closed 5 years ago

Considerable drop in number of crash reports getting indexed

Categories

(Toolkit :: Crash Reporting, defect, P1)

73 Branch
defect

Tracking

()

VERIFIED FIXED
Root Cause Coding: Other
Tracking Status
firefox-esr68 --- unaffected
firefox71 --- unaffected
firefox72 --- unaffected
firefox73 blocking verified

People

(Reporter: philipp, Assigned: gsvelto)

References

(Regression)

Details

(Keywords: regression)

Attachments

(1 obsolete file)

After 73.0a1 build 20191202220401 there's a considerable drop of crash reports we're receiving on crash-stats.mozilla.com - bug 1420363 was an obvious change related to crash handling in this build, so i'll assume this was the regressor.

this is most easily spotted by looking at the crashing graph for a long-running top crash signature on the nightly channel:
https://crash-stats.mozilla.com/signature/?product=Firefox&release_channel=nightly&signature=IPCError-browser%20%7C%20ShutDownKill&date=%3E%3D2019-06-11#graphs

another way to notice this is to look at the following super search query, after sorting it ascending by buildid. before the change we normally had 1500-2000 crashes that got reported per build, afterwards it's more in the area of 300-400 crashes (20191203094830 is an outlier where a single install was causing 1000 reports):
https://crash-stats.mozilla.com/search/?release_channel=nightly&build_id=%3E%3D20191127215655&build_id=%3C20191206214833&platform=Windows&platform=Mac%20OS%20X&date=%3E%3D2019-11-11T19%3A39%3A00.000Z&date=%3C2019-12-11T19%3A39%3A00.000Z&_facets=build_id#facet-build_id

Flags: needinfo?(gsvelto)
Priority: -- → P1

This is bad. There's two possible explanations for this: either we're doing something wrong in the exception handler and it's triggering recursive exceptions thus causing the crash report not to be written at all, or we're emitting invalid JSON when we write the .extra file. Either way we need to back this out ASAP and it requires a dedicated patch because some things have changed in the meantime. I'll prepare a patch.

Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Flags: needinfo?(gsvelto)
Attached file Bug 1603201 - Back out bug 1420363 (obsolete) (deleted) —

Try run of the backout: https://treeherder.mozilla.org/#/jobs?repo=try&revision=75482e1b304217bf6ba83bb97c7990ac8206c609

I'll try to land this tomorrow but I don't want to rush it either because what I really don't want is even more breakage.

from the discussion in #stability slack the current theory is that the reports are still getting submitted properly and end up on socorro but there may be some indexing problems due to the format change.

Summary: Considerable drop in number of crash reports getting submitted → Considerable drop in number of crash reports getting indexed

I think this is caused by bug #1603236 and is a bug in Socorro. I think Socorro is getting the crash reports, but is throwing an error when indexing the processed crashes for crash reports that were sent in JSON. I think once I fix bug #1603236, I can reprocess those crash reports and the dip will go away.

Attachment #9115283 - Attachment is obsolete: true

As per the discussion on bug 1603236 I'll just revert the change to the ModuleSignatureInfo field so that it's a string again and with Will's fixes applied to Socorro's side we should recover all the crashes that we missed in the past two weeks.

I deployed the fix for bug #1603236 about 20 minutes ago and the problem is gone now. We don't need to change ModuleSignatureInfo. I've got something to go reprocess all the reports we got over the last couple of weeks that didn't get into Elasticsearch. I'll let you know when that's done.

Alright, thanks!

Took a while to get a list of crash ids for crash reports that were processed, but didn't make it into Elasticsaerch. I reprocessed about 29k crash reports in the last couple of hours. The ShutDownKill graph looks fine now.

I think we're good here now?

Flags: needinfo?(gsvelto)

How do I recognize a Fenix 3.0 crash report? Is the Fenix version number in the annotations somewhere?

(In reply to Ryan VanderMeulen [:RyanVM] from comment #10)

I think we're good here now?

Yeah, everything's back to normal.

Flags: needinfo?(gsvelto)

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #11)

How do I recognize a Fenix 3.0 crash report? Is the Fenix version number in the annotations somewhere?

I think there is no such thing ATM, you should file a bug to add it.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Depends on: 1603236
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED

Please specify a root cause for this bug. See :tmaity for more information.

Root Cause: --- → ?

I'd say that the most appropriate root cause here is "Coding: Compatibility Issue" but it's not available from the drop-down menu so I picked "Coding:Other" instead. The patch in bug 1420363 introduced a small change in the format of the payload we send to Socorro which started discarding the new reports as malformed because of that.

Root Cause: ? → Coding: Other
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: