Closed Bug 1733904 Opened 3 years ago Closed 2 years ago

mark STATUS_DEVICE_DATA_ERROR crashes as "bad hardware" in signature

Categories

(Socorro :: Signature, task, P2)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gsvelto, Assigned: willkg)

Details

Attachments

(2 files)

We seem to receive ~2K crashes a week (which should translate to ~20K incoming crashes IIRC) with a reason set to EXCEPTION_IN_PAGE_ERROR_READ / STATUS_DEVICE_DATA_ERROR, EXCEPTION_IN_PAGE_ERROR_EXEC / STATUS_DEVICE_DATA_ERROR and EXCEPTION_IN_PAGE_ERROR_WRITE / STATUS_DEVICE_DATA_ERROR. See this query.

These crashes are caused by damaged blocks on the machines' drive and we can't do anything about it. We should consider dropping them entirely.

I'll look into this.

Assignee: nobody → willkg
Status: NEW → ASSIGNED
Priority: -- → P2

I don't see a good way to figure this out from the crash annotations. Seems like Socorro needs to process the crash report and look at the minidump.

Generally, we don't delete crash reports that have been collected (which removes them from the user's device) until the expiration of the data.

We could add another marker in the signature, though. That would bump these crash reports out of top crashers, signature reports, etc and make them clearly identifiable. Maybe a flag similar to what we do with OOM crashes.

What do you think about that? Maybe something like "BH" for "bad hardware"?

Flags: needinfo?(gsvelto)

Oh! Earlier today, we were talking about running something on the minidump to extract annotations. Can we write a tool that quickly extracts the crash_info.type bit from the minidump? If we can do that, then maybe we could throttle on it.

(Sorry for all the replies....) If we did add a throttle rule that rejects the crash report because it's from bad hardware, I'd really like for the rejection reason to percolate up to the user so they know why we rejected the crash. We keep having issues where the Socorro collector intentionally rejects crashes, but the user has no idea why and keeps retrying.

crash_info.type comes from the MinidumpException stream so there isn't any shortcut for checking it. If you really wanted we could add a flag to minidump-stackwalk to have it compute that value first and bail on that value. Not sure if you would prefer just failing to produce output or a mini-json output with { status: "rejected for xyz" }.

We'd have to run this new utility in the collector--not the processor. The collector doesn't currently look at minidumps at all--it only looks at crash annotations in the HTTP POST payload.

When you say "there isn't any shortcut for checking it", are you saying we have to completely parse the minidump and extract the stack and symbols and do all the work that minidump-stackwalk does and we can't write a new utility that does the minimum it needs to do to extract the crash_info.type?

(Edited: example was wrong)

Ahh ok. I was mentally contrasting with an "ideal" situation of just "oh yeah that value's always at offset 72" or whatever.

The minidump has a directory-like structure of its "streams", so you can check the directory for the stream you're interested in and only parse that one. The core rust-minidump crate is specifically designed to do this, so it would be ~trivial for me to write a new application that just does this one check. In fact, it's basically a simplified version of the example in rust-minidump's README lol:

fn main() {
  let mut dump = minidump::Minidump::read_path("../testdata/test.dmp").unwrap();
  let sys  = dump.get_stream::<MinidumpSystemInfo>().unwrap();
  let exception = dump.get_stream::<MinidumpException>().unwrap();

  // print crash_info.type (can also `match` on it to check for specific values)
  println!("{}", exception.get_crash_reason(sys.os, sys.cpu));
}

In this case you do also need to parse the SystemInfo stream because interpreting the error code is a platform-specific operation. But neither of these things requires symbols or whatever. Basically as light-weight as it gets.

Nice! I'll put this in my list of things to test out when I get a chance.

Flags: needinfo?(gsvelto)

This would in fact be a really nice starter bug if you were at all interested in learning more basic rust stuff / how rust-minidump works. If you are, I would be happy to answer any questions and point out good places to use as reference (read: copy-paste snippets from).

Otherwise I could knock it out pretty quick, depending on the specific requirements.

(In reply to Will Kahn-Greene [:willkg] ET needinfo? me from comment #2)

We could add another marker in the signature, though. That would bump these crash reports out of top crashers, signature reports, etc and make them clearly identifiable. Maybe a flag similar to what we do with OOM crashes.

What do you think about that? Maybe something like "BH" for "bad hardware"?

I like that! It would immediately give crash triagers the piece of information they need (ignore this crash) but it would also allow us to gather statistics about them (how many of these crashes are caused by bad hardware?). We would then go and re-use that for other crashes that might be caused by bad hardware such as bug 1738651.

As for the signature I'd spell it out (i.e. "bad hardware") because it's hard to tell what's going on from just "BH". So this could be something like [@ bad hardware | _getdrive] or maybe just [@ bad hardware].

I tweaked the summary accordingly. I'll go with that.

I'll table the "let's add minidump parsing things to the collector" thing to later. I do like that idea and having that kind of ability gives us a lot of interesting options. Definitely something to think about in the future.

Component: Antenna → Signature
Summary: Consider ignoring STATUS_DEVICE_DATA_ERROR crashes → mark STATUS_DEVICE_DATA_ERROR crashes as "bad hardware" in signature

I deployed these changes to prod in bug #1790053. I reprocessed all the crash reports with STATUS_DEVICE_DATA_ERROR in the reason for the last week. There were like 800 of them or so. They now have "bad hardware" in the signature.

https://crash-stats.mozilla.org/search/?reason=~STATUS_DEVICE_DATA_ERROR&product=Firefox&date=%3E%3D2022-09-02T00%3A00%3A00.000Z&date=%3C2022-09-09T23%3A57%3A00.000Z&_facets=signature&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature

Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: