Closed Bug 1603236 Opened 5 years ago Closed 5 years ago

[traceback] TransportError: MapperParsingException[failed to parse [raw_crash.ModuleSignatureInfo]]; nested: ElasticsearchIllegalArgumentException

Categories

(Socorro :: Processor, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: willkg, Assigned: willkg)

References

Details

Attachments

(1 file)

Starting December 3rd, we're seeing a ton of these errors every day.

https://sentry.prod.mozaws.net/operations/socorro-new-prod/issues/6322185/

RequestError: TransportError(400, 'RemoteTransportException[[Shaman][inet[/172.31.10.146:9300]][indices:data/write/index]]; nested: MapperParsingException[failed to parse [raw_crash.ModuleSignatureInfo]]; nested: ElasticsearchIllegalArgumentException[unknown property [XXX]] ')
  File "socorro/external/es/crashstorage.py", line 320, in _submit_crash_to_elasticsearch
    conn, index_name, es_doctype, crash_document, crash_id
  File "socorro/external/es/crashstorage.py", line 290, in _index_crash
    index=es_index, doc_type=es_doctype, body=crash_document, id=crash_id
  File "elasticsearch/client/utils.py", line 69, in _wrapped
    return func(*args, params=params, **kwargs)
  File "elasticsearch/client/__init__.py", line 263, in index
    _make_path(index, doc_type, id), params=params, body=body)
  File "elasticsearch/transport.py", line 307, in perform_request
    status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
  File "elasticsearch/connection/http_requests.py", line 78, in perform_request
    self._raise_error(response.status_code, raw_data)
  File "elasticsearch/connection/base.py", line 105, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)

The XXX in the exception message is a placeholder--there are a variety of possible properties there.

This bug covers figuring that out.

On or around that date, Firefox nightly started sending crash reports in JSON format as opposed to the multipart/form-data format. That work is in bug #1420363. Antenna (the Socorro collector) converts the JSON to a Python object and persists it as JSON again. That's super, except that fields like TelemetryEnvironment and ModuleSignatureInfo used to have a string value (it was JSON encoded), but now are Python dicts and our Elasticsearch crash storage can't handle that.

For example:

{
    ...
    "TelemetryEnvironment": "{\"build\":{\"applicationId\":\"{ec8030f7-c20a-464f-9b0e-13a3a9e97384}\",\"...",
    ...
}

vs.

{
    ...
    "TelemetryEnvironment": {
        "build": {
            "applicationId": "{ec8030f7-c20a-464f-9b0e-13a3a9e97384}",
            ...
}

Here's a raw crash from one of the affected crash reports:

https://crash-stats.mozilla.org/rawdumps/4dba56a7-19ce-456d-8ded-8100c0191211.json

Here's what they used to look like:

https://crash-stats.mozilla.org/rawdumps/2b272167-6040-4d90-b058-012b90191211.json

Our Elasticsearch crash storage will drop fields it has trouble with, so it's dropping these fields from search. I know in the crash reports I've looked at, it manages to save the crash report, but probably without the problematic fields.

I think the problem is that these fields aren't specified in super search so they're getting a default mapping of string or something like that and the values are definitely not strings, so they kick up an error.

Things to do:

  1. find out if there are any incoming crash reports that weren't processed (bug #1603231)
  2. figure out what Elasticsearch should do with fields that aren't in super search fields and don't have string values
Assignee: nobody → willkg
Status: NEW → ASSIGNED

Philip crashed his Firefox nightly and sent a crash report. He can see the crash report in Crash Stats, but can't search for it:

That has a TelemetryEnvironment as a big string, but it has ModuleSignatureInfo as a nested object.

I can see the report for bp-4dba56a7-19ce-456d-8ded-8100c0191211 from comment 1, but can't search for it, either.

Weirdly, I thought I had a pair of raw crashes that showed the problem with TelemetryEnvironment, but now I don't see any TelemetryEnvironment that is a non-string value, so maybe it's just a problem with ModuleSignatureInfo.

Also, I thought Elasticsearch would take the problematic field out and try indexing again, but maybe that code is broken or I'm misremembering.

I can reproduce the problem in my local dev environment with two crash reports that have different ModuleSignatureInfo value types. One has a string value. The other has a nested Python dict value.

This field is not in super search fields, so Elasticsearch infers the mapping type when it sees the first one. If the first one is a string, then when it tries to index a non-string version, it fails. That's what we're seeing here.

I think we have a few possible solutions:

  1. Maintain the existing system. Add some code that does a pass on the crash report values and converts anything that's a list or a dict into a JSON-encoded string. Then ModuleSignatureInfo would get converted to a JSON-encoded string just like older crash reports.

  2. Drop problematic fields. Users can't do anything with values in problematic fields with SuperSearch, so these are just hanging out. Plus if they're lists or dicts, they're likely to be large values. Thus they're large and not useful--we should just remove them.

We have code that does this already, but it doesn't do it for MapperParserException which is the exception we're getting now. We could just add handling for this exception, too.

  1. Rework Elasticsearch somehow to handle fields that have values of different types. I have no idea how this might work.

I'm inclined to go with option 2. It uses existing code to drop fields that don't want to index well. It's easy to implement now. If we want to index these fields later, we'll write a processor rule that converts the value into something that's searchable. That won't benefit from having a bunch of data already in Elasticsearch.

I've looked through crash reports. I only see the ModuleSignatureInfo field being problematic. There are 25k errors, though, so maybe I just haven't seen the others.

I implemented option 2. I think it's the right thing to do.

I wrote up bug #1603308 to fix verifyprocessed to look at Elasticsearch and treat missing data there as missing processed crashes. Then it'll self-heal over time.

willkg merged PR #5058: "bug 1603236: fix unknown property indexing failure" in 98f825a.

We don't have this problem in stage since all the crash reports that get submitted to stage come from prod, but are in multipart/form-data format. I can reproduce the issue if I manually send crash reports to stage.

I'll wait for this to deploy to stage, then verify it there using my script.

Will, making the ModuleSignatureInfo into a string again is a one line change in the minidump analzyer, so if it makes your life easier I can do it today.

Flags: needinfo?(willkg)

That would be cool, but we'd still have lost a week and a half of processed crash reports plus I don't think we know if this will pop up again.

I've landed the fix and tested it on stage. So then I wanted to reprocess all the affected crash reports and that's when I discovered this other thing I use for catching stuff that got missed wasn't working. So then I went to go fix that and discovered that a whole bunch of its tests were "weird" and wouldn't do what I wanted.... And a whole bunch of time passed.

I have all the things fixed. I'm pushing them to stage and I'll test them out there. Then I'll push to prod and reprocess and we should be fine after that.

Thanks for the offer! It's an interesting option if this ever comes up again and I can't fix it in the processor.

Flags: needinfo?(willkg)

This was deployed in bug #1603593 an hour ago. Sentry is showing no further instances of this problem.

The verifyprocessed job will find the crash reports that were processed and made it to S3, but didn't make it to Elasticsearch and set them up for reprocessing. I'll check in tomorrow and make sure that worked.

Took a while to get a list of crash ids for crash reports that were processed, but didn't make it into Elasticsaerch. I reprocessed about 29k crash reports in the last couple of hours. The ShutDownKill graph looks fine now.

Marking as FIXED.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: