Open Bug 1268484 Opened 9 years ago Updated 3 years ago

Fuzzy autoclassification using ElasticSearch

Categories

(Tree Management :: Treeherder, defect, P3)

defect

Tracking

(Not tracked)

People

(Reporter: jgraham, Unassigned)

References

Details

Attachments

(1 file)

To deal with cases where the autoclassifier doesn't make a match because of variable data, try using ElasticSearch to do word-only matching.
Blocks: 1268485
No longer blocks: 1268485
Depends on: 1268485
Depends on: 1267943
The output from running `./manage.py es_import_failure_lines` with a larger dyno size, to prevent the one-off dyno from being killed: https://emorley.pastebin.mozilla.org/8869563 I've also filed an issue against Heroku for making `heroku run` more clearly show an error message, to save others from spending ages debugging like we did: https://help.heroku.com/tickets/359578
Attachment #8751686 - Flags: review?(wlachance)
Attachment #8751686 - Flags: review?(emorley)
Comment on attachment 8751686 [details] [treeherder] mozilla:es_matcher > mozilla:master Left some initial feedback. Please could you also add a multi-line commit message explaining the reasoning for the change and an overview of the feature. There are a few other places where there could be some additional inline comments/docstrings. Re-request review when you'd like me to take a final look :-)
Attachment #8751686 - Flags: review?(emorley)
Attachment #8751686 - Flags: review?(emorley)
Assignee: nobody → james
Attachment #8751686 - Flags: review?(emorley) → review+
Comment on attachment 8751686 [details] [treeherder] mozilla:es_matcher > mozilla:master I don't have any experience with elasticsearch, but this seems to make sense to me.
Attachment #8751686 - Flags: review?(wlachance) → review+
Comment on attachment 8751686 [details] [treeherder] mozilla:es_matcher > mozilla:master I added some new commits which I think/hope improve the performance somewhat. At least it hasn't entirely blown up on heroku at the moment.
Attachment #8751686 - Flags: review?(wlachance)
Attachment #8751686 - Flags: review?(emorley)
Attachment #8751686 - Flags: review+
Comment on attachment 8751686 [details] [treeherder] mozilla:es_matcher > mozilla:master This pretty much all looks fine to me, but I'm definitely not an elasticsearch expert.
Attachment #8751686 - Flags: review?(wlachance) → review+
Attachment #8751686 - Flags: review?(emorley) → review+
Attachment #8751686 - Flags: review+ → review?(emorley)
Attachment #8751686 - Flags: review?(emorley) → review+
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/52607c2a57c9f9780b2b7b96c39124db700357d2 Bug 1268484 - Add elastic-search based matcher for test failure lines (#1488) Add support for matching test failures where the test, subtest, status, and expected status are all exact matches, but the message is not an exact match. The matching uses ElasticSearch and is initially optimised for cases where the messages differ only in numeric values since this is a relatively common case. This commit also adds ElasticSearch to the travis environment.
Depends on: 1283252
Depends on: 1283253
Depends on: 1283316
Depends on: 1284429
Depends on: 1284432
Depends on: 1285539
Elasticsearch 5.1.2 is now out and supported by Elastic Cloud. The addon on the treeherder-prototype app is currently using Elasticsearch 2.3.5. Given we'll need to set up new Elasticsearch addons on stage/prod when this lands, I think it makes sense to use 5.x from the outset. Please can we try updating both the treeherder-prototype's addon and also the Python clients to match?
Depends on: 1331397
Depends on: 1340505
Depends on: 1340552
Depends on: 1382227
Depends on: 1382229
We now have a clearer plan on how to improve this: * ML approaches seem like overkill; text search probably does what we need whilst benefiting from robust implementations like ES. * Instead of working line-by-line, we should consider the full set of lines from a document together. In the simplest case we can tokenize to remove expected-useless data and do an exact match on the full set of lines from a previous job. This has a couple of nice properties, notably that any context-dependent classifications (cases where the classification of identical lines with text M depends on whether they follow lines with text X or Y) will be retained as long as we see the same lines in the errorsummary file. * For cases where a full match fails to produce a result, we can take the lines and match using Tf-idf weighting to find the most similar previous classifications, and apply some threshold to only get good-enough matches. * To the extent that it's possible to work with full job summaries, it should be possible to work with job classification data rather than line classification data, which will allow more jobs to be autoclassified. It's not clear to me if this can work for the case where we aren't matching a full error summary.
Component: Treeherder → Treeherder: Log Parsing & Classification
Assignee: james → nobody
Depends on: 1527868
Priority: -- → P3
Component: Treeherder: Log Parsing & Classification → TreeHerder
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: