Open Bug 1268484 Opened 9 years ago Updated 3 years ago

Fuzzy autoclassification using ElasticSearch

Tracking

(Not tracked)

Status:

NEW

People

(Reporter: jgraham, Unassigned)

References

Details

Attachments

(1 file)

[treeherder] mozilla:es_matcher > mozilla:master 9 years ago GitHub Autolander Bot (deleted), text/x-github-pull-request	wlach : review+ emorley : review+	Details

James Graham [:jgraham]

Reporter

Description

•

9 years ago

To deal with cases where the autoclassifier doesn't make a match because of variable data, try using ElasticSearch to do word-only matching.

James Graham [:jgraham]

Reporter

Updated

•

9 years ago

Blocks: 1268485

James Graham [:jgraham]

Reporter

Updated

•

9 years ago

No longer blocks: 1268485

Depends on: 1268485

Ed Morley [:emorley]

Updated

•

9 years ago

Depends on: 1267943

Ed Morley [:emorley]

Comment 1

•

9 years ago

The output from running `./manage.py es_import_failure_lines` with a larger dyno size, to prevent the one-off dyno from being killed: https://emorley.pastebin.mozilla.org/8869563 I've also filed an issue against Heroku for making `heroku run` more clearly show an error message, to save others from spending ages debugging like we did: https://help.heroku.com/tickets/359578

James Graham [:jgraham]

Reporter

Comment 2

•

9 years ago

https://treeherder-heroku.herokuapp.com/#/jobs?repo=try&revision=778e8b422d15&autoclassify This seems to work, at least for simple things.

GitHub Autolander Bot

Comment 3

•

9 years ago

Attached file [treeherder] mozilla:es_matcher > mozilla:master (deleted) — Details

James Graham [:jgraham]

Reporter

Updated

•

9 years ago

Attachment #8751686 - Flags: review?(wlachance)

Attachment #8751686 - Flags: review?(emorley)

Ed Morley [:emorley]

Comment 4

•

9 years ago

Comment on attachment 8751686 [details] [treeherder] mozilla:es_matcher > mozilla:master Left some initial feedback. Please could you also add a multi-line commit message explaining the reasoning for the change and an overview of the feature. There are a few other places where there could be some additional inline comments/docstrings. Re-request review when you'd like me to take a final look :-)

Attachment #8751686 - Flags: review?(emorley)

James Graham [:jgraham]

Reporter

Updated

•

8 years ago

Attachment #8751686 - Flags: review?(emorley)

Ed Morley [:emorley]

Updated

•

8 years ago

Assignee: nobody → james

Ed Morley [:emorley]

Updated

•

8 years ago

Attachment #8751686 - Flags: review?(emorley) → review+

William Lachance (:wlach)

Comment 5

•

8 years ago

Comment on attachment 8751686 [details] [treeherder] mozilla:es_matcher > mozilla:master I don't have any experience with elasticsearch, but this seems to make sense to me.

Attachment #8751686 - Flags: review?(wlachance) → review+

James Graham [:jgraham]

Reporter

Comment 6

•

8 years ago

Comment on attachment 8751686 [details] [treeherder] mozilla:es_matcher > mozilla:master I added some new commits which I think/hope improve the performance somewhat. At least it hasn't entirely blown up on heroku at the moment.

Attachment #8751686 - Flags: review?(wlachance)

Attachment #8751686 - Flags: review?(emorley)

Attachment #8751686 - Flags: review+

William Lachance (:wlach)

Comment 7

•

8 years ago

Comment on attachment 8751686 [details] [treeherder] mozilla:es_matcher > mozilla:master This pretty much all looks fine to me, but I'm definitely not an elasticsearch expert.

Attachment #8751686 - Flags: review?(wlachance) → review+

Ed Morley [:emorley]

Updated

•

8 years ago

Attachment #8751686 - Flags: review?(emorley) → review+

James Graham [:jgraham]

Reporter

Updated

•

8 years ago

Attachment #8751686 - Flags: review+ → review?(emorley)

Ed Morley [:emorley]

Updated

•

8 years ago

Attachment #8751686 - Flags: review?(emorley) → review+

Treeherder GitHub Bugbot

Comment 8

•

8 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/52607c2a57c9f9780b2b7b96c39124db700357d2 Bug 1268484 - Add elastic-search based matcher for test failure lines (#1488) Add support for matching test failures where the test, subtest, status, and expected status are all exact matches, but the message is not an exact match. The matching uses ElasticSearch and is initially optimised for cases where the messages differ only in numeric values since this is a relatively common case. This commit also adds ElasticSearch to the travis environment.

Ed Morley [:emorley]

Updated

•

8 years ago

Depends on: 1283252

Ed Morley [:emorley]

Updated

•

8 years ago

Depends on: 1283253

Ed Morley [:emorley]

Updated

•

8 years ago

Depends on: 1283316

Ed Morley [:emorley]

Updated

•

8 years ago

Depends on: 1284429

Ed Morley [:emorley]

Updated

•

8 years ago

Depends on: 1284432

Ed Morley [:emorley]

Updated

•

8 years ago

Depends on: 1285539

Ed Morley [:emorley]

Comment 9

•

8 years ago

Elasticsearch 5.1.2 is now out and supported by Elastic Cloud. The addon on the treeherder-prototype app is currently using Elasticsearch 2.3.5. Given we'll need to set up new Elasticsearch addons on stage/prod when this lands, I think it makes sense to use 5.x from the outset. Please can we try updating both the treeherder-prototype's addon and also the Python clients to match?

Depends on: 1331397

Ed Morley [:emorley]

Updated

•

8 years ago

Depends on: 1340505

Ed Morley [:emorley]

Updated

•

8 years ago

Depends on: 1340552

Ed Morley [:emorley]

Updated

•

7 years ago

Depends on: 1382227

Ed Morley [:emorley]

Updated

•

7 years ago

Depends on: 1382229

James Graham [:jgraham]

Reporter

Comment 10

•

7 years ago

We now have a clearer plan on how to improve this: * ML approaches seem like overkill; text search probably does what we need whilst benefiting from robust implementations like ES. * Instead of working line-by-line, we should consider the full set of lines from a document together. In the simplest case we can tokenize to remove expected-useless data and do an exact match on the full set of lines from a previous job. This has a couple of nice properties, notably that any context-dependent classifications (cases where the classification of identical lines with text M depends on whether they follow lines with text X or Y) will be retained as long as we see the same lines in the errorsummary file. * For cases where a full match fails to produce a result, we can take the lines and match using Tf-idf weighting to find the most similar previous classifications, and apply some threshold to only get good-enough matches. * To the extent that it's possible to work with full job summaries, it should be possible to work with job classification data rather than line classification data, which will allow more jobs to be autoclassified. It's not clear to me if this can work for the case where we aren't matching a full error summary.

Ed Morley [:emorley]

Updated

•

7 years ago

Component: Treeherder → Treeherder: Log Parsing & Classification

Ed Morley [:emorley]

Updated

•

7 years ago

Assignee: james → nobody

Ed Morley [:emorley]

Updated

•

6 years ago

Depends on: 1527868

Karl Thiessen [:kthiessen, he/him]

Updated

•

5 years ago

Priority: -- → P3

Nobody; OK to take it and work on it

Assignee

Updated

•

3 years ago

Component: Treeherder: Log Parsing & Classification → TreeHerder

You need to log in before you can comment on or make changes to this bug.