Closed
Bug 1330354
Opened 8 years ago
Closed 8 years ago
SETA - failures fixed by commit endpoint times out
Categories
(Tree Management Graveyard :: Treeherder: SETA, defect)
Tree Management Graveyard
Treeherder: SETA
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Assigned: rwood)
References
Details
Attachments
(2 files)
emorley, how can we see what is going wrong?
How can we simulate an environment like this?
The number of JobNotes on production is probably immensely big compared to stage.
Jan 11 11:35:32 treeherder-prod heroku/router: at=error code=H12 desc="Request timeout" method=GET path="/api/seta/failures-fixed-by-commit/" host=treeherder.mozilla.org request_id=9c28532c-84a8-4579-9b68-0f60c6701a2a fwd="66.207.193.22" dyno=web.2 connect=0ms service=30001ms status=503 bytes=0
Code:
* https://github.com/mozilla/treeherder/blob/master/treeherder/webapp/api/seta.py#L42-L51
* https://github.com/mozilla/treeherder/blob/master/treeherder/seta/analyze_failures.py#L36-L108
Comment 1•8 years ago
|
||
I'd look at the New Relic profiles - check for reported slow transactions.
Reporter | ||
Comment 2•8 years ago
|
||
I loaded the API (so it falls within 30 minutes) and I found it under 'transactions' [1]
That shows that it took 49.9 seconds. Most of it goes into MySql operations.
Is that too long?
[1] https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e736574613a536574614661696c7572657346697865644279436f6d6d69742e6c697374222c22225d
Comment 3•8 years ago
|
||
(In reply to Armen Zambrano - Soon to be gone until end of March [:armenzg] (EDT/UTC-4) from comment #2)
> I loaded the API (so it falls within 30 minutes) and I found it under
> 'transactions' [1]
> That shows that it took 49.9 seconds. Most of it goes into MySql operations.
Yeah, we have a limit of 30 seconds for web transactions.
Options:
1. Speed up the SQL query
2. Pre-generate the data somehow and store it in memcache
Comment 4•8 years ago
|
||
(In reply to Armen Zambrano - Soon to be gone until end of March [:armenzg] (EDT/UTC-4) from comment #2)
> I loaded the API (so it falls within 30 minutes) and I found it under
> 'transactions' [1]
> That shows that it took 49.9 seconds. Most of it goes into MySql operations.
You can view older transactions by adjusting the time window on the menu at the top.
The trace in question is:
https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e736574613a536574614661696c7572657346697865644279436f6d6d69742e6c697374222c22225d
The total transaction duration was 30s -- the 49.9 from what I can tell was actually "49.9%" for profile percentage breakdown of the Python parts.
I see that there are 2800 SELECTs made against 4 different tables (ie in total over 10000 SELECTS).
I'd start by trying in Vagrant, and loading the API in the browser to see what the debug toolbar SQL analyzer says. Even though the query time will be quicker locally, you should still be able to identify cases where we should be using eg select_relateed().
Comment 5•8 years ago
|
||
Ah yes there's no select_related() being used:
for job_note in models.JobNote.objects.filter(failure_classification=2):
So Django lazy-loads the additional fields for each usage, rather than fetching all in one shot using a JOIN.
Something like this should work:
job_notes = models.JobNote.objects.filter(failure_classification=2).select_related(
'job', 'job__signature', 'job__job_type', 'job__option_collection')
(I can never decide the best indentation strategy for these)
Comment 6•8 years ago
|
||
Updated•8 years ago
|
Component: Treeherder → Treeherder: SETA
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → rwood
Status: NEW → ASSIGNED
Comment 8•8 years ago
|
||
Assignee | ||
Updated•8 years ago
|
Attachment #8836800 -
Flags: review?(emorley)
Comment 9•8 years ago
|
||
Assignee | ||
Comment 10•8 years ago
|
||
Comment on attachment 8836888 [details]
[treeherder] rwood-moz:bug1330354-optimize > mozilla:master
Another pull request for a small optimization suggested by :wlach
Attachment #8836888 -
Flags: review?(emorley)
Updated•8 years ago
|
Attachment #8836888 -
Flags: review?(emorley) → review+
Comment 11•8 years ago
|
||
Comment on attachment 8836800 [details]
[treeherder] rwood-moz:bug1330354 > mozilla:master
:-)
Attachment #8836800 -
Flags: review?(emorley) → review+
Comment 12•8 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/42ee3c1ce8d185b1830d033152bacb9c5337a2e3
Bug 1330354 - Use select_related in get_failures_fixed_by_commit() (#2162)
To prevent thousands of MySQL requests that cause both the API endpoint
and seta-analyse-failures task to time out.
Comment 13•8 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/2af1cefb827c022c56172adb9952f403b696582e
Bug 1330354 - Small optimization for Job.get_platform_option (#2164)
Assignee | ||
Comment 14•8 years ago
|
||
Leaving this bug open until both of these patches are deployed to production, and we can confirm the API/analyze_failures no longer times out there.
Assignee | ||
Comment 15•8 years ago
|
||
After the patches were deployed, I ran analyze_failures on production and it appears to have finished successfully. I didn't see any exceptions on treeherder-prod in new relic. I also didn't see any seta APIs in the list of most time consuming transactions on new relic. I was trying to see the total run time of analyze_failures but couldn't find it listed. I ran the script via heroku run .. "./manage.py analyze_failures).
:emorley, is that sufficient to verify that analyze_faiures runs in < 30s now? Is there any way to get an actual number? I dug around in new relic but couldn't find it
Flags: needinfo?(emorley)
Comment 16•8 years ago
|
||
(In reply to Robert Wood [:rwood] from comment #15)
> I ran the script via heroku run .. "./manage.py analyze_failures).
>
> :emorley, is that sufficient to verify that analyze_faiures runs in < 30s
> now? Is there any way to get an actual number? I dug around in new relic but
> couldn't find it
In order to show in New Relic, the agent has to wrap the command being run (like the commands listed in `Procfile`), eg:
newrelic-admin run-program ./manage.py analyze_failures
Flags: needinfo?(emorley)
Assignee | ||
Comment 17•8 years ago
|
||
> In order to show in New Relic, the agent has to wrap the command being run
> (like the commands listed in `Procfile`), eg:
>
> newrelic-admin run-program ./manage.py analyze_failures
Thanks :emroley. So I would run that command in a heroku dyno? Actually will it show up in new relic when it is run as the regularly scheduled celerey task automatically?
Flags: needinfo?(emorley)
Comment 18•8 years ago
|
||
Yes, eg:
$ thp run newrelic-admin run-program ./manage.py analyze_failures
(Using the helper aliases like: `alias thp="HEROKU_APP=treeherder-prod heroku"`)
It will show in New Relic when run automatically, since the worker that handles the job also uses the wrapper (every command in the Procfile uses the wrapper for this reason).
Flags: needinfo?(emorley)
Comment 19•8 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #18)
> Yes, eg:
>
> $ thp run newrelic-admin run-program ./manage.py analyze_failures
Hmm I tried that and it didn't appear in New Relic. I think because it's being run outside of the celery task framework and/or a Django web request, the New Relic agent doesn't know how to categorise it.
A better way of ensuring the task appears is to schedule an async task (like happens for the celerybeat scheduled periodic tasks), using:
$ thp run ./manage.py shell
...
from treeherder.seta.tasks import seta_analyze_failures
seta_analyze_failures.apply_async(time_limit=15*60, soft_time_limit=10*60)
(The `./manage.py shell` command doesn't need to be run with the New relic wrapper script itself, since it's only adding the task to the queue, not running the task. The task will shortly after run on the worker_default dyno, not the one-off dyno.)
Updated•8 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•4 years ago
|
Product: Tree Management → Tree Management Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•