Closed
Bug 1501682
Opened 6 years ago
Closed 6 years ago
Slow API response times due to higher number of requests to /jobs/ API than usual
Categories
(Tree Management :: Treeherder: Frontend, defect, P1)
Tree Management
Treeherder: Frontend
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: aryx, Assigned: camd)
References
(Blocks 1 open bug)
Details
Attachments
(3 files)
Today treeherder often feels slow when loaded.
When I loaded 50 pushes (1 in each tab), many tabs took dozen of seconds until they showed the favicon. Some XHR actions like removing bug annotations from the annotations tab took several seconds in a way that I considered it broken.
A few requests take very long, e.g. for https://treeherder.mozilla.org/api/project/autoland/jobs/?return_type=list&count=2000&push_id=394869&last_modified__gt=2018-10-24T14%3A30%3A43.289
Blocked:
1 ms
DNS resolution:
1 ms
Connecting:
0 ms
TLS setup:
0 ms
Sending:
0 ms
Waiting:
10738 ms
Receiving:
0 ms
Similar ones before and after are in the range 200-500ms, few 4000+ ms.
Comment 1•6 years ago
|
||
Looking at New Relic, a fair bit of time is spent waiting for a free web dyno to serve the request ("request queueing"):
https://screenshots.firefox.com/lrUPiyKhrbsPSaeQ/rpm.newrelic.com
(https://rpm.newrelic.com/accounts/677903/applications/14179757?tw%5Bend%5D=1540394332&tw%5Bstart%5D=1540383532)
The number of web dynos we have has not changed for some time. This means that either:
* more people are using Treeherder, or are having multiple tabs open, or are having more pushes displayed at once
* the quantity or type of requests that the UI is now making has changed (eg if there was a bug with polling frequency or ...)
* the requests are somehow each taking longer to process, therefore hogging the workers for longer (eg DB slower to respond, API code regression, Heroku noisy neighbour, ...)
Looking at the web transactions view for the last 3 hours:
* 82% of the wall time is consumed by JobsViewSet.list (followed by PushViewSet.list at 12%)
* looking instead at request throughput, JobsViewSet.list is again the leader at 1.1k/min, followed by PushViewSet.list at 169/min
(https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?tw%5Bend%5D=1540394481&tw%5Bstart%5D=1540383681#id=&sort_by=throughput)
Looking at JobsViewSet.list, the approx time-spent breakdown is:
* Python code within JobsViewSet.list: 50%
* MySQL job select: 24%
* WSGI bits: 10%
* other DB: ~8%
Looking at New Relic Insights, a significant proportion (60+%) of API requests are coming from Windows User Agents, which is unexpected:
https://insights.newrelic.com/accounts/677903/explorer/events?eventType=Transaction&duration=43200000&facet=user_agent
Are the sheriffs using Windows machines?
I wonder as well, how many of these requests come from Treeherder tabs that have been left open and forgotten about - and that have hundreds of pushes displayed/continually polling? Perhaps we could:
* set a limit on the max number of pushes displayed at the same time (similar to bug 1107667)
* have the polling automatically stop polling (or reduce frequency) for older pushesthat are now complete, or else tabs that have been open for days? (eg with a "Data refresh paused, refresh page to resume")
Cameron, thoughts?
Flags: needinfo?(cdawson)
Comment 2•6 years ago
|
||
Thinking about the 1.1k/min request throughput, that's higher than I remember in the past, but still not a huge amount given the way we poll for pushes.
ie I believe we poll once a minute for every visible push, and each push is queried separately. As such 1.1k/min means that 1100 pushes are open across all Treeherder tabs. And if users are using the "view next 50" pushes button, that only means 20 tabs.
I suppose the real fix for this is just to not load each push's results separately.
Assignee | ||
Comment 3•6 years ago
|
||
So, one change that would fall into Ed's analysis here is that we now let each push poll for its own jobs. We used to do one http request for ALL pushes on the page and get any new jobs, then split the jobs up to each push and distribute them accordingly. Now, each push does its own polling/http request. This would account for the increase in requests. This new approach simplified some of the code, and I had made the assumption that it wouldn't make any difference in performance. I was wrong. :) I'll work to remedy this right away.
Assignee | ||
Comment 4•6 years ago
|
||
Heh, Ed and I were writing our comments at the same time. :) Agreed.
Flags: needinfo?(cdawson)
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → cdawson
Status: NEW → ASSIGNED
Comment 5•6 years ago
|
||
(In reply to Cameron Dawson [:camd] from comment #3)
> We used to do one http request for ALL pushes on the page
Ah I didn't realise that was how we used to do it :-)
Comment 6•6 years ago
|
||
Comment 7•6 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/d5b85be8968fea6d3dce6c7a50737c2e1e7934b1
Bug 1501682 - Poll for new jobs on all pushes at once
Instead of polling for new jobs on each push, this will
poll for new jobs on all loaded pushes in one http call,
then distribute them to each Push.
Assignee | ||
Comment 8•6 years ago
|
||
This has been deployed to production. Please reopen or create a new bug if you notice more issue in this area.
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Blocks: 1450040
Component: Treeherder → Treeherder: Frontend
Priority: -- → P1
Summary: slow loading of both the treeherder page and some XHRs → Slow API response times due to higher number of requests to /jobs/ API than usual
Comment 9•6 years ago
|
||
This looks to be greatly improved after the change here (see screenshot).
Previously there were between 900 and 1600 requests/min to JobsViewSet.list, now there are only 350-500.
I've scaled the prod Heroku web dynos back down from 6 to 4.
Comment 10•6 years ago
|
||
This can actually be seen on the deployments overview page (which shows the total requests throughput for the whole app). That deployment is:
https://rpm.newrelic.com/accounts/677903/applications/14179757/deployments/18001385
Which occurred on 16th Oct - the day bug 1499551 was filed, which means the fix here should hopefully help with that too.
Updated•6 years ago
|
Attachment #9020370 -
Attachment filename: Screenshot_2018-10-26 Deployments - treeherder-prod - New Relic.png → Deployments overview
Updated•6 years ago
|
Attachment #9020370 -
Attachment description: Screenshot_2018-10-26 Deployments - treeherder-prod - New Relic.png → Deployments overview
Attachment #9020370 -
Attachment filename: Deployments overview → Deployments-overview.png
You need to log in
before you can comment on or make changes to this bug.
Description
•