Closed Bug 1501682 Opened 6 years ago Closed 6 years ago

Slow API response times due to higher number of requests to /jobs/ API than usual

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: aryx, Assigned: camd)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/4179 6 years ago GitHub Bugzilla PR Linker (deleted), text/x-github-pull-request		Details
Screenshot of production /jobs/ API request stats 6 years ago Ed Morley [:emorley] (deleted), image/png		Details
Deployments overview 6 years ago Ed Morley [:emorley] (deleted), image/png		Details

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

6 years ago

Today treeherder often feels slow when loaded. When I loaded 50 pushes (1 in each tab), many tabs took dozen of seconds until they showed the favicon. Some XHR actions like removing bug annotations from the annotations tab took several seconds in a way that I considered it broken. A few requests take very long, e.g. for https://treeherder.mozilla.org/api/project/autoland/jobs/?return_type=list&count=2000&push_id=394869&last_modified__gt=2018-10-24T14%3A30%3A43.289 Blocked: 1 ms DNS resolution: 1 ms Connecting: 0 ms TLS setup: 0 ms Sending: 0 ms Waiting: 10738 ms Receiving: 0 ms Similar ones before and after are in the range 200-500ms, few 4000+ ms.

Ed Morley [:emorley]

Comment 1

•

6 years ago

Looking at New Relic, a fair bit of time is spent waiting for a free web dyno to serve the request ("request queueing"): https://screenshots.firefox.com/lrUPiyKhrbsPSaeQ/rpm.newrelic.com (https://rpm.newrelic.com/accounts/677903/applications/14179757?tw%5Bend%5D=1540394332&tw%5Bstart%5D=1540383532) The number of web dynos we have has not changed for some time. This means that either: * more people are using Treeherder, or are having multiple tabs open, or are having more pushes displayed at once * the quantity or type of requests that the UI is now making has changed (eg if there was a bug with polling frequency or ...) * the requests are somehow each taking longer to process, therefore hogging the workers for longer (eg DB slower to respond, API code regression, Heroku noisy neighbour, ...) Looking at the web transactions view for the last 3 hours: * 82% of the wall time is consumed by JobsViewSet.list (followed by PushViewSet.list at 12%) * looking instead at request throughput, JobsViewSet.list is again the leader at 1.1k/min, followed by PushViewSet.list at 169/min (https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?tw%5Bend%5D=1540394481&tw%5Bstart%5D=1540383681#id=&sort_by=throughput) Looking at JobsViewSet.list, the approx time-spent breakdown is: * Python code within JobsViewSet.list: 50% * MySQL job select: 24% * WSGI bits: 10% * other DB: ~8% Looking at New Relic Insights, a significant proportion (60+%) of API requests are coming from Windows User Agents, which is unexpected: https://insights.newrelic.com/accounts/677903/explorer/events?eventType=Transaction&duration=43200000&facet=user_agent Are the sheriffs using Windows machines? I wonder as well, how many of these requests come from Treeherder tabs that have been left open and forgotten about - and that have hundreds of pushes displayed/continually polling? Perhaps we could: * set a limit on the max number of pushes displayed at the same time (similar to bug 1107667) * have the polling automatically stop polling (or reduce frequency) for older pushesthat are now complete, or else tabs that have been open for days? (eg with a "Data refresh paused, refresh page to resume") Cameron, thoughts?

Flags: needinfo?(cdawson)

Ed Morley [:emorley]

Comment 2

•

6 years ago

Thinking about the 1.1k/min request throughput, that's higher than I remember in the past, but still not a huge amount given the way we poll for pushes. ie I believe we poll once a minute for every visible push, and each push is queried separately. As such 1.1k/min means that 1100 pushes are open across all Treeherder tabs. And if users are using the "view next 50" pushes button, that only means 20 tabs. I suppose the real fix for this is just to not load each push's results separately.

Cameron Dawson [:camd]

Assignee

Comment 3

•

6 years ago

So, one change that would fall into Ed's analysis here is that we now let each push poll for its own jobs. We used to do one http request for ALL pushes on the page and get any new jobs, then split the jobs up to each push and distribute them accordingly. Now, each push does its own polling/http request. This would account for the increase in requests. This new approach simplified some of the code, and I had made the assumption that it wouldn't make any difference in performance. I was wrong. :) I'll work to remedy this right away.

Cameron Dawson [:camd]

Assignee

Comment 4

•

6 years ago

Heh, Ed and I were writing our comments at the same time. :) Agreed.

Flags: needinfo?(cdawson)

Cameron Dawson [:camd]

Assignee

Updated

•

6 years ago

Assignee: nobody → cdawson

Status: NEW → ASSIGNED

Ed Morley [:emorley]

Comment 5

•

6 years ago

(In reply to Cameron Dawson [:camd] from comment #3) > We used to do one http request for ALL pushes on the page Ah I didn't realise that was how we used to do it :-)

GitHub Bugzilla PR Linker

Comment 6

•

6 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla/treeherder/pull/4179 (deleted) — Details

Treeherder GitHub Bugbot

Comment 7

•

6 years ago

Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/d5b85be8968fea6d3dce6c7a50737c2e1e7934b1 Bug 1501682 - Poll for new jobs on all pushes at once Instead of polling for new jobs on each push, this will poll for new jobs on all loaded pushes in one http call, then distribute them to each Push.

Cameron Dawson [:camd]

Assignee

Comment 8

•

6 years ago

This has been deployed to production. Please reopen or create a new bug if you notice more issue in this area.

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Ed Morley [:emorley]

Updated

•

6 years ago

Blocks: 1450040

Component: Treeherder → Treeherder: Frontend

Priority: -- → P1

Summary: slow loading of both the treeherder page and some XHRs → Slow API response times due to higher number of requests to /jobs/ API than usual

Ed Morley [:emorley]

Comment 9

•

6 years ago

Attached image Screenshot of production /jobs/ API request stats (deleted) — Details

This looks to be greatly improved after the change here (see screenshot). Previously there were between 900 and 1600 requests/min to JobsViewSet.list, now there are only 350-500. I've scaled the prod Heroku web dynos back down from 6 to 4.

Ed Morley [:emorley]

Comment 10

•

6 years ago

Attached image Deployments overview (deleted) — Details

This can actually be seen on the deployments overview page (which shows the total requests throughput for the whole app). That deployment is: https://rpm.newrelic.com/accounts/677903/applications/14179757/deployments/18001385 Which occurred on 16th Oct - the day bug 1499551 was filed, which means the fix here should hopefully help with that too.

Ed Morley [:emorley]

Updated

•

6 years ago

Attachment #9020370 - Attachment filename: Screenshot_2018-10-26 Deployments - treeherder-prod - New Relic.png → Deployments overview

Ed Morley [:emorley]

Updated

•

6 years ago

Attachment #9020370 - Attachment description: Screenshot_2018-10-26 Deployments - treeherder-prod - New Relic.png → Deployments overview

Attachment #9020370 - Attachment filename: Deployments overview → Deployments-overview.png

Ed Morley [:emorley]

Updated

•

6 years ago

Blocks: 1499551

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Slow API response times due to higher number of requests to /jobs/ API than usual

Categories

(Tree Management :: Treeherder: Frontend, defect, P1)

Tracking

(Not tracked)

People

(Reporter: aryx, Assigned: camd)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Updated

Comment 9

Comment 10

Updated

Updated

Updated

Attachment

General

Description

File Name

Content Type