Closed Bug 1507397 Opened 6 years ago Closed 6 years ago

/performance/data/ endpoint is frequently timing out

Categories

(Tree Management :: Perfherder, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: KWierso, Assigned: igoldan)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Attached image Perfherder requests (deleted) —
Example New Relic slow transaction trace: https://rpm.newrelic.com/accounts/677903/applications/14179757/transactions?tw%5Bend%5D=1542281239&tw%5Bstart%5D=1542194839#id=5b225765625472616e73616374696f6e2f46756e6374696f6e2f747265656865726465722e7765626170702e6170692e706572666f726d616e63655f646174613a506572666f726d616e6365446174756d566965775365742e6c697374222c22225d&app_trace_id=c8173206-e843-11e8-aade-0242ac11000b_0_3311 (note: it's not the same request as in comment 0, so might differ slightly) 80% of the profile was spent in `MySQL performance_datum select`. The DB queries tab for that trace reports: SELECT `performance_datum`.`id`, `performance_datum`.`signature_id`, `performance_signature`.`signature_hash`, `performance_datum`.`job_id`, `performance_datum`.`push_id`, `performance_datum`.`push_timestamp`, `performance_datum`.`value` FROM `performance_datum` INNER JOIN `performance_signature` ON (`performance_datum`.`signature_id` = `performance_signature`.`id`) WHERE (`performance_datum`.`repository_id` = %s AND `performance_datum`.`signature_id` IN (%s, %s, ...) AND `performance_signature`.`framework_id` IN (%s) AND `performance_datum`.`push_timestamp` > %s) ORDER BY `performance_datum`.`push_timestamp` ASC
Blocks: 1504990
Flags: needinfo?(igoldan)
Priority: -- → P1
Summary: performance/data endpoint is frequently timing out → /performance/data/ endpoint is frequently timing out
The EXPLAIN section says: performance_signature The table was retrieved with this index: PRIMARY A temporary table was created to access this part of the query, which can cause poor performance. This typically happens if the query contains GROUP BY and ORDER BY clauses that list columns differently. MySQL had to do an extra pass to retrieve the rows in sorted order, which is a cause of poor performance but sometimes unavoidable. You can speed up this query by querying only fields that are within the index. Or you can create an index that includes every field in your query, including the primary key. Approximately 100 rows of this table were scanned. performance_datum The table was retrieved with this index: performance_datum_repository_id_795c16d99ae557e2_idx You can speed up this query by querying only fields that are within the index. Or you can create an index that includes every field in your query, including the primary key. Approximately 403 rows of this table were scanned.
Will look into this.
Flags: needinfo?(igoldan)
Assignee: nobody → igoldan
Would maybe be worth also adding some error handling for the cases where the requests fail, since unless you're looking at the network monitor while the page loads, there's no indication that it isn't still attempting to load the data, even though the requests timed out at 20 seconds.
Row count of this table is around 500 millions, so I understand the high computation time. performance_datum table's index has covered all fields accessed in comment 1. I don't think we have anything more to index. From what I could see from the devconsole, we're spawning a bulk of 10 requests, each with their own parameters, but which leverage the same SQL query. :wlach basically, we're chunking a single request into multiple smaller ones or am I wrong? I have to go through the FE code to properly confirm. If the chunking mechanic is true, then I see 2 (not necessarily distinct) solutions: 1. get rid of the ORDER BY on server side and do it on FE 2. between each request from that big bulk, wait until it retrieves response from the server, and then make another one. The 2nd approach seems more reasonable, as it will not flood the database with too many expensive queries at once. We could of course make 2 or even 3 requests like these at a time. But not more.
Flags: needinfo?(wlachance)
Yeah, I think Ionut has it right. It looks like the `performance_datum_repository_id_795c16d99ae557e2_idx` is the one that covers repository, index, and push timestamp: https://sql.telemetry.mozilla.org/queries/60141/source It doesn't sound like we're doing anything all that crazy in the query. We're just asking for a *lot* of it, and making many concurrent queries at the same time. There are probably some things we could do at the margins here -- for example, design an alternate version of the endpoint which doesn't need to join on the performance_signature endpoint and then get the compare view to use it. I have a half-finished patch to do this which I can share with people if they're interested, but I would not expect an order-of-magnitude improvement. Unfortunately I don't think shifting the order by to the client will help much (if at all) and the problem with throttling requests is that the compare view is already rather slow -- the last thing we want to do is make it even slower and more frustrating to use. I think the right solution is to much of this logic to the server so we can avoid so many queries concurrently (bug 15057960. This will probably be slower than the current method, but should be *much* more efficient. I feel like the best short-term solution might just be to throw more machine cycles at this problem. How expensive would be to upgrade the treeherder db instance on production to something a little beefier?
Flags: needinfo?(wlachance)
We're currently using an RDS db.m4.2xlarge on prod which has 8 vCPU and 32GB RAM. The next size up is double on both specs, and also double the cost (so ~$700-1000/month extra, depending on reserved vs on-demand). Other choices would be switching to: * the new m5 range (db.m5.2xlarge) * the memory optimised range (db.r4.*) iff it turns out we're memory constrained * RDS Aurora Though all of those would be premature without first monitoring resource usage on the DB instance and seeing if that is actually the issue. At least in the past (haven't checked for a few months) our prod DB remained under 30% CPU on average - so this could easily be due to a temporary spike. As a next step I would recommend running the exact queries that were slow against prod and seeing if they still take that long now (my bet would be that they don't).
(In reply to Ed Morley [:emorley] from comment #7) > As a next step I would recommend running the exact queries that were slow > against prod and seeing if they still take that long now (my bet would be > that they don't). Will do this. Already, you may be betting right, as the :KWierso request isn't erroring out anymore. It just takes 10 seconds to complete.
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #8) > (In reply to Ed Morley [:emorley] from comment #7) > > As a next step I would recommend running the exact queries that were slow > > against prod and seeing if they still take that long now (my bet would be > > that they don't). > > Will do this. Already, you may be betting right, as the :KWierso request > isn't erroring out anymore. It just takes 10 seconds to complete. Oh, sorry. I look at the wrong tab. Testing :KWierso's URL locally, using |yarn start|'s environment, the FE only throws a bulk of 3 requests, which are successful. Each request has a max of 150 signature ids. Testing that URL directly on production, the FE throws a bulk of 10 requests; it still causes those 503 errors. Each request here has a max of 40 signature ids. I guess we have a setting for that bulk's size somewhere. Could we just tweak that?
Still, something to keep in mind: the reported URL approaches a worst case scenario; it produced performance_datum rows for 7 platforms X all Talos tests (many!) X 20 times each test. Tweaking as I suggested can address some cases which are a little worse. But on the long run, we will be adding more and more perf tests. It's next year's target for the Performance team: monitor everything we can.
(In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #9) > (In reply to Ionuț Goldan [:igoldan], Performance Sheriffing from comment #8) > > (In reply to Ed Morley [:emorley] from comment #7) > > > As a next step I would recommend running the exact queries that were slow > > > against prod and seeing if they still take that long now (my bet would be > > > that they don't). > > > > Will do this. Already, you may be betting right, as the :KWierso request > > isn't erroring out anymore. It just takes 10 seconds to complete. > > Oh, sorry. I look at the wrong tab. > Testing :KWierso's URL locally, using |yarn start|'s environment, the FE > only throws a bulk of 3 requests, which are successful. Each request has a > max of 150 signature ids. > Testing that URL directly on production, the FE throws a bulk of 10 > requests; it still causes those 503 errors. Each request here has a max of > 40 signature ids. > > I guess we have a setting for that bulk's size somewhere. Could we just > tweak that? I think it's set here: https://github.com/mozilla/treeherder/blob/master/ui/perfherder/helpers.js#L315 I don't remember it being so high (and apparently it isn't in production), maybe it got upped recently as part of the recent refactoring work? Perhaps all we need to do is deploy the current master branch to production and we'll magically improve things? Would be nice if this problem had such an easy solution...
(In reply to William Lachance (:wlach) (use needinfo!) from comment #11) > I don't remember it being so high (and apparently it isn't in production), > maybe it got upped recently as part of the recent refactoring work? Perhaps > all we need to do is deploy the current master branch to production and > we'll magically improve things? Would be nice if this problem had such an > easy solution... :emorley could we just deploy the current master? If we're not happy with it, we'll just reopen this bug and tweak that chunk size setting.
Flags: needinfo?(emorley)
I have no stake in the implementation approach chosen here, so deploying `master` to production seems orthogonal to this bug? Regarding deploying prod in general - we had a few unrelated log viewer regressions on `master` that should mostly be resolved, so we were hoping to do a production deploy soon.
Flags: needinfo?(emorley)
Previously this link [1] was breaking all the time. Now it seems to work, at least most of the time. Out of 20 requests I've made, 3 still threw 503 errors for the last sub request. [1] https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=0334e963e8ea8800f12edb3b3b4cfc77acce842d&newProject=try&newRevision=f01fe1e5d3811917e5fe57765c03a2a33090b8d0&framework=1
I've tested this again and I no longer experience any errors. The response times are way better, by 4 to 5 times. I guess this is due to low load on our database.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: