Closed Bug 926640 Opened 11 years ago Closed 10 years ago

[tracking] Improve API performance

Categories

(Marketplace Graveyard :: API, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: chuck, Unassigned)

References

Details

(Keywords: perf)

Attachments

(3 files, 2 obsolete files)

In the weekly performance report, New Relic indicated some potential performance problems in Marketplace production. Despite no notable differences in traffic, the Apdex score dipped to its 12-week low, largely on the back of a 4.4% "Frustrated" rate, several times higher than it has been in recent weeks. Full SLA report (login required): https://rpm.newrelic.com/accounts/315282/applications/2914756/optimize/sla_report We should investigate possible causes of this spike. GA and Nagios might be good places to start. Screenshot of email attached. For reference, Apdex calculation methodology: https://docs.newrelic.com/docs/site/apdex-measuring-user-satisfaction
I'd bet that it has to do with the filtering stuff. I'm not really sure what we can do. Are there statsd graphs that correspond to this?
(In reply to Matt Basta [:basta] from comment #1) > I'd bet that it has to do with the filtering stuff. I'm not really sure what > we can do. Are there statsd graphs that correspond to this? If by filtering you mean search and buchets (?) I see no indication of that in the graphite charts: http://dashboard.mktadm.ops.services.phx1.mozilla.com/graphite?site=marketplace&graph=search
I mean collection filtering in the search/featured/ endpoint(s)
It might also be the case that it's the rocketfuel endpoints and a significant portion of our traffic is just curators/admins :-/
I've spent some time digging into numbers this morning, and it's pretty safe to say that the new filtering features are at least partially responsible for these problems. It has been consuming almost 70% of our wall clock time (a high number is expected as it the most-accessed views in Marketplace), and it has the highest average response time of any of our public-facing views by a level of magnitude (see attachment in comment 5). The complexities of this view will make optimization difficult: we serve different responses based on a large number of variables: carrier, region, category, device capabilities, etc. I'd like to improve those numbers, as it is a very high-traffic endpoint that will contribute to first impressions of an important part of the platform. What are some strategies we can take to improve our response time? One potential tact: gather metrics on each of those variables, and precalculate and cache responses for the most common configurations for a short period of time.
We could try caching collections/FACs/OSCs independently (not sure if we do this already). Curated collections can't be created per-carrier, so anyone in a particular region will see them. OSCs can only be created for the homepage, so there might be an optimization there.
Some additional data: a breakdown of where time is spend in various segments of the response. It appears that the majority of the time is spent outside of database transactions, so caching at that level probably wouldn't be sufficient.
The 3 types of collections (basic/fac/os) in search/featured are queried and therefore cached separately by cache-machine. However note that if the fallback kicks in, each query made by the fallback is also cached separately, so we want to add a layer of caching on top of it all (or maybe not) We should probably gather more metrics to find out what is slow exactly, since as chuck pointed out we do quite a lot in that view because of all the variables. One thing to note is that we didn't create any specific database indexes for the collections, might be worth looking into, even though we have a very limited number of collections right now, and they are quite small, it might help a little.
Component: General → API
Turning this into a tracking bug for our API performance issues. We are going to open up new ones and made them blocking that one.
Depends on: 927420
Summary: New Relic indicates performance problems week of 10/3 → [tracking] API performance problems
Priority: -- → P2
Depends on: 881926
Depends on: 946061
To follow: attachments containing updated data for the 7 day period ending right now, after the DRF conversion has finished.
Attachment #817205 - Attachment is obsolete: true
This is, by far, the highest impact view: 54.7% of processor time is spent processing these request.
Attachment #817225 - Attachment is obsolete: true
Depends on: 956860
Depends on: 956987
Depends on: 958608
Depends on: 912097
Summary: [tracking] API performance problems → [tracking] Improve API performance
Depends on: 961719
Whiteboard: [perf]
Depends on: 963730
Depends on: 973735
Keywords: perf
Whiteboard: [perf]
Removing priority now that it's a tracking bug.
Priority: P2 → --
No longer blocks: tarako-marketplace
Blocks: 992365
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: