Closed Bug 1151806 Opened 10 years ago Closed 9 years ago

Add chunking to treeherder-client / ETL to keep POSTs under the 30s Heroku request time limit

Categories

(Tree Management :: Treeherder: Infrastructure, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: camd)

References

Details

Attachments

(1 file, 1 obsolete file)

(deleted), text/x-github-pull-request
mdoglio
: review+
Details
Breaking out of the overall Heroku bug (bug 1145606), since this is camd's Treeherder deliverable for this quarter: https://docs.google.com/a/mozilla.com/document/d/1U3VXk7K5iTmZvqqtX4sc-Znhch9ZAD98Vl1OyWRqZwo/edit Heroku has a 30 second cutoff for requests to web nodes (https://devcenter.heroku.com/articles/request-timeout). Our current ETL process submits the data to the publicly accessible API on the web nodes, quite often in big chunks due to builds-4hr etc. We'll likely hit the 30s limit unless we do one of: 1) Chunk the submissions to the API. 2) Make the ETL layer submit not use the web-accessible API (eg internally make the model/DB updates). 3) Be more intelligent about the amount of busywork we repeat (eg switch to builds-2hr or use memcached to keep track of ingested jobs, so we don't continually re-insert the builds-4hr jobs list, when only a small percentage of it is new each time).
No longer depends on: 1151803
use memcache to keep track of which jobs we've already ingested from pending.js, running.js or build4hr.js. So we can check that prior to ingestion rather than relying always on the failover of ON DUPLICATE KEY. This will reduce DB traffic and speed it up. One memcache key per repo. Add job_guid to the list only on successful ingestion.
chunking can be added as a param in the th_client. We can specify the chunk size in our settings file. This could be broken up by chunk size for pending, running, build4hr. Even resultsets. So the code change will be primarily in th_client. But then OAuthLoaderMixin will need to pass the param from the settings. Mauro mentioned, too, to change our timeout to match the Heroku limit requirement. Then when we deploy to the existing staging env, we'll know we're good.
Above are just some notes from chatting with mdoglio about this task. Sorry they're a bit choppy. :)
Depends on: 1096878
Status: NEW → ASSIGNED
Attached file Ingestion Chunking PR (obsolete) (deleted) —
Attachment #8606000 - Flags: review?(mdoglio)
Attachment #8606000 - Attachment description: PR → Ingestion Chunking PR
When this lands, we'll want to double check it wasn't the cause of the memory usage spikes seen in bug 1164888 comment 2, which was either that bug or this one. Perhaps Mauro's idea of using a generator (https://github.com/mozilla/treeherder/pull/533#discussion_r30598047) will help avoid this? :-)
Attachment #8606000 - Flags: review?(mdoglio) → review+
Depends on: 1167091
Something that occurred to me: the "too many requests" is presumably us hitting the API thresholds we put in place for things like taskcluster (though it does seem right that we hit them too; makes sense for us to do so). To decide what values we should set chunk size to, I think it would help to know what a typical batch size would be if we weren't using chunking at all (likely for builds-4hr, since that's the worst case file). ie: if previously we'd been submitting up to 10,000 jobs at once, then perhaps the 150 job chunk size we have after the followup https://github.com/mozilla/treeherder/commit/e9d127f7eee4a3dbffee6b421e293adcf46fcc52 is still a case of "one extreme to the other"? (And so we could say set it to 500 or 1000 jobs and avoid the timeouts on Heroku but also not increase the number of requests ten-fold).
Summary: Get ETL layer working with Heroku → Add chunking to treeherder-client / ETL to keep POSTs under the 30s Heroku request time limit
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/942e314361ed3fbe90e58780f7bc983cbd9edd4a Revert "Bug 1151806 - Implement chunking for job ingestion" This reverts commit e71e78156555baada7f6d60d291188387ce96b12. This commit caused pending and running jobs to be put into the objectstore. This causes their completed versions not to be ingested.
Attached file fixed PR after backout (deleted) —
Attachment #8609401 - Flags: review?(mdoglio)
Attachment #8606000 - Attachment is obsolete: true
mdoglio hasn't actually marked this an r+, but he said it was on a vidyo chat today.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Attachment #8609401 - Flags: review?(mdoglio) → review+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: