Closed Bug 1419478 Opened 7 years ago Closed 7 years ago

Re-enable Funsize diff caching

Categories

(Release Engineering :: General, enhancement)

enhancement
Not set
normal

Tracking

(firefox59 fixed)

RESOLVED FIXED
Tracking Status
firefox59 --- fixed

People

(Reporter: sfraser, Assigned: sfraser)

References

Details

Attachments

(1 file)

tools/update-packaging/make_incremental_update.sh has the ability to use a script for caching of diffs, to prevent wasted CPU cycles. Funsize does set this up, setting the MBSDIFF_HOOK environment variable to point to https://dxr.mozilla.org/mozilla-central/source/taskcluster/docker/funsize-update-generator/scripts/mbsdiff_hook.sh At the moment there's no external service for storing the diffs, and the way partials generation jobs are organised means there are very few cache hits for this feature within a single task. After speaking with jonasfj, we think the best way forward is to use one of the taskcluster S3 buckets to store these short-lived artifacts in. A task with the right scopes can ask for some temporary S3 credentials, meaning that no new secrets need to be issued to the funsize docker image. 1. Find the right S3 bucket to use 2. Arrange an area in that S3 bucket to use, and set an expiry time. <=24h should be good. 3. Find the right scopes to add to get temporary S3 credentials 4. Modify/replace mbsdiff_hook.sh to store objects in S3. a. Add the aws-cli to the docker image, get mbsdiff_hook.sh to use those tools to upload files. b. Replace mbsdiff_hook.sh with a python script that does a similar job - but keep the same calling interface!
Some data to provide context: Using the nightly taskgraph TJvK0TwiReSLTR8kEvZ2eg as an example, for partials generating tasks only: - Total task run time is 5 days, 14:51:39.604000 (This is a direct summation, ignoring parallel runs) - There were 510 partials generating tasks, giving a mean runtime per task of 0:15:51 - Each task made four partials (N-1,2,3,4 for one platform/locale combination), so 0:03:57 per partial, on average. The elapsed wall time, from first task starting to latest finishing, was 2:39:54, however en-US ran a lot earlier than the others. 101 locales started and finished within a 33 minute window. en-US running much earlier has some good potential for using it to cache results. Caching the results of diffing small files may not help the wall time, even if it reduces compute - it simply swaps cpu for network access to S3. But, caching the results of the bigger files, like libxul.so, can save ~2-3 minutes per partial.
Assignee: nobody → sfraser
Options: 1. S3 Bucket cache Activate MBSDIFF_HOOK with an S3 bucket, to store the results of the en-US partial generation. This will then be used when the later tasks start. Difficulty: Low. Requires: - mbsdiff_hook.sh rewriting - S3 bucket with short lifetime Access: - access keys in taskcluster secrets OR - TC scope giving temporary AWS credentials OR - IAM role giving temporary AWS credentials Cons: Chain of trust issues, as the S3 bucket may be poisoned. 2. One* Big Job *More than one Use a local filesystem cache, to avoid CoT issues. Requires reworking the partial generation so that instead of 1 platform, 1 locale, all N-1,2,3,4' tasks we have 1 platform, all/chunked locales, N-1' and separate N-2, 3, 4 tasks. To avoid excessive runtimes, parallelism must be done in the worker. Assuming 1 platform, all locales, thats 102 source MARs and 102 destination MARs, or 204 * 43Mb = 8772Mb download. If using CoT this will be done before the task starts. Using signing as a baseline, ~6 seconds to verify source task and signature on the file, means 204 * 6s = ~20 minutes of CoT. Pros: No external dependency on the caching, reducing the chance of poisoning the cached diffs. Cons: Lose treeherder visibility of what a task is doing. Instead of one label for each locale on each platforms row, it's a generic one, and a manual search to find the right task for a locale. 3. All the CoT Like option 1 but we support CoT on the S3 bucket 4. A bit more artifactory. Like option 1 but instead of using an S3 bucket we extract the patches from make_incremental_update.sh and store them as artifacts in the en-US task Cons: Would need to make other locales have the en-US generation as a dependency Would need to modify make_incremental_update.sh - or have a dummy one that generates these diffs for m_i_u.sh later use.
Discussions with security indicate low concerns about using S3 as long as the credentials used are temporary. :jonasfj from TC team prefers scopes to generate temporary credentials, as that means an idle worker has no permission to access the S3 bucket. This has been set up, using the scope auth:aws-s3:read-write:tc-gp-private-1d-us-east-1/releng/* to access tc-gp-private-1d-us-east-1 (a bucket with a lifetime of 1 day) and the prefix of 'releng/' (as other teams may use this bucket in the future). Funsie should use 'releng/mbsdiff-cache/' as a prefix, so other releng tools can use this area. Note the trailing slash is important, as AWS permissions do a straight string comparison, so access to 'foo' allows access to 'foo.txt' as well as 'foo/bar.txt'. Potential to wrap up this caching as a utility for use elsewhere - memoizing external features.
Comment on attachment 8937122 [details] Bug 1419478 Enable S3 caching for binary diff patch files in partial update tasks https://reviewboard.mozilla.org/r/207824/#review213712 Wooooohooo! Great job!
Attachment #8937122 - Flags: review?(rail) → review+
Pushed by sfraser@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/7eecc63dcdec Enable S3 caching for binary diff patch files in partial update tasks r=rail
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: