Open Bug 1432287 Opened 7 years ago Updated 2 years ago

[meta] Run tests from source checkouts in CI

Categories

(Testing :: General, enhancement, P3)

Version 3
enhancement

Tracking

(Not tracked)

People

(Reporter: gps, Unassigned)

References

(Depends on 2 open bugs, Blocks 2 open bugs)

Details

(Keywords: meta)

We have a number of bugs floating around for this already. But we're lacking a tracking bug... Essentially, we want to run test tasks from source checkouts. Today, we ship zip or compressed tar files around to different machines. This has a number of problems: 1) Creating archives creates overhead 2) Extracting archives creates overhead 3) Tasks don't have full source context and thus can't easily do things like implement task logic as mach commands 4) Simplifies management of test environment Expanding on these... Creating an archive of files adds overhead to the task producing that archive. It takes time to read files, compress them, and upload them. This increases end-to-end time. While we've spent many engineering hours optimizing this process, no matter how you slice it, it is still overhead. Extracting archives to the local filesystem obviously takes time. You need to download the archive. Spend CPU to decompress it. Incur I/O to write out changes. But the big inefficiency here is that naive extracting of archives is inefficient. You can't just extract an archive over an existing directory because you may have orphaned files since the last extraction. So, we tend to blow away the destination directory first. Or we use separate destination directories for each source archive. CI redundantly extracts the same files over and over. To make matters worse, many files don't change often from archive to archive. e.g. in the archives of test files, typically only a few files change. Assuming you could use a cache, ~100% of the file extractions are redundant. Yes, we could implement "smart" archive extraction. It would know how to delete orphaned files. How to look for unmodified files and skip them. But at the point you do this, you are reinventing version control. And version control is highly optimized to solve this problem of incrementally updating a working directory. So why not use version control? Using version control also gives tasks access to potentially the contents of the full repo. This means you can do things like run `mach` without jumping through hoops. Things like mozharness (which was invented to bridge the gap between mozilla-central and CI) wouldn't need to exist (although mozharness is in tree today so that gap doesn't really exist as much). Having access to the full repo contents would allow more test logic to exist in repo. It would enable code used by people and machines to converge. Finally, by not leaning on source archives and by using version control checkouts, we remove a number of problems around managing the task state. If most of the files we care about are in the version control checkout, we can use version control to manage these files for us. We have a lot of code around ensuring the state of caches, extracting archives to the correct places, etc. If your goal is "ensure a pristine state of files is in a directory," version control solves that quite well and eliminates a whole lot of code from CI land. This work is currently blocked on getting efficient partial clones rolled out to Firefox CI. That is tracked in bug 1428470. That work is staffed for 2018 and should hopefully be deployed and ready to go sometime in Q3.
Priority: -- → P3

nalexander: fyi

(In reply to Bob Clary [:bc:] from comment #1)

nalexander: fyi

Thanks, Bob. Just for history's sake, I got source checkouts working for Raptor jobs, and found that on Windows just doing a run-task with checkout: true takes 9-10 minutes. (I don't think a sparse profile would help that much here, either.) What ever this "incremental" HG operation is/could be, it is absolutely essential, 'cuz just using HG naively is orders of magnitude slower than the mozharness.zip stuff.

I think this bug is predicated on the checkouts being cached on the workers. If raptor is running on a worker type that doesn't already have a cached checkout, then the initial overhead will be large. As more workers claim these types of tasks, then we should theoretically start seeing faster setup times.

There are lots of variables at play here though. If the cache gets too large, taskcluster will just delete the older stuff. I'm also not sure what happens to a cache if an instance is killed. I don't believe anyone took the time to make sure that our caching strategy on Windows is working as expected.

also switching between esr60, beta, trunk, etc. takes a long time, for me locally on windows with a SSD, it takes ~5 minutes to switch from esr68 to trunk, imagine esr60 :)

good news for perf is that we only run perf on beta/trunk, that is not as extreme, for all other unittests, this is something we need to support.

(In reply to Andrew Halberstadt [:ahal] from comment #3)

I think this bug is predicated on the checkouts being cached on the workers. If raptor is running on a worker type that doesn't already have a cached checkout, then the initial overhead will be large. As more workers claim these types of tasks, then we should theoretically start seeing faster setup times.

In fact, the one thing I needed to do was to put HG_STORE_PATH somewhere that wasn't on a cache, since the Win 10 HW devices don't have the default cache location.

There are lots of variables at play here though. If the cache gets too large, taskcluster will just delete the older stuff. I'm also not sure what happens to a cache if an instance is killed. I don't believe anyone took the time to make sure that our caching strategy on Windows is working as expected.

Thanks for the context. I'm still concerned that just the hg checkout I/O will dwarf any benefits we might get. I did wonder about using the .tar.gz and/or .zip HG endpoints to do a "super sparse" checkout, which would actually make sense for Raptor. (Server load and the complexity that is https://searchfox.org/mozilla-central/source/python/mozbuild/mozbuild/action/test_archive.py are why it wouldn't make sense more generally. If I'm wrong about server load (in either direction), please feed that back to me!)

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.