Determine if checking out tests from hg can be fast enough on test workers
Categories
(Release Engineering :: General, task)
Tracking
(Not tracked)
People
(Reporter: catlee, Assigned: rail)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
(deleted),
patch
|
Details | Diff | Splinter Review |
We'd like to know if using tests & harnesses directly from hg is feasible on the test machines, or if we should continue on using test archives produced by the build or toolchain tasks.
For each of our test worker types, we'd like to know:
- how much time does the initial clone take?
- how much time do incremental pulls take?
- can we use sparse/shallow clones?
Comment 1•5 years ago
|
||
what criteria would we want to measure?
- is this possible?
- time compared to setup task compared to current method?
- failure rate?
- is this a all or none requirement? i.e. if win7x32 is much slower, is it ok to keep generating test packages and downloading them? bitbar machines?
I think we need to account for switching branches (esr, beta, central, try) where there are thousands of commits to update source between.
Are there other consumers of the test packages? i.e. if we didn't produce them what else will fail? comm-central? fenix? mozregression? some mach commands to test locally?
Assignee | ||
Comment 2•5 years ago
|
||
Just started working in this. Looks like I need to patch mozharness, python requirements and taskgraph quite a bit to make this work.
BTW, do we have anything that I can use to measure the impact of the changes without reinventing something new?
Assignee | ||
Comment 3•5 years ago
|
||
I've got some numbers, more to come.
Linux, m5.large (t-linux-large) running opt-mochitest-remote-e10s, no sparse checkout, vcs operations only:
Initial clone (mozilla-unified clone, then pull from a try head + update): 512 secs
try -> autoland switch: 6s
autoland -> central switch: 7s
central -> beta switch: 12s
beta -> release switch: 17s
release -> esr68 switch: 87s
Switching between different try pushes: 22-26s
Switching between different autoland pushes: 5-7s
Switching between different central pushes: 5-7s
Switching between different beta pushes: 5-7s
Switching between different reelase pushes: 5-7s
Switching between different esr68 pushes: 6-10s
Comment 4•5 years ago
|
||
This is really good data, looking forward to other platforms.
curious what we replace in-task, I assume:
- download and extract of mozharness.zip
- download and extract of test packages
not sure about:
- download and extract of build and built binaries (like minidump-stackwalk, symbols)
- virtualenv setup (download and install packages)
would it be possible to get data on total end to end job runtime.
Assignee | ||
Comment 5•5 years ago
|
||
Moar numbers
Using sparse checkout on linux didn't help, the initial clone time was 515 secs (vs 512s).
Mac test-macosx1014-64/debug-mochitest-a11y-1proc based run:
Initial clone: 3m50s
try -> autoland switch: 13s
autoland -> central switch: 14s
central -> beta switch: 24s
beta -> release switch: 23s
release -> esr68 switch: 23s
Switching between different try pushes: 30-45s
Switching between different autoland pushes: 13-18s
Switching between different central pushes: 13s
Switching between different beta pushes: 13-24s
Switching between different relase pushes: 13-24s
Switching between different esr68 pushes: 17-26s
Assignee | ||
Comment 6•5 years ago
|
||
Win10-64 gecko-t/t-win10-64 c5.2xlarge test-windows10-64/opt-mochitest-a11y-1proc
Initial clone: 9m52s
try -> autoland switch: 22s
autoland -> central switch: 12s
central -> beta switch: 32s
beta -> release switch: 32s
release -> esr68 switch: 34s
Switching between different try pushes: 12-40s
Switching between different autoland pushes: 11-22s
Switching between different central pushes: 10-12s
Switching between different beta pushes: 11s
Switching between different relase pushes: 10-12s
Switching between different esr68 pushes: 10-13s
Assignee | ||
Comment 7•5 years ago
|
||
win7-32 gecko-t/t-win7-32 c4.2xlarge test-windows7-32/opt-mochitest-a11y-1proc
Initial clone: 47m58s :(
try -> autoland switch: 15s
autoland -> central switch: 19s
Then it exceeded the max time, but looks like switching between branches is not that bad:
Switching between different try pushes: 85-190s
Switching between different autoland pushes: 15s
Switching between different central pushes: 17-19s
Assignee | ||
Comment 8•5 years ago
|
||
We can probably use the same instance type for windows jobs in order to improve the initial clone. Also we can try prepopulating the docker images with mozilla-unified hg store in order to improve this.
Assignee | ||
Comment 9•5 years ago
|
||
This is what I used to hack around the checkout process.
Reporter | ||
Comment 10•4 years ago
|
||
Great, thanks rail!
I think the next step here is to investigate if we can bring down the initial clone times by including a copy of the unified repo in the worker image. ni? coop to let us know how feasible that is.
Comment 11•4 years ago
|
||
I'll get a card added for our worker lifecycle management sprint next week and try to get it scoped.
Fair warning: with Wander's contract ending next week, and Miles on PTO, we only have 4 people right now.
Comment 12•4 years ago
|
||
I have a vague memory of someone doing an investigation of something along the lines of embedding a clone in worker image, and the concern being raised that updating an old clone being more expensive that cloning from scratch (at least on the hg server).
Comment 13•4 years ago
|
||
<tomprince> Do you recall any investigation of update old clones vs. fresh clones being expensive
<glandium> depends how old of a clone you talk about
but yes, that can be expensive on the server
<tom.prince> At the current rate we rebuild images, I think it'd be weeks, but I guess we could update them every week or something.
<glandium> I'm not particularly convinced baking a clone in an image is better than not baking one and having clones in caches
BTW, we should figure out if we're getting enough from caches, because it sure as hell looks like I get a lot of clones from scratch on at least try
<tom.prince> I think we are fairly aggressive about shutting down instances (and are looking at getting more aggressive).
<glandium> all backing a clone in an image does, IMHO, is increase the size of the images by a non-trivial amount
I wouldn't expect any speed gain or anything.
except maybe on windows
Reporter | ||
Comment 14•4 years ago
|
||
Yeah, certainly those are valid concerns. If we do this, we would need to refresh the images quite often so that the cached versions don't get stale.
Comment 15•4 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #14)
Yeah, certainly those are valid concerns. If we do this, we would need to refresh the images quite often so that the cached versions don't get stale.
Talking about this with Rob this morning;
- we've done baked-in clones before, on builders primarily but also on testers at some point, so it's definitely possibly
- it's not free (ie fire and forget) from a maintenance perspective, on the image building side
- there's some questions/concerns around image workflow and how that affects the testers (eg mid-day image changes, overnight image bustage, rollbacks, and how any of those could inadvertently affect the testers well after the fact)
Assignee | ||
Comment 16•4 years ago
|
||
Let's call this task done and explore the other proposed solutions. So far it looks like the initial clone is a big no-no here. Let's see if other methods are better.
Updated•4 years ago
|
Comment 17•4 years ago
|
||
Many moons ago, when I was actively hacking on the Raptor/Browsertime harness and tests (Bug 1564256), I hacked some "download from hg.mozilla.org" logic to speed up my cycle. I think there are two parts that are relevant:
-
We require builds to do the test packaging, and the builds use the same architecture as the tests. That means bumping a single Windows test triggered an hour of Windows build time (before cross builds)! Surely test packages can be platform-agnostic, and can be generated without a build.
-
The whole test archive thing is complicated: surely we can get rid of all this path rewriting and special casing. If we do that, perhaps we can make sure that we can fetch specific test directories from
hg.mozilla.org
efficiently? I don't have a link to this commit anywhere, but some of my work looked like:
changeset: 572386:0e4a7a0de5e9
user: Nick Alexander <nalexander@mozilla.com>
date: Thu Jul 18 15:02:39 2019 -0700
instability: orphan
files: testing/mozharness/mozharness/base/script.py testing/mozharness/mozharness/mozilla/testing/testbase.py
description:
Bug 1564256 - DO NOT LAND - Download testing/raptor/raptor/** from hg.mozilla.org.
This means that changes to the Raptor harness itself (changes under
testing/raptor/raptor/**) don't require a new
target.tests.raptor.tar.gz, which takes 20-50 minutes to produce as
part of a Win10 _artifact_ build. The downloaded files are overlaid
on top of what is extracted from target.tests.raptor.tar.gz, replacing
existing files but not deleting files (that have either been removed
-- this isn't VCS -- or that come from other parts of the Raptor test
archive defined in `test_archive.py`).
testing/mozharness/mozharness/base/script.py | 26 ++++++++-----
testing/mozharness/mozharness/mozilla/testing/testbase.py | 28 +++++++++++++++
2 files changed, 44 insertions(+), 10 deletions(-)
modified testing/mozharness/mozharness/base/script.py
@@ -394,6 +394,8 @@ class ScriptMixin(PlatformMixin):
Returns:
BytesIO: contents of url
'''
+ content_length = None
+
self.info('Fetch {} into memory'.format(url))
parsed_url = urlparse.urlparse(url)
@@ -425,16 +427,16 @@ class ScriptMixin(PlatformMixin):
# Bug 1309912 - Adding timeout in hopes to solve blocking on response.read() (bug 1300413)
response = urllib2.urlopen(request, timeout=30)
- if parsed_url.scheme in ('http', 'https'):
+ if parsed_url.scheme in ('http', 'https') and response.headers.get('Content-Length'):
content_length = int(response.headers.get('Content-Length'))
+ self.info('Content-Length response header: {}'.format(content_length))
response_body = response.read()
response_body_size = len(response_body)
- self.info('Content-Length response header: {}'.format(content_length))
self.info('Bytes received: {}'.format(response_body_size))
- if response_body_size != content_length:
+ if content_length and response_body_size != content_length:
raise ContentLengthMismatch(
'The retrieved Content-Length header declares a body length '
'of {} bytes, while we actually retrieved {} bytes'.format(
@@ -646,7 +648,7 @@ class ScriptMixin(PlatformMixin):
t = tarfile.open(fileobj=compressed_file, mode=mode)
t.extractall(path=extract_to)
- def download_unpack(self, url, extract_to='.', extract_dirs='*', verbose=False):
+ def download_unpack(self, url, extract_to='.', extract_dirs='*', verbose=False, mimetype=None):
"""Generic method to download and extract a compressed file without writing it to disk first.
Args:
@@ -658,8 +660,11 @@ class ScriptMixin(PlatformMixin):
verbose (bool, optional): whether or not extracted content should be displayed.
Defaults to False.
+ mimetype (str, optional): The content mimetype.
+ Defaults to `None`, which determines the mimetype from the
+ extension of the filename given in the URL.
"""
- def _determine_extraction_method_and_kwargs(url):
+ def _determine_extraction_method_and_kwargs(url, mimetype=None):
EXTENSION_TO_MIMETYPE = {
'bz2': 'application/x-bzip2',
'gz': 'application/x-gzip',
@@ -687,10 +692,11 @@ class ScriptMixin(PlatformMixin):
},
}
- filename = url.split('/')[-1]
- # XXX: bz2/gz instead of tar.{bz2/gz}
- extension = filename[filename.rfind('.')+1:]
- mimetype = EXTENSION_TO_MIMETYPE[extension]
+ if not mimetype:
+ filename = url.split('/')[-1]
+ # XXX: bz2/gz instead of tar.{bz2/gz}
+ extension = filename[filename.rfind('.')+1:]
+ mimetype = EXTENSION_TO_MIMETYPE[extension]
self.debug('Mimetype: {}'.format(mimetype))
function = MIMETYPES[mimetype]['function']
@@ -735,7 +741,7 @@ class ScriptMixin(PlatformMixin):
# 2) We're guaranteed to have download the file with error_level=FATAL
# Let's unpack the file
- function, kwargs = _determine_extraction_method_and_kwargs(url)
+ function, kwargs = _determine_extraction_method_and_kwargs(url, mimetype=mimetype)
try:
function(**kwargs)
except zipfile.BadZipfile:
modified testing/mozharness/mozharness/mozilla/testing/testbase.py
@@ -371,6 +371,34 @@ You can set this by specifying --test-ur
self.download_unpack(url, target_dir,
extract_dirs=unpack_dirs)
+ if 'raptor' in file_name \
+ and 'GECKO_HEAD_REPOSITORY' in os.environ \
+ and 'GECKO_HEAD_REV' in os.environ:
+ RAPTOR_ARCHIVE_URL = '{GECKO_HEAD_REPOSITORY}/archive/{GECKO_HEAD_REV}.zip/testing/raptor' # noqa: E501
+ url = RAPTOR_ARCHIVE_URL.format(
+ GECKO_HEAD_REPOSITORY=os.environ['GECKO_HEAD_REPOSITORY'],
+ GECKO_HEAD_REV=os.environ['GECKO_HEAD_REV'])
+ self.download_unpack(url, target_dir, extract_dirs='*', mimetype='application/zip')
+
+ # target.raptor.tests.tar.gz contains 'raptor/raptor/raptor.py'.
+ # HG's archive contains try-6ae3ca22378129beed0ae637679b989b1bea60ee/testing/raptor/raptor/raptor.py.
+ _, repo = os.environ['GECKO_HEAD_REPOSITORY'].rsplit('/', 1) # Turn 'https://hg.mozilla.org/try' into 'try'.
+
+ ROOT = '{repo}-{GECKO_HEAD_REV}'.format(
+ repo=repo,
+ GECKO_HEAD_REV=os.environ['GECKO_HEAD_REV'])
+ root = os.path.join(target_dir, ROOT, 'testing', 'raptor', 'raptor')
+ self.info('root: {}'.format(root))
+ for dirpath, dirs, files in os.walk(root):
+ for f in files:
+ src_abspath = os.path.join(dirpath, f)
+ relpath = os.path.relpath(src_abspath, root)
+ dst_abspath = os.path.join(target_dir, 'raptor', 'raptor', relpath)
+ self.info("Renaming: {src} -> {dst}".format(src=src_abspath, dst=dst_abspath))
+ if os.path.isfile(dst_abspath):
+ os.remove(dst_abspath)
+ os.rename(src_abspath, dst_abspath)
+
def _download_test_zip(self, extract_dirs=None):
dirs = self.query_abs_dirs()
test_install_dir = dirs.get('abs_test_install_dir',
Reporter | ||
Comment 18•4 years ago
|
||
re 1) - we're planning to tackle that by breaking out test archive generation into separate tasks - bug 1628981
re 2) - hopefully we can fix the layout issue at the same time. I don't understand why the layout of the test archives doesn't match the layout of the source directories
Comment 19•4 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #18)
re 1) - we're planning to tackle that by breaking out test archive generation into separate tasks - bug 1628981
re 2) - hopefully we can fix the layout issue at the same time. I don't understand why the layout of the test archives doesn't match the layout of the source directories
Thrilled to hear that we're going to do some of the lower hanging fruit. I can only imagine the _tests
directory stuff is related to 2). I've never understood those choices.
Description
•