1636201 - Determine if checking out tests from hg can be fast enough on test workers

Reporter

Description

•

5 years ago

We'd like to know if using tests & harnesses directly from hg is feasible on the test machines, or if we should continue on using test archives produced by the build or toolchain tasks.

For each of our test worker types, we'd like to know:

how much time does the initial clone take?
how much time do incremental pulls take?
can we use sparse/shallow clones?

Joel Maher ( :jmaher ) (UTC -8)

Comment 1

•

5 years ago

what criteria would we want to measure?

is this possible?
time compared to setup task compared to current method?
failure rate?
is this a all or none requirement? i.e. if win7x32 is much slower, is it ok to keep generating test packages and downloading them? bitbar machines?

I think we need to account for switching branches (esr, beta, central, try) where there are thousands of commits to update source between.

Are there other consumers of the test packages? i.e. if we didn't produce them what else will fail? comm-central? fenix? mozregression? some mach commands to test locally?

Rail Aliiev [:rail]

Assignee

Comment 2

•

5 years ago

Just started working in this. Looks like I need to patch mozharness, python requirements and taskgraph quite a bit to make this work.

BTW, do we have anything that I can use to measure the impact of the changes without reinventing something new?

Rail Aliiev [:rail]

Assignee

Comment 3

•

5 years ago

I've got some numbers, more to come.

Linux, m5.large (t-linux-large) running opt-mochitest-remote-e10s, no sparse checkout, vcs operations only:

Initial clone (mozilla-unified clone, then pull from a try head + update): 512 secs

try -> autoland switch: 6s
autoland -> central switch: 7s
central -> beta switch: 12s
beta -> release switch: 17s
release -> esr68 switch: 87s

Switching between different try pushes: 22-26s
Switching between different autoland pushes: 5-7s
Switching between different central pushes: 5-7s
Switching between different beta pushes: 5-7s
Switching between different reelase pushes: 5-7s
Switching between different esr68 pushes: 6-10s

Joel Maher ( :jmaher ) (UTC -8)

Comment 4

•

5 years ago

This is really good data, looking forward to other platforms.

curious what we replace in-task, I assume:

download and extract of mozharness.zip
download and extract of test packages

not sure about:

download and extract of build and built binaries (like minidump-stackwalk, symbols)
virtualenv setup (download and install packages)

would it be possible to get data on total end to end job runtime.

Rail Aliiev [:rail]

Assignee

Comment 5

•

5 years ago

Moar numbers
Using sparse checkout on linux didn't help, the initial clone time was 515 secs (vs 512s).

Mac test-macosx1014-64/debug-mochitest-a11y-1proc based run:

Initial clone: 3m50s

try -> autoland switch: 13s
autoland -> central switch: 14s
central -> beta switch: 24s
beta -> release switch: 23s
release -> esr68 switch: 23s

Switching between different try pushes: 30-45s
Switching between different autoland pushes: 13-18s
Switching between different central pushes: 13s
Switching between different beta pushes: 13-24s
Switching between different relase pushes: 13-24s
Switching between different esr68 pushes: 17-26s

Rail Aliiev [:rail]

Assignee

Comment 6

•

5 years ago

Win10-64 gecko-t/t-win10-64 c5.2xlarge test-windows10-64/opt-mochitest-a11y-1proc

Initial clone: 9m52s

try -> autoland switch: 22s
autoland -> central switch: 12s
central -> beta switch: 32s
beta -> release switch: 32s
release -> esr68 switch: 34s

Switching between different try pushes: 12-40s
Switching between different autoland pushes: 11-22s
Switching between different central pushes: 10-12s
Switching between different beta pushes: 11s
Switching between different relase pushes: 10-12s
Switching between different esr68 pushes: 10-13s

Rail Aliiev [:rail]

Assignee

Comment 7

•

5 years ago

win7-32 gecko-t/t-win7-32 c4.2xlarge test-windows7-32/opt-mochitest-a11y-1proc

Initial clone: 47m58s :(

try -> autoland switch: 15s
autoland -> central switch: 19s

Then it exceeded the max time, but looks like switching between branches is not that bad:

Switching between different try pushes: 85-190s
Switching between different autoland pushes: 15s
Switching between different central pushes: 17-19s

Rail Aliiev [:rail]

Assignee

Comment 8

•

5 years ago

We can probably use the same instance type for windows jobs in order to improve the initial clone. Also we can try prepopulating the docker images with mozilla-unified hg store in order to improve this.

Rail Aliiev [:rail]

Assignee

Comment 9

•

5 years ago

Attached patch try.diff (deleted) — Details — Splinter Review

This is what I used to hack around the checkout process.

Chris AtLee [:catlee]

Reporter

Comment 10

•

4 years ago

Great, thanks rail!

I think the next step here is to investigate if we can bring down the initial clone times by including a copy of the unified repo in the worker image. ni? coop to let us know how feasible that is.

Flags: needinfo?(coop)

Chris Cooper [:coop] (he/him)

Comment 11

•

4 years ago

I'll get a card added for our worker lifecycle management sprint next week and try to get it scoped.

Fair warning: with Wander's contract ending next week, and Miles on PTO, we only have 4 people right now.

Flags: needinfo?(coop)

Tom Prince [:tomprince]

Comment 12

•

4 years ago

I have a vague memory of someone doing an investigation of something along the lines of embedding a clone in worker image, and the concern being raised that updating an old clone being more expensive that cloning from scratch (at least on the hg server).

Flags: needinfo?(sheehan)

Tom Prince [:tomprince]

Comment 13

•

4 years ago

<tomprince> Do you recall any investigation of update old clones vs. fresh clones being expensive
<glandium> depends how old of a clone you talk about
but yes, that can be expensive on the server
<tom.prince> At the current rate we rebuild images, I think it'd be weeks, but I guess we could update them every week or something.
<glandium> I'm not particularly convinced baking a clone in an image is better than not baking one and having clones in caches
BTW, we should figure out if we're getting enough from caches, because it sure as hell looks like I get a lot of clones from scratch on at least try
<tom.prince> I think we are fairly aggressive about shutting down instances (and are looking at getting more aggressive).
<glandium> all backing a clone in an image does, IMHO, is increase the size of the images by a non-trivial amount
I wouldn't expect any speed gain or anything.
except maybe on windows

Chris AtLee [:catlee]

Reporter

Comment 14

•

4 years ago

Yeah, certainly those are valid concerns. If we do this, we would need to refresh the images quite often so that the cached versions don't get stale.

Kendall Libby [:fubar] (he/him)

Comment 15

•

4 years ago

(In reply to Chris AtLee [:catlee] from comment #14)

Yeah, certainly those are valid concerns. If we do this, we would need to refresh the images quite often so that the cached versions don't get stale.

Talking about this with Rob this morning;

we've done baked-in clones before, on builders primarily but also on testers at some point, so it's definitely possibly
it's not free (ie fire and forget) from a maintenance perspective, on the image building side
there's some questions/concerns around image workflow and how that affects the testers (eg mid-day image changes, overnight image bustage, rollbacks, and how any of those could inadvertently affect the testers well after the fact)

Rail Aliiev [:rail]

Assignee

Comment 16

•

4 years ago

Let's call this task done and explore the other proposed solutions. So far it looks like the initial clone is a big no-no here. Let's see if other methods are better.

Status: NEW → RESOLVED

Closed: 4 years ago

Flags: needinfo?(sheehan)

Resolution: --- → FIXED

Marco Castelluccio [:marco]

Updated

•

4 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1432287

Nick Alexander :nalexander [he/him] (PTO until Montreallhands August 21, 2023)

Comment 17

•

4 years ago

Many moons ago, when I was actively hacking on the Raptor/Browsertime harness and tests (Bug 1564256), I hacked some "download from hg.mozilla.org" logic to speed up my cycle. I think there are two parts that are relevant:

We require builds to do the test packaging, and the builds use the same architecture as the tests. That means bumping a single Windows test triggered an hour of Windows build time (before cross builds)! Surely test packages can be platform-agnostic, and can be generated without a build.
The whole test archive thing is complicated: surely we can get rid of all this path rewriting and special casing. If we do that, perhaps we can make sure that we can fetch specific test directories from hg.mozilla.org efficiently? I don't have a link to this commit anywhere, but some of my work looked like:

changeset:   572386:0e4a7a0de5e9
user:        Nick Alexander <nalexander@mozilla.com>
date:        Thu Jul 18 15:02:39 2019 -0700
instability: orphan
files:       testing/mozharness/mozharness/base/script.py testing/mozharness/mozharness/mozilla/testing/testbase.py
description:
Bug 1564256 - DO NOT LAND - Download testing/raptor/raptor/** from hg.mozilla.org.

This means that changes to the Raptor harness itself (changes under
testing/raptor/raptor/**) don't require a new
target.tests.raptor.tar.gz, which takes 20-50 minutes to produce as
part of a Win10 _artifact_ build.  The downloaded files are overlaid
on top of what is extracted from target.tests.raptor.tar.gz, replacing
existing files but not deleting files (that have either been removed
-- this isn't VCS -- or that come from other parts of the Raptor test
archive defined in `test_archive.py`).


 testing/mozharness/mozharness/base/script.py              |  26 ++++++++-----
 testing/mozharness/mozharness/mozilla/testing/testbase.py |  28 +++++++++++++++
 2 files changed, 44 insertions(+), 10 deletions(-)

modified   testing/mozharness/mozharness/base/script.py
@@ -394,6 +394,8 @@ class ScriptMixin(PlatformMixin):
         Returns:
             BytesIO: contents of url
         '''
+        content_length = None
+
         self.info('Fetch {} into memory'.format(url))
         parsed_url = urlparse.urlparse(url)
 
@@ -425,16 +427,16 @@ class ScriptMixin(PlatformMixin):
         # Bug 1309912 - Adding timeout in hopes to solve blocking on response.read() (bug 1300413)
         response = urllib2.urlopen(request, timeout=30)
 
-        if parsed_url.scheme in ('http', 'https'):
+        if parsed_url.scheme in ('http', 'https') and response.headers.get('Content-Length'):
             content_length = int(response.headers.get('Content-Length'))
+            self.info('Content-Length response header: {}'.format(content_length))
 
         response_body = response.read()
         response_body_size = len(response_body)
 
-        self.info('Content-Length response header: {}'.format(content_length))
         self.info('Bytes received: {}'.format(response_body_size))
 
-        if response_body_size != content_length:
+        if content_length and response_body_size != content_length:
             raise ContentLengthMismatch(
                 'The retrieved Content-Length header declares a body length '
                 'of {} bytes, while we actually retrieved {} bytes'.format(
@@ -646,7 +648,7 @@ class ScriptMixin(PlatformMixin):
         t = tarfile.open(fileobj=compressed_file, mode=mode)
         t.extractall(path=extract_to)
 
-    def download_unpack(self, url, extract_to='.', extract_dirs='*', verbose=False):
+    def download_unpack(self, url, extract_to='.', extract_dirs='*', verbose=False, mimetype=None):
         """Generic method to download and extract a compressed file without writing it to disk first.
 
         Args:
@@ -658,8 +660,11 @@ class ScriptMixin(PlatformMixin):
             verbose (bool, optional): whether or not extracted content should be displayed.
                                       Defaults to False.
 
+            mimetype (str, optional): The content mimetype.
+                                      Defaults to `None`, which determines the mimetype from the
+                                      extension of the filename given in the URL.
         """
-        def _determine_extraction_method_and_kwargs(url):
+        def _determine_extraction_method_and_kwargs(url, mimetype=None):
             EXTENSION_TO_MIMETYPE = {
                 'bz2': 'application/x-bzip2',
                 'gz':  'application/x-gzip',
@@ -687,10 +692,11 @@ class ScriptMixin(PlatformMixin):
                 },
             }
 
-            filename = url.split('/')[-1]
-            # XXX: bz2/gz instead of tar.{bz2/gz}
-            extension = filename[filename.rfind('.')+1:]
-            mimetype = EXTENSION_TO_MIMETYPE[extension]
+            if not mimetype:
+                filename = url.split('/')[-1]
+                # XXX: bz2/gz instead of tar.{bz2/gz}
+                extension = filename[filename.rfind('.')+1:]
+                mimetype = EXTENSION_TO_MIMETYPE[extension]
             self.debug('Mimetype: {}'.format(mimetype))
 
             function = MIMETYPES[mimetype]['function']
@@ -735,7 +741,7 @@ class ScriptMixin(PlatformMixin):
 
         # 2) We're guaranteed to have download the file with error_level=FATAL
         #    Let's unpack the file
-        function, kwargs = _determine_extraction_method_and_kwargs(url)
+        function, kwargs = _determine_extraction_method_and_kwargs(url, mimetype=mimetype)
         try:
             function(**kwargs)
         except zipfile.BadZipfile:
modified   testing/mozharness/mozharness/mozilla/testing/testbase.py
@@ -371,6 +371,34 @@ You can set this by specifying --test-ur
             self.download_unpack(url, target_dir,
                                  extract_dirs=unpack_dirs)
 
+            if 'raptor' in file_name \
+              and 'GECKO_HEAD_REPOSITORY' in os.environ \
+              and 'GECKO_HEAD_REV' in os.environ:
+                RAPTOR_ARCHIVE_URL = '{GECKO_HEAD_REPOSITORY}/archive/{GECKO_HEAD_REV}.zip/testing/raptor'  # noqa: E501
+                url = RAPTOR_ARCHIVE_URL.format(
+                    GECKO_HEAD_REPOSITORY=os.environ['GECKO_HEAD_REPOSITORY'],
+                    GECKO_HEAD_REV=os.environ['GECKO_HEAD_REV'])
+                self.download_unpack(url, target_dir, extract_dirs='*', mimetype='application/zip')
+
+                # target.raptor.tests.tar.gz contains 'raptor/raptor/raptor.py'.
+                # HG's archive contains try-6ae3ca22378129beed0ae637679b989b1bea60ee/testing/raptor/raptor/raptor.py.
+                _, repo = os.environ['GECKO_HEAD_REPOSITORY'].rsplit('/', 1)  # Turn 'https://hg.mozilla.org/try' into 'try'.
+
+                ROOT = '{repo}-{GECKO_HEAD_REV}'.format(
+                    repo=repo,
+                    GECKO_HEAD_REV=os.environ['GECKO_HEAD_REV'])
+                root = os.path.join(target_dir, ROOT, 'testing', 'raptor', 'raptor')
+                self.info('root: {}'.format(root))
+                for dirpath, dirs, files in os.walk(root):
+                    for f in files:
+                        src_abspath = os.path.join(dirpath, f)
+                        relpath = os.path.relpath(src_abspath, root)
+                        dst_abspath = os.path.join(target_dir, 'raptor', 'raptor', relpath)
+                        self.info("Renaming: {src} -> {dst}".format(src=src_abspath, dst=dst_abspath))
+                        if os.path.isfile(dst_abspath):
+                            os.remove(dst_abspath)
+                        os.rename(src_abspath, dst_abspath)
+
     def _download_test_zip(self, extract_dirs=None):
         dirs = self.query_abs_dirs()
         test_install_dir = dirs.get('abs_test_install_dir',

Chris AtLee [:catlee]

Reporter

Comment 18

•

4 years ago

re 1) - we're planning to tackle that by breaking out test archive generation into separate tasks - bug 1628981

re 2) - hopefully we can fix the layout issue at the same time. I don't understand why the layout of the test archives doesn't match the layout of the source directories

Nick Alexander :nalexander [he/him] (PTO until Montreallhands August 21, 2023)

Comment 19

•

4 years ago

(In reply to Chris AtLee [:catlee] from comment #18)

re 1) - we're planning to tackle that by breaking out test archive generation into separate tasks - bug 1628981

re 2) - hopefully we can fix the layout issue at the same time. I don't understand why the layout of the test archives doesn't match the layout of the source directories

Thrilled to hear that we're going to do some of the lower hanging fruit. I can only imagine the _tests directory stuff is related to 2). I've never understood those choices.