Closed
Bug 1430878
Opened 7 years ago
Closed 7 years ago
Add worker type with >30 CPU cores to support some toolchain tasks
Categories
(Firefox Build System :: Task Configuration, task)
Firefox Build System
Task Configuration
Tracking
(Not tracked)
RESOLVED
FIXED
mozilla60
People
(Reporter: gps, Assigned: gps)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
We currently use the standard builder worker type for toolchain tasks. These are currently based on [cm][45].4xlarge instance types, which have 16 vCPUs.
Some of the toolchain tasks (notably Clang and GCC builds) are heavily CPU bound and should be able to scale up to take advantage of >>30 vCPUs.
I think we should add a new worker type backed by c5.18xlarge (72 vCPU), m5.24xlarge (96 vCPU), etc and move CPU bound toolchain tasks to it. This *should* shave minutes off individual toolchain tasks. And it could potentially shave dozens of minutes off the end-to-end times for rebuilding toolchains.
I /think/ this should be relatively low-effort:
1) Define the new worker type(s)
2) Perform Try pushes to see where the sweet spot is for the underlying EC2 instance type to make sure we're not throwing away too much money at cores we don't use
3) Switch tasks to the new worker type
Strictly speaking, we don't need to define the new worker type to test things: I can provision new AWS instances at will now and run things in Docker containers manually. But it's certainly easier to do it in the context of TC.
Also, we may want to wait for bug 1424376 so we don't waste money on idle instances because they still think AWS bills hourly.
Assignee | ||
Comment 1•7 years ago
|
||
From a c5.17xlarge worker building Clang 6:
[task 2018-01-16T21:35:04.398Z] -- Configuring done
[task 2018-01-16T21:36:25.043Z] -- Generating done
[task 2018-01-16T21:36:25.127Z] CMake Warning:
[task 2018-01-16T21:36:25.127Z] Manually-specified variables were not used by the project:
[task 2018-01-16T21:36:25.127Z]
[task 2018-01-16T21:36:25.127Z] LIBCXX_LIBCPPABI_VERSION
[task 2018-01-16T21:36:25.127Z]
[task 2018-01-16T21:36:25.127Z]
[task 2018-01-16T21:36:25.129Z] -- Build files have been written to: /builds/worker/workspace/moz-toolchain/build/stage1/build
[task 2018-01-16T21:36:25.759Z] cd "/builds/worker/workspace/build/src/build/build-clang"
[task 2018-01-16T21:36:25.759Z] cd "/builds/worker/workspace/moz-toolchain/build/stage1/build"
[task 2018-01-16T21:36:25.759Z] ninja install
[task 2018-01-16T21:36:26.954Z] [1/3524] Building CXX object lib/Support/CMakeFiles/LLVMSupport.dir/APInt.cpp.o
[task 2018-01-16T21:38:14.234Z] [3000/3524] Building CXX object tools/clang/lib/Driver/CMakeFiles/clangDriver.dir/ToolChains/Haiku.cpp.o
[task 2018-01-16T21:39:35.046Z] [3524/3524] Install the project...
[task 2018-01-16T21:39:35.052Z] -- Install configuration: "Release"
So, stage1 took ~190s. It was ~100% CPU usage for most of that. Although it definitely trailed off towards the end (likely as it finished .o generation and moved on to linking).
Compare to https://public-artifacts.taskcluster.net/Jb_6wdWUTDqxnbaHvp1x0g/0/public/logs/live_backing.log:
[task 2018-01-16T19:39:24.099Z] -- Build files have been written to: /builds/worker/workspace/moz-toolchain/build/stage1/build
[task 2018-01-16T19:39:24.657Z] cd "/builds/worker/workspace/build/src/build/build-clang"
[task 2018-01-16T19:39:24.657Z] cd "/builds/worker/workspace/moz-toolchain/build/stage1/build"
[task 2018-01-16T19:39:24.657Z] ninja install
[task 2018-01-16T19:39:26.252Z] [1/3524] Building CXX object lib/Support/CMakeFiles/LLVMSupport.dir/APInt.cpp.o
[task 2018-01-16T19:50:00.299Z] [3000/3524] Building CXX object tools/clang/lib/Frontend/CMakeFiles/clangFrontend.dir/CacheTokens.cpp.o
[task 2018-01-16T19:53:22.800Z] [3524/3524] Install the project...
That took ~840s. So only a ~11 minute speedup.
And that's only for stage 1.
Stage 2 appears to take a similar amount of wall time. It completes at:
[task 2018-01-16T21:44:03.156Z] [3083/3083] Install the project...
And stage 3 at:
[task 2018-01-16T21:48:23.858Z] [3083/3083] Install the project...
It's worth noting that there are some very slow single core bottlenecks in this task:
* Subversion operations against llvm.org (glandium wants to "cache" source archives in another task's artifact)
* cmake takes ~90s per invocation to generate backend files (after the "configure" like functionality). Why it is taking so long, I'm not sure. That's painful enough that it might be worth our time to profile it and report/fix the problem upstream. Maybe upgrading cmake will magically make it faster?
* fetching and xz decompression of other toolchains
* xz compression for ourselves
But even with those bottlenecks, wall time savings over the [cm][45].4xlarge is massive. On this x5.17xlarge, the task took ~25 minutes instead of ~60 minutes. We can probably shave another 10 minutes by improving the single core bottlenecks.
Assignee | ||
Comment 2•7 years ago
|
||
GCC 6's build system is quite obviously not as efficient as Clang's:
real 24m15.192s
user 149m1.928s
sys 5m15.660s
This is end-to-end time. Includes a Mercurial clone, toolchain download, xz compression, etc.
Chalk the CPU inefficiency up to a poor implementation of a GNU make backend versus cmake+ninja. It's quite obvious from looking at CPU load that GCC's build system does a lot of per-directory traversal. You see a bunch of processes get spawned and CPU jumps up. Then it slowly tails off to ~0%. Then a new batch comes in.
By contrast, an existing task (https://tools.taskcluster.net/task-inspector/#QRxcSFCTRzuVyE9ZcnO9Tg) took ~34 minutes.
While the c5.17xlarge is faster, it is a waste for GCC 6 tasks because too many cores remain idle for too much of the time. I rarely saw CPU usage approach 100%. I don't think it hit 90% once. Clang, by contrast, was at ~100% for dozens of seconds multiple times in its build. A c5.9xlarge would likely be the sweet spot for GCC tasks.
Assignee | ||
Comment 3•7 years ago
|
||
And with Clang 3.9:
real 19m1.485s
user 301m7.308s
sys 15m53.840s
~6 minutes of that was xz.
By comparison, https://tools.taskcluster.net/task-inspector/#SOKOIfXtS6urJ--jtWpOAw took 34 minutes. So not as big a win as Clang 6. But if you optimize our single threaded bottlenecks, it seems to be generally highly CPU efficient like Clang 6 and worth throwing cores at.
Assignee | ||
Comment 4•7 years ago
|
||
The docker-worker changes to move us off hourly billing have now been deployed. So idle instances will linger for 15 minutes before terminating. That should reduce our exposure to be being billed for expensive instances when idle.
dustin: would you be willing to create a new AWS worker type for us?
I'm not sure of the names, but I think we'll want two flavors: beefy and super beefy.
beefy: c4.8xlarge, c5.9xlarge. Maybe add m4.10xlarge and m5.12xlarge with a higher utility factor (more cores but more expensive)
super beefy: m4.16xlarge, c5.18xlarge. Maybe add m5.24xlarge with a higher utility factor.
We'll likely only use "beefy" for now. I'd like to have a worker type with 64 cores available to facilitate testing if nothing else. Although post bug 1392370, we could likely deploy Clang toolchain tasks to 64 core machines and not feel too guilty about wasting cores...
Depends on: 1392370
Flags: needinfo?(dustin)
Comment 5•7 years ago
|
||
I don't think I'm the best person for that -- I'm a little out of touch with how the deployments work. Maybe Wander can help?
Flags: needinfo?(dustin)
Comment 6•7 years ago
|
||
gecko-L-toolchain and gecko-L-toolchain-huge were created.
Comment 7•7 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #1)
> * fetching and xz decompression of other toolchains
I don't know if toolchain tasks use `mach artifact toolchain`, but I filed bug 1421734 not long ago when looking at the setup overhead of some task. It didn't look like it'd be hard to make that code work in parallel, which should be a decent win when we have enough cores to do decompress all the packages in parallel. (I assume we can probably download as much as we want in parallel from S3 without maxing anything out.)
Comment 8•7 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #1)
> * Subversion operations against llvm.org (glandium wants to "cache" source
> archives in another task's artifact)
I think we've discussed this before, but in general removing dependencies on external resources would be great for reproducibility and reliability. I remember someone (grenade?) saying that our OpenCloudConfig repo had CI that would take URLs mentioned, upload them to a blob store, and then commit the resulting hash to the repo so the file could be fetched from there instead. Something like that preserves nice developer usability (you just put the upstream URL in the config) but also removes the external dependency.
Assignee | ||
Comment 9•7 years ago
|
||
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #8)
> (In reply to Gregory Szorc [:gps] from comment #1)
> > * Subversion operations against llvm.org (glandium wants to "cache" source
> > archives in another task's artifact)
>
> I think we've discussed this before, but in general removing dependencies on
> external resources would be great for reproducibility and reliability. I
> remember someone (grenade?) saying that our OpenCloudConfig repo had CI that
> would take URLs mentioned, upload them to a blob store, and then commit the
> resulting hash to the repo so the file could be fetched from there instead.
> Something like that preserves nice developer usability (you just put the
> upstream URL in the config) but also removes the external dependency.
That's basically tooltool :)
We can also create tasks that securely download these artifacts and make them TaskCluster artifacts. If we e.g. change the revision of Clang we build, we'll store clang.tar.xz (or whatever) as an artifact the first time we schedule things.
Assignee | ||
Comment 10•7 years ago
|
||
(In reply to Wander Lairson Costa [:wcosta] from comment #6)
> gecko-L-toolchain and gecko-L-toolchain-huge were created.
The "L" here is supposed to be a placeholder for {1, 2, 3}. So we actually need 3 variations of each worker type.
Also, we may not use these worker types for just toolchain tasks. We currently run these tasks on gecko-{{level}}-b-linux.
How about the following for the worker names:
gecko-{{level}}-b-xlarge
gecko-{{level}}-b-xxlarge
Also, I ran into scopes problems with a try task on these worker types. e.g. https://public-artifacts.taskcluster.net/D0o-KF3iSXacESfcO7i4Tg/0/public/logs/live_backing.log. If we name the workers gecko-{{level}}-*, I /think/ the scopes might magically sort themselves out?
Flags: needinfo?(wcosta)
Comment 11•7 years ago
|
||
I created gecko-1-b-xlarge and gecko-1-b-xxlarge. If they are ok, I will create levels 2 and 3 on Monday.
Flags: needinfo?(wcosta)
Assignee | ||
Comment 12•7 years ago
|
||
Thanks for the worker definitions!
I should have said that "linux" belongs in the worker name somewhere. e.g. gecko-{level}-b-xlarge. Sorry about that.
Anyway, tasks ran on these worker types properly! See try pushes at https://treeherder.mozilla.org/#/jobs?repo=try&revision=607cbc9486b19ae97f0264470c0c01a13cced215 and https://treeherder.mozilla.org/#/jobs?repo=try&revision=7a06fbe140a11f722f95e6245dbed316b9601866. So, I think we can proceed with creating workers for other levels.
Our final names should be:
gecko-1-b-linux-xlarge
gecko-1-b-linux-xxlarge
gecko-2-b-linux-xlarge
gecko-2-b-linux-xxlarge
gecko-3-b-linux-xlarge
gecko-3-b-linux-xxlarge
Flags: needinfo?(wcosta)
Comment 13•7 years ago
|
||
I created them as gecko-L-b-linux-[x]large because using xx exceeds the maximum length for worker type names.
Also create a PR [1] to add these new worker types to docker-worker update list.
[1] https://github.com/taskcluster/docker-worker/pull/376
Flags: needinfo?(wcosta)
Comment 14•7 years ago
|
||
Commit pushed to master at https://github.com/taskcluster/docker-worker
https://github.com/taskcluster/docker-worker/commit/1c44b9d390c6c608e1fe446d4a96907703f9102f
Bug 1430878: Add gecko-L-b-linux-[x]large worker types to update list
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment 17•7 years ago
|
||
mozreview-review |
Comment on attachment 8952509 [details]
Bug 1430878 - Use larger EC2 instances for Clang toolchain tasks;
https://reviewboard.mozilla.org/r/221724/#review227734
Attachment #8952509 -
Flags: review?(mh+mozilla) → review+
Comment 18•7 years ago
|
||
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f0e351b32622
Use larger EC2 instances for Clang toolchain tasks; r=glandium
Assignee | ||
Comment 19•7 years ago
|
||
wcosta: We had a worker misnamed in AWS Provisioner: gecko-3-linux-b-xlarge was created instead of gecko-3-b-linux-large. Jonas created a new worker as a copy last night. While he said he would clean up the old worker today, I figured I'd needinfo you in case you want to perform any additional auditing.
Flags: needinfo?(wcosta)
Comment 20•7 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
Comment 21•7 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #19)
> wcosta: We had a worker misnamed in AWS Provisioner: gecko-3-linux-b-xlarge
> was created instead of gecko-3-b-linux-large. Jonas created a new worker as
> a copy last night. While he said he would clean up the old worker today, I
> figured I'd needinfo you in case you want to perform any additional auditing.
It feels like everything is good, thanks for the heads up
Flags: needinfo?(wcosta)
Updated•7 years ago
|
Product: TaskCluster → Firefox Build System
Updated•6 years ago
|
Assignee: nobody → gps
You need to log in
before you can comment on or make changes to this bug.
Description
•