Closed
Bug 1469398
Opened 6 years ago
Closed 6 years ago
Windows AMI support for c5 instances
Categories
(Infrastructure & Operations :: RelOps: OpenCloudConfig, task)
Infrastructure & Operations
RelOps: OpenCloudConfig
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gps, Assigned: grenade)
References
(Blocks 1 open bug)
Details
Amazon c5 and m5 instances have faster CPU and are cheaper. The new c5d and m5d variants have local NVMe storage and I/O is substantially faster than what you can get via EBS. We want to use these 5th generation instances throughout CI. And on Linux we are.
However, if you attempt to run a c5 or m5 instance with the latest Windows AMIs we use for build tasks (I believe these are based on Windows Server), you receive the following error message:
InvalidParameterCombination: Enhanced networking with the Elastic Network Adapter (ENA) is required for the 'c5.4xlarge' instance type. Ensure that you are using an AMI that is enabled for ENA.
This error is likely due to the AMI not having the ENA driver installed. Enabling the ENA driver should hopefully be as simple as installing the driver from the zip file linked from https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/enhanced-networking-ena.html.
Once instances have ENA support, it is quite possible that they'll fail to mount EBS volumes. This is because c5 instances treat EBS volumes as NVMe devices. At least on Linux. Not sure about Windows.
Anyway, I'm filing this bug against Taskcluster Operations so we have a dedicated bug on file to support the 5th generation instance types with our Windows AMIs.
needinfo to fubar to triage the request.
Flags: needinfo?(klibby)
Reporter | ||
Comment 1•6 years ago
|
||
Adding needinfo on coop in case triage falls on his team. (Ownership of OpenCloudConfig is a bit nebulous.)
Not running c5 and c5d instances on Windows is likely costing us *minutes* per build in Firefox CI due to I/O. That itself likely translates to a few thousand dollars in EC2 costs per year due to extra machine time. When you factor in that c5 instances are cheaper than c4 instances *and* have faster CPU, we're likely talking several to $10,000+ annual savings from this switch. And there's a good change this bug is the lone blocker.
Flags: needinfo?(coop)
Comment 2•6 years ago
|
||
It's not coop, it's on us, sorry. We're working on a new process to repeatably build AMIs, but have been stalled because of issues in MDC2.
Flags: needinfo?(coop)
Reporter | ||
Comment 3•6 years ago
|
||
Thanks for the update!
Correct me if I'm wrong, but aren't there 2 phases to AMI generation (base and top). And if OCC runs as part of the top-most AMI layer, presumably this could install the ENA driver. As long as AMI generation isn't running on a c5 instance (it shouldn't need to), I would think we could install the ENA driver in the top-most AMI and this may "just work."
Comment 4•6 years ago
|
||
Rob, what do you think? Can you take a look and see if you can make this go in our current state?
Flags: needinfo?(klibby) → needinfo?(rthijssen)
Assignee | ||
Comment 5•6 years ago
|
||
yep. i'll have a go.
Assignee: nobody → rthijssen
Status: NEW → ASSIGNED
Component: Operations → Relops: OpenCloudConfig
Flags: needinfo?(rthijssen)
Product: Taskcluster → Infrastructure & Operations
QA Contact: rthijssen
Assignee | ||
Comment 6•6 years ago
|
||
testing on gecko-1-b-win2012-beta:
https://github.com/mozilla-releng/OpenCloudConfig/commit/0f2a565
Assignee | ||
Comment 7•6 years ago
|
||
testing firefox build on c5.4xlarge workers:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ce8c079
Assignee | ||
Comment 8•6 years ago
|
||
beta builds succeeded.
promoting to gecko-1-b-win2012 & gecko-2-b-win2012:
https://github.com/mozilla-releng/OpenCloudConfig/commit/562946e
https://tools.taskcluster.net/groups/DlYOexCVRFCgkztF9et0xA
Assignee | ||
Comment 9•6 years ago
|
||
if we've no issues over the next few hours, i'll deploy to l3 builders tomorrow morning (utc).
https://github.com/mozilla-releng/OpenCloudConfig/pull/159
Comment 10•6 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #9)
> if we've no issues over the next few hours, i'll deploy to l3 builders
> tomorrow morning (utc).
> https://github.com/mozilla-releng/OpenCloudConfig/pull/159
Reminder that the US is in holiday tomorrow; I don't think that should necessarily change your plan, but it won't be a normal weekday in many regards.
Assignee | ||
Comment 11•6 years ago
|
||
we had problems with provisioning. it seems we can't get enough c5.4xlarge spot instances (this instance type is also used by linux builders). i have reverted the aws provisioner config for gecko-1-b-win2012 to use the new amis (with ena drivers installed) but to instead use c4.4xlarge instance types until the provisioning issue is resolved.
Reporter | ||
Comment 12•6 years ago
|
||
On the Linux builders, we define a mix of instance types with different utility factor:
c5d.4xlarge: 1.3
m5d.4xlarge: 1.2
c5.4xlarge: 1.1
m5.4xlarge: 1.1
c4.4xlarge: 1.0
m4.4xlarge: 0.9
From https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=autoland,1444817,1,2&series=autoland,1582195,1,2&series=autoland,1691781,1,2&series=autoland,1444828,1,2&series=autoland,1618745,1,2&series=autoland,1697674,1,2, you can see that a healthy mix of these instances is provisioned in the wild.
Using a mix of instance types gives the provisioner access to additional instance types in case one instance type is not available or is prohibitively expensive. May I suggest applying this approach to the Windows workers as well?
Note that the c5d and m5d instance types have local/ephemeral NVMe storage available as a single drive. That replaces the use of an EBS volume. However, the Windows builders are using 2 EBS volumes and tasks actively use both volumes. To fully realize the benefits of the c5d and m5d instance types, we'll want tasks to use a single volume for task data and caches. I believe this will require a bit of work. So it's probably best to ignore the c5d and m5d instance types for Windows builders at this time. We can sort that out in bug 1462528 or in a derivative of it. i.e. we'll want the Windows builders to be a mix of c5.4xlarge, m5.4xlarge, c4.4xlarge, and m4.4xlarge.
Reporter | ||
Comment 13•6 years ago
|
||
Regarding the performance of the c5.4xlarge instance type... it is only 1 sample point, but it looks real.
https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=try,1713271,1,2&series=try,1460207,1,2
Previously, the best time for a Windows 64 opt build was ~2000s. This build clocked in at ~1875s. So roughly a 2 minute speedup.
Honestly, this isn't as significant as I was hoping for. And there appeared to be no improvement with Mercurial operation times either. This is... disappointing. Especially the lack of improvement with I/O as measured by Mercurial. I thought for sure we'd see a significant win there, as the c5 instances are supposed to have better EBS performance.
Despite the apparent lackluster performance gains, the 5th generation instances are faster and cheaper, so it is still a good idea to press on. And having the compatible AMI will enable us to experiment with c5d and m5d instances, which will almost certainly demonstrate significant I/O wins.
Assignee | ||
Comment 14•6 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #12)
> Using a mix of instance types gives the provisioner access to additional
> instance types in case one instance type is not available or is
> prohibitively expensive. May I suggest applying this approach to the Windows
> workers as well?
yes, sounds very reasonable. i hope i've understood the configuration properly. does this look sane:
"instanceTypes": [
{
"instanceType": "c4.4xlarge",
"utility": 0.9,
...
},
{
"instanceType": "c5.4xlarge",
"utility": 1,
...
}
],
i'm hoping that the provisioner will see `utility: 1` and spot request c5.4xlarge, falling back to c4.4xlarge (`utility: 0.9`) if it can't get enough of the ena instances. or have i misinterpreted how that works?
i'm in the process of rewriting https://github.com/mozilla-releng/OpenCloudConfig/blob/master/ci/update-workertype.sh (bug 1441402, bug 1460535) as a python module. this script is what currently manages the windows worker type configurations. as a bash script, it's difficult to support multiple instance types with readable code, so hopefully in python, it will be easier to support all of the available instance types.
Assignee | ||
Comment 15•6 years ago
|
||
Assignee | ||
Comment 16•6 years ago
|
||
Assignee | ||
Comment 17•6 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #14)
> i'm hoping that the provisioner will see `utility: 1` and spot request
> c5.4xlarge, falling back to c4.4xlarge (`utility: 0.9`) if it can't get
> enough of the ena instances. or have i misinterpreted how that works?
seems to work. the jobs on gecko-1-b-win2012-beta triggered instantiation of both c4.4xlarge & c5.4xlarge instances.
Reporter | ||
Comment 18•6 years ago
|
||
Are there plans to finish this rollout?
FWIW Perfherder build metrics show a clear performance win with c5 instances:
https://treeherder.mozilla.org/perf.html#/graphs?series=try,1460207,1,2&series=try,1713271,1,2 (~300s faster on average)
I'm keen to realize these speedups outside of Try!
Flags: needinfo?(rthijssen)
Assignee | ||
Comment 19•6 years ago
|
||
thanks for the follow up.
the gecko-3-b-win2012 rollout now complete.
https://github.com/mozilla-releng/OpenCloudConfig/commit/d1969a3a
https://tools.taskcluster.net/groups/dKeYmTbpScev4M034CDh9A/tasks/BuE3yhxbRdq9nPRtDkKCDQ/runs/0/logs/public%2Flogs%2Flive.log
i'll leave the bug open until i've updated occ to also update the aws provisioner config (this is being handled manually now) with the extra instance types.
Flags: needinfo?(rthijssen)
Reporter | ||
Comment 20•6 years ago
|
||
Thank you, Rob!
We have limited data, but the build time improvements are already pretty clear.
A ~300s speedup in opt builds:
https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1460301,1,2&series=autoland,1724153,1,2
An ~800s speedup in pgo builds:
https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1460459,1,2&series=autoland,1724384,1,2
(This graph is wonky because we rolled out c5's around the same time thinLTO landed and thinLTO made builds much slower. But it looks like the c5's effectively offset the build time loss from thinLTO!)
Developers and sheriffs should notice these speedups. Especially people who have tight iteration loops.
Assignee | ||
Comment 21•6 years ago
|
||
the occ ci updates were dealt with in bug 1480108 and occ commit https://github.com/mozilla-releng/OpenCloudConfig/commit/331e1ff7bbfd2f309d7520be7833792af7db2eee
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•