Open Bug 1462528 Opened 6 years ago Updated 2 years ago

Investigate and potentially use c5d instance types

Categories

(Firefox Build System :: Task Configuration, task, P2)

3 Branch
task

Tracking

(Not tracked)

People

(Reporter: gps, Unassigned)

References

(Blocks 1 open bug)

Details

https://aws.amazon.com/blogs/aws/ec2-instance-update-c5-instances-with-local-nvme-storage-c5d/ The new c5d instance types offer local NVMe storage (as opposed to using EBS for storage). The amount of local storage increases as you use beefier instance types. The c5d.4xlarge has 450 GB of local storage. We're currently using c5.4xlarge for build workers. Our current EBS volumes are actually smaller (120 GB) than what we'd get with c5d.4xlarge! Spot prices for c5d are currently lower than c5. But regular instances are a tad more expensive. (Spot is probably lower due to ~0 demand at present.) When factoring in cost, you need to consider that you don't have to pay for EBS volumes with c5d instances. We currently mount a 120 GB EBS volume + 8 GB for the root volume on build workers using gp2 storage. EBS is $0.10 GB-mo.
It looks like the root volume still needs to be provisioned on EBS volumes. So ignore the 8 GB EBS volume savings per instance when doing cost calculations.
The I/O on these instances is *substantially* better than EBS volumes - at least compared to the provisioned I/O we get with 120 GB volumes. Mercurial operations appear to be at least 50% faster. This is the first time I've seen I/O in AWS come close to matching what the SSD in my development machine can accomplish. These instances look *very* promising. I /think/ if we configure a docker-worker based worker type to use this instance type and remove the EBS volume from the worker config, things will "just work." Note: we currently use a mix of m4/c4/m5/c5 instances. And there are no m5d instance types (yet). Since the EBS volumes in the worker config are defined as part of the "launch spec" and the launch spec is shared across all ec2 instance types, we'd need to switch exclusively to c5d instance types in order to use them without EBS volumes. That's kind of unfortunate: we do benefit from specifying multiple EC2 instance types and bidding on the cheapest one at a given time.
Priority: -- → P2
It's worth noting that with local storage, you don't have to deal with any of the complexities with EBS: * Initial I/O to a block is slow (bug 1305174) * Provisioned IOPS * IOPS accumulation/balance Instead, you just have to deal with contention from whatever other tenants are on that shared hardware. And presumably the hypervisor will reasonably guarantee equal access when loads are high. I'm tentatively *very* willing to make that trade-off.
I hooked up the gecko-1-b-linux-gps worker type to use c5d.4xlarge instances. The results are... very promising! A c5.4xlarge does a full clone + checkout on a 120 GB EBS volume in ~190s. The c5d.4xlarge does it in 90-100s. This operation is very I/O intensive - writing a few hundred thousand files essentially as fast as the filesystem will keep up. So the fact this is ~2x faster with local NVMe storage is saying something! Builds themselves also appear to be a bit faster. Maybe a minute or two. But the difference isn't as pronounced. Probably because most parts of builds aren't I/O bound as much as VCS is. But I did appear to record the fastest build time ever recorded on a .4xlarge instance with a c5d. So that says something.
I switched gecko-1-decision workers to use c5d.xlarge instances. As part of this, the BlockDeviceMappings section was removed since EBS volumes are no longer needed. ~6 decision tasks have successfully executed on new workers backed by c5d.xlarge instances. So it appears everything "just worked." For the record, they were previously m5.xlarge and c5.xlarge instances. The BlockDeviceMappings value was: "BlockDeviceMappings": [ { "DeviceName": "/dev/xvdb", "Ebs": { "DeleteOnTermination": true, "VolumeSize": 120, "VolumeType": "gp2" } } ], We can currently copy settings from gecko-[2,3]-decision if we need to restore things. At least until those workers are switched to c5d instances as well. I'll let gecko-1-decision bake for a little bit first.
There appear to have been no issues with the c5d instances on gecko-1-decision, so I gave the same treatment to gecko-2-decision and gecko-3-decision. I also updated gecko-1-images to c5d as well. I'll revert that if my Try push running tasks on that worker runs into issues.
gecko-1-images worked well with the c5d instances. So I converted gecko-2-images and gecko-3-images to match. The old worker definition was using a 160 GB EBS volume and a .xlarge instance type. The c5d.xlarge only has 100 GB of storage. I was concerned about the worker running out of disk. So I bumped the instances to c5d.2xlarge (225 GB storage) and increased their utility factor from 1 to 2. If the 160 GB wasn't necessary and 100 GB is sufficient, we should consider reverting to .xlarge. (I'm a fan of having only a single task run on a worker at a time.) Also, I *really* wish we had inline comments in the worker definitions and/or a changelog so you could look at history. As it stands, there's a lot of FUD when it comes to changing the worker definitions because the historical context is not easily discoverable.
Have I got a bug for you! Bug 1465851.
Dustin also enlightened me in bug 1466186 that we can in fact only use EBS volumes on specific instance types. I've tested this with gecko-1-images and it appears to work! So I'm going to repopulate gecko-{L}-decision and gecko-{L}-images with regular c5 and m5 instance types with EBS volumes so the provisioner can take advantage of cheaper instances, if available. This also means we can add c5d instances to the mix for build workers!!!
I restored c5.xlarge instances with 120 GB EBS volumes into the gecko-{L}-decision and gecko-{L}-images workers. I'm not sure why we were using 160 GB EBS volumes for gecko-{L}-images and I suspect it was arbitrary. So I've changed the volumes to 120 GB. Furthermore, I dropped gecko-{L}-images from c5.2xlarge to c5.xlarge. This means they have utility 1 (like before). They also have 100 GB volumes instead of 120 GB. I don't think this will matter. I'll trigger a bunch of image builds on Try to tease out any issues. I also added c5d.4xlarge into the gecko-1-b-linux worker at utility 1.5. I've previously tested that builds work with this instance type. So things should "just work."
c5d.4xlarge seemed to be working on gecko-1-b-linux. So I also added it to gecko-2-b-linux and gecko-3-b-linux.
I also added c5d instances to gecko-{1,2,3}-b-linux-{large,xlarge} workers. I believe we can now use c5d instances in most build tasks where we were previously using c5 instances. (We still have some workers lingering on the c4 instances.)
And AWS announced m5d instances: https://aws.amazon.com/blogs/aws/ec2-instance-update-m5-instances-with-local-nvme-storage-m5d/ We'll probably want to add these to the worker definitions as well so we have 1 more instance type for the provisioner to choose from.
I just now manually updated all worker types that I had previously modified to use c5d instances to include m5d instances as well.
Version: Version 3 → 3 Branch
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.