Closed Bug 1578460 Opened 5 years ago Closed 5 years ago

Evaluate the viability of migrating from packet.net AWS bare metal

Categories

(Taskcluster :: Workers, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wcosta, Assigned: wcosta)

References

Details

Attachments

(10 files)

We have a first try push of Android tests running on AWS bare metal:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8070f42f462b059ca53e098b72f590b302adfbce

:coop could you please find someone that can check this try push? There are some tests failing, but it feels like isn't related to the environment they are running. My only option is :jmaher, but he is on PTO.

Flags: needinfo?(coop)

I did a quick comparison of the red/orange results that didn't go green on a re-run:

Red

Android 4.0 API16+ opt :: Atier2 - this is broken everywhere (bug 1540782)

Orange

Android 4.3 API16+ opt :: M-1proc(23) - intermittent (bug 1565119)
Android 4.3 API16+ pgo :: M-1proc(7) - intermittent (bug 1434744)
Android 4.3 API16+ pgo :: M-1proc(23) - intermittent (bug 1565119)
Android 4.3 API16+ debug :: R-1proc(R53) - intermittent (bug 1543639)
Android 4.3 API16+ debug :: M-1proc(20) - intermittent (bug 1576379)
Android 4.3 API16+ debug :: M-1proc(57) - intermittent (bug 1565119)
Android 4.3 API16+ debug :: R-1proctier 2 - intermittent (bug 1578043)
Android 7.0 x86-64 debug :: W(wpt2) - intermittent (bug 1577075) ***

I'm highlighting the last one -- Android 7.0 x86-64 debug :: W(wpt2) -- because despite it matching an intermittent bug, I haven't been able to get it to go green after multiple re-triggers.

The other thing we need to look at is the performance for the green/passing jobs. I'm going to loop in :bc at this point to help with that.

:bc - do you have a way to compare the results from this push (see comment #0) to historical data for each test? I paged through the Similar Jobs for a bunch of tests. They all seem to finish in about the same time, plus-or-minus a few minutes. That's hardly scientific though.

:bc - also let us know if there is a better way to structure this so you can help us more easily.

Flags: needinfo?(coop) → needinfo?(bob)

Looking now. Sorry for the delay.

Please note that the Android 4.3 tests are obsolete -- we no longer run those tests on trunk. (Also, those have always run on aws, not packet.net!)

(In reply to Geoff Brown [:gbrown] from comment #5)

Please note that the Android 4.3 tests are obsolete -- we no longer run those tests on trunk. (Also, those have always run on aws, not packet.net!)

It looks like Wander pushed with all the Android jobs using a fuzzy query, rather than triggering just the packet.net-specific tests. Sorry about that.

I had waited too long and the logs expired. wcosta did another push to try for me and I added the tests again to have a total of 5 jobs for each test. I also did a try push from autoland e859a5aebb5b with 5 jobs each for the tests.

aws-metal try push.

packetnet push to try from autoland

The test results are comparable. 5 is a pretty low number for comparison but no new failures appeared and it appears that aws-metal may be somewhat better.

Working on timing now. First impression is aws-metal did take much longer but I'll have concrete numbers soon.

Remember this is with 5 runs of each test:
https://docs.google.com/spreadsheets/d/1u2I4cf97aGTouAimeC2v-pL-QJPKJaOBgZcvY-FlNx4/edit#gid=926559078

aws metal takes 14 more hours elapsed time to run than packet.net

Flags: needinfo?(bob)

Discussed this with :miles and :wcosta yesterday. Here's our current state:

  • Miles was going to revisit the monopacker images we were using before. It's likely they'll need some tweaks to work in the new Firefox CI cluster.
  • To use those images, we need to recreate the AWS bare metal pool in the new Firefox CI cluster. We should file a bug with releng to get that going.
  • Once we have the instances running, we'll want to re-validate that the tests still work. Bob's mach try fuzzy logic from comment #7 can help with this.
  • Once we know things are still working, we'll want to up the concurrency of workers per instance. We're running with 4 workers/instance in packet.net, so that's the first target.
  • Try scaling up the concurrency as far as we can get away with. Our packet.net instances only have 4 cores, whereas something like c5.metal in AWS has 96 vCPUs. While we probable can't get away with a concurrency of 96, greater than 4 seems likely.
  • Based on our final concurrency numbers, we should evaluate whether the migration makes sense cost-wise. In fact, we can calculate our cost efficiency threshold per AWS instance type at any point. We know what we pay right now per instance in packet.net and we can use the AWS calculator to ballpark what our break-even point needs to be in terms of concurrency. Note that this doesn't need to be a perfect mapping; I'm willing to pay (slightly) more in AWS if it means eliminating the management headache of packet.net.

:miles, :wcosta - can I get you to turn that list into dependent bugs and action items, please?

Flags: needinfo?(wcosta)
Flags: needinfo?(miles)

I've confirmed that our recent images are working on metal instances. I baked some fresh images here:

us-east-1: ami-0ccb2b4c4a0694e01
us-west-1: ami-0601039bc3bee6675
us-west-2: ami-00885e2f54435445e

I'll create blocker bugs to this one.

Flags: needinfo?(wcosta)
Flags: needinfo?(miles)
Depends on: 1596948

Last week I tried another approach. I instanciated a bare metal machine by hand and ran the same scripts I used to provision packet. That would make identical instances configuration. Unfortunately, the kernel in bare metal doesn't ship with snd-aloop and V4L2, necessary to run docker-worker.

Assignee: nobody → coop
Assignee: coop → nobody
Assignee: nobody → miles

Here's the recent push that Wander did:

https://treeherder.mozilla.org/#/jobs?repo=try&searchStr=geckoview%2Candroid%2C7.0&revision=af5b54c22e150b1dcde4137e06b0b7ecbb0f56eb&selectedJob=279000057

He has helpfully collated the perf results here:

https://docs.google.com/spreadsheets/d/1GuptChtL3JV3bM3QUIkVQ91FjZt-XMepkRBkGain07s/edit#gid=0

Note that's a comparison between two single runs, but illustrates that overall perf is not that different (~5%).

Assignee: miles → wcosta
Status: NEW → ASSIGNED

We use m5.metal and c5.metal instance types, since their availaibiliy in
spot market is higher than the 5d counter parts.

m5 has more RAM compared to c5, so it can afford more parallel tasks.

This uses monopacker AWS images without an official CoT key.

GeckoView tasks require privileged containers.

(In reply to Wander Lairson Costa from comment #18)

Created attachment 9126693 [details]
Bug 1578460: enable run privileged tasks in baremetal r=jlorenzo

GeckoView tasks require privileged containers.

I would very strongly prefer not having any workers with privileged enabled in the firefox-ci cluster

Flags: needinfo?(wcosta)

(In reply to Tom Prince [:tomprince] from comment #19)

(In reply to Wander Lairson Costa from comment #18)

Created attachment 9126693 [details]
Bug 1578460: enable run privileged tasks in baremetal r=jlorenzo

GeckoView tasks require privileged containers.

I would very strongly prefer not having any workers with privileged enabled in the firefox-ci cluster

privileged tasks are a requirement for GeckoView to run the emulator correctly, that's how they have been running in packet since the beginning. The alternative is to move them to generic-worker.

Flags: needinfo?(wcosta)

(In reply to Wander Lairson Costa from comment #20)

privileged tasks are a requirement for GeckoView to run the emulator correctly, that's how they have been running in packet since the beginning. The alternative is to move them to generic-worker.

As Wander indicates, we are already running privileged tasks in docker for GeckoView in packet.net. I'm going to argue that moving all workloads from packet.net back into AWS is a net security win because it's one less cloud provider to secure.

We can certainly look at improving the security story of these tasks as a second step. Running them using task user separation under generic-worker is an obvious choice, but we could also experiment with podman, rkt, etc.

Are we still blocked on privileged mode?

I think I introduced use of privileged at packet.net specifically to access the kvm device from the docker container -- necessary for hardware acceleration of the android x86 emulator. kvm is essential for the emulator used for all the geckoview tests. However, :snorp points out that, in modern versions of docker, the --device argument should be able to provide kvm access without --privileged. A way forward?

(In reply to Geoff Brown [:gbrown] from comment #22)

Are we still blocked on privileged mode?

We shouldn't be, per comment #21.

Wander: what are we blocked on? Do you have an update?

Flags: needinfo?(wcosta)

(In reply to Geoff Brown [:gbrown] from comment #22)

Are we still blocked on privileged mode?

I think I introduced use of privileged at packet.net specifically to access the kvm device from the docker container -- necessary for hardware acceleration of the android x86 emulator. kvm is essential for the emulator used for all the geckoview tests. However, :snorp points out that, in modern versions of docker, the --device argument should be able to provide kvm access without --privileged. A way forward?

We recently added support for /dev/kvm in docker-worker. I will investigate the feasibly of removing the privileged requirement.

Flags: needinfo?(wcosta)

These images adds a bunch of fixes and improvements for baremetal
machines.

This updates the worker pool to allow workers to run on full capacity.

baremetal machines run up to 36 tasks in parallel, and measurements show
it imposes a bottleneck on disk writes. To mitigate this, instead of
configuration one single disk, we create the volume with several smaller
disks.

Depends on: 1624642
Depends on: 1624649

We [cmr]5.metal instance types with io1 because gp2 doesn't scale with
multiple parallel tasks. We also add [cmr]5d.metal instances which
already ship with SSD disks and don't need custom disk setup.

We decrease the number of tasks to 24 per instance to make sure we don't
have I/O bottleneck. In the future we should investigate what the
optimal number of tasks for each instance.

as this bug describes evaluate, what other criteria do we need to confirm before discussing moving from packet to aws?

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #29)

as this bug describes evaluate, what other criteria do we need to confirm before discussing moving from packet to aws?

We're currently working to find metal instance variants that are:

a) generally available on spot in regions we support;
b) have adequate performance relative to packet.net; and,
c) cost around the same amount per hour per worker as in packet.net

Until recently, a) was the primary concern, but we've now by using multiple metal instance types, we've moved past that hurdle.

Recall from https://bugzilla.mozilla.org/show_bug.cgi?id=1599144#c2 that in packet.net we're paying $0.103/hour/worker. Due to the nature of the instances there, that's a static cost that makes it easy to compare.

Using the new AWS calculator, I priced out the current config with the io1-optimized storage. Using the current config, I come up with a cost of $0.279/hour/worker for the m5.metal type in us-east-1. ($4,880.94 instance/month / 730 hours/month / 24 workers/instance)

So the current config is a non-starter from a cost perspective. I've asked Wander to use the calculator to create a short list of configs that would work from a cost standpoint. Decreasing the # of workers per instance is one way we could achieve this and might preclude the need for special iops configs.

e.g. If we used a m5.metal instance with a slightly larger gp2 disk (1000GB), we would only need to run 8 workers/instance to bring the cost down to $0.098/hour/worker. ($570.94 instance/month / 730 hours/month / 8 workers/instance) Does performance suffer with a gp2 disk and 8 workers? We'd need to test to find out.

Once Wander has a list of viable configs from a cost perspective, he can quickly iterate through them on Try and find the best one.

This is my first try on reducing costs:

28 tasks, 1000GB io1, 28000 Iops
Cost ~ 0.177 (per worker per hour)

Try push: https://treeherder-taskcluster-staging.herokuapp.com/#/jobs?repo=try&revision=10ea99e97f36c334f5054b79e015aa23247e9e08&selectedJob=5166

When this finishes I will run another configuration.

As the spot prices fluctuate, I edited the costs with a more conservative estimate of 50% discount in the spot market.

My last try push was with 28 workers, 1000GB io1, 16000 Iops
Costs ~ 0.13

Try push: https://treeherder-taskcluster-staging.herokuapp.com/#/jobs?repo=try&revision=43fad293337de6a0f1b85a7ca7d6424dfa8eaac5

I updated the performance spreadsheet with this push: https://docs.google.com/spreadsheets/d/1GuptChtL3JV3bM3QUIkVQ91FjZt-XMepkRBkGain07s/edit#gid=311893925

tl;dr: we are significantly worse in terms of performance compared with [cmr]5d.metal instances we tried in december

(In reply to Wander Lairson Costa from comment #34)

tl;dr: we are significantly worse in terms of performance compared with [cmr]5d.metal instances we tried in december

Remind me again why we can't use those instance types? That doesn't seem to be documented in the bug anywhere.

Is it due to lack of of availability? I guess that's comment #16.

(In reply to Chris Cooper [:coop] pronoun: he from comment #35)

(In reply to Wander Lairson Costa from comment #34)

tl;dr: we are significantly worse in terms of performance compared with [cmr]5d.metal instances we tried in december

Remind me again why we can't use those instance types? That doesn't seem to be documented in the bug anywhere.

Is it due to lack of of availability? I guess that's comment #16.

Exactly. But bear in mind that the comparison isn't totally fair. The machines in packet are running for a long time, so they have docker image, hg repo, tooltool downloads all cached, while AWS instances have to download them.

(In reply to Wander Lairson Costa from comment #36)

Exactly. But bear in mind that the comparison isn't totally fair. The machines in packet are running for a long time, so they have docker image, hg repo, tooltool downloads all cached, while AWS instances have to download them.

That's also why I'm a little flexible on cost, though. The AWS instances will get shutdown when not in use so we won't be paying for them 24/7 like we are for packet.net.

(In reply to Wander Lairson Costa from comment #34)

I updated the performance spreadsheet with this push: https://docs.google.com/spreadsheets/d/1GuptChtL3JV3bM3QUIkVQ91FjZt-XMepkRBkGain07s/edit#gid=311893925

You've run a bunch of different configs now. Which config was this updated performance data for? We want to to track the performance of all the configs so we can know which configs are promising and which are non-starters. This will avoid re-doing work in a few weeks because we didn't write it down.

I would also suggest trying again with the [cmr]5d.metal instances in all the regions we support.

(In reply to Chris Cooper [:coop] pronoun: he from comment #38)

(In reply to Wander Lairson Costa from comment #34)

I updated the performance spreadsheet with this push: https://docs.google.com/spreadsheets/d/1GuptChtL3JV3bM3QUIkVQ91FjZt-XMepkRBkGain07s/edit#gid=311893925

You've run a bunch of different configs now. Which config was this updated performance data for? We want to to track the performance of all the configs so we can know which configs are promising and which are non-starters. This will avoid re-doing work in a few weeks because we didn't write it down.

I would also suggest trying again with the [cmr]5d.metal instances in all the regions we support.

It was for the latest config, which had the best prices. I am starting to look in how to automate these tests, it is getting too time-consuming doing it manually.

I created a new spreadsheet with more diverse and precise data on workers performance and costs

The performance data are measured in minutes and the costs are based on the average spot market prices from the last 30 days.

Depends on: 1631049

Last Tuesday we had the idea of using part of the RAM available as an in-memory disk and turns out it worked (with patches to monopacker and ci-admin). You can check the results in the performance spreadsheet. In summary, we could get r5.metal running 32 parallel tasks with an hourly cost of ~ $0.07 and m5.metal running 15 tasks with an hourly cost of ~ $0.14. I am going to close this bug and open a new one to track migration progress.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: