<a class="header-button" href="https://bugzilla-dev.allizom.org/home" title="Go to home page"> Bugzilla

Assignee

Comment 1

•

8 years ago

Looking at the RelEng AWS bill seems to indicate we are running a bunch of c3.2xlarge instances: we ran ~313k instance-hours in October. Compare with 71.6k for m3.xlarge, 34.0k r3.xlarge and 90.6k c3.xlarge. Not sure why these release jobs all seem to be running on the slower instances.

Kim Moir [:kmoir] ET

Comment 2

•

8 years ago

Moving to buildduty to investigate. It would be good to look at this more recent data in treeherder and see if this is still the case since we recently moved our release and nightly linux builds to tc.

Component: General Automation → Buildduty

QA Contact: catlee → bugspam.Callek

Comment 3

•

8 years ago

For linux all builds are running on Taskcluster under m4.4xlarge and we have for Windows and OS X some builds which run under b-2008-spot instances which are c3.2xlarge(see more details in the links bellow) https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=build&selectedJob=86623903 https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=build&filter-build_system_type=Buildbot&fromchange=517c553ad64746c479456653ce11b04ab8e4977f Gregory should we change also the machine type for b-2008-spot or it's fine with this configuration?

Flags: needinfo?(gps)

Assignee

Comment 4

•

8 years ago

To be conservative, I'd have release builds match what non-release builds are using. I suspect there are reasons that the b-2008-spot instances are still using c3.2xlarge. We can worry about getting those switched in another bug. Also, TC uses a combo of m4 and c4 instances. It just so happens that spot bidding often delivers m4's more often. The c4's are a few minutes faster than m4's. So if you want to prioritize wall time of release builds over a marginal cost premium (which I think you do since release builds are important), you should go with the c4's. https://treeherder.mozilla.org/perf.html#/graphs?timerange=31536000&series=%5Bmozilla-inbound,4044b74c437dfc672f4615a746ea01f6e4c0312d,1,2%5D&series=%5Bmozilla-inbound,077c454bbb47966e9661e9b00ba7100f14bbd6c9,1,2%5D

Flags: needinfo?(gps)

Comment 5

•

8 years ago

Did some investigation here but I couldn't find the place where the configuration for c4.4xlarge/m4.4xlarge AWS instances is made.Found some old tests and configuration in Bug 1287604 and Bug 1290282 (https://bug1290282.bmoattachments.org/attachment.cgi?id=8778997#)but things look different now. Also based on the discutions on Bug 1287604,are we sure that we want to change the configuration from combo of m4 and c4 instances to c4 instances?

Flags: needinfo?(gps)

Assignee

Comment 6

•

8 years ago

I cannot recall where the buildbot EC2 instance types are defined. catlee?

Flags: needinfo?(gps) → needinfo?(catlee)

https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/watch_pending.cfg#L67

Comment 7

•

8 years ago

Flags: needinfo?(catlee)

Kim Moir [:kmoir] ET

Comment 8

•

8 years ago

Andrei further to our discussion regarding the bug this morning, this is a request to change the instance type on release branches + m-c for nightly only

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

8 years ago

So from what I understand, for now we want to change the configuration on taskcluster for the builds to not use anymore a combo of m4.4xlarge and c4.4xlarge instances,but to use only c4.4xlarge based on https://bugzilla.mozilla.org/show_bug.cgi?id=1321168#c4id=1321168#c4. For that I will need to know where we define the characteristics of our TC worker types,particularly the instance type we are going to use,for example gecko-1-b-linux uses both c4.4xlarge and m4.4xlarge (https://tools.taskcluster.net/aws-provisioner/#gecko-1-b-linux/view);this is not found in watch_pending.cfg,there we only set the instance type for spot instances, c4.4xlarge and m4.4xlarge are not even in the list. Dustin,do you know where we define the characteristics of our TC worker? Gregory do we want to also change b-2008-spot instance type to c4.4xlarge?so that we will have for all builds in release branches + m-c, c4.4xlarge?We currently use c3.4xlarge there(https://github.com/mozilla-releng/build-cloud-tools/blob/master/configs/watch_pending.cfg#L69).

Flags: needinfo?(dustin)

Comment 10

•

8 years ago

It doesn't seem this bug is about Taskcluster. The instance types are controlled by the AWS provisioner. But the workerTypes used for releases are the same as those used for CI builds, so this particular change is impossible. I would want :gardnt's feedback before we change all workers' workerType -- there's a cost at which "dozens of minutes" might not be worthwhile.

Flags: needinfo?(dustin)

Comment 11

•

8 years ago

> The instance types are controlled by the AWS provisioner. But the > workerTypes used for releases are the same as those used for CI builds, so > this particular change is impossible. I would want :gardnt's feedback > before we change all workers' workerType -- there's a cost at which "dozens > of minutes" might not be worthwhile.

Flags: needinfo?(garndt)

Assignee

Comment 12

•

8 years ago

How did we start talking about Taskcluster? I think Taskcluster's instance selection is fine. It is buildbot that is severely lagging behind the times. Yes, we would ideally be building only on c4's instead of m4's in Taskcluster. But it isn't worth the complexity and cost at this time. Let's focus on getting buildbot release builds to something modern and we can worry about micro-optimizing release builds in TC to use c4 another time.

Greg Arndt [:garndt]

Comment 13

•

8 years ago

Based on Greg's comment above, I'm going to remove ni?

Flags: needinfo?(garndt)

Comment 14

•

8 years ago

Attached file changing instace type to c4.4xlarge for b-2008 and y-2008 (deleted) — Details

Not sure if there is something else that need to be change.

Attachment #8857968 - Flags: feedback?(catlee)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 15

•

8 years ago

Comment on attachment 8857968 [details] changing instace type to c4.4xlarge for b-2008 and y-2008 Looks fine, but I'm not sure if we require local instance storage or not for the Windows builds. c4 instances are EBS-only whereas c3 have local SSD storage. Rail, grenade, do you know if we require on local instance storage for windows instances?

Attachment #8857968 - Flags: feedback?(rthijssen)

Attachment #8857968 - Flags: feedback?(rail)

Attachment #8857968 - Flags: feedback?(catlee)

Attachment #8857968 - Flags: feedback+

Rail Aliiev [:rail]

Comment 16

•

8 years ago

Comment on attachment 8857968 [details] changing instace type to c4.4xlarge for b-2008 and y-2008 TBH, I'm not sure how the lack of instance store may affect Windows builds. As a possible test-in-production we can land the patch and watch the builds.

Attachment #8857968 - Flags: feedback?(rail) → feedback+

Comment 17

•

8 years ago

Comment on attachment 8857968 [details] changing instace type to c4.4xlarge for b-2008 and y-2008 should be fine. on tc we use c4s with instance storage on extra ebs drives configured into the ami. but pretty sure bb just builds on the c: drive so should be no problems.

Attachment #8857968 - Flags: feedback?(rthijssen) → feedback+

Comment 18

•

8 years ago

Comment on attachment 8857968 [details] changing instace type to c4.4xlarge for b-2008 and y-2008 https://github.com/mozilla-releng/build-cloud-tools/pull/291

Attachment #8857968 - Flags: checked-in+

Assignee

Comment 19

•

8 years ago

(In reply to Rob Thijssen (:grenade - CEST) from comment #17) > Comment on attachment 8857968 [details] > changing instace type to c4.4xlarge for b-2008 and y-2008 > > should be fine. on tc we use c4s with instance storage on extra ebs drives > configured into the ami. but pretty sure bb just builds on the c: drive so > should be no problems. If builds are on c: then bug 1305174 may come into play and this may make builds significantly slower. Builds should be performed on EBS volumes that aren't initialized from an AMI to avoid this problem.

Comment 20

•

8 years ago

This was merged today.

Comment 21

•

8 years ago

Things look good,b-2008-spot and y-2008-spot instances are now c4.4xlarge. Since only these builds are executed by Buildbot,and we don't want to do yet changes in Taskcluster I think we can change the status to fixed.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Comment 22

•

8 years ago

Before we close this out, we should compare before/after times to make sure that we didn't make things worse instead of better (per comment 19).

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Add the link bellow for monitoring: https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=%5Bmozilla-inbound,8ddd7cf44dad0352a7c715dcb6df1776fb2d3df0,1,2%5D&series=%5Bmozilla-inbound,99aa49aa36f11a56ffb54f9ce5720bb82a7a6f8d,0,2%5D&series=%5Bmozilla-inbound,7e2973e7febe97736736e1373db56d25161d45ce,0,2%5D&series=%5Bmozilla-inbound,18851c0bb8875e85890ba4b1aed2439f8c682743,1,2%5D

Comment 23

•

8 years ago

Comment 24

•

8 years ago

As I suspected, since these only have the single disk, many times are increasing, not decreasing. I doubt that the aggregate of those jobs that are marginally faster makes up for the drastic increases on freshly booted instances.

Comment 25

•

8 years ago

This has caused a significant increase in build times as per the graph. We should back this out.

Flags: needinfo?(aobreja)

Assignee

Comment 26

•

8 years ago

Is it too much work to add EBS volumes and perform work on them? We could still use existing paths that reference C:\ if an NTFS junction point is used to map an EBS-backed drive to a directory on C:\.

Assignee

Comment 27

•

8 years ago

The reason I'd like to use the C4's is because the C5's should be out any month now and they should offer a substantial speedup over C4's. So I see the choices as "stick with C3's until this automation is retired [and replaced by TaskCluster]" or move forward to C4's so we can swap in C5's whenever they are ready.

Comment 28

•

8 years ago

Between datacenter work, security work, and trying to get things working for taskcluster and windows 10, we don't have time to prioritize optimization at the moment. If someone outside of relops wants to make modifications, that'd be great.

Assignee

Comment 29

•

8 years ago

Not having consistency between TC and BB annoys me from the perspective of someone who optimizes the build system and version control. I am currently working on some patches to build-cloud-tools to enable EBS volumes on these instances and to install junctions in the appropriate locations.

Assignee: nobody → gps

Status: REOPENED → ASSIGNED

Assignee

Comment 30

•

8 years ago

My reading of the existing code in Ec2UserdataUtils.psm1 for provisioning a new spot instance indicates that everything here should have "just worked." However, the problem appears to be that the b-2008 and y-2008 configs still thought they had ephemeral drives: "device_map": { "/dev/sda1": { "delete_on_termination": true, "skip_resize": true, "volume_type": "gp2", "size": 120, "instance_dev": "C:" }, "/dev/sdb": { "ephemeral_name": "ephemeral0", "instance_dev": "/dev/xvdb", "skip_resize": true, "delete_on_termination": false }, "/dev/sdc": { "ephemeral_name": "ephemeral1", "instance_dev": "/dev/xvdc", "skip_resize": true, "delete_on_termination": false } }, I /think/ the creation of those volumes is silently failing. Ec2UserdataUtils.psm1 treats that as "no extra volumes, no work to do" and c:\builds is run from the root EBS volume and bad performance ensues. I'm optimistic reworking the device_map to properly list a gp2 volume will "just work." However, the code in Ec2UserdataUtils.psm1 is a bit fragile. So I'm going to refactor that as well. PR coming shortly.

Flags: needinfo?(aobreja)

Mihai Tabara [:mtabara]⌚️GMT

Comment 31

•

7 years ago

I backed this out since it seems to line up with a spike in failures of bug 1147271.

Comment 32

•

7 years ago

Note explaining the priority level: P5 doesn't mean we've lowered the priority, but the contrary. However, we're aligning these levels to the buildduty quarterly deliverables, where P1-P3 are taken by our daily waterline KTLO operational tasks.

Priority: -- → P5