Closed Bug 1555458 Opened 5 years ago Closed 5 years ago

Options to reduce Google Pixel 2 backlog on Bitbar

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: egao, Assigned: egao)

References

Details

Attachments

(3 files)

Bug 1555458 - separate android-hw-arm7 test-set to pgo and opt; stop running jsreftests on opt builds 5 years ago Edwin Takahashi (:egao \| infrequent contributor) (deleted), text/x-phabricator-request		Details
Bug 1555458 - Stop android-hw debug jittest/jsreftest on integration branches; r=egao 5 years ago Geoff Brown [:gbrown] (deleted), text/x-phabricator-request		Details
Bug 1555458 - require --full for android-hw job scheduling with fuzzy 5 years ago Edwin Takahashi (:egao \| infrequent contributor) (deleted), text/x-phabricator-request		Details

Edwin Takahashi (:egao | infrequent contributor)

Assignee

Description

•

5 years ago

Issue summary

As :aerickson pointed out in IRC:

aerickson> gbrown: adding jsreftests has added 12 hours of bitbar machine time to each push... our p2-unit queue is backed up (10 hours is the oldest job)
12:15 
<gbrown> aerickson: we stopped running android-hw aarch64 jittest recently; had hoped that would balance out
12:16 
<egao> Edwin Gao aerickson: I played a part in enabling jsreftests on p2 - let's restrict the platforms to try and m-c?
12:16 
<aerickson> gbrown: aah, ok.
12:16 egao: that would definitely help. we can move devices from perf, but then if that ramps up we'll fall behind there.
12:17 
<gbrown> I guess m-c/try would be okay for the short-term, but we really wanted integration branches covered too
12:17 
<egao> Edwin Gao gbrown, aerickson: i'll create a bug to weigh our options
12:18 
<gbrown> thanks egao
12:18 
<egao> Edwin Gao 360 backlog is quite bad, worse than when I was waiting for G5

This backlog of Pixel 2 devices occurred shortly after bug #1553310 landed.

In that bug, these tradeoffs were made:

turn off jsreftest on emulator
turn off jsreftest on android-hw-aarch64
turn on jsreftest on android-hw-arm7

Options

reduce coverage (temporarily) of jsreftests on android-hw-arm7 to:

try
mozilla-central

This would lead to integration branches not being covered by jsreftests.

stop/reduce coverage of jsreftests on certain build types

opt
pgo
debug

Geoff Brown [:gbrown]

Comment 1

•

5 years ago

I notice android-hw jsreftests running on pgo, opt, and debug currently. I was only expecting pgo and debug -- probably opt can be stopped?

Edwin Takahashi (:egao | infrequent contributor)

Assignee

Comment 2

•

5 years ago

(In reply to Geoff Brown [:gbrown] from comment #1)

I notice android-hw jsreftests running on pgo, opt, and debug currently. I was only expecting pgo and debug -- probably opt can be stopped?

I think that would help, we can try a rolling approach (leave this bug open), try solutions one at a time to see if it reduces the load. I can cook a proposed patch up.

Geoff Brown [:gbrown]

Comment 3

•

5 years ago

Sounds good - thanks.

btw, on branches other than try, jittest and jsreftest should only be run when code in js/ is modified -- they are subject to the schedules at

https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/js/moz.build#15-19

and that seems to be working.

Edwin Takahashi (:egao | infrequent contributor)

Assignee

Comment 4

•

5 years ago

Attached file Bug 1555458 - separate android-hw-arm7 test-set to pgo and opt; stop running jsreftests on opt builds (deleted) — Details

Pulsebot

Comment 5

•

5 years ago

Pushed by egao@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0489b95c196f separate android-hw-arm7 test-set to pgo and opt; stop running jsreftests on opt builds r=gbrown

Edwin Takahashi (:egao | infrequent contributor)

Assignee

Updated

•

5 years ago

Keywords: leave-open

Edwin Takahashi (:egao | infrequent contributor)

Assignee

Updated

•

5 years ago

Assignee: nobody → egao

Stefan Hindli [:stefan_hindli]

Comment 6

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/0489b95c196f

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Updated

•

5 years ago

Blocks: 1555479

Geoff Brown [:gbrown]

Comment 7

•

5 years ago

Attached file Bug 1555458 - Stop android-hw debug jittest/jsreftest on integration branches; r=egao (deleted) — Details

Incremental effort to improve android-hw device availability: Stop running android-hw
debug jsreftest and jittest on integration branches.
Also, remove the option for android-hw opt jittest on try. opt is a nice alternative
to pgo on try in general, but the risk of accidental (unnecessary) inclusion in
try pushes makes this a luxury we cannot afford on android-hw.

Phabricator Automation

Updated

•

5 years ago

Attachment #9068703 - Attachment description: Bug 1555458 - Stop android-hw debug jittest/jsreftest on integration branches; r= → Bug 1555458 - Stop android-hw debug jittest/jsreftest on integration branches; r=egao

Pulsebot

Comment 8

•

5 years ago

Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4796f71f2be4 Stop android-hw debug jittest/jsreftest on integration branches; r=egao

Narcis Beleuzu [:NarcisB]

Comment 9

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/4796f71f2be4

Geoff Brown [:gbrown]

Comment 10

•

5 years ago

Backlog seems to be caught up now.

Andrew Erickson [:aerickson]

Comment 11

•

5 years ago

I moved 10 perf devices to the unit queue yesterday at 11:20am pacific.

Since the backlog is gone (and we now have perf backlog), I've moved them back to perf.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 12

•

5 years ago

It might just be random fluctuation but the backlog is building up again on the unit workers.

Edwin Takahashi (:egao | infrequent contributor)

Assignee

Comment 13

•

5 years ago

:gbrown, aerickson - looks like backlog is building. As of this moment I count 86 for unittest-p2, 218 for perf-p2.

Are these regular numbers?

Geoff Brown [:gbrown]

Comment 14

•

5 years ago

I don't know.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 15

•

5 years ago

When I posted comment 12 it was ~120 for unittest-p2 and over 300 for perf-p2. So maybe it's getting better/stabilizing.

Bob Clary [:bc] (inactive)

Comment 16

•

5 years ago

More than half of pending are try

try: pending 'android-hw' tasks: 211, oldest pending submitted 9h, 43m ago
mozilla-inbound: pending 'android-hw' tasks: 40, oldest pending submitted 4h, 23m ago
autoland: pending 'android-hw' tasks: 55, oldest pending submitted 5h ago
mozilla-central: pending 'android-hw' tasks: 91, oldest pending submitted 2h, 14m ago
mozilla-beta: pending 'android-hw' tasks: 0
mozilla-release: pending 'android-hw' tasks: 0
total: pending 'android-hw' tasks: 397

But we are lagging the production branches by too much.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 17

•

5 years ago

/cc Jessie

Per https://tools.taskcluster.net/provisioners/proj-autophone/worker-types the perf-p2 backlog is currently at 1914 jobs while the unit-p2 backlog is at 302. We can keep shift more devices from unit to perf but I'm not actually suggesting we do that right now because it means my pending try push will take even longer to complete. :)

I think we need a plan to add more devices, because this doesn't feel sustainable, and increasing the number of jobs (which I'll be doing in bug 1555479 and bug 1525314) will only make this worse.

Bob Clary [:bc] (inactive)

Comment 18

•

5 years ago

Most of these were initially from https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&author=gpascutto%40mozilla.com&searchStr=android-hw which have failed with exceptions since it took too long for them to run but they prevented other jobs from starting which have now accumulated in the pending queues. The failed jobs will be removed from the queues over time.

With a limited number of physical devices we will always be subject to being DOSed. We need to educate people that android-hw is a limited resource and can be easily overwhelmed. We also need to be proactive by reaching out to the people and asking them to cancel their jobs if they don't really need them.

Edwin Takahashi (:egao | infrequent contributor)

Assignee

Comment 19

•

5 years ago

Though developer education about the limited resource of Android hardware can be effective, past experience with windows10-aarch64 shows that most simply don't pay attention. Often, the pushes are simply all so a lot of unnecessary tasks are scheduled.

My proposed solution, which will hold us over until we have sufficient device count, is to:

prevent android-hw from showing up with mach try syntax
require --full keyword with mach try fuzzy

For windows10-aarch64, the combination of these approaches helped reduce the backlog from 100+ to near-zero, which proved my (anecdotal) claims that most of the jobs are actually unnecessary.

:bc, :gbrown - what do you think about the proposal? Do we want to take a less drastic measure (further restrict branches? other meausres?).

Bob Clary [:bc] (inactive)

Comment 20

•

5 years ago

(In reply to Edwin Gao (:egao) from comment #19)

prevent android-hw from showing up with mach try syntax

This is with the interactive chooser? Yes, that sounds good.

require --full keyword with mach try fuzzy

This also sounds good.

I approve.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 21

•

5 years ago

Last night (~12 hours ago) I checked the backlog and it was on the order of ~1750 for perf and ~500 for unit. I went hunting for try pushes that might have inadvertently included hw-p2 jobs and found a few, so I cancelled those jobs. It brought the numbers down slightly but today they're back up to 1500+ for perf and 623 for unit.

Edwin Takahashi (:egao | infrequent contributor)

Assignee

Comment 22

•

5 years ago

So it looks like we already prevent users from scheduling android-hw jobs using mach try syntax approach: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/try_option_syntax.py#580

# Don't schedule android-hw tests when try option syntax is used
if 'android-hw' in task.label:
    return False

I have put up a patch to require --full for the mach try fuzzy query.

Edwin Takahashi (:egao | infrequent contributor)

Assignee

Comment 23

•

5 years ago

Attached file Bug 1555458 - require --full for android-hw job scheduling with fuzzy (deleted) — Details

Geoff Brown [:gbrown]

Comment 24

•

5 years ago

Seems like a good idea. Thanks.

Pulsebot

Comment 25

•

5 years ago

Pushed by egao@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1a696234bf69 require --full for android-hw job scheduling with fuzzy r=jmaher,gbrown

Stefan Hindli [:stefan_hindli]

Comment 26

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1a696234bf69

Gian-Carlo Pascutto [:gcp]

Comment 27

•

5 years ago

We need to educate people that android-hw is a limited resource and can be easily overwhelmed

FWIW my push was originally made because Android (raptor) performance numbers are very noisy and it's hard to make sure nothing was regressed without sufficient samples (and the original push had a number of Android tests that looked like they might). So nothing in comment 19 would have stopped it because it was very much intentional.

Perf numbers being very noisy and there being limited hardware is a painful recipe. I've spent the last few days pouring through all results and manually re-triggering the individual tests that were still inconclusive.

Bob Clary [:bc] (inactive)

Comment 28

•

5 years ago

(In reply to Edwin Gao (:egao) from comment #22)

Right. I was thinking of the interactive ui but --full should so the job I think.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #27)

Thanks. It is sometimes necessary to run lots of jobs against android-hw. We just need to be mindful of the timing. Large jobs on try are best done over night or weekends if possible. Multiple large jobs should be staged so that they don't form a bottleneck for other developers.

The way the scheduling works is that production branches like mozilla-central, autoland, mozilla-inbound etc will always be served before try. With the current heavy load on production especially during the day, that means that try does not get jobs as often as we would like or in fact at all. This results in jobs staying in the queue for longer than a day and then being cancelled.

It is unfortunate that we do not have more devices but that is the situation we find ourselves in.

Geoff Brown [:gbrown]

Updated

•

5 years ago

Priority: -- → P3

Edwin Takahashi (:egao | infrequent contributor)

Assignee

Comment 29

•

5 years ago

Closing this bug as I believe the backlog for P2 has been addressed satisfactorily.

Status: NEW → RESOLVED

Closed: 5 years ago

Keywords: leave-open

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.