Closed Bug 1555458 Opened 5 years ago Closed 5 years ago

Options to reduce Google Pixel 2 backlog on Bitbar

Categories

(Testing :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: egao, Assigned: egao)

References

Details

Attachments

(3 files)

Issue summary

As :aerickson pointed out in IRC:

aerickson> gbrown: adding jsreftests has added 12 hours of bitbar machine time to each push... our p2-unit queue is backed up (10 hours is the oldest job)
12:15 
<gbrown> aerickson: we stopped running android-hw aarch64 jittest recently; had hoped that would balance out
12:16 
<egao> Edwin Gao aerickson: I played a part in enabling jsreftests on p2 - let's restrict the platforms to try and m-c?
12:16 
<aerickson> gbrown: aah, ok.
12:16 egao: that would definitely help. we can move devices from perf, but then if that ramps up we'll fall behind there.
12:17 
<gbrown> I guess m-c/try would be okay for the short-term, but we really wanted integration branches covered too
12:17 
<egao> Edwin Gao gbrown, aerickson: i'll create a bug to weigh our options
12:18 
<gbrown> thanks egao
12:18 
<egao> Edwin Gao 360 backlog is quite bad, worse than when I was waiting for G5 

This backlog of Pixel 2 devices occurred shortly after bug #1553310 landed.

In that bug, these tradeoffs were made:

  • turn off jsreftest on emulator
  • turn off jsreftest on android-hw-aarch64
  • turn on jsreftest on android-hw-arm7

Options

  1. reduce coverage (temporarily) of jsreftests on android-hw-arm7 to:
  • try
  • mozilla-central

This would lead to integration branches not being covered by jsreftests.

  1. stop/reduce coverage of jsreftests on certain build types
  • opt
  • pgo
  • debug

I notice android-hw jsreftests running on pgo, opt, and debug currently. I was only expecting pgo and debug -- probably opt can be stopped?

(In reply to Geoff Brown [:gbrown] from comment #1)

I notice android-hw jsreftests running on pgo, opt, and debug currently. I was only expecting pgo and debug -- probably opt can be stopped?

I think that would help, we can try a rolling approach (leave this bug open), try solutions one at a time to see if it reduces the load. I can cook a proposed patch up.

Sounds good - thanks.

btw, on branches other than try, jittest and jsreftest should only be run when code in js/ is modified -- they are subject to the schedules at

https://searchfox.org/mozilla-central/rev/7556a400affa9eb99e522d2d17c40689fa23a729/js/moz.build#15-19

and that seems to be working.

Pushed by egao@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0489b95c196f separate android-hw-arm7 test-set to pgo and opt; stop running jsreftests on opt builds r=gbrown
Assignee: nobody → egao

Incremental effort to improve android-hw device availability: Stop running android-hw
debug jsreftest and jittest on integration branches.
Also, remove the option for android-hw opt jittest on try. opt is a nice alternative
to pgo on try in general, but the risk of accidental (unnecessary) inclusion in
try pushes makes this a luxury we cannot afford on android-hw.

Attachment #9068703 - Attachment description: Bug 1555458 - Stop android-hw debug jittest/jsreftest on integration branches; r= → Bug 1555458 - Stop android-hw debug jittest/jsreftest on integration branches; r=egao
Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4796f71f2be4 Stop android-hw debug jittest/jsreftest on integration branches; r=egao

Backlog seems to be caught up now.

I moved 10 perf devices to the unit queue yesterday at 11:20am pacific.

Since the backlog is gone (and we now have perf backlog), I've moved them back to perf.

It might just be random fluctuation but the backlog is building up again on the unit workers.

:gbrown, aerickson - looks like backlog is building. As of this moment I count 86 for unittest-p2, 218 for perf-p2.

Are these regular numbers?

I don't know.

When I posted comment 12 it was ~120 for unittest-p2 and over 300 for perf-p2. So maybe it's getting better/stabilizing.

More than half of pending are try

try: pending 'android-hw' tasks: 211, oldest pending submitted 9h, 43m ago
mozilla-inbound: pending 'android-hw' tasks: 40, oldest pending submitted 4h, 23m ago
autoland: pending 'android-hw' tasks: 55, oldest pending submitted 5h ago
mozilla-central: pending 'android-hw' tasks: 91, oldest pending submitted 2h, 14m ago
mozilla-beta: pending 'android-hw' tasks: 0
mozilla-release: pending 'android-hw' tasks: 0
total: pending 'android-hw' tasks: 397

But we are lagging the production branches by too much.

/cc Jessie

Per https://tools.taskcluster.net/provisioners/proj-autophone/worker-types the perf-p2 backlog is currently at 1914 jobs while the unit-p2 backlog is at 302. We can keep shift more devices from unit to perf but I'm not actually suggesting we do that right now because it means my pending try push will take even longer to complete. :)

I think we need a plan to add more devices, because this doesn't feel sustainable, and increasing the number of jobs (which I'll be doing in bug 1555479 and bug 1525314) will only make this worse.

Most of these were initially from https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&author=gpascutto%40mozilla.com&searchStr=android-hw which have failed with exceptions since it took too long for them to run but they prevented other jobs from starting which have now accumulated in the pending queues. The failed jobs will be removed from the queues over time.

With a limited number of physical devices we will always be subject to being DOSed. We need to educate people that android-hw is a limited resource and can be easily overwhelmed. We also need to be proactive by reaching out to the people and asking them to cancel their jobs if they don't really need them.

Though developer education about the limited resource of Android hardware can be effective, past experience with windows10-aarch64 shows that most simply don't pay attention. Often, the pushes are simply all so a lot of unnecessary tasks are scheduled.

My proposed solution, which will hold us over until we have sufficient device count, is to:

  • prevent android-hw from showing up with mach try syntax
  • require --full keyword with mach try fuzzy

For windows10-aarch64, the combination of these approaches helped reduce the backlog from 100+ to near-zero, which proved my (anecdotal) claims that most of the jobs are actually unnecessary.

:bc, :gbrown - what do you think about the proposal? Do we want to take a less drastic measure (further restrict branches? other meausres?).

(In reply to Edwin Gao (:egao) from comment #19)

  • prevent android-hw from showing up with mach try syntax

This is with the interactive chooser? Yes, that sounds good.

  • require --full keyword with mach try fuzzy

This also sounds good.

I approve.

Last night (~12 hours ago) I checked the backlog and it was on the order of ~1750 for perf and ~500 for unit. I went hunting for try pushes that might have inadvertently included hw-p2 jobs and found a few, so I cancelled those jobs. It brought the numbers down slightly but today they're back up to 1500+ for perf and 623 for unit.

So it looks like we already prevent users from scheduling android-hw jobs using mach try syntax approach: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/try_option_syntax.py#580

# Don't schedule android-hw tests when try option syntax is used
if 'android-hw' in task.label:
    return False

I have put up a patch to require --full for the mach try fuzzy query.

Seems like a good idea. Thanks.

Pushed by egao@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1a696234bf69 require --full for android-hw job scheduling with fuzzy r=jmaher,gbrown

We need to educate people that android-hw is a limited resource and can be easily overwhelmed

FWIW my push was originally made because Android (raptor) performance numbers are very noisy and it's hard to make sure nothing was regressed without sufficient samples (and the original push had a number of Android tests that looked like they might). So nothing in comment 19 would have stopped it because it was very much intentional.

Perf numbers being very noisy and there being limited hardware is a painful recipe. I've spent the last few days pouring through all results and manually re-triggering the individual tests that were still inconclusive.

(In reply to Edwin Gao (:egao) from comment #22)

Right. I was thinking of the interactive ui but --full should so the job I think.

(In reply to Gian-Carlo Pascutto [:gcp] from comment #27)

Thanks. It is sometimes necessary to run lots of jobs against android-hw. We just need to be mindful of the timing. Large jobs on try are best done over night or weekends if possible. Multiple large jobs should be staged so that they don't form a bottleneck for other developers.

The way the scheduling works is that production branches like mozilla-central, autoland, mozilla-inbound etc will always be served before try. With the current heavy load on production especially during the day, that means that try does not get jobs as often as we would like or in fact at all. This results in jobs staying in the queue for longer than a day and then being cancelled.

It is unfortunate that we do not have more devices but that is the situation we find ourselves in.

Priority: -- → P3

Closing this bug as I believe the backlog for P2 has been addressed satisfactorily.

Status: NEW → RESOLVED
Closed: 5 years ago
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: