Options to reduce Google Pixel 2 backlog on Bitbar
Categories
(Testing :: General, defect, P3)
Tracking
(Not tracked)
People
(Reporter: egao, Assigned: egao)
References
Details
Attachments
(3 files)
Issue summary
As :aerickson pointed out in IRC:
aerickson> gbrown: adding jsreftests has added 12 hours of bitbar machine time to each push... our p2-unit queue is backed up (10 hours is the oldest job)
12:15
<gbrown> aerickson: we stopped running android-hw aarch64 jittest recently; had hoped that would balance out
12:16
<egao> Edwin Gao aerickson: I played a part in enabling jsreftests on p2 - let's restrict the platforms to try and m-c?
12:16
<aerickson> gbrown: aah, ok.
12:16 egao: that would definitely help. we can move devices from perf, but then if that ramps up we'll fall behind there.
12:17
<gbrown> I guess m-c/try would be okay for the short-term, but we really wanted integration branches covered too
12:17
<egao> Edwin Gao gbrown, aerickson: i'll create a bug to weigh our options
12:18
<gbrown> thanks egao
12:18
<egao> Edwin Gao 360 backlog is quite bad, worse than when I was waiting for G5
This backlog of Pixel 2 devices occurred shortly after bug #1553310 landed.
In that bug, these tradeoffs were made:
- turn off jsreftest on emulator
- turn off jsreftest on android-hw-aarch64
- turn on jsreftest on android-hw-arm7
Options
- reduce coverage (temporarily) of
jsreftests
onandroid-hw-arm7
to:
- try
- mozilla-central
This would lead to integration branches not being covered by jsreftests
.
- stop/reduce coverage of
jsreftests
on certain build types
- opt
- pgo
- debug
Comment 1•5 years ago
|
||
I notice android-hw jsreftests running on pgo, opt, and debug currently. I was only expecting pgo and debug -- probably opt can be stopped?
Assignee | ||
Comment 2•5 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #1)
I notice android-hw jsreftests running on pgo, opt, and debug currently. I was only expecting pgo and debug -- probably opt can be stopped?
I think that would help, we can try a rolling approach (leave this bug open), try solutions one at a time to see if it reduces the load. I can cook a proposed patch up.
Comment 3•5 years ago
|
||
Sounds good - thanks.
btw, on branches other than try, jittest and jsreftest should only be run when code in js/ is modified -- they are subject to the schedules at
and that seems to be working.
Assignee | ||
Comment 4•5 years ago
|
||
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Comment 6•5 years ago
|
||
bugherder |
Comment 7•5 years ago
|
||
Incremental effort to improve android-hw device availability: Stop running android-hw
debug jsreftest and jittest on integration branches.
Also, remove the option for android-hw opt jittest on try. opt is a nice alternative
to pgo on try in general, but the risk of accidental (unnecessary) inclusion in
try pushes makes this a luxury we cannot afford on android-hw.
Updated•5 years ago
|
Comment 9•5 years ago
|
||
bugherder |
Comment 10•5 years ago
|
||
Backlog seems to be caught up now.
Comment 11•5 years ago
|
||
I moved 10 perf devices to the unit queue yesterday at 11:20am pacific.
Since the backlog is gone (and we now have perf backlog), I've moved them back to perf.
Comment 12•5 years ago
|
||
It might just be random fluctuation but the backlog is building up again on the unit workers.
Assignee | ||
Comment 13•5 years ago
|
||
:gbrown, aerickson - looks like backlog is building. As of this moment I count 86 for unittest-p2, 218 for perf-p2.
Are these regular numbers?
Comment 14•5 years ago
|
||
I don't know.
Comment 15•5 years ago
|
||
When I posted comment 12 it was ~120 for unittest-p2 and over 300 for perf-p2. So maybe it's getting better/stabilizing.
Comment 16•5 years ago
|
||
More than half of pending are try
try: pending 'android-hw' tasks: 211, oldest pending submitted 9h, 43m ago
mozilla-inbound: pending 'android-hw' tasks: 40, oldest pending submitted 4h, 23m ago
autoland: pending 'android-hw' tasks: 55, oldest pending submitted 5h ago
mozilla-central: pending 'android-hw' tasks: 91, oldest pending submitted 2h, 14m ago
mozilla-beta: pending 'android-hw' tasks: 0
mozilla-release: pending 'android-hw' tasks: 0
total: pending 'android-hw' tasks: 397
But we are lagging the production branches by too much.
Comment 17•5 years ago
|
||
/cc Jessie
Per https://tools.taskcluster.net/provisioners/proj-autophone/worker-types the perf-p2 backlog is currently at 1914 jobs while the unit-p2 backlog is at 302. We can keep shift more devices from unit to perf but I'm not actually suggesting we do that right now because it means my pending try push will take even longer to complete. :)
I think we need a plan to add more devices, because this doesn't feel sustainable, and increasing the number of jobs (which I'll be doing in bug 1555479 and bug 1525314) will only make this worse.
Comment 18•5 years ago
|
||
Most of these were initially from https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&author=gpascutto%40mozilla.com&searchStr=android-hw which have failed with exceptions since it took too long for them to run but they prevented other jobs from starting which have now accumulated in the pending queues. The failed jobs will be removed from the queues over time.
With a limited number of physical devices we will always be subject to being DOSed. We need to educate people that android-hw is a limited resource and can be easily overwhelmed. We also need to be proactive by reaching out to the people and asking them to cancel their jobs if they don't really need them.
Assignee | ||
Comment 19•5 years ago
|
||
Though developer education about the limited resource of Android hardware can be effective, past experience with windows10-aarch64 shows that most simply don't pay attention. Often, the pushes are simply all
so a lot of unnecessary tasks are scheduled.
My proposed solution, which will hold us over until we have sufficient device count, is to:
- prevent android-hw from showing up with
mach try syntax
- require
--full
keyword withmach try fuzzy
For windows10-aarch64, the combination of these approaches helped reduce the backlog from 100+ to near-zero, which proved my (anecdotal) claims that most of the jobs are actually unnecessary.
:bc, :gbrown - what do you think about the proposal? Do we want to take a less drastic measure (further restrict branches? other meausres?).
Comment 20•5 years ago
|
||
(In reply to Edwin Gao (:egao) from comment #19)
- prevent android-hw from showing up with
mach try syntax
This is with the interactive chooser? Yes, that sounds good.
- require
--full
keyword withmach try fuzzy
This also sounds good.
I approve.
Comment 21•5 years ago
|
||
Last night (~12 hours ago) I checked the backlog and it was on the order of ~1750 for perf and ~500 for unit. I went hunting for try pushes that might have inadvertently included hw-p2 jobs and found a few, so I cancelled those jobs. It brought the numbers down slightly but today they're back up to 1500+ for perf and 623 for unit.
Assignee | ||
Comment 22•5 years ago
|
||
So it looks like we already prevent users from scheduling android-hw jobs using mach try syntax
approach: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/try_option_syntax.py#580
# Don't schedule android-hw tests when try option syntax is used
if 'android-hw' in task.label:
return False
I have put up a patch to require --full
for the mach try fuzzy
query.
Assignee | ||
Comment 23•5 years ago
|
||
Comment 24•5 years ago
|
||
Seems like a good idea. Thanks.
Comment 25•5 years ago
|
||
Comment 26•5 years ago
|
||
bugherder |
Comment 27•5 years ago
|
||
We need to educate people that android-hw is a limited resource and can be easily overwhelmed
FWIW my push was originally made because Android (raptor) performance numbers are very noisy and it's hard to make sure nothing was regressed without sufficient samples (and the original push had a number of Android tests that looked like they might). So nothing in comment 19 would have stopped it because it was very much intentional.
Perf numbers being very noisy and there being limited hardware is a painful recipe. I've spent the last few days pouring through all results and manually re-triggering the individual tests that were still inconclusive.
Comment 28•5 years ago
|
||
(In reply to Edwin Gao (:egao) from comment #22)
Right. I was thinking of the interactive ui but --full should so the job I think.
(In reply to Gian-Carlo Pascutto [:gcp] from comment #27)
Thanks. It is sometimes necessary to run lots of jobs against android-hw. We just need to be mindful of the timing. Large jobs on try are best done over night or weekends if possible. Multiple large jobs should be staged so that they don't form a bottleneck for other developers.
The way the scheduling works is that production branches like mozilla-central, autoland, mozilla-inbound etc will always be served before try. With the current heavy load on production especially during the day, that means that try does not get jobs as often as we would like or in fact at all. This results in jobs staying in the queue for longer than a day and then being cancelled.
It is unfortunate that we do not have more devices but that is the situation we find ourselves in.
Updated•5 years ago
|
Assignee | ||
Comment 29•5 years ago
|
||
Closing this bug as I believe the backlog for P2 has been addressed satisfactorily.
Description
•