1194213 - Retrigger job on same push besides automatic backfilling

Reporter

Description

•

9 years ago

I believed we redeuced the dial on automatic back-filling to *only* backfill without re-triggering the job that goes orange. jgraham tells me that it would be useful for him. adusca: would you mind looking into it? <armenzg> we have automated backfilling <armenzg> we fill the holes in coverage when a test failure appears <jgraham> armenzg: I want it to tell the autostarring code about new intermittents without the need for manual intervention <armenzg> jgraham, can you rephrase? <armenzg> do you want your autostarring code to know when a new orange job appears? <Ms2ger> Filing bugs for new intermittents? <jgraham> armenzg: One way to tell the autostarring code about a previously unknown (to it) intermittent would be to use retriggered jobs on the same push; if one run is green and another has a test failure, that test failure is probably an intermittent that can be automarked in the future <jgraham> Then if we see it a number of times (and jmaher|afk will tell you we usually don't) we file a bug for it automatically <armenzg> jgraham, would ActiveData give you this info? <armenzg> or you need that extra re-trigger? <jgraham> armenzg: The retrigger seems important otherwise you aren't comparing like-for-like <armenzg> jgraham, OK - we can add that back (the extra re-trigger) <armenzg> after that, what else would you need? <armenzg> or you're good at that point? <jgraham> Well for now that seems like a good start <armenzg> (you have two jobs of the same build on the same push) <jgraham> I imagine there will be all sorts of problems with this system at first, but it will be more obvious what those problems actually are once we encounter them

Alice Scarpa [:adusca]

Assignee

Comment 1

•

9 years ago

The new version is deployed. Now automatic backfilling will also trigger failed jobs one extra time to help check for intermittents.

Armen [:armenzg]

Reporter

Updated

•

9 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1180732

Armen [:armenzg]

Reporter

Comment 2

•

9 years ago

adusca: can you please run automatic back filling as dry run mode for now? We need to turn the dial back until we take backouts properly addressed. We need to mimic what sheriffs do and right now we don't handle it properly. philor: can you please file bugs about issues like this? We're happy to fix them. jgraham: we added your request to retrigger failed jobs on inbound, however, it seems that it is a bit too much extra load on inbound. What do you feel is the right balance? What specifically are we trying to achieve? 00:47 <philor> has someone gone mad and starting blindly retriggering failed jobs as though that would produce something other than more pain, or have we started automatically retriggering failed jobs as though that would produce something other than more pain? 00:48 <heycam> the latter 00:48 <heycam> at least on try 00:48 <heycam> (where I find it useful) 00:48 <heycam> haven't paid enough attention to notice if it happens on other trees 00:48 <philor> yeah, I know about on try, where it's... not entirely unreasonable, quite, though not something we can actually afford 00:50 <philor> but I mean shit like running Waldo's jittests twice, or having two failed Android mochitest-9s on almost every push, as though someone cared 00:50 <philor> nearly every debug Mac mochitest-2 run leaks and IOSurface, great, I don't want to see it twice per push 00:52 <philor> on try, probably useful. in production, particularly on m-i and m-c, at the absolute most maybe 2% of failed jobs should be retriggered by someone manually looking at them and deciding to roll the dice again 00:52 <philor> probably less than that, except nights and weekends 00:53 <philor> Windows mochitest-2? never. b2g opt M6 and debug M15? they shouldn't even run once, running them twice is batshit insane 00:56 — philor goes on a killing spree 00:56 <philor> oh, ffs 00:56 <philor> ARMENZG@!# 00:57 <philor> now he's automatically filling in skipped jittest runs back to the last green one 00:57 <philor> IT'S ALREADY BACKED OUT 00:58 <philor> we have 200 pending win7 test jobs even though it's 9pm, and he's pointlessly retriggering jobs to fill in below a BACKED OUT PUSH

James Graham [:jgraham]

Comment 3

•

9 years ago

OK, so I guess what I want is, or can be, slightly more nuanced than this. Basically I want an automatic retrigger on the same job if we can't pin the failures on a known problem (either an existing intermittent or an existing bug). This is probably something that treeherder should do in response to trying to autostar a job, finding failures it can't star and determining that there are no more instances of that job on that push. Having jobs that are too expensive to retrigger is a problem. Why are they so big?

Armen [:armenzg]

Reporter

Updated

•

9 years ago

Blocks: 1180732

Joel Maher ( :jmaher ) (UTC -8)

Comment 4

•

9 years ago

interesting bug, I think we can build robots that do a lot of great work, but building them to handle every situation is exponentially harder. While we can account for the above case that :philor mentions fairly easily, likewise any other abnormal case for the simple one, it becomes difficult to outline all the cases up front and handle them reliably. There are safeguards in place for the automatic retriggering/bisection- at least decent takes at them. :philor, it is frustrating when there are backlogs, failures, closures and other random things slowing down progress, it would be more welcoming to new contributors who are interested in helping out if comments and conversations were friendlier. Even qualifying it with a "*I am frustrated because XYZ* so why are you doing ABC!!" would at least put things into more context. Needless to say, you always figure out the situation and usually are right on the ball when commenting on root causes. glad we have a use case to work around or at least work towards reducing. jgraham brings up a good point about why are the jobs so long! most likely more chunking would help us get them <20 minutes/job and we could have less of these conversations.

Armen [:armenzg]

Reporter

Comment 5

•

9 years ago

philor, I believe what you're complaining is: 1) too many jobs triggered in the case of a back out (easy to fix - bug 1195809) 2) we re-trigger perma failures (bug 1195824) > running Waldo's jittests twice, or having two failed Android mochitest-9s on almost every push, as though someone cared Could you elaborate on this? > Windows mochitest-2? never. b2g opt M6 and debug M15? they shouldn't even run once, running them twice is ... insane What is the reason you refer to those specific builders? What do you think could be done differently? Query ActiveData for a specific criteria? Query pool capacity or pending jobs? (this might be good regarless - bug 1195851) ########### jgraham: > Basically I want an automatic retrigger on the same job if we can't > pin the failures on a known problem (either an existing intermittent or an existing bug) Any suggestions on how to query for this? Should you file a treeherder bug for this? Treeherder can request from pulse_actions to re-trigger any job after failing to auto-star. This is done through a pulse message on a known exchange (We do this for sheriffs' manual backfilling) Would this work for you? Can we disable this extra re-trigger we're doing and close this bug? ########### jmaher: yes, we should file a bug for investigating chunking or reducing runtimes. I feel we should build a new gofaster to answer some of this: http://brasstacks.mozilla.com/gofaster/#/

James Graham [:jgraham]

Comment 6

•

9 years ago

(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #5) > jgraham: > > Basically I want an automatic retrigger on the same job if we can't > > pin the failures on a known problem (either an existing intermittent or an existing bug) > > Any suggestions on how to query for this? > Should you file a treeherder bug for this? > Treeherder can request from pulse_actions to re-trigger any job after > failing to auto-star. > This is done through a pulse message on a known exchange (We do this for > sheriffs' manual backfilling) > Would this work for you? > Can we disable this extra re-trigger we're doing and close this bug? I can build this in to treeherder, rather than relying on this bot to do the work. That will give me the context to make more intelligent decisions about when to retrigger. I'm happy if you revert this change.

Alice Scarpa [:adusca]

Assignee

Comment 7

•

9 years ago

I reverted the change, so now automatic backfilling is back to just backfilling failed jobs without extra retriggers.

Armen [:armenzg]

Reporter

Comment 8

•

9 years ago

Thanks adusca!

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → INVALID

Bugzilla

Retrigger job on same push besides automatic backfilling

Categories

(Testing :: General, defect)

Tracking

(Not tracked)

People

(Reporter: armenzg, Assigned: adusca)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8