Closed Bug 1625168 Opened 5 years ago Closed 5 years ago

Decision task frequently fails with mach try auto

Categories

(Firefox Build System :: Task Configuration, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sg, Assigned: ahal)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

In the last days, I experienced Decision Task failures/timeouts on try pushes frequently. Today, it only worked on the fourth attempt.

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=294851879&revision=17bdf54eb556907310e1194393a0de6490a2daa4

https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=294855005&revision=beec17c2c8722532518477fc6e1faf3e1646f030

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a381ecb762e2fd51d9b2f38f0ebbcbe650f2b2a0&selectedJob=294858656

all failed, then finally

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1fc665bd3fd471d955dd7c816b077c9165685970

succeeded (I changed the commit message on the last one, that's why it shows a different revision, but the content was exactly the same).

This has cost me a lot of time, not sure if others are affected as well.

I've asked the Taskcluster team to look into this.
Treeherder mainly displays what happens on Taskcluster.

Hi marco, ahal,
This seems to be an issue with mach try auto.

Is there a way to make it more obvious under which component/repo should issues be filed against?

Flags: needinfo?(mcastelluccio)
Flags: needinfo?(ahal)

Fyi ./mach try auto is very experimental atm (we haven't announced it anywhere yet), so expect issues.

I think what's happening here is that for some reason the bugbug service is failing to compute the results for this push, then the taskgraph isn't propagating the error properly. It would also help if ./mach try auto enabled verbose logging in the Decision task to help see what's going on.

Component: Treeherder: Infrastructure → Task Configuration
Flags: needinfo?(ahal)
Product: Tree Management → Firefox Build System
Version: --- → unspecified

Oh, interesting. Sorry I didn't mention that these used mach try auto. Since it didn't fail deterministically, I thought it were an infrastructure issue. (mach try auto is incredibly useful, so it would be really great if it worked reliably)

Summary: Decision task frequently fails → Decision task frequently fails with mach try auto

Yikes.. I forgot to increment i in the timeout code:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/optimize/bugbug.py#46

So my guess was correct. I'll fix the timeout so that this doesn't wait 30 minutes to fail. Though the underlying cause seems to be that the service just isn't processing this push (it presumably keeps returning 202).

Keywords: leave-open
Assignee: nobody → ahal
Status: NEW → ASSIGNED
Pushed by ahalberstadt@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/cd0c3c759c83 [taskgraph] Raise exception when timing out waiting for bugbug service, r=marco

(In reply to Simon Giesecke [:sg] [he/him] from comment #4)

Oh, interesting. Sorry I didn't mention that these used mach try auto. Since it didn't fail deterministically, I thought it were an infrastructure issue. (mach try auto is incredibly useful, so it would be really great if it worked reliably)

Have you seen failures with specific patches, or generically? I'm going to add more logging in the bugbug service so I can more easily find out what happens when things go wrong.

Flags: needinfo?(mcastelluccio)

Just a suggestion, while this is still experimental, instead of pushing again you could retrigger the decision task.

(In reply to Marco Castelluccio [:marco] from comment #8)

(In reply to Simon Giesecke [:sg] [he/him] from comment #4)

Oh, interesting. Sorry I didn't mention that these used mach try auto. Since it didn't fail deterministically, I thought it were an infrastructure issue. (mach try auto is incredibly useful, so it would be really great if it worked reliably)

Have you seen failures with specific patches, or generically? I'm going to add more logging in the bugbug service so I can more easily find out what happens when things go wrong.

I am not completely sure, but I guess the failed attempt were all changing quite basic things in mfbt or xpcom/ds.

(In reply to Marco Castelluccio [:marco] from comment #9)

Just a suggestion, while this is still experimental, instead of pushing again you could retrigger the decision task.

Unfortunately, due to an issue with my account, I can't retrigger any tasks at the moment. Hope this will be resolved soon.

I made quite a few improvements in the bugbug HTTP service, so this should be fixed.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: