1255225 - TC jobs occasionally fail and don't get a parsed failure in Treeherder.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Description

•

9 years ago

This job failed: https://public-artifacts.taskcluster.net/OxYnpNZFSmyTF3ZKRwJpyw/0/public/logs/live_backing.log All treeherder sees is "Failure summary is empty" in the failure summary pane. Looking in the log, the very last line is: [taskcluster] Unsuccessful task run with exit code: 141 completed in 312.984 seconds I'm not seeing the actual failure in the log, but it really should print something that treeherder can pick up and use as a failure summary.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Comment 1

•

9 years ago

Not sure if this should block switching TC stuff to tier-1, but I'll throw it on the pile for someone else to look at...

Blocks: 1243024

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Comment 2

•

9 years ago

Then there's this one: https://public-artifacts.taskcluster.net/fUFD14ZfTWKxg0xyup3Mww/0/public/logs/live_backing.log No summary gets shown in Treeherder. Bottom of the log has a line with: 92 657k 92 611k 0 0 115 0 1:37:37 1:29:56 0:07:41 0[taskcluster:error] Task timeout after 5400 seconds. Force killing container. Followed by: [taskcluster] Unsuccessful task run with exit code: -1 completed in 5402.56 seconds Should that first bit have been split into two separate lines so that "[taskcluster:error]" was at the start of a new line, and would that have made it get parsed on Treeherder?

Joel Maher ( :jmaher ) (UTC -8)

Comment 3

•

9 years ago

this seems pretty important, as in we either block or fix it asap after we switch (i.e. not Q2). I assume this is reporting status to treeherder which is the problem.

Greg Arndt [:garndt]

Assignee

Comment 4

•

9 years ago

Looking at the task in the bug description, it looks like the task failed because 'curl' couldn't download the file. As far as there being an error put into the logs, it looks like this should be handled in the script that's running in the task. If we want that error to be parsed and displayed in the treeherder logs, that error must conform to one of treeherder's error parsing regexes. (prefixing a line with [<component>:error] will get it parsed. As far as comment 2 it does look like somehow there was a race between writing the last line from the task and the line when the task was killed. We probably could/should write a new line when writing that error. Usually the force kill message appears on its own line and gets parsed, but clearly that didn't happen in this case.

Joel Maher ( :jmaher ) (UTC -8)

Comment 5

•

9 years ago

I had thought this was an intermittent/perma failure for a test case/build. If this is core harness/infra, I would vote this down from must have to should have- as in we can fix the harness issues, but core sheriffing would not be affected the majority of the time. :kwierso, do chime in if you feel I am off base.

Dustin J. Mitchell [:dustin] (he/him)

Comment 6

•

9 years ago

Bug 1231200 covers converting some more of the shell-script errors into things treeherder will show.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Comment 7

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #5) > I had thought this was an intermittent/perma failure for a test case/build. > If this is core harness/infra, I would vote this down from must have to > should have- as in we can fix the harness issues, but core sheriffing would > not be affected the majority of the time. :kwierso, do chime in if you feel > I am off base. Yeah, at the moment, I either star and move on with my day, or star then retrigger then move on with my day. It's possible these could hide some actual bustage, but it's unlikely that it would happen frequently enough to truly make things unreasonably bad.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Comment 8

•

9 years ago

I'm not reading the visibility policy's failure output requirements strictly enough to say these non-parsed instances mean the tc jobs can't be tier-1, though I'll defer to the other sheriffs on this if they disagree.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 9

•

9 years ago

I already filed this as bug 1254063 a couple days ago.

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

9 years ago

Blocks: 1254063

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

9 years ago

Depends on: 1231200

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

9 years ago

From bug 1254063: --- https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=Firefox%20UI&filter-tier=2&filter-tier=3&fromchange=71c56930fb7f&selectedJob=23106936 This job has no failure summary set so that you are forced to manually update the log and check for the failure. In this case it is because of: Failed to start Xvfb, retry: 2 + sleep 2 + '[' 2 -gt 2 ']' + xvinfo xvinfo: Unable to open display :0 --- https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-searchStr=Firefox%20UI&filter-tier=2&filter-tier=3&fromchange=41d7e0f6b503&selectedJob=23190672 curl --fail -o mozharness.zip --retry 10 -L $MOZHARNESS_URL + curl --fail -o mozharness.zip --retry 10 -L https://queue.taskcluster.net/v1/task/VwFh2hFKQH-6maIm5zyNUQ/artifacts/public/build/mozharness.zip % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0 [..] 72 657k 72 475k 0 0 6045 0 0:01:51 0:01:20 0:00:31 0 72 657k 72 475k 0 0 6045 0 0:01:51 0:01:20 0:00:31 0 curl: (18) transfer closed with 187040 bytes remaining to read --- The first of those is https://tools.taskcluster.net/task-inspector/#e6AI4R1lQkOiW6LfiBiwjA/0 and the second https://tools.taskcluster.net/task-inspector/#WRuZ43i_TI2z1p2yQOCo9Q/0 In the first case, it looks like xvfb was the issue, and I don't see any more useful information - it looks like it just died. In the second case, the curl download failed, always a possibility with long HTTP downloads.

Assignee: nobody → dustin

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

9 years ago

So I think the ask here is better compliance with "Failures must appear in the Treeherder failure summary in order to avoid having to open the full log for every failure." I'm not sure what that means, though. https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy linked to a utils.py that didn't seem relevant, but going back to a refactor it seems like the interesting stuff was moved to https://github.com/mozilla/treeherder/blob/master/treeherder/model/error_summary.py. Even then, I don't really see what the format is (I don't see where the 'all_errors' attribute of an artifact blob is set). From context, I'm guessing it's looking for lines matching /^\[(?<component>.*):error\]/ So more specifically, we're looking at having all possible failures include a log artifact with a line matching that pattern. I'm not sure "all" is achievable, but certainly bug 1231200 addresses a class of failure modes (including Henrik's first comment), and including an extra newline before the timeout error will help with comment 1. Comment 0 and Henrik's second comment are the same, and can also be addressed in bug 1231200. But more generally, I think this will have to be an ongoing process - we've identified a number of failure modes, but it's impossible to ever say we've identified all failure modes. As we discover more, we'll make sure they're identified correctly. Wes, did I get my guesses correct at the top? If so, I'll edit the wiki to match. If not, please edit it to enlighten me :)

Flags: needinfo?(wkocher)

Dustin J. Mitchell [:dustin] (he/him)

Comment 13

•

9 years ago

Attached file https://github.com/taskcluster/docker-worker/pull/215 (deleted) — Details

Attachment #8729237 - Flags: review?(garndt)

Dustin J. Mitchell [:dustin] (he/him)

Comment 14

•

9 years ago

Worth noting, I would have thought curl's --retry option would retry the mozharness download :(

Greg Arndt [:garndt]

Assignee

Updated

•

9 years ago

Attachment #8729237 - Flags: review?(garndt) → review+

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Reporter

Comment 15

•

9 years ago

> Wes, did I get my guesses correct at the top? If so, I'll edit the wiki to match. If not, please edit it to enlighten me :) Yeah, looks like it was already changed.

Flags: needinfo?(wkocher)

Dustin J. Mitchell [:dustin] (he/him)

Comment 16

•

9 years ago

Greg will be deploying this.

Assignee: dustin → garndt

Greg Arndt [:garndt]

Assignee

Comment 17

•

9 years ago

New AMI started rolling out yesterday and all workers should be cycled by the weekend. Considering this resolved.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED