1204281 - Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 5400 seconds. Force killing container. / [taskcluster:error] Task timeout after 7200 seconds. Force killing container.

Phil Ringnalda (:philor)

Reporter

Description

•

9 years ago

+++ This bug was initially created as a clone of Bug #1198092 +++

Comment hidden (Legacy TBPL/Treeherder Robot)

Comment hidden (Intermittent Failures Robot)

Nigel Babu [:nigelb]

Comment 53

•

9 years ago

I think this has to do with Gij's retry logic. As Michael said on the mailing list, each Gij test is run 5 times within a test chunk (g. Gij4) before it is marked as failing. Then that chunk itself is retried up to 5 times before the whole thing is marked as failing.[1] This means we may end up retrying so much that we timeout the task. I wouldn't consider this a bug for TC but rather a bug for b2g Gij automation. What do you guys think about us changing the component here? [1]: https://groups.google.com/d/msg/mozilla.dev.fxos/LTTobhx4tCc/nN_gad51AgAJ

Flags: needinfo?(mhenretty)

Michael Henretty [:mikehenrty][:mhenretty]

Comment 54

•

9 years ago

(In reply to Nigel Babu [:nigelb] from comment #53) > I think this has to do with Gij's retry logic. As Michael said on the > mailing list, each Gij test is run 5 times within a test chunk (g. Gij4) > before it is marked as failing. Then that chunk itself is retried up to 5 > times before the whole thing is marked as failing.[1] This means we may end > up retrying so much that we timeout the task. > > I wouldn't consider this a bug for TC but rather a bug for b2g Gij > automation. What do you guys think about us changing the component here? > > [1]: > https://groups.google.com/d/msg/mozilla.dev.fxos/LTTobhx4tCc/nN_gad51AgAJ Certainly our retry logic makes this worse, but the real problem is that for these bad (ie. long) runs, something happens which makes the test runner pause 11 minutes between test runs. If you take a look at one of the failures [1], there was really only 1 test that was being retried, apps/system/test/marionette/text_selection_test.js. But several tests before that (which passed the first time) would have an 11 minute delay between tests. This is where the problem is. Now this could still totally be a bug in our Gij automation rather than taskcluster, but we should investigate what is happening during those 11 minutes before drawing any conclusions. 1.) https://public-artifacts.taskcluster.net/XMYzs9RrQyyVTjlQy7d_Aw/0/public/logs/live_backing.log

Flags: needinfo?(mhenretty)

Comment hidden (Intermittent Failures Robot)

Greg Arndt [:garndt]

Comment 64

•

9 years ago

:rail it looks like some of the spikes recently were related to funsize tasks timing out, such as https://tools.taskcluster.net/task-inspector/#WXTTC2IoQ-WesLVclUDX7A/0 . Most oft he tasks seem to fail with "HTTPError: 400 Client Error: BAD REQUEST" The last week or so of tasks, there were 755 failures and 625 of them were from a funsize task. I wasn't sure if you were aware of anything going on. Let me know if there is anything I can dig into with this.

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Comment 65

•

9 years ago

I filed bug 1223872 to make balrog submission have less race conditions, but I'm not sure if it can be quickly and easily resolved. There is also bug 1224698 to help with networking issues we have after aus migrated to scl3, but still routed through phx.

Flags: needinfo?(rail)

Greg Arndt [:garndt]

Comment 66

•

9 years ago

Thanks for the update! At least there are bugs out there to fix this. I just didn't know if you were aware of the spike or not. Thanks again!

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 85

•

9 years ago

looking at the last few items here, I see we timeout when the entire job is >60 minutes- not the tests. This includes setup, running tests, cleanup. I see 27 minutes for setup due to vcs errors: https://treeherder.mozilla.org/logviewer.html#?repo=fx-team&job_id=6263182 https://treeherder.mozilla.org/logviewer.html#?repo=fx-team&job_id=6263184 https://treeherder.mozilla.org/logviewer.html#?repo=fx-team&job_id=6263184 all of these started at roughly the same time, so this looks to be a root issue in the time to complete: curl --connect-timeout 30 --speed-limit 500000 -L -o /home/worker/.tc-vcs/clones/hg.mozilla.org/integration/gaia-central.tar.gz https://queue.taskcluster.net/v1/task/Dq4miM-9T6ygRTP9h-XWkQ/artifacts/public/hg.mozilla.org/integration/gaia-central.tar.gz this should normally take <5 minutes and it shows 20+ minutes in many logs. :rail, can you help me find the person who would know why the vcs sync is so slow- I assume this was an isolated incident.

Flags: needinfo?(rail)

Joel Maher ( :jmaher ) (UTC -8)

Comment 86

•

9 years ago

then I see a few others where we have no vcs issues and I see: 14:35:39 INFO - Running tests in /home/worker/gaia/apps/system/test/marionette/audio_channel_competing_test.js 15:03:25 INFO - ..................................................................................................................................................... 15:03:25 INFO - /home/worker/gaia/apps/system/test/marionette/audio_channel_competing_test.js failed. Will retry. 15:03:30 INFO - Running tests in /home/worker/gaia/apps/system/test/marionette/audio_channel_competing_test.js [taskcluster:error] Task timeout after 3600 seconds. Force killing container. this seems as though that specific test case is failing. :bkelly, can you help me find the owner of gaia/apps/system/test/marionette/audio_channel_competing_test.js, I cannot find it in dxr, and this must live somewhere in b2g land.

Flags: needinfo?(bkelly)

Rail Aliiev [:rail]

Comment 87

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #85) > :rail, can you help me find the person who would know why the vcs sync is so > slow- I assume this was an isolated incident. I'd talk to hwine.

Flags: needinfo?(rail)

Ben Kelly [:bkelly, not reviewing]

Comment 88

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #86) > :bkelly, can you help me find the owner of > gaia/apps/system/test/marionette/audio_channel_competing_test.js, I cannot > find it in dxr, and this must live somewhere in b2g land. I emailed gregor and mhenretty. I think those are bug 1233565. Thanks for checking on this!

Flags: needinfo?(bkelly)

Michael Henretty [:mikehenrty][:mhenretty]

Comment 89

•

9 years ago

Right, we do have an owner for gaia/apps/system/test/marionette/audio_channel_competing_test.js in bug 1233565. But also note that this test has been disabled for about a week [1], and we have still been seeing this failure in automation since. So I still think this issue is caused by some large slowdown in the test runner (maybe the VM gets choked in amazon or something), and not due to any individual test(s). 1.) https://github.com/mozilla-b2g/gaia/commit/0ffdc828b44dc33b84a3b34ce1643102d04b116a

Joel Maher ( :jmaher ) (UTC -8)

Comment 90

•

9 years ago

:hwine, when you get back, could you look at why this vcs sync is taking so long intermittently? In fact, I would rather kill the job if we have a vcs sync error that takes so long as it would terminate the job faster since we know it will always terminate. Can we define a guarantee of 6 minutes for all source syncing and if we cross that threshold we fail?

Flags: needinfo?(hwine)

Comment hidden (Intermittent Failures Robot)

Hal Wine [:hwine] (use NI)

Comment 92

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #90) > :hwine, when you get back, could you look at why this vcs sync is taking so > long intermittently? Can you define "this", please? There are many (>500) vcs-sync jobs. There is lots of tuning that can be done once we know which repos are involved. > In fact, I would rather kill the job if we have a vcs > sync error that takes so long as it would terminate the job faster since we > know it will always terminate. Can we define a guarantee of 6 minutes for > all source syncing and if we cross that threshold we fail? I'm not sure I'm following here - there should be no time based dependencies in vcs-sync (it is supposed to be event driven). So "all source syncing" doesn't mean anything to me. Let's do a vidyo to educate me. However, no guarantee of that short (6m) a time. See https://wiki.mozilla.org/User:Hwine/Holiday_VCS-Sync_Troubleshooting#Diagnosing_Single_Repo_Issues for a diagram of what is happening, and how there are some events that should gate potential race conditions.

Flags: needinfo?(hwine) → needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 93

•

9 years ago

this is more of an issue with curling a repo from https://queue.taskcluster.net: https://public-artifacts.taskcluster.net/KoAcWC4BR62evcQkVSXQ1A/0/public/logs/live_backing.log (search for "Operation too slow") as per comment 85, this seems to be clustered at certain times. The link at the top of this current comment is from 2 days ago and was the only instance. Is this something that has a capacity of XX connections/second and we hit a perfect storm every now and then? :garndt, how can we debug this and ensure that the .gz is available and we can determine where the proper caches are updated/out of date? Maybe :gps would know more details.

Flags: needinfo?(jmaher) → needinfo?(garndt)

Greg Arndt [:garndt]

Comment 94

•

9 years ago

I have setup monitoring of some of our taskcluster-vcs cached repos (basically everything but emulator and device image 'repo' repos, those will come soon with a patch) so we should notice hopefully when one of those are out of date (> 48 hours old). Caches expire after 30 days so for a new cache task to not be indexed in 48 hours is enough time to look into it as long as we receive the alert. As far as ensuring that the .gz is available, it is in this case because it started the transfer. Also, in tc-vcs >2.3.17 it will give an error if the artifact couldn't be found for a particular indexed task. I'm in the process of upgrading our builders and phone builder images with that, and then can move onto the tester image used here. In this case, the artifact fact was found and was being downloaded, but it was just too slow (<500 kB/s) for too long (> 30 seconds I think). Looking at when this task was run and when the cache for gaia was created, it's possible that there was slow transfer from us-west-1 (where the ec2 instance was) and us-west-2 (where the artifact in s3 lives). If I recall correctly (I would need to double check this), our s3-copy-proxy will copy to the requesting region but in the meantime, will redirect the client to the canonical region in us-west-2 for an artifact until the copy is completed.

Flags: needinfo?(garndt)

Joel Maher ( :jmaher ) (UTC -8)

Comment 95

•

9 years ago

nice, it sounds like we have a proactive solution to this. Now to figure out how to get test timeouts in a different bug :) I will wait to see how many issues show up with this bug over the next couple of days.

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 100

•

9 years ago

:garndt, we are seeing a lot of these errors still, can you check your monitoring and see if anything stands out? Maybe we have small time windows where a lot happens, 16 issues today, 246 yesterday (Sunday), 24 last Friday. The more issues we can pinpoint the better!

Flags: needinfo?(garndt)

Greg Arndt [:garndt]

Comment 101

•

9 years ago

Spot checking some of these it appears a majority of the time is spent within the tests (usually around 56-58 minutes of the 1 hour max run time) and eventually time out. I checked out some metrics related to these instances, and it appears that a majority of them are running out of memory. Is the spike in these failing around the same time a majority of these instances on m1.medium?

Flags: needinfo?(garndt)

Joel Maher ( :jmaher ) (UTC -8)

Comment 102

•

9 years ago

ah, I overlooked the obvious! lets split the R(J) jobs into 2 chunks, :armenzg, can you look at splitting Jsreftests into 2 chunks?

Flags: needinfo?(armenzg)

Armen [:armenzg]

Comment 103

•

9 years ago

I will.

Flags: needinfo?(armenzg)

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 106

•

9 years ago

Armen, can we do mochitest-plain in more chunks, we are hitting the 3600 second timeout on chunk 4 many times. Maybe 15? 5 of the jobs are >40 minutes, most in the 50+ minute range with chunk 4 at 55 minutes+. The other 5 are in the 15-22 minute range. We should verify that --chunk-by-runtime is defined as well.

Flags: needinfo?(armenzg)

Armen [:armenzg]

Comment 107

•

9 years ago

Working on it. Bug 1242502.

Flags: needinfo?(armenzg)

Comment hidden (Intermittent Failures Robot)

Armen [:armenzg]

Comment 109

•

9 years ago

A change landed yesterday and it got merged today. We will have few more instances but this should quite down (including today): https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1204281&startday=2016-01-25&endday=2016-01-26&tree=all

Comment hidden (Intermittent Failures Robot)

Armen [:armenzg]

Comment 111

•

9 years ago

If we ignore the first day of the range we go down to 16: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1204281&startday=2016-01-26&endday=2016-01-31&tree=all If we ignore one more day we're down to 6: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1204281&startday=2016-01-27&endday=2016-01-31&tree=all As we enable more jobs we will keep an eye on this.

Comment hidden (Intermittent Failures Robot)

Armen [:armenzg]

Comment 114

•

9 years ago

rail, funzise tasks have started spiking up (scroll down): [funsize] Publish to Balrog (today-2, chunk 4, subchunk 1)

Comment hidden (Intermittent Failures Robot)

Armen [:armenzg]

Comment 118

•

9 years ago

It seems that funsize accounts for more than half of all these occurrences and most occurred on the 26th.

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Comment 119

•

9 years ago

I looked at those and most of them are timeout due to balrog submission retries. This is a known issue and aki is going to look at new worker type for balrog submission.

Flags: needinfo?(rail)

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 122

•

9 years ago

wow, 75 failures in the last week on Aurora! This looks all related to funsize stuff. :bhearsum, can you take a look at this?

Flags: needinfo?(bhearsum)

bhearsum@mozilla.com (:bhearsum)

Comment 123

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #122) > wow, 75 failures in the last week on Aurora! This looks all related to > funsize stuff. :bhearsum, can you take a look at this? It looks like Aki is going to be looking at this (maybe in a roundabout way) soon, based on comment #119.

Flags: needinfo?(bhearsum) → needinfo?(rail)

Rail Aliiev [:rail]

Comment 124

•

9 years ago

We can retry harder, but I'm not sure if it's going to be better...

Flags: needinfo?(rail)

Joel Maher ( :jmaher ) (UTC -8)

Comment 125

•

9 years ago

ok, if :aki is reworking the balrog submission that sounds like it should resolve this funsize stuff. Is this a March thing or a Q2 thing? This specific error is pretty high on the orange factor list.

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 128

•

9 years ago

:aki, can you comment on when your work for the new balrog worker will be completed and in pace? :rail, is there anything else we can do in the meantime?

Flags: needinfo?(aki)

Aki Sasaki (not active)

Comment 129

•

9 years ago

:jmaher, currently this looks like a [mid-?]q2 thing.

Flags: needinfo?(aki)

Joel Maher ( :jmaher ) (UTC -8)

Comment 130

•

9 years ago

thanks :aki! I wonder if there are ways to reduce this error outside of the new balrog worker in the short term. If not, this doesn't seem to be getting much worse, it is still one of the top issues the sheriffs have to star though.

Comment hidden (Intermittent Failures Robot)

Rail Aliiev [:rail]

Updated

•

9 years ago

Depends on: 1259423

Comment hidden (Intermittent Failures Robot)

Selena Deckelmann :selenamarie :selena

Updated

•

9 years ago

Depends on: 1244181

Phil Ringnalda (:philor)

Reporter

Updated

•

9 years ago

Summary: Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. → Intermittent [taskcluster:error] Task timeout after 3600 seconds. Force killing container. / [taskcluster:error] Task timeout after 5400 seconds. Force killing container.

Comment hidden (Intermittent Failures Robot)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 139

•

9 years ago

This got very frequent. Joel, can you investigate if this caused by a performance regression, please?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 140

•

9 years ago

ok this is 2 issues: * linux64 debug xpcshell (chunk 4/5 are hitting the 1 hour limit)- this is already fixed as this is now 10 chunks * linux64 debug mda, 1 chunk, hitting the 55+ minutes normally I am testing a patch on try to see if I can split mda into 2 chunks

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 142

•

9 years ago

ok, mda needs 3 chunks but doing so yields a failure in test_zmedia_cleanup.html: https://treeherder.mozilla.org/#/jobs?repo=try&revision=081c2ddc2501a54467d3e86be0a91ee837ac2bf5 I am thinking it might be better to just extend the timeout. :dminor, can you weigh in here and figure out if we need a longer timeout or should split this up and make it green?

Flags: needinfo?(jmaher) → needinfo?(dminor)

Dan Minor [:dminor]

Comment 143

•

9 years ago

So it looks like test_zmedia_cleanup.html was added as a hacky of way of cleaning up network state for B2G testing and doesn't actually test anything. We have a Bug 1188120 on file to remove it, I'm going to see if we can go ahead and do that. I've hit lots of intermittent failures with that test.

Flags: needinfo?(dminor)

Joel Maher ( :jmaher ) (UTC -8)

Comment 144

•

9 years ago

:dminor, would you be fine splitting this to 3 chunks for linux64 debug?

Joel Maher ( :jmaher ) (UTC -8)

Comment 145

•

9 years ago

dminor, or would you rather see a longer timeout?

Depends on: 1188120

Flags: needinfo?(dminor)

Dan Minor [:dminor]

Comment 146

•

9 years ago

Extending the timeout is fine by me.

Flags: needinfo?(dminor)

Joel Maher ( :jmaher ) (UTC -8)

Comment 147

•

9 years ago

Attached file MozReview Request: Bug 1204281 - split linux64 debug taskcluster M(mda) into 3 chunks. r?dminor (deleted) — Details

Review commit: https://reviewboard.mozilla.org/r/53970/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/53970/

Attachment #8754427 - Flags: review?(dminor)

Dan Minor [:dminor]

Comment 148

•

9 years ago

Comment on attachment 8754427 [details] MozReview Request: Bug 1204281 - split linux64 debug taskcluster M(mda) into 3 chunks. r?dminor https://reviewboard.mozilla.org/r/53970/#review50682 Please revise the commit message to reflect changing the timeout rather than splitting the test into three chunks.

Attachment #8754427 - Flags: review?(dminor) → review+

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

9 years ago

Keywords: leave-open

Pulsebot

Comment 149

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/03ed23408215

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 150

•

9 years ago

Backed out for breaking gecko decision task: https://hg.mozilla.org/integration/mozilla-inbound/rev/2b227a22287677ac7af098166a632e768e70d022 Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=03ed23408215dbc98f987c68a568af89adb25eb8

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 151

•

9 years ago

Attached patch increase timeout to 5400 seconds (deleted) — Details — Splinter Review

last patch had an indentation problem and broke the tree! this is what I get for landing code after just removing unused lines from a patch and accidentally hitting the space bar. also mozreview doesn't work here as there is already some parent review request.

Flags: needinfo?(jmaher)

Attachment #8754532 - Flags: review?(dminor)

Dan Minor [:dminor]

Comment 152

•

9 years ago

Comment on attachment 8754532 [details] [diff] [review] increase timeout to 5400 seconds Review of attachment 8754532 [details] [diff] [review]: ----------------------------------------------------------------- lgtm

Attachment #8754532 - Flags: review?(dminor) → review+

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

9 years ago

Keywords: checkin-needed

Comment hidden (Intermittent Failures Robot)

Pulsebot

Comment 154

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/823f49140d69

Keywords: checkin-needed

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 156

•

9 years ago

:dustin, can you help me out here? I extended the timeout from 3600 -> 5400 and we are still getting 5400 second timeouts. example job: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=28622438#L44951 code I used to fix the timeout: https://hg.mozilla.org/integration/mozilla-inbound/rev/823f49140d69 I tested this on try server with a few different cycles and never saw the timeout, but it could have been luck that my jobs finished in <60 minutes.

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 157

•

9 years ago

The maxRunTime on https://tools.taskcluster.net/task-inspector/#Nggm4AmrRPOmmw27HgBrbg/ is still 3600. However, timeout, which is not a parameter that means anything to docker-worker, is set to 5400. I think you want to set maxRunTime to 5400 :)

Flags: needinfo?(dustin)

Joel Maher ( :jmaher ) (UTC -8)

Comment 158

•

9 years ago

Attached patch use maxRunTime instead of Timeout (deleted) — Details — Splinter Review

thanks Dustin for the pointer, this should resolve things.

Attachment #8756500 - Flags: review?(dminor)

Dan Minor [:dminor]

Updated

•

9 years ago

Attachment #8756500 - Flags: review?(dminor) → review+

Carsten Book [:Tomcat]

Comment 159

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/823f49140d69

Comment hidden (Intermittent Failures Robot)

Pulsebot

Comment 161

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/316d1c40f0fc

Ryan VanderMeulen [:RyanVM]

Comment 162

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/316d1c40f0fc

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 171

•

8 years ago

current issue is that we run out of time on the job- it is normally 55 minutes- take some slowdown for a failed test or cleanup/symbols and we cross the 60 minute threshold. What really makes this difficult is that the times are so variable due to the docker image download/setup. I cannot just load the meta data in treeherder, I have to click on each log file which takes a long time (i.e. this random docker setup is making life harder) My overall impression here is that we need 1 or 2 more chunks, but for now I would like to just bump this up to 90 minutes. It is foolish to split this into more chunks until we can realistically reduce the 20 minute blocks of randomness in the docker setup.

Joel Maher ( :jmaher ) (UTC -8)

Comment 172

•

8 years ago

Attached patch increase browser-chrome timeout from 60 to 90 minutes (obsolete) (deleted) — Details — Splinter Review

Attachment #8763502 - Flags: review?(cbook)

Carsten Book [:Tomcat]

Comment 173

•

8 years ago

Comment on attachment 8763502 [details] [diff] [review] increase browser-chrome timeout from 60 to 90 minutes Review of attachment 8763502 [details] [diff] [review]: ----------------------------------------------------------------- looks good, thanks joel!

Attachment #8763502 - Flags: review?(cbook) → review+

Pulsebot

Comment 174

•

8 years ago

Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/f89175185de0 90 minute timeout for linux64 mochitest-browser-chrome chunks. r=Tomcat

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 175

•

8 years ago

Backed out for gecko-decision opt failures: https://hg.mozilla.org/integration/mozilla-inbound/rev/bc341233192c Push with failure: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=f89175185de0a20650609a24ead315dc02c4e01f Traceback (most recent call last): File "/workspace/gecko/taskcluster/mach_commands.py", line 151, in taskgraph_decision return taskgraph.decision.taskgraph_decision(options) File "/workspace/gecko/taskcluster/taskgraph/decision.py", line 79, in taskgraph_decision create_tasks(tgg.optimized_task_graph, tgg.label_to_taskid) File "/workspace/gecko/taskcluster/taskgraph/create.py", line 61, in create_tasks f.result() File "/workspace/gecko/python/futures/concurrent/futures/_base.py", line 396, in result return self.__get_result() File "/workspace/gecko/python/futures/concurrent/futures/thread.py", line 55, in run result = self.fn(*self.args, **self.kwargs) File "/workspace/gecko/taskcluster/taskgraph/create.py", line 73, in _create_task res.raise_for_status() File "/workspace/gecko/python/requests/requests/models.py", line 840, in raise_for_status raise HTTPError(http_error_msg, response=self) HTTPError: 400 Client Error: Bad Request for url: http://taskcluster/queue/v1/task/UINhsKfoSF-0VKRdHS6uVA

Flags: needinfo?(jmaher)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 176

•

8 years ago

Hi Dustin, can you take a look at the patch and check what's wrong with it? Thank you.

Flags: needinfo?(dustin)

Joel Maher ( :jmaher ) (UTC -8)

Comment 177

•

8 years ago

I believe the problem with the patch is that I had 2 space indentation vs 4 space indentation (wrong scope...call it scope creep!).

Flags: needinfo?(jmaher)

Dustin J. Mitchell [:dustin] (he/him)

Comment 178

•

8 years ago

The actual error was quite a ways up in the logfile, and indicated a JSON schema failure because maxRunTime was added at the task, rather than task.payload, level.

Flags: needinfo?(dustin)

Joel Maher ( :jmaher ) (UTC -8)

Comment 179

•

8 years ago

Attached patch increase browser-chrome timeout from 60 to 90 minutes (v.2) (deleted) — Details — Splinter Review

ok, this has proper spacing and green on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=703cec16666126931b77b69eef4ccfbc09b7b237

Attachment #8763502 - Attachment is obsolete: true

Attachment #8763826 - Flags: review?(cbook)

Carsten Book [:Tomcat]

Comment 180

•

8 years ago

Comment on attachment 8763826 [details] [diff] [review] increase browser-chrome timeout from 60 to 90 minutes (v.2) r+ and fingers crossed - try looked also ok

Attachment #8763826 - Flags: review?(cbook) → review+

Pulsebot

Comment 181

•

8 years ago

Pushed by jmaher@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/bd33dc6449d7 90 minute timeout for linux64 mochitest-browser-chrome chunks. r=Tomcat

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 182

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/bd33dc6449d7

Comment hidden (Intermittent Failures Robot)

Joel Maher ( :jmaher ) (UTC -8)

Comment 185

•

8 years ago

:gbrown, looking at orangefactor in the previous comment, the majority of the 29 failures are linux64 asan tests, primarily: * [TC] Linux64 mochitest-media-e10s * [TC] Linux64 xpcshell-6 can you investigate those and fix the timeouts or split up the tests accordingly?

Flags: needinfo?(gbrown)

Pulsebot

Comment 186

•

8 years ago

Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/5c0b7eae936a Adjust chunks and maxRunTime to avoid tc Linux x86 intermittent timeouts; r=me

MozReview Request: Bug 1204281 - split linux64 debug taskcluster M(mda) into 3 chunks. r?dminor 9 years ago Joel Maher ( :jmaher ) (UTC -8) (deleted), text/x-review-board-request	dminor : review+	Details
increase timeout to 5400 seconds 9 years ago Joel Maher ( :jmaher ) (UTC -8) (deleted), patch	dminor : review+	Details \| Diff \| Splinter Review
use maxRunTime instead of Timeout 9 years ago Joel Maher ( :jmaher ) (UTC -8) (deleted), patch	dminor : review+	Details \| Diff \| Splinter Review
increase browser-chrome timeout from 60 to 90 minutes 8 years ago Joel Maher ( :jmaher ) (UTC -8) (deleted), patch	cbook : review+	Details \| Diff \| Splinter Review
increase browser-chrome timeout from 60 to 90 minutes (v.2) 8 years ago Joel Maher ( :jmaher ) (UTC -8) (deleted), patch	cbook : review+	Details \| Diff \| Splinter Review
increase linux64-asan mochitest-bc chunks from 10 to 16 8 years ago Geoff Brown [:gbrown] (deleted), patch	jmaher : review+	Details \| Diff \| Splinter Review