1369083 - Android Debug marionette-4 job runs for too long

Assignee

Description

•

8 years ago

Bug 1204281 records "Task timeout after 3600 seconds" failures; recently, there are intermittent failures of that type for test-android-4.3-arm7-api-15/debug-marionette-4 jobs. The run-time of that job increased substantially with bug 1368101: https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=android%20marionette&tochange=7b9687c90aea55f7893ecfb0ccd5f0c954e36eb0&fromchange=740d674779eb4dada7c7f47ef03fb3aaaa65d212 Before bug 1368101, Android Debug marionette-4 jobs completed in about 35 minutes; afterwards, they required 60 minutes to complete.

Geoff Brown [:gbrown]

Assignee

Comment 1

•

8 years ago

:whimboo - Do you know why job run-time increased so much? What do you want to do to avoid timeouts?

Flags: needinfo?(hskupin)

Geoff Brown [:gbrown]

Assignee

Updated

•

8 years ago

Blocks: 1204281

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 2

•

8 years ago

The bad thing for Marionette tests on Android is that there is no gecko.log provided. :( As such it is very hard to say anything about what's going on here. But from the standard log I can see that in both cases the job get stalled in the following test: test_window_handles_content.py TestWindowHandles.test_window_handles_after_opening_new_tab Beside the bug you mentioned above I also landed a change to those tests via bug 1368526. So the default wait timeout for page_load is set to 300s. Does the task timeout mean that this is for the whole job? I could imagine that we accumulate long delays in any of those tests and finally get killed. The chance to hit this intermittent failure seems to be kinda low. Beside those two jobs I cannot see another one with this type of failure.

Blocks: 1368526

Flags: needinfo?(hskupin)

Keywords: regression

Geoff Brown [:gbrown]

Assignee

Comment 3

•

8 years ago

(In reply to Henrik Skupin (:whimboo) from comment #2) > Does > the task timeout mean that this is for the whole job? I could imagine that > we accumulate long delays in any of those tests and finally get killed. Yes, the "Task timeout after 3600 seconds" is for the whole job. I think it happens specifically when the mozharness job run from taskcluster does not complete after 3600 seconds. > The chance to hit this intermittent failure seems to be kinda low. Beside > those two jobs I cannot see another one with this type of failure. From https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1204281&endday=2017-05-31&startday=2017-05-30&tree=trunk, I believe there are 8 of these failures on May 30, which is concerning...but none so far on May 31 - encouraging! If I browse current Android Debug marionette jobs on mozilla-central and look at the "Duration" reported by treeherder, chunk 4 seems to be running in 20 to 30 minutes again, but chunk 3 is now running closer to 55 minutes -- no errors, but not much room for change.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 4

•

8 years ago

Due to a Taskcluster issue with creating one click loaners I'm not able to check that live. Jonas put a fix in place, which will allow me to get a loaner tomorrow.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 5

•

8 years ago

(In reply to Geoff Brown [:gbrown] from comment #3) > If I browse current Android Debug marionette jobs on mozilla-central and > look at the "Duration" reported by treeherder, chunk 4 seems to be running > in 20 to 30 minutes again, but chunk 3 is now running closer to 55 minutes > -- no errors, but not much room for change. The chunk selection implementation is pretty bad. I would assume that some tests might have been executed in chunk 3 instead of 4, and as such chunk 4 doesn't fail.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 6

•

8 years ago

Maybe this is somewhat related to bug 1368787. In some of the listed test jobs on OF I can see a lot of MessageChannel errors like `SendAccumulateChildKeyedHistograms`. Here a non Marionette test job: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=103298490&lineNumber=42249 Not sure if this might affect non-e10s builds/tests like on Android.

Geoff Brown [:gbrown]

Assignee

Comment 7

•

8 years ago

(In reply to Henrik Skupin (:whimboo) from comment #6) > Maybe this is somewhat related to bug 1368787. In some of the listed test > jobs on OF I can see a lot of MessageChannel errors like > `SendAccumulateChildKeyedHistograms`. Here a non Marionette test job: > > https://treeherder.mozilla.org/logviewer.html#?repo=mozilla- > inbound&job_id=103298490&lineNumber=42249 That failure is in mochitest-media-e10s and the MessageChannel errors are during shutdown, which is taking forever. That's a pattern I am aware of -- bug 1339568 -- but as far as I know, it only affects linux mochitest-media-e10s jobs and only affects shutdown.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 8

•

8 years ago

Ok, so that should be related for Android then, because on that platform all of our restart tests are not getting run. Given the frequency of this failure is so low at the moment I don't think it makes sense to dig into it. I will/can do when it's getting more prominent.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 9

•

8 years ago

The reason why we do not have the detailed gecko.log files could be that this task gets killed, and no usually created artifacts are getting uploaded. This is sad, because it would definitely help us here. :/ Here an example of another job which wasn't killed: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=103286935&lineNumber=2139

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 10

•

8 years ago

Also the problem with Mn3 is that it contains all navigation tests, which take a long time. And all of them are in the same file. Chunking does only pick whole files. So we would have to speed-up the tests, or split the file into one or two more.

Geoff Brown [:gbrown]

Assignee

Comment 11

•

7 years ago

Android Mn4 is running fine lately, often completing in under 30 minutes.

Assignee: nobody → gbrown

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WORKSFORME

Geoff Brown [:gbrown]

Assignee

Updated

•

7 years ago

Blocks: 1411358

Geoff Brown [:gbrown]

Assignee

Updated

•

7 years ago

No longer blocks: 1411358

BMO Automation

Updated

•

2 years ago

Product: Testing → Remote Protocol

Bugzilla

Android Debug marionette-4 job runs for too long

Categories

(Remote Protocol :: Marionette, defect)

Tracking

(Not tracked)

People

(Reporter: gbrown, Assigned: gbrown)

References

Details

(Keywords: regression)

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated

Updated