1241297 - Wpt5 fails on TC but not on Buildbot

Assignee

Description

•

9 years ago

The wpt-3 e10s job always times out on TC/docker regardless of how long you give it (now timing out at 120 minutes) [1][2]. It seems that if I bumped it beyond 120 minutes it might finish on time. It seems that Buildbot has that same chunk running hidden [3] jgraham, do you want us to try chunking this job further? or try it on an m3.xlarge instance? Eventually we want to disable the Buildbot jobs and used these running on TaskCluster. [1] https://public-artifacts.taskcluster.net/FVFCzYC3SCWrdYZ-D1suhA/1/public/logs/live_backing.log [2] https://treeherder.mozilla.org/#/jobs?repo=try&author=armenzg@mozilla.com&filter-searchStr=web-platform-tests%203%29&selectedJob=15573412 [3] https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=linux%20x64%20debug%20web-platform-tests-e10s&group_state=expanded https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=linux%20x64%20debug%20web-platform-tests-e10s&group_state=expanded&exclusion_profile=false [4] 16:41:01 INFO - PROCESS | 1207 | JavaScript error: executormarionette.py, line 33: Error: Permission denied to access property "timeout" 16:41:05 INFO - TEST-UNEXPECTED-TIMEOUT | /html/dom/reflection-embedded.html | expected OK [5] 17:56:16 CRITICAL - Loading initial page http://web-platform.test:8000/testharness_runner.html failed. Ensure that the there are no other programs bound to this port and that your firewall rules or network setup does not prevent access.\eTraceback (most recent call last): 17:56:16 CRITICAL - File "/home/worker/workspace/build/tests/web-platform/harness/wptrunner/executors/executormarionette.py", line 124, in load_runner 17:56:16 CRITICAL - self.marionette.navigate(url) 17:56:16 CRITICAL - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1505, in navigate 17:56:16 CRITICAL - self._send_message("get", {"url": url}) 17:56:16 CRITICAL - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 36, in _ 17:56:16 CRITICAL - return func(*args, **kwargs) 17:56:16 CRITICAL - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 748, in _send_message 17:56:16 CRITICAL - self._handle_error(err) 17:56:16 CRITICAL - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 809, in _handle_error 17:56:16 CRITICAL - raise errors.lookup(error)(message, stacktrace=stacktrace) 17:56:16 CRITICAL - UnknownException: UnknownException: Error loading page 17:56:16 CRITICAL - 17:56:16 CRITICAL -

Joel Maher ( :jmaher ) (UTC -8)

Comment 1

•

9 years ago

this changed in the last 2 weeks for <50 minutes to >90 minutes- we should identify the root cause of that.

James Graham [:jgraham]

Comment 2

•

9 years ago

There are generally unexplained problems with W3 on Linux; see https://bugzilla.mozilla.org/show_bug.cgi?id=1238435#c26 Let's revisit after that is fixed.

Depends on: 1238435

Joel Maher ( :jmaher ) (UTC -8)

Comment 3

•

9 years ago

w3 seems to have become much worse in runtime by looking at this range: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=Linux%20x64%20debug%20W3C%20Web%20Platform%20Tests%20W3C%20Web%20Platform%20Tests%20W%283%29&group_state=expanded&fromchange=c33f30666b37&tochange=6020a4cb41a7 judging by the running jobs, I suspect our culprit lies in this set of changes: https://hg.mozilla.org/mozilla-central/pushloghtml?changeset=359f86fecbc2 and we have new tests which showed up there: https://hg.mozilla.org/mozilla-central/rev/31a86d5e5ffa working on filling in the gaps in inbound to prove that: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-str=web&filter-searchStr=Linux%20x64%20debug%20W3C%20Web%20Platform%20Tests%20W3C%20Web%20Platform%20Tests%20W%283%29&tochange=31a86d5e5ffa&fromchange=cd6b93ff2af7 If we added a handful of tests and our runtime goes from ~45 -> ~80 minutes, that is suspect- are these tests necessary, are there bugs, can we optimize, is this across all platforms, etc.? At the very least we should chunk more which we can do in taskcluster land- not sure about available builders in buildbot land (linux64 is close to full)

Joel Maher ( :jmaher ) (UTC -8)

Comment 4

•

9 years ago

adding bkelly and yury as they landed/reviewed the new wpt tests- I know there are other issues with the stability of the wpt(3) tests, I care about runtime :)

James Graham [:jgraham]

Comment 5

•

9 years ago

jmaher: Yeah, those tests are already implicated. But the mechanism is a mystery because they don't actually run in W3. Which suggests chunking changes. But why that would cause the browser to be unable to load pages I don't know. Possibly something is crashing the web server or similar, but I need to actually reproduce the issue to be sure.

Brad Lassey [:blassey] (use needinfo?)

Updated

•

9 years ago

Blocks: e10s-tests

tracking-e10s: --- → +

Yury Delendik (:yury)

Comment 6

•

9 years ago

(In reply to James Graham [:jgraham] from comment #5) > jmaher: Yeah, those tests are already implicated. But the mechanism is a > mystery because they don't actually run in W3. Which suggests chunking > changes. But why that would cause the browser to be unable to load pages I > don't know. Possibly something is crashing the web server or similar, but I > need to actually reproduce the issue to be sure. As noted at https://bugzilla.mozilla.org/show_bug.cgi?id=1238435#c11 , the reflection wpt tests are crashing on e10s. The chunking of the tests moved them (reflection tests) in W3 for linux64 after new tests are added. It's hard to tell in which chunk reflection tests where located or in which chunks they are on other platforms (e.g. see raw logs of W2 on Mac OSX https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&selectedJob=20210378). But reflection test definitely produce crashes on linux64 e10s due to DNS host name resolution.

Joel Maher ( :jmaher ) (UTC -8)

Comment 7

•

9 years ago

the runtime difference is in both e10s and non e10s.

Armen [:armenzg]

Assignee

Comment 8

•

9 years ago

Is disabling the test(s) causing the crash a possibility? At least we would get results from other tests until the crash is ironed out.

Yury Delendik (:yury)

Comment 9

•

9 years ago

(In reply to Armen Zambrano [:armenzg] - Engineering productivity from comment #8) > Is disabling the test(s) causing the crash a possibility? > At least we would get results from other tests until the crash is ironed out. This solution was r- at https://bugzilla.mozilla.org/show_bug.cgi?id=1238435#c16

Joel Maher ( :jmaher ) (UTC -8)

Comment 10

•

9 years ago

we run an entirely different set of tests now in chunk 3 (keep in mind this is debug chunk 3 where we have 8 chunks). I think this bug is not useful other than to split wpt into more chunks and let the crashing/timeouts be solved in bug 1238435

Armen [:armenzg]

Assignee

Comment 11

•

9 years ago

Has anything changed? It seems that going from 8 to 12 chunks has cleared this: https://treeherder.mozilla.org/#/jobs?repo=try&author=armenzg@mozilla.com&filter-searchStr=web-platform-tests&group_state=expanded It seems that the lengthy tests are now running accross chunks 4, 5 & 6 (all between 40 & 50 minutes)

James Graham [:jgraham]

Comment 12

•

9 years ago

If chunking exposed this bug in the first place, it's not too surprising if rechunking hides it again. I certainly don't object to increasing the number of chunks in general; do you want to do that on buildbot and TC?

Armen [:armenzg]

Assignee

Comment 13

•

9 years ago

I don't know if we will be able to do it for Buildbot as the limit of builders was pretty high and close to the limit. We can increase the TC chunking and at least have that visible.

Phil Ringnalda (:philor)

Comment 14

•

9 years ago

Good luck with working around it by changing your chunk numbers - the bug 1242153 wpt update moved things around enough to get whatever two tests it is that don't like being in the same chunk into your chunk 4, both e10s and not-e10s, so I added them to the exclusion hiding buildbot's e10s-3.

Armen [:armenzg]

Assignee

Updated

•

9 years ago

Assignee: nobody → armenzg

Summary: wpt-3 e10s always takes as long as the max runtime allows it to → Bump timeout for wpt tests (wpt4 times out)

Armen [:armenzg]

Assignee

Comment 15

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=617e7fa74bbd

Armen [:armenzg]

Assignee

Comment 16

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f6673eb2f4d8

Armen [:armenzg]

Assignee

Comment 17

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=40d6f7a8472d

Armen [:armenzg]

Assignee

Comment 18

•

9 years ago

Going back from 12 chunks to 8 chunks and increasing the time out does not necessarily improve the matter (AFAIK): https://treeherder.mozilla.org/#/jobs?repo=try&revision=788f6950a3bd I will have to see if the tests run on Buildbot or not and compare with the TC jobs.

Armen [:armenzg]

Assignee

Comment 19

•

9 years ago

Just to update this, we're still having this issue for wpt5 e10s: https://treeherder.mozilla.org/#/jobs?repo=try&revision=aba896a9c4e2&filter-searchStr=Linux%20x64%20debug%20Desktop%20web-platform-tests%20[TC]%20Linux64%20web-platform-tests%205%20tc-W%285%29&exclusion_profile=false

Armen [:armenzg]

Assignee

Comment 20

•

9 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=ba8e96bd6daf

Armen [:armenzg]

Assignee

Updated

•

9 years ago

Summary: Bump timeout for wpt tests (wpt4 times out) → Wpt5 fails on TC but not on Buildbot

Armen [:armenzg]

Assignee

Comment 21

•

9 years ago

Attached patch wpt test jobs from 12 to 8 chunks to match Buildbot (deleted) — Details — Splinter Review

Having the same chunks will hopefully make comparing Buildbot and TaskCluster easier (it might fix the problem).

Attachment #8715849 - Flags: review?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 22

•

9 years ago

Comment on attachment 8715849 [details] [diff] [review] wpt test jobs from 12 to 8 chunks to match Buildbot Review of attachment 8715849 [details] [diff] [review]: ----------------------------------------------------------------- I would like to go back to 12 chunks as soon as we are on taskcluster only.

Attachment #8715849 - Flags: review?(jmaher) → review+

Armen [:armenzg]

Assignee

Comment 23

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/6391ae6db0ff732c8affd9143e8d080c22d1c4c6 Bug 1241297 - Bump timeout for TC Linux64 wpt tests and go from 12 chunks to 8 chunks. DONTBUILD. r=jmaher

Carsten Book [:Tomcat]

Comment 24

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/6391ae6db0ff

Status: NEW → RESOLVED

Closed: 9 years ago

status-firefox47: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla47

Armen [:armenzg]

Assignee

Comment 25

•

9 years ago

Going back to 8 has cleared the issue. (In reply to Joel Maher (:jmaher) from comment #22) > I would like to go back to 12 chunks as soon as we are on taskcluster only. Someone will have to work on bug 1238435 before we can go to any different chunking.

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 26

•

9 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-aurora/rev/e33050d7a424

status-firefox46: --- → fixed