1345904 - Frequent failures in windows 7 vm pgo tests with Automation Error: Received unexpected exception while running application

Reporter

Description

•

8 years ago

like https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&filter-searchStr=windows%207%20vm%20pgo&fromchange=d9d91d6e389ef19bd48eb5153c9e32a122a1eecc&selectedJob=82019761 browser chrome/mochitests seems to fail a lot and seems to be green on retrigger seems this started at the last merge day. From the logs: https://treeherder.mozilla.org/logviewer.html#?job_id=82790773&repo=mozilla-aurora&lineNumber=2859 7:18:07 INFO - TEST-INFO | started process Main app process 07:20:08 INFO - Traceback (most recent call last): 07:20:08 INFO - File "c:\slave\test\build\tests\mochitest\runtests.py", line 2482, in doTests 07:20:08 INFO - marionette_args=marionette_args, 07:20:08 INFO - File "c:\slave\test\build\tests\mochitest\runtests.py", line 2086, in runApp 07:20:08 INFO - self.marionette.start_session(timeout=port_timeout) 07:20:08 INFO - File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _ 07:20:08 INFO - m._handle_socket_failure() 07:20:08 INFO - File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _ 07:20:08 INFO - return func(*args, **kwargs) 07:20:08 INFO - File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1284, in start_session 07:20:08 INFO - self.protocol, _ = self.client.connect() 07:20:08 INFO - File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect 07:20:08 INFO - self.sock.connect((self.addr, self.port)) 07:20:08 INFO - File "c:\mozilla-build\python27\Lib\socket.py", line 224, in meth 07:20:08 INFO - return getattr(self._sock,name)(*args) 07:20:08 INFO - error: [Errno 10061] No connection could be made because the target machine actively refused it 07:20:08 ERROR - Automation Error: Received unexpected exception while running application

Ryan VanderMeulen [:RyanVM]

Comment 1

•

8 years ago

Have you tried bisecting this on Try with uplift simulation patches?

Flags: needinfo?(cbook)

Carsten Book [:Tomcat]

Reporter

Comment 2

•

8 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #1) > Have you tried bisecting this on Try with uplift simulation patches? at a first glance i would say this happened between February 15 and 27th based on the simulations

Flags: needinfo?(cbook)

Comment hidden (Intermittent Failures Robot)

Phil Ringnalda (:philor)

Updated

•

8 years ago

Component: General Automation → General

Flags: needinfo?(dburns)

Product: Release Engineering → Core

QA Contact: catlee

Version: unspecified → 54 Branch

Carsten Book [:Tomcat]

Reporter

Comment 6

•

8 years ago

Bisect Builds are underway. From a look back at the uplift simulations https://mzl.la/2lVzN9I seems to be the regression range. Joel can you take a look maybe something stands out for you that is likely causing this (beside of the running bisect try runs

Flags: needinfo?(jmaher)

Carsten Book [:Tomcat]

Reporter

Comment 7

•

8 years ago

i looked back at the simulation runs and what i discovered is that this problem (while it looks its only on windows 7 vm pgo) its also on windows 7 vm opt (that we don't run on default it seems on production trees) - while the normal (non vm) windows 7 opt run don't have this failure at the same time. I guess if we would run e10s mochitest on non vm maschines for windows 7 too, we would see the same. arr, markco, grenade: do you know if there is a fundamental difference between windows 7 vm opt (and pgo) between the non vm ones that could explain this 01:24:25 INFO - chunk = self.sock.recv(bytes_to_recv) 01:24:25 INFO - error: [Errno 10054] An existing connection was forcibly closed by the remote host 01:24:25 ERROR - Automation Error: Received unexpected exception while running application (from https://treeherder.mozilla.org/logviewer.html#?job_id=81827388&repo=try&lineNumber=2838) ?

Flags: needinfo?(rthijssen)

Flags: needinfo?(mcornmesser)

Flags: needinfo?(arich)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 8

•

8 years ago

the only difference in the environment setup between vm and non should be the presence of GPU drivers and access to the GPU via the VM layer on the non vm instances.

Flags: needinfo?(rthijssen)

Joel Maher ( :jmaher ) (UTC -8)

Comment 9

•

8 years ago

pgo vs opt vs debug are the same machines, and we don't configure different instances for pgo. Please ping me again when the range is narrowed down to a day or two- looking over thousands of changesets is not very useful.

Flags: needinfo?(jmaher)

David Burns :automatedtester

Comment 10

•

8 years ago

The difference in errors between comment 0 and comment 7 suggests potentially 2 problems. In comment 7, Marionette started where in comment 0. The suggests it could be a startup crash happening at different times but further bisection might help find the cause.

Flags: needinfo?(dburns)

Amy Rich [:arr] [:arich]

Updated

•

8 years ago

Flags: needinfo?(arich)

Mark Cornmesser [:markco]

Updated

•

8 years ago

Flags: needinfo?(mcornmesser)

Comment hidden (Intermittent Failures Robot)

Carsten Book [:Tomcat]

Reporter

Comment 12

•

8 years ago

Spent some time in bisect builds and try builds and: Uplift Simulations [1] on Feb 27th were showing the error we now see like in https://treeherder.mozilla.org/logviewer.html#?job_id=80447625&repo=try but a try run with a later rev show this is fairly intermittent 2 failures in 230 tests -> https://treeherder.mozilla.org/#/jobs?repo=try&revision=8c69d44946b98040c5f9c23cb1e175bb42548d5e Also a try run with the current aurora tree show some of the failures but also a lot of green runs for the same test -> https://treeherder.mozilla.org/#/jobs?repo=try&revision=92df3032b2e9633b2b0c5e3861378704f46daa81 and the latest aurora as beta uplift simulations doesn't show this error at all in https://treeherder.mozilla.org/#/jobs?repo=try&revision=68f4d254bc812ab4003b31e67be4b265f0fcf30f (ignore the other failures just windows 7 vm opt is interesting here :) So this could mean we have either some startup crash somewhere before that didn't got noticed or something is not starting up as fast we expect ? [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=4319e10f0da544b50835676761b0289fafda3404&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception

Ryan VanderMeulen [:RyanVM]

Comment 13

•

8 years ago

I discussed comment 12 with Tomcat and we sorted out some issues in the bisection procedure. Taking another stab at it, bug 1312883 is beginning to look like a plausible candidate for this. The Try push for it backed out from Aurora is looking green: https://treeherder.mozilla.org/#/jobs?repo=try&revision=123fa634bd19692571790b5cc0803bb14d376753 Mike, any ideas what might be going on here?

Flags: needinfo?(mconley)

Ryan VanderMeulen [:RyanVM]

Comment 14

•

8 years ago

I think the Try run in comment 13 is conclusive enough to officially claim causality. I pinged Mike about it on IRC and he asked for a day to look into it before formally doing the backout, which sounds reasonable enough to me.

Blocks: 1312883

Comment hidden (Intermittent Failures Robot)

Carsten Book [:Tomcat]

Reporter

Comment 16

•

8 years ago

(In reply to Ryan VanderMeulen [:RyanVM] from comment #13) > I discussed comment 12 with Tomcat and we sorted out some issues in the > bisection procedure. Taking another stab at it, bug 1312883 is beginning to > look like a plausible candidate for this. The Try push for it backed out > from Aurora is looking green: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=123fa634bd19692571790b5cc0803bb14d376753 > > Mike, any ideas what might be going on here? yeah i agree with Ryan the latest try run with the backout is entirely green, so this should fix this de-facto tree closure bug/regression

Mike Conley (:mconley) (:⚙️)

Comment 17

•

8 years ago

So I've been looking at this today, and I'm having trouble getting actionable data out of this. Do we know for certain that the parent process is crashing? If so, where are the stacks? How do I recreate this scenario locally so I can debug it? Who can answer these questions?

Flags: needinfo?(mconley) → needinfo?(ryanvm)

Ryan VanderMeulen [:RyanVM]

Comment 18

•

8 years ago

To close the loop from comment 17, Mike's been working with people on IRC and IRL to debug this. In the mean time, bug 1312883 was backed out from Aurora and the tests are all greening up. Leaving this bug open for the investigation so we don't get burned by this again when 55 goes to Aurora next month.

status-firefox53: --- → unaffected

status-firefox54: --- → fixed

status-firefox55: --- → affected

status-firefox-esr52: --- → unaffected

Flags: needinfo?(ryanvm)

Mike Conley (:mconley) (:⚙️)

Comment 19

•

8 years ago

Update: Setting --enable-profiling didn't help, and RyanVM was also able to reproduce in a non-PGO build. The backout has occurred, which means I've bought a month to figure this out. Joy! I'll file a bug to request a Win 7 VM loaner.

Mike Conley (:mconley) (:⚙️)

Updated

•

8 years ago

Depends on: 1348077

Ryan VanderMeulen [:RyanVM]

Comment 20

•

8 years ago

From the pushes I did on Try today, I learned a couple things: * While the failures *can* happen with regular opt builds without PGO enabled, PGO makes the failures a lot more frequent. * --enable-profiling is a red herring, adding it back to the configs made no difference in frequency. * Changing the version number back to 54.0a1 (and in effect getting NIGHTLY_BUILD set again) makes the failures go away.

Comment hidden (Intermittent Failures Robot)

Phil Ringnalda (:philor)

Comment 23

•

8 years ago

[Tracking Requested - why for this release]: needs a fix before 55 hits aurora on 2017-04-17, or we'll be right back in the same boat again.

tracking-firefox55: --- → ?

Marcia Knous [:marcia]

Comment 24

•

8 years ago

Tracking 55+ based on the reason in Comment 23.

tracking-firefox55: ? → +

Mike Conley (:mconley) (:⚙️)

Comment 25

•

8 years ago

The patch that mystor posted to try here[1] gets rid of a deadlock that might be responsible for this problem. When that lands, I'm going to try to re-apply bug 1312883 and that patch to my aurora tip and push to try, and see if this addresses the issue. [1]: https://hg.mozilla.org/try/rev/b613b42a25bad5765af95b03e9b0ef8bc7f9d44c

Mike Conley (:mconley) (:⚙️)

Comment 26

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=0d92083ee9e1

Mike Conley (:mconley) (:⚙️)

Comment 27

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=62c5bbdd2cee2eadc289de85f6dcccc9d1fa092f

Mike Conley (:mconley) (:⚙️)

Updated

•

8 years ago

Flags: needinfo?(mconley)

Mike Conley (:mconley) (:⚙️)

Comment 28

•

8 years ago

I believe this try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=85b53054fdb3887713bb5ad0e5ea928092336ea8&selectedJob=90647115 Suggests that bug 1346415 will fix this.

Depends on: 1346415

Flags: needinfo?(mconley)

Julien Cristau [:jcristau]

Comment 29

•

8 years ago

Bug 1346415 has now hit central. Can this be marked as fixed in 55?

Flags: needinfo?(mconley)

Mike Conley (:mconley) (:⚙️)

Comment 30

•

8 years ago

Yeah, I think so. Let's re-open if uplift simulations show the bug rearing its head again.

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox55: affected → fixed

Flags: needinfo?(mconley)

Resolution: --- → FIXED