Closed
Bug 1345904
Opened 8 years ago
Closed 8 years ago
Frequent failures in windows 7 vm pgo tests with Automation Error: Received unexpected exception while running application
Categories
(Core :: General, defect)
Tracking
()
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox53 | --- | unaffected |
firefox54 | --- | fixed |
firefox55 | + | fixed |
People
(Reporter: cbook, Unassigned)
References
()
Details
like https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&filter-searchStr=windows%207%20vm%20pgo&fromchange=d9d91d6e389ef19bd48eb5153c9e32a122a1eecc&selectedJob=82019761
browser chrome/mochitests seems to fail a lot and seems to be green on retrigger seems this started at the last merge day.
From the logs:
https://treeherder.mozilla.org/logviewer.html#?job_id=82790773&repo=mozilla-aurora&lineNumber=2859
7:18:07 INFO - TEST-INFO | started process Main app process
07:20:08 INFO - Traceback (most recent call last):
07:20:08 INFO - File "c:\slave\test\build\tests\mochitest\runtests.py", line 2482, in doTests
07:20:08 INFO - marionette_args=marionette_args,
07:20:08 INFO - File "c:\slave\test\build\tests\mochitest\runtests.py", line 2086, in runApp
07:20:08 INFO - self.marionette.start_session(timeout=port_timeout)
07:20:08 INFO - File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 28, in _
07:20:08 INFO - m._handle_socket_failure()
07:20:08 INFO - File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\decorators.py", line 23, in _
07:20:08 INFO - return func(*args, **kwargs)
07:20:08 INFO - File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1284, in start_session
07:20:08 INFO - self.protocol, _ = self.client.connect()
07:20:08 INFO - File "c:\slave\test\build\venv\lib\site-packages\marionette_driver\transport.py", line 223, in connect
07:20:08 INFO - self.sock.connect((self.addr, self.port))
07:20:08 INFO - File "c:\mozilla-build\python27\Lib\socket.py", line 224, in meth
07:20:08 INFO - return getattr(self._sock,name)(*args)
07:20:08 INFO - error: [Errno 10061] No connection could be made because the target machine actively refused it
07:20:08 ERROR - Automation Error: Received unexpected exception while running application
Comment 1•8 years ago
|
||
Have you tried bisecting this on Try with uplift simulation patches?
Flags: needinfo?(cbook)
Reporter | ||
Comment 2•8 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #1)
> Have you tried bisecting this on Try with uplift simulation patches?
at a first glance i would say this happened between February 15 and 27th based on the simulations
Flags: needinfo?(cbook)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Component: General Automation → General
Flags: needinfo?(dburns)
Product: Release Engineering → Core
QA Contact: catlee
Version: unspecified → 54 Branch
Reporter | ||
Comment 6•8 years ago
|
||
Bisect Builds are underway.
From a look back at the uplift simulations https://mzl.la/2lVzN9I seems to be the regression range. Joel can you take a look maybe something stands out for you that is likely causing this (beside of the running bisect try runs
Flags: needinfo?(jmaher)
Reporter | ||
Comment 7•8 years ago
|
||
i looked back at the simulation runs and what i discovered is that this problem (while it looks its only on windows 7 vm pgo) its also on windows 7 vm opt (that we don't run on default it seems on production trees) - while the normal (non vm) windows 7 opt run don't have this failure at the same time. I guess if we would run e10s mochitest on non vm maschines for windows 7 too, we would see the same.
arr, markco, grenade: do you know if there is a fundamental difference between windows 7 vm opt (and pgo) between the non vm ones that could explain this
01:24:25 INFO - chunk = self.sock.recv(bytes_to_recv)
01:24:25 INFO - error: [Errno 10054] An existing connection was forcibly closed by the remote host
01:24:25 ERROR - Automation Error: Received unexpected exception while running application
(from https://treeherder.mozilla.org/logviewer.html#?job_id=81827388&repo=try&lineNumber=2838) ?
Flags: needinfo?(rthijssen)
Flags: needinfo?(mcornmesser)
Flags: needinfo?(arich)
Comment 8•8 years ago
|
||
the only difference in the environment setup between vm and non should be the presence of GPU drivers and access to the GPU via the VM layer on the non vm instances.
Flags: needinfo?(rthijssen)
Comment 9•8 years ago
|
||
pgo vs opt vs debug are the same machines, and we don't configure different instances for pgo. Please ping me again when the range is narrowed down to a day or two- looking over thousands of changesets is not very useful.
Flags: needinfo?(jmaher)
Comment 10•8 years ago
|
||
The difference in errors between comment 0 and comment 7 suggests potentially 2 problems. In comment 7, Marionette started where in comment 0. The suggests it could be a startup crash happening at different times but further bisection might help find the cause.
Flags: needinfo?(dburns)
Updated•8 years ago
|
Flags: needinfo?(arich)
Updated•8 years ago
|
Flags: needinfo?(mcornmesser)
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 12•8 years ago
|
||
Spent some time in bisect builds and try builds and:
Uplift Simulations [1] on Feb 27th were showing the error we now see like in https://treeherder.mozilla.org/logviewer.html#?job_id=80447625&repo=try but a try run with a later rev show this is fairly intermittent
2 failures in 230 tests -> https://treeherder.mozilla.org/#/jobs?repo=try&revision=8c69d44946b98040c5f9c23cb1e175bb42548d5e
Also a try run with the current aurora tree show some of the failures but also a lot of green runs for the same test
-> https://treeherder.mozilla.org/#/jobs?repo=try&revision=92df3032b2e9633b2b0c5e3861378704f46daa81
and the latest aurora as beta uplift simulations doesn't show this error at all in https://treeherder.mozilla.org/#/jobs?repo=try&revision=68f4d254bc812ab4003b31e67be4b265f0fcf30f (ignore the other failures just windows 7 vm opt is interesting here :)
So this could mean we have either some startup crash somewhere before that didn't got noticed or something is not starting up as fast we expect ?
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=4319e10f0da544b50835676761b0289fafda3404&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception
Comment 13•8 years ago
|
||
I discussed comment 12 with Tomcat and we sorted out some issues in the bisection procedure. Taking another stab at it, bug 1312883 is beginning to look like a plausible candidate for this. The Try push for it backed out from Aurora is looking green:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=123fa634bd19692571790b5cc0803bb14d376753
Mike, any ideas what might be going on here?
Flags: needinfo?(mconley)
Comment 14•8 years ago
|
||
I think the Try run in comment 13 is conclusive enough to officially claim causality. I pinged Mike about it on IRC and he asked for a day to look into it before formally doing the backout, which sounds reasonable enough to me.
Blocks: 1312883
Comment hidden (Intermittent Failures Robot) |
Reporter | ||
Comment 16•8 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #13)
> I discussed comment 12 with Tomcat and we sorted out some issues in the
> bisection procedure. Taking another stab at it, bug 1312883 is beginning to
> look like a plausible candidate for this. The Try push for it backed out
> from Aurora is looking green:
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=123fa634bd19692571790b5cc0803bb14d376753
>
> Mike, any ideas what might be going on here?
yeah i agree with Ryan the latest try run with the backout is entirely green, so this should fix this de-facto tree closure bug/regression
Comment 17•8 years ago
|
||
So I've been looking at this today, and I'm having trouble getting actionable data out of this.
Do we know for certain that the parent process is crashing? If so, where are the stacks? How do I recreate this scenario locally so I can debug it? Who can answer these questions?
Flags: needinfo?(mconley) → needinfo?(ryanvm)
Comment 18•8 years ago
|
||
To close the loop from comment 17, Mike's been working with people on IRC and IRL to debug this. In the mean time, bug 1312883 was backed out from Aurora and the tests are all greening up. Leaving this bug open for the investigation so we don't get burned by this again when 55 goes to Aurora next month.
status-firefox53:
--- → unaffected
status-firefox54:
--- → fixed
status-firefox55:
--- → affected
status-firefox-esr52:
--- → unaffected
Flags: needinfo?(ryanvm)
Comment 19•8 years ago
|
||
Update:
Setting --enable-profiling didn't help, and RyanVM was also able to reproduce in a non-PGO build.
The backout has occurred, which means I've bought a month to figure this out. Joy!
I'll file a bug to request a Win 7 VM loaner.
Comment 20•8 years ago
|
||
From the pushes I did on Try today, I learned a couple things:
* While the failures *can* happen with regular opt builds without PGO enabled, PGO makes the failures a lot more frequent.
* --enable-profiling is a red herring, adding it back to the configs made no difference in frequency.
* Changing the version number back to 54.0a1 (and in effect getting NIGHTLY_BUILD set again) makes the failures go away.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 23•8 years ago
|
||
[Tracking Requested - why for this release]: needs a fix before 55 hits aurora on 2017-04-17, or we'll be right back in the same boat again.
tracking-firefox55:
--- → ?
Comment 25•8 years ago
|
||
The patch that mystor posted to try here[1] gets rid of a deadlock that might be responsible for this problem.
When that lands, I'm going to try to re-apply bug 1312883 and that patch to my aurora tip and push to try, and see if this addresses the issue.
[1]: https://hg.mozilla.org/try/rev/b613b42a25bad5765af95b03e9b0ef8bc7f9d44c
Comment 26•8 years ago
|
||
Comment 27•8 years ago
|
||
Updated•8 years ago
|
Flags: needinfo?(mconley)
Comment 28•8 years ago
|
||
I believe this try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=85b53054fdb3887713bb5ad0e5ea928092336ea8&selectedJob=90647115
Suggests that bug 1346415 will fix this.
Depends on: 1346415
Flags: needinfo?(mconley)
Comment 29•8 years ago
|
||
Bug 1346415 has now hit central. Can this be marked as fixed in 55?
Flags: needinfo?(mconley)
Comment 30•8 years ago
|
||
Yeah, I think so. Let's re-open if uplift simulations show the bug rearing its head again.
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(mconley)
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•