Closed
Bug 1261598
Opened 9 years ago
Closed 7 years ago
Intermittent Automation Error: Received unexpected exception while running application | after error: [Errno 111] Connection refused
Categories
(Testing :: Marionette Client and Harness, defect)
Testing
Marionette Client and Harness
Tracking
(firefox57 fixed, firefox58 fixed)
RESOLVED
FIXED
mozilla58
People
(Reporter: aryx, Assigned: whimboo)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:other])
Attachments
(1 file)
(deleted),
text/x-review-board-request
|
Details |
https://treeherder.mozilla.org/logviewer.html#?job_id=25030028&repo=mozilla-inbound
00:20:46 INFO - MochitestServer : launching [u'/home/worker/workspace/build/tests/bin/xpcshell', '-g', '/home/worker/workspace/build/application/firefox', '-v', '170', '-f', '/home/worker/workspace/build/tests/bin/components/httpd.js', '-e', "const _PROFILE_PATH = '/tmp/tmpYwE2t1.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', '/home/worker/workspace/build/tests/mochitest/server.js']
00:20:46 INFO - runtests.py | Server pid: 1145
00:20:46 INFO - runtests.py | Websocket server pid: 1148
00:20:46 INFO - runtests.py | SSL tunnel pid: 1152
00:20:47 INFO - runtests.py | Running tests: start.
00:20:47 INFO - runtests.py | Application pid: 1173
00:20:47 INFO - TEST-INFO | started process Main app process
00:20:47 INFO - ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to /tmp/tmpYwE2t1.mozrunner/runtests_leaks.log
00:20:58 INFO - ++DOCSHELL 0x7f254ede3800 == 1 [pid = 1173] [id = 1]
00:20:58 INFO - ++DOMWINDOW == 1 (0x7f254ede4000) [pid = 1173] [serial = 1] [outer = (nil)]
00:20:58 INFO - [1173] WARNING: No inner window available!: file /home/worker/workspace/build/src/dom/base/nsGlobalWindow.cpp, line 9773
00:20:58 INFO - ++DOMWINDOW == 2 (0x7f254ede5000) [pid = 1173] [serial = 2] [outer = 0x7f254ede4000]
00:21:48 INFO - Traceback (most recent call last):
00:21:48 INFO - File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2357, in doTests
00:21:48 INFO - marionette_args=marionette_args,
00:21:48 INFO - File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 1984, in runApp
00:21:48 INFO - self.marionette.start_session(timeout=port_timeout)
00:21:48 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1166, in start_session
00:21:48 INFO - self.protocol, _ = self.client.connect()
00:21:48 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 226, in connect
00:21:48 INFO - self.sock.connect((self.addr, self.port))
00:21:48 INFO - File "/usr/lib/python2.7/socket.py", line 224, in meth
00:21:49 INFO - return getattr(self._sock,name)(*args)
00:21:49 INFO - error: [Errno 111] Connection refused
00:21:49 INFO - 0 ERROR Automation Error: Received unexpected exception while running application
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 8•8 years ago
|
||
There's been a significant spike in frequency today, and higher-than-normal frequency for this bug for the last couple of days. It seems like a lot of the recent failures are linux64-asan and gpu or mochitest-gl.
:whimboo -- Do you know what's going on?
Flags: needinfo?(hskupin)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 11•8 years ago
|
||
A pattern I see that this is all mochitest only, right? And all the time when it happens we run some tests and then the Mochitest server seems to start a new Firefox process:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=74110423&lineNumber=1840
I don't know those suites to give a clear answer, but maybe this is a shutdown/startup issue? It would be great to see a test job with gecko logging turned on (if possible), if the failure rate stays that high.
Flags: needinfo?(hskupin)
Updated•8 years ago
|
Comment 12•8 years ago
|
||
:gbrown is this something you could look at in the next week or two?
Flags: needinfo?(gbrown)
Comment 13•8 years ago
|
||
note, this pattern changed around Feb 1st and our rate of hitting this bug increased.
Assignee | ||
Comment 14•8 years ago
|
||
Isn't that bug 1318951 what Geoff added a day ago under "see also"?
Comment 15•8 years ago
|
||
I am not clear if that bug is really misstars of this bug or vice versa. I am under the impression this bug is different from the other and needs to be fixed by someone on our team.
Comment hidden (Intermittent Failures Robot) |
Comment 18•8 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #11)
> I don't know those suites to give a clear answer, but maybe this is a
> shutdown/startup issue? It would be great to see a test job with gecko
> logging turned on (if possible), if the failure rate stays that high.
I can reproduce on try fairly easily. How do I turn on gecko logging?
Flags: needinfo?(hskupin)
Assignee | ||
Comment 19•8 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #18)
> (In reply to Henrik Skupin (:whimboo) from comment #11)
> > I don't know those suites to give a clear answer, but maybe this is a
> > shutdown/startup issue? It would be great to see a test job with gecko
> > logging turned on (if possible), if the failure rate stays that high.
>
> I can reproduce on try fairly easily. How do I turn on gecko logging?
With marionette harness itself its just "--gecko-log", but I don't know how other test harnesses are doing that. Maybe Joel knows about?
Flags: needinfo?(hskupin) → needinfo?(jmaher)
Comment 20•8 years ago
|
||
I have no idea, gbrown had asked me before asking in the bug.
Flags: needinfo?(jmaher)
Assignee | ||
Comment 21•8 years ago
|
||
Ok, so I had a look and there is an option when starting Firefox which is "--console". If the test harness you are checking here isn't setting it, please hack it into the mozrunner run_command call.
Comment 22•8 years ago
|
||
Thanks :whimboo.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=97becc65975e0567282ab2fb941971ea0cf7132d has --console and reproduces this failure. For example:
https://public-artifacts.taskcluster.net/AxKCpVtaRFGg6Suwe2anEQ/0/public/logs/live_backing.log
[task 2017-02-10T23:38:04.684283Z] 23:38:04 INFO - Application command: /home/worker/workspace/build/application/firefox/firefox --console -marionette -foreground -profile /tmp/tmp1gV_zP.mozrunner
[task 2017-02-10T23:38:04.715424Z] 23:38:04 INFO - runtests.py | Application pid: 1128
[task 2017-02-10T23:38:04.716011Z] 23:38:04 INFO - TEST-INFO | started process Main app process
[task 2017-02-10T23:40:04.754794Z] 23:40:04 INFO - Traceback (most recent call last):
[task 2017-02-10T23:40:04.755610Z] 23:40:04 INFO - File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2417, in doTests
[task 2017-02-10T23:40:04.756719Z] 23:40:04 INFO - marionette_args=marionette_args,
[task 2017-02-10T23:40:04.758099Z] 23:40:04 INFO - File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2020, in runApp
[task 2017-02-10T23:40:04.759219Z] 23:40:04 INFO - self.marionette.start_session(timeout=port_timeout)
[task 2017-02-10T23:40:04.760575Z] 23:40:04 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 28, in _
[task 2017-02-10T23:40:04.762437Z] 23:40:04 INFO - m._handle_socket_failure()
[task 2017-02-10T23:40:04.763931Z] 23:40:04 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
[task 2017-02-10T23:40:04.765374Z] 23:40:04 INFO - return func(*args, **kwargs)
[task 2017-02-10T23:40:04.767229Z] 23:40:04 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1284, in start_session
[task 2017-02-10T23:40:04.769433Z] 23:40:04 INFO - self.protocol, _ = self.client.connect()
[task 2017-02-10T23:40:04.770252Z] 23:40:04 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 223, in connect
[task 2017-02-10T23:40:04.770992Z] 23:40:04 INFO - self.sock.connect((self.addr, self.port))
[task 2017-02-10T23:40:04.771704Z] 23:40:04 INFO - File "/usr/lib/python2.7/socket.py", line 228, in meth
[task 2017-02-10T23:40:04.776997Z] 23:40:04 INFO - return getattr(self._sock,name)(*args)
[task 2017-02-10T23:40:04.782340Z] 23:40:04 INFO - error: [Errno 111] Connection refused
[task 2017-02-10T23:40:04.783482Z] 23:40:04 ERROR - Automation Error: Received unexpected exception while running application
I don't think that helped.
I also added gecko_log to the Marionette arguments, but that didn't seem to work.
I also defined MOZ_LOG with some common nspr logging components ("all" doesn't work currently)...that generated some additional info in the mozLogs.zip artifact, but didn't produce anything that jumps out at me as an explanation for what's going wrong here.
Comment 23•8 years ago
|
||
I suppose Firefox may be crashing on startup; I think that would explain Connection Refused? I wonder, if runtests.py caught the failure here and continued, would it find the app process still running, would it find a crash dump? I'll look into that next.
Comment hidden (Intermittent Failures Robot) |
Comment 25•8 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 27•8 years ago
|
||
Maybe other test harnesses beside Mn do not correctly handle thrown exceptions when trying to use Marionette for installing an addon? Keep in mind that in such cases Marionette doesn't do any crash checking and such. It would be all up to the harness which using Marionette.
Comment hidden (Intermittent Failures Robot) |
Comment 29•8 years ago
|
||
Recent findings...
https://treeherder.mozilla.org/#/jobs?repo=try&revision=eea99c2033987e186e9d43fd49ff6311b44ad287
Fail: https://public-artifacts.taskcluster.net/SVI8qSOaTTClKpArCIrb5A/0/public/logs/live_backing.log
OK: https://public-artifacts.taskcluster.net/KDi2NWBzTUWoxvVcCVVYoA/0/public/logs/live_backing.log
I don't think Firefox is crashing:
[task 2017-02-13T02:47:25.051571Z] 02:47:25 ERROR - Force-terminating active process(es).
[task 2017-02-13T02:47:25.052837Z] 02:47:25 INFO - Determining child pids from psutil
[task 2017-02-13T02:47:25.059342Z] 02:47:25 INFO - Found child pids: [1959, 1981]
[task 2017-02-13T02:47:25.060637Z] 02:47:25 INFO - Killing process: 1959
[task 2017-02-13T02:47:25.061522Z] 02:47:25 INFO - TEST-INFO | started process screentopng
[task 2017-02-13T02:47:25.581131Z] 02:47:25 INFO - TEST-INFO | screentopng: exit 0
[task 2017-02-13T02:47:25.585511Z] 02:47:25 INFO - Can't trigger Breakpad, just killing process
[task 2017-02-13T02:47:25.586586Z] 02:47:25 INFO - Killing process: 1981
[task 2017-02-13T02:47:25.588030Z] 02:47:25 INFO - Not taking screenshot here: see the one that was previously logged
[task 2017-02-13T02:47:25.589271Z] 02:47:25 INFO - Can't trigger Breakpad, just killing process
[task 2017-02-13T02:47:26.699131Z] 02:47:26 INFO - psutil found pid 1959 dead
[task 2017-02-13T02:47:26.700186Z] 02:47:26 INFO - psutil found pid 1981 dead
[task 2017-02-13T02:47:26.700899Z] 02:47:26 INFO - Killing process: 1955
"Marionette enabled via command-line flag" is missing from the failed cases:
https://dxr.mozilla.org/mozilla-central/rev/0eef1d5a39366059677c6d7944cfe8a97265a011/testing/marionette/components/marionette.js#111
but it should be there:
[task 2017-02-13T02:41:54.105081Z] 02:41:54 INFO - Application command: /home/worker/workspace/build/application/firefox/firefox --console -marionette -foreground -profile /tmp/tmpntlzpC.mozrunner
Comment hidden (Intermittent Failures Robot) |
Comment 31•8 years ago
|
||
Startup is hanging at https://dxr.mozilla.org/mozilla-central/rev/0eef1d5a39366059677c6d7944cfe8a97265a011/toolkit/xre/nsXREDirProvider.cpp#1155:
em->Observe(nullptr, "addons-startup", nullptr);
DoStartup() is called from https://dxr.mozilla.org/mozilla-central/rev/0eef1d5a39366059677c6d7944cfe8a97265a011/toolkit/xre/nsAppRunner.cpp#4289; since that doesn't progress, the command line handlers are never called and marionette is never started.
I'll check into why the addon manager startup appears to be hanging.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 34•8 years ago
|
||
...and addon manager startup hangs due to a hang somewhere in CryptoTask.
Still making slow progress here, tracking down the source of the hang with logging and try pushes.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=bfb2e557d65cfb110b70f8b82eb259896a6ccdc1
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 39•8 years ago
|
||
:gbrown, any updates in this bug? I recall you mentioning some progress, it would be good to document it in here.
Flags: needinfo?(gbrown)
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Whiteboard: [stockwell needswork]
Comment 41•8 years ago
|
||
tl;dr - During automated test runs, browser startup sometimes hangs at https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/extensions/internal/XPIProvider.jsm#1596; the CryptoTask that it is waiting for was called off main thread, called NS_DispatchToMainThread(), but is not called again.
The details:
- Marionette's connection is refused because the marionette server (in firefox) is not running.
- Marionette is not running because it was never started; it was not started because firefox startup has hung before processing command line arguments like '-marionette'.
- Startup is hung at https://dxr.mozilla.org/mozilla-central/rev/0eef1d5a39366059677c6d7944cfe8a97265a011/toolkit/xre/nsXREDirProvider.cpp#1155:
em->Observe(nullptr, "addons-startup", nullptr);
- Addons startup is hung at:
syncLoadManifestFromFile@resource://gre/modules/addons/XPIProvider.jsm:1598:16
processPendingFileChanges@resource://gre/modules/addons/XPIProvider.jsm:3408:19
checkForChanges@resource://gre/modules/addons/XPIProvider.jsm:3698:19
startup@resource://gre/modules/addons/XPIProvider.jsm:2802:25
callProvider@resource://gre/modules/AddonManager.jsm:272:12
_startProvider@resource://gre/modules/AddonManager.jsm:756:5
startup@resource://gre/modules/AddonManager.jsm:939:9
startup@resource://gre/modules/AddonManager.jsm:3136:5
observe@resource://gre/components/addonManager.js:65:9
Sorry, those line numbers were modified by my logging in https://treeherder.mozilla.org/#/jobs?repo=try&revision=b1ecf27fcebc97a0c7d997a3e2f06631d5c24786...so, addons startup is hung at:
https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/extensions/internal/XPIProvider.jsm#1596
while (success === undefined)
thread.processNextEvent(true);
I can reproduce this failure on try, on linux64-asan. If I trigger a crash from the harness when the marionette connection times out, I consistently get a stack in nsAppShell::ProcessNextNativeEvent:
https://public-artifacts.taskcluster.net/VRVjeVOzSxuOGwAywnDkaA/0/public/logs/live_backing.log
#26 0x7f4954a044ab in g_main_context_iteration (/lib/x86_64-linux-gnu/libglib-2.0.so.0+0x4a4ab)
#27 0x7f4946de010e in nsAppShell::ProcessNextNativeEvent(bool) /home/worker/workspace/build/src/widget/gtk/nsAppShell.cpp:270:12
#28 0x7f4946d3d114 in DoProcessNextNativeEvent /home/worker/workspace/build/src/widget/nsBaseAppShell.cpp:138:17
#29 0x7f4946d3d114 in nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool) /home/worker/workspace/build/src/widget/nsBaseAppShell.cpp:271
#30 0x7f4946d3dbaf in non-virtual thunk to nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool) /home/worker/workspace/build/src/widget/nsBaseAppShell.cpp:233:17
#31 0x7f4940b2e7f4 in nsThread::ProcessNextEvent(bool, bool*) /home/worker/workspace/build/src/xpcom/threads/nsThread.cpp:1220:5
That seems consistent/unsurprising.
- Meanwhile, we have this happening:
CryptoTask::Run(): https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/security/manager/ssl/CryptoTask.cpp#55
...
openSignedAppFileAsync(): https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/security/apps/AppSignatureVerification.cpp#1044
verifyZipSignedState(): https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/extensions/internal/XPIProvider.jsm#1829
loadManifestFromZipReader(): https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/extensions/internal/XPIProvider.jsm#1527
loadManifestFromFile(): https://dxr.mozilla.org/mozilla-central/rev/e1135c6fdc9bcd80d38f7285b269e030716dcb72/toolkit/mozapps/extensions/internal/XPIProvider.jsm#1573
In all failures, CryptoTask::Run() is called off main thread and completes the call to NS_DispatchToMainThread(), then is never called again (is not called on the main thread).
My interpretation is that CryptoTask::Run() is waiting for the main thread, but the main thread is blocked on thread.processNextEvent(true)...but I'm out of my depth and unsure of the details.
If I change to thread.processNextEvent(false), I can no longer reproduce the hang, but that doesn't seem like a sensible solution.
:mossop -- It looks like you might know this code. Can you help me understand the issues, or suggest a solution?
Flags: needinfo?(gbrown) → needinfo?(dtownsend)
Comment 42•8 years ago
|
||
I don't know the crypto code or why it might be blocked on the main thread, you probably want Dan for that.
Flags: needinfo?(dtownsend) → needinfo?(dveditz)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 46•8 years ago
|
||
no activity here in a week, checking in as this is still a very high frequency failure.
Assignee | ||
Comment 47•8 years ago
|
||
This is actually happening nearly all the time for Linux64 and I wonder if this is related to bug 1293093.
Reason is:
[task 2017-03-05T23:54:21.632367Z] 23:54:21 INFO - runtests.py | Application pid: 1642
[task 2017-03-05T23:54:21.633212Z] 23:54:21 INFO - TEST-INFO | started process Main app process
[task 2017-03-05T23:56:21.722802Z] 23:56:21 INFO - Traceback (most recent call last):
We try to start Firefox but fail after 2 minutes.
Depends on: 1345105
Comment 48•8 years ago
|
||
Comment 41 documents the issue: A deadlock involving crypto code waiting for the main thread during startup.
Bug 1345105 may help (or worsen) the behavior by changing startup timing, but will not address the issue.
I'm waiting on the needinfo to dveditz.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 50•8 years ago
|
||
I think David Keeler can also help us here.
Flags: needinfo?(dkeeler)
(In reply to Geoff Brown [:gbrown] from comment #41)
...
> In all failures, CryptoTask::Run() is called off main thread and completes
> the call to NS_DispatchToMainThread(), then is never called again (is not
> called on the main thread).
>
>
> My interpretation is that CryptoTask::Run() is waiting for the main thread,
> but the main thread is blocked on thread.processNextEvent(true)...but I'm
> out of my depth and unsure of the details.
It's not waiting for the main thread - it's waiting to run on the main thread (which processNextEvent should do).
> If I change to thread.processNextEvent(false), I can no longer reproduce the
> hang, but that doesn't seem like a sensible solution.
This turns an idle wait into a busy wait.
It seems like what's happening here is that the CryptoTask is being added to the main thread's event queue but for some reason processNextEvent doesn't process the event unless called repeatedly. Maybe running with the nsThread log set to debug would be informative?
Flags: needinfo?(dkeeler)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 55•8 years ago
|
||
I am not sure who is owning fixing this bug? :gbrown is this something you can do, or do you need someone like :keeler to fix this?
Flags: needinfo?(gbrown)
Comment 56•8 years ago
|
||
Sorry, couldn't look at this yesterday; it remains a high priority for me.
I will try to reproduce with nsThread logging, as suggested in comment 51.
I am happy to hand-off this bug to anyone who is interested, but I think it remains unclear what component is to blame (we know where browser startup is hanging now, but don't fully understand why).
Flags: needinfo?(gbrown)
Comment 57•8 years ago
|
||
The platform breakdown in comment 54 is quite different from earlier. Curiously, this bug remains mostly Linux on trunk, but spiked suddenly on aurora March 6-9, primarily on Windows PGO.
Comment 58•8 years ago
|
||
The aurora/windows-pgo failures look quite different:
[Errno 10061] No connection could be made because the target machine actively refused it
Those started with https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=0931190cd725a744b74f8f47863b7b38be8e8ebf&filter-searchStr=windows+pgo. I understand tomcat is looking into it.
I'll remain focused on the Linux failures.
Comment hidden (Intermittent Failures Robot) |
Comment 60•8 years ago
|
||
In https://treeherder.mozilla.org/#/jobs?repo=try&revision=ce67f735b6525cb5cb6a9a4f629a205a6ffc7008, I added nsThread logging, which I see in mozLogs.zip files for all the successful test runs. Unfortunately, for the failed cases, like https://treeherder.mozilla.org/logviewer.html#?job_id=83862794&repo=try, mozLogs.zip does not contain any traces for the failed pid.
Assignee | ||
Comment 61•8 years ago
|
||
Well, for successful jobs we clearly see that events are getting processed on the main thread:
[task 2017-03-15T00:51:03.032162Z] 00:51:03 INFO - runtests.py | Application pid: 1130
[task 2017-03-15T00:51:03.032244Z] 00:51:03 INFO - TEST-INFO | started process GECKO(1130)
[task 2017-03-15T00:51:04.063991Z] 00:51:04 INFO - GECKO(1130) | ZZZ XREMain::XRE_mainRun calls dirprovider.DoStartup
[task 2017-03-15T00:51:04.105691Z] 00:51:04 INFO - GECKO(1130) | nsXREDirProvider::DoStartup calls addons Observe for addons-startup on main thread
[task 2017-03-15T00:51:04.731357Z] 00:51:04 INFO - GECKO(1130) | ZZZ OpenSignedAppFileAsync
[task 2017-03-15T00:51:04.732343Z] 00:51:04 INFO - GECKO(1130) | ZZZ OpenSignedAppFileTask::CalculateResult called
[task 2017-03-15T00:51:04.733710Z] 00:51:04 INFO - GECKO(1130) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T00:51:04.735141Z] 00:51:04 INFO - GECKO(1130) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T00:51:04.749637Z] 00:51:04 INFO - GECKO(1130) | ZZZ OpenSignedAppFile OK
[task 2017-03-15T00:51:04.749777Z] 00:51:04 INFO - GECKO(1130) | ZZZ CryptoTask::Run 0x60b000041ee0 - NS_DispatchToMainThread okay
[task 2017-03-15T00:51:04.765823Z] 00:51:04 INFO - GECKO(1130) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T00:51:04.766754Z] 00:51:04 INFO - GECKO(1130) | ProcessNextNativeEvent on main, mayWait=0
Whereby for failing jobs the following is happening:
[task 2017-03-15T01:33:50.248452Z] 01:33:50 INFO - runtests.py | Application pid: 1804
[task 2017-03-15T01:33:50.248541Z] 01:33:50 INFO - TEST-INFO | started process GECKO(1804)
[task 2017-03-15T01:33:51.090452Z] 01:33:51 INFO - GECKO(1804) | ZZZ XREMain::XRE_mainRun calls dirprovider.DoStartup
[task 2017-03-15T01:33:51.126704Z] 01:33:51 INFO - GECKO(1804) | nsXREDirProvider::DoStartup calls addons Observe for addons-startup on main thread
[task 2017-03-15T01:33:51.756964Z] 01:33:51 INFO - GECKO(1804) | ZZZ OpenSignedAppFileAsync
[task 2017-03-15T01:33:51.757053Z] 01:33:51 INFO - GECKO(1804) | ZZZ OpenSignedAppFileTask::CalculateResult called
[task 2017-03-15T01:33:51.757988Z] 01:33:51 INFO - GECKO(1804) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T01:33:51.758663Z] 01:33:51 INFO - GECKO(1804) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T01:33:51.782045Z] 01:33:51 INFO - GECKO(1804) | ZZZ OpenSignedAppFile OK
[task 2017-03-15T01:33:51.785720Z] 01:33:51 INFO - GECKO(1804) | ZZZ CryptoTask::Run 0x60b000041e30 - NS_DispatchToMainThread okay
[task 2017-03-15T01:35:50.279997Z] 01:35:50 INFO - Traceback (most recent call last):
The crypto task got dispatched to the main thread, right? But then we have 2 minutes without any processing of events. At least this proves the assumptions so far.
Would it help to enable all kind of logging, so we might see what is blocking the main thread from processing events?
Flags: needinfo?(dkeeler)
Comment 62•8 years ago
|
||
Thanks for having a look, whimboo.
I tried changing MOZ_LOG to "all" - I got no mozLogs.zip output at all! https://treeherder.mozilla.org/#/jobs?repo=try&revision=9de48e76eef69082f8d67c2f52ad0372ee6cb11f
Assignee | ||
Comment 63•8 years ago
|
||
You missed to specify the level. So "all:5" should do it.
Comment 64•8 years ago
|
||
I did that: https://hg.mozilla.org/try/rev/3c1b7908bfaa2fa8ebfad9ec1dfd126f2be696b4#l4.13. It just doesn't seem to work.
Looks like some of those tests are faling with "Log file size 83275573 is greater than max allowed 52428800! Setting TBPL_FAILURE (was SUCCESS)..." - could that be preventing the log file from being uploaded? You could try just enabling some logs (although it looks like you did that in comment 60 and it didn't help...) . In PSM, the relevant ones would be pipnss:4 and maybe certverifier:5.
Flags: needinfo?(dkeeler)
Comment 66•8 years ago
|
||
Rather than fiddle with MOZ_LOG further, I just changed the LOG statements in nsThread.cpp into printf's:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=42299ff365d4d3827a19ac8713e0a95a8e073435
That causes even more "Log file size is greater than max allowed" failures, and also seems to make it more difficult to reproduce this hang, but I was still able to produce a hang and a corresponding successful run:
Hang:
https://public-artifacts.taskcluster.net/TCIijhPiSY-yvA3Dv8M8gg/0/public/logs/live_backing.log
[task 2017-03-15T19:54:10.847829Z] 19:54:10 INFO - GECKO(2276) | ZZZ OpenSignedAppFileTask::CalculateResult called
[task 2017-03-15T19:54:10.848564Z] 19:54:10 INFO - GECKO(2276) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:54:10.849391Z] 19:54:10 INFO - GECKO(2276) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:54:10.850233Z] 19:54:10 INFO - GECKO(2276) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:54:10.871748Z] 19:54:10 INFO - GECKO(2276) | ZZZ OpenSignedAppFile OK
[task 2017-03-15T19:54:10.874809Z] 19:54:10 INFO - GECKO(2276) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:54:10.875728Z] 19:54:10 INFO - GECKO(2276) | ZZZ CryptoTask::Run 0x60b0000420f0 - NS_DispatchToMainThread okay
[task 2017-03-15T19:54:10.876495Z] 19:54:10 INFO - GECKO(2276) | THRD(0x6120000e6bc0) ProcessNextEvent [0 0]
[task 2017-03-15T19:54:10.877298Z] 19:54:10 INFO - GECKO(2276) | THRD(0x6120000e6bc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:55:00.167762Z] 19:55:00 INFO - GECKO(2276) | THRD(0x6120000d7d40) Dispatch [(nil) 0]
[task 2017-03-15T19:55:00.168578Z] 19:55:00 INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:55:00.169364Z] 19:55:00 INFO - GECKO(2276) | THRD(0x6120000d7d40) running [0x621000643d20]
[task 2017-03-15T19:55:00.170168Z] 19:55:00 INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:55:00.170924Z] 19:55:00 INFO - GECKO(2276) | THRD(0x6120000d7d40) Dispatch [(nil) 0]
[task 2017-03-15T19:55:00.171643Z] 19:55:00 INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:55:00.172459Z] 19:55:00 INFO - GECKO(2276) | THRD(0x6120000d7d40) running [0x607000065580]
[task 2017-03-15T19:55:00.173193Z] 19:55:00 INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:55:00.173760Z] 19:55:00 INFO - GECKO(2276) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:55:10.173609Z] 19:55:10 INFO - GECKO(2276) | THRD(0x6120000d7d40) Dispatch [(nil) 0]
[task 2017-03-15T19:55:10.174726Z] 19:55:10 INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:55:10.174793Z] 19:55:10 INFO - GECKO(2276) | THRD(0x6120000d7d40) running [0x621000643d20]
[task 2017-03-15T19:55:10.174852Z] 19:55:10 INFO - GECKO(2276) | THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-15T19:56:09.314943Z] 19:56:09 INFO - Traceback (most recent call last):
[task 2017-03-15T19:56:09.315267Z] 19:56:09 INFO - File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2491, in doTests
[task 2017-03-15T19:56:09.315942Z] 19:56:09 INFO - marionette_args=marionette_args,
[task 2017-03-15T19:56:09.317786Z] 19:56:09 INFO - File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2095, in runApp
[task 2017-03-15T19:56:09.317854Z] 19:56:09 INFO - self.marionette.start_session(timeout=port_timeout)
[task 2017-03-15T19:56:09.318038Z] 19:56:09 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 28, in _
[task 2017-03-15T19:56:09.318753Z] 19:56:09 INFO - m._handle_socket_failure()
[task 2017-03-15T19:56:09.319745Z] 19:56:09 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
[task 2017-03-15T19:56:09.320979Z] 19:56:09 INFO - return func(*args, **kwargs)
[task 2017-03-15T19:56:09.322240Z] 19:56:09 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1292, in start_session
[task 2017-03-15T19:56:09.323561Z] 19:56:09 INFO - self.protocol, _ = self.client.connect()
[task 2017-03-15T19:56:09.325582Z] 19:56:09 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 223, in connect
[task 2017-03-15T19:56:09.326296Z] 19:56:09 INFO - self.sock.connect((self.addr, self.port))
[task 2017-03-15T19:56:09.326902Z] 19:56:09 INFO - File "/usr/lib/python2.7/socket.py", line 228, in meth
[task 2017-03-15T19:56:09.327380Z] 19:56:09 INFO - return getattr(self._sock,name)(*args)
[task 2017-03-15T19:56:09.327855Z] 19:56:09 INFO - error: [Errno 111] Connection refused
Success:
[task 2017-03-15T19:59:39.346412Z] 19:59:39 INFO - GECKO(2207) | ZZZ OpenSignedAppFileTask::CalculateResult called
[task 2017-03-15T19:59:39.352121Z] 19:59:39 INFO - GECKO(2207) | ZZZ OpenSignedAppFile OK
[task 2017-03-15T19:59:39.353319Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:59:39.354864Z] 19:59:39 INFO - GECKO(2207) | ZZZ CryptoTask::Run 0x60b000042040 - NS_DispatchToMainThread okay
[task 2017-03-15T19:59:39.355903Z] 19:59:39 INFO - GECKO(2207) | THRD(0x6120000e6bc0) ProcessNextEvent [0 0]
[task 2017-03-15T19:59:39.356826Z] 19:59:39 INFO - GECKO(2207) | THRD(0x6120000e6bc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.379060Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x607000000f00]
[task 2017-03-15T19:59:39.381582Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.382870Z] 19:59:39 INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.383540Z] 19:59:39 INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.384713Z] 19:59:39 INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.385528Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x604000017390]
[task 2017-03-15T19:59:39.386594Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.387948Z] 19:59:39 INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.389559Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x604000040f10]
[task 2017-03-15T19:59:39.390943Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.391911Z] 19:59:39 INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.393182Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x6120000c5d40]
[task 2017-03-15T19:59:39.395848Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:59:39.396906Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.398041Z] 19:59:39 INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.398829Z] 19:59:39 INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.399920Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x6040000bde90]
[task 2017-03-15T19:59:39.400867Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.402731Z] 19:59:39 INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.403491Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x607000065740]
[task 2017-03-15T19:59:39.404278Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.405062Z] 19:59:39 INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.405678Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x60700006d150]
[task 2017-03-15T19:59:39.406488Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.407274Z] 19:59:39 INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.408085Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x6040000f3990]
[task 2017-03-15T19:59:39.408950Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 0]
[task 2017-03-15T19:59:39.416002Z] 19:59:39 INFO - GECKO(2207) | ProcessNextNativeEvent on main, mayWait=0
[task 2017-03-15T19:59:39.418090Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) running [0x60b000042040]
[task 2017-03-15T19:59:39.418860Z] 19:59:39 INFO - GECKO(2207) | ZZZ CryptoTask::Run on main thread
[task 2017-03-15T19:59:39.421100Z] 19:59:39 INFO - GECKO(2207) | ZZZ OpenSignedAppFileTask::CallCallback called
[task 2017-03-15T19:59:39.422265Z] 19:59:39 INFO - GECKO(2207) | THRD(0x6120000e6bc0) sync shutdown
[task 2017-03-15T19:59:39.422841Z] 19:59:39 INFO - GECKO(2207) | THRD(0x6120000e6bc0) running [0x6060002753c0]
[task 2017-03-15T19:59:39.423602Z] 19:59:39 INFO - GECKO(2207) | THRD(0x6120000e6bc0) ProcessNextEvent [0 0]
[task 2017-03-15T19:59:39.425564Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) Dispatch [(nil) 0]
[task 2017-03-15T19:59:39.426182Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) ProcessNextEvent [1 1]
[task 2017-03-15T19:59:39.428992Z] 19:59:39 INFO - GECKO(2207) | THRD(0x612000009dc0) Dispatch [(nil) 0]
Comment hidden (Intermittent Failures Robot) |
Comment 68•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f048161f5db572f518b449fad9d2399b5bcf8eb0 has thread names plus some additional logging.
In the failed cases, "Cache2 I/O" is the last thread to call ProcessNextEvent before the hang, but that ProcessNextEvent call seems to complete - not clear that there's any problem there. In the failed cases, the main thread is last seen calling ProcessNextEvent just before OpenSignedAppFileAsync is called; I still don't see what happens to that thread.
Assignee | ||
Comment 69•8 years ago
|
||
Well, the interesting piece here is the timing. Have a look at the following excerpt:
[task 2017-03-16T04:05:52.028141Z] 04:05:52 INFO - GECKO(1128) | [SignedJAR] THRD(0x6120000e6bc0) ProcessNextEvent [0 0]
[task 2017-03-16T04:05:52.028184Z] 04:05:52 INFO - GECKO(1128) | ==> return
[task 2017-03-16T04:05:52.028236Z] 04:05:52 INFO - GECKO(1128) | [SignedJAR] THRD(0x6120000e6bc0) ProcessNextEvent [1 0]
[task 2017-03-16T04:06:41.348327Z] 04:06:41 INFO - GECKO(1128) | THRD(0x6120000d7d40) Dispatch [(nil) 0]
[task 2017-03-16T04:06:41.348608Z] 04:06:41 INFO - GECKO(1128) | [Cache2 I/O] THRD(0x6120000d7d40) ProcessNextEvent [0 0]
[task 2017-03-16T04:06:41.349038Z] 04:06:41 INFO - GECKO(1128) | THRD(0x6120000d7d40) running [0x621000643d20]
[task 2017-03-16T04:06:41.350368Z] 04:06:41 INFO - GECKO(1128) | ==> return
There is a hang of about 50s between the return of the SingedJAR thread, and dispatching and running the Cache2 I/O thread.
Would strace maybe give us further information?
Comment 70•8 years ago
|
||
That is an interesting pause, present in the failures and not present in the successful runs.
strace might produce something interesting...do you know how to set that up, whimboo?
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 72•8 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #70)
> strace might produce something interesting...do you know how to set that up,
> whimboo?
Sadly not. But Dustin gave a quick hint in bug 1331198 comment 16. So something like "strace -p <firefox pid>" should be enough. Finally I haven't had to use it to fix the problem.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 75•8 years ago
|
||
The Aurora/Windows failures were more properly bug 1345904, apparently resolved by https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=6fb7eeaf573e8f25d1cf77900f4daaeb5f90c2c3.
Comment 76•8 years ago
|
||
My latest try push, with lots of logging: https://treeherder.mozilla.org/#/jobs?repo=try&revision=331bfa1250d8ef90a5a6a684f404d5b4d9d40b1c
Note that "-b o -p linux64-asan -u mochitest-gpu -t none --rebuild 20" seems to reliably reproduce the hang.
Comment 77•8 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #69)
> Well, the interesting piece here is the timing. Have a look at the following
> excerpt:
>
> [task 2017-03-16T04:05:52.028141Z] 04:05:52 INFO - GECKO(1128) |
> [SignedJAR] THRD(0x6120000e6bc0) ProcessNextEvent [0 0]
> [task 2017-03-16T04:05:52.028184Z] 04:05:52 INFO - GECKO(1128) | ==>
> return
> [task 2017-03-16T04:05:52.028236Z] 04:05:52 INFO - GECKO(1128) |
> [SignedJAR] THRD(0x6120000e6bc0) ProcessNextEvent [1 0]
> [task 2017-03-16T04:06:41.348327Z] 04:06:41 INFO - GECKO(1128) |
> THRD(0x6120000d7d40) Dispatch [(nil) 0]
> [task 2017-03-16T04:06:41.348608Z] 04:06:41 INFO - GECKO(1128) | [Cache2
> I/O] THRD(0x6120000d7d40) ProcessNextEvent [0 0]
> [task 2017-03-16T04:06:41.349038Z] 04:06:41 INFO - GECKO(1128) |
> THRD(0x6120000d7d40) running [0x621000643d20]
> [task 2017-03-16T04:06:41.350368Z] 04:06:41 INFO - GECKO(1128) | ==>
> return
>
> There is a hang of about 50s between the return of the SingedJAR thread, and
> dispatching and running the Cache2 I/O thread.
At first I thought that was interesting too, but actually, I think there's no reason to expect the Cache2 thread to be active immediately after the SignedJAR thread event completes. In successful runs, there is also a long pause between Cache2 activity. (That may not be apparent in successful runs because the main thread, and other threads, are continuously active.)
Comment 78•8 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #76)
> My latest try push, with lots of logging:
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=331bfa1250d8ef90a5a6a684f404d5b4d9d40b1c
Looking at one of the failures:
https://public-artifacts.taskcluster.net/G4-HSvXAQV6KEbrO6N_yGw/0/public/logs/live_backing.log
[task 2017-03-22T18:05:22.879544Z] 18:05:22 INFO - GECKO(1468) | nsXREDirProvider::DoStartup calls addons Observe for addons-startup on main thread
[task 2017-03-22T18:05:23.138075Z] 18:05:23 INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.478692Z] 18:05:23 INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.480116Z] 18:05:23 INFO - GECKO(1468) | 1490205923472 addons.xpi WARN syncLoadManifestFromFile waiting for completion
[task 2017-03-22T18:05:23.480809Z] 18:05:23 INFO - GECKO(1468) | [Main Thread] THRD(0x61400000ce40) ProcessNextEvent [1 0]
[task 2017-03-22T18:05:23.482189Z] 18:05:23 INFO - GECKO(1468) | ZZZ OpenSignedAppFileAsync on main thread
[task 2017-03-22T18:05:23.483832Z] 18:05:23 INFO - GECKO(1468) | THRD(0x6140000a8640) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.485374Z] 18:05:23 INFO - GECKO(1468) | (0x60b000041a10) CryptoTask::Dispatch OK
[task 2017-03-22T18:05:23.485962Z] 18:05:23 INFO - GECKO(1468) | ZZZ OpenSignedAppFileAsync - SignedJAR Dispatch OK
[task 2017-03-22T18:05:23.486627Z] 18:05:23 INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.486726Z] 18:05:23 INFO - GECKO(1468) | [Main Thread] THRD(0x61400000ce40) ProcessNextEvent calls OnProcessNextEvent [1]
[task 2017-03-22T18:05:23.487338Z] 18:05:23 INFO - GECKO(1468) | [Main Thread] ProcessNextNativeEvent, mayWait=0
[task 2017-03-22T18:05:23.490190Z] 18:05:23 INFO - GECKO(1468) | [SignedJAR] THRD(0x6140000a8640) ProcessNextEvent [0 0]
[task 2017-03-22T18:05:23.490953Z] 18:05:23 INFO - GECKO(1468) | THRD(0x6140000a8640) running [0x60b000041a10]
[task 2017-03-22T18:05:23.491779Z] 18:05:23 INFO - GECKO(1468) | [Main Thread] ProcessNextNativeEvent returns 1
[task 2017-03-22T18:05:23.492888Z] 18:05:23 INFO - GECKO(1468) | [Main Thread] ProcessNextNativeEvent, mayWait=0
[task 2017-03-22T18:05:23.518732Z] 18:05:23 INFO - GECKO(1468) | OpenSignedAppFile returns NS_OK
[task 2017-03-22T18:05:23.521801Z] 18:05:23 INFO - GECKO(1468) | (0x60b000041a10) OpenSignedAppFileTask::CalculateResult returns
[task 2017-03-22T18:05:23.522734Z] 18:05:23 INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.523675Z] 18:05:23 INFO - GECKO(1468) | (0x60b000041a10) CryptoTask::Run - NS_DispatchToMainThread okay
[task 2017-03-22T18:05:23.524548Z] 18:05:23 INFO - GECKO(1468) | ==> THRD(0x6140000a8640) ProcessNextEvent return
[task 2017-03-22T18:05:23.525419Z] 18:05:23 INFO - GECKO(1468) | [SignedJAR] THRD(0x6140000a8640) ProcessNextEvent [0 0]
[task 2017-03-22T18:05:23.526485Z] 18:05:23 INFO - GECKO(1468) | ==> THRD(0x6140000a8640) ProcessNextEvent return
[task 2017-03-22T18:05:23.527467Z] 18:05:23 INFO - GECKO(1468) | [SignedJAR] THRD(0x6140000a8640) ProcessNextEvent [1 0]
[task 2017-03-22T18:06:12.826174Z] 18:06:12 INFO - GECKO(1468) | THRD(0x61400008fe40) Dispatch [(nil) 0]
[task 2017-03-22T18:06:12.827068Z] 18:06:12 INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:12.828077Z] 18:06:12 INFO - GECKO(1468) | THRD(0x61400008fe40) running [0x621000643d20]
[task 2017-03-22T18:06:12.828912Z] 18:06:12 INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:06:12.829729Z] 18:06:12 INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:12.830412Z] 18:06:12 INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:06:12.831404Z] 18:06:12 INFO - GECKO(1468) | THRD(0x61400008fe40) Dispatch [(nil) 0]
[task 2017-03-22T18:06:12.832129Z] 18:06:12 INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:12.832858Z] 18:06:12 INFO - GECKO(1468) | THRD(0x61400008fe40) running [0x607000065580]
[task 2017-03-22T18:06:12.833615Z] 18:06:12 INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:06:12.834452Z] 18:06:12 INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:12.835233Z] 18:06:12 INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:06:12.835908Z] 18:06:12 INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:06:22.845240Z] 18:06:22 INFO - GECKO(1468) | THRD(0x61400008fe40) Dispatch [(nil) 0]
[task 2017-03-22T18:06:22.845358Z] 18:06:22 INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:22.846286Z] 18:06:22 INFO - GECKO(1468) | THRD(0x61400008fe40) running [0x621000643d20]
[task 2017-03-22T18:06:22.846355Z] 18:06:22 INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:06:22.846815Z] 18:06:22 INFO - GECKO(1468) | [Cache2 I/O] THRD(0x61400008fe40) ProcessNextEvent [0 0]
[task 2017-03-22T18:06:22.847242Z] 18:06:22 INFO - GECKO(1468) | ==> THRD(0x61400008fe40) ProcessNextEvent return
[task 2017-03-22T18:07:22.068483Z] 18:07:22 INFO - Traceback (most recent call last):
Note particularly this part:
[task 2017-03-22T18:05:23.491779Z] 18:05:23 INFO - GECKO(1468) | [Main Thread] ProcessNextNativeEvent returns 1
[task 2017-03-22T18:05:23.492888Z] 18:05:23 INFO - GECKO(1468) | [Main Thread] ProcessNextNativeEvent, mayWait=0
[task 2017-03-22T18:05:23.518732Z] 18:05:23 INFO - GECKO(1468) | OpenSignedAppFile returns NS_OK
[task 2017-03-22T18:05:23.521801Z] 18:05:23 INFO - GECKO(1468) | (0x60b000041a10) OpenSignedAppFileTask::CalculateResult returns
[task 2017-03-22T18:05:23.522734Z] 18:05:23 INFO - GECKO(1468) | THRD(0x61400000ce40) Dispatch [(nil) 0]
[task 2017-03-22T18:05:23.523675Z] 18:05:23 INFO - GECKO(1468) | (0x60b000041a10) CryptoTask::Run - NS_DispatchToMainThread okay
https://hg.mozilla.org/try/rev/51e6963f68a269fc69c76f9844f588a94fd58af7#l7.18
nsAppShell::ProcessNextNativeEvent(), running on the main thread, called g_main_context_iteration() but that never returned. OpenSignedAppFile() and OpenSignedAppFileTask::CalculateResult() completed immediately after, but that's not on the main thread. What happened to the main thread?
:keeler -- Can you have another look? I don't know what else to do here.
Assignee: gbrown → nobody
Flags: needinfo?(dkeeler)
Comment 79•8 years ago
|
||
do we have an idea of what tests are causing this? Maybe we can narrow it down that way.
Comment 80•8 years ago
|
||
I don't think there is any connection to individual tests -- this is a hang on browser startup, well before any tests are run. It might be affected by add-ons or anything else configured in the test profile.
It seems to affect all mochitests, but hits mochitest-gl and mochitest-gpu more often than plain - I don't know why.
I wouldn't be surprised if there is a similar issue affecting reftests that is reported differently, but I haven't noticed anything specific.
To test out the assumption that something in the threading of OpenSignedAppFile was causing the timeouts, I wrote a patch that (I believe) made it synchronous rather than async. With 30 retriggers, the intermittent didn't show up:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e24de258c0bca34c852af74bdf61ee1b668e85ea
This then reminded me of bug 1185489, which involves code in this area that isn't thread-safe. I wrote a different patch that just removed the code in question. Again, after 30 retriggers, the intermittent didn't show up:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=fb748ae85e71956819f37502008d508813682cd2
To convince myself that these results were significant, I attempted to reproduce the intermittent on the base revision that both of those patches are based on:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=32bad569ddff5bc4c1be3dd7ed0122090af401ec&filter-searchStr=gpu
.. and unfortunately I haven't been able to make that fail, despite that being the same code as this:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=4c987b7ed54a&filter-searchStr=linux%20x64%20asan%20gpu
which does have a failure.
I'll keep looking, but unfortunately I really don't have anything to add at the moment.
Flags: needinfo?(dkeeler)
Comment hidden (Intermittent Failures Robot) |
Comment 83•8 years ago
|
||
Thanks :keeler. Since my logging seems to make this happen with greater regularity, here are try pushes with my logging on its own + with each of your patches: https://treeherder.mozilla.org/#/jobs?repo=try&author=gbrown@mozilla.com&tochange=86612c999285064cdc09abcb3c7398e1069089aa&fromchange=032ca845c994d6d9118c1db376daa0072d72fac6
Comment 84•8 years ago
|
||
...your "convert openSignedAppFileAsync to sync" patch seems to fix the hang!
Comment 85•8 years ago
|
||
:keeler - Not trying to rush you, but thought I'd check in to see if you noticed the previous couple of comments. Is converting to sync a viable solution here? Let me know if I can run more try experiments or help with anything else.
Flags: needinfo?(dkeeler)
I did see those comments, but thanks for the needinfo anyway. Converting that function to a synchronous call isn't really viable - the file reading/public key operations are long running and not something we want to block the main thread. What's odd is this use of CryptoTask is failing so frequently while others don't seem to. I'll keep looking, though.
Flags: needinfo?(dkeeler)
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Flags: needinfo?(dveditz)
Comment hidden (Intermittent Failures Robot) |
Comment 89•8 years ago
|
||
:keeler, this is still causing a lot of pain on our trees, can you look into a more appropriate fix for openSignedAppFileAsync (per comment 13) ?
Flags: needinfo?(dkeeler)
I didn't manage to get to this today, but it's on my radar. Keeping the ni? so I remember to have a look.
Geoff, I'm having the hardest time reproducing this, even when basing off your original logging patch. Any idea what I'm doing wrong?
https://treeherder.mozilla.org/#/jobs?repo=try&revision=850c91be735b521ac0db3637d572bf0cd5a80f2c
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1d18042d0b35b1341d1e0527ff6f115d44dbef18
https://treeherder.mozilla.org/#/jobs?repo=try&revision=37ee630d4769525854ef565b0d3917e2c4657d84
Flags: needinfo?(dkeeler) → needinfo?(gbrown)
Comment 92•8 years ago
|
||
It looks like you are doing everything right. I think something has shifted and my trick for reproducing is not working now.
I thought perhaps we would have better luck with a different test job, but that logging patch generates so much output, almost all the tests fail, and create massive log files. I trimmed it down with more hackery, in https://treeherder.mozilla.org/#/jobs?repo=try&revision=46f757fb8a70752db14cad81718db384b610cc17, but I'm still not having any luck reproducing the failure.
The failure is still happening, according to Orange Factor, but reproducing it on try seems difficult. If you have a possible solution, I'd suggest landing it and seeing if the failures go away. Otherwise, we may just need to wait and see if we can reproduce it again in future. (Or rollback and push a revision from a couple weeks ago?)
Flags: needinfo?(gbrown)
Comment hidden (Intermittent Failures Robot) |
Comment 94•8 years ago
|
||
oddly this has greatly reduced in frequency after April 11th.
Whiteboard: [stockwell needswork] → [stockwell unknown]
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•8 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 102•8 years ago
|
||
This is failing frequently again. I pushed the logging patch that :keeler and I were using earlier: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ab246e341381addda1437dd67039e0506d5fadde. Good news: It is now easy to reproduce this on linux32-debug. Bad news: The hang seems to have moved.
I'll try to track it down again.
Flags: needinfo?(gbrown)
Comment hidden (Intermittent Failures Robot) |
Comment 104•8 years ago
|
||
I am seeing a few different causes of this failure. Let's look at one...
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c8d621a1f21ba3c5f6c63e3595c2c0c0d9b2e028
https://public-artifacts.taskcluster.net/CH-cN9NLRpqunwGTEKcUxQ/0/public/logs/live_backing.log
20:39:51 INFO - TEST-INFO | started process GECKO(1082)
20:39:51 INFO - GECKO(1082) | ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to /tmp/tmpxcE5GX.mozrunner/runtests_leaks.log
20:41:44 INFO - GECKO(1082) | XRE_mainRun inits commandline...
...
20:41:45 INFO - GECKO(1082) | 1494362505551 Marionette INFO marionette enabled by command line flag
...
20:41:49 INFO - GECKO(1082) | ++DOMWINDOW == 12 (0xde3e3c00) [pid = 1082] [serial = 12] [outer = 0xde4b8c00]
20:41:51 INFO - Traceback (most recent call last):
...
20:41:51 INFO - error: [Errno 111] Connection refused
20:41:51 ERROR - Automation Error: Received unexpected exception while running application
It seems the -marionette flag has been seen and Marionette enabled -- https://hg.mozilla.org/try/rev/af523ebe17152d836b87244f56a79d81aed0337a#l1.12 -- but not initialized -- https://hg.mozilla.org/try/rev/af523ebe17152d836b87244f56a79d81aed0337a#l1.28.
There was activity just before the connection error, at 20:41:49...I wonder if we just hadn't gotten around to initializing MarionetteComponent. It looks like the connection error happens 120 seconds after process launch; could that time period be lengthened, and might that avoid some of these failures, especially on slow platforms like linux32-debug?
:whimboo -- Thoughts?
Flags: needinfo?(gbrown) → needinfo?(hskupin)
Assignee | ||
Comment 105•8 years ago
|
||
Hm, we wait for the "sessionstore-windows-restored" observer topic to be fired, before we do the initialization. Maybe this one is coming through too late or maybe never?
https://dxr.mozilla.org/mozilla-central/rev/b21b974d60d3075ae24f6fb1bae75d0f122f28fc/testing/marionette/components/marionette.js#197
On Windows we also wait until the GFX sanity window has been closed.
Flags: needinfo?(hskupin)
Comment 106•8 years ago
|
||
Thanks - I was wondering how that worked.
Reviewing some of the successful jobs in https://treeherder.mozilla.org/#/jobs?repo=try&revision=c8d621a1f21ba3c5f6c63e3595c2c0c0d9b2e028, I see more evidence that it can take ~2 minutes to init marionette:
https://public-artifacts.taskcluster.net/FmPpTwH8SOS9xFVUKNGhMg/0/public/logs/live_backing.log
20:42:26 INFO - TEST-INFO | started process GECKO(1096)
20:44:18 INFO - GECKO(1096) | 1494362658486 Marionette INFO Marionette init
20:44:18 INFO - GECKO(1096) | 1494362658960 Marionette INFO Listening on port 2828
https://public-artifacts.taskcluster.net/SPeyE3TqRXq9EAUhcBxEaQ/0/public/logs/live_backing.log
20:40:44 INFO - TEST-INFO | started process GECKO(1088)
20:42:24 INFO - GECKO(1088) | 1494362544274 Marionette INFO Marionette init
20:42:24 INFO - GECKO(1088) | 1494362544695 Marionette INFO Listening on port 2828
It can take less time - between 60 and 90 seconds seems common - but it isn't hard to find cases close to 120 seconds.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 108•8 years ago
|
||
Geoff, would you mind to push a try with the following changes:
https://hg.mozilla.org/try/rev/d53acf55de513e4eaa6915877d9556e9d896bd3d
I use that also for another bug and this output is pretty helpful. It will show us which observer notifications Marionette actually receives. It should give us an indication where in the initialization code we are waiting.
Comment 109•8 years ago
|
||
No problem: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bacb39ac18ec2b9bee13028ced5b1f0b666ab6d7
Here's an example of this failure with your changes: https://public-artifacts.taskcluster.net/ABmnEm2_RXCfCYbOgPEi_Q/0/public/logs/live_backing.log
Comment 110•8 years ago
|
||
I also tried increasing DEFAULT_STARTUP_TIMEOUT from 120 seconds to 180 seconds: https://treeherder.mozilla.org/#/jobs?repo=try&revision=300dc393115b1973cf7c28ee522c45dd1dec11d0. With this change, I could not reproduce this failure.
Assignee | ||
Comment 111•8 years ago
|
||
So here the timing for a failing test:
[task 2017-05-10T15:08:46.572745Z] 15:08:46 INFO - TEST-INFO | started process GECKO(1085)
[..]
[task 2017-05-10T15:09:54.628136Z] 15:09:54 INFO - GECKO(1085) | 1494428994626 Marionette INFO Received observer notification "sessionstore-windows-restored"
[task 2017-05-10T15:09:55.037043Z] 15:09:55 INFO - GECKO(1085) | 1494428995035 Marionette INFO Listening on port 2828
It means that it has been taken 1:10min to start Marionette for this debug build. This is insanely long!
The Linux workers are using Ubuntu 16.04.2 LTS, so bug 1345105 shouldn't have an effect here.
Geoff I would propose that you request a one-click loaner, run the setup task, and run the tests manually by inspecting the cpu load time while firefox is starting up. I could imagine there is some other process which takes a fair amount of cpu power and slows Firefox down.
Comment hidden (Intermittent Failures Robot) |
Comment 113•7 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #111)
> It means that it has been taken 1:10min to start Marionette for this debug
> build. This is insanely long!
It is a long time from a user perspective, but this is linux32 and debug and some cheap aws machine -- I don't feel qualified to criticize! :)
It looks to me like startup in this environment is usually between 60 and 120 seconds, with something like 3% of jobs taking more than 120 seconds (causing failures here).
> Geoff I would propose that you request a one-click loaner, run the setup
> task, and run the tests manually by inspecting the cpu load time while
> firefox is starting up. I could imagine there is some other process which
> takes a fair amount of cpu power and slows Firefox down.
Good idea. I did that, I noted startup taking 60 seconds+, as expected. top indicated firefox-bin taking 95% CPU, python taking 4% CPU during startup; I didn't see any other significant CPU use during startup.
Assignee | ||
Comment 114•7 years ago
|
||
Ok, in that case we have two options... 1) invest in faster workers, or 2) bump up the start time timeout. I will leave it up to you to decide. I only wonder if we save money or not when using such slow workers which take much, much longer to run a test job, than having a faster worker and shorter job times.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Assignee: nobody → gbrown
Comment 118•7 years ago
|
||
I am trying to address this in bug 1364228, and am hopeful that changes there will reduce failure frequency by 80%+.
Whiteboard: [stockwell unknown] → [stockwell needswork]
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 122•7 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #118)
> I am trying to address this in bug 1364228, and am hopeful that changes
> there will reduce failure frequency by 80%+.
Nearly all (but not quite all) linux failures have been eliminated.
Assignee: gbrown → nobody
Whiteboard: [stockwell needswork] → [stockwell fixed]
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 128•7 years ago
|
||
this picked up on the 17th, primarily on osx opt in mochitest*; :gbrown, can you maybe help narrow this down or do your magic to make this actionable?
Flags: needinfo?(gbrown)
Whiteboard: [stockwell fixed] → [stockwell needswork]
Assignee | ||
Comment 129•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) from comment #128)
> this picked up on the 17th, primarily on osx opt in mochitest*; :gbrown,
> can you maybe help narrow this down or do your magic to make this actionable?
Maybe a side-effect from moving from Buildbot to Taskcluster?
Comment hidden (Intermittent Failures Robot) |
Comment 131•7 years ago
|
||
Looking at the osx-10-10 failures only, I see -
- opt only, non-e10s only
- various mochitests...perhaps biased to mochitest-chrome, mochitest-bc?
- fails after 2 minutes of trying to connect; successful runs generally connect in a few seconds
- earliest that I see is https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=00fa4f730bf6b7779a8fb78aaef3e23322f5750a&filter-searchStr=osx, but that certainly didn't cause this.
Comment hidden (Intermittent Failures Robot) |
Comment 133•7 years ago
|
||
I'm debugging on try...may take a while.
Also, note whimboo's comments in bug 1374762.
Comment 134•7 years ago
|
||
In the failed cases, I find profile-after-change is seen but sessionstore-windows-restored is not.
Assignee | ||
Comment 135•7 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #134)
> In the failed cases, I find profile-after-change is seen but
> sessionstore-windows-restored is not.
Yeah, that would indicate that we might have started Firefox in safe mode. The application command doesn't show it:
17:37:54 INFO - Application command: /Users/cltbld/tasks/task_1498003975/build/application/Nightly.app/Contents/MacOS/firefox -marionette -foreground -profile /var/folders/p5/c6r39pb97p5___0gvb4wyhzr00000w/T/tmpPpotiL.mozrunner
But I wonder which code paths could actually trigger the safe mode to become active. Geoff, have you had a chance to see it on a one click loaner? If we can see it with VNC connected that would be great.
Comment hidden (Intermittent Failures Robot) |
Comment 137•7 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #135)
> active. Geoff, have you had a chance to see it on a one click loaner? If we
> can see it with VNC connected that would be great.
I have not seen it on a one click loaner. I have only reproduced it with try pushes like https://treeherder.mozilla.org/#/jobs?repo=try&revision=8e781e0c5b0bf42723b7a8f2cd14062916f3b309.
The latest result indicates sessionstore-windows-restored is not received because browser.js's onLoad() is not called:
https://dxr.mozilla.org/mozilla-central/rev/92eb911c35da48907d326604c4c92cf55e551895/browser/base/content/browser.js#1255
and therefore doesn't notify sessionstore-windows-restored:
https://dxr.mozilla.org/mozilla-central/rev/92eb911c35da48907d326604c4c92cf55e551895/browser/base/content/browser.js#1313
Assignee | ||
Comment 138•7 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #137)
> I have not seen it on a one click loaner. I have only reproduced it with try
> pushes like
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=8e781e0c5b0bf42723b7a8f2cd14062916f3b309.
bc2 and bc3 are failing way more than other tests. So having a look at the VNC screen, while running such a test multiple times, should give us more information. It would be good to know if the browser windows comes up at all, or eg if safe mode is active.
Comment 139•7 years ago
|
||
I don't think there is a vnc screen to access- these tests all run in AWS and as gbrown mentioned he hasn't seen this on a one click loaner.
Assignee | ||
Comment 140•7 years ago
|
||
Every one click loaner at least for Linux workers allow you to also connect to the screen via VNC. If he hasn't seen it yet, he might have not run enough test jobs to trigger the behavior. It's all the same environment so I do not understand why a one click loaner should not be helpful here?
Comment 141•7 years ago
|
||
:whimboo, a one click loaner is not always realistic to reproduce problems- many times I have had to give up and go to try server, have you seen this failure while inside a one click loaner- maybe you have tips for how to reproduce it.
Assignee | ||
Comment 142•7 years ago
|
||
What is try server? If you mean try builds then those jobs also allow a one click loaner. Again, tests on those run totally the same way. I really don't understand the problem here, and we may want to discuss this outside of this bug.
Comment 143•7 years ago
|
||
I just haven't used a one-click loaner for this because it seemed inconvenient: With a try push I can start 100+ mochitest jobs and almost always reproduce this problem a couple of times. I don't fancy sitting on a one-click loaner re-running mochitest 100 times. I'll give it a shot with bc2/bc3...see if I can catch it.
Comment 144•7 years ago
|
||
...or not!
It looks like one-click loaners are not available for osx (can select one-click loaner option in treeherder, but disabled in taskcluster).
This happens too infrequently on linux to try to reproduce there.
Comment hidden (Intermittent Failures Robot) |
Comment 146•7 years ago
|
||
possibly getting a physical loaner?
this could be a side effect of running tests in taskcluster vs buildbot- the timing looks about right.
Comment hidden (Intermittent Failures Robot) |
Comment 148•7 years ago
|
||
There have been no new failures on any platform since June 24 and none on OSX since June 23. That's a weekend, with fewer pushes than normal, but a try push seems to confirm: This isn't happening any more.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=eb4128adf5fa4d197c868e2f33d5039c18d1eb99
My planned next step here was to try to track down why browser.js' onLoad() was not being called, even though xpcom startup seemed fine. I also wanted to check if this might be related to safe mode, but didn't have a concrete idea of how to pursue either idea.
Hoping this stays fixed...
Flags: needinfo?(gbrown)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 150•7 years ago
|
||
My patch for bug 1374762 landed 2 days ago. Since then we haven't had any single failure anymore. So I really wonder if that could have been due to an inappropriate start in safe mode.
Please note that with my patch I added a DEBUG log line for Marionette which prints out when the safe mode dialog gets handled. Sadly be default this will not show up in the logs of other test suites because the preference `marionette.log.level = 'trace'` is not used. Maybe it should be added so that Marionette prints a bit more output in case of failures.
Geoff, what do you think?
Assignee | ||
Updated•7 years ago
|
Flags: needinfo?(gbrown)
Comment 151•7 years ago
|
||
There have been a couple of failures today. :(
...But there has been lots of improvement here, the majority of the improvement around June 24, and perhaps some additional improvement around the time of your safe mode change - that's great!
I find investigating and debugging these "Connection refused" failures tiresome and welcome additional logging that would help narrow the cause of the failure more quickly. Looking at the current Marionette test logs, I'm not convinced the TRACE messages would help much: It looks like "Listening on port 2828" is still the first message, and I don't see much more logging at startup. I'm also concerned by all the TRACE messages while tests are running: Our tests logs are already so big and convoluted, I don't want to see that much more added unless we really need it. Maybe it's best to keep it the way it is? It's easy to add TRACE logs in try pushes.
Flags: needinfo?(gbrown)
Assignee | ||
Comment 152•7 years ago
|
||
Geoff, it should actually be enough to use `marionette.log.level = 'debug'`. Because that is what we actually use to print out other information during startup and page loads. The latter shouldn't matter here because it would only happen for page load requests as done via Marionette, and that's nothing those test harnesses is doing.
Otherwise we could add a info level output for how Marionette has been enabled (command line flag or environment variable). It would indicate that it's running but not necessarily have the server started.
Comment 153•7 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #152)
> startup and page loads. The latter shouldn't matter here because it would
> only happen for page load requests as done via Marionette, and that's
> nothing those test harnesses is doing.
Oh, I see - I hadn't thought of that. That sounds okay then.
> Otherwise we could add a info level output for how Marionette has been
> enabled (command line flag or environment variable). It would indicate that
> it's running but not necessarily have the server started.
That sounds good too!
Assignee | ||
Comment 154•7 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #153)
> > startup and page loads. The latter shouldn't matter here because it would
> > only happen for page load requests as done via Marionette, and that's
> > nothing those test harnesses is doing.
>
> Oh, I see - I hadn't thought of that. That sounds okay then.
I would leave this up to you then given that this is outside of Marionette.
> > Otherwise we could add a info level output for how Marionette has been
> > enabled (command line flag or environment variable). It would indicate that
> > it's running but not necessarily have the server started.
>
> That sounds good too!
I filed bug 1378209 to get this implemented soon.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 156•7 years ago
|
||
So a normal application start looks like this:
[task 2017-07-09T10:27:16.225590Z] 10:27:16 INFO - Application command: /home/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmpABZapw.mozrunner
[task 2017-07-09T10:27:16.246287Z] 10:27:16 INFO - runtests.py | Application pid: 1813
[task 2017-07-09T10:27:16.246599Z] 10:27:16 INFO - TEST-INFO | started process GECKO(1813)
[task 2017-07-09T10:27:18.870075Z] 10:27:18 INFO - GECKO(1813) | 1499596038867 Marionette INFO Enabled via --marionette
[task 2017-07-09T10:27:23.520937Z] 10:27:23 INFO - GECKO(1813) | 1499596043514 Marionette INFO Listening on port 2828
As you can see Marionette is enabled via the command line flag. But here is the output from a non-working case:
[task 2017-07-09T10:30:16.098939Z] 10:30:16 INFO - Application command: /home/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmpoguz_J.mozrunner
[task 2017-07-09T10:30:16.114279Z] 10:30:16 INFO - runtests.py | Application pid: 2431
[task 2017-07-09T10:30:16.114769Z] 10:30:16 INFO - TEST-INFO | started process GECKO(2431)
[task 2017-07-09T10:33:16.192469Z] 10:33:16 INFO - Traceback (most recent call last):
[task 2017-07-09T10:33:16.195312Z] 10:33:16 INFO - File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2534, in doTests
[task 2017-07-09T10:33:16.197115Z] 10:33:16 INFO - marionette_args=marionette_args,
[task 2017-07-09T10:33:16.199086Z] 10:33:16 INFO - File "/home/worker/workspace/build/tests/mochitest/runtests.py", line 2135, in runApp
[task 2017-07-09T10:33:16.200887Z] 10:33:16 INFO - self.marionette.start_session(timeout=port_timeout)
[task 2017-07-09T10:33:16.202293Z] 10:33:16 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 28, in _
[task 2017-07-09T10:33:16.203538Z] 10:33:16 INFO - m._handle_socket_failure()
[task 2017-07-09T10:33:16.205368Z] 10:33:16 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/decorators.py", line 23, in _
[task 2017-07-09T10:33:16.207145Z] 10:33:16 INFO - return func(*args, **kwargs)
[task 2017-07-09T10:33:16.208370Z] 10:33:16 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/marionette.py", line 1208, in start_session
[task 2017-07-09T10:33:16.210069Z] 10:33:16 INFO - self.protocol, _ = self.client.connect()
[task 2017-07-09T10:33:16.211347Z] 10:33:16 INFO - File "/home/worker/workspace/build/venv/local/lib/python2.7/site-packages/marionette_driver/transport.py", line 223, in connect
[task 2017-07-09T10:33:16.212672Z] 10:33:16 INFO - self.sock.connect((self.addr, self.port))
[task 2017-07-09T10:33:16.213892Z] 10:33:16 INFO - File "/usr/lib/python2.7/socket.py", line 228, in meth
[task 2017-07-09T10:33:16.215644Z] 10:33:16 INFO - return getattr(self._sock,name)(*args)
[task 2017-07-09T10:33:16.216847Z] 10:33:16 INFO - error: [Errno 111] Connection refused
[task 2017-07-09T10:33:16.218212Z] 10:33:16 ERROR - Automation Error: Received unexpected exception while running application
The Marionette component didn't get started. Even not after 3 minutes! And as such the client times out. This indicates that there is clearly a problem in Firefox, whether with initializing components, or a hang / crash during startup.
Updated•7 years ago
|
Whiteboard: [stockwell needswork] → [stockwell unk
Updated•7 years ago
|
Whiteboard: [stockwell unk → [stockwell unknown]
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Whiteboard: [stockwell unknown] → [stockwell needswork]
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 164•7 years ago
|
||
It appears this bug has become a catch all... There are totally unrelated classifications against this bug.
Updated•7 years ago
|
Updated•7 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 168•7 years ago
|
||
It was my impression that mochitests would soon move to install
addons in a different way and not use Marionette? Has this plan
changed, or am I disillusioned?
From looking at a recent log:
> 16:35:06 INFO - GECKO(1097) | 1502728506714 Marionette INFO Enabled via --marionette
[…]
> 16:35:12 INFO - error: [Errno 111] Connection refused
> 16:35:12 ERROR - Automation Error: Received unexpected exception while running application
In between here, we would have expected to see “Marionette INFO
Listening on port 2828”, indicating that the TCP listener in the
remote protocol has started up.
We use Services.tm.idleDispatchToMainThread to delay starting
up Marionette until Firefox has finished whatever it is
doing on the main thread to avoid potential synchronisation
problems. You can see the relevant code for this in
https://searchfox.org/mozilla-central/rev/b258e6864ee3e809d40982bc5d0d5aff66a20780/testing/marionette/components/marionette.js#292-311.
I suspect that this check is not relevant to mochitest and that
mochitest may be instrumenting the content shell in such a way that
it never reaches an idle main thread. jmaher, does this sound like a
plausible theory?
Updated•7 years ago
|
Flags: needinfo?(jmaher)
Assignee | ||
Comment 169•7 years ago
|
||
Andreas please see bug 1382162 which might be the underlying issue here too. This is marked as dependency.
Comment 170•7 years ago
|
||
:ato, the work to not use Marionette was for Talos only, not mochitest/reftest- that work has been done and mozprofile installs the webextensions. Talos is different than Mochitest/Reftest in that it has signed addons/webextensions and we can have persistent installations- if we were to sign our addons for the unittests, then we could install them in a different way.
Flags: needinfo?(jmaher)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 173•7 years ago
|
||
:automatedtester- can you find someone to look at fixing this? we have many failures/week showing up from this and marionette connections have been a persistent problem since we starting using marionette.
Flags: needinfo?(dburns)
Assignee | ||
Comment 174•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #173)
> :automatedtester- can you find someone to look at fixing this? we have many
> failures/week showing up from this and marionette connections have been a
> persistent problem since we starting using marionette.
Once again, all those failures should be related to bug 1382162 which needs to be investigated and fixed first. Geoff spent most time on all this so far but has a very hard time to get it reproduced. I would suggest we are moving any discussion item over to bug 1382162?
Comment 175•7 years ago
|
||
I agree, but we need someone to own resolving this issue as that bug has been idle for 10 days.
Comment hidden (mozreview-request) |
Assignee | ||
Comment 177•7 years ago
|
||
mozreview-review |
Comment on attachment 8902637 [details]
Bug 1261598 - Set port timeout to be more than 60s for ASAN.
https://reviewboard.mozilla.org/r/174308/#review179476
::: testing/mozharness/configs/unittests/linux_unittest.py:119
(Diff revision 1)
> "--log-errorsummary=%(error_summary_file)s",
> "--use-test-media-devices",
> "--screenshot-on-fail",
> "--cleanup-crashes",
> "--marionette-startup-timeout=180",
> + "--marionette-port-timeout=180"
The port timeout is by default 360s. Why should we reduce it here, so that its value is even clashing with the --marionette-startup-timeout?
I don't see that this will change anything. It only will abort tests earlier.
Comment 178•7 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #177)
> Comment on attachment 8902637 [details]
> Bug 1261598 - Set port timeout to be more than 60s for ASAN.
>
> https://reviewboard.mozilla.org/r/174308/#review179476
>
> ::: testing/mozharness/configs/unittests/linux_unittest.py:119
> (Diff revision 1)
> > "--log-errorsummary=%(error_summary_file)s",
> > "--use-test-media-devices",
> > "--screenshot-on-fail",
> > "--cleanup-crashes",
> > "--marionette-startup-timeout=180",
> > + "--marionette-port-timeout=180"
>
> The port timeout is by default 360s. Why should we reduce it here, so that
> its value is even clashing with the --marionette-startup-timeout?
>
> I don't see that this will change anything. It only will abort tests earlier.
The tests are timing out after 60s so its clearly not waiting 360s
Flags: needinfo?(dburns)
Assignee | ||
Comment 179•7 years ago
|
||
(In reply to David Burns :automatedtester from comment #178)
> > ::: testing/mozharness/configs/unittests/linux_unittest.py:119
> > (Diff revision 1)
> > > "--log-errorsummary=%(error_summary_file)s",
> > > "--use-test-media-devices",
> > > "--screenshot-on-fail",
> > > "--cleanup-crashes",
> > > "--marionette-startup-timeout=180",
> > > + "--marionette-port-timeout=180"
> >
> > I don't see that this will change anything. It only will abort tests earlier.
>
> The tests are timing out after 60s so its clearly not waiting 360s
No, tests are timing out after 3min (180s) which is startup timeout as listed above:
> [task 2017-08-26T08:04:23.069762Z] 08:04:23 INFO - Application command: /home/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmpGylzXD.mozrunner
[..]
[task 2017-08-26T08:04:27.051735Z] 08:04:27 INFO - GECKO(1205) | 1503734667044 Marionette INFO Enabled via --marionette
[task 2017-08-26T08:07:29.551916Z] 08:07:29 INFO - Traceback (most recent call last):
To get more details about this failure we could temporarily do the following:
1) Change Marionette's log lines for received observer notifications from DEBUG to INFO level
2) Update test harnesses which use Marionette to use the DEBUG log level instead of INFO
I would actually prefer option 2) which could simply be done via a preference as already mentioned in comment 150.
But basically it's still bug 1382162 we have to work on, because that is the underlying issue here.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 181•7 years ago
|
||
Please note that some failures like the following are not caused by Marionette but we hit a MOZ_CRASH which causes a shutdown of Firefox.
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=127986873&lineNumber=22362
[task 2017-09-02T11:53:03.283917Z] 11:53:03 INFO - GECKO(2785) | Hit MOZ_CRASH(Aborting on channel error.) at /builds/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:2535
[task 2017-09-02T11:53:03.290153Z] 11:53:03 INFO - GECKO(2785) | #01: mozilla::ipc::ProcessLink::OnChannelError [ipc/glue/MessageLink.cpp:393]
[task 2017-09-02T11:53:03.290404Z] 11:53:03 INFO -
[task 2017-09-02T11:53:03.293195Z] 11:53:03 INFO - GECKO(2785) | #02: event_process_active_single_queue [ipc/chromium/src/third_party/libevent/event.c:1648]
Shouldn't we also add MOZ_CRASH to the error list, so that those cases are getting reported correctly on Treeherder? Otherwise we will get falsely starred test failures like in the above case. Ed, can you give feedback?
Flags: needinfo?(emorley)
Assignee | ||
Comment 182•7 years ago
|
||
So what I think goes also wrong here is the usage of Marionette by other test harnesses. I already reported that on bug 1393366 for wpt-tests. What's basically happening here is the following:
By default the startup_timeout of Marionette driver is 120s. BUT as we can see here it errors out already after 60s. The 120s are only used when you make use of `wait_for_port` or `raise_for_port` before the first call to `start_session`. Starting a session defaults to 60s, and uses this value inappropriately to wait for the server to be ready. Instead it would have to use the 120s:
http://searchfox.org/mozilla-central/rev/44c693914255638d74bcf1ec3b0bcd448dfab2fd/testing/marionette/client/marionette_driver/marionette.py#1173-1202
I will take care of that in my patch on bug 1397675. This should definitely give us 60s more time, which should be sufficient.
Depends on: 1397675
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 185•7 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #181)
> Shouldn't we also add MOZ_CRASH to the error list, so that those cases are
> getting reported correctly on Treeherder? Otherwise we will get falsely
> starred test failures like in the above case. Ed, can you give feedback?
I agree this sounds useful to have marked as an error (presuming it would result in no false positives). However we're trying to move away from hardcoding error strings in Treeherder, so I would prefer this to be addressed by making the mozharness log level prefix be ERROR rather than the current INFO. eg:
[task 2017-09-02T11:53:03.283917Z] 11:53:03 ERROR - GECKO(2785) | Hit MOZ_CRASH(...
Flags: needinfo?(emorley)
Assignee | ||
Comment 186•7 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #185)
> hardcoding error strings in Treeherder, so I would prefer this to be
> addressed by making the mozharness log level prefix be ERROR rather than the
Sure, I filed bug 1399993.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 188•7 years ago
|
||
The patch on bug 1397675 has been landed yesterday. I will observe the results on Monday to see if also fixes this problem.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 190•7 years ago
|
||
The number of failures dropped a lot after bug 1397675 got fixed, but there are still issues, and I think this is because of bug
1382162 (missing session restore notification).
Comment 191•7 years ago
|
||
I don't see many failures in the last day, in fact it looks silent for almost 16 hours now. Lets see if this continues the trend by the end of the week :)
Assignee | ||
Comment 192•7 years ago
|
||
Well, there are still quite a lot of failures. And with each job waiting 3 minutes now for Firefox and Marionette to be ready nothing seem to have changed. This clearly indicates that the build is stuck during startup, and Marionette doesn't start because of it. Note, that even the command line arguments handler is not getting called!
Most of the test failures are for mochitest-chrome on Windows 10, and similar to bug 1352671 only happening for jobs as run on buildbot. Given that Joel moved those to TaskCluster via bug 1401184 lets see if this changes something.
If not I'm still working on improving the logging of Marionette driver now.
Depends on: 1401184
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 194•7 years ago
|
||
There are mainly failures which happen for browser chrome tests, Joel is trying to move to TC now. Maybe that helps us.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 198•7 years ago
|
||
I can only see failures for browser-chrome jobs yesterday which run via buildbot but there is no single entry for those running via TC now. So moving them from hardware to VM seems to have helped a lot, and that was done via bug 1402068.
Top-most remaining failing jobs seem to be for mochitest-gl-e10s. Joel, is there anything we can do to also port them to VM?
Depends on: 1402068
Flags: needinfo?(jmaher)
Comment 199•7 years ago
|
||
for webgl/gpu/reftest we need a working GPU instance with a supported video card- right now there are issues with that which is why these are running on hardware. That work is tracked in bug 1366288.
Flags: needinfo?(jmaher)
Comment 200•7 years ago
|
||
a chance this could still come up on the radar for disable-recommended, but we are under the 200 threshold and I suspect will remain there.
Whiteboard: [stockwell disable-recommended] → [stockwell needswork]
Comment 201•7 years ago
|
||
sorry, I was looking at bug 1391545 and accidentally changed this bug.
Whiteboard: [stockwell needswork] → [stockwell disable-recommended]
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 203•7 years ago
|
||
(In reply to OrangeFactor Robot from comment #202)
> 40 failures in 178 pushes (0.225 failures/push) were associated with this
> bug yesterday.
Only 5 of those failures are from test jobs as run on TC workers. All the rest is coming from Buildbot workers. Especially the mochitest-gl-e10-* jobs seem to be mostly affected.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 206•7 years ago
|
||
There are no more failures since September 28th. This is when my patch on bug 1403616 landed.
Assignee: nobody → hskupin
Status: NEW → RESOLVED
Closed: 7 years ago
status-firefox57:
--- → fixed
status-firefox58:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla58
Updated•7 years ago
|
Whiteboard: [stockwell disable-recommended] → [stockwell fixed:other]
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Product: Testing → Remote Protocol
Comment 221•2 years ago
|
||
Moving bug to Testing::Marionette Client and Harness component per bug 1815831.
Component: Marionette → Marionette Client and Harness
Product: Remote Protocol → Testing
You need to log in
before you can comment on or make changes to this bug.
Description
•