Open Bug 1414495 Opened 7 years ago Updated 1 year ago

Intermittent non-asan TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output (Waiting for browser...)

Categories

(Testing :: General, defect, P5)

Version 3
defect

Tracking

(Not tracked)

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Depends on 5 open bugs)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra] [see summary at comment 139][stockwell needswork:owner])

Filed by: archaeopteryx [at] coole-files.de https://treeherder.mozilla.org/logviewer.html#?job_id=142108991&repo=mozilla-inbound https://queue.taskcluster.net/v1/task/RhxWTJ_kRR2RXHpbbWWJoQ/runs/0/artifacts/public/test_info/mochitest-devtools-chrome-chunked_errorsummary.log 05:46:09 INFO - TEST-START | devtools/client/performance/test/browser_timeline-waterfall-workers.js 05:46:09 INFO - TEST-SKIP | devtools/client/performance/test/browser_timeline-waterfall-workers.js | took 0ms 05:46:09 INFO - Running manifest: devtools\client\debugger\new\test\mochitest\browser.ini 05:46:09 INFO - Z:\task_1509773846\build\tests\bin\pk12util.exe: PKCS12 IMPORT SUCCESSFUL 05:46:09 INFO - MochitestServer : launching [u'Z:\\task_1509773846\\build\\tests\\bin\\xpcshell.exe', '-g', 'Z:\\task_1509773846\\build\\application\\firefox', '-v', '170', '-f', 'Z:\\task_1509773846\\build\\tests\\bin\\components\\httpd.js', '-e', "const _PROFILE_PATH = 'c:\\\\users\\\\genericworker\\\\appdata\\\\local\\\\temp\\\\tmpxtl83a.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', 'Z:\\task_1509773846\\build\\tests\\mochitest\\server.js'] 05:46:09 INFO - runtests.py | Server pid: 7068 05:46:09 INFO - runtests.py | Websocket server pid: 8724 05:46:09 INFO - runtests.py | SSL tunnel pid: 7308 05:46:10 INFO - runtests.py | Running with e10s: True 05:46:10 INFO - runtests.py | Running tests: start. 05:46:10 INFO - Application command: Z:\task_1509773846\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\genericworker\appdata\local\temp\tmpxtl83a.mozrunner 05:46:10 INFO - runtests.py | Application pid: 6256 05:46:10 INFO - TEST-INFO | started process GECKO(6256) 05:46:10 INFO - GECKO(6256) | 1509774370676 Marionette DEBUG Received observer notification "profile-after-change" 05:46:10 INFO - GECKO(6256) | 1509774370793 Marionette DEBUG Received observer notification "command-line-startup" 05:46:10 INFO - GECKO(6256) | 1509774370793 Marionette INFO Enabled via --marionette 05:46:11 INFO - GECKO(6256) | 1509774371746 Marionette DEBUG Received observer notification "sessionstore-windows-restored" 05:46:12 INFO - GECKO(6256) | 1509774372064 Marionette DEBUG New connections are accepted 05:46:12 INFO - GECKO(6256) | 1509774372064 Marionette INFO Listening on port 2828 05:52:22 INFO - Buffered messages finished 05:52:22 WARNING - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output
Depends on: 1397201
I added some logging in https://treeherder.mozilla.org/#/jobs?repo=try&revision=6737b2be84e9b7cd23fcb1dcf4586abab865498a and reproduced the timeouts. In the timeout cases, it looks to me like start_session() completes very quickly -- maybe even before the server has started?? I wonder if it is actually connecting to something (what?) or if it is just failing but not noticing that it's failing. I then added more logging and pushed to try, but then couldn't reproduce any time outs on startup -- just as jmaher reported in https://bugzilla.mozilla.org/show_bug.cgi?id=1397201#c79.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
Go ahead and successfully get that bug suggested for both 330 second failures on beta/release/esr52 and for 370 second failures on the trunk, and only *then* mark this as a duplicate of it.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
There are 73 failures related to this bug in the last 7 days. We can see this occurring on Windows 10, OSX, Linux 64 and Windows 7 (opt, pgo and debug). :whimboo any updates on this? 09:55:30 INFO - Calling ['C:\\slave\\test\\build\\venv\\Scripts\\python', '-u', 'C:\\slave\\test\\build\\tests\\reftest\\runreftest.py', '--appname=C:\\slave\\test\\build\\application\\firefox\\firefox.exe', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=https://queue.taskcluster.net/v1/task/Bq3iCutpRqyt6w1eZve9WQ/artifacts/public/build/target.crashreporter-symbols.zip', '--log-raw=C:\\slave\\test\\build\\blobber_upload_dir\\reftest_raw.log', '--log-errorsummary=C:\\slave\\test\\build\\blobber_upload_dir\\reftest_errorsummary.log', '--cleanup-crashes', '--marionette-startup-timeout=180', '--log-raw=-', '--suite=reftest', '--', 'tests/reftest/tests/layout/reftests/reftest.list'] with output_timeout 1000 09:55:31 INFO - REFTEST INFO | Running with e10s: True 09:55:31 INFO - REFTEST INFO | Application command: C:\slave\test\build\application\firefox\firefox.exe -marionette -profile c:\users\cltbld\appdata\local\temp\tmp9br3ex.mozrunner 10:01:41 ERROR - REFTEST ERROR | reftest | application timed out after 370 seconds with no output 10:01:41 ERROR - REFTEST ERROR | Force-terminating active process(es). 10:01:41 INFO - REFTEST TEST-INFO | started process screenshot 10:01:41 INFO - REFTEST TEST-INFO | screenshot: exit 0 10:01:41 INFO - TEST-INFO | crashinject: exit 0 10:01:42 ERROR - TEST-UNEXPECTED-FAIL | reftest | application terminated with exit code 1 10:01:42 INFO - REFTEST INFO | Downloading symbols from: https://queue.taskcluster.net/v1/task/Bq3iCutpRqyt6w1eZve9WQ/artifacts/public/build/target.crashreporter-symbols.zip 10:01:48 INFO - REFTEST INFO | Copy/paste: C:\slave\test\build\win32-minidump_stackwalk.exe c:\users\cltbld\appdata\local\temp\tmp9br3ex.mozrunner\minidumps\b53c3a0f-9d59-44f4-a529-321d52056490.dmp c:\users\cltbld\appdata\local\temp\tmplrafuh 10:01:53 INFO - REFTEST INFO | Saved minidump as C:\slave\test\build\blobber_upload_dir\b53c3a0f-9d59-44f4-a529-321d52056490.dmp 10:01:53 INFO - REFTEST INFO | Saved app info as C:\slave\test\build\blobber_upload_dir\b53c3a0f-9d59-44f4-a529-321d52056490.extra 10:01:53 INFO - REFTEST PROCESS-CRASH | reftest | application crashed [@ CrashingThread(void *)] 10:01:53 INFO - Crash dump filename: c:\users\cltbld\appdata\local\temp\tmp9br3ex.mozrunner\minidumps\b53c3a0f-9d59-44f4-a529-321d52056490.dmp 10:01:53 INFO - Operating system: Windows NT 10:01:53 INFO - 10.0.15063 10:01:53 INFO - CPU: amd64 10:01:53 INFO - family 6 model 30 stepping 5 10:01:53 INFO - 8 CPUs 10:01:53 INFO - 10:01:53 INFO - GPU: UNKNOWN 10:01:53 INFO - 10:01:53 INFO - Crash reason: EXCEPTION_ACCESS_VIOLATION_WRITE 10:01:53 INFO - Crash address: 0x0 10:01:53 INFO - Process uptime: 371 seconds 10:01:53 INFO - 10:01:53 INFO - Thread 29 (crashed) 10:01:53 INFO - 0 crashinjectdll.dll!CrashingThread(void *) [crashinjectdll.cpp:389bd78cf79d : 17 + 0x0] 10:01:53 INFO - rax = 0x00007ffbc5c01000 rdx = 0x00007ffbc5c01000 10:01:53 INFO - rcx = 0x0000000000000000 rbx = 0x0000000000000000 10:01:53 INFO - rsi = 0x0000000000000000 rdi = 0x0000000000000000 10:01:53 INFO - rbp = 0x0000000000000000 rsp = 0x0000009a24dffcc8 10:01:53 INFO - r8 = 0x0000000000000000 r9 = 0x0000000000000000 10:01:53 INFO - r10 = 0x0000000000000000 r11 = 0x0000000000000246 10:01:53 INFO - r12 = 0x0000000000000000 r13 = 0x0000000000000000 10:01:53 INFO - r14 = 0x0000000000000000 r15 = 0x0000000000000000 10:01:53 INFO - rip = 0x00007ffbc5c01000 10:01:53 INFO - Found by: given as instruction pointer in context 10:01:53 INFO - 1 kernel32.dll!LdrpLoadResourceFromAlternativeModule + 0x27c 10:01:53 INFO - rsp = 0x0000009a24dffcd0 rip = 0x00007ffbc86d2774 10:01:53 INFO - Found by: stack scanning
Flags: needinfo?(hskupin)
Whiteboard: [stockwell needswork]
I checked some of the screenshots and found the following: Mac OS with a system notification present and in the foreground: https://public-artifacts.taskcluster.net/bPC-phMtQp-XV3W97WGDhw/0/public/test_info/mozilla-test-fail-screenshot_xGwbMf.png Here a Windows case when Marionette installed the Mochikit extension, but then Firefox seems to hang, and doesn't correctly load the Mochitest window: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=144259262&lineNumber=6238 https://public-artifacts.taskcluster.net/f6p4HQJGQsCBeockIpEjRw/0/public/test_info/mozilla-test-fail-screenshot_bgwyes.png Here a job on Linux, which also shows that Marionette has been finished installing the addons but then we seem to hang with maybe the local webserver not being ready? The invalid cert message is strange: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=143862071&lineNumber=9519 https://public-artifacts.taskcluster.net/EdJZ9Ei4SYuQV4XQdCXmtA/0/public/test_info//mozilla-test-fail-screenshot_KPLPcM.png Windows again, with a command shell window in-front (I thought this was fixed): https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=143629526&lineNumber=1754 https://public-artifacts.taskcluster.net/fVHbhN6MRKiQgPymuZCUlQ/0/public/test_info/mozilla-test-fail-screenshot_uuieaw.png Geoff, it would be good if you could take over some of the work now. I will still try to get logging enabled for Marionette client, but I doubt that this client is the problem here.
Flags: needinfo?(hskupin) → needinfo?(gbrown)
OK - I'll try to look into this early next week, including following up on your excellent screenshots.
Assignee: nobody → gbrown
Flags: needinfo?(gbrown)
Depends on: 1419191
The Mac notification has been there for multiple versions of OS X, we've filed multiple bugs about it, multiple times discovered that it wasn't causing any of the problems we blamed on it, and multiple times discovered that none of the things people say will allow for automating getting rid of it actually will get rid of it.
Btw. what happens when mochitests are getting started, and Firefox gets moved to the background? Some tests do need explicit focus set to Firefox, and print a message to the console. Is this just a print or a logging statement? I wonder if this could cause a hang we do not see in the logs.
There were several hangs reported here on Nov 16 for Windows test-verify. Those were triggered by bug 1396620, which was backed out; on re-landing, there were no test-verify failures.
There were also several hangs in Windows 10 ccov tests, a new platform. Those tests are hidden now - tier 3 - so let's ignore those too.
I've made some progress on bug 1419121, which I thought might be related, but as it turns out probably isn't very important to this bug. The majority of recent failures here are Windows 10 mochitests, with perhaps a bias toward pgo builds. They seem consistent in that logs typically end with 'Received observer notification "profile-after-change"' before the timeout. The parent process is alive at the time of the timeout, and crash reports are produced. Screenshots typically do not show the browser. All of that seems consistent with a browser hang early in startup. The location of the hang is not apparent to me from the crash reports. I typically see threads running MessagePumpForNonMainThreads and helper threads and native Windows code, and most threads just waiting. Still investigating...
Component: Marionette → General
(In reply to Geoff Brown [:gbrown] from comment #19) > Now narrowed down to > > https://dxr.mozilla.org/mozilla-central/rev/ > 38f49346a200cc25492236c7b3c536fc835fe031/toolkit/xre/nsXREDirProvider. > cpp#1024-1025 Does it hang? I wonder for which component registered under "profile-after-change" that is happening. Looks like we have to add some log statements to: https://dxr.mozilla.org/mozilla-central/source/xpcom/components/nsCategoryManager.cpp#757
It does hang, at least for 370 seconds. https://treeherder.mozilla.org/#/jobs?repo=try&revision=53d0f4bc5311160f45bf9bf4fd9fa061263fb9ca narrows it down just a little more, to https://dxr.mozilla.org/mozilla-central/rev/574f4f58fe09dd590ea892406e237318c31705b4/xpcom/components/nsCategoryManager.cpp#784-817 I assume one of the profile-after-change observers is hanging, but I cannot reproduce the hang once I add logging to the loop (and that wouldn't be very helpful anyway -- still wouldn't know which observer is to blame). A better crash report would be very useful...why are they so consistently unhelpful?
Shouldn't we better find someone who has knowledge in that area? Dave, do you know who could help us in investigating this hang/crash during start-up? I would have suggested Benjamin but he is no longer with us. Thanks.
Flags: needinfo?(dtownsend)
Help would be great! https://treeherder.mozilla.org/#/jobs?repo=try&revision=cdf7b1aecc826f41e45e25fb9b249fa5d5df8906 demonstrates what I have found so far. On Windows 10/pgo, there is an intermittent hang during startup. XRE_mainRun() calls mDirProvider.DoStartup() https://hg.mozilla.org/try/rev/5e8a028c3373c8d8c8af2b98923c1cb636610eb9#l1.14 which calls NS_CreateServicesFromCategory() https://hg.mozilla.org/try/rev/5e8a028c3373c8d8c8af2b98923c1cb636610eb9#l2.16 which enumerates and notifies observers of 'profile-after-change' https://hg.mozilla.org/try/rev/5e8a028c3373c8d8c8af2b98923c1cb636610eb9#l3.14 There are other calls to NS_CreateServicesFromCategory(), so on a successful run this logging produces something like 21:45:46 INFO - GECKO(5920) | ZZZ NS_CreateServicesFromCategory starts loop 21:45:46 INFO - GECKO(5920) | ZZZ NS_CreateServicesFromCategory starts loop 21:45:46 INFO - GECKO(5920) | ZZZ XRE_mainRun calls mDirProvider.DoStartup 21:45:46 INFO - GECKO(5920) | ZZZ NS_CreateServicesFromCategory starts loop 21:45:46 INFO - GECKO(5920) | ZZZ DoStartup calls NS_CreateServicesFromCategory 21:45:46 INFO - GECKO(5920) | ZZZ NS_CreateServicesFromCategory starts loop 21:45:46 INFO - GECKO(5920) | 1512423946889 Marionette DEBUG Received observer notification "profile-after-change" 21:45:47 INFO - GECKO(5920) | ZZZ NS_CreateServicesFromCategory done 21:45:47 INFO - GECKO(5920) | ZZZ mDirProvider.DoStartup done Intermittently, there is a hang and we get: https://treeherder.mozilla.org/logviewer.html#?job_id=149681058&repo=try&lineNumber=11692 22:43:24 INFO - GECKO(6604) | ZZZ NS_CreateServicesFromCategory starts loop 22:43:25 INFO - GECKO(6604) | ZZZ NS_CreateServicesFromCategory starts loop 22:43:25 INFO - GECKO(6604) | ZZZ XRE_mainRun calls mDirProvider.DoStartup 22:43:25 INFO - GECKO(6604) | ZZZ NS_CreateServicesFromCategory starts loop 22:43:25 INFO - GECKO(6604) | ZZZ DoStartup calls NS_CreateServicesFromCategory 22:43:25 INFO - GECKO(6604) | ZZZ NS_CreateServicesFromCategory starts loop 22:43:25 INFO - GECKO(6604) | 1512427405433 Marionette DEBUG Received observer notification "profile-after-change" 22:46:25 INFO - runtests.py | Waiting for browser... 22:49:35 INFO - Buffered messages finished 22:49:35 WARNING - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output If I add additional logging within the observer loop, I can no longer reproduce the hang. Curiously, I never see XRE_mainRun or DoStartup or NS_CreateServicesFromCategory in the crash report. This is a startup hang, but I think the hang is after the crash reporter is initialized. These are pgo builds -- that probably doesn't help.
Florian, can you help with this startup-related issue?
Flags: needinfo?(dtownsend) → needinfo?(florian)
Blocks: 1418575
This bug has failed 36 times in the last 7 days on Windows 10-64, Linux64 and OS X 10.10 on asan, pgo and debug build types. Here is a recent log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=153056867 And here is part of the log: WARNING - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output 00:38:10 INFO - 1141 ERROR Force-terminating active process(es). 00:38:10 INFO - Determining child pids from psutil... 00:38:10 INFO - [] 00:38:10 INFO - Found child pids: set([]) 00:38:10 INFO - Killing process: 600 00:38:10 INFO - TEST-INFO | started process screenshot 00:38:10 INFO - TEST-INFO | screenshot: exit 0 00:38:10 INFO - mozcrash kill_pid(): wait failed (-1) terminating pid 600: error 5 00:38:10 INFO - psutil found pid 600 dead 00:38:10 INFO - TEST-INFO | Main app process: exit 1 00:38:10 INFO - Buffered messages finished 00:38:10 ERROR - 1142 ERROR TEST-UNEXPECTED-FAIL | automation.py | application terminated with exit code 1 00:38:10 INFO - runtests.py | Application ran for: 0:06:10.602000 00:38:10 INFO - zombiecheck | Reading PID log: c:\users\genericworker\appdata\local\temp\tmp4pdowzpidlog 00:38:10 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/KgFDiTGcTb2MD86V9ZQtqw/artifacts/public/build/target.crashreporter-symbols.zip 00:38:13 INFO - mozcrash Copy/paste: Z:\task_1513901523\build\win32-minidump_stackwalk.exe c:\users\genericworker\appdata\local\temp\tmpq0ik1j.mozrunner\minidumps\f1c8369b-a44c-4d8d-a21a-e6d41cd3490e.dmp c:\users\genericworker\appdata\local\temp\tmpvlr48b 00:38:24 INFO - mozcrash Saved minidump as Z:\task_1513901523\build\blobber_upload_dir\f1c8369b-a44c-4d8d-a21a-e6d41cd3490e.dmp 00:38:24 WARNING - PROCESS-CRASH | automation.py | application crashed [None] 00:38:24 INFO - Crash dump filename: c:\users\genericworker\appdata\local\temp\tmpq0ik1j.mozrunner\minidumps\f1c8369b-a44c-4d8d-a21a-e6d41cd3490e.dmp 00:38:24 INFO - Operating system: Windows NT 00:38:24 INFO - 10.0.15063 00:38:24 INFO - CPU: amd64 00:38:24 INFO - family 6 model 63 stepping 2 00:38:24 INFO - 8 CPUs 00:38:24 INFO - GPU: UNKNOWN 00:38:24 INFO - No crash 00:38:24 INFO - Process uptime: 371 seconds 00:38:24 INFO - Thread 0 Hi :gbrown, can you please take a look at this bug?
Flags: needinfo?(gbrown)
Whiteboard: [stockwell disable-recommended] → [stockwell disable-recommended][stockweel needswork]
I'm waiting for Florian. This may take a while to sort out.
Flags: needinfo?(gbrown)
Whiteboard: [stockwell disable-recommended][stockweel needswork] → [stockwell disable-recommended]
On 2018-01-11 there was a spike of failures in Windows test-verify reported here. Those are likely a separate, non-recurring problem that should probably be ignored for now.
These failures have been very infrequent for the last 2 weeks. I don't know why. Similar failures continue in reftests in bug 1418575.
Temporary spike on Feb 2 was Windows test-verify of js\\xpconnect\\tests\\mochitest\\test_bug790732.html (triggered by bug 1435138).
Depends on: 1437420
Lower frequency failures continue here, perhaps from multiple causes. I no longer see any of the Windows screenshot oddities that were previously noted. I expect bug 1437420 to reduce failure frequency by about 30% more. I would still like to hear from :florian regarding investigation of the possible intermittent browser startup hang. Changing to [stockwell infra] to reflect that this bug may indicate multiple on-going issues.
Whiteboard: [stockwell disable-recommended] → [stockwell infra]
(In reply to Geoff Brown [:gbrown] from comment #43) > I would still like to hear from :florian regarding investigation of the > possible intermittent browser startup hang. Which kind of investigation do you expect from me here? From looking at the failures in comment 42, most have a JS error like: 15:32:55 INFO - GECKO(3416) | JavaScript error: http://mochi.test:8888/tests/js/xpconnect/tests/mochitest?autorun=1&closeWhenDone=1&consoleLevel=INFO&runUntilFailure=1&repeat=10&testname=tests/js\xpconnect\tests\mochitest\test_bug790732.html&manifestFile=tests.json&dumpOutputDirectory=c%3A%5Cusers%5Cgenericworker%5Cappdata%5Clocal%5Ctemp&cleanupCrashes=true, line 11: SyntaxError: malformed hexadecimal character escape sequence Then there are a few isolate odd ones, like https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=160828169&lineNumber=57455 that has "Assertion failure: aVsyncTimestamp <= TimeStamp::Now(), at z:/build/build/src/layout/base/nsRefreshDriver.cpp:680"
Flags: needinfo?(florian)
"malformed hexadecimal character escape sequence" should be handled by bug 1437420. The aVsyncTimestamp assertion is interesting. There are also cases like https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=161236315&lineNumber=12171 02:39:15 INFO - Application command: Z:\task_1518142850\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\genericworker\appdata\local\temp\tmpdymd4w.mozrunner 02:39:15 INFO - runtests.py | Application pid: 5076 02:39:15 INFO - TEST-INFO | started process GECKO(5076) 02:39:15 INFO - GECKO(5076) | 1518143955773 Marionette DEBUG Received observer notification profile-after-change 02:42:15 INFO - runtests.py | Waiting for browser... 02:45:25 INFO - Buffered messages finished 02:45:25 WARNING - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output Marionette would normally report next: GECKO(6080) | 1518349905300 Marionette DEBUG Received observer notification command-line-startup but there's no sign of that in these cases. If command-line-startup is not received, I'd expect marionette to not start, and no tests will run. It would be great if we could determine why firefox startup seems to hang like this.
(In reply to Geoff Brown [:gbrown] from comment #45) > There are also cases like > > https://treeherder.mozilla.org/logviewer.html#?repo=mozilla- > inbound&job_id=161236315&lineNumber=12171 This contains a crash dump, which I don't know how to read. Who could help to understand it? xul.dll!profiler_suspend_and_sample_thread at https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=161236315&lineNumber=13199 makes me wonder if the profiler is somehow enabled, and if so how we could access the captured profile.
Florian, this is not a crash. After the test failure as covered by this bug mozprocess killed the process: > Killing process: 5076 The crash info also mentions 'No Crash'. So this is a hang between: 02:39:15 INFO - Application command: Z:\task_1518142850\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\genericworker\appdata\local\temp\tmpdymd4w.mozrunner 02:39:15 INFO - runtests.py | Application pid: 5076 02:39:15 INFO - TEST-INFO | started process GECKO(5076) 02:39:15 INFO - GECKO(5076) | 1518143955773 Marionette DEBUG Received observer notification profile-after-change ..... [1] 02:42:15 INFO - runtests.py | Waiting for browser... 02:45:25 INFO - Buffered messages finished 02:45:25 WARNING - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output 02:45:25 INFO - 911 ERROR Force-terminating active process(es). [1] Here nothing happened and the -marionette argument is not handled.
(In reply to Henrik Skupin (:whimboo) from comment #47) > Florian, this is not a crash. The process gets killed and we have a crash dump. I would like to understand what the main thread was doing at the time we killed the process, and don't know how to get that information from the crash dump.
(In reply to Florian Quèze [:florian] from comment #46) > (In reply to Geoff Brown [:gbrown] from comment #45) > > > There are also cases like > > > > https://treeherder.mozilla.org/logviewer.html#?repo=mozilla- > > inbound&job_id=161236315&lineNumber=12171 > > This contains a crash dump, which I don't know how to read. Who could help > to understand it? :ted knows a lot about how crash dumps are generated, crash dump format, etc. :mconley comes to mind as someone who's done a lot of crash debugging. Perhaps needinfo one of them?
I talked to Ted on IRC and he mentioned to me mozcrash.kill_and_get_minidump(): https://dxr.mozilla.org/mozilla-central/rev/27fd083ed7ee5782e52a5eaf0286c5ffa8b35a9e/testing/mozbase/mozcrash/mozcrash/mozcrash.py#512 Maybe we should use that method instead of mozprocess' kill(SIGKILL) in case of detected application hangs and timeouts? In that case we should get at least a better stack.
Please skip my last comment. We already do that. So the stack we see here shows what was going on at that time. Ted suggested to load the minidump as attached as artifact into the debugger and to get more details: https://queue.taskcluster.net/v1/task/Mt-3rb_sST270x1rN7P-Qw/runs/0/artifacts/public/test_info/8a7e9064-b5dc-4108-82e4-b704d7c64012.dmp
I managed to open this crash dump with Visual Studio, here is the stack I got: [External Code] xul.dll!mozilla::widget::GfxInfo::Init() Line 393 C++ > xul.dll!mozilla::widget::GfxInfoConstructor(nsISupports * aOuter, const nsID & aIID, void * * aResult) Line 152 C++ xul.dll!nsComponentManagerImpl::CreateInstance(const nsID & aClass, nsISupports * aDelegate, const nsID & aIID, void * * aResult) Line 1003 C++ xul.dll!nsComponentManagerImpl::GetService(const nsID & aClass, const nsID & aIID, void * * aResult) Line 1247 C++ xul.dll!nsJSCID::GetService(JS::Handle<JS::Value> iidval, JSContext * cx, unsigned char optionalArgc, JS::MutableHandle<JS::Value> retval) Line 702 C++ [External Code] xul.dll!XPC_WN_CallMethod(JSContext * cx, unsigned int argc, JS::Value * vp) Line 929 C++ xul.dll!js::InternalCallOrConstruct(JSContext * cx, const JS::CallArgs & args, js::MaybeConstruct construct) Line 473 C++ xul.dll!Interpret(JSContext * cx, js::RunState & state) Line 3096 C++ xul.dll!js::RunScript(JSContext * cx, js::RunState & state) Line 423 C++ xul.dll!js::InternalCallOrConstruct(JSContext * cx, const JS::CallArgs & args, js::MaybeConstruct construct) Line 495 C++ xul.dll!Interpret(JSContext * cx, js::RunState & state) Line 3096 C++ xul.dll!js::RunScript(JSContext * cx, js::RunState & state) Line 423 C++ xul.dll!js::InternalCallOrConstruct(JSContext * cx, const JS::CallArgs & args, js::MaybeConstruct construct) Line 495 C++ xul.dll!JS_CallFunctionValue(JSContext * cx, JS::Handle<JSObject *> obj, JS::Handle<JS::Value> fval, const JS::HandleValueArray & args, JS::MutableHandle<JS::Value> rval) Line 2978 C++ xul.dll!nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS * wrapper, unsigned short methodIndex, const XPTMethodDescriptor * info_, nsXPTCMiniVariant * nativeParams) Line 1307 C++ xul.dll!nsXPCWrappedJS::CallMethod(unsigned short methodIndex, const XPTMethodDescriptor * info, nsXPTCMiniVariant * params) Line 616 C++ xul.dll!PrepareAndDispatch(nsXPTCStubBase * self, unsigned int methodIndex, unsigned __int64 * args, unsigned __int64 * gprData, double * fprData) Line 174 C++ [External Code] xul.dll!NS_CreateServicesFromCategory(const char * aCategory, nsISupports * aOrigin, const char * aObserverTopic, const char16_t *) Line 816 C++ xul.dll!nsXREDirProvider::DoStartup() Line 1035 C++ xul.dll!XREMain::XRE_mainRun() Line 4512 C++ xul.dll!XREMain::XRE_main(int argc, char * * argv, const mozilla::BootstrapConfig & aConfig) Line 4808 C++ xul.dll!XRE_main(int argc, char * * argv, const mozilla::BootstrapConfig & aConfig) Line 4901 C++ GfxInfo::Init() Line 393 looks like https://searchfox.org/mozilla-central/rev/74b7ffee403c7ffd05b8b476c411cbf11d134eb9/widget/windows/GfxInfo.cpp#393 HDEVINFO devinfo = SetupDiGetClassDevsW(nullptr, mDeviceID[0].get(), nullptr, DIGCF_PRESENT | DIGCF_PROFILE | DIGCF_ALLCLASSES);
Milan, are we aware of such a hang in GfxInfo::Init() during startup of Firefox?
Flags: needinfo?(milan)
There's a good chance I don't know what I'm talking about here, but, it looks like mDeviceID[0] is being used as a "PnP device instance ID" and https://msdn.microsoft.com/en-us/library/windows/hardware/ff551069(v=vs.85).aspx says When specifying a PnP device instance ID, DIGCF_DEVICEINTERFACE must be set in the Flags parameter. :milan - should DIGCF_DEVICEINTERFACE be added to the SetupDiGetClassDevsW() call?
The documentation could certainly be suggesting that, so if it takes away the intermittent hang, I'd r+ it. Geoff, want to to a try push with a bunch of repeats and see if the hang goes away?
Flags: needinfo?(milan) → needinfo?(gbrown)
Unfortunately, adding DIGCF_DEVICEINTERFACE seems to cause some new failures: https://treeherder.mozilla.org/#/jobs?repo=try&revision=699ca312a40c81bbc6d0a939d57696c43df3791e
Flags: needinfo?(gbrown)
:gbrown, is this what I"m experiencing with bug 1427969?
(In reply to Edmund Wong (:ewong) from comment #58) > :gbrown, is this what I"m experiencing with bug 1427969? You are getting the same "application timed out" message obviously, which really just means the harness has not seen any test output for a long time -- either the browser is hung or a test is hung. There are lots of ways to hang. Note that most of the specific issues identified so far in this bug - bug 1437420, bug 1419191, and comment 54 - only happen on Windows. We are also seeing hangs on Linux, which mostly haven't been investigated yet.
Depends on: 1441869
As it looks like the landed patch on bug 1441869 helped a lot. I don't see failing tests anymore for normal test jobs. But the QuantumRender jobs still have the problem: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=165175543&lineNumber=1753 Maybe we also have to update some Rust code for that GFX initialization?
Flags: needinfo?(gbrown)
The only rust call to SetupDiGetClassDevs* that I can find is https://dxr.mozilla.org/mozilla-central/rev/714a94df1a99b933a4eef2149819d2af319a1ed5/dom/webauthn/u2f-hid-rs/src/windows/winapi.rs#78-85 and that looks correct to me.
Flags: needinfo?(gbrown)
:florian -- The crash report that you obtained from Visual Studio (comment 52) was quite helpful - thanks! Could you do the same thing for a few of the most recent failures, like https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=165307692 https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=165230332 https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=165175543 ?
Flags: needinfo?(florian)
Geoff, I checked some of the latest reports and each of those jobs do not show an active Firefox window in the failing screenshot at all, but only a command shell with a Notepad command for visual effects. https://taskcluster-artifacts.net/O-UKxZLIQ0WEVgHnsJc4Gg/0/public/test_info/mozilla-test-fail-screenshot_xzyfty.png So whether we do no longer create a Firefox window anymore due to the changes on bug 1441869, or we fail somewhere else now and leave a zombie process behind?
(In reply to Geoff Brown [:gbrown] from comment #63) > https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=165307692 ntdll.dll!NtWaitForAlertByThreadId() Unknown ntdll.dll!RtlpWaitOnAddressWithTimeout() Unknown ntdll.dll!RtlpWaitOnCriticalSection() Unknown ntdll.dll!RtlpEnterCriticalSectionContended() Unknown ntdll.dll!RtlEnterCriticalSection() Unknown ucrtbase.dll!__acrt_update_thread_locale_data() Unknown ucrtbase.dll!_LocaleUpdate::_LocaleUpdate(struct __crt_locale_pointers * const) Unknown ucrtbase.dll!__crt_stdio_output::write_multiple_characters<class __crt_stdio_output::stream_output_adapter<wchar_t>,char>(class __crt_stdio_output::stream_output_adapter<wchar_t> const &,char,int,int * const) Unknown ucrtbase.dll!__stdio_common_vfwscanf() Unknown ucrtbase.dll!__stdio_common_vfprintf() Unknown xul.dll!fprintf(_iobuf * const _Stream, const char * const _Format, ...) Line 840 C > xul.dll!CVRPathRegistry_Public::BLoadFromFile() Line 223 C++ xul.dll!CVRPathRegistry_Public::GetPaths(std::basic_string<char,std::char_traits<char>,std::allocator<char> > * psRuntimePath, std::basic_string<char,std::char_traits<char>,std::allocator<char> > * psConfigPath, std::basic_string<char,std::char_traits<char>,std::allocator<char> > * psLogPath, const char * pchConfigPathOverride, const char * pchLogPathOverride, std::vector<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,std::allocator<std::basic_string<char,std::char_traits<char>,std::allocator<char> > > > * pvecExternalDrivers) Line 337 C++ xul.dll!vr::VR_IsRuntimeInstalled() Line 213 C++ xul.dll!mozilla::gfx::VRSystemManagerOpenVR::Create() Line 609 C++ xul.dll!mozilla::gfx::VRManager::VRManager() Line 90 C++ xul.dll!mozilla::gfx::VRManager::ManagerInit() Line 51 C++ xul.dll!gfxPlatform::gfxPlatform() Line 530 C++ xul.dll!gfxWindowsPlatform::gfxWindowsPlatform() Line 310 C++ xul.dll!gfxPlatform::Init() Line 745 C++ xul.dll!gfxPlatform::GetPlatform() Line 538 C++ xul.dll!nsRefreshDriver::ChooseTimer() Line 1141 C++ xul.dll!nsRefreshDriver::EnsureTimerStarted(nsRefreshDriver::EnsureTimerStartedFlags aFlags) Line 1378 C++ xul.dll!nsIPresShell::DoObserveStyleFlushes() Line 9387 C++ xul.dll!nsPresContext::MediaFeatureValuesChanged(const mozilla::MediaFeatureChange & aChange) Line 303 C++ xul.dll!mozilla::PresShell::ResizeReflowIgnoreOverride(int aWidth, int aHeight, int aOldWidth, int aOldHeight, nsIPresShell::ResizeReflowOptions aOptions) Line 1942 C++ xul.dll!mozilla::PresShell::ResizeReflow(int aWidth, int aHeight, int aOldWidth, int aOldHeight, nsIPresShell::ResizeReflowOptions aOptions) Line 1919 C++ xul.dll!nsViewManager::DoSetWindowDimensions(int aWidth, int aHeight) Line 193 C++ xul.dll!nsViewManager::SetWindowDimensions(int aWidth, int aHeight, bool aDelayResize) Line 228 C++ xul.dll!nsDocumentViewer::SetBoundsWithFlags(const mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> & aBounds, unsigned int aFlags) Line 2071 C++ xul.dll!nsDocShell::SetPositionAndSize(int aX, int aY, int aWidth, int aHeight, unsigned int aFlags) Line 5673 C++ xul.dll!nsWebShellWindow::WindowResized(nsIWidget * aWidget, int aWidth, int aHeight) Line 305 C++ xul.dll!nsWebShellWindow::WidgetListenerDelegate::WindowResized(nsIWidget * aWidget, int aWidth, int aHeight) Line 808 C++ xul.dll!nsWindow::OnResize(mozilla::gfx::IntRectTyped<mozilla::gfx::UnknownUnits> & aWindowRect) Line 7262 C++ xul.dll!nsWindow::OnWindowPosChanged(tagWINDOWPOS * wp) Line 6805 C++ xul.dll!nsWindow::ProcessMessage(unsigned int msg, unsigned __int64 & wParam, __int64 & lParam, __int64 * aRetValue) Line 6022 C++ xul.dll!nsWindow::WindowProcInternal(HWND__ * hWnd, unsigned int msg, unsigned __int64 wParam, __int64 lParam) Line 5015 C++ xul.dll!CallWindowProcCrashProtected(__int64(*)(HWND__ *, unsigned int, unsigned __int64, __int64) aWndProc, HWND__ * aHWnd, unsigned int aMsg, unsigned __int64 aWParam, __int64 aLParam) Line 32 C++ xul.dll!nsWindow::WindowProc(HWND__ * hWnd, unsigned int msg, unsigned __int64 wParam, __int64 lParam) Line 4968 C++ user32.dll!UserCallWinProcCheckWow() Unknown user32.dll!DispatchClientMessage() Unknown user32.dll!__fnINLPWINDOWPOS() Unknown ntdll.dll!KiUserCallbackDispatcherContinue() Unknown win32u.dll!NtUserSetWindowPos() Unknown xul.dll!nsWindow::Resize(double aWidth, double aHeight, bool aRepaint) Line 1904 C++ xul.dll!nsXULWindow::SetSize(int aCX, int aCY, bool aRepaint) Line 585 C++ xul.dll!nsXULWindow::SizeShellToWithLimit(int aDesiredWidth, int aDesiredHeight, int shellItemWidth, int shellItemHeight) Line 2294 C++ xul.dll!nsXULWindow::SizeShellTo(nsIDocShellTreeItem * aShellItem, int aCX, int aCY) Line 1882 C++ xul.dll!nsXULWindow::SetRootShellSize(int aWidth, int aHeight) Line 1862 C++ xul.dll!nsWindowWatcher::SizeOpenedWindow(nsIDocShellTreeOwner * aTreeOwner, mozIDOMWindowProxy * aParent, bool aIsCallerChrome, const SizeSpec & aSizeSpec, const mozilla::Maybe<float> & aOpenerFullZoom) Line 2491 C++ xul.dll!nsWindowWatcher::OpenWindowInternal(mozIDOMWindowProxy * aParent, const char * aUrl, const char * aName, const char * aFeatures, bool aCalledFromJS, bool aDialog, bool aNavigate, nsIArray * aArgv, bool aIsPopupSpam, bool aForceNoOpener, nsIDocShellLoadInfo * aLoadInfo, mozIDOMWindowProxy * * aResult) Line 1249 C++ xul.dll!nsWindowWatcher::OpenWindow(mozIDOMWindowProxy * aParent, const char * aUrl, const char * aName, const char * aFeatures, nsISupports * aArguments, mozIDOMWindowProxy * * aResult) Line 354 C++ xul.dll!XPTC__InvokebyIndex() Unknown xul.dll!XPCWrappedNative::CallMethod(XPCCallContext & ccx, XPCWrappedNative::CallMode mode) Line 1234 C++ xul.dll!XPC_WN_CallMethod(JSContext * cx, unsigned int argc, JS::Value * vp) Line 929 C++ xul.dll!js::InternalCallOrConstruct(JSContext * cx, const JS::CallArgs & args, js::MaybeConstruct construct) Line 467 C++ xul.dll!Interpret(JSContext * cx, js::RunState & state) Line 3085 C++ xul.dll!js::RunScript(JSContext * cx, js::RunState & state) Line 417 C++ xul.dll!js::InternalCallOrConstruct(JSContext * cx, const JS::CallArgs & args, js::MaybeConstruct construct) Line 489 C++ xul.dll!js::Call(JSContext * cx, JS::Handle<JS::Value> fval, JS::Handle<JS::Value> thisv, const js::AnyInvokeArgs & args, JS::MutableHandle<JS::Value> rval) Line 535 C++ xul.dll!JS_CallFunctionValue(JSContext * cx, JS::Handle<JSObject *> obj, JS::Handle<JS::Value> fval, const JS::HandleValueArray & args, JS::MutableHandle<JS::Value> rval) Line 2969 C++ xul.dll!nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS * wrapper, unsigned short methodIndex, const nsXPTMethodInfo * info, nsXPTCMiniVariant * nativeParams) Line 1308 C++ xul.dll!nsXPCWrappedJS::CallMethod(unsigned short methodIndex, const nsXPTMethodInfo * info, nsXPTCMiniVariant * params) Line 615 C++ xul.dll!PrepareAndDispatch(nsXPTCStubBase * self, unsigned int methodIndex, unsigned __int64 * args, unsigned __int64 * gprData, double * fprData) Line 174 C++ xul.dll!SharedStub() Unknown xul.dll!NS_CreateServicesFromCategory(const char * aCategory, nsISupports * aOrigin, const char * aObserverTopic, const char16_t * aObserverData) Line 812 C++ xul.dll!nsXREDirProvider::DoStartup() Line 1025 C++ xul.dll!XREMain::XRE_mainRun() Line 4518 C++ xul.dll!XREMain::XRE_main(int argc, char * * argv, const mozilla::BootstrapConfig & aConfig) Line 4814 C++ xul.dll!XRE_main(int argc, char * * argv, const mozilla::BootstrapConfig & aConfig) Line 4906 C++ firefox.exe!do_main(int argc, char * * argv, char * * envp) Line 232 C++ firefox.exe!NS_internal_main(int argc, char * * argv, char * * envp) Line 306 C++ firefox.exe!wmain(int argc, wchar_t * * argv) Line 132 C++ firefox.exe!__scrt_common_main_seh() Line 283 C++ kernel32.dll!BaseThreadInitThunk() Unknown ntdll.dll!RtlUserThreadStart() Unknown
(In reply to Geoff Brown [:gbrown] from comment #63) > https://treeherder.mozilla.org/logviewer.html#?repo=mozilla- > inbound&job_id=165230332 > https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=165175543 These 2 other failures have the exact same stack as the first one (comment 66).
Flags: needinfo?(florian)
(In reply to Florian Quèze [:florian] from comment #66) > (In reply to Geoff Brown [:gbrown] from comment #63) > > > https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=165307692 > > xul.dll!fprintf(_iobuf * const _Stream, const char * const _Format, ...) > Line 840 C > > xul.dll!CVRPathRegistry_Public::BLoadFromFile() Line 223 C++ This is https://searchfox.org/mozilla-central/rev/d2b4b40901c15614fad2fa34718eea428774306e/gfx/vr/openvr/src/vrpathregistry_public.cpp#223 See bug 1373279.
Depends on: 1373279
(In reply to Florian Quèze [:florian] from comment #68) > This is > https://searchfox.org/mozilla-central/rev/ > d2b4b40901c15614fad2fa34718eea428774306e/gfx/vr/openvr/src/ > vrpathregistry_public.cpp#223 This line is: VRLog( "Unable to read VR Path Registry from %s\n", sRegPath.c_str() );
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=165897117&lineNumber=5110 (most recent failure in the link from comment 64) has this stack: > ntdll.dll!NtProtectVirtualMemory() Unknown ntdll.dll!LdrpChangeMrdataProtection() Unknown ntdll.dll!LdrProtectMrdata() Unknown ntdll.dll!RtlInsertInvertedFunctionTable() Unknown ntdll.dll!LdrpProcessMappedModule() Unknown ntdll.dll!LdrpMapDllWithSectionHandle() Unknown ntdll.dll!LdrpMapDllNtFileName() Unknown ntdll.dll!LdrpMapDllSearchPath() Unknown ntdll.dll!LdrpProcessWork() Unknown ntdll.dll!LdrpLoadDllInternal() Unknown ntdll.dll!LdrpLoadForwardedDll() Unknown ntdll.dll!LdrpGetDelayloadExportDll() Unknown ntdll.dll!LdrpHandleProtectedDelayload() Unknown ntdll.dll!LdrResolveDelayLoadedAPI() Unknown setupapi.dll!__delayLoadHelper2() Unknown setupapi.dll!__tailMerge_DEVOBJ_dll() Unknown setupapi.dll!SetupDiCreateDeviceInfoListExW() Unknown setupapi.dll!SetupDiGetClassDevsExW() Unknown setupapi.dll!SetupDiGetClassDevsW() Unknown xul.dll!mozilla::widget::GfxInfo::Init() Line 394 C++ xul.dll!mozilla::widget::GfxInfoConstructor(nsISupports * aOuter, const nsID & aIID, void * * aResult) Line 152 C++ xul.dll!nsComponentManagerImpl::CreateInstance(const nsID & aClass, nsISupports * aDelegate, const nsID & aIID, void * * aResult) Line 1003 C++ xul.dll!nsComponentManagerImpl::GetService(const nsID & aClass, const nsID & aIID, void * * aResult) Line 1247 C++ xul.dll!nsJSCID::GetService(JS::Handle<JS::Value> iidval, JSContext * cx, unsigned char optionalArgc, JS::MutableHandle<JS::Value> retval) Line 691 C++ xul.dll!XPTC__InvokebyIndex() Line 99 Unknown xul.dll!XPCWrappedNative::CallMethod(XPCCallContext & ccx, XPCWrappedNative::CallMode) Line 1234 C++ xul.dll!XPC_WN_CallMethod(JSContext * cx, unsigned int argc, JS::Value * vp) Line 929 C++ xul.dll!js::InternalCallOrConstruct(JSContext * cx, const JS::CallArgs & args, js::MaybeConstruct construct) Line 467 C++ xul.dll!Interpret(JSContext * cx, js::RunState & state) Line 3085 C++ xul.dll!js::RunScript(JSContext * cx, js::RunState & state) Line 417 C++ xul.dll!js::InternalCallOrConstruct(JSContext * cx, const JS::CallArgs & args, js::MaybeConstruct construct) Line 489 C++ xul.dll!Interpret(JSContext * cx, js::RunState & state) Line 3085 C++ xul.dll!js::RunScript(JSContext * cx, js::RunState & state) Line 417 C++ xul.dll!js::InternalCallOrConstruct(JSContext * cx, const JS::CallArgs & args, js::MaybeConstruct construct) Line 489 C++ xul.dll!JS_CallFunctionValue(JSContext * cx, JS::Handle<JSObject *> obj, JS::Handle<JS::Value> fval, const JS::HandleValueArray & args, JS::MutableHandle<JS::Value> rval) Line 2969 C++ xul.dll!nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS * wrapper, unsigned short methodIndex, const nsXPTMethodInfo * info, nsXPTCMiniVariant * nativeParams) Line 1308 C++ xul.dll!nsXPCWrappedJS::CallMethod(unsigned short methodIndex, const nsXPTMethodInfo * info, nsXPTCMiniVariant * params) Line 615 C++ xul.dll!PrepareAndDispatch(nsXPTCStubBase * self, unsigned int methodIndex, unsigned __int64 * args, unsigned __int64 * gprData, double * fprData) Line 174 C++ xul.dll!SharedStub() Line 61 Unknown xul.dll!NS_CreateServicesFromCategory(const char * aCategory, nsISupports * aOrigin, const char * aObserverTopic, const char16_t *) Line 816 C++ xul.dll!nsXREDirProvider::DoStartup() Line 1025 C++ xul.dll!XREMain::XRE_mainRun() Line 4518 C++ xul.dll!XREMain::XRE_main(int argc, char * * argv, const mozilla::BootstrapConfig & aConfig) Line 4814 C++ xul.dll!mozilla::BootstrapImpl::XRE_main(int argc, char * * argv, const mozilla::BootstrapConfig & aConfig) Line 49 C++ firefox.exe!NS_internal_main(int argc, char * * argv, char * * envp) Line 304 C++ firefox.exe!wmain(int argc, wchar_t * * argv) Line 132 C++ firefox.exe!__scrt_common_main_seh() Line 283 C++ kernel32.dll!BaseThreadInitThunk() Unknown ntdll.dll!RtlUserThreadStart() Unknown This is worryingly similar to the stack in comment 52.
There have been 58 failures in the last 7 days Summary: TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output (hang on start) Fails on -windows10-64 opt/pgo/debug amounting to 50 out of 58 failures as follows: -31 on windows10-64 PGO -17 on windows10-64 OPT -2 on windows10-64 DEBUG -windows10-64-nightly OPT - 2 failures -windows10-64-qr OPT - 3 failures -windows7-32-nightly OPT - 3 failures Here is a recent log life and a snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=167830234&lineNumber=4729 00:44:17 INFO - Application command: Z:\task_1520987284\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\genericworker\appdata\local\temp\tmpxlb8z_.mozrunner 00:44:17 INFO - runtests.py | Application pid: 6860 00:44:17 INFO - TEST-INFO | started process GECKO(6860) 00:44:17 INFO - GECKO(6860) | 1520988257528 Marionette DEBUG Received observer notification profile-after-change 00:47:17 INFO - runtests.py | Waiting for browser... 00:50:27 INFO - Buffered messages finished 00:50:27 WARNING - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output 00:50:27 INFO - 752 ERROR Force-terminating active process(es). 00:50:27 INFO - Determining child pids from psutil... 00:50:27 INFO - [] 00:50:27 INFO - Found child pids: set([]) 00:50:27 INFO - Killing process: 6860 00:50:27 INFO - TEST-INFO | started process screenshot 00:50:27 INFO - TEST-INFO | screenshot: exit 0 00:50:27 INFO - mozcrash kill_pid(): wait failed (-1) terminating pid 6860: error 5 00:50:27 INFO - psutil found pid 6860 dead 00:50:27 INFO - TEST-INFO | Main app process: exit 1 00:50:27 INFO - Buffered messages finished 00:50:27 ERROR - 753 ERROR TEST-UNEXPECTED-FAIL | automation.py | application terminated with exit code 1 00:50:27 INFO - runtests.py | Application ran for: 0:06:10.614000 00:50:27 INFO - zombiecheck | Reading PID log: c:\users\genericworker\appdata\local\temp\tmpkswgr0pidlog 00:50:27 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/D3QPLiWRSsuSQuOxVCfe_g/artifacts/public/build/target.crashreporter-symbols.zip 00:50:30 INFO - mozcrash Copy/paste: Z:\task_1520987284\build\win32-minidump_stackwalk.exe c:\users\genericworker\appdata\local\temp\tmpxlb8z_.mozrunner\minidumps\6bacc407-1bca-44cd-9dbe-b16a905e8598.dmp c:\users\genericworker\appdata\local\temp\tmp9pckdd 00:50:36 INFO - mozcrash Saved minidump as Z:\task_1520987284\build\blobber_upload_dir\6bacc407-1bca-44cd-9dbe-b16a905e8598.dmp 00:50:36 WARNING - PROCESS-CRASH | automation.py | application crashed [None] 00:50:36 INFO - Crash dump filename: c:\users\genericworker\appdata\local\temp\tmpxlb8z_.mozrunner\minidumps\6bacc407-1bca-44cd-9dbe-b16a905e8598.dmp
this appears to be windows only- is it possibly that we need to wait a bit longer between cycling the browser? I have been investigating some reftest issues on startup with the new taskcluster worker and noticed that adding a time.sleep(10) after shutting down the browser seems to resolve almost all my issues. In fact most of the failures in bug 1444642 are the same thing as this bug- I assume the root cause is the OS doing cleanup or not releasing a file handle until some other OS level operations are done.
Maybe this is due to bug 1438009? At least I know that there is a race condition right now on Windows in reporting the exit code of the browser.
Thanks Joel. I've pushed a sleep() patch to try to see if it helps. Meanwhile, let's keep in mind that we have multiple confirmed causes of these hangs, including bug 1373279 (I just commented there), more cases of SetupDiGetClassDevsW (I need to follow-up) and bug 1440719 (waiting on needinfos).
(In reply to Geoff Brown [:gbrown] from comment #60) > (In reply to Edmund Wong (:ewong) from comment #58) > > :gbrown, is this what I"m experiencing with bug 1427969? > > You are getting the same "application timed out" message obviously, which > really just means the harness has not seen any test output for a long time > -- either the browser is hung or a test is hung. There are lots of ways to > hang. Note that most of the specific issues identified so far in this bug - > bug 1437420, bug 1419191, and comment 54 - only happen on Windows. We are > also seeing hangs on Linux, which mostly haven't been investigated yet. This is a part of the log from a seamonkey linux debug mochitest: GECKO(11732) | --DOMWINDOW == 14 (0xe06e7800) [pid = 11732] [serial = 14] [outer = (nil)] [url = about:blank] GECKO(11732) | --DOMWINDOW == 13 (0xdd52b800) [pid = 11732] [serial = 10] [outer = (nil)] [url = about:blank] GECKO(11732) | --DOMWINDOW == 12 (0xe627d800) [pid = 11732] [serial = 2] [outer = (nil)] [url = about:blank] [11707, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80004002: file /builds/slave/c-cen-t-lnx-dbg/build/mozilla/toolkit/components/resistfingerprinting/nsRFPService.cpp, line 680 runtests.py | Waiting for browser... Buffered messages finished TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output 0 ERROR Force-terminating active process(es). Determining child pids from psutil... [] Found child pids: set([]) Killing process: 11732 Failed to retrieve MOZ_UPLOAD_DIR env var psutil found pid 11732 dead runtests.py | Failed to get app exit code - running/crashed? TEST-INFO | Main app process: exit 0 runtests.py | Application ran for: 0:06:25.944597 zombiecheck | Reading PID log: /tmp/tmpk9YuVWpidlog mozcrash Copy/paste: /builds/slave/test/tools/breakpad/linux/minidump_stackwalk /tmp/tmpiXbfRY.mozrunner/minidumps/0282b304-a7b8-cc35-0d86-7b0de34acf62.dmp /builds/slave/test/build/symbols mozcrash Saved minidump as /builds/slave/test/minidumps/0282b304-a7b8-cc35-0d86-7b0de34acf62.dmp mozcrash Saved app info as /builds/slave/test/minidumps/0282b304-a7b8-cc35-0d86-7b0de34acf62.extra PROCESS-CRASH | Main app process exited normally | application crashed [@ linux-gate.so + 0x430] Crash dump filename: /tmp/tmpiXbfRY.mozrunner/minidumps/0282b304-a7b8-cc35-0d86-7b0de34acf62.dmp Operating system: Linux 0.0.0 Linux 2.6.32-220.7.1.el6.x86_64 #1 SMP Wed Mar 7 00:52:02 GMT 2012 x86_64 CPU: x86 GenuineIntel family 6 model 42 stepping 7 4 CPUs Crash reason: SIGABRT Crash address: 0x2da1 Thread 0 (crashed) 0 linux-gate.so + 0x430 eip = 0x00565430 esp = 0xffec3db4 ebp = 0xffec3dd8 ebx = 0xe5eb6420 esi = 0x00000000 edi = 0x0029bff4 eax = 0xfffffffc ecx = 0x00000003 edx = 0xffffffff efl = 0x00000296 Found by: given as instruction pointer in context 1 libglib-2.0.so.0.3400.3 + 0x53dba eip = 0x079a5dbb esp = 0xffec3de0 ebp = 0xffec3df8 Found by: previous frame's frame pointer 2 libxul.so + 0x25be5d7 eip = 0xf3d3b5d8 esp = 0xffec3e00 ebp = 0xffec3e38 Found by: previous frame's frame pointer 3 libglib-2.0.so.0.3400.3 + 0x462bb eip = 0x079982bc esp = 0xffec3e40 ebp = 0xffec3e98 Found by: previous frame's frame pointer 4 libglib-2.0.so.0.3400.3 + 0x463b2 eip = 0x079983b3 esp = 0xffec3ea0 ebp = 0xffec3ec8 Found by: previous frame's frame pointer 5 libxul.so + 0x25bc5cf eip = 0xf3d395d0 esp = 0xffec3ed0 ebp = 0xffec3ee8 Found by: previous frame's frame pointer 6 libxul.so + 0x258d56d eip = 0xf3d0a56e esp = 0xffec3ef0 ebp = 0xffec3f18 Found by: previous frame's frame pointer 7 libxul.so + 0x2590f8c eip = 0xf3d0df8d esp = 0xffec3f20 ebp = 0xffec3f58 Found by: previous frame's frame pointer 8 libxul.so + 0x81acb4 eip = 0xf1f97cb5 esp = 0xffec3f60 ebp = 0xffec44a8 Found by: previous frame's frame pointer 9 libxul.so + 0x820c88 eip = 0xf1f9dc89 esp = 0xffec44b0 ebp = 0xffec44e8 Found by: previous frame's frame pointer 10 libxul.so + 0xc13eee eip = 0xf2390eef esp = 0xffec44f0 ebp = 0xffec4538 Found by: previous frame's frame pointer 11 libxul.so + 0xbe6bb1 eip = 0xf2363bb2 esp = 0xffec4540 ebp = 0xffec4568 Found by: previous frame's frame pointer 12 libxul.so + 0xbe6bd8 eip = 0xf2363bd9 esp = 0xffec4570 ebp = 0xffec4598 Found by: previous frame's frame pointer 13 libxul.so + 0x258c242 eip = 0xf3d09243 esp = 0xffec45a0 ebp = 0xffec45c8 Found by: previous frame's frame pointer 14 libxul.so + 0x34c7673 eip = 0xf4c44674 esp = 0xffec45d0 ebp = 0xffec45e8 Found by: previous frame's frame pointer 15 libxul.so + 0x3561458 eip = 0xf4cde459 esp = 0xffec45f0 ebp = 0xffec4718 Found by: previous frame's frame pointer 16 libxul.so + 0x3561c00 eip = 0xf4cdec01 esp = 0xffec4720 ebp = 0xffec4788 Found by: previous frame's frame pointer 17 libxul.so + 0x3561f72 eip = 0xf4cdef73 esp = 0xffec4790 ebp = 0xffec48e8 Found by: previous frame's frame pointer 18 seamonkey-bin + 0x543c eip = 0x0804d43d esp = 0xffec48f0 ebp = 0xffec5928 Found by: previous frame's frame pointer 19 seamonkey-bin + 0x4d55 eip = 0x0804cd56 esp = 0xffec5930 ebp = 0xffec5978 Found by: previous frame's frame pointer 20 libc-2.12.so + 0x16ce5 eip = 0x00126ce6 esp = 0xffec5980 ebp = 0xffec5a08 Found by: previous frame's frame pointer 21 seamonkey-bin + 0x503c eip = 0x0804d03d esp = 0xffec5a10 ebp = 0x00000000 Found by: previous frame's frame pointer 22 seamonkey-bin + 0x4ca7 eip = 0x0804cca8 esp = 0xffec5a14 ebp = 0x00000000 Found by: stack scanning 23 seamonkey-bin + 0x27d9f eip = 0x0806fda0 esp = 0xffec5a20 ebp = 0x00000000 Found by: stack scanning 24 seamonkey-bin + 0x27d8f eip = 0x0806fd90 esp = 0xffec5a24 ebp = 0x00000000 Found by: stack scanning 25 ld-2.12.so + 0xf25f eip = 0x006c0260 esp = 0xffec5a28 ebp = 0x00000000 Found by: stack scanning I'm just super confused with the "PROCESS-CRASH" then "Main app process exited normally" then "application crashed". Might someone be able to clarify if bug 1427969 is an application hang issue or a process crash issue? or is it a log that shows two different bugs? Any clarifications appreciated.
(In reply to Edmund Wong (:ewong) from comment #79) > This is a part of the log from a seamonkey linux debug mochitest: > runtests.py | Waiting for browser... > Buffered messages finished > TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 > seconds with no output > 0 ERROR Force-terminating active process(es). > Killing process: 11732 > runtests.py | Failed to get app exit code - running/crashed? > TEST-INFO | Main app process: exit 0 > Crash reason: SIGABRT > Might someone be able to clarify if bug 1427969 is an application hang issue > or > a process crash issue? or is it a log that shows two different bugs? > > Any clarifications appreciated. I quoted only the parts of your log that are relevant to answer your question. It shows a timeout and then killing the process to force generating a crash dump. It's a hang issue, not a crash.
(In reply to Geoff Brown [:gbrown] from comment #78) > Thanks Joel. I've pushed a sleep() patch to try to see if it helps. Comparing: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1aec580486b41835e15e46eba69200d1d54a98d2 https://treeherder.mozilla.org/#/jobs?repo=try&revision=aabdb4e412306f72b8ef5b8e695ffc83923d1082 I see no improvement with the sleep(). I understand Joel did his own test and found a possible minor improvement. Let's continue to pursue existing leads: > Meanwhile, let's keep in mind that we have multiple confirmed causes of > these hangs, including bug 1373279 (I just commented there), more cases of > SetupDiGetClassDevsW (I need to follow-up) and bug 1440719 (waiting on > needinfos).
There have been 168 failures in the last week. It fails almost exclusively on windows10-64 / opt & pgo, with for 4 failures also for windows10-64-nightly / opt. Log file and snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=170954514&lineNumber=4551 02:51:40 INFO - TEST-INFO | started process GECKO(5992) 02:51:40 INFO - GECKO(5992) | 1522291900840 Marionette DEBUG Received observer notification profile-after-change 02:54:40 INFO - runtests.py | Waiting for browser... 02:57:50 INFO - Buffered messages finished 02:57:50 WARNING - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output 02:57:50 INFO - 508 ERROR Force-terminating active process(es). 02:57:50 INFO - Determining child pids from psutil... 02:57:50 INFO - [] 02:57:50 INFO - Found child pids: set([]) 02:57:50 INFO - Killing process: 5992 02:57:50 INFO - TEST-INFO | started process screenshot 02:57:50 INFO - TEST-INFO | screenshot: exit 0 02:57:51 INFO - mozcrash kill_pid(): wait failed (-1) terminating pid 5992: error 5 02:57:51 INFO - psutil found pid 5992 dead 02:57:51 INFO - TEST-INFO | Main app process: exit 1 02:57:51 INFO - Buffered messages finished 02:57:51 ERROR - 509 ERROR TEST-UNEXPECTED-FAIL | automation.py | application terminated with exit code 1 02:57:51 INFO - runtests.py | Application ran for: 0:06:10.740000 02:57:51 INFO - zombiecheck | Reading PID log: c:\users\genericworker\appdata\local\temp\tmplkiurupidlog 02:57:51 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/EFbKvJ_DQOm8uQRowPpyEw/artifacts/public/build/target.crashreporter-symbols.zip 02:57:54 INFO - mozcrash Copy/paste: Z:\task_1522289749\build\win32-minidump_stackwalk.exe c:\users\genericworker\appdata\local\temp\tmpg8tsmh.mozrunner\minidumps\5d4c02bb-abfe-4133-8a4e-18895b2e0fc8.dmp c:\users\genericworker\appdata\local\temp\tmpmm9ug6 02:58:01 INFO - mozcrash Saved minidump as Z:\task_1522289749\build\blobber_upload_dir\5d4c02bb-abfe-4133-8a4e-18895b2e0fc8.dmp 02:58:01 WARNING - PROCESS-CRASH | automation.py | application crashed [None] Waiting for updates on: 1373279
Flags: needinfo?(gbrown)
(In reply to Geoff Brown [:gbrown] from comment #81) > Let's continue to pursue existing leads: > > > Meanwhile, let's keep in mind that we have multiple confirmed causes of > > these hangs, including bug 1373279 (I just commented there), more cases of > > SetupDiGetClassDevsW (I need to follow-up) and bug 1440719 (waiting on > > needinfos). Bug 1440719 was resolved last night. I'll be watching to see how much of a difference it makes. Still waiting for bug 1373279. I have started but still need to investigate more cases of SetupDiGetClassDevsW.
Flags: needinfo?(gbrown)
(In reply to Geoff Brown [:gbrown] from comment #90) > Bug 1440719 was resolved last night. I'll be watching to see how much of a > difference it makes. That didn't make a big difference. :( Watching for any effects from the Windows update now: https://groups.google.com/forum/#!topic/mozilla.tools.taskcluster/UHeTl1ZMEy8 Still waiting for bug 1373279. I have started but still need to investigate more cases of SetupDiGetClassDevsW.
We should check later today if the switch to generic worker on Windows (see bug 1399401) had an impact here.
OS: Unspecified → Windows
Hardware: Unspecified → All
Henrik, I haven't seen any more occurrences since the upgrade in bug 1399401, so things are looking good at the moment. The last one I saw was https://treeherder.mozilla.org/logviewer.html#?job_id=171656558&repo=mozilla-inbound&lineNumber=1635 which is still from generic-worker 8.3.0.
Pete, I hope you are talking about the overlapping windows in the screenshots? I agree that is no longer present. Great! I still see failures after the upgrade, maybe even at the same rate. (Still hopeful bug 1373279 will make a big difference.) https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=172128314&lineNumber=7702
(In reply to Geoff Brown [:gbrown] from comment #98) > I still see failures after the upgrade, maybe even at the same rate. (Still > hopeful bug 1373279 will make a big difference.) > > https://treeherder.mozilla.org/logviewer.html#?repo=mozilla- > central&job_id=172128314&lineNumber=7702 This job still has a command window open, but no Firefox window is visible. Also sadly no Windows taskbar. Pete, do you really think that the problem is fixed?
Flags: needinfo?(pmoore)
(In reply to Henrik Skupin (:whimboo) from comment #99) > (In reply to Geoff Brown [:gbrown] from comment #98) > > https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=172128314&lineNumber=7702 > > This job still has a command window open, but no Firefox window is visible. > Also sadly no Windows taskbar. Pete, do you really think that the problem is > fixed? We have code to explicitly remove the Windows taskbar which was requested in bug 1373178: Windows 10: https://github.com/mozilla-releng/OpenCloudConfig/blob/902d24ab34314b314ec98a18512cfc12ed4fcfa0/userdata/Configuration/GenericWorker/task-user-init-win10.cmd#L14-L16 Windows 7: https://github.com/mozilla-releng/OpenCloudConfig/blob/902d24ab34314b314ec98a18512cfc12ed4fcfa0/userdata/Configuration/GenericWorker/task-user-init-win7.cmd#L14-L24 Can you clarify if this is no longer required?
Flags: needinfo?(pmoore) → needinfo?(hskupin)
Note, probably the test harnesses should take care of removing taskbar, or showing it, to get this logic out of OpenCloudConfig and into gecko. It doesn't require special privileges - the steps can be run as the task user itself.
I would argue that the test harnesses should not be modifying the OS config- it should fail if the config isn't acceptable (like screen resolution, etc.) For all of the history we have assumed the machine would be setup properly and crafted our tests to work in that environment- just having a better list of what custom modifications we require goes a long way towards making our tests better.
(In reply to Pete Moore [:pmoore][:pete] from comment #100) > We have code to explicitly remove the Windows taskbar which was requested in > bug 1373178: Interesting. I nearly forgot about this bug. So if hiding the taskbar stops the notifications from popping up, it's clearly a must have for I would surely say all test suites which interact with the UI of Firefox/Thunderbird/etc. I can second that open notifications irritate tests which require focus of the Firefox window. Pete, but what about the command window, which I'm actually more interested in here?
Flags: needinfo?(hskupin) → needinfo?(pmoore)
OF reports from the last 3 weeks only show Windows 10 64bit as being affected by this hang. (In reply to Henrik Skupin (:whimboo) from comment #103) > Interesting. I nearly forgot about this bug. So if hiding the taskbar stops > the notifications from popping up, it's clearly a must have for I would > surely say all test suites which interact with the UI of > Firefox/Thunderbird/etc. I can second that open notifications irritate tests > which require focus of the Firefox window. So how would that look like if you actually want to use a one-click loaner? Is the taskbar hidden/disabled in those cases too? If yes, how could someone enable it? Shouldn't it be better to find and disable all notifications which could cause such a focus issue instead? I really miss the taskbar for inspecting per test job screenshot which tasks are currently running on that machine.
Flags: needinfo?(rthijssen)
OS: Windows → Windows 10
Hardware: All → x86_64
if you hover the mouse pointer over the lower edge of the screen, the task bar should pop up. the task bar isn't disabled it's just set to "auto hide". on a loaner, you can also reverse the registry setting which was used to enable auto hide. - Win 7: HKCU:SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\StuckRects2 - Win 10: HKCU:SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\StuckRects3
Flags: needinfo?(rthijssen)
Ok, that sounds better, but do you also have a reply to the second part of the comment?
Flags: needinfo?(rthijssen)
if you mean "Shouldn't it be better to find and disable all notifications which could cause such a focus issue instead?" then the answer is probably yes. but there are probably quite an infinite number of notifications that could pop up and steal focus. disabling them will depend on what they are and what triggered them. when we receive bugs about specific notification popups, we do generally implement some mechanism to disable them. usually a registry setting or some configuration change. if the registry setting or configuration to disable a pop-up is user specific (eg: HKEY_CURRENT_USER), then it should be disabled in tree rather than in OCC. note that the mechanism to disable the task bar is also in-tree: https://dxr.mozilla.org/mozilla-central/rev/cb2bb4a7c1c39a017aa8c24c30b62a5f48922677/testing/mozharness/configs/unittests/win_taskcluster_unittest.py#278-287
Flags: needinfo?(rthijssen)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #102) > I would argue that the test harnesses should not be modifying the OS config- > it should fail if the config isn't acceptable (like screen resolution, etc.) > For all of the history we have assumed the machine would be setup properly > and crafted our tests to work in that environment- just having a better list > of what custom modifications we require goes a long way towards making our > tests better. Joel, Luckily for us in this case, it turns out that hiding the task bar doesn't require any system setup, it is a just a user-specific change that requires no special privileges and can be run as the task user during task execution. This makes it a great candidate for the test harness, and it looks like from Rob's comment 110 that this is already the case. One issue with that though that might bite us - on Windows 7 I experienced that the HKCU:SOFTWARE\Microsoft\Windows\CurrentVersion\Explorer\StuckRects2 key didn't exist at all, so I couldn't just amend it, I actually needed to create it[1] with a bunch of default values. On Windows 10, we can just take the existing value, and flip some bits. In any case, this logic should probably be moved in-tree like we have for win10, and individual tests should be able to decide if they want to hide the task bar or not. Henrik, are you able to RDP onto a task to see the focus issue firsthand to get an idea what might be stealing focus etc? I haven't quite understood how the focus issue might relate to a hidden task bar. --- [1] https://github.com/mozilla-releng/OpenCloudConfig/blob/dbd82051b22f3cd336153d257f5509bf838652d0/userdata/Configuration/GenericWorker/task-user-init-win7.cmd#L19
Flags: needinfo?(pmoore)
Flags: needinfo?(jmaher)
Flags: needinfo?(hskupin)
I doubt that this related to focus given that there is not a Firefox window open at all since we have the new generic worker. The only thing I want to know what this command window is: https://taskcluster-artifacts.net/Ft5Tp50aTKu1tqdRT1uY1w/0/public/test_info/mozilla-test-fail-screenshot_jzic_2.png
Flags: needinfo?(hskupin) → needinfo?(pmoore)
are we able to realistically set these registry values from a restricted user? One difference is mozharness vs testharness- possibly we need to ensure that mozharness does more of the os setup/config in addition to downloading builds/artifacts and setting up the virtualenv.
Flags: needinfo?(jmaher)
(In reply to Henrik Skupin (:whimboo) from comment #112) > I doubt that this related to focus given that there is not a Firefox window > open at all since we have the new generic worker. The only thing I want to > know what this command window is: > > https://taskcluster-artifacts.net/Ft5Tp50aTKu1tqdRT1uY1w/0/public/test_info/ > mozilla-test-fail-screenshot_jzic_2.png Henrik, if you can get the test to keep running (so the cmd window doesn't get closed), you should be able to discover what process created that window with the Process Explorer / Process Monitor or similar tools we have on the workers. See https://wiki.mozilla.org/ReleaseEngineering/How_To/Self_Provision_a_TaskCluster_Windows_Instance#For_generic-worker_10.5.0_onwards for getting a loaner.
Flags: needinfo?(pmoore) → needinfo?(hskupin)
(In reply to Pete Moore [:pmoore][:pete] from comment #114) > Henrik, if you can get the test to keep running (so the cmd window doesn't > get closed), you should be able to discover what process created that window > with the Process Explorer / Process Monitor or similar tools we have on the > workers. The only command window I see while the tests are running is the one which seems to get used by the harness, and is open all the time while the tests are running. Could we make it happen that this window is minimized? Pete, once the tests have finished, how can I start the tests again? The docs are missing that and I don't want to create a new one click loaner all the time.
Flags: needinfo?(hskupin) → needinfo?(pmoore)
Geoff, I don't see that we have browser log output available during startup which might help us to better investigate the problem. So something I would consider is to enable the tracing logs for Marionette, which means that only the very first run will show the output but no test afterward. Given that the hang happens during the first start, this seems to be a good trade off. Shall we enable it?
Flags: needinfo?(gbrown)
(In reply to Henrik Skupin (:whimboo) from comment #120) > Given that > the hang happens during the first start, this seems to be a good trade off. > Shall we enable it? Are you saying this about a specific failure log? I just looked at a few failure logs and it seems the hang is during startup when restarting the browser for another folder of tests, after having already ran a random number of tests from other folders.
You are correct, sorry. In that case we will have it indeed for each and every start of Firefox when Marionette is enabled. But we can stop the trace log right after the extension gets installed. Maybe we could just trigger a try build with that and hope we run into such a situation.
I'm confused: Haven't we already enabled Marionette trace logs, at least for mochitests? https://treeherder.mozilla.org/logviewer.html#?job_id=175000549&repo=mozilla-inbound&lineNumber=1902-1916 10:24:50 INFO - Application command: Z:\task_1524390763\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\task_1524390763\appdata\local\temp\tmpqpvsc8.mozrunner 10:24:50 INFO - runtests.py | Application pid: 2960 10:24:50 INFO - TEST-INFO | started process GECKO(2960) 10:24:51 INFO - GECKO(2960) | 1524392691355 Marionette DEBUG Received observer notification profile-after-change 10:24:51 INFO - GECKO(2960) | 1524392691471 Marionette DEBUG Received observer notification command-line-startup 10:24:51 INFO - GECKO(2960) | 1524392691471 Marionette DEBUG Received observer notification nsPref:changed 10:24:52 INFO - GECKO(2960) | 1524392692403 Marionette DEBUG Received observer notification sessionstore-windows-restored 10:24:52 INFO - GECKO(2960) | 1524392692632 Marionette DEBUG New connections are accepted 10:24:52 INFO - GECKO(2960) | 1524392692633 Marionette INFO Listening on port 2828 10:24:52 INFO - GECKO(2960) | 1524392692695 Marionette DEBUG Accepted connection 0 from 127.0.0.1:49991 10:24:52 INFO - GECKO(2960) | 1524392692708 Marionette DEBUG Closed connection 0 10:24:52 INFO - GECKO(2960) | 1524392692709 Marionette DEBUG Accepted connection 1 from 127.0.0.1:49992 10:24:52 INFO - GECKO(2960) | 1524392692732 Marionette TRACE 1 -> [0,1,"WebDriver:NewSession",{}] 10:24:52 INFO - GECKO(2960) | 1524392692759 Marionette DEBUG Register listener.js for window 4294967297 10:24:52 INFO - GECKO(2960) | 1524392692768 Marionette TRACE 1 <- [1,1,null,{"sessionId":"0e264648-1daf-45bb-a01b-c32af3d5625e","capabilities":{"browserName":"firefox","browserVersion":"61.0a ... 4390763\\AppData\\Local\\Temp\\tmpqpvsc8.mozrunner","moz:useNonSpecCompliantPointerOrigin":false,"moz:webdriverClick":true}}] We don't see that output for the 370 second timeout case because Marionette is inactive, waiting for the browser to start - right? If there is more logging that can be enabled, I'd do that on try, at least initially: It should not be too hard to reproduce this on try. I'm waiting for / hoping for action on bug 1373279.
Flags: needinfo?(gbrown)
(In reply to Henrik Skupin (:whimboo) from comment #117) > (In reply to Pete Moore [:pmoore][:pete] from comment #114) > > Henrik, if you can get the test to keep running (so the cmd window doesn't > > get closed), you should be able to discover what process created that window > > with the Process Explorer / Process Monitor or similar tools we have on the > > workers. > > The only command window I see while the tests are running is the one which > seems to get used by the harness, and is open all the time while the tests > are running. Could we make it happen that this window is minimized? > > Pete, once the tests have finished, how can I start the tests again? The > docs are missing that and I don't want to create a new one click loaner all > the time. There is a script for each command in the task payload. They are named like this: Z:\task_XXXXXXX\command_000000_wrapper.bat Z:\task_XXXXXXX\command_000001_wrapper.bat Z:\task_XXXXXXX\command_000002_wrapper.bat .... You should just be able to run them. Sorry for delayed reply!
Flags: needinfo?(pmoore) → needinfo?(hskupin)
(In reply to Geoff Brown [:gbrown] from comment #124) > I'm confused: Haven't we already enabled Marionette trace logs, at least for > mochitests? Oh, the problem might be then that we don't see anything because Marionette isn't handling the Firefox instance but the mochitest runner is doing that. I'm not sure where the difference actually is and what exactly enables the tracing output in the former case before Marionette is enabled. Andreas, what do I miss? > We don't see that output for the 370 second timeout case because Marionette > is inactive, waiting for the browser to start - right? In parts of the cases we fail before Marionette gets enabled. In others we fail later most likely in the mochitest extension which should start the tests but fails somehow. > I'm waiting for / hoping for action on bug 1373279. Not sure if that will help given that we see this warning in a lot of places also for passing tests. But who knows...
Flags: needinfo?(hskupin) → needinfo?(ato)
Hm, maybe this is `browser.dom.window.dump.enabled` which is missing in this case? Or does the Firefox log output goes somewhere else.
(In reply to Henrik Skupin (:whimboo) from comment #128) > (In reply to Geoff Brown [:gbrown] from comment #124) > > I'm confused: Haven't we already enabled Marionette trace logs, at least for > > mochitests? > > Oh, the problem might be then that we don't see anything because Marionette > isn't handling the Firefox instance but the mochitest runner is doing that. > I'm not sure where the difference actually is and what exactly enables the > tracing output in the former case before Marionette is enabled. Andreas, > what do I miss? I’m confused too. The log pasted in https://bugzilla.mozilla.org/show_bug.cgi?id=1414495#c124 appears to contain trace-level logging entries?
Flags: needinfo?(ato)
testing locally on win10, I ran into a similar problem- here I found we didn't cleanup properly (I assume ssltunnel was a zombie process as I couldn't remove that file from my objdir nor did I see it in task manager). Could it be that when taskcluster hits a timeout (aborted task) we don't cleanup everything properly. I see the failures here are win10 and mochitest specific- could we somehow determine which jobs ran previously? could we enforce a reboot on a device when taskcluster aborts a task?
Flags: needinfo?(gbrown)
Joel, sorry, I'm not sure I understand what happened to you locally: Was it a 370 second no output timeout? Did it happen on browser startup? Was "runtests.py | Waiting for browser..." logged immediately before the timeout? I checked several recent logs and found the associated system-info.logs show they were freshly rebooted. Of course, there are multiple browser starts in each task, and many (but not all) of the failures in this bug happen on the 2nd or later browser start, so there is a possibility of an interaction with processes launched previously...but I have no evidence for that.
Flags: needinfo?(gbrown)
Time for a new summary: This bug tracks intermittent test failures in which the test harness times out after 370 seconds waiting for firefox to start. In recent months, the great majority of known failures are on Windows 10 opt and pgo on various flavors of mochitest. Instead of seeing: 19:20:31 INFO - TEST-INFO | started process GECKO(672) 19:20:31 INFO - GECKO(672) | 1525288831693 Marionette DEBUG Received observer notification profile-after-change 19:20:31 INFO - GECKO(672) | 1525288831788 Marionette DEBUG Received observer notification command-line-startup 19:20:31 INFO - GECKO(672) | 1525288831788 Marionette DEBUG Received observer notification nsPref:changed 19:20:32 INFO - GECKO(672) | 1525288832579 Marionette DEBUG Received observer notification sessionstore-windows-restored ... The failure logs have: 19:20:41 INFO - TEST-INFO | started process GECKO(4768) 19:20:42 INFO - GECKO(4768) | 1525288842452 Marionette DEBUG Received observer notification profile-after-change 19:23:42 INFO - runtests.py | Waiting for browser... 19:26:52 INFO - Buffered messages finished 19:26:52 WARNING - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output Note that Marionette does not receive command-line-startup, so never completes its startup. Not receiving command-line-startup is consistent with a hang early in gecko startup. Screenshots taken at the time of the timeout show only a cmd.exe window -- no firefox window -- possibly consistent with a startup hang prior to window initialization. After the timeout, the harness kills the browser and almost always collects a minidump, but the logged crash report is nonsensical. However, loading the minidump in Visual Studio produces a sensible stack, as seen in comments like comment 66. Comments 66 to 69 suggest a hang in CVRPathRegistry_Public::BLoadFromFile: we hope that bug 1373279 will address that, but there's been no recent activity there. Comments 70 and 71 suggest a hang in SetupDiGetClassDevsW; that should also be investigated further. Want to help? There are lots of failure logs here, each with an associated .dmp file. Please load some .dmp files in Visual Studio and document the stacks found: Is it in BLoadFromFile? SetupDiGetClassDevsW? Somewhere else??
Whiteboard: [stockwell infra] → [stockwell infra] [see summary at comment 139]
Something I want to note here... Even those failures exist for a long time we ONLY see them for builds on any development branch but never on beta, release, or esr. So I wonder if a feature which is only enabled in Nightly causes us to fail here. It would be good to know which feature can affect the Firefox startup process, and given our former analysis this might be a GFX related feature? Milan, mind telling us which GFX features are only enabled by default for Nightly on Windows 10?
Flags: needinfo?(milaninbugzilla)
Some possibilities: We build (but not enable) WebRender. Perhaps the presence of the WebRender code, and maybe some parts that are still run? We do some checkerboarding telemetry by default. We restart the GPU process a few extra times, if it crashes or fails.
Flags: needinfo?(milaninbugzilla) → needinfo?(bugmail)
(In reply to Milan Sreckovic [:milan] (needinfo for best results) from comment #144) > Some possibilities: > We build (but not enable) WebRender. Perhaps the presence of the WebRender > code, and maybe some parts that are still run? I would say it's unlikely that building webrender would cause this sort of startup hang. But it should be easy enough to check, we can do a try push with --enable-webrender=false in the mozconfig and see if the problem still occurs. > We do some checkerboarding telemetry by default. The telemetry is recorded on all channels, but there is some extra recording of state that happens on Nightly. However that should only come into play after APZ is initialized, which in turn should only be after the gfx startup is done. The recording can be disabled by setting apz.record_checkerboarding to false. > We restart the GPU process a few extra times, if it crashes or fails. Of the three options this seems the most plausible to me, but I don't have any concrete idea of why GPU process restarting might trigger the hang. Again though, it should be easy enough to check as we just need to change the value of the layers.gpu-process.max_restarts pref - it's 1 on release but set to 3 in all.js inside an ifdef for NIGHTLY. I would suggest doing a try push with all three of these changes and retrigger to see if we still have the problem. If the problem is resolved we can bisect the changes to see which one is causing it and go from there. Otherwise we'll have to look elsewhere.
Flags: needinfo?(bugmail)
(In reply to Henrik Skupin (:whimboo) from comment #143) > Something I want to note here... Even those failures exist for a long time > we ONLY see them for builds on any development branch but never on beta, > release, or esr. Another factor to consider: We don't seem to run mochitests on "Windows 10 x64 opt" or "Windows 10 x64 pgo" on beta/release/esr ... only on "Windows 10 x64 Nightly opt".
Assignee: gbrown → nobody
Btw the fix for "Unable to read VR Path Registry ..." didn't had an effect here.
Depends on: 1461353
I am disappointed that startup hangs continue with about the same frequency after the fix for bug 1373279. However, there is good news: in about 30% of the cases, we are now getting a more sensible stack reported in the crash report -- bug 1461353.
Over the last 7 days there are 127 failures. These happen on windows10-64, windows10-64-nightly Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#?job_id=179573601&repo=autoland&lineNumber=5350 00:28:15 INFO - runtests.py | Waiting for browser... 00:31:26 INFO - Buffered messages finished 00:31:26 WARNING - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output 00:31:26 INFO - 572 ERROR Force-terminating active process(es). 00:31:26 INFO - Determining child pids from psutil... 00:31:26 INFO - [] 00:31:26 INFO - Found child pids: set([]) 00:31:26 INFO - Killing process: 6404 00:31:26 INFO - TEST-INFO | started process screenshot 00:31:26 INFO - TEST-INFO | screenshot: exit 0 00:31:26 INFO - mozcrash kill_pid(): wait failed (-1) terminating pid 6404: error 5 00:31:26 INFO - psutil found pid 6404 dead 00:31:26 INFO - TEST-INFO | Main app process: exit 1 00:31:26 INFO - Buffered messages finished 00:31:26 ERROR - 573 ERROR TEST-UNEXPECTED-FAIL | automation.py | application terminated with exit code 1 00:31:26 INFO - runtests.py | Application ran for: 0:06:10.768000 00:31:26 INFO - zombiecheck | Reading PID log: c:\users\task_1526947977\appdata\local\temp\tmp1k35cwpidlog 00:31:26 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/IBjTCj0cRS2e_mboBfcWeg/artifacts/public/build/target.crashreporter-symbols.zip 00:31:30 INFO - mozcrash Copy/paste: Z:\task_1526947977\build\win32-minidump_stackwalk.exe c:\users\task_1526947977\appdata\local\temp\tmpipgryd.mozrunner\minidumps\ecfcc165-a3ff-4e79-9dbf-40f49d88879c.dmp c:\users\task_1526947977\appdata\local\temp\tmpcejvwg 00:31:35 INFO - mozcrash Saved minidump as Z:\task_1526947977\build\blobber_upload_dir\ecfcc165-a3ff-4e79-9dbf-40f49d88879c.dmp 00:31:35 WARNING - PROCESS-CRASH | automation.py | application crashed [None]
Flags: needinfo?(gbrown)
Flags: needinfo?(gbrown)
Update: There have been 44 failures in the last week. All the failures occur on the windows10-64 platform on all the build types except debug. Recent log file: https://treeherder.mozilla.org/logviewer.html#?job_id=181826361&repo=autoland&lineNumber=5208
one random theory worth considering. While working on cleaning up test-verify, I noticed many failures. before: https://treeherder.mozilla.org/#/jobs?repo=try&revision=25f98606954d6a785becca862c2d056abea8e8b7&filter-searchStr=windows%20x64%20test-verify%20opt after skip-if=verify on many tests: https://treeherder.mozilla.org/#/jobs?repo=try&revision=183e10a5f6103373ae2d098e0a4e24add9413e86&filter-searchStr=windows%20x64%20test-verify%20opt what I noticed is we are running ~2000 jobs on each of the pushes. the before has many more of these startup failures than the after. Some initial questions about this problem in general: 1) are the timeouts on the first time we launch the browser or a secondary one (could we analyze many of the failures on integration branches) 2) is it possible that certain tests are affecting the state of the browser/OS which causes the startup failure on subsequent launches 3) not sure how to do this programatically- possibly some simply log parsing trick we could use?
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #178) > 1) are the timeouts on the first time we launch the browser or a secondary > one (could we analyze many of the failures on integration branches) Both: The normal (not test-verify) failures reported here are often on a secondary launch of the browser, but it is not hard to find cases where the startup hang occurs on the very first start of the browser: https://treeherder.mozilla.org/logviewer.html#?job_id=183091128&repo=autoland&lineNumber=1177
(In reply to Geoff Brown [:gbrown] from comment #154) > I am disappointed that startup hangs continue with about the same frequency > after the fix for bug 1373279. However, there is good news: in about 30% of > the cases, we are now getting a more sensible stack reported in the crash > report -- bug 1461353. Crash reports consistent with bug 1461353 continue and may be the most frequent failure mode: https://treeherder.mozilla.org/logviewer.html#?job_id=183053190&repo=mozilla-inbound&lineNumber=1822 https://treeherder.mozilla.org/logviewer.html#?job_id=183012047&repo=mozilla-inbound&lineNumber=12237 https://treeherder.mozilla.org/logviewer.html#?job_id=183003540&repo=autoland&lineNumber=2298
Recent discussion in bug 1461353 suggests the Windows 10 hangs are related to the BackgroundHangMonitor. Comparing https://treeherder.mozilla.org/#/jobs?repo=try&revision=6fbd7b2164e560948314fcc7cdd0b37d2925ddb1 https://treeherder.mozilla.org/#/jobs?repo=try&revision=255a4bf742be01631249345592fa5a8de9367de5 seems to confirm this: I cannot reproduce the startup hang when the BackgroundHangMonitor is disabled. It's great to see progress in bug 1461353 -- waiting to see how that works out...
Geoff, lately I see lots of those hangs also reported on Linux. That happens here on this bug directly, but also for Marionette restart tests. Maybe something changed for Linux in the last weeks?
Update: In the last 7 days, there have been 157 failures. Occurrences per platform and build type: - 122 failures on windows10-64 / pgo, opt - 26 failures on linux64 / asan, opt, pgo, debug - 4 failures on linux32 / opt, debug - 2 failures on windows10-64-qr / opt - 2 failures on windows10-64-nightly / opt - 1 failure on linux64-qr / debug Here is a recent relevant log file and a snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?job_id=186098012&repo=autoland&lineNumber=3466 101:42:57 INFO - runtests.py | Server pid: 8032 01:42:57 INFO - runtests.py | Websocket server pid: 2584 01:42:57 INFO - runtests.py | websocket/process bridge pid: 6196 01:42:57 INFO - runtests.py | SSL tunnel pid: 5576 01:42:57 INFO - runtests.py | Running with e10s: True 01:42:57 INFO - runtests.py | Running tests: start. 01:42:57 INFO - Application command: Z:\task_1530580643\build\application\firefox\firefox.exe -marionette -foreground -profile c:\users\task_1530580643\appdata\local\temp\tmpr9km34.mozrunner 01:42:57 INFO - runtests.py | Application pid: 7268 01:42:57 INFO - TEST-INFO | started process GECKO(7268) 01:42:58 INFO - GECKO(7268) | 1530582178454 Marionette DEBUG Received observer notification profile-after-change 01:45:58 INFO - runtests.py | Waiting for browser... 01:49:08 INFO - Buffered messages finished 01:49:08 WARNING - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output 01:49:08 INFO - 961 ERROR Force-terminating active process(es). 01:49:08 INFO - Determining child pids from psutil... 01:49:08 INFO - [] 01:49:08 INFO - Found child pids: set([]) 01:49:08 INFO - Killing process: 7268 01:49:08 INFO - TEST-INFO | started process screenshot 01:49:08 INFO - TEST-INFO | screenshot: exit 0 01:49:08 INFO - mozcrash kill_pid(): wait failed (-1) terminating pid 7268: error 5 01:49:08 INFO - psutil found pid 7268 dead 01:49:08 INFO - TEST-INFO | Main app process: exit 1 01:49:08 INFO - Buffered messages finished 01:49:08 ERROR - 962 ERROR TEST-UNEXPECTED-FAIL | automation.py | application terminated with exit code 1 01:49:08 INFO - runtests.py | Application ran for: 0:06:10.819000e
Andreas, can you take a look if the surge here is a regression from Webdriver or not? Thank you in advance. Most new failures are on Linux 64: https://treeherder.mozilla.org/logviewer.html#?job_id=187604233&repo=mozilla-inbound&lineNumber=4228 [task 2018-07-11T13:56:19.764Z] 13:56:19 INFO - TEST-INFO | started process GECKO(4595) [task 2018-07-11T13:56:21.947Z] 13:56:21 INFO - GECKO(4595) | 1531317381937 Marionette DEBUG Received observer notification profile-after-change [task 2018-07-11T13:56:22.367Z] 13:56:22 INFO - GECKO(4595) | 1531317382353 Marionette DEBUG Received observer notification command-line-startup [task 2018-07-11T13:56:22.368Z] 13:56:22 INFO - GECKO(4595) | 1531317382353 Marionette DEBUG Received observer notification nsPref:changed [task 2018-07-11T13:56:22.368Z] 13:56:22 INFO - GECKO(4595) | 1531317382354 Marionette DEBUG Init aborted (running=false, enabled=true, finalUIStartup=false) [task 2018-07-11T13:56:23.092Z] 13:56:23 INFO - GECKO(4595) | 1531317383081 Marionette DEBUG Received observer notification toplevel-window-ready [task 2018-07-11T13:56:25.917Z] 13:56:25 INFO - GECKO(4595) | 1531317385908 Marionette DEBUG Received observer notification sessionstore-windows-restored [task 2018-07-11T13:56:25.917Z] 13:56:25 INFO - GECKO(4595) | 1531317385908 Marionette DEBUG Waiting for delayed startup... [task 2018-07-11T13:56:26.978Z] 13:56:26 INFO - GECKO(4595) | 1531317386971 Marionette DEBUG Waiting for startup tests... [task 2018-07-11T13:56:27.380Z] 13:56:27 INFO - GECKO(4595) | 1531317387364 Marionette INFO Listening on port 2828 [task 2018-07-11T13:56:27.380Z] 13:56:27 INFO - GECKO(4595) | 1531317387365 Marionette DEBUG Remote service is active [task 2018-07-11T13:56:27.441Z] 13:56:27 INFO - GECKO(4595) | 1531317387433 Marionette DEBUG Accepted connection 0 from 127.0.0.1:47366 [task 2018-07-11T13:56:27.479Z] 13:56:27 INFO - GECKO(4595) | 1531317387477 Marionette DEBUG Closed connection 0 [task 2018-07-11T13:56:27.489Z] 13:56:27 INFO - GECKO(4595) | 1531317387482 Marionette DEBUG Accepted connection 1 from 127.0.0.1:47368 [task 2018-07-11T13:56:27.573Z] 13:56:27 INFO - GECKO(4595) | 1531317387566 Marionette TRACE 1 -> [0,1,"WebDriver:NewSession",{}] [task 2018-07-11T13:56:27.912Z] 13:56:27 INFO - GECKO(4595) | 1531317387908 Marionette DEBUG [2147483649] Frame script loaded [task 2018-07-11T13:56:27.953Z] 13:56:27 INFO - GECKO(4595) | 1531317387945 Marionette DEBUG [2147483649] Frame script registered [task 2018-07-11T13:56:27.996Z] 13:56:27 INFO - GECKO(4595) | 1531317387992 Marionette TRACE 1 <- [1,1,null,{"sessionId":"c17f5179-9586-4d69-86e3-f60e57502a7f","capabilities":{"browserName":"firefox","browserVersion":"63.0a ... ssID":4595,"moz:profile":"/tmp/tmpBhzAcU.mozrunner","moz:useNonSpecCompliantPointerOrigin":false,"moz:webdriverClick":true}}] [task 2018-07-11T13:56:28.074Z] 13:56:28 INFO - GECKO(4595) | 1531317388065 Marionette TRACE 1 -> [0,2,"Addon:Install",{"path":"/tmp/tmpZg_lzW.zip","temporary":false}] [task 2018-07-11T13:56:28.204Z] 13:56:28 INFO - GECKO(4595) | 1531317388198 Marionette TRACE 1 <- [1,2,null,{"value":"special-powers@mozilla.org"}] [task 2018-07-11T13:56:28.451Z] 13:56:28 INFO - GECKO(4595) | 1531317388440 Marionette TRACE 1 -> [0,3,"Addon:Install",{"path":"/tmp/tmpt4asp7.zip","temporary":false}] [task 2018-07-11T13:56:28.560Z] 13:56:28 INFO - GECKO(4595) | 1531317388552 Marionette TRACE 1 <- [1,3,null,{"value":"mochikit@mozilla.org"}] [task 2018-07-11T13:56:28.585Z] 13:56:28 INFO - GECKO(4595) | 1531317388582 Marionette TRACE 1 -> [0,4,"Marionette:GetContext",{}] [task 2018-07-11T13:56:28.588Z] 13:56:28 INFO - GECKO(4595) | 1531317388584 Marionette TRACE 1 <- [1,4,null,{"value":"content"}] [task 2018-07-11T13:56:28.613Z] 13:56:28 INFO - GECKO(4595) | 1531317388607 Marionette TRACE 1 -> [0,5,"Marionette:SetContext",{"value":"chrome"}] [task 2018-07-11T13:56:28.619Z] 13:56:28 INFO - GECKO(4595) | 1531317388609 Marionette TRACE 1 <- [1,5,null,{"value":null}] [task 2018-07-11T13:56:28.624Z] 13:56:28 INFO - GECKO(4595) | 1531317388620 Marionette TRACE 1 -> [0,6,"WebDriver:ExecuteScript",{"scriptTimeout":null,"newSandbox":true,"args":[{"testUrl":"about:blank","flavor":"browser-chr ... new CustomEvent(\"mochitest-load\", {\"detail\": [flavor, url]});\nwin.dispatchEvent(ev);","sandbox":"default","line":1755}] [task 2018-07-11T13:56:28.640Z] 13:56:28 INFO - GECKO(4595) | 1531317388630 Marionette TRACE 1 <- [1,6,null,{"value":null}] [task 2018-07-11T13:56:28.650Z] 13:56:28 INFO - GECKO(4595) | 1531317388644 Marionette TRACE 1 -> [0,7,"Marionette:SetContext",{"value":"content"}] [task 2018-07-11T13:56:28.650Z] 13:56:28 INFO - GECKO(4595) | 1531317388645 Marionette TRACE 1 <- [1,7,null,{"value":null}] [task 2018-07-11T13:56:28.652Z] 13:56:28 INFO - GECKO(4595) | 1531317388647 Marionette TRACE 1 -> [0,8,"WebDriver:DeleteSession",{}] [task 2018-07-11T13:56:28.657Z] 13:56:28 INFO - runtests.py | Waiting for browser... [task 2018-07-11T13:56:28.658Z] 13:56:28 INFO - GECKO(4595) | 1531317388652 Marionette TRACE 1 <- [1,8,null,{"value":null}] [task 2018-07-11T13:56:28.661Z] 13:56:28 INFO - GECKO(4595) | 1531317388658 Marionette DEBUG Closed connection 1 [task 2018-07-11T14:02:38.666Z] 14:02:38 INFO - Buffered messages finished [task 2018-07-11T14:02:38.666Z] 14:02:38 ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output [task 2018-07-11T14:02:38.668Z] 14:02:38 ERROR - Force-terminating active process(es).
Severity: normal → major
Flags: needinfo?(ato)
(In reply to Sebastian Hengst [:aryx] (needinfo on intermittent or backout) from comment #210) > Andreas, can you take a look if the surge here is a regression > from Webdriver or not? Thank you in advance. First of all, I seem unable to reproduce this instability in a local linux64 build. Since mochitest/browserchrome tests use Marionette directly, we can exclude that any changes to geckodriver would have impacted this surge. I looked at the commit log for testing/geckodriver and there have only been two changes in the time frame 9-11 July: * https://bugzilla.mozilla.org/show_bug.cgi?id=1452490 removes the marionetteScriptFinished global for injected scripts and I verified that we have no further mentions of this type in central. The change itself to testing/marionette/evaluate.js is minimal. * https://bugzilla.mozilla.org/show_bug.cgi?id=1447977 reimplements the cyclic object check for returned values from WebDriver:{ExecuteScript,ExecuteAsyncScript}. The new implementation uses a stack to keep track of seen elements in collections (arrays/objects). If we are leaking the stack this could slow down Firefox to the point where the application could seem unresponsive. However I see no evidence this is the case from glancing at the Marionette trace logs.
Flags: needinfo?(ato)
Depends on: 1475264
I have landed temporarily landed https://bugzilla.mozilla.org/show_bug.cgi?id=1475264 on inbound to see if this mitigates the recent intermittent spike.
Hi Henrik, can you look into this after your return, please?
Flags: needinfo?(hskupin)
From a quick look Marionette correctly installs the SpecialPowers extension, and runs some Javascript commands. But as it looks like the execute script call is not run correctly, which results in the Mochitest extension not being initialized and running. As such it times out. Given that this problem is highly reproducible those days in CI I wonder why no-one actually tried to investigate this problem with a one-click loaner on Taskcluster. This is pretty easy, and most likely should have reproduced it immediately. Something else which might help here is to check for all the commits as landed between July 9th and July 10th, which also hasn't been done yet. I will see what I can do soon to get it investigated.
I was not able yet to create any one-click loaner for a Mochitest job, because of the following error: > 04:54:13 INFO - Home directory not accessible: Permission denied > 04:54:13 INFO - No PulseAudio daemon running, or not running as session daemon. > 04:54:13 INFO - Traceback (most recent call last): > 04:54:13 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 3115, in <module> > 04:54:13 INFO - sys.exit(cli()) > 04:54:13 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 3111, in cli > 04:54:13 INFO - return run_test_harness(parser, options) > 04:54:13 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 3090, in run_test_harness > 04:54:13 INFO - result = runner.runTests(options) > 04:54:13 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2611, in runTests > 04:54:13 INFO - res = self.runMochitests(options, tests_in_manifest) > 04:54:13 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2406, in runMochitests > 04:54:13 INFO - result = self.doTests(options, testsToRun) > 04:54:13 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 2669, in doTests > 04:54:13 INFO - devices = findTestMediaDevices(self.log) > 04:54:13 INFO - File "/builds/worker/workspace/build/tests/mochitest/runtests.py", line 780, in findTestMediaDevices > 04:54:13 INFO - subprocess.check_call([pacmd, 'set-default-sink', 'null']) > 04:54:13 INFO - File "/usr/lib/python2.7/subprocess.py", line 541, in check_call > 04:54:13 INFO - raise CalledProcessError(retcode, cmd) > 04:54:13 INFO - subprocess.CalledProcessError: Command '['/usr/bin/pacmd', 'set-default-sink', 'null']' returned non-zero exit status 1 > 04:54:13 ERROR - Return code: 1 Geoff or Joel, do I miss something? I just selected option 1 in the run wizard which usually should work fine. Did something change for the PulseAudio setup of Mochitests on Linux in the last 2 weeks?
Flags: needinfo?(hskupin) → needinfo?(gbrown)
It looks like this problem goes back in time at least until early June. So I believe that some specific settings in Taskcluster workers for Linux caused this problem. I will file a separate bug to get this fixed.
Flags: needinfo?(gbrown)
Btw here is the script as executed by Marionette to initialize Mochitest: https://dxr.mozilla.org/mozilla-central/rev/085cdfb90903d4985f0de1dc7786522d9fb45596/testing/mochitest/start_desktop.js#15-18 So the custom `mochitest-load` event is not correctly dispatched, or the listener as setup by the Mochitest extension failed to receive it, or to handle its data: https://dxr.mozilla.org/mozilla-central/rev/085cdfb90903d4985f0de1dc7786522d9fb45596/testing/mochitest/bootstrap.js#78-95
So finally I was able to reproduce this failure but only once with a one-click loaner. Not sure why all successive runs with debug output never showed this problem again. Maybe it's a caching issue, and I should better add those debug lines locally and submit to try. That way it might be easier to catch.
Actually I was wondering what happened to the bootstrap.js file for the mochitest extension. After a pull it was no longer existent. Now all the code is in api.js. As such I checked its history, and I think that bug 1451513 caused the recent spike of this intermittent failure. I think we should try to backout the webextension port of mochitest and see if that helps.
Flags: needinfo?(aswan)
Flags: needinfo?(aryx.bugmail)
I pushed a try build with some debug logging which may help us to narrow down the problem: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f84d59ed985dd6b3a3d5cd45ee66f7aeb71680c2
Backed out bug 1451513 from central.
Flags: needinfo?(aryx.bugmail)
(In reply to OrangeFactor Robot from comment #238) > 1175 failures in 760 pushes (1.546 failures/push) were associated with this > bug in the last 7 days. > > This is the #1 most frequent failure this week. Just to note that since July 27th we are back at a normal rate for that failure.
Flags: needinfo?(aswan)
Update: There have been 33 failures in the last week. Failures per platform: - linux64 / asan: 24 - windows10-64-qr / opt: 4 - windows10-64 / pgo & opt: 3 - osx-10-10 / opt: 1 - windows10-64-msvc / opt: 1 Recent relevant log file and snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?job_id=195648175&repo=mozilla-inbound&lineNumber=2995 [task 2018-08-24T00:11:48.260Z] 00:11:48 INFO - Application command: /builds/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmpVhR9bQ.mozrunner [task 2018-08-24T00:11:48.269Z] 00:11:48 INFO - runtests.py | Application pid: 4623 [task 2018-08-24T00:11:48.269Z] 00:11:48 INFO - TEST-INFO | started process GECKO(4623) [task 2018-08-24T00:14:48.356Z] 00:14:48 INFO - runtests.py | Waiting for browser... [task 2018-08-24T00:17:58.284Z] 00:17:58 INFO - Buffered messages finished [task 2018-08-24T00:17:58.286Z] 00:17:58 ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output [task 2018-08-24T00:17:58.287Z] 00:17:58 ERROR - Force-terminating active process(es). [task 2018-08-24T00:17:58.289Z] 00:17:58 INFO - Determining child pids from psutil... [task 2018-08-24T00:17:58.294Z] 00:17:58 INFO - [4627, 4641] [task 2018-08-24T00:17:58.295Z] 00:17:58 INFO - Found child pids: set([4641, 4627]) [task 2018-08-24T00:17:58.295Z] 00:17:58 INFO - Killing process: 4641 [task 2018-08-24T00:17:58.296Z] 00:17:58 INFO - TEST-INFO | started process screentopng [task 2018-08-24T00:17:58.847Z] 00:17:58 INFO - TEST-INFO | screentopng: exit 0 [task 2018-08-24T00:17:58.848Z] 00:17:58 INFO - Can't trigger Breakpad, just killing process
(In reply to Tiberius Oros[:tiberius_oros] from comment #243) > Update: > There have been 33 failures in the last week. > > Failures per platform: > - linux64 / asan: 24 Hey Geoff, please note the increase of failures on Linux. As it looks like there might be a recent regression especially for ASAN builds which is causing this hang. Windows looks much better those days.
Flags: needinfo?(gbrown)
Depends on: 1487243
(In reply to Henrik Skupin (:whimboo) from comment #244) > Hey Geoff, please note the increase of failures on Linux. As it looks like > there might be a recent regression especially for ASAN builds which is > causing this hang. Windows looks much better those days. Thanks. I'll see if I can figure something out in bug 1487243.
Flags: needinfo?(gbrown)
There is a total of 121 failures in the last 7 days, on linux64, OSX and win10-64-qr. This is being worked on as per Bug 1487243.
In the last 7 days, there are 74 failures associated with this bug. They occur mostly on linux64 (asan) and the rest of them on windows10-64 (opt, pgo), windows10-64-qr (opt). Recent failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=198353046&repo=autoland&lineNumber=3645
This bug failed 78 times in the last 7 days. These occur mostly on Linux64 and a few failures on OSX and windows10-64-qr. On asan, debug, opt and pgo build types. Log: https://treeherder.mozilla.org/logviewer.html#?job_id=201363178&repo=autoland&lineNumber=2649 INFO - runtests.py | Application pid: 2708 [task 2018-09-25T06:24:43.659Z] 06:24:43 INFO - TEST-INFO | started process GECKO(2708) [task 2018-09-25T06:27:43.729Z] 06:27:43 INFO - runtests.py | Waiting for browser... [task 2018-09-25T06:30:53.669Z] 06:30:53 INFO - Buffered messages finished [task 2018-09-25T06:30:53.671Z] 06:30:53 ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output [task 2018-09-25T06:30:53.673Z] 06:30:53 ERROR - Force-terminating active process(es). [task 2018-09-25T06:30:53.675Z] 06:30:53 INFO - Determining child pids from psutil... [task 2018-09-25T06:30:53.680Z] 06:30:53 INFO - [2712, 2726] [task 2018-09-25T06:30:53.682Z] 06:30:53 INFO - Found child pids: set([2712, 2726]) [task 2018-09-25T06:30:53.683Z] 06:30:53 INFO - Killing process: 2712 [task 2018-09-25T06:30:53.684Z] 06:30:53 INFO - TEST-INFO | started process screentopng [task 2018-09-25T06:30:54.208Z] 06:30:54 INFO - TEST-INFO | screentopng: exit 0 [task 2018-09-25T06:30:54.210Z] 06:30:54 INFO - Can't trigger Breakpad, just killing process [task 2018-09-25T06:31:24.248Z] 06:31:24 INFO - failed to kill pid 2712 after 30s [task 2018-09-25T06:31:24.249Z] 06:31:24 INFO - Killing process: 2726 [task 2018-09-25T06:31:24.250Z] 06:31:24 INFO - Not taking screenshot here: see the one that was previously logged [task 2018-09-25T06:31:24.251Z] 06:31:24 INFO - Can't trigger Breakpad, just killing process [task 2018-09-25T06:31:24.260Z] 06:31:24 INFO - psutil found pid 2726 dead [task 2018-09-25T06:31:25.049Z] 06:31:25 INFO - psutil found pid 2726 dead [task 2018-09-25T06:31:25.050Z] 06:31:25 INFO - psutil found pid 2712 dead [task 2018-09-25T06:31:25.051Z] 06:31:25 INFO - Killing process: 2708 [task 2018-09-25T06:31:25.053Z] 06:31:25 INFO - Not taking screenshot here: see the one that was previously logged [task 2018-09-25T06:31:25.054Z] 06:31:25 INFO - Can't trigger Breakpad, just killing process [task 2018-09-25T06:31:25.080Z] 06:31:25 INFO - psutil found pid 2708 dead [task 2018-09-25T06:31:25.081Z] 06:31:25 INFO - psutil found pid 2708 dead [task 2018-09-25T06:31:25.088Z] 06:31:25 WARNING - runtests.py | Failed to get app exit code - running/crashed? gbrown: Can you please take a look at this bug?
Flags: needinfo?(gbrown)
(In reply to Dorel Luca [:dluca] from comment #267) > This bug failed 78 times in the last 7 days. These occur mostly on Linux64 > and a few failures on OSX and windows10-64-qr. On asan, debug, opt and pgo > build types. Most of the failures which are actually ASAN builds are wrongly classified against this bug while they should have been marked for bug 1487243. As such this bug got a high failure rate for OF. Dorel, please inform the whole sheriffing team so that the failures will be classified correctly. Thanks.
Flags: needinfo?(gbrown) → needinfo?(dluca)
Whimboo: Ok, Sheriffs have been informed.
Flags: needinfo?(dluca)
Summary: Intermittent TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output (hang on start) → Intermittent TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output (hang on start) NOT FOR LINUX ASAN
Andreea, this bug doesn't track the ASAN build hangs, but it was bug 1487243. I don't know why the summary has been updated for this bug, which caused false classification.
Summary: Intermittent TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output (hang on start) NOT FOR LINUX ASAN → Intermittent TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output (hang on start)
Note, that people might not have read/seen the `NOT` in the summary. So lets revert the summary now given that the other bug is fixed.
No longer blocks: 1471264
No longer blocks: 1413560
No longer blocks: 1458725
(In reply to Henrik Skupin (:whimboo) from comment #271) > Andreea, this bug doesn't track the ASAN build hangs, but it was bug > 1487243. I don't know why the summary has been updated for this bug, which > caused false classification. The bug summary was changed specifically so that people would not use this for asan failures, since the asan failures bug is not displayed as a suggestion in treeherder :( i'm sorry it did not work.
Note that in the last two weeks this failure mostly occurs for Quantum Render builds on Windows 10 64bit.
No longer blocks: 1439869
This still has a high occurrence rate, summary/details available in Comment 139.
I checked the recent logs, and the browser is up and running, and Marionette installed the mochitest extension: > 15:25:49 INFO - GECKO(2083) | 1542842749958 Marionette TRACE 1 -> [0,8,"WebDriver:DeleteSession",{}] > 15:25:49 INFO - GECKO(2083) | 1542842749959 Marionette TRACE 1 <- [1,8,null,{"value":null}] > 15:25:49 INFO - runtests.py | Waiting for browser... > 15:25:49 INFO - GECKO(2083) | 1542842749961 Marionette DEBUG Closed connection 1 > 15:31:59 INFO - Buffered messages finished > 15:31:59 ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output But then I assume the mochitest extension doesn't start the execution of the tests. Geoff, could this be the same problem with the focus as we had for the reftest extension?
Flags: needinfo?(gbrown)
(In reply to Henrik Skupin (:whimboo) from comment #284) That's from one of the recent osx failures from the look of it: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=213269235&repo=autoland&lineNumber=1013 Did you notice the screenshots, saying "Please sign in with your Apple ID". That's not helpful. Filing a bug...
Flags: needinfo?(gbrown)
Depends on: 1510413
(In reply to Henrik Skupin (:whimboo) from comment #284) > But then I assume the mochitest extension doesn't start the execution of the > tests. Geoff, could this be the same problem with the focus as we had for > the reftest extension? I'm not sure about that. Maybe? I don't know much about focus in mochitest.
(In reply to Geoff Brown [:gbrown] from comment #286) > (In reply to Henrik Skupin (:whimboo) from comment #284) > > But then I assume the mochitest extension doesn't start the execution of the > > tests. Geoff, could this be the same problem with the focus as we had for > > the reftest extension? > > I'm not sure about that. Maybe? I don't know much about focus in mochitest. I think about bug 1453895 here. I'm fairly sure that Mochitest has a similar behavior. Also it requests focus and waits until it gets it. (In reply to Geoff Brown [:gbrown] from comment #285) > Did you notice the screenshots, saying "Please sign in with your Apple ID". > That's not helpful. Filing a bug... No I didn't. But it would manifest a focus issue.
(In reply to Henrik Skupin (:whimboo) from comment #287) > (In reply to Geoff Brown [:gbrown] from comment #285) > > Did you notice the screenshots, saying "Please sign in with your Apple ID". > > That's not helpful. Filing a bug... > > No I didn't. But it would manifest a focus issue. Or not. This actually looks like a hang during the final setup steps after an upgrade.
OS: Windows 10 → All
Hardware: x86_64 → All

Over the last 7 days there are 40 failures on this bug. These happen on linux32, windows7-32-nightly, windows7-32, windows10-64-nightly, windows10-64, osx-10-10, linux64, linux32-nightly.

Here is the latest log: https://treeherder.mozilla.org/logviewer.html#?job_id=226091513&repo=mozilla-inbound

19:16:50 INFO - runtests.py | Waiting for browser...
19:16:50 INFO - GECKO(1979) | 1549336610472 Marionette DEBUG Closed connection 1
19:17:03 INFO - GECKO(1979) | 2019-02-04 19:17:03.734 firefox[1979:8794] Persistent UI failed to open file file:///Users/cltbld/Library/Saved%20Application%20State/org.mozilla.nightly.savedState/window_1.data: No such file or directory (2)
19:23:13 INFO - Buffered messages finished
19:23:13 ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output
19:23:13 ERROR - Force-terminating active process(es).
19:23:13 INFO - Determining child pids from psutil...
19:23:13 INFO - [1980, 1981, 1982]
19:23:13 INFO - ==> process 1979 launched child process 1980
19:23:13 INFO - ==> process 1979 launched child process 1981
19:23:13 INFO - ==> process 1979 launched child process 1982
19:23:13 INFO - Found child pids: set([1980, 1981, 1982])
19:23:13 INFO - Killing process: 1980
19:23:13 INFO - TEST-INFO | started process screencapture
19:23:13 INFO - TEST-INFO | screencapture: exit 0
19:23:13 INFO - Killing process: 1981
19:23:13 INFO - Not taking screenshot here: see the one that was previously logged
19:23:13 INFO - Killing process: 1982
19:23:13 INFO - Not taking screenshot here: see the one that was previously logged
19:23:14 INFO - psutil found pid 1981 dead
19:23:14 INFO - psutil found pid 1982 dead
19:23:14 INFO - psutil found pid 1980 dead
19:23:14 INFO - Killing process: 1979
19:23:14 INFO - Not taking screenshot here: see the one that was previously logged
19:23:14 INFO - psutil found pid 1979 dead
19:23:14 WARNING - runtests.py | Failed to get app exit code - running/crashed?
19:23:14 INFO - TEST-INFO | Main app process: exit 0
19:23:14 INFO - runtests.py | Application ran for: 0:06:26.240797
19:23:14 INFO - zombiecheck | Reading PID log: /var/folders/6v/fxwclqb90fs15bn9_v03004m00000w/T/tmphbpc_Gpidlog
19:23:14 INFO - ==> process 1979 launched child process 1980
19:23:14 INFO - ==> process 1979 launched child process 1981
19:23:14 INFO - ==> process 1979 launched child process 1982
19:23:14 INFO - zombiecheck | Checking for orphan process with PID: 1980
19:23:14 INFO - zombiecheck | Checking for orphan process with PID: 1981
19:23:14 INFO - zombiecheck | Checking for orphan process with PID: 1982
19:23:14 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/HguGXdkkQJm-Ot32mgMGYA/artifacts/public/build/target.crashreporter-symbols.zip
19:23:23 INFO - mozcrash Copy/paste: /Users/cltbld/tasks/task_1549336442/build/macosx64-minidump_stackwalk /var/folders/6v/fxwclqb90fs15bn9_v03004m00000w/T/tmpLTnx2e.mozrunner/minidumps/99003FA2-F4A4-4F4F-A1EF-B42CF2820517.dmp /var/folders/6v/fxwclqb90fs15bn9_v03004m00000w/T/tmpefF3E3
19:23:40 INFO - mozcrash Saved minidump as /Users/cltbld/tasks/task_1549336442/build/blobber_upload_dir/99003FA2-F4A4-4F4F-A1EF-B42CF2820517.dmp
19:23:40 INFO - mozcrash Saved app info as /Users/cltbld/tasks/task_1549336442/build/blobber_upload_dir/99003FA2-F4A4-4F4F-A1EF-B42CF2820517.extra
19:23:41 INFO - PROCESS-CRASH | Main app process exited normally | application crashed [@ libsystem_platform.dylib + 0x5cce]
19:23:41 INFO - Crash dump filename: /var/folders/6v/fxwclqb90fs15bn9_v03004m00000w/T/tmpLTnx2e.mozrunner/minidumps/99003FA2-F4A4-4F4F-A1EF-B42CF2820517.dmp
19:23:41 INFO - Operating system: Mac OS X
19:23:41 INFO - 10.10.5 14F27

Flags: needinfo?(gbrown)

There was a big spike in failures Feb 4/5, which appears to have been resolved by https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=2c1f0275ef803e8142d993138b08b7cd696ea850 -- expired certificates. osx screenshots confirm that firefox is alerting for invalid certificates.

Earlier on-going failures are primarily Windows -- maybe still bug 1461353?

Flags: needinfo?(gbrown)

There are 31 failures over the past 7 days happening on windows10-64 pgo/opt/.

Recent Log link: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=230238991&repo=autoland&lineNumber=5742

Recent Log snippet:

05:16:48 INFO - TEST-OK | dom/base/test/test_youtube_flash_embed.html | took 192ms
05:16:48 INFO - TEST-START | Shutdown
05:16:48 INFO - Passed: 43991
05:16:48 INFO - Failed: 0
05:16:48 INFO - Todo: 102
05:16:48 INFO - Mode: e10s
05:16:48 INFO - Slowest: 140117ms - /tests/dom/base/test/test_timeout_clamp.html
05:16:48 INFO - SimpleTest FINISHED
05:16:48 INFO - TEST-INFO | Ran 1 Loops
05:16:48 INFO - SimpleTest FINISHED
05:16:48 INFO - GECKO(5840) | [Parent 1736, Gecko_IOThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 341
05:16:48 INFO - GECKO(5840) | [Parent 1736, Gecko_IOThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 341
05:16:48 INFO - GECKO(5840) | [Child 2188, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 341
05:16:48 INFO - GECKO(5840) | [Parent 1736, Gecko_IOThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 341
05:16:48 INFO - GECKO(5840) | [Child 4468, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 341
05:16:48 INFO - GECKO(5840) | [Child[Parent 1736, Gecko_IOThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 341
05:16:48 INFO - GECKO(5840) | [Child 6280, Chrome_ChildThread] WARNING: pipe error: 109: file z:/build/build/src/ipc/chromium/src/chrome/common/ipc_channel_win.cc, line 341
05:16:48 INFO - GECKO(5840) | [Child 6280, Chrome_ChildThrea[Parent 1736, Gecko_IOThread] WARNING: file z:/build/build/src/ipc/chromium/src/base/process_util_win.cc, line 160
05:16:48 INFO - JavaScript error: Z:\task_1551069064\build\tests\bin\components\httpd.js, line 3311: NS_ERROR_NOT_AVAILABLE:
05:16:48 INFO - GECKO(5840) | 1551071808866 Marionette TRACE Received observer notification xpcom-will-shutdown
05:16:48 INFO - GECKO(5840) | 1551071808866 Marionette INFO Stopped listening on port 2828
05:16:48 INFO - GECKO(5840) | 1551071808866 Marionette DEBUG Remote service is inactive
05:16:48 INFO - TEST-INFO | Main app process: exit 0
05:16:48 INFO - runtests.py | Application ran for: 0:04:59.017000
05:16:48 INFO - zombiecheck | Reading PID log: c:\users\task_1551069064\appdata\local\temp\tmpfj_lifpidlog
05:16:48 INFO - ==> process 1736 launched child process 6280 ("Z:\task_1551069064\build\application\firefox\firefox.exe" -contentproc --channel="1736.0.1235215167\1689496918" -childID 1 -isForBrowser -prefsHandle 2632 -prefMapHandle 2628 -prefsLen 1 -prefMapSize 194466 -parentBuildID 20190225030302 -greomni "Z:\task_1551069064\build\application\firefox\omni.ja" -appomni "Z:\task_1551069064\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1551069064\build\application\firefox\browser" - 1736 "\.\pipe\gecko-crash-server-pipe.1736" 2452 tab)
05:16:48 INFO - ==> process 1736 launched child process 1112 ("Z:\task_1551069064\build\application\firefox\firefox.exe" -contentproc --channel="1736.6.389835714\1704013093" -childID 2 -isForBrowser -prefsHandle 2856 -prefMapHandle 1776 -prefsLen 132 -prefMapSize 194466 -parentBuildID 20190225030302 -greomni "Z:\task_1551069064\build\application\firefox\omni.ja" -appomni "Z:\task_1551069064\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1551069064\build\application\firefox\browser" - 1736 "\.\pipe\gecko-crash-server-pipe.1736" 2204 tab)
05:16:48 INFO - ==> process 1736 launched child process 2188 ("Z:\task_1551069064\build\application\firefox\firefox.exe" -contentproc --channel="1736.12.1251021330\1395388970" -childID 3 -isForBrowser -prefsHandle 1620 -prefMapHandle 2228 -prefsLen 132 -prefMapSize 194466 -parentBuildID 20190225030302 -greomni "Z:\task_1551069064\build\application\firefox\omni.ja" -appomni "Z:\task_1551069064\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1551069064\build\application\firefox\browser" - 1736 "\.\pipe\gecko-crash-server-pipe.1736" 2232 tab)
05:16:48 INFO - ==> process 1736 launched child process 4468 ("Z:\task_1551069064\build\application\firefox\firefox.exe" -contentproc --channel="1736.18.1122998081\2082216341" -childID 4 -isForBrowser -prefsHandle 4940 -prefMapHandle 4924 -prefsLen 9255 -prefMapSize 194466 -parentBuildID 20190225030302 -greomni "Z:\task_1551069064\build\application\firefox\omni.ja" -appomni "Z:\task_1551069064\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1551069064\build\application\firefox\browser" - 1736 "\.\pipe\gecko-crash-server-pipe.1736" 4824 tab)
05:16:48 INFO - ==> process 1736 launched child process 3560 ("Z:\task_1551069064\build\application\firefox\plugin-container.exe" --channel="1736.24.305114370\1943496949" "C:\Users\task_1551069064\AppData\Local\Temp\tmp8s_aqg.mozrunner\plugins\nptest.dll" "C:\Users\task_1551069064\AppData\LocalLow\Mozilla\Temp-{f0265b23-de91-4f7b-a2fb-b53793984787}" "Z:\task_1551069064\AppData\Roaming\Adobe\" -greomni "Z:\task_1551069064\build\application\firefox\omni.ja" -appomni "Z:\task_1551069064\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1551069064\build\application\firefox\browser" - 1736 "\.\pipe\gecko-crash-server-pipe.1736" 3648 plugin)
05:16:48 INFO - zombiecheck | Checking for orphan process with PID: 6280
05:16:48 INFO - zombiecheck | Checking for orphan process with PID: 1112
05:16:48 INFO - zombiecheck | Checking for orphan process with PID: 4468
05:16:48 INFO - zombiecheck | Checking for orphan process with PID: 2188
05:16:48 INFO - zombiecheck | Checking for orphan process with PID: 3560
05:16:48 INFO - Stopping web server
05:16:48 INFO - Stopping web socket server
05:16:48 INFO - Stopping ssltunnel
05:16:48 WARNING - leakcheck | refcount logging is off, so leaks can't be detected!
05:16:48 INFO - runtests.py | Running tests: end.
05:16:49 INFO - Buffered messages finished
05:16:49 INFO - Running manifest: dom\battery\test\mochitest.ini
05:16:49 INFO - Z:\task_1551069064\build\tests\bin\pk12util.exe: PKCS12 IMPORT SUCCESSFUL
05:16:49 INFO - MochitestServer : launching [u'Z:\task_1551069064\build\tests\bin\xpcshell.exe', '-g', 'Z:\task_1551069064\build\application\firefox', '-f', 'Z:\task_1551069064\build\tests\bin\components\httpd.js', '-e', "const _PROFILE_PATH = 'c:\\users\\task_1551069064\\appdata\\local\\temp\\tmpfd9oyb.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', 'Z:\task_1551069064\build\tests\mochitest\server.js']
05:16:49 INFO - runtests.py | Server pid: 756
05:16:49 INFO - runtests.py | Websocket server pid: 4596
05:16:49 INFO - runtests.py | SSL tunnel pid: 6756
05:16:49 INFO - runtests.py | Running with scheme: http
05:16:49 INFO - runtests.py | Running with e10s: True
05:16:49 INFO - runtests.py | Running with serviceworker_e10s: False
05:16:49 INFO - runtests.py | Running with socketprocess_e10s: False
05:16:49 INFO - runtests.py | Running tests: start.
05:16:49 INFO -
05:16:50 INFO - Application command: Z:\task_1551069064\build\application\firefox\firefox.exe -marionette --wait-for-browser -foreground -profile c:\users\task_1551069064\appdata\local\temp\tmpfd9oyb.mozrunner
05:16:50 INFO - runtests.py | Application pid: 8824
05:16:50 INFO - TEST-INFO | started process GECKO(8824)
05:16:50 INFO - GECKO(8824) | *** You are running in headless mode.
05:16:50 INFO - GECKO(8824) | 1551071810484 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: mozillaAddons
05:16:50 INFO - GECKO(8824) | 1551071810484 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: resource://pdf.js/
05:16:50 INFO - GECKO(8824) | 1551071810485 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: about:reader*
05:16:50 INFO - GECKO(8824) | 1551071810617 Marionette TRACE Received observer notification profile-after-change
05:16:50 INFO - GECKO(8824) | 1551071810639 Marionette TRACE Received observer notification toplevel-window-ready
05:16:50 INFO - GECKO(8824) | 1551071810694 Marionette TRACE Received observer notification command-line-startup
05:16:50 INFO - GECKO(8824) | 1551071810694 Marionette TRACE Received observer notification nsPref:changed
05:16:50 INFO - GECKO(8824) | 1551071810694 Marionette DEBUG Init aborted (running=false, enabled=true, finalUIStartup=false)
05:19:50 INFO - runtests.py | Waiting for browser...
05:23:00 INFO - Buffered messages finished
05:23:00 ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output
05:23:00 ERROR - Force-terminating active process(es).
05:23:00 INFO - Determining child pids from psutil...
05:23:00 INFO - [7472]
05:23:00 INFO - Found child pids: set([7472])
05:23:00 INFO - Killing process: 7472
05:23:00 INFO - TEST-INFO | started process screenshot
05:23:00 INFO - TEST-INFO | screenshot: exit 0
05:23:00 INFO - psutil found pid 7472 dead
05:23:00 INFO - Killing process: 8824
05:23:00 INFO - Not taking screenshot here: see the one that was previously logged
05:23:00 WARNING - mozcrash kill_pid(): unable to terminate pid 8824: 5
05:23:00 INFO - psutil found pid 8824 dead
05:23:01 INFO - TEST-INFO | Main app process: exit 1
05:23:01 INFO - Buffered messages finished
05:23:01 ERROR - TEST-UNEXPECTED-FAIL | automation.py | application terminated with exit code 1
05:23:01 INFO - runtests.py | Application ran for: 0:06:10.951000
05:23:01 INFO - zombiecheck | Reading PID log: c:\users\task_1551069064\appdata\local\temp\tmpycim8vpidlog
05:23:01 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/Gz7_UDtyQvadjPWLYmO-EA/artifacts/public/build/target.crashreporter-symbols.zip
05:23:05 INFO - mozcrash Copy/paste: Z:\task_1551069064\build\win32-minidump_stackwalk.exe c:\users\task_1551069064\appdata\local\temp\tmpfd9oyb.mozrunner\minidumps\1789e32e-7d38-40a6-8665-c1787c3d5572.dmp c:\users\task_1551069064\appdata\local\temp\tmpklcrn9
05:23:11 INFO - mozcrash Saved minidump as Z:\task_1551069064\build\blobber_upload_dir\1789e32e-7d38-40a6-8665-c1787c3d5572.dmp
05:23:11 INFO - PROCESS-CRASH | automation.py | application crashed [None]
05:23:11 INFO - Crash dump filename: c:\users\task_1551069064\appdata\local\temp\tmpfd9oyb.mozrunner\minidumps\1789e32e-7d38-40a6-8665-c1787c3d5572.dmp
05:23:11 INFO - Operating system: Windows NT
05:23:11 INFO - 10.0.17134
05:23:11 INFO - CPU: amd64
05:23:11 INFO - family 6 model 85 stepping 3
05:23:11 INFO - 8 CPUs
05:23:11 INFO -
05:23:11 INFO - GPU: UNKNOWN
05:23:11 INFO -
05:23:11 INFO - No crash
05:23:11 INFO - Process uptime: 370 seconds
05:23:11 INFO -
05:23:11 INFO - Thread 0
05:23:11 INFO - 0 ntdll.dll!NtProtectVirtualMemory + 0x14
05:23:11 INFO - rax = 0x0000000000000050 rdx = 0x00000090913fa7c0
05:23:11 INFO - rcx = 0xffffffffffffffff rbx = 0x0000000000000001
05:23:11 INFO - rsi = 0x00007fff750b0000 rdi = 0x0000000000000000
05:23:11 INFO - rbp = 0x00007fff750b0000 rsp = 0x00000090913fa768
05:23:11 INFO - r8 = 0x00000090913fa768 r9 = 0x00007fff750b0000
05:23:11 INFO - r10 = 0x0000000000000000 r11 = 0x0000000000000246
05:23:11 INFO - r12 = 0x0000000000000000 r13 = 0x0000000000000003
05:23:11 INFO - r14 = 0x0000000000000001 r15 = 0x0000000000008020
05:23:11 INFO - rip = 0x00007fff7946a8f4
05:23:11 INFO - Found by: given as instruction pointer in context
05:23:11 INFO - 1 ntdll.dll!LdrpChangeMrdataProtection + 0x4e
05:23:11 INFO - rbx = 0x0000000000000001 rbp = 0x00007fff750b0000
05:23:11 INFO - rsp = 0x00000090913fa770 r12 = 0x0000000000000000
05:23:11 INFO - r13 = 0x0000000000000003 r14 = 0x0000000000000001
05:23:11 INFO - r15 = 0x0000000000008020 rip = 0x00007fff793f1762
05:23:11 INFO - Found by: call frame info
05:23:11 INFO - 2 ntdll.dll!LdrProtectMrdata + 0x72
05:23:11 INFO - rbx = 0x0000000000000001 rbp = 0x00007fff750b0000
05:23:11 INFO - rsp = 0x00000090913fa7b0 r12 = 0x0000000000000000
05:23:11 INFO - r13 = 0x0000000000000003 r14 = 0x0000000000000001
05:23:11 INFO - r15 = 0x0000000000008020 rip = 0x00007fff793eea16
05:23:11 INFO - Found by: call frame info
05:23:11 INFO - 3 ntdll.dll!RtlInsertInvertedFunctionTable + 0x6d
05:23:11 INFO - rbx = 0x0000000000000001 rbp = 0x00007fff750b0000
05:23:11 INFO - rsp = 0x00000090913fa7e0 r12 = 0x0000000000000000
05:23:11 INFO - r13 = 0x0000000000000003 r14 = 0x0000000000000001
05:23:11 INFO - r15 = 0x0000000000008020 rip = 0x00007fff793f3eb5
05:23:11 INFO - Found by: call frame info
05:23:11 INFO - 4 ntdll.dll!LdrpProcessMappedModule + 0x159
05:23:11 INFO - rbx = 0x0000000000000001 rbp = 0x00007fff750b0000
05:23:11 INFO - rsp = 0x00000090913fa820 r12 = 0x0000000000000000
05:23:11 INFO - r13 = 0x0000000000000003 r14 = 0x0000000000000001
05:23:11 INFO - r15 = 0x0000000000008020 rip = 0x00007fff793f4105
05:23:11 INFO - Found by: call frame info
05:23:11 INFO - 5 ntdll.dll!LdrpMapDllWithSectionHandle + 0x143
05:23:11 INFO - rbx = 0x0000000000000001 rbp = 0x00007fff750b0000
05:23:11 INFO - rsp = 0x00000090913fa880 r12 = 0x0000000000000000
05:23:11 INFO - r13 = 0x0000000000000003 r14 = 0x0000000000000001
05:23:11 INFO - r15 = 0x0000000000008020 rip = 0x00007fff793f4d0f
05:23:11 INFO - Found by: call frame info
05:23:11 INFO - 6 ntdll.dll!LdrpMapDllNtFileName + 0x18a
05:23:11 INFO - rbx = 0x0000000000000001 rbp = 0x00007fff750b0000
05:23:11 INFO - rsp = 0x00000090913fa8e0 r12 = 0x0000000000000000
05:23:11 INFO - r13 = 0x0000000000000003 r14 = 0x0000000000000001
05:23:11 INFO - r15 = 0x0000000000008020 rip = 0x00007fff7940bd9e
05:23:11 INFO - Found by: call frame info
05:23:11 INFO - 7 ntdll.dll!LdrpMapDllSearchPath + 0x1de
05:23:11 INFO - rbx = 0x0000000000000001 rbp = 0x00007fff750b0000
05:23:11 INFO - rsp = 0x00000090913fa9b0 r12 = 0x0000000000000000
05:23:11 INFO - r13 = 0x0000000000000003 r14 = 0x0000000000000001
05:23:11 INFO - r15 = 0x0000000000008020 rip = 0x00007fff7940adea
05:23:11 INFO - Found by: call frame info
05:23:11 INFO - 8 ntdll.dll!LdrpProcessWork + 0x83
05:23:11 INFO - rbx = 0x0000000000000001 rbp = 0x00007fff750b0000
05:23:11 INFO - rsp = 0x00000090913fac10 r12 = 0x0000000000000000
05:23:11 INFO - r13 = 0x0000000000000003 r14 = 0x0000000000000001
05:23:11 INFO - r15 = 0x0000000000008020 rip = 0x00007fff7940b23b
05:23:11 INFO - Found by: call frame info
05:23:11 INFO - 9 ntdll.dll!LdrpLoadDllInternal + 0x13e
05:23:11 INFO - rbx = 0x0000000000000001 rbp = 0x00007fff750b0000
05:23:11 INFO - rsp = 0x00000090913fac80 r12 = 0x0000000000000000
05:23:11 INFO - r13 = 0x0000000000000003 r14 = 0x0000000000000001
05:23:11 INFO - r15 = 0x0000000000008020 rip = 0x00007fff79400eba
05:23:11 INFO - Found by: call frame info
05:23:11 INFO - 10 ntdll.dll!LdrpLoadForwardedDll + 0x129
05:23:11 INFO - rbx = 0x0000000000000001 rbp = 0x00007fff750b0000
05:23:11 INFO - rsp = 0x00000090913fad00 r12 = 0x0000000000000000
05:23:11 INFO - r13 = 0x0000000000000003 r14 = 0x0000000000000001
05:23:11 INFO - r15 = 0x0000000000008020 rip = 0x00007fff7940a50d
05:23:11 INFO - Found by: call frame info

Maybe the upgrade to Windows 10 1803 via bug 1522896 increased the number of intermittents for test jobs on Windows?

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] (away 02/28 - 03/03) from comment #307)

Maybe the upgrade to Windows 10 1803 via bug 1522896 increased the number of intermittents for test jobs on Windows?

There are a lot of hangs during startup of Firefox which don't let Marionette initialize due to a missing sessionstore-windows-restored notification. On bug 1530979 I will work soon on a custom observer notification which will hopefully fix bug 1382162.

No longer blocks: 1382162
Depends on: 1382162

There are 35 failures over the past 7 days happening on windows10-64 pgo/opt.

:whimboo any updates for this?

Flags: needinfo?(hskupin)

For a status update please see bug 1530979 which I said is the bug I'm working on.

Flags: needinfo?(hskupin)

Good news. The last recorded failure from the last days is from 2019-03-16! This could mean that my patch on bug 1530979 could have fixed that bug, which was caused by not starting Marionette correctly due to a missing sessionstore-windows-restored notification.

I will revisit this bug next week and close it if no further failures on mozilla-central and beta will occur.

Flags: needinfo?(hskupin)

I was too optimistic. The failures from over the weekend might not have been classified when I took a look at OF. Now it can be seen that it is still happening.

Flags: needinfo?(hskupin)

There are 42 failures associated to this bug in the last 7 days. These are occurring mostly on windows10-64-shippable opt but also on osx-10 and linux (asan, debug).

whimboo: can you take a look?

There are 51 total failures in the last 7 days: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-05-02&endday=2019-05-09&tree=trunk&bug=1414495

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=245446963&repo=mozilla-central&lineNumber=2260

00:56:05 INFO - TEST-START | devtools/client/application/test/browser_application_panel_list-unicode.js
01:02:23 ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output
01:02:23 ERROR - Force-terminating active process(es).
01:02:23 INFO - Determining child pids from psutil...
01:02:23 INFO - [2276]
01:02:23 INFO - ==> process 2276 launched child process 3712 ("C:\tasks\task_1557359896\build\application\firefox\firefox.exe" -contentproc --channel="2276.0.1004273798\1610296460" -parentBuildID 20190508215144 -greomni "C:\tasks\task_1557359896\build\application\firefox\omni.ja" -appomni "C:\tasks\task_1557359896\build\application\firefox\browser\omni.ja" -appdir "C:\tasks\task_1557359896\build\application\firefox\browser" - 2276 "\.\pipe\gecko-crash-server-pipe.2276" 2160 gpu)
01:02:23 INFO - Found child pids: set([3712, 2276])
01:02:23 INFO - Killing process: 3712
01:02:23 INFO - TEST-INFO | started process screenshot
01:02:24 INFO - TEST-INFO | screenshot: exit 0
01:02:24 INFO - Can't trigger Breakpad, just killing process
01:02:24 INFO - psutil found pid 3712 dead
01:02:24 INFO - Killing process: 2276
01:02:24 INFO - Not taking screenshot here: see the one that was previously logged
01:02:24 INFO - Can't trigger Breakpad, just killing process
01:02:24 INFO - psutil found pid 2276 dead
01:02:24 INFO - psutil found pid 2276 dead
01:02:24 INFO - psutil found pid 3712 dead
01:02:24 INFO - Killing process: 10220
01:02:24 INFO - Not taking screenshot here: see the one that was previously logged
01:02:24 INFO - Can't trigger Breakpad, just killing process
01:02:24 INFO - Error: Failed to kill process 10220: psutil.NoSuchProcess no process found with pid 10220
01:02:24 INFO - psutil found pid 10220 dead
01:02:24 INFO - TEST-INFO | Main app process: exit f
01:02:24 INFO - Buffered messages finished
01:02:24 ERROR - TEST-UNEXPECTED-FAIL | automation.py | application terminated with exit code 15
01:02:24 INFO - runtests.py | Application ran for: 0:06:11.504000
01:02:24 INFO - zombiecheck | Reading PID log: c:\users\testdr1\appdata\local\temp\tmpeh5jn3pidlog
01:02:24 INFO - ==> process 2276 launched child process 3712 ("C:\tasks\task_1557359896\build\application\firefox\firefox.exe" -contentproc --channel="2276.0.1004273798\1610296460" -parentBuildID 20190508215144 -greomni "C:\tasks\task_1557359896\build\application\firefox\omni.ja" -appomni "C:\tasks\task_1557359896\build\application\firefox\browser\omni.ja" -appdir "C:\tasks\task_1557359896\build\application\firefox\browser" - 2276 "\.\pipe\gecko-crash-server-pipe.2276" 2160 gpu)
01:02:24 INFO - zombiecheck | Checking for orphan process with PID: 3712
01:02:24 INFO - Traceback (most recent call last):
01:02:24 INFO - File "C:\tasks\task_1557359896\build\tests\mochitest\runtests.py", line 2834, in doTests
01:02:24 INFO - e10s=options.e10s
01:02:24 INFO - File "C:\tasks\task_1557359896\build\tests\mochitest\runtests.py", line 2270, in runApp
01:02:24 INFO - self.marionette.start_session()
01:02:24 INFO - File "C:\tasks\task_1557359896\build\venv\lib\site-packages\marionette_driver\decorators.py", line 36, in _
01:02:24 INFO - m._handle_socket_failure()
01:02:24 INFO - File "C:\tasks\task_1557359896\build\venv\lib\site-packages\marionette_driver\marionette.py", line 650, in _handle_socket_failure
01:02:24 INFO - reraise(exc, val, tb)
01:02:24 INFO - File "C:\tasks\task_1557359896\build\venv\lib\site-packages\marionette_driver\decorators.py", line 26, in _
01:02:24 INFO - return func(*args, **kwargs)
01:02:24 INFO - File "C:\tasks\task_1557359896\build\venv\lib\site-packages\marionette_driver\marionette.py", line 1108, in start_session
01:02:24 INFO - self.raise_for_port(timeout=timeout)
01:02:24 INFO - File "C:\tasks\task_1557359896\build\venv\lib\site-packages\marionette_driver\marionette.py", line 569, in raise_for_port
01:02:24 INFO - self.host, self.port))
01:02:24 INFO - timeout: Timed out waiting for connection on 127.0.0.1:2828!
01:02:24 ERROR - Automation Error: Received unexpected exception while running application
01:02:24 ERROR -
01:02:24 INFO - Stopping web server
01:02:24 INFO - Stopping web socket server
01:02:24 INFO - Stopping ssltunnel
01:02:24 WARNING - leakcheck | refcount logging is off, so leaks can't be detected!
01:02:24 INFO - runtests.py | Running tests: end.
01:02:24 INFO - Buffered messages finished
01:02:24 INFO - Running manifest: devtools\client\inspector\boxmodel\test\browser.ini
01:02:24 INFO - The following extra prefs will be set:
01:02:24 INFO - dom.animations-api.core.enabled=true
01:02:24 INFO - dom.animations-api.timelines.enabled=true
01:02:25 INFO - C:\tasks\task_1557359896\build\tests\bin\pk12util.exe: PKCS12 IMPORT SUCCESSFUL
01:02:25 INFO - Increasing default timeout to 240.0 seconds
01:02:25 INFO - MochitestServer : launching [u'C:\tasks\task_1557359896\build\tests\bin\xpcshell.exe', '-g', 'C:\tasks\task_1557359896\build\application\firefox', '-f', 'C:\tasks\task_1557359896\build\tests\bin\components\httpd.js', '-e', "const _PROFILE_PATH = 'c:\\users\\testdr
1\\appdata\\local\\temp\\tmprnoqa3.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', 'C:\tasks\task_1557359896\build\tests\mochitest\server.js']
01:02:25 INFO - runtests.py | Server pid: 1028
01:02:25 INFO - runtests.py | Websocket server pid: 9796
01:02:25 INFO - runtests.py | SSL tunnel pid: 8420
01:02:25 INFO - runtests.py | Running with scheme: http
01:02:25 INFO - runtests.py | Running with e10s: True
01:02:25 INFO - runtests.py | Running with serviceworker_e10s: False
01:02:25 INFO - runtests.py | Running with socketprocess_e10s: False
01:02:25 INFO - runtests.py | Running tests: start.
01:02:25 INFO -
01:02:25 INFO - Application command: C:\tasks\task_1557359896\build\application\firefox\firefox.exe -marionette --wait-for-browser -foreground -profile c:\users\testdr~1\appdata\local\temp\tmprnoqa3.mozrunner
01:02:25 INFO - runtests.py | Application pid: 6792

There are 18 total failures in the last 7 days on windows7-32-shippable opt, windows7-32 debug, windows10-64 opt, macosx1014-64 debug and linux64-shippable-qr opt and 53 total failures in the last 30 days.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=277236061&repo=autoland&lineNumber=1110

[task 2019-11-20T18:54:36.599Z] 18:54:36 INFO - runtests.py | Application pid: 5880
[task 2019-11-20T18:54:36.599Z] 18:54:36 INFO - TEST-INFO | started process GECKO(5880)
[task 2019-11-20T18:54:37.919Z] 18:54:37 INFO - GECKO(5880) | 1574276077910 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: mozillaAddons
[task 2019-11-20T18:54:37.919Z] 18:54:37 INFO - GECKO(5880) | 1574276077911 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: telemetry
[task 2019-11-20T18:54:37.920Z] 18:54:37 INFO - GECKO(5880) | 1574276077911 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: resource://pdf.js/
[task 2019-11-20T18:54:37.920Z] 18:54:37 INFO - GECKO(5880) | 1574276077911 addons.webextension.screenshots@mozilla.org WARN Loading extension 'screenshots@mozilla.org': Reading manifest: Invalid extension permission: about:reader*
[task 2019-11-20T18:54:38.041Z] 18:54:38 INFO - GECKO(5880) | 1574276078040 Marionette TRACE Received observer notification profile-after-change
[task 2019-11-20T18:54:38.081Z] 18:54:38 INFO - GECKO(5880) | 1574276078077 Marionette TRACE Received observer notification toplevel-window-ready
[task 2019-11-20T18:54:38.104Z] 18:54:38 INFO - GECKO(5880) | 1574276078101 Marionette TRACE Received observer notification command-line-startup
[task 2019-11-20T18:54:38.104Z] 18:54:38 INFO - GECKO(5880) | 1574276078102 Marionette DEBUG Init aborted (running=false, enabled=true, finalUIStartup=false)
[task 2019-11-20T18:54:38.159Z] 18:54:38 INFO - GECKO(5880) | 1574276078155 Marionette TRACE Received observer notification toplevel-window-ready
[task 2019-11-20T18:54:39.842Z] 18:54:39 INFO - GECKO(5880) | 1574276079837 Marionette TRACE Received observer notification marionette-startup-requested
[task 2019-11-20T18:54:39.842Z] 18:54:39 INFO - GECKO(5880) | 1574276079837 Marionette TRACE Waiting until startup recorder finished recording startup scripts...
[task 2019-11-20T18:54:39.857Z] 18:54:39 INFO - GECKO(5880) | 1574276079846 Marionette TRACE All scripts recorded.
[task 2019-11-20T18:54:39.892Z] 18:54:39 INFO - GECKO(5880) | 1574276079889 Marionette INFO Listening on port 2828
[task 2019-11-20T18:54:39.892Z] 18:54:39 INFO - GECKO(5880) | 1574276079889 Marionette DEBUG Remote service is active
[task 2019-11-20T18:54:40.007Z] 18:54:40 INFO - GECKO(5880) | 1574276080000 Marionette DEBUG Accepted connection 0 from 127.0.0.1:49373
[task 2019-11-20T18:54:40.102Z] 18:54:40 INFO - GECKO(5880) | 1574276080095 Marionette DEBUG Accepted connection 1 from 127.0.0.1:49376
[task 2019-11-20T18:54:40.102Z] 18:54:40 INFO - GECKO(5880) | 1574276080096 Marionette DEBUG Closed connection 1
[task 2019-11-20T18:54:40.102Z] 18:54:40 INFO - GECKO(5880) | 1574276080096 Marionette DEBUG Closed connection 0
[task 2019-11-20T18:54:40.102Z] 18:54:40 INFO - GECKO(5880) | 1574276080097 Marionette DEBUG Accepted connection 2 from 127.0.0.1:49378
[task 2019-11-20T18:54:40.105Z] 18:54:40 INFO - GECKO(5880) | 1574276080101 Marionette DEBUG 2 -> [0,1,"WebDriver:NewSession",{"strictFileInteractability":true}]
[task 2019-11-20T18:54:40.125Z] 18:54:40 INFO - GECKO(5880) | 1574276080121 Marionette TRACE [19] Frame script loaded
[task 2019-11-20T18:54:40.132Z] 18:54:40 INFO - GECKO(5880) | 1574276080127 Marionette TRACE [19] Frame script registered
[task 2019-11-20T18:54:40.132Z] 18:54:40 INFO - GECKO(5880) | 1574276080128 Marionette TRACE [25] Frame script loaded
[task 2019-11-20T18:54:40.135Z] 18:54:40 INFO - GECKO(5880) | 1574276080131 Marionette DEBUG 2 <- [1,1,null,{"sessionId":"461b6dcd-5db5-44b8-a1c1-ebc112a6a46d","capabilities":{"browserName":"firefox","browserVersion":"72.0a ... p\tmpxew112.mozrunner","moz:shutdownTimeout":60000,"moz:useNonSpecCompliantPointerOrigin":false,"moz:webdriverClick":true}}]
[task 2019-11-20T18:54:40.135Z] 18:54:40 INFO - GECKO(5880) | 1574276080133 Marionette TRACE [25] Frame script registered
[task 2019-11-20T18:54:40.155Z] 18:54:40 INFO - GECKO(5880) | 1574276080151 Marionette DEBUG 2 -> [0,2,"Addon:Install",{"path":"c:\users\task_1574275192\appdata\local\temp\tmpdfpjax.zip","temporary":false}]
[task 2019-11-20T18:54:40.190Z] 18:54:40 INFO - GECKO(5880) | 1574276080184 Marionette DEBUG 2 <- [1,2,null,{"value":"special-powers@mozilla.org"}]
[task 2019-11-20T18:54:40.217Z] 18:54:40 INFO - GECKO(5880) | 1574276080213 Marionette DEBUG 2 -> [0,3,"Addon:Install",{"path":"c:\users\task_1574275192\appdata\local\temp\tmphj7s1j.zip","temporary":false}]
[task 2019-11-20T18:54:40.252Z] 18:54:40 INFO - GECKO(5880) | 1574276080242 Marionette DEBUG 2 <- [1,3,null,{"value":"mochikit@mozilla.org"}]
[task 2019-11-20T18:54:40.252Z] 18:54:40 INFO - GECKO(5880) | 1574276080243 Marionette DEBUG 2 -> [0,4,"Marionette:GetContext",{}]
[task 2019-11-20T18:54:40.252Z] 18:54:40 INFO - GECKO(5880) | 1574276080244 Marionette DEBUG 2 <- [1,4,null,{"value":"content"}]
[task 2019-11-20T18:54:40.253Z] 18:54:40 INFO - GECKO(5880) | 1574276080244 Marionette DEBUG 2 -> [0,5,"Marionette:SetContext",{"value":"chrome"}]
[task 2019-11-20T18:54:40.253Z] 18:54:40 INFO - GECKO(5880) | 1574276080245 Marionette DEBUG 2 <- [1,5,null,{"value":null}]
[task 2019-11-20T18:54:40.253Z] 18:54:40 INFO - GECKO(5880) | 1574276080247 Marionette DEBUG 2 -> [0,6,"WebDriver:ExecuteScript",{"script":"/* This Source Code Form is subject to the terms of the Mozilla Public\n * License, ... l","flavor":"a11y"}],"filename":"Z:\task_1574275192\build\tests\mochitest\runtests.py","sandbox":"default","line":1772}]
[task 2019-11-20T18:54:40.260Z] 18:54:40 INFO - GECKO(5880) | 1574276080253 Marionette DEBUG 2 <- [1,6,null,{"value":null}]
[task 2019-11-20T18:54:40.260Z] 18:54:40 INFO - GECKO(5880) | 1574276080256 Marionette DEBUG 2 -> [0,7,"Marionette:SetContext",{"value":"content"}]
[task 2019-11-20T18:54:40.260Z] 18:54:40 INFO - GECKO(5880) | 1574276080257 Marionette DEBUG 2 <- [1,7,null,{"value":null}]
[task 2019-11-20T18:54:40.260Z] 18:54:40 INFO - GECKO(5880) | 1574276080258 Marionette DEBUG 2 -> [0,8,"WebDriver:DeleteSession",{}]
[task 2019-11-20T18:54:40.263Z] 18:54:40 INFO - GECKO(5880) | 1574276080259 Marionette DEBUG 2 <- [1,8,null,{"value":null}]
[task 2019-11-20T18:54:40.263Z] 18:54:40 INFO - runtests.py | Waiting for browser...
[task 2019-11-20T18:54:40.263Z] 18:54:40 INFO - GECKO(5880) | 1574276080260 Marionette DEBUG Closed connection 2
[task 2019-11-20T18:54:40.704Z] 18:54:40 INFO - GECKO(5880) | JavaScript error: chrome://mochikit/content/browser-test.js, line 115: TypeError: content is null
[task 2019-11-20T19:00:50.727Z] 19:00:50 INFO - Buffered messages finished
[task 2019-11-20T19:00:50.727Z] 19:00:50 ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output
[task 2019-11-20T19:00:50.728Z] 19:00:50 ERROR - Force-terminating active process(es).
[task 2019-11-20T19:00:50.728Z] 19:00:50 INFO - Determining child pids from psutil...
[task 2019-11-20T19:00:50.728Z] 19:00:50 INFO - [5820]
[task 2019-11-20T19:00:50.729Z] 19:00:50 INFO - ==> process 5820 launched child process 520 ("Z:\task_1574275192\build\application\firefox\firefox.exe" -contentproc --channel="5820.0.1678139090\2117864965" -childID 1 -isForBrowser -prefsHandle 1152 -prefMapHandle 2148 -prefsLen 1 -prefMapSize 223873 -parentBuildID 20191120161549 -greomni "Z:\task_1574275192\build\application\firefox\omni.ja" -appomni "Z:\task_1574275192\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1574275192\build\application\firefox\browser" - 5820 "\.\pipe\gecko-crash-server-pipe.5820" 2184 tab)
[task 2019-11-20T19:00:50.729Z] 19:00:50 INFO - ==> process 5820 launched child process 6016 ("Z:\task_1574275192\build\application\firefox\firefox.exe" -contentproc --channel="5820.6.533656122\1332268244" -childID 2 -isForBrowser -prefsHandle 2392 -prefMapHandle 1632 -prefsLen 98 -prefMapSize 223873 -parentBuildID 20191120161549 -greomni "Z:\task_1574275192\build\application\firefox\omni.ja" -appomni "Z:\task_1574275192\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1574275192\build\application\firefox\browser" - 5820 "\.\pipe\gecko-crash-server-pipe.5820" 2404 tab)
[task 2019-11-20T19:00:50.730Z] 19:00:50 INFO - ==> process 5820 launched child process 1624 ("Z:\task_1574275192\build\application\firefox\firefox.exe" -contentproc --channel="5820.12.1968968017\93997382" -childID 3 -isForBrowser -prefsHandle 2592 -prefMapHandle 2588 -prefsLen 159 -prefMapSize 223873 -parentBuildID 20191120161549 -greomni "Z:\task_1574275192\build\application\firefox\omni.ja" -appomni "Z:\task_1574275192\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1574275192\build\application\firefox\browser" - 5820 "\.\pipe\gecko-crash-server-pipe.5820" 2604 tab)
[task 2019-11-20T19:00:50.730Z] 19:00:50 INFO - Found child pids: set([520, 6016, 1624, 5820])
[task 2019-11-20T19:00:50.730Z] 19:00:50 INFO - Killing process: 520
[task 2019-11-20T19:00:50.730Z] 19:00:50 INFO - TEST-INFO | started process screenshot
[task 2019-11-20T19:00:50.789Z] 19:00:50 INFO - TEST-INFO | screenshot: exit 0
[task 2019-11-20T19:00:50.904Z] 19:00:50 INFO - Killing process: 6016
[task 2019-11-20T19:00:50.904Z] 19:00:50 INFO - Not taking screenshot here: see the one that was previously logged
[task 2019-11-20T19:00:50.924Z] 19:00:50 INFO - Killing process: 1624
[task 2019-11-20T19:00:50.924Z] 19:00:50 INFO - Not taking screenshot here: see the one that was previously logged
[task 2019-11-20T19:00:51.031Z] 19:00:51 INFO - Killing process: 5820
[task 2019-11-20T19:00:51.031Z] 19:00:51 INFO - Not taking screenshot here: see the one that was previously logged
[task 2019-11-20T19:00:51.074Z] 19:00:51 INFO - psutil found pid 520 dead
[task 2019-11-20T19:00:51.074Z] 19:00:51 INFO - psutil found pid 6016 dead
[task 2019-11-20T19:00:51.074Z] 19:00:51 INFO - psutil found pid 5820 dead
[task 2019-11-20T19:00:51.074Z] 19:00:51 INFO - psutil found pid 1624 dead
[task 2019-11-20T19:00:51.075Z] 19:00:51 INFO - Killing process: 5880
[task 2019-11-20T19:00:51.075Z] 19:00:51 INFO - Not taking screenshot here: see the one that was previously logged
[task 2019-11-20T19:00:51.078Z] 19:00:51 WARNING - mozcrash kill_pid(): unable to terminate pid 5880: 5
[task 2019-11-20T19:00:51.078Z] 19:00:51 INFO - psutil found pid 5880 dead
[task 2019-11-20T19:00:51.133Z] 19:00:51 INFO - TEST-INFO | Main app process: exit 1
[task 2019-11-20T19:00:51.133Z] 19:00:51 INFO - Buffered messages finished
[task 2019-11-20T19:00:51.133Z] 19:00:51 ERROR - TEST-UNEXPECTED-FAIL | automation.py | application terminated with exit code 1

Geoff since you are the triage owner, based on the failure rate should someone take a look at this?

Flags: needinfo?(gbrown)

This seems to be a problem with the mochikit extension and it would make sense to file it as a new bug:

JavaScript error: chrome://mochikit/content/browser-test.js, line 115: TypeError: content is null

Due to that error the tests aren't get started as it looks like.

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #358)

This seems to be a problem with the mochikit extension and it would make sense to file it as a new bug:

JavaScript error: chrome://mochikit/content/browser-test.js, line 115: TypeError: content is null

Due to that error the tests aren't get started as it looks like.

Oh look, we already have bug 1549939 for that!

Most of the recent failures happened during a narrow range of commits on Nov 20: I think there was a temporary problem fixed by backout or follow-up. There are still on-going failures, but the rate is fairly low.

Flags: needinfo?(gbrown)

(In reply to Cristina Coroiu [:ccoroiu] from comment #377)

Continued retriggers and backfills up to:
https://treeherder.mozilla.org/#/jobs?repo=autoland&searchStr=linux%2C18.04%2Cx64%2Cccov%2Copt%2Cmochitests%2Ctest-linux1804-64-ccov%2Fopt-mochitest-browser-chrome-e10s-2%2Cm%28bc2%29&tochange=9f194eeeafdfc5890b823c97da887ea295062a87&fromchange=aef6f407e2e1ef8a5790b363f3356c706fa2cfd9&selectedJob=293258493

It seems that started to fail from this push: https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=293268342&searchStr=linux%2C18.04%2Cx64%2Cccov%2Copt%2Cmochitests%2Ctest-linux1804-64-ccov%2Fopt-mochitest-browser-chrome-e10s-2%2Cm%28bc2%29&revision=06ca3c111fc7b2fd2c2c8137bdac13f8d52bfb98, and the culprit is https://bugzilla.mozilla.org/show_bug.cgi?id=1618188.

Gijs, can you please take a look?

This is incorrect; the test changed chunks as a result of that commit. Tests from browser/base/content/test/performance/io/browser.ini run on the bc2 chunk in that push and the following ones, but not in the previous ones. You can filter on that test directory using the "funnel" icon in treeherder and filtering on test path, and then retrigger the "right" chunks of the ccov runs to find the "real" culprit. I'll retrigger all ccov bc jobs on the previous run to allow us to find out which job that is.

Anyway, in case it's helpful for the next person looking at this, this seems to be affecting browser/base/content/test/performance/io/browser.ini tests; the browser has come up successfully per the screenshot and marionette is listening (per the log output; search for "timed out").

Flags: needinfo?(gijskruitbosch+bugs) → needinfo?(ccoroiu)

Sorry about that, I'll use next time "filter by test path" in case of changing chunks failures.

Flags: needinfo?(ccoroiu)
Flags: needinfo?(mcastelluccio)
Flags: needinfo?(florian)

:whimboo, looks like marionette waits for output here and never gets any (cf. https://treeherder.mozilla.org/logviewer.html#?job_id=294275391&repo=mozilla-central ). Any idea what's going on?

Flags: needinfo?(hskupin)

Here the log for reference:

[task 2020-03-22T22:18:00.719Z] 22:18:00     INFO - Application command: /builds/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmphUcmCl.mozrunner
[task 2020-03-22T22:18:00.727Z] 22:18:00     INFO - runtests.py | Application pid: 3175
[..]
[task 2020-03-22T22:18:15.533Z] 22:18:15     INFO - GECKO(3175) | 1584915495528	Marionette	TRACE	Marionette enabled
[task 2020-03-22T22:18:16.033Z] 22:18:16     INFO - GECKO(3175) | 1584915496025	Marionette	TRACE	Received observer notification toplevel-window-ready
[task 2020-03-22T22:18:28.301Z] 22:18:28     INFO - GECKO(3175) | [CodeCoverage] Setting handlers for process 3254.
[task 2020-03-22T22:18:29.373Z] 22:18:29     INFO - GECKO(3175) | [CodeCoverage] Setting handlers for process 3371.
[task 2020-03-22T22:20:26.914Z] 22:20:26     INFO - GECKO(3175) | 1584915626905	Marionette	TRACE	Received observer notification marionette-startup-requested
[task 2020-03-22T22:20:26.915Z] 22:20:26     INFO - GECKO(3175) | 1584915626907	Marionette	TRACE	Waiting until startup recorder finished recording startup scripts...
[task 2020-03-22T22:20:35.546Z] 22:20:35     INFO - GECKO(3175) | [CodeCoverage] Setting handlers for process 3466.
[task 2020-03-22T22:21:00.800Z] 22:21:00     INFO - runtests.py | Waiting for browser...
[task 2020-03-22T22:21:49.439Z] 22:21:49     INFO - GECKO(3175) | 1584915709437	Marionette	TRACE	All scripts recorded.
[task 2020-03-22T22:21:49.784Z] 22:21:49     INFO - GECKO(3175) | 1584915709776	Marionette	INFO	Listening on port 2828
[task 2020-03-22T22:21:49.784Z] 22:21:49     INFO - GECKO(3175) | 1584915709777	Marionette	DEBUG	Marionette is listening

It means that the Marionette component in Firefox is ready and waits for incoming connections. Given that Mochitest isn't using the Marionette harness it handles the connection itself. So I assume it somewhat fails because the startup time is nearly 4 minutes for this ccov build? It might make sense to add more logging around those lines:

https://searchfox.org/mozilla-central/rev/3fd53c47864fedb916f0ed8f002f15456324f729/testing/mochitest/runtests.py#2303-2305

Geoff, could you maybe do that?

Flags: needinfo?(hskupin) → needinfo?(gbrown)

(In reply to Andreea Pavel [:apavel] from comment #381)

Marco/Florian can you provide any input here?

I continued retriggers as Gijs suggested above, on linux ccov opt, on a different chunk: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&tochange=06ca3c111fc7b2fd2c2c8137bdac13f8d52bfb98&test_paths=browser%2Fbase%2Fcontent%2Ftest%2Fperformance%2Fio%2Fbrowser.ini&fromchange=f1a38b2d35f3c5baf83d73fd1d8e467bf35cb4ac&selectedJob=293258493

From looking at the log of the passing bc2 jobs on this try push, it seems startup takes about 3 minutes when starting to run the browser/base/content/test/performance/io/browser.ini tests. We could make startup for this folder faster by disabling the afterPaintListener of startupRecorder.js for this folder where the tests use the startupRecorder but don't care about the screenshots (that are taken for the flicker test).
If we are also seeing timeouts for the browser/base/content/test/performance/browser.ini jobs, we should consider disabling these tests for linux ccov, where I don't think performance tests are relevant.

Flags: needinfo?(florian)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #384)

[task 2020-03-22T22:21:00.800Z] 22:21:00 INFO - runtests.py | Waiting for browser...
[task 2020-03-22T22:21:49.439Z] 22:21:49 INFO - GECKO(3175) | 1584915709437 Marionette TRACE All scripts recorded.
[task 2020-03-22T22:21:49.784Z] 22:21:49 INFO - GECKO(3175) | 1584915709776 Marionette INFO Listening on port 2828
[task 2020-03-22T22:21:49.784Z] 22:21:49 INFO - GECKO(3175) | 1584915709777 Marionette DEBUG Marionette is listening

It means that the Marionette component in Firefox is ready and waits for incoming connections. Given that Mochitest isn't using the Marionette harness it handles the connection itself. So I assume it somewhat fails because the startup time is nearly 4 minutes for this ccov build? It might make sense to add more logging around those lines:

https://searchfox.org/mozilla-central/rev/3fd53c47864fedb916f0ed8f002f15456324f729/testing/mochitest/runtests.py#2303-2305

Geoff, could you maybe do that?

But I see "waiting for browser" -- runtests.py is at least getting to:

https://searchfox.org/mozilla-central/rev/3fd53c47864fedb916f0ed8f002f15456324f729/testing/mochitest/runtests.py#2339-2340

Seems like browser startup is taking too long.

Flags: needinfo?(gbrown)

(In reply to Geoff Brown [:gbrown] (reduced availability) from comment #387)

But I see "waiting for browser" -- runtests.py is at least getting to:

https://searchfox.org/mozilla-central/rev/3fd53c47864fedb916f0ed8f002f15456324f729/testing/mochitest/runtests.py#2339-2340

Seems like browser startup is taking too long.

Well, Marionette trace logs are enabled so I would expect to see some lines for installing the mochikit and specialpowers extensions. But that is not happening. So as long as those aren't installed the harness will wait forever.

Flags: needinfo?(gbrown)

...and the deferred exception is later lost (after the long wait for the process to end) because
https://searchfox.org/mozilla-central/rev/c80fa7258c935223fe319c5345b58eae85d4c6ae/testing/mochitest/runtests.py#2388
uses with_traceback -- not available in python 2.

I forced an increase in the timeout used by raise_for_port:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e1a215f0a275334100bbcf90ca3b6ef52d3dabfd
and that avoided the timeout. I'm not sure where that timeout value (180) comes from, but it looks like we could avoid this problem by increasing the timeout to ~300.

I guess startup just takes a long time on ccov for this directory of tests (maybe because of the prefs or environment used by this manifest? https://searchfox.org/mozilla-central/rev/c80fa7258c935223fe319c5345b58eae85d4c6ae/browser/base/content/test/performance/io/browser.ini#11-20); given comment 386, perhaps best to skip this manifest on ccov?

Flags: needinfo?(gbrown)

180s is indeed strange given that we use 120s by default for the startup timeout:

https://searchfox.org/mozilla-central/rev/3fd53c47864fedb916f0ed8f002f15456324f729/testing/marionette/client/marionette_driver/marionette.py#397

But yes, if that is a ccov build and it takes that long you might want to increase the timeout. Just pass in the value you want when calling self.marionette.start_session(). It's the timeout argument. You might even want to adjust it based on the type of build you run.

Beside that I wonder if it is ok when it takes that long even for a ccov build to startup.

Depends on: 1624857

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+1] from comment #392)

Beside that I wonder if it is ok when it takes that long even for a ccov build to startup.

It's not a normal startup, we capture a screenshot every single time we paint something on screen, to then compare the screenshots to catch unexpected flicker.

Depends on: 1624868

(In reply to Andreea Pavel [:apavel] from comment #381)

I continued retriggers as Gijs suggested above, on linux ccov opt, on a different chunk: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&tochange=06ca3c111fc7b2fd2c2c8137bdac13f8d52bfb98&test_paths=browser%2Fbase%2Fcontent%2Ftest%2Fperformance%2Fio%2Fbrowser.ini&fromchange=f1a38b2d35f3c5baf83d73fd1d8e467bf35cb4ac&selectedJob=293258493

So do we know when this started failing? It seems to be currently (almost) perma fail on Linux ccov and Windows 10-64 asan for the browser/base/content/test/performance/io/ folder.

Flags: needinfo?(apavel)

(In reply to Florian Quèze [:florian] from comment #395)

(In reply to Andreea Pavel [:apavel] from comment #381)

I continued retriggers as Gijs suggested above, on linux ccov opt, on a different chunk: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&tochange=06ca3c111fc7b2fd2c2c8137bdac13f8d52bfb98&test_paths=browser%2Fbase%2Fcontent%2Ftest%2Fperformance%2Fio%2Fbrowser.ini&fromchange=f1a38b2d35f3c5baf83d73fd1d8e467bf35cb4ac&selectedJob=293258493

So do we know when this started failing? It seems to be currently (almost) perma fail on Linux ccov and Windows 10-64 asan for the browser/base/content/test/performance/io/ folder.

I think the linux ccov thing is maybe the blocklist removal, which changed chunking. When it was in the bc3 chunk, prior to my push, it was green ( https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&selectedJob=293691734&tochange=06ca3c111fc7b2fd2c2c8137bdac13f8d52bfb98&fromchange=f1a38b2d35f3c5baf83d73fd1d8e467bf35cb4ac&searchStr=ccov%2Cbrowser ).

I don't know why it changing chunk and/or removing a bunch of dead code would have caused this test to start perma-failing.

The windows thing seems more recent.

(In reply to Florian Quèze [:florian] from comment #395)

(In reply to Andreea Pavel [:apavel] from comment #381)

I continued retriggers as Gijs suggested above, on linux ccov opt, on a different chunk: https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&tochange=06ca3c111fc7b2fd2c2c8137bdac13f8d52bfb98&test_paths=browser%2Fbase%2Fcontent%2Ftest%2Fperformance%2Fio%2Fbrowser.ini&fromchange=f1a38b2d35f3c5baf83d73fd1d8e467bf35cb4ac&selectedJob=293258493

So do we know when this started failing? It seems to be currently (almost) perma fail on Linux ccov and Windows 10-64 asan for the browser/base/content/test/performance/io/ folder.

No, we do not. Retriggers show what Cristina mentioned above, this started with Bug 1618188. Below this bug the failure did not occur on any chunk browser/base/content/test/performance/io/ ran.

https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&resultStatus=success%2Ctestfailed%2Cbusted%2Cexception&tochange=06ca3c111fc7b2fd2c2c8137bdac13f8d52bfb98&fromchange=f1a38b2d35f3c5baf83d73fd1d8e467bf35cb4ac&searchStr=browser%2Cchrome&selectedJob=293258493

Flags: needinfo?(apavel)

This bug failed 38 times in the last 7 days. Occurs on linux1804-64-ccov, windows10-64, windows10-aarch6 platforms on asan build and opt build types.

Recent log:
https://treeherder.mozilla.org/logviewer.html#?job_id=296302356&repo=autoland

There have been 67 failures in the last 7 days: https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-04-06&endday=2020-04-13&tree=trunk&bug=1414495

Affected platforms are windows10-64 asan, linux1804-64-ccov opt, linux1804-64-qr debug and linux1804-64-ccov opt.
Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=297335607&repo=mozilla-central&lineNumber=2206

In all the Windows cases the hang happened because the startup recorder never finished to record the startup scripts. We always hang in Marionette for this line:

[task 2020-04-10T18:51:51.255Z] 18:51:51 INFO - GECKO(7540) | 1586544711253 Marionette TRACE Waiting until startup recorder finished recording startup scripts...

https://searchfox.org/mozilla-central/rev/2cd2d511c0d94a34fb7fa3b746f54170ee759e35/testing/marionette/components/marionette.js#476-486

Flags: needinfo?(florian)
Depends on: 1629824

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #406)

In all the Windows cases the hang happened because the startup recorder never finished to record the startup scripts. We always hang in Marionette for this line:

[task 2020-04-10T18:51:51.255Z] 18:51:51 INFO - GECKO(7540) | 1586544711253 Marionette TRACE Waiting until startup recorder finished recording startup scripts...

https://searchfox.org/mozilla-central/rev/2cd2d511c0d94a34fb7fa3b746f54170ee759e35/testing/marionette/components/marionette.js#476-486

Note that here I meant non-ASAN builds because I thought that those issues were already handled by a different bug. But after speaking to Florian that's not the case yet. As such I filed bug 1629824. Sheriffs will update the classification data for already tagged failures on Treeherder.

In the last 7 days there have been 29 occurrences on linux1804-64 and windows10-64 asan and opt.

Recent failure: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=301533308&repo=mozilla-central&lineNumber=1154

Checking some of the classified failures here I can see that all are hanging waiting for the browser after the first top-level window has been opened. Interestingly this matches my observation for webdriver tests (bug 1630133). So I wonder if that is also a hang between:

[task 2020-04-15T00:55:51.226Z] 00:55:51     INFO - PID 1920 | DEBUG: Adding blocker Remote Settings profile-before-change for phase profile-before-change
[task 2020-04-15T00:55:51.471Z] 00:55:51     INFO - PID 1920 | DEBUG: Adding blocker MediaShutdownManager: shutdown for phase xpcom-will-shutdown

Shouldn't we actually try to use the kill_and_get_minidump feature with a bit shorter timeout here to get a crash stack of the start-up hang? I assume that has to be implemented in the harness, means Mochitest here. Whereby it might also be a good feature for the Marionette harness as used for the Mn jobs.

Flags: needinfo?(gijskruitbosch+bugs)
Flags: needinfo?(gbrown)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #414)

Checking some of the classified failures here I can see that all are hanging waiting for the browser after the first top-level window has been opened. Interestingly this matches my observation for webdriver tests (bug 1630133). So I wonder if that is also a hang between:

[task 2020-04-15T00:55:51.226Z] 00:55:51     INFO - PID 1920 | DEBUG: Adding blocker Remote Settings profile-before-change for phase profile-before-change
[task 2020-04-15T00:55:51.471Z] 00:55:51     INFO - PID 1920 | DEBUG: Adding blocker MediaShutdownManager: shutdown for phase xpcom-will-shutdown

Shouldn't we actually try to use the kill_and_get_minidump feature with a bit shorter timeout here to get a crash stack of the start-up hang? I assume that has to be implemented in the harness, means Mochitest here. Whereby it might also be a good feature for the Marionette harness as used for the Mn jobs.

I don't understand the question. I also don't understand what to take from the log you posted - it shows us adding 2 shutdown blockers, but we're doing it milliseconds apart, so that doesn't appear to be a hang?

Flags: needinfo?(gijskruitbosch+bugs)

(In reply to :Gijs (he/him) from comment #415)

I don't understand the question. I also don't understand what to take from the log you posted - it shows us adding 2 shutdown blockers, but we're doing it milliseconds apart, so that doesn't appear to be a hang?

See our conversation on bug 1630133. The above log snippet simply referenced some output where I assume the hang could happen. Seems like that wasn't clear enough. For Mochitests we do not have the async shutdown logs enabled, so we won't see that output - as such my speculation.

I had missed the needinfo, but I don't think I have anything to add. Ccov issues are almost always timing-related (since the ccov build is slower compared to other builds), and this seems not specific to ccov, which supports the hypothesis.

Flags: needinfo?(mcastelluccio)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #414)

Shouldn't we actually try to use the kill_and_get_minidump feature with a bit shorter timeout here to get a crash stack of the start-up hang? I assume that has to be implemented in the harness, means Mochitest here. Whereby it might also be a good feature for the Marionette harness as used for the Mn jobs.

That "application timed out after..." timeout is set up here, I believe:

https://searchfox.org/mozilla-central/rev/446160560bf32ebf4cb7c4e25d7386ee22667255/testing/mochitest/runtests.py#2297

But I would expect, if it's really hung, killing earlier will probably produce the same stack, unless the kill happens before the hang, in which case the stack may not be relevant.

Flags: needinfo?(gbrown)
Summary: Intermittent TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output (hang on start) → Intermittent non-asan TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output (hang on start)

Oh I see. killAndGetStack is actually used, and for non-asan builds we get the expected crash stack. There have been wrongly classified a lot of Windows asan builds recently, which first caused my confusion.

(In reply to Intermittent Failures Robot from comment #420)

19 failures in 997 pushes (0.019 failures/push) were associated with this bug yesterday.

Only 3 of those were non-asan builds. Everything else was wrongly classified.

There were 25 failures in the past week on linux1804-64-tsan opt, windows10-64-shippable opt, debug.

Recent failure log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=305294432&repo=autoland&lineNumber=4670

[task 2020-06-06T00:51:33.121Z] 00:51:33 INFO - TEST-START | dom/xml/test/test_bug691215.html
[task 2020-06-06T00:51:34.825Z] 00:51:34 INFO - GECKO(3250) | MEMORY STAT | vsize 130550474MB | residentFast 414MB
[task 2020-06-06T00:51:35.069Z] 00:51:35 INFO - TEST-OK | dom/xml/test/test_bug691215.html | took 1952ms
[task 2020-06-06T00:51:35.224Z] 00:51:35 INFO - TEST-START | Shutdown
[task 2020-06-06T00:51:35.225Z] 00:51:35 INFO - Passed: 11
[task 2020-06-06T00:51:35.226Z] 00:51:35 INFO - Failed: 0
[task 2020-06-06T00:51:35.228Z] 00:51:35 INFO - Todo: 0
[task 2020-06-06T00:51:35.228Z] 00:51:35 INFO - Mode: e10s
[task 2020-06-06T00:51:35.229Z] 00:51:35 INFO - Slowest: 2051ms - /tests/dom/xml/test/test_bug232004.xhtml
[task 2020-06-06T00:51:35.236Z] 00:51:35 INFO - SimpleTest FINISHED
[task 2020-06-06T00:51:35.240Z] 00:51:35 INFO - TEST-INFO | Ran 1 Loops
[task 2020-06-06T00:51:35.241Z] 00:51:35 INFO - SimpleTest FINISHED
[task 2020-06-06T00:51:37.265Z] 00:51:37 INFO - GECKO(3250) | 1591404697256 Marionette TRACE Received observer notification xpcom-will-shutdown
[task 2020-06-06T00:51:37.265Z] 00:51:37 INFO - GECKO(3250) | 1591404697258 Marionette INFO Stopped listening on port 2828
[task 2020-06-06T00:51:37.265Z] 00:51:37 INFO - GECKO(3250) | 1591404697259 Marionette DEBUG Marionette stopped listening
[task 2020-06-06T00:51:43.490Z] 00:51:43 INFO - GECKO(3250) | [2020-06-06T00:51:43Z ERROR xulstore::persist] removeDocument error: unavailable
[task 2020-06-06T00:51:44.662Z] 00:51:44 INFO - TEST-INFO | Main app process: exit 0
[task 2020-06-06T00:51:44.662Z] 00:51:44 INFO - runtests.py | Application ran for: 0:00:45.356879
[task 2020-06-06T00:51:44.662Z] 00:51:44 INFO - zombiecheck | Reading PID log: /tmp/tmpQVNFbjpidlog
[task 2020-06-06T00:51:44.662Z] 00:51:44 INFO - ==> process 3250 launched child process 3265
[task 2020-06-06T00:51:44.662Z] 00:51:44 INFO - ==> process 3250 launched child process 3363
[task 2020-06-06T00:51:44.663Z] 00:51:44 INFO - ==> process 3250 launched child process 3379
[task 2020-06-06T00:51:44.663Z] 00:51:44 INFO - ==> process 3250 launched child process 3449
[task 2020-06-06T00:51:44.663Z] 00:51:44 INFO - ==> process 3250 launched child process 3474
[task 2020-06-06T00:51:44.663Z] 00:51:44 INFO - zombiecheck | Checking for orphan process with PID: 3265
[task 2020-06-06T00:51:44.663Z] 00:51:44 INFO - zombiecheck | Checking for orphan process with PID: 3474
[task 2020-06-06T00:51:44.663Z] 00:51:44 INFO - zombiecheck | Checking for orphan process with PID: 3363
[task 2020-06-06T00:51:44.663Z] 00:51:44 INFO - zombiecheck | Checking for orphan process with PID: 3379
[task 2020-06-06T00:51:44.663Z] 00:51:44 INFO - zombiecheck | Checking for orphan process with PID: 3449
[task 2020-06-06T00:51:44.663Z] 00:51:44 INFO - Stopping web server
[task 2020-06-06T00:51:44.679Z] 00:51:44 INFO - Stopping web socket server
[task 2020-06-06T00:51:44.699Z] 00:51:44 INFO - Stopping ssltunnel
[task 2020-06-06T00:51:44.719Z] 00:51:44 WARNING - leakcheck | refcount logging is off, so leaks can't be detected!
[task 2020-06-06T00:51:44.720Z] 00:51:44 INFO - runtests.py | Running tests: end.
[task 2020-06-06T00:51:44.735Z] 00:51:44 INFO - Buffered messages finished
[task 2020-06-06T00:51:44.735Z] 00:51:44 INFO - Running manifest: intl/uconv/tests/mochitest.ini
[task 2020-06-06T00:51:44.755Z] 00:51:44 INFO - Setting pipeline to PAUSED ...
[task 2020-06-06T00:51:44.756Z] 00:51:44 INFO - Pipeline is PREROLLING ...
[task 2020-06-06T00:51:44.756Z] 00:51:44 INFO - Pipeline is PREROLLED ...
[task 2020-06-06T00:51:44.756Z] 00:51:44 INFO - Setting pipeline to PLAYING ...
[task 2020-06-06T00:51:44.756Z] 00:51:44 INFO - New clock: GstSystemClock
[task 2020-06-06T00:51:44.792Z] 00:51:44 INFO - Got EOS from element "pipeline0".
[task 2020-06-06T00:51:44.792Z] 00:51:44 INFO - Execution ended after 0:00:00.033442265
[task 2020-06-06T00:51:44.792Z] 00:51:44 INFO - Setting pipeline to PAUSED ...
[task 2020-06-06T00:51:44.792Z] 00:51:44 INFO - Setting pipeline to READY ...
[task 2020-06-06T00:51:44.792Z] 00:51:44 INFO - (gst-launch-1.0:3516): GStreamer-CRITICAL **: 00:51:44.788: gst_object_unref: assertion '((GObject *) object)->ref_count > 0' failed
[task 2020-06-06T00:51:44.793Z] 00:51:44 INFO - Setting pipeline to NULL ...
[task 2020-06-06T00:51:44.793Z] 00:51:44 INFO - Freeing pipeline ...
[task 2020-06-06T00:51:44.828Z] 00:51:44 INFO - INFO | runtests.py | TSan using symbolizer at /builds/worker/workspace/build/application/firefox/llvm-symbolizer
[task 2020-06-06T00:51:45.492Z] 00:51:45 INFO - pk12util: PKCS12 IMPORT SUCCESSFUL
[task 2020-06-06T00:51:45.515Z] 00:51:45 INFO - INFO | runtests.py | TSan using symbolizer at /builds/worker/workspace/build/application/firefox/llvm-symbolizer
[task 2020-06-06T00:51:45.516Z] 00:51:45 INFO - INFO | runtests.py | TSan using symbolizer at /builds/worker/workspace/build/application/firefox/llvm-symbolizer
[task 2020-06-06T00:51:45.520Z] 00:51:45 INFO - MochitestServer : launching [u'/builds/worker/workspace/build/tests/bin/xpcshell', '-g', '/builds/worker/workspace/build/application/firefox', '-f', '/builds/worker/workspace/build/tests/bin/components/httpd.js', '-e', "const _PROFILE_PATH = '/tmp/tmpHmTKMs.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', '/builds/worker/workspace/build/tests/mochitest/server.js']
[task 2020-06-06T00:51:45.520Z] 00:51:45 INFO - runtests.py | Server pid: 3535
[task 2020-06-06T00:51:45.527Z] 00:51:45 INFO - runtests.py | Websocket server pid: 3538
[task 2020-06-06T00:51:45.529Z] 00:51:45 INFO - INFO | runtests.py | TSan using symbolizer at /builds/worker/workspace/build/application/firefox/llvm-symbolizer
[task 2020-06-06T00:51:45.536Z] 00:51:45 INFO - runtests.py | SSL tunnel pid: 3541
[task 2020-06-06T00:51:46.096Z] 00:51:46 INFO - runtests.py | Running with scheme: http
[task 2020-06-06T00:51:46.096Z] 00:51:46 INFO - runtests.py | Running with e10s: True
[task 2020-06-06T00:51:46.096Z] 00:51:46 INFO - runtests.py | Running with fission: False
[task 2020-06-06T00:51:46.096Z] 00:51:46 INFO - runtests.py | Running with serviceworker_e10s: True
[task 2020-06-06T00:51:46.096Z] 00:51:46 INFO - runtests.py | Running with socketprocess_e10s: False
[task 2020-06-06T00:51:46.096Z] 00:51:46 INFO - runtests.py | Running tests: start.
[task 2020-06-06T00:51:46.096Z] 00:51:46 INFO -
[task 2020-06-06T00:51:46.103Z] 00:51:46 INFO - Application command: /builds/worker/workspace/build/application/firefox/firefox -marionette -foreground -profile /tmp/tmpHmTKMs.mozrunner
[task 2020-06-06T00:51:46.111Z] 00:51:46 INFO - runtests.py | Application pid: 3560
[task 2020-06-06T00:51:46.111Z] 00:51:46 INFO - TEST-INFO | started process GECKO(3560)
[task 2020-06-06T00:51:50.547Z] 00:51:50 INFO - GECKO(3560) | 1591404710538 Marionette TRACE Marionette enabled
[task 2020-06-06T00:51:51.293Z] 00:51:51 INFO - GECKO(3560) | 1591404711291 Marionette TRACE Received observer notification toplevel-window-ready
[task 2020-06-06T00:51:58.445Z] 00:51:58 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:58.445Z] 00:51:58 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.069Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.069Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.069Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.069Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.069Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.069Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.073Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.073Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.073Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.073Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.076Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.076Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.076Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.076Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.080Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.080Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.080Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.080Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.081Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.081Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.085Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.085Z] 00:51:59 INFO - GECKO(3560) | ==3657==WARNING: Symbolizer buffer too small
[task 2020-06-06T00:51:59.750Z] 00:51:59 INFO - GECKO(3560) | console.error: SearchService: "_readCacheFile: Error reading cache file:" (new Error("", "(unknown module)"))
[task 2020-06-06T00:52:07.265Z] 00:52:07 INFO - GECKO(3560) | 1591404727257 Marionette TRACE Received observer notification marionette-startup-requested
[task 2020-06-06T00:52:07.265Z] 00:52:07 INFO - GECKO(3560) | 1591404727258 Marionette TRACE Waiting until startup recorder finished recording startup scripts...
[task 2020-06-06T00:52:07.562Z] 00:52:07 INFO - GECKO(3560) | 1591404727552 Marionette TRACE All scripts recorded.
[task 2020-06-06T00:52:08.227Z] 00:52:08 INFO - GECKO(3560) | 1591404728225 Marionette INFO Listening on port 2828
[task 2020-06-06T00:52:08.228Z] 00:52:08 INFO - GECKO(3560) | 1591404728227 Marionette DEBUG Marionette is listening
[task 2020-06-06T00:52:08.296Z] 00:52:08 INFO - GECKO(3560) | 1591404728292 Marionette DEBUG Accepted connection 0 from 127.0.0.1:33824
[task 2020-06-06T00:52:08.372Z] 00:52:08 INFO - GECKO(3560) | 1591404728367 Marionette DEBUG Closed connection 0
[task 2020-06-06T00:52:08.373Z] 00:52:08 INFO - GECKO(3560) | 1591404728372 Marionette DEBUG Accepted connection 1 from 127.0.0.1:33826
[task 2020-06-06T00:52:08.393Z] 00:52:08 INFO - GECKO(3560) | 1591404728382 Marionette DEBUG 1 -> [0,1,"WebDriver:NewSession",{"strictFileInteractability":true}]
[task 2020-06-06T00:52:08.718Z] 00:52:08 INFO - GECKO(3560) | 1591404728712 Marionette TRACE [7] Frame script loaded
[task 2020-06-06T00:52:09.061Z] 00:52:09 INFO - GECKO(3560) | 1591404729057 Marionette TRACE [23] Frame script loaded
[task 2020-06-06T00:58:08.484Z] 00:58:08 INFO - runtests.py | Waiting for browser...
[task 2020-06-06T00:58:19.069Z] 00:58:19 INFO - Buffered messages finished
[task 2020-06-06T00:58:19.069Z] 00:58:19 ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output
[task 2020-06-06T00:58:19.069Z] 00:58:19 ERROR - Force-terminating active process(es).

Geoff, could you take a look, please?

Flags: needinfo?(gbrown)

These failures can be quite difficult to debug and it has been a long time since I've investigated this type of failure. I will try, but it may take a while to get around to this....

linux1804-64-tsan failures seem to be most frequent currently; let's concentrate on those.

I note warnings like "Symbolizer buffer too small" and "_readCacheFile: Error reading cache file:", but these are seen in successful runs also; I assume they are normal and not cause for concern.

I also see this pattern:

https://firefoxci.taskcluster-artifacts.net/DxwzkYsMSdyEm_4C6Hj6Vw/0/public/logs/live_backing.log

[task 2020-06-08T10:16:03.776Z] 10:16:03     INFO - GECKO(2134) | 1591611363771	Marionette	DEBUG	Accepted connection 0 from 127.0.0.1:47286
[task 2020-06-08T10:16:03.784Z] 10:16:03     INFO - GECKO(2134) | 1591611363778	Marionette	DEBUG	Accepted connection 1 from 127.0.0.1:47288
[task 2020-06-08T10:16:03.784Z] 10:16:03     INFO - GECKO(2134) | 1591611363782	Marionette	DEBUG	Closed connection 0
[task 2020-06-08T10:16:03.900Z] 10:16:03     INFO - GECKO(2134) | 1591611363897	Marionette	DEBUG	1 -> [0,1,"WebDriver:NewSession",{"strictFileInteractability":true}]
[task 2020-06-08T10:16:04.197Z] 10:16:04     INFO - GECKO(2134) | 1591611364192	Marionette	TRACE	[7] Frame script loaded
[task 2020-06-08T10:16:04.619Z] 10:16:04     INFO - GECKO(2134) | 1591611364615	Marionette	TRACE	[23] Frame script loaded
[task 2020-06-08T10:22:03.931Z] 10:22:03     INFO - runtests.py | Waiting for browser...
[task 2020-06-08T10:22:14.621Z] 10:22:14     INFO - Buffered messages finished
[task 2020-06-08T10:22:14.623Z] 10:22:14    ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output

whereas a successful run would have something like:

[task 2020-06-08T10:15:42.734Z] 10:15:42     INFO - GECKO(1898) | 1591611342731	Marionette	DEBUG	Accepted connection 0 from 127.0.0.1:49634
[task 2020-06-08T10:15:42.835Z] 10:15:42     INFO - GECKO(1898) | 1591611342829	Marionette	DEBUG	Closed connection 0
[task 2020-06-08T10:15:42.838Z] 10:15:42     INFO - GECKO(1898) | 1591611342834	Marionette	DEBUG	Accepted connection 1 from 127.0.0.1:49636
[task 2020-06-08T10:15:42.855Z] 10:15:42     INFO - GECKO(1898) | 1591611342846	Marionette	DEBUG	1 -> [0,1,"WebDriver:NewSession",{"strictFileInteractability":true}]
[task 2020-06-08T10:15:43.261Z] 10:15:43     INFO - GECKO(1898) | 1591611343253	Marionette	TRACE	[7] Frame script loaded
[task 2020-06-08T10:15:43.297Z] 10:15:43     INFO - GECKO(1898) | 1591611343288	Marionette	TRACE	[7] Frame script registered
[task 2020-06-08T10:15:43.414Z] 10:15:43     INFO - GECKO(1898) | 1591611343409	Marionette	DEBUG	1 <- [1,1,null,{"sessionId":"1be035ef-44e6-4e0e-9f9d-5a262c9d0d98","capabilities":{"browserName":"firefox","browserVersion":"79.0a ... p/tmpG4SAjf.mozrunner","moz:shutdownTimeout":360000,"moz:useNonSpecCompliantPointerOrigin":false,"moz:webdriverClick":true}}]
[task 2020-06-08T10:15:43.530Z] 10:15:43     INFO - GECKO(1898) | 1591611343526	Marionette	DEBUG	1 -> [0,2,"Addon:Install",{"path":"/tmp/tmpGlrsm6.zip","temporary":false}]
[task 2020-06-08T10:15:43.888Z] 10:15:43     INFO - GECKO(1898) | 1591611343883	Marionette	TRACE	[23] Frame script loaded
[task 2020-06-08T10:15:43.892Z] 10:15:43     INFO - GECKO(1898) | 1591611343889	Marionette	TRACE	[23] Frame script registered

So in the failure cases, "Frame script registered" is missing.
https://searchfox.org/mozilla-central/rev/fac90408bcf52ca88a3dcd2ef30a379b68ab24e2/testing/marionette/listener.js#512

:whimboo -- Any ideas?

Flags: needinfo?(gbrown) → needinfo?(hskupin)

Marionette was using the outerWindowID to track the browsing contexts in content processes. This might have caused several issues eg. when a remoteness change happened, especially for Fission. But I'm not sure how this can happen for the NewSession command.

Anyway two days ago I landed a patch for Marionette (bug 1519335) which changed the handling of browsing contexts by using their id instead. Here is what I can see:

Frame scripts are registered but still the session is not created:

[task 2020-06-08T23:59:51.107Z] 23:59:51     INFO - GECKO(2237) | 1591660791105	Marionette	DEBUG	1 -> [0,1,"WebDriver:NewSession",{"strictFileInteractability":true}]
[task 2020-06-08T23:59:51.446Z] 23:59:51     INFO - GECKO(2237) | 1591660791439	Marionette	TRACE	[5] Frame script loaded
[task 2020-06-08T23:59:51.466Z] 23:59:51     INFO - GECKO(2237) | 1591660791463	Marionette	TRACE	[5] Frame script registered
[task 2020-06-08T23:59:51.574Z] 23:59:51     INFO - GECKO(2237) | 1591660791571	Marionette	DEBUG	1 <- [1,1,null,{"sessionId":"499c6fb3-d260-40c9-8b7d-8eaca00b6582","capabilities":{"browserName":"firefox","browserVersion":"79.0a ... p/tmps70KFy.mozrunner","moz:shutdownTimeout":360000,"moz:useNonSpecCompliantPointerOrigin":false,"moz:webdriverClick":true}}]
[task 2020-06-08T23:59:51.784Z] 23:59:51     INFO - GECKO(2237) | 1591660791780	Marionette	DEBUG	1 -> [0,2,"Addon:Install",{"path":"/tmp/tmpEgtflB.zip","temporary":false}]
[task 2020-06-08T23:59:52.288Z] 23:59:52     INFO - GECKO(2237) | 1591660792283	Marionette	TRACE	[18] Frame script loaded
[task 2020-06-08T23:59:52.336Z] 23:59:52     INFO - GECKO(2237) | 1591660792331	Marionette	TRACE	[18] Frame script registered

Framescripts aren't registered and session is not created:

[task 2020-06-10T05:03:17.975Z] 05:03:17     INFO - GECKO(1270) | 1591765397967	Marionette	DEBUG	1 -> [0,1,"WebDriver:NewSession",{"strictFileInteractability":true}]
[task 2020-06-10T05:03:18.272Z] 05:03:18     INFO - GECKO(1270) | 1591765398260	Marionette	TRACE	[5] Frame script loaded
[task 2020-06-10T05:03:18.574Z] 05:03:18     INFO - GECKO(1270) | 1591765398567	Marionette	TRACE	[18] Frame script loaded

I wonder why we have multiple framescript registrations at all. As you can see in the working case, it should only be a single one. Maybe something screws up with about:blank? I would suggest to add some dump() calls to the following method before we send the registration message to the parent process:

https://searchfox.org/mozilla-central/rev/fac90408bcf52ca88a3dcd2ef30a379b68ab24e2/testing/marionette/listener.js#503

Dump the value of curContainer.frame.location.href to the console. Maybe that gives us an idea what's going on here. If it's both for about:blank we might have a hick-up during startup.

There are also still cases were we do not initialize marionette due to bug 1382162.

Flags: needinfo?(hskupin)

Hm, I just saw the following in a working case:

[task 2020-06-10T05:08:37.541Z] 05:08:37     INFO - GECKO(1260) | 1591765717536	Marionette	DEBUG	1 -> [0,1,"WebDriver:NewSession",{"strictFileInteractability":true}]
[task 2020-06-10T05:08:37.637Z] 05:08:37     INFO - GECKO(1260) | 1591765717632	Marionette	TRACE	[5] Frame script loaded
[task 2020-06-10T05:08:37.638Z] 05:08:37     INFO - GECKO(1260) | 1591765717635	Marionette	TRACE	[5] Frame script registered
[task 2020-06-10T05:08:37.641Z] 05:08:37     INFO - GECKO(1260) | 1591765717637	Marionette	TRACE	[18] Frame script loaded
[task 2020-06-10T05:08:37.657Z] 05:08:37     INFO - GECKO(1260) | 1591765717651	Marionette	DEBUG	1 <- [1,1,null,{"sessionId":"b7de4c8f-cb4e-40e3-941d-994381c54c02","capabilities":{"browserName":"firefox","browserVersion":"79.0a ... p/tmpIL2mvA.mozrunner","moz:shutdownTimeout":180000,"moz:useNonSpecCompliantPointerOrigin":false,"moz:webdriverClick":true}}]
[task 2020-06-10T05:08:37.677Z] 05:08:37     INFO - GECKO(1260) | 1591765717671	Marionette	TRACE	[18] Frame script registered

So maybe something is wrong with the message manager and the message is never received in driver.js?

https://searchfox.org/mozilla-central/rev/fac90408bcf52ca88a3dcd2ef30a379b68ab24e2/testing/marionette/driver.js#603
https://searchfox.org/mozilla-central/rev/fac90408bcf52ca88a3dcd2ef30a379b68ab24e2/testing/marionette/driver.js#3574

Mind also adding some dump() calls there?

https://treeherder.mozilla.org/#/jobs?repo=try&revision=26c84b3d7a959fe46f20eb9ac4d8bbd83c37ac67
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=305946159&repo=try&lineNumber=2832

[task 2020-06-11T14:43:00.750Z] 14:43:00     INFO - GECKO(3317) | 1591886580745	Marionette	INFO	Listening on port 2828
[task 2020-06-11T14:43:00.751Z] 14:43:00     INFO - GECKO(3317) | 1591886580746	Marionette	DEBUG	Marionette is listening
[task 2020-06-11T14:43:00.907Z] 14:43:00     INFO - GECKO(3317) | 1591886580899	Marionette	DEBUG	Accepted connection 0 from 127.0.0.1:46846
[task 2020-06-11T14:43:00.935Z] 14:43:00     INFO - GECKO(3317) | 1591886580928	Marionette	DEBUG	Closed connection 0
[task 2020-06-11T14:43:00.935Z] 14:43:00     INFO - GECKO(3317) | 1591886580932	Marionette	DEBUG	Accepted connection 1 from 127.0.0.1:46848
[task 2020-06-11T14:43:00.951Z] 14:43:00     INFO - GECKO(3317) | 1591886580942	Marionette	DEBUG	1 -> [0,1,"WebDriver:NewSession",{"strictFileInteractability":true}]
[task 2020-06-11T14:43:00.951Z] 14:43:00     INFO - GECKO(3317) | 1591886580946	Marionette	INFO	ZZZ registerPromise
[task 2020-06-11T14:43:00.952Z] 14:43:00     INFO - GECKO(3317) | 1591886580947	Marionette	INFO	ZZZ registerPromise promise added listener
[task 2020-06-11T14:43:01.265Z] 14:43:01     INFO - GECKO(3317) | 1591886581253	Marionette	TRACE	[5] Frame script loaded
[task 2020-06-11T14:43:01.266Z] 14:43:01     INFO - GECKO(3317) | 1591886581259	Marionette	TRACE	[5] ZZZ registerSelf curContainer.frame.location.href=
[task 2020-06-11T14:43:01.266Z] 14:43:01     INFO - GECKO(3317) | 1591886581260	Marionette	TRACE	[5] about:blank
[task 2020-06-11T14:43:01.266Z] 14:43:01     INFO - GECKO(3317) | 1591886581262	Marionette	TRACE	[5] ZZZ registerSelf sandboxes cleared
[task 2020-06-11T14:43:01.267Z] 14:43:01     INFO - GECKO(3317) | 1591886581263	Marionette	TRACE	[5] ZZZ registerSelf sending Register...
[task 2020-06-11T14:43:01.603Z] 14:43:01     INFO - GECKO(3317) | 1591886581598	Marionette	TRACE	[18] Frame script loaded
[task 2020-06-11T14:43:01.603Z] 14:43:01     INFO - GECKO(3317) | 1591886581600	Marionette	TRACE	[18] ZZZ registerSelf curContainer.frame.location.href=
[task 2020-06-11T14:43:01.606Z] 14:43:01     INFO - GECKO(3317) | 1591886581602	Marionette	TRACE	[18] about:blank
[task 2020-06-11T14:43:01.606Z] 14:43:01     INFO - GECKO(3317) | 1591886581604	Marionette	TRACE	[18] ZZZ registerSelf sandboxes cleared
[task 2020-06-11T14:43:01.606Z] 14:43:01     INFO - GECKO(3317) | 1591886581604	Marionette	TRACE	[18] ZZZ registerSelf sending Register...
[task 2020-06-11T14:49:01.032Z] 14:49:01     INFO - runtests.py | Waiting for browser...
[task 2020-06-11T14:49:11.623Z] 14:49:11     INFO - Buffered messages finished
[task 2020-06-11T14:49:11.623Z] 14:49:11    ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output

So, curContainer.frame.location.href=about:blank, and it looks like we block on sendSyncMessage("Marionette:Register", ...):
https://hg.mozilla.org/try/rev/2a7721020a8f2fac864d90cb8a17e418469c11b4#l2.23

It means calling sendSyncMessage() causes a hang here. I wonder why the message never arrives in the registered frame script. Maybe the message manager got disposed because of another process change and followed by a framescript reload while in the middle of an active communication?

Nika or Kris, any idea why this can happen? Our Fission work will start soon, but not sure how long it will take and maybe we can find a plausible workaround for the above situation in the meantime.

Flags: needinfo?(nika)
Flags: needinfo?(kmaglione+bmo)

(In reply to Henrik Skupin (:whimboo) [⌚️UTC+2] from comment #432)

Nika or Kris, any idea why this can happen? Our Fission work will start soon, but not sure how long it will take and maybe we can find a plausible workaround for the above situation in the meantime.

It's possible that the parent process is hung, and that this is just the first place you're seeing the effects.

Flags: needinfo?(kmaglione+bmo)

(In reply to Kris Maglione [:kmag] from comment #433)

It's possible that the parent process is hung, and that this is just the first place you're seeing the effects.

It would at least explain why that happens on various different places during startup. I wonder what we could do to get a profile of Firefox for the startup, especially when the process is hung and needs to be killed. Do mochitests kill the browser in a way that we get a proper stack where it actually hangs? Something what we do for Talos. I'm asking because I cannot see anything logged.

Flags: needinfo?(nika) → needinfo?(gbrown)

For these tsan hangs, there are no crash reports. Logs show "Can't trigger Breakpad, just killing process", because the tsan mozinfo has crashreporter = false.

Flags: needinfo?(gbrown)

This bug failed 37 times in the past 7 days. Failures occured on linux1804-64-tsan opt, windows10-64-shippable opt, windows10-64-qr debug.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=306878118&repo=autoland&lineNumber=5641

In the last 7 days, there are 83 failures, we can see an increase of the failures since 20th of June https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2020-05-29&endday=2020-06-28&tree=trunk&bug=1414495

Failures occurred on linux1804-64 opt, linux1804-64-qr debug, linux1804-64-shippable opt, linux1804-64-tsan opt, windows10-64-shippable opt.

Recent failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=307780434&repo=autoland&lineNumber=4944

This bug failed 25 times in the last 7 days. This happend on linux1804-64-qr debug, linux1804-64-shippable opt, linux1804-64 opt, windows10-64-shippable opt, linux1804-64-tsan opt.

Recent log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=309425873&repo=autoland&lineNumber=2961

In all the cases that I can see the hang seem to happen when the Mochitest extension is waiting for focus on the browser window. There seem to be no more hang during startup because of Marionette.

Geoff, any idea what we could do here to improve the situation? As it looks like slower builds are more likely to fail.

Status: REOPENED → NEW
Flags: needinfo?(florian) → needinfo?(gbrown)
Summary: Intermittent non-asan TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output (hang on start) → Intermittent non-asan TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output (Waiting for browser...)

It looks like most of these failures are on ccov now. I wonder if :marco would be interested?

Flags: needinfo?(gbrown) → needinfo?(mcastelluccio)

Probably this is more frequently happening on ccov because it's slower than other builds, but there's unlikely anything wrong specifically with ccov.
IDK if this information could help: ccov is slower at runtime, but it's particularly slower during Firefox shutdown because it needs to dump the counters (and dumping the counters means writing a lot of files).

Flags: needinfo?(mcastelluccio)

So I wonder why the browser window doesn't get focus so that the tests will be started. So I assume it's around these lines:
https://searchfox.org/mozilla-central/rev/2398ab33adcea896838b3da678ff6480dbb98b9a/testing/mochitest/browser-harness.xhtml#260-273

Geoff, maybe we could add a log line when the focus of the harness window happened and before we are going to wait for the focus for the test window? Maybe this could give us some more insights.

Flags: needinfo?(gbrown)

Yes, logging might help.

Also, I notice this pattern frequently:

[task 2022-06-08T04:52:41.763Z] 04:52:41     INFO - GECKO(4824) | [CodeCoverage] JS flush completed.
[task 2022-06-08T04:52:41.835Z] 04:52:41     INFO - GECKO(4824) | JavaScript error: resource://testing-common/PerTestCoverageUtils.jsm, line 72: NS_ERROR_FILE_DIR_NOT_EMPTY: Component returned failure code: 0x80520014 (NS_ERROR_FILE_DIR_NOT_EMPTY) [nsIFile.remove]
[task 2022-06-08T04:58:51.844Z] 04:58:51     INFO - Buffered messages finished
[task 2022-06-08T04:58:51.845Z] 04:58:51    ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output
[task 2022-06-08T04:58:51.845Z] 04:58:51    ERROR - Force-terminating active process(es).

I wonder if avoiding or handling the NS_ERROR_FILE_DIR_NOT_EMPTY would resolve the timeout failures.

Flags: needinfo?(gbrown)

Maybe a fix for bug 1784591 could help here as well.

Depends on: 1784591

(In reply to Intermittent Failures Robot from comment #558)

  • windows10-64-2004-shippable-qr: 23
    • opt: 23
  • windows10-32-2004-shippable-qr: 11
    • opt: 11

Note that all of these failures are actually related to bug 1784591.

No longer depends on: 1784591
Severity: major → S2

Recent builds show a problem with loading the LogController in the SimpleTest extension:
https://treeherder.mozilla.org/logviewer?job_id=399444797&repo=autoland&lineNumber=2360-2363

[task 2022-12-13T08:29:44.303Z] 08:29:44     INFO - runtests.py | Waiting for browser...
[task 2022-12-13T08:29:44.303Z] 08:29:44     INFO - GECKO(1275) | 1670920184303	Marionette	DEBUG	Closed connection 1
[task 2022-12-13T08:29:44.506Z] 08:29:44     INFO - GECKO(1275) | JavaScript error: https://example.com/tests/SimpleTest/setup.js, line 21: ReferenceError: LogController is not defined
[task 2022-12-13T08:29:44.514Z] 08:29:44     INFO - GECKO(1275) | JavaScript error: https://example.com/tests/SimpleTest/setup.js, line 287: TypeError: can't access property "manifestFile", params is undefined

This is most likely the reason why it fails starting the tests.

Flags: needinfo?(gbrown)

I haven't made changes to SimpleTest in a long time. Very speculatively, I wonder if this error might somehow be a regression from bug 1801368?

Flags: needinfo?(standard8)

(In reply to Geoff Brown [:gbrown] from comment #580)

I haven't made changes to SimpleTest in a long time. Very speculatively, I wonder if this error might somehow be a regression from bug 1801368?

I don't think so, it didn't touch anything to do with LogController, though I'm happy to be wrong.

It would probably be worth looking at how setup.js is getting loaded and if TestRunner.js has also been loaded in that scope, as per the comment in the file.

Flags: needinfo?(standard8)

For non-chrome tests (like these mochitest-media tests that intermittently fail) it looks like setup.js is only loaded from server.js:
https://searchfox.org/mozilla-central/rev/57527d50ef5d3df412caa5d99536f0709399be6f/testing/mochitest/server.js#787

TestRunner.js, and LogController.js, are loaded just before that,
https://searchfox.org/mozilla-central/rev/57527d50ef5d3df412caa5d99536f0709399be6f/testing/mochitest/server.js#769-780

server.js is loaded from runtests.py:
https://searchfox.org/mozilla-central/rev/57527d50ef5d3df412caa5d99536f0709399be6f/testing/mochitest/runtests.py#539
when it runs the mochitest server in xpcshell.

I don't see how that could be going wrong.

Flags: needinfo?(gbrown)

Yeah, I don't see anything obvious. I'd suggest that we file a separate bug and look at it there, rather than trying to debug it here. I don't think it is related to my bug, but it could be. In any case, I probably won't have time to look before the holidays.

There have been 67 total failures in the last 7 days, recent failure log.
Affected platforms are:

  • linux1804-64-ccov-qr
  • linux1804-64-shippable-qr
  • macosx1015-64-qr
  • macosx1015-64-shippable-qr
  • macosx1100-64-shippable-qr
  • windows10-64-2004-ccov-qr
[task 2023-02-18T18:37:19.668Z] 18:37:19     INFO - TEST-START | tools/profiler/tests/browser/browser_test_profile_slow_capture.js
[task 2023-02-18T18:37:19.668Z] 18:37:19     INFO - TEST-SKIP | tools/profiler/tests/browser/browser_test_profile_slow_capture.js | took 1ms
[task 2023-02-18T18:37:19.669Z] 18:37:19     INFO - Running manifest: browser/actors/test/browser/browser.ini
[task 2023-02-18T18:37:21.174Z] 18:37:21     INFO - PID 1311 | pk12util: PKCS12 IMPORT SUCCESSFUL
[task 2023-02-18T18:37:21.233Z] 18:37:21     INFO - MochitestServer : launching ['/opt/worker/tasks/task_167674531544934/build/tests/bin/xpcshell', '-g', '/opt/worker/tasks/task_167674531544934/build/application/Firefox Nightly.app/Contents/Resources', '-f', '/opt/worker/tasks/task_167674531544934/build/tests/bin/components/httpd.js', '-e', "const _PROFILE_PATH = '/var/folders/fz/cnm2dyxn5gb6ldkjy3p_xm10000014/T/tmpyrh4yeoj.mozrunner'; const _SERVER_PORT = '8888'; const _SERVER_ADDR = '127.0.0.1'; const _TEST_PREFIX = undefined; const _DISPLAY_RESULTS = false;", '-f', '/opt/worker/tasks/task_167674531544934/build/tests/mochitest/server.js']
[task 2023-02-18T18:37:21.233Z] 18:37:21     INFO - runtests.py | Server pid: 1312
[task 2023-02-18T18:37:21.236Z] 18:37:21     INFO - runtests.py | Websocket server pid: 1313
[task 2023-02-18T18:37:21.240Z] 18:37:21     INFO - runtests.py | SSL tunnel pid: 1314
[task 2023-02-18T18:37:21.451Z] 18:37:21     INFO - runtests.py | Running with scheme: http
[task 2023-02-18T18:37:21.451Z] 18:37:21     INFO - runtests.py | Running with e10s: True
[task 2023-02-18T18:37:21.451Z] 18:37:21     INFO - runtests.py | Running with fission: True
[task 2023-02-18T18:37:21.452Z] 18:37:21     INFO - runtests.py | Running with cross-origin iframes: False
[task 2023-02-18T18:37:21.452Z] 18:37:21     INFO - runtests.py | Running with serviceworker_e10s: True
[task 2023-02-18T18:37:21.453Z] 18:37:21     INFO - runtests.py | Running with socketprocess_e10s: False
[task 2023-02-18T18:37:21.453Z] 18:37:21     INFO - runtests.py | Running tests: start.
[task 2023-02-18T18:37:21.453Z] 18:37:21     INFO - 
[task 2023-02-18T18:37:21.459Z] 18:37:21     INFO - Application command: /opt/worker/tasks/task_167674531544934/build/application/Firefox Nightly.app/Contents/MacOS/firefox -marionette -foreground -profile /var/folders/fz/cnm2dyxn5gb6ldkjy3p_xm10000014/T/tmpyrh4yeoj.mozrunner
[task 2023-02-18T18:37:21.463Z] 18:37:21     INFO - runtests.py | Application pid: 1315
[task 2023-02-18T18:37:21.463Z] 18:37:21     INFO - TEST-INFO | started process GECKO(1315)
[task 2023-02-18T18:37:21.818Z] 18:37:21     INFO - GECKO(1315) | 1676745441817	Marionette	INFO	Marionette enabled
[task 2023-02-18T18:37:21.844Z] 18:37:21     INFO - GECKO(1315) | 1676745441843	Marionette	TRACE	Received observer notification final-ui-startup
[task 2023-02-18T18:37:21.846Z] 18:37:21     INFO - GECKO(1315) | 1676745441845	Marionette	INFO	Listening on port 2828
[task 2023-02-18T18:37:21.846Z] 18:37:21     INFO - GECKO(1315) | 1676745441845	Marionette	DEBUG	Marionette is listening
[task 2023-02-18T18:37:22.263Z] 18:37:22     INFO - GECKO(1315) | 1676745442263	Marionette	DEBUG	Accepted connection 0 from 127.0.0.1:49255
[task 2023-02-18T18:37:22.280Z] 18:37:22     INFO - GECKO(1315) | 1676745442280	Marionette	DEBUG	Closed connection 0
[task 2023-02-18T18:37:22.281Z] 18:37:22     INFO - GECKO(1315) | 1676745442280	Marionette	DEBUG	Accepted connection 1 from 127.0.0.1:49256
[task 2023-02-18T18:37:22.356Z] 18:37:22     INFO - GECKO(1315) | 1676745442355	Marionette	DEBUG	1 -> [0,1,"WebDriver:NewSession",{"strictFileInteractability":true}]
[task 2023-02-18T18:37:22.359Z] 18:37:22     INFO - GECKO(1315) | 1676745442358	Marionette	DEBUG	Waiting for initial application window
[task 2023-02-18T18:37:23.319Z] 18:37:23     INFO - GECKO(1315) | console.warn: SearchSettings: "get: No settings file exists, new profile?" (new NotFoundError("Could not open the file at /var/folders/fz/cnm2dyxn5gb6ldkjy3p_xm10000014/T/tmpyrh4yeoj.mozrunner/search.json.mozlz4", (void 0)))
[task 2023-02-18T18:37:23.335Z] 18:37:23     INFO - GECKO(1315) | console.error: (new Error("Unexpected content-type \"text/plain;charset=US-ASCII\"", "resource://services-settings/Utils.jsm", 407))
[task 2023-02-18T18:37:24.066Z] 18:37:24     INFO - GECKO(1315) | console.error: (new Error("Unexpected content-type \"text/plain;charset=US-ASCII\"", "resource://services-settings/Utils.jsm", 407))
[task 2023-02-18T18:37:24.067Z] 18:37:24     INFO - GECKO(1315) | console.error: (new Error("Unexpected content-type \"text/plain;charset=US-ASCII\"", "resource://services-settings/Utils.jsm", 407))
[task 2023-02-18T18:37:24.067Z] 18:37:24     INFO - GECKO(1315) | console.error: (new Error("Unexpected content-type \"text/plain;charset=US-ASCII\"", "resource://services-settings/Utils.jsm", 407))
[task 2023-02-18T18:37:24.207Z] 18:37:24     INFO - GECKO(1315) | 1676745444206	Marionette	TRACE	Received observer notification browser-idle-startup-tasks-finished
[task 2023-02-18T18:37:24.211Z] 18:37:24     INFO - GECKO(1315) | 1676745444210	RemoteAgent	TRACE	[10] ProgressListener Start: expectNavigation=false resolveWhenStarted=false unloadTimeout=200 waitForExplicitStart=false
[task 2023-02-18T18:37:24.211Z] 18:37:24     INFO - GECKO(1315) | 1676745444210	RemoteAgent	TRACE	[10] ProgressListener Setting unload timer (200ms)
[task 2023-02-18T18:37:24.212Z] 18:37:24     INFO - GECKO(1315) | 1676745444210	RemoteAgent	TRACE	[10] Document already finished loading: about:blank
[task 2023-02-18T18:37:24.212Z] 18:37:24     INFO - GECKO(1315) | 1676745444210	RemoteAgent	TRACE	[10] ProgressListener Stop: has error=false
[task 2023-02-18T18:37:24.215Z] 18:37:24     INFO - GECKO(1315) | 1676745444213	Marionette	DEBUG	1 <- [1,1,null,{"sessionId":"3b346205-a81c-42da-b065-6946c513c663","capabilities":{"browserName":"firefox","browserVersion":"112.0a1","platformName":"mac","acceptInsecureCerts":false,"pageLoadStrategy":"normal","setWindowRect":true,"timeouts":{"implicit":0,"pageLoad":300000,"script":30000},"strictFileInteractability":true,"unhandledPromptBehavior":"dismiss and notify","moz:accessibilityChecks":false,"moz:buildID":"20230218180146","moz:headless":false,"moz:platformVersion":"19.6.0","moz:processID":1315,"moz:profile":"/var/folders/fz/cnm2dyxn5gb6ldkjy3p_xm10000014/T/tmpyrh4yeoj.mozrunner","moz:shutdownTimeout":60000,"moz:useNonSpecCompliantPointerOrigin":false,"moz:webdriverClick":true,"moz:windowless":false,"proxy":{}}}]
[task 2023-02-18T18:37:24.226Z] 18:37:24     INFO - GECKO(1315) | 1676745444225	Marionette	DEBUG	1 -> [0,2,"Addon:Install",{"path":"/var/folders/fz/cnm2dyxn5gb6ldkjy3p_xm10000014/T/tmphmy0w1nq.zip","temporary":false}]
[task 2023-02-18T18:37:24.264Z] 18:37:24     INFO - GECKO(1315) | 1676745444263	Marionette	DEBUG	1 <- [1,2,null,{"value":"special-powers@mozilla.org"}]
[task 2023-02-18T18:37:24.287Z] 18:37:24     INFO - GECKO(1315) | 1676745444286	Marionette	DEBUG	1 -> [0,3,"Addon:Install",{"path":"/var/folders/fz/cnm2dyxn5gb6ldkjy3p_xm10000014/T/tmpaf8yog7e.zip","temporary":false}]
[task 2023-02-18T18:37:24.305Z] 18:37:24     INFO - GECKO(1315) | 1676745444304	Marionette	DEBUG	1 <- [1,3,null,{"value":"mochikit@mozilla.org"}]
[task 2023-02-18T18:37:24.306Z] 18:37:24     INFO - GECKO(1315) | 1676745444305	Marionette	DEBUG	1 -> [0,4,"Marionette:GetContext",{}]
[task 2023-02-18T18:37:24.307Z] 18:37:24     INFO - GECKO(1315) | 1676745444306	Marionette	DEBUG	1 <- [1,4,null,{"value":"content"}]
[task 2023-02-18T18:37:24.307Z] 18:37:24     INFO - GECKO(1315) | 1676745444306	Marionette	DEBUG	1 -> [0,5,"Marionette:SetContext",{"value":"chrome"}]
[task 2023-02-18T18:37:24.308Z] 18:37:24     INFO - GECKO(1315) | 1676745444306	Marionette	DEBUG	1 <- [1,5,null,{"value":null}]
[task 2023-02-18T18:37:24.310Z] 18:37:24     INFO - GECKO(1315) | 1676745444309	Marionette	DEBUG	1 -> [0,6,"WebDriver:ExecuteScript",{"script":"/* This Source Code Form is subject to the terms of the Mozilla Public\n * License, v. 2.0. If a copy of the MPL was not distr ... s which flavor and url to load.\nlet ev = new CustomEvent(\"mochitest-load\", { detail: [flavor, url] });\nwin.dispatchEvent(ev);","args":[{"flavor":"browser-chrome","testUrl":"about:blank"}],"newSandbox":true,"sandbox":"default","line":2061,"filename":"tests/mochitest/runtests.py"}]
[task 2023-02-18T18:37:24.312Z] 18:37:24     INFO - GECKO(1315) | 1676745444311	RemoteAgent	TRACE	WebDriverProcessData actor created for PID 1315
[task 2023-02-18T18:37:24.312Z] 18:37:24     INFO - GECKO(1315) | 1676745444312	Marionette	TRACE	[2] MarionetteCommands actor created for window id 4
[task 2023-02-18T18:37:24.336Z] 18:37:24     INFO - GECKO(1315) | 1676745444336	Marionette	TRACE	Received observer notification domwindowopened
[task 2023-02-18T18:37:24.344Z] 18:37:24     INFO - GECKO(1315) | 1676745444343	Marionette	DEBUG	1 <- [1,6,null,{"value":null}]
[task 2023-02-18T18:37:24.355Z] 18:37:24     INFO - GECKO(1315) | 1676745444354	Marionette	DEBUG	1 -> [0,7,"Marionette:SetContext",{"value":"content"}]
[task 2023-02-18T18:37:24.355Z] 18:37:24     INFO - GECKO(1315) | 1676745444354	Marionette	DEBUG	1 <- [1,7,null,{"value":null}]
[task 2023-02-18T18:37:24.368Z] 18:37:24     INFO - GECKO(1315) | 1676745444367	Marionette	DEBUG	1 -> [0,8,"WebDriver:DeleteSession",{}]
[task 2023-02-18T18:37:24.368Z] 18:37:24     INFO - GECKO(1315) | 1676745444367	Marionette	TRACE	[2] MarionetteCommands actor destroyed for window id 4
[task 2023-02-18T18:37:24.369Z] 18:37:24     INFO - GECKO(1315) | 1676745444368	Marionette	DEBUG	1 <- [1,8,null,{"value":null}]
[task 2023-02-18T18:37:24.369Z] 18:37:24     INFO - runtests.py | Waiting for browser...
[task 2023-02-18T18:37:24.370Z] 18:37:24     INFO - GECKO(1315) | 1676745444369	Marionette	DEBUG	Closed connection 1
[task 2023-02-18T18:37:25.399Z] 18:37:25     INFO - GECKO(1315) | JavaScript error: resource://gre/modules/XULStore.jsm, line 58: Error: Can't find profile directory.
[task 2023-02-18T18:43:35.469Z] 18:43:35     INFO - Buffered messages finished
[task 2023-02-18T18:43:35.469Z] 18:43:35    ERROR - TEST-UNEXPECTED-TIMEOUT | automation.py | application timed out after 370 seconds with no output
[task 2023-02-18T18:43:35.470Z] 18:43:35  WARNING - Force-terminating active process(es).
[task 2023-02-18T18:43:35.470Z] 18:43:35     INFO - Determining child pids from psutil...
[task 2023-02-18T18:43:35.472Z] 18:43:35     INFO - [1317, 1318, 1320, 1321, 1322, 1323, 1324]
[task 2023-02-18T18:43:35.472Z] 18:43:35     INFO - ==> process 1315 launched child process 1317
[task 2023-02-18T18:43:35.473Z] 18:43:35     INFO - ==> process 1315 launched child process 1318
[task 2023-02-18T18:43:35.473Z] 18:43:35     INFO - ==> process 1315 launched child process 1320
[task 2023-02-18T18:43:35.473Z] 18:43:35     INFO - ==> process 1315 launched child process 1321
[task 2023-02-18T18:43:35.474Z] 18:43:35     INFO - ==> process 1315 launched child process 1322
[task 2023-02-18T18:43:35.474Z] 18:43:35     INFO - ==> process 1315 launched child process 1323
[task 2023-02-18T18:43:35.474Z] 18:43:35     INFO - ==> process 1315 launched child process 1324
[task 2023-02-18T18:43:35.475Z] 18:43:35     INFO - Found child pids: {1317, 1318, 1320, 1321, 1322, 1323, 1324}
[task 2023-02-18T18:43:35.475Z] 18:43:35     INFO - Killing process: 1317
[task 2023-02-18T18:43:35.476Z] 18:43:35     INFO - TEST-INFO | started process screencapture
[task 2023-02-18T18:43:35.580Z] 18:43:35     INFO - TEST-INFO | screencapture: exit 0

Update:

There have been 30 failures within the last 7 days:

  • 4 failures on Linux 18.04 x64 CCov WebRender opt

  • 7 failures on Linux 18.04 x64 WebRender debug/opt

  • 1 failure on Linux 18.04 x64 WebRender Shippable opt

  • 3 failures on OS X 11 WebRender Shippable opt

  • 2 failures on Windows 11 x86 22H2 WebRender Shippable opt

  • 1 failure on Windows 11 x64 22H2 asan WebRender opt

  • 8 failures on Windows 11 x64 22H2 CCov WebRender opt

  • 4 failures on Windows 11 x64 22H2 WebRender opt

  • Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=414235141&repo=mozilla-central&lineNumber=2816

Whiteboard: [stockwell infra] [see summary at comment 139] → [stockwell infra] [see summary at comment 139][stockwell needswork:owner]

Looks like there is one asan failure that made its way on this bug, so that would actually render the occurrences to 29.

Update:

There have been 42 failures within the last 7 days:

  • 5 failures on Linux 18.04 x64 CCov WebRender opt
  • 11 failures on Linux 18.04 x64 WebRender opt
  • 1 failure on Linux 18.04 x64 WebRender Shippable opt
  • 4 failures on OS X 11 WebRender Shippable opt
  • 1 failure on Windows 11 x86 22H2 MinGW WebRender debug
  • 13 failures on Windows 11 x64 22H2 CCov WebRender opt
  • 2 failures on Windows 11 x64 22H2 MinGW WebRender debug
  • 5 failures on Windows 11 x64 22H2 WebRender debug/opt

Recent log: https://treeherder.mozilla.org/logviewer?job_id=415688556&repo=mozilla-central&lineNumber=2201

Mark, Geoff, are you still looking into this?
Thank you.

Flags: needinfo?(standard8)
Flags: needinfo?(gbrown)

I don't think the issues here are related to anything I've changed.

Flags: needinfo?(standard8)

I am not investigating this.

Flags: needinfo?(gbrown)

Update

There have been 36 total failures within the last 7 days:

  • 7 failures on Linux 18.04 x64 CCov WebRender opt
  • 1 failure on Linux 18.04 x64 WebRender debug
  • 1 failure on Linux 18.04 x64 WebRender Shippable opt
  • 3 failures on OS X 11 WebRender Shippable opt
  • 1 failure on Windows 11 x64 22H2 asan WebRender opt
  • 14 failures on Windows 11 x64 22H2 CCov WebRender opt
  • 7 failures on Windows 11 x64 22H2 WebRender debug/opt
  • 2 failures on Windows 11 x64 22H2 WebRender Shippable opt

Recent log: https://treeherder.mozilla.org/logviewer?job_id=423511007&repo=autoland&lineNumber=2228

Duplicate of this bug: 1843204

Update

There have been 33 failures within the last 7 days:

  • 2 failures on Linux 18.04 x64 CCov WebRender opt
  • 4 failures on Linux 18.04 x64 WebRender opt/debug
  • 6 failures on OS X 11 WebRender Shippable opt
  • 1 failure on Windows 11 x86 22H2 MinGW WebRender debug
  • 1 failure on Windows 11 x86 22H2 WebRender debug
  • 2 failures on Windows 11 x86 22H2 WebRender Shippable opt
  • 1 failure on Windows 11 x64 22H2 asan WebRender opt
  • 6 failures on Windows 11 x64 22H2 CCov WebRender opt
  • 7 failures on Windows 11 x64 22H2 WebRender debug/opt
  • 3 failures on Windows 11 x64 22H2 WebRender Shippable opt

Recent failure log: https://treeherder.mozilla.org/logviewer?job_id=427104672&repo=mozilla-central&lineNumber=3291

You need to log in before you can comment on or make changes to this bug.