Closed Bug 1418575 Opened 7 years ago Closed 7 years ago

Intermittent reftest | application timed out after 370 seconds with no output

Categories

(Testing :: Reftest, defect, P5)

Version 3
defect

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: intermittent-bug-filer, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

It looks like most recent failures here are on startup, similar in appearance to bug 1414495, (but with a slightly different title, for reftest). Bug 1414495 is waiting on a needinfo currently.
Depends on: 1414495
Whiteboard: [stockwell needswork]
There are 40 failures in the past 7 days, most of them on windows10-64, windows10-64-qr and one on windows10-64-ccov, opt and debug. Recent log example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=153572745 And a relevant part of the log: nsStringStats 1733 23:10:28 INFO - => mAllocCount: 14287 1734 23:10:28 INFO - => mReallocCount: 112 1735 23:10:28 INFO - => mFreeCount: 14286 -- LEAKED 1 !!! 1736 23:10:28 INFO - => mShareCount: 7533 1737 23:10:28 INFO - => mAdoptCount: 1628 1738 23:10:28 INFO - => mAdoptFreeCount: 1628 1739 23:10:28 INFO - => Process ID: 9236, Thread ID: 9240 1740 23:10:28 INFO - DLL blocklist was unable to intercept AppInit DLLs. 1741 23:10:28 INFO - ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to c:\users\genericworker\appdata\local\temp\tmpjgu35o.mozrunner\runreftest_leaks_tab_pid9784.log 1742 23:10:28 INFO - [Child 9784, Main Thread] WARNING: No CID found when attempting to map contract ID: file z:/build/build/src/xpcom/components/nsComponentManager.cpp, line 508 1743 23:10:30 INFO - --DOCSHELL 0000023E726A2800 == 7 [pid = 9140] [id = {1629306b-b100-4255-b0ba-f785b6dd951f}] 1744 23:10:30 INFO - --DOCSHELL 0000023E76C4C000 == 6 [pid = 9140] [id = {f8870708-5c93-4fc3-ab4b-91d9401353cf}] 1745 23:10:30 INFO - --DOCSHELL 0000023E75116000 == 5 [pid = 9140] [id = {13e6f8e3-4414-4009-8cb0-12d8d89cb225}] 1746 23:10:30 INFO - --DOCSHELL 0000023E74BE8000 == 4 [pid = 9140] [id = {3e18e2e6-aba4-4e2c-86fc-f5c81c3d4e96}] 1747 23:10:31 INFO - --DOMWINDOW == 18 (0000023E71AC83F0) [pid = 9140] [serial = 8] [outer = 0000000000000000] [url = about:blank] 1748 23:10:33 INFO - --DOCSHELL 000002A8A2340000 == 1 [pid = 8652] [id = {041af833-57cf-4884-b033-bfeaac86faef}] 1749 23:10:35 INFO - --DOMWINDOW == 17 (0000023E71E6CC00) [pid = 9140] [serial = 4] [outer = 0000000000000000] [url = about:blank] 1750 23:10:35 INFO - --DOMWINDOW == 16 (0000023E72071800) [pid = 9140] [serial = 13] [outer = 0000000000000000] [url = about:blank] 1751 23:10:35 INFO - --DOMWINDOW == 15 (0000023E72073800) [pid = 9140] [serial = 14] [outer = 0000000000000000] [url = about:blank] 1752 23:10:35 INFO - --DOMWINDOW == 14 (0000023E78041800) [pid = 9140] [serial = 11] [outer = 0000000000000000] [url = about:blank] 1753 23:10:37 INFO - --DOMWINDOW == 5 (000002A8AC51D400) [pid = 8652] [serial = 5] [outer = 0000000000000000] [url = about:blank] 1754 23:10:37 INFO - --DOMWINDOW == 4 (000002A8A6992800) [pid = 8652] [serial = 2] [outer = 0000000000000000] [url = about:blank] 1755 23:10:38 INFO - --DOMWINDOW == 13 (0000023E71ACA4E0) [pid = 9140] [serial = 9] [outer = 0000000000000000] [url = about:blank] 1756 23:10:40 INFO - [DEBUG SHUTDOWN] Shutdown: decoder=000002A8AC7CB000 state machine=000002A8AC7FE000 1757 23:10:40 INFO - --DOMWINDOW == 3 (000002A8A67B6470) [pid = 8652] [serial = 1] [outer = 0000000000000000] [url = chrome://gfxsanity/content/sanitytest.html] 1758 23:10:40 INFO - [DEBUG SHUTDOWN] Enter: state machine=000002A8A5A20980 reader=000002A8A69EB670 1759 23:10:40 INFO - [DEBUG SHUTDOWN] Shutdown: reader=000002A8AC7FB000 shutdown demuxer=000002A8A5A20280 1760 23:10:40 INFO - [DEBUG SHUTDOWN] Shutdown: pool=000002A8AC5EFB80 count=1 1761 23:10:40 INFO - [DEBUG SHUTDOWN] operator (): pool=000002A8AC5EFB80 shutdown=true count=0 1762 23:10:40 INFO - [DEBUG SHUTDOWN] TearDownDecoders: reader=000002A8AC7FB000 shut down audio task queue 1763 23:10:40 INFO - [DEBUG SHUTDOWN] TearDownDecoders: reader=000002A8AC7FB000 shut down video task queue 1764 23:10:40 INFO - [DEBUG SHUTDOWN] FinishShutdown: state machine=000002A8AC7FE000 1765 23:10:40 INFO - [DEBUG SHUTDOWN] Unregister: decoder=000002A8AC7CB000, count=0 1766 23:10:41 INFO - [Parent 9140, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file z:/build/build/src/dom/fetch/FetchDriver.cpp, line 696 1767 23:10:41 INFO - --DOMWINDOW == 12 (0000023E76AB0C00) [pid = 9140] [serial = 10] [outer = 0000000000000000] [url = about:blank] 1768 23:10:43 INFO - --DOMWINDOW == 2 (000002A8AC572400) [pid = 8652] [serial = 3] [outer = 0000000000000000] [url = chrome://gfxsanity/content/sanitytest.html] 1769 23:10:43 INFO - --DOMWINDOW == 2 (0000016D269F6C00) [pid = 9440] [serial = 2] [outer = 0000000000000000] [url = about:blank] 1770 23:16:53 INFO - REFTEST ERROR | reftest | application timed out after 370 seconds with no output 1771 23:16:53 INFO - REFTEST ERROR | Force-terminating active process(es). 1772 23:16:53 INFO - REFTEST TEST-INFO | started process screenshot 1773 23:16:53 INFO - REFTEST TEST-INFO | screenshot: exit 0 1774 23:16:53 INFO - TEST-INFO | crashinject: exit 0 1775 23:16:54 WARNING - TEST-UNEXPECTED-FAIL | reftest | application terminated with exit code 1 1776 23:16:54 INFO - REFTEST INFO | Copy/paste: Z:\task_1514588294\build\win32-minidump_stackwalk.exe c:\users\genericworker\appdata\local\temp\tmpjgu35o.mozrunner\minidumps\7a64de14-1cbd-4174-b541-c65a261e9b90.dmp Z:\task_1514588294\build\symbols 1777 23:17:06 INFO - REFTEST INFO | Saved minidump as Z:\task_1514588294\build\blobber_upload_dir\7a64de14-1cbd-4174-b541-c65a261e9b90.dmp 1778 23:17:06 INFO - REFTEST INFO | Saved app info as Z:\task_1514588294\build\blobber_upload_dir\7a64de14-1cbd-4174-b541-c65a261e9b90.extra 1779 23:17:06 ERROR - REFTEST PROCESS-CRASH | reftest | application crashed [@ CrashingThread(void *)] 1780 23:17:06 INFO - Crash dump filename: c:\users\genericworker\appdata\local\temp\tmpjgu35o.mozrunner\minidumps\7a64de14-1cbd-4174-b541-c65a261e9b90.dmp 1781 23:17:06 INFO - Operating system: Windows NT 1782 23:17:06 INFO - 10.0.15063 1783 23:17:06 INFO - CPU: amd64 1784 23:17:06 INFO - family 6 model 45 stepping 7 1785 23:17:06 INFO - 8 CPUs 1786 23:17:06 INFO - GPU: UNKNOWN 1787 23:17:06 INFO - Crash reason: EXCEPTION_ACCESS_VIOLATION_WRITE 1788 23:17:06 INFO - Crash address: 0x0 1789 23:17:06 INFO - Process uptime: 395 seconds 1790 23:17:06 INFO - Thread 72 (crashed) :gbrown Any updates here?
Flags: needinfo?(gbrown)
Depends on: 1427946
(In reply to Noemi Erli[:noemi_erli] from comment #10) > :gbrown Any updates here? I had been waiting for progress on bug 1414495, as indicated in comment 6. However, I noticed that many recent failures are in Windows 10 crashtests and most of those have similar screenshots, which I have not seen before; I filed bug 1427946 to try to understand those screenshots better.
Flags: needinfo?(gbrown)
I think the fix for bug 1427946 has helped a lot: I no longer see those TeamViewer screenshots, and there are fewer failures reported here in the last 24 hours or so. Remaining failures might require progress on bug 1414495.
most of these failures are in QR builds- lets see if :kats might have ideas?
Flags: needinfo?(bugmail)
Whiteboard: [stockwell disable-recommended] → [stockwell needswork]
All the QR failures I looked at seem to have a configmymonitor.exe startup failure, see example screenshot: https://public-artifacts.taskcluster.net/SzkAyxr8ScSrcK4jNablXg/0/public/test_info/mozilla-test-fail-screenshot_ricvh2.png So this seems to point to a configuration issue similar to the TeamViewer thing. Also according to OF the last one of these that happened on a QR build was jan 5, so maybe it's not an issue any more.
Flags: needinfo?(bugmail)
There have been 41 failures in the last 7 days and 168 failures in teh last 21 days. For the failure pattern see Comment 10 Recent failure log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=157051190&lineNumber=1733 I checked a few logs and noticed that the reftest analyzer is not loading properly. Example: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/ID5xSavmQmyp1-hqAkayvw/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1 :jmaher there are more than 150 failures in the last 21 days, but the failure rate went down in the last week. Should we still consider disabling this test? Thank you!
Flags: needinfo?(jmaher)
what is interesting here is that out of 36 failures in the last week, only 3 are on hardware (i.e. win10 reftest) and the rest are on VM (i.e. taskcluster)- also 2 of the failures are not win10 (1xlinux64, 1xwin7). Given that reftests run at the same rate as crashtests or jsreftests, I would conclude that any focus here is on the VMs, and not reftests on hardware. I looked at some logs and saw some differences between the hardware and vms: hardware: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=157030919&lineNumber=1949 vm: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=156698165&lineNumber=1675 on hardware we do not have statements like: [DEBUG SHUTDOWN] Shutdown: decoder=00000227C5E85800 state machine=00000227CBA96000 The above statement is from the above mentioned VM log- which is an opt build. So either we are not hitting any initialization when on hardware, or there is something different in the OS/Env which is causing this difference. The screenshot from the vm in the above log is: https://public-artifacts.taskcluster.net/HUe7OyOZQ0ifBBbN56Og2w/0/public/test_info/mozilla-test-fail-screenshot_fdtd7g.png there is a reference to a failure for configmymonitor.exe which cannot start- not sure if this is related or not, but we should fix that. A few differences in the environment variables: 1) live logging ports specified for VM/Taskcluster 2) KTS/monitoring for Hardware/Buildbot 3) taskcluster is missing these MOZ_ env variables: 'MOZ_CRASHREPORTER_NO_REPORT': '1', 'MOZ_HIDE_RESULTS_TABLE': '1', 'MOZ_NO_REMOTE': '1', 'NO_EM_RESTART': '1', 'NO_FAIL_ON_TEST_ERRORS': '1', 'XPCOM_DEBUG_BREAK': 'warn' ^ NOTE: I have a webrender link above which has different env vars, but all VM logs are missing these env vars. I cannot find a spot in the scripts where we set these environment variables- many harnesses set them, but it seems as though these are typically setup at the OS level and inherited prior to running. :grenade, do you know how to determine on windows 10 what the environment variables are by default on the IX hardware vs the VM environment?
Flags: needinfo?(jmaher) → needinfo?(rthijssen)
i would just use task creator [1] to run a task that runs a single command of: `set` [2]. once on a hw worker type and once on a vm. that should give a list of the vars set for the user that runs the tests. the default vars would differ for the admin user so it's important to list the vars that are set for the user that actually runs tasks. generic-worker creates the user that runs the tasks and also sets a number of environment and profile configurations for that user so it's more relevant to get the env settings at the beginning of a task run than env settings at machine configuration time which would be quite a different thing. [1]: https://tools.taskcluster.net/task-creator/ [2]: https://ss64.com/nt/set.html
Flags: needinfo?(rthijssen)
ok, I did the edit task and added some env vars, they seem to pass through to the test harness- it looks like I could set them here in the docker image creation area: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/docker_image.py#148 in addition I see other places we set similar variables: https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/mozharness_test.py#251 likewise, possibly we ensure the harness is set properly (we would need to do this for ALL harnesses to be more complete.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #22) > The screenshot from the vm in the above log is: > https://public-artifacts.taskcluster.net/HUe7OyOZQ0ifBBbN56Og2w/0/public/ > test_info/mozilla-test-fail-screenshot_fdtd7g.png > > there is a reference to a failure for configmymonitor.exe which cannot > start- not sure if this is related or not, but we should fix that. thanks for spotting that. i've now removed configmymonitor and fakemon from these instances as it doesn't work on windows 10 thus far. https://github.com/mozilla-releng/OpenCloudConfig/commit/a3453c1e0cb373a381a53d0f3b3d733668cad41d
here is a case where I have added environment variables: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c21026ce655b91ecbf061363c84ba268f27f5709&selectedJob=157409919 looking at the jsreftest log, I see we time out, then start up again and run all the tests successfully: https://public-artifacts.taskcluster.net/Qu13jFheQ86LgnlvXvBAKA/0/public/logs/live_backing.log same for the crashtest failure: https://public-artifacts.taskcluster.net/LrrI6B4tSA-dNu_VSyLA6A/0/public/logs/live_backing.log I noticed the first time we launched the browser I see this: 01:10:33 INFO - REFTEST INFO | Application command: Z:\task_1516410509\build\application\firefox\firefox.exe -marionette -profile c:\users\genericworker\appdata\local\temp\tmpufuw0c.mozrunner then the second time, I see this: 01:17:09 INFO - REFTEST INFO | Application command: Z:\task_1516410509\build\application\firefox\firefox.exe -marionette -marionette -profile c:\users\genericworker\appdata\local\temp\tmpdtfhav.mozrunner why do we have two -marionette args- there must be a loop somewhere. This also proves the machine is valid and workable- we just didn't start the test after launching the browser.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #26) > why do we have two -marionette args- there must be a loop somewhere. This > also proves the machine is valid and workable- we just didn't start the test > after launching the browser. I think we run tests until success, to allow for resume-after-crash: https://dxr.mozilla.org/mozilla-central/rev/5faab9e619901b1513fd4ca137747231be550def/layout/tools/reftest/runreftest.py#787 "-marionette" will be added every time, but I don't think that makes a difference to anything.
Depends on: 1434123
I don't know what is going wrong here. We recently added marionette trace logging to reftests and we have some failure logs with that additional logging now -- but the marionette trace logging looks normal to me. Logs suggest to me that firefox starts up normally, but then hangs. On the other hand, screenshots for these failures do not show firefox displayed - odd.
(In reply to Rob Thijssen (:grenade UTC+2) from comment #25) > (In reply to Joel Maher ( :jmaher) (UTC-5) from comment #22) > > The screenshot from the vm in the above log is: > > https://public-artifacts.taskcluster.net/HUe7OyOZQ0ifBBbN56Og2w/0/public/ > > test_info/mozilla-test-fail-screenshot_fdtd7g.png > > > > there is a reference to a failure for configmymonitor.exe which cannot > > start- not sure if this is related or not, but we should fix that. > > thanks for spotting that. i've now removed configmymonitor and fakemon from > these instances as it doesn't work on windows 10 thus far. > https://github.com/mozilla-releng/OpenCloudConfig/commit/ > a3453c1e0cb373a381a53d0f3b3d733668cad41d :grenade -- Some recent screenshots still show configmymonitor errors. https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=158892135&lineNumber=1810 https://public-artifacts.taskcluster.net/DePvXD_2STKi6w8mLvRoFw/1/public/test_info/mozilla-test-fail-screenshot_rpfod_.png
Flags: needinfo?(rthijssen)
the removal patch went in but was reverted due to an unrelated problem. an ami redeploy should fix it. in progress now: https://tools.taskcluster.net/groups/BWLMji0_Qmek1p3KCl8HqQ/tasks/M10X4t_vRtCW7BYmRLypAQ/details
Flags: needinfo?(rthijssen)
No new failures since Feb 8! (I don't know why. Maybe comment 32...?)
Whiteboard: [stockwell disable-recommended] → [stockwell unknown]
Geoff, I think we can close this bug?
Flags: needinfo?(gbrown)
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(gbrown)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.