Closed
Bug 1418575
Opened 7 years ago
Closed 7 years ago
Intermittent reftest | application timed out after 370 seconds with no output
Categories
(Testing :: Reftest, defect, P5)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: intermittent-bug-filer, Unassigned)
References
(Depends on 1 open bug)
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])
Filed by: nbeleuzu [at] mozilla.com
https://treeherder.mozilla.org/logviewer.html#?job_id=145637201&repo=mozilla-inbound
https://queue.taskcluster.net/v1/task/GQcNEgk3Rem4ZLDV8GTnIg/runs/0/artifacts/public/logs/live_backing.log
https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/GQcNEgk3Rem4ZLDV8GTnIg/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 6•7 years ago
|
||
It looks like most recent failures here are on startup, similar in appearance to bug 1414495, (but with a slightly different title, for reftest). Bug 1414495 is waiting on a needinfo currently.
Depends on: 1414495
Updated•7 years ago
|
Whiteboard: [stockwell needswork]
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 10•7 years ago
|
||
There are 40 failures in the past 7 days, most of them on windows10-64, windows10-64-qr and one on windows10-64-ccov, opt and debug.
Recent log example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-central&job_id=153572745
And a relevant part of the log:
nsStringStats
1733
23:10:28 INFO - => mAllocCount: 14287
1734
23:10:28 INFO - => mReallocCount: 112
1735
23:10:28 INFO - => mFreeCount: 14286 -- LEAKED 1 !!!
1736
23:10:28 INFO - => mShareCount: 7533
1737
23:10:28 INFO - => mAdoptCount: 1628
1738
23:10:28 INFO - => mAdoptFreeCount: 1628
1739
23:10:28 INFO - => Process ID: 9236, Thread ID: 9240
1740
23:10:28 INFO - DLL blocklist was unable to intercept AppInit DLLs.
1741
23:10:28 INFO - ### XPCOM_MEM_BLOAT_LOG defined -- logging bloat/leaks to c:\users\genericworker\appdata\local\temp\tmpjgu35o.mozrunner\runreftest_leaks_tab_pid9784.log
1742
23:10:28 INFO - [Child 9784, Main Thread] WARNING: No CID found when attempting to map contract ID: file z:/build/build/src/xpcom/components/nsComponentManager.cpp, line 508
1743
23:10:30 INFO - --DOCSHELL 0000023E726A2800 == 7 [pid = 9140] [id = {1629306b-b100-4255-b0ba-f785b6dd951f}]
1744
23:10:30 INFO - --DOCSHELL 0000023E76C4C000 == 6 [pid = 9140] [id = {f8870708-5c93-4fc3-ab4b-91d9401353cf}]
1745
23:10:30 INFO - --DOCSHELL 0000023E75116000 == 5 [pid = 9140] [id = {13e6f8e3-4414-4009-8cb0-12d8d89cb225}]
1746
23:10:30 INFO - --DOCSHELL 0000023E74BE8000 == 4 [pid = 9140] [id = {3e18e2e6-aba4-4e2c-86fc-f5c81c3d4e96}]
1747
23:10:31 INFO - --DOMWINDOW == 18 (0000023E71AC83F0) [pid = 9140] [serial = 8] [outer = 0000000000000000] [url = about:blank]
1748
23:10:33 INFO - --DOCSHELL 000002A8A2340000 == 1 [pid = 8652] [id = {041af833-57cf-4884-b033-bfeaac86faef}]
1749
23:10:35 INFO - --DOMWINDOW == 17 (0000023E71E6CC00) [pid = 9140] [serial = 4] [outer = 0000000000000000] [url = about:blank]
1750
23:10:35 INFO - --DOMWINDOW == 16 (0000023E72071800) [pid = 9140] [serial = 13] [outer = 0000000000000000] [url = about:blank]
1751
23:10:35 INFO - --DOMWINDOW == 15 (0000023E72073800) [pid = 9140] [serial = 14] [outer = 0000000000000000] [url = about:blank]
1752
23:10:35 INFO - --DOMWINDOW == 14 (0000023E78041800) [pid = 9140] [serial = 11] [outer = 0000000000000000] [url = about:blank]
1753
23:10:37 INFO - --DOMWINDOW == 5 (000002A8AC51D400) [pid = 8652] [serial = 5] [outer = 0000000000000000] [url = about:blank]
1754
23:10:37 INFO - --DOMWINDOW == 4 (000002A8A6992800) [pid = 8652] [serial = 2] [outer = 0000000000000000] [url = about:blank]
1755
23:10:38 INFO - --DOMWINDOW == 13 (0000023E71ACA4E0) [pid = 9140] [serial = 9] [outer = 0000000000000000] [url = about:blank]
1756
23:10:40 INFO - [DEBUG SHUTDOWN] Shutdown: decoder=000002A8AC7CB000 state machine=000002A8AC7FE000
1757
23:10:40 INFO - --DOMWINDOW == 3 (000002A8A67B6470) [pid = 8652] [serial = 1] [outer = 0000000000000000] [url = chrome://gfxsanity/content/sanitytest.html]
1758
23:10:40 INFO - [DEBUG SHUTDOWN] Enter: state machine=000002A8A5A20980 reader=000002A8A69EB670
1759
23:10:40 INFO - [DEBUG SHUTDOWN] Shutdown: reader=000002A8AC7FB000 shutdown demuxer=000002A8A5A20280
1760
23:10:40 INFO - [DEBUG SHUTDOWN] Shutdown: pool=000002A8AC5EFB80 count=1
1761
23:10:40 INFO - [DEBUG SHUTDOWN] operator (): pool=000002A8AC5EFB80 shutdown=true count=0
1762
23:10:40 INFO - [DEBUG SHUTDOWN] TearDownDecoders: reader=000002A8AC7FB000 shut down audio task queue
1763
23:10:40 INFO - [DEBUG SHUTDOWN] TearDownDecoders: reader=000002A8AC7FB000 shut down video task queue
1764
23:10:40 INFO - [DEBUG SHUTDOWN] FinishShutdown: state machine=000002A8AC7FE000
1765
23:10:40 INFO - [DEBUG SHUTDOWN] Unregister: decoder=000002A8AC7CB000, count=0
1766
23:10:41 INFO - [Parent 9140, Main Thread] WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80520012: file z:/build/build/src/dom/fetch/FetchDriver.cpp, line 696
1767
23:10:41 INFO - --DOMWINDOW == 12 (0000023E76AB0C00) [pid = 9140] [serial = 10] [outer = 0000000000000000] [url = about:blank]
1768
23:10:43 INFO - --DOMWINDOW == 2 (000002A8AC572400) [pid = 8652] [serial = 3] [outer = 0000000000000000] [url = chrome://gfxsanity/content/sanitytest.html]
1769
23:10:43 INFO - --DOMWINDOW == 2 (0000016D269F6C00) [pid = 9440] [serial = 2] [outer = 0000000000000000] [url = about:blank]
1770
23:16:53 INFO - REFTEST ERROR | reftest | application timed out after 370 seconds with no output
1771
23:16:53 INFO - REFTEST ERROR | Force-terminating active process(es).
1772
23:16:53 INFO - REFTEST TEST-INFO | started process screenshot
1773
23:16:53 INFO - REFTEST TEST-INFO | screenshot: exit 0
1774
23:16:53 INFO - TEST-INFO | crashinject: exit 0
1775
23:16:54 WARNING - TEST-UNEXPECTED-FAIL | reftest | application terminated with exit code 1
1776
23:16:54 INFO - REFTEST INFO | Copy/paste: Z:\task_1514588294\build\win32-minidump_stackwalk.exe c:\users\genericworker\appdata\local\temp\tmpjgu35o.mozrunner\minidumps\7a64de14-1cbd-4174-b541-c65a261e9b90.dmp Z:\task_1514588294\build\symbols
1777
23:17:06 INFO - REFTEST INFO | Saved minidump as Z:\task_1514588294\build\blobber_upload_dir\7a64de14-1cbd-4174-b541-c65a261e9b90.dmp
1778
23:17:06 INFO - REFTEST INFO | Saved app info as Z:\task_1514588294\build\blobber_upload_dir\7a64de14-1cbd-4174-b541-c65a261e9b90.extra
1779
23:17:06 ERROR - REFTEST PROCESS-CRASH | reftest | application crashed [@ CrashingThread(void *)]
1780
23:17:06 INFO - Crash dump filename: c:\users\genericworker\appdata\local\temp\tmpjgu35o.mozrunner\minidumps\7a64de14-1cbd-4174-b541-c65a261e9b90.dmp
1781
23:17:06 INFO - Operating system: Windows NT
1782
23:17:06 INFO - 10.0.15063
1783
23:17:06 INFO - CPU: amd64
1784
23:17:06 INFO - family 6 model 45 stepping 7
1785
23:17:06 INFO - 8 CPUs
1786
23:17:06 INFO - GPU: UNKNOWN
1787
23:17:06 INFO - Crash reason: EXCEPTION_ACCESS_VIOLATION_WRITE
1788
23:17:06 INFO - Crash address: 0x0
1789
23:17:06 INFO - Process uptime: 395 seconds
1790
23:17:06 INFO - Thread 72 (crashed)
:gbrown Any updates here?
Flags: needinfo?(gbrown)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 13•7 years ago
|
||
(In reply to Noemi Erli[:noemi_erli] from comment #10)
> :gbrown Any updates here?
I had been waiting for progress on bug 1414495, as indicated in comment 6.
However, I noticed that many recent failures are in Windows 10 crashtests and most of those have similar screenshots, which I have not seen before; I filed bug 1427946 to try to understand those screenshots better.
Flags: needinfo?(gbrown)
Comment 14•7 years ago
|
||
I think the fix for bug 1427946 has helped a lot: I no longer see those TeamViewer screenshots, and there are fewer failures reported here in the last 24 hours or so.
Remaining failures might require progress on bug 1414495.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 17•7 years ago
|
||
most of these failures are in QR builds- lets see if :kats might have ideas?
Flags: needinfo?(bugmail)
Updated•7 years ago
|
Whiteboard: [stockwell disable-recommended] → [stockwell needswork]
Comment 18•7 years ago
|
||
All the QR failures I looked at seem to have a configmymonitor.exe startup failure, see example screenshot: https://public-artifacts.taskcluster.net/SzkAyxr8ScSrcK4jNablXg/0/public/test_info/mozilla-test-fail-screenshot_ricvh2.png
So this seems to point to a configuration issue similar to the TeamViewer thing. Also according to OF the last one of these that happened on a QR build was jan 5, so maybe it's not an issue any more.
Flags: needinfo?(bugmail)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 21•7 years ago
|
||
There have been 41 failures in the last 7 days and 168 failures in teh last 21 days.
For the failure pattern see Comment 10
Recent failure log:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=157051190&lineNumber=1733
I checked a few logs and noticed that the reftest analyzer is not loading properly.
Example: https://hg.mozilla.org/mozilla-central/raw-file/tip/layout/tools/reftest/reftest-analyzer.xhtml#logurl=https://queue.taskcluster.net/v1/task/ID5xSavmQmyp1-hqAkayvw/runs/0/artifacts/public/logs/live_backing.log&only_show_unexpected=1
:jmaher there are more than 150 failures in the last 21 days, but the failure rate went down in the last week. Should we still consider disabling this test?
Thank you!
Flags: needinfo?(jmaher)
Comment 22•7 years ago
|
||
what is interesting here is that out of 36 failures in the last week, only 3 are on hardware (i.e. win10 reftest) and the rest are on VM (i.e. taskcluster)- also 2 of the failures are not win10 (1xlinux64, 1xwin7).
Given that reftests run at the same rate as crashtests or jsreftests, I would conclude that any focus here is on the VMs, and not reftests on hardware.
I looked at some logs and saw some differences between the hardware and vms:
hardware:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=157030919&lineNumber=1949
vm:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=156698165&lineNumber=1675
on hardware we do not have statements like:
[DEBUG SHUTDOWN] Shutdown: decoder=00000227C5E85800 state machine=00000227CBA96000
The above statement is from the above mentioned VM log- which is an opt build. So either we are not hitting any initialization when on hardware, or there is something different in the OS/Env which is causing this difference.
The screenshot from the vm in the above log is:
https://public-artifacts.taskcluster.net/HUe7OyOZQ0ifBBbN56Og2w/0/public/test_info/mozilla-test-fail-screenshot_fdtd7g.png
there is a reference to a failure for configmymonitor.exe which cannot start- not sure if this is related or not, but we should fix that.
A few differences in the environment variables:
1) live logging ports specified for VM/Taskcluster
2) KTS/monitoring for Hardware/Buildbot
3) taskcluster is missing these MOZ_ env variables:
'MOZ_CRASHREPORTER_NO_REPORT': '1',
'MOZ_HIDE_RESULTS_TABLE': '1',
'MOZ_NO_REMOTE': '1',
'NO_EM_RESTART': '1',
'NO_FAIL_ON_TEST_ERRORS': '1',
'XPCOM_DEBUG_BREAK': 'warn'
^ NOTE: I have a webrender link above which has different env vars, but all VM logs are missing these env vars.
I cannot find a spot in the scripts where we set these environment variables- many harnesses set them, but it seems as though these are typically setup at the OS level and inherited prior to running.
:grenade, do you know how to determine on windows 10 what the environment variables are by default on the IX hardware vs the VM environment?
Flags: needinfo?(jmaher) → needinfo?(rthijssen)
Comment 23•7 years ago
|
||
i would just use task creator [1] to run a task that runs a single command of: `set` [2]. once on a hw worker type and once on a vm. that should give a list of the vars set for the user that runs the tests. the default vars would differ for the admin user so it's important to list the vars that are set for the user that actually runs tasks. generic-worker creates the user that runs the tasks and also sets a number of environment and profile configurations for that user so it's more relevant to get the env settings at the beginning of a task run than env settings at machine configuration time which would be quite a different thing.
[1]: https://tools.taskcluster.net/task-creator/
[2]: https://ss64.com/nt/set.html
Flags: needinfo?(rthijssen)
Comment 24•7 years ago
|
||
ok, I did the edit task and added some env vars, they seem to pass through to the test harness- it looks like I could set them here in the docker image creation area:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/docker_image.py#148
in addition I see other places we set similar variables:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/mozharness_test.py#251
likewise, possibly we ensure the harness is set properly (we would need to do this for ALL harnesses to be more complete.
Comment 25•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #22)
> The screenshot from the vm in the above log is:
> https://public-artifacts.taskcluster.net/HUe7OyOZQ0ifBBbN56Og2w/0/public/
> test_info/mozilla-test-fail-screenshot_fdtd7g.png
>
> there is a reference to a failure for configmymonitor.exe which cannot
> start- not sure if this is related or not, but we should fix that.
thanks for spotting that. i've now removed configmymonitor and fakemon from these instances as it doesn't work on windows 10 thus far.
https://github.com/mozilla-releng/OpenCloudConfig/commit/a3453c1e0cb373a381a53d0f3b3d733668cad41d
Comment 26•7 years ago
|
||
here is a case where I have added environment variables:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=c21026ce655b91ecbf061363c84ba268f27f5709&selectedJob=157409919
looking at the jsreftest log, I see we time out, then start up again and run all the tests successfully:
https://public-artifacts.taskcluster.net/Qu13jFheQ86LgnlvXvBAKA/0/public/logs/live_backing.log
same for the crashtest failure:
https://public-artifacts.taskcluster.net/LrrI6B4tSA-dNu_VSyLA6A/0/public/logs/live_backing.log
I noticed the first time we launched the browser I see this:
01:10:33 INFO - REFTEST INFO | Application command: Z:\task_1516410509\build\application\firefox\firefox.exe -marionette -profile c:\users\genericworker\appdata\local\temp\tmpufuw0c.mozrunner
then the second time, I see this:
01:17:09 INFO - REFTEST INFO | Application command: Z:\task_1516410509\build\application\firefox\firefox.exe -marionette -marionette -profile c:\users\genericworker\appdata\local\temp\tmpdtfhav.mozrunner
why do we have two -marionette args- there must be a loop somewhere. This also proves the machine is valid and workable- we just didn't start the test after launching the browser.
Comment hidden (Intermittent Failures Robot) |
Comment 28•7 years ago
|
||
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #26)
> why do we have two -marionette args- there must be a loop somewhere. This
> also proves the machine is valid and workable- we just didn't start the test
> after launching the browser.
I think we run tests until success, to allow for resume-after-crash:
https://dxr.mozilla.org/mozilla-central/rev/5faab9e619901b1513fd4ca137747231be550def/layout/tools/reftest/runreftest.py#787
"-marionette" will be added every time, but I don't think that makes a difference to anything.
Comment hidden (Intermittent Failures Robot) |
Comment 30•7 years ago
|
||
I don't know what is going wrong here.
We recently added marionette trace logging to reftests and we have some failure logs with that additional logging now -- but the marionette trace logging looks normal to me.
Logs suggest to me that firefox starts up normally, but then hangs. On the other hand, screenshots for these failures do not show firefox displayed - odd.
Comment 31•7 years ago
|
||
(In reply to Rob Thijssen (:grenade UTC+2) from comment #25)
> (In reply to Joel Maher ( :jmaher) (UTC-5) from comment #22)
> > The screenshot from the vm in the above log is:
> > https://public-artifacts.taskcluster.net/HUe7OyOZQ0ifBBbN56Og2w/0/public/
> > test_info/mozilla-test-fail-screenshot_fdtd7g.png
> >
> > there is a reference to a failure for configmymonitor.exe which cannot
> > start- not sure if this is related or not, but we should fix that.
>
> thanks for spotting that. i've now removed configmymonitor and fakemon from
> these instances as it doesn't work on windows 10 thus far.
> https://github.com/mozilla-releng/OpenCloudConfig/commit/
> a3453c1e0cb373a381a53d0f3b3d733668cad41d
:grenade -- Some recent screenshots still show configmymonitor errors.
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=158892135&lineNumber=1810
https://public-artifacts.taskcluster.net/DePvXD_2STKi6w8mLvRoFw/1/public/test_info/mozilla-test-fail-screenshot_rpfod_.png
Flags: needinfo?(rthijssen)
Comment 32•7 years ago
|
||
the removal patch went in but was reverted due to an unrelated problem. an ami redeploy should fix it. in progress now:
https://tools.taskcluster.net/groups/BWLMji0_Qmek1p3KCl8HqQ/tasks/M10X4t_vRtCW7BYmRLypAQ/details
Flags: needinfo?(rthijssen)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 35•7 years ago
|
||
No new failures since Feb 8! (I don't know why. Maybe comment 32...?)
Whiteboard: [stockwell disable-recommended] → [stockwell unknown]
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(gbrown)
Resolution: --- → WORKSFORME
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Comment hidden (Intermittent Failures Robot) |
You need to log in
before you can comment on or make changes to this bug.
Description
•