Closed Bug 1378002 Opened 7 years ago Closed 5 years ago

Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, ts_paint_heavy, tp6_google_heavy, tp6_amazon_heavy, tp6_google, glvideo | application crashed [unknown top frame]

Categories

(Testing :: Talos, defect, P3)

Version 3
defect

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

Attachments

(1 file)

Summary: Intermittent PROCESS-CRASH | tps | application crashed [unknown top frame] → Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy | application crashed [unknown top frame]
Summary: Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy | application crashed [unknown top frame] → Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, glvideo | application crashed [unknown top frame]
Summary: Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, glvideo | application crashed [unknown top frame] → Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, tp6_amazon_heavy, glvideo | application crashed [unknown top frame]
Blocks: 1422525
Summary: Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, tp6_amazon_heavy, glvideo | application crashed [unknown top frame] → Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, tp6_amazon_heavy, tp6_google, glvideo | application crashed [unknown top frame]
There have been 39 total failures in the last 7 days. There are failures across 3 platforms and 2 build types: -Windows 7: 18 -windows10-64: 12 -OS X 10.10: 9 -opt: 29 -pgo: 10 Here is a recent log file and a snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=159981381&lineNumber=1974 16:19:54 INFO - mozcrash Downloading symbols from: https://queue.taskcluster.net/v1/task/HLdGg4LuRKGRsrIIESRRUA/artifacts/public/build/target.crashreporter-symbols.zip 16:20:03 INFO - mozcrash Copy/paste: C:\slave\test\build\win32-minidump_stackwalk.exe c:\users\cltbld~1.t-w\appdata\local\temp\tmptjw_b3\profile\minidumps\4b13ff50-6d26-4605-adc4-1e1f8300b7fd.dmp c:\users\cltbld~1.t-w\appdata\local\temp\tmpfd76kq 16:20:11 INFO - mozcrash Saved minidump as C:\slave\test\build\blobber_upload_dir\4b13ff50-6d26-4605-adc4-1e1f8300b7fd.dmp 16:20:11 INFO - PROCESS-CRASH | tp5o_scroll | application crashed [unknown top frame]
Flags: needinfo?(rwood)
Whiteboard: [stockwell needswork]
Summary: Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, tp6_amazon_heavy, tp6_google, glvideo | application crashed [unknown top frame] → Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, tp6_google_heavy, tp6_amazon_heavy, tp6_google, glvideo | application crashed [unknown top frame]
lots of failures, but we are moving next week to the new hardware for windows- I am fine waiting until then.
This bug has failed 70 times in the last 7 days, mainly on Windows 7 and Windows 10 but there are a few occurrences on OS X 10.10, affecting pgo and opt build types. Failing tests: talos-tp6-stylo-threads-e10, talos-tp6-e10. Link to a recent log: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=162317413&lineNumber=1540 Part of that log: IndexError: list index out of range 18:15:09 INFO - During handling of the above exception, another exception occurred: 18:15:09 INFO - Traceback (most recent call last): 18:15:09 INFO - File "<string>", line 1, in <module> 18:15:09 INFO - File "C:\Users\CLTBLD~1.T-W\AppData\Local\Temp\pip-build-a91etw0y\cffi\setup.py", line 113, in <module> 18:15:09 INFO - if sys.platform == 'win32' and uses_msvc(): 18:15:09 INFO - File "C:\Users\CLTBLD~1.T-W\AppData\Local\Temp\pip-build-a91etw0y\cffi\setup.py", line 91, in uses_msvc 18:15:09 INFO - return config.try_compile('#ifndef _MSC_VER\n#error "not MSVC"\n#endif') 18:15:09 INFO - File "C:\slave\test\build\python3.6\lib\distutils\command\config.py", line 227, in try_compile 18:15:09 INFO - self._compile(body, headers, include_dirs, lang) 18:15:09 INFO - File "C:\slave\test\build\python3.6\lib\distutils\command\config.py", line 133, in _compile 18:15:09 INFO - self.compiler.compile([src], include_dirs=include_dirs) 18:15:09 INFO - File "C:\slave\test\build\python3.6\lib\distutils\_msvccompiler.py", line 304, in compile 18:15:09 INFO - self.initialize() 18:15:09 INFO - File "C:\slave\test\build\python3.6\lib\distutils\_msvccompiler.py", line 197, in initialize 18:15:09 INFO - vc_env = _get_vc_env(plat_spec) 18:15:09 INFO - File "c:\slave\test\py3venv\lib\site-packages\setuptools\msvc.py", line 172, in msvc14_get_vc_env 18:15:09 INFO - return EnvironmentInfo(plat_spec, vc_min_ver=14.0).return_env() 18:15:09 INFO - File "c:\slave\test\py3venv\lib\site-packages\setuptools\msvc.py", line 771, in __init__ 18:15:09 INFO - self.si = SystemInfo(self.ri, vc_ver) 18:15:09 INFO - File "c:\slave\test\py3venv\lib\site-packages\setuptools\msvc.py", line 480, in __init__ 18:15:09 INFO - raise distutils.errors.DistutilsPlatformError(err) 18:15:09 INFO - distutils.errors.DistutilsPlatformError: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools :rwood did you have the chance to look at this bug?
Over the last 7 days there are 65 failures on this bug. These happen on Linux x64, macosx64-nightly, OS X 10.10, Windows 7, windows10-64. Here is the most recent log example: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=163916283&lineNumber=2682 Here is a relevant part of that log: 13:18:45 INFO - Found by: given as instruction pointer in context 13:18:45 INFO - TEST-UNEXPECTED-ERROR | damp | Found crashes after test run, terminating test 13:18:45 ERROR - Traceback (most recent call last): 13:18:45 INFO - File "C:\Users\task_1519387965\build\tests\talos\talos\run_tests.py", line 296, in run_tests 13:18:45 INFO - talos_results.add(mytest.runTest(browser_config, test)) 13:18:45 INFO - File "C:\Users\task_1519387965\build\tests\talos\talos\ttest.py", line 62, in runTest 13:18:45 INFO - return self._runTest(browser_config, test_config, setup) 13:18:45 INFO - File "C:\Users\task_1519387965\build\tests\talos\talos\ttest.py", line 209, in _runTest 13:18:45 INFO - test_config['name']) 13:18:45 INFO - File "C:\Users\task_1519387965\build\tests\talos\talos\ttest.py", line 46, in check_for_crashes 13:18:45 INFO - raise TalosCrash('Found crashes after test run, terminating test') 13:18:45 INFO - TalosCrash: Found crashes after test run, terminating test 13:18:45 INFO - TEST-INFO took 3607888ms 13:18:45 INFO - SUITE-END | took 3607s 13:18:45 ERROR - Return code: 2 13:18:45 WARNING - setting return code to 2 13:18:45 ERROR - # TBPL FAILURE #
these are primarily windows failures- we just did a major change yesterday to windows hardware- I wonder what this will look like in a week from now.
There have been 85 failures in the last week and 207 failures in the last 21 days. For the last 7 days I see that the failures occur on: Platforms: - OS X 10.10: 48 - windows10-64: 22 - Windows 7: 10 - macosx64-nightly: 3 - linux64-stylo-disabled: 1 - linux64-nightly: 1 Build type: - opt: 65 - pgo: 19 - debug: 1 Here is a recent log file and a snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=165736923&lineNumber=3687 04:13:02 INFO - mozcrash Copy/paste: /Users/cltbld/tasks/task_1520074748/build/macosx64-minidump_stackwalk /var/folders/46/s7s5j29s62v4stfn569xx97m00000w/T/tmprtPiG6/profile/minidumps/79C62B57-26C8-4CB1-AD2A-53A839CD82B5.dmp /var/folders/46/s7s5j29s62v4stfn569xx97m00000w/T/tmp_gExnn 04:13:02 INFO - mozcrash Saved minidump as /Users/cltbld/tasks/task_1520074748/build/blobber_upload_dir/79C62B57-26C8-4CB1-AD2A-53A839CD82B5.dmp 04:13:02 INFO - PROCESS-CRASH | damp | application crashed [unknown top frame] 04:13:02 INFO - Crash dump filename: /var/folders/46/s7s5j29s62v4stfn569xx97m00000w/T/tmprtPiG6/profile/minidumps/79C62B57-26C8-4CB1-AD2A-53A839CD82B5.dmp 04:13:02 INFO - stderr from minidump_stackwalk: 04:13:02 INFO - 2018-03-03 04:13:02: minidump.cc:4359: INFO: Minidump opened minidump /var/folders/46/s7s5j29s62v4stfn569xx97m00000w/T/tmprtPiG6/profile/minidumps/79C62B57-26C8-4CB1-AD2A-53A839CD82B5.dmp 04:13:02 INFO - 2018-03-03 04:13:02: minidump.cc:4808: ERROR: ReadBytes: read 0/32 04:13:02 INFO - 2018-03-03 04:13:02: minidump.cc:4453: ERROR: Minidump cannot read header 04:13:02 INFO - 2018-03-03 04:13:02: stackwalk.cc:133: ERROR: Minidump /var/folders/46/s7s5j29s62v4stfn569xx97m00000w/T/tmprtPiG6/profile/minidumps/79C62B57-26C8-4CB1-AD2A-53A839CD82B5.dmp could not be read All of the bug dependencies have been solved. :jmaher do you think we should wait some more on disabling this? About 41% of failures are on OS X 10.10 / opt in the last week.
Flags: needinfo?(jmaher)
I will look into this more, this is the g2 job failing on osx and windows- the profiling ones on nightly are different issues and a random browser-chrome is mixed in there. We will disable or fix this- thanks for filing!
the main crash here is damp and this is related to the logs we see: 03:33:36 INFO - PID 965 | Open toolbox on 'complicated.inspector' 03:33:37 INFO - PID 965 | Unable to read VR Path Registry from /Users/cltbld/Library/Application Support/OpenVR/.openvr/openvrpaths.vrpath 03:33:37 INFO - PID 965 | 2018-03-03 03:33:37.195 plugin-container[999:66350] *** CFMessagePort: bootstrap_register(): failed 1100 (0x44c) 'Permission denied', port = 0x9057, name = 'com.apple.tsm.portname' 03:33:37 INFO - PID 965 | See /usr/include/servers/bootstrap_defs.h for the error codes. 03:33:38 INFO - PID 965 | Garbage collect 03:33:38 INFO - PID 965 | Reload page on 'complicated.inspector' 03:33:42 INFO - PID 965 | Close toolbox on 'complicated.inspector' 03:33:42 INFO - PID 965 | Garbage collect 03:33:44 INFO - PID 965 | Open toolbox on 'complicated.jsdebugger' 03:33:44 INFO - PID 965 | Unable to read VR Path Registry from /Users/cltbld/Library/Application Support/OpenVR/.openvr/openvrpaths.vrpath 03:33:44 INFO - PID 965 | 2018-03-03 03:33:44.814 plugin-container[1000:66676] *** CFMessagePort: bootstrap_register(): failed 1100 (0x44c) 'Permission denied', port = 0x953b, name = 'com.apple.tsm.portname' 03:33:44 INFO - PID 965 | See /usr/include/servers/bootstrap_defs.h for the error codes. 03:33:45 INFO - PID 965 | Waiting for state change: count sources 03:33:45 INFO - PID 965 | Finished waiting for state change: count sources 03:33:45 INFO - PID 965 | Selecting source: ga.js 03:33:45 INFO - PID 965 | Waiting for state change: selected source 04:30:46 INFO - Timeout waiting for test completion; killing browser... 04:30:46 INFO - Terminating psutil.Process(pid=965, name='firefox', started='03:30:46') 04:30:46 INFO - PID 965 | [GFX1-]: Receive IPC close with reason=AbnormalShutdown 04:30:46 INFO - PID 965 | ** Unknown exception behavior: -2147483647 04:30:46 INFO - PID 965 | ** Unknown exception behavior: -2147483647 04:30:46 INFO - PID 965 | ** Unknown exception behavior: -2147483647 OSX seems to be the most affected. :ochameau, are you familiar with any recent changes in damp which would cause the above hang?
Flags: needinfo?(rwood)
Flags: needinfo?(poirot.alex)
Flags: needinfo?(jmaher)
Not that I know of. It seems to have increased significantly around early February. Yulia, any idea why this method would never resolve: https://searchfox.org/mozilla-central/source/testing/talos/talos/tests/devtools/addon/content/damp.js#167-188 selectSource(dbg, url) { dump(`Selecting source: ${url}\n`); const line = 1; const source = this.findSource(dbg, url); dbg.actions.selectLocation({ sourceId: source.get("id"), line }); return this.waitForState( dbg, state => { const source = dbg.selectors.getSelectedSource(state); const isLoaded = source && source.get("loadedState") === "loaded"; if (!isLoaded) { return false; } // wait for symbols -- a flat map of all named variables in a file -- to be calculated. // this is a slow process and becomes slower the larger the file is return dbg.selectors.hasSymbols(state, source.toJS()); }, "selected source" ); } I'm wondering if hasSymbols can be the issue here? Or is loadedState never ending up with "loaded"? DAMP times out here intermittently: https://searchfox.org/mozilla-central/source/testing/talos/talos/tests/devtools/addon/content/damp.js#821
Flags: needinfo?(poirot.alex) → needinfo?(ystartsev)
I'm taking a look. In late January the sources and symbols were both refactored, and this could be caused by that. For now I will split the test so we have a better idea why its failing
Assignee: nobody → ystartsev
Flags: needinfo?(ystartsev)
I looked into the code and I am having a hard time seeing what might be causing this. I also reran the test on talos 30 times and it didn't fail. For now I have made a change that will add more logging to the test, so that we can narrow down the issue -- if it has to do with the source never loading, or the symbols never loading. It is not a fix but it might give us more information
Comment on attachment 8959608 [details] Bug 1378002 - add more information to track intermittent process crash. https://reviewboard.mozilla.org/r/228424/#review234556 This change adds quite a lot of logs: 02:49:55 INFO - PID 28724 | Waiting for state change: selected source 02:49:55 INFO - PID 28724 | Checking if source is loaded 02:49:56 INFO - PID 28724 | Checking if source is loaded 02:49:56 INFO - PID 28724 | Waiting for symbols 02:49:56 INFO - PID 28724 | Checking if source is loaded 02:49:56 INFO - PID 28724 | Waiting for symbols 02:49:56 INFO - PID 28724 | Checking if source is loaded 02:49:56 INFO - PID 28724 | Waiting for symbols 02:49:56 INFO - PID 28724 | Checking if source is loaded 02:49:56 INFO - PID 28724 | Waiting for symbols 02:49:56 INFO - PID 28724 | Checking if source is loaded 02:49:56 INFO - PID 28724 | Waiting for symbols 02:49:56 INFO - PID 28724 | Checking if source is loaded 02:49:56 INFO - PID 28724 | Waiting for symbols 02:49:57 INFO - PID 28724 | Checking if source is loaded 02:49:57 INFO - PID 28724 | Waiting for symbols 02:49:57 INFO - PID 28724 | Finished waiting for state change: selected source Instead of landing it, you can push it to try like this: ./mach try -b o -p macosx64 -t g2-e10s --rebuild-talos 10 --artifact Unless you think it is valuable to always have these logs? Or were you unable to reproduce this intermittent on try?
Attachment #8959608 - Flags: review?(poirot.alex)
I wasn't able to reproduce it -- I am rerunning it this time with 20 rebuilds and with the logs, lets see if that gives more information...
Joel, I looked at the 10 last failure for this bug on orangefactor and there are all happenning on windows, and the vast majority of them are related to such failure: 01:57:52 INFO - PID 3888 | JavaScript note: http://localhost:49424/tests/tp5n/goo.ne.jp/search.goo.ne.jp/cdn/goo/js/v2/sgt.v2.js, line 1: { opened at line 1, column 11981 01:58:07 INFO - PID 3888 | Cycle 1(4): loaded http://localhost:49424/tests/tp5n/goo.ne.jp/goo.ne.jp/index.html (next: http://localhost:49424/tests/tp5n/guardian.co.uk/www.guardian.co.uk/index.html) 01:58:08 INFO - PID 3888 | JavaScript note: http://localhost:49424/tests/tp5n/goo.ne.jp/search.goo.ne.jp/cdn/goo/js/v2/sgt.v2.js, line 1: { opened at line 1, column 11981 01:58:09 INFO - Timeout waiting for test completion; killing browser... It looks like it is mostly tp5o_scroll timeout. https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=170226617&lineNumber=1818 the timeout related to debugger seems to be gone. But I was wondering if we could make logviewer/perfherder have better summary. It isn't handy to see various, very different error happening during g2 runs be aggregated in this bug. We could tweak DAMP to print particular strings when one test timesout/throws. Is there particular patterns that are picked up/ignored by perfherder or whatever tool processing test outputs?
Flags: needinfo?(jmaher)
I agree this isn't damp related anymore. I am open to suggestions of making the error reporting better so we can reduce the time looking at the large list of raw logs to find a real failure pattern. One problem is when we get a timeout in the browser we have to have a failure message such as "Timeout waiting for test completion" and then close the browser and cleanup. When we force the browser closed we do so and look for crash dumps. Unfortunately that is the same pattern in many of these logs. What is interesting is that we seem to be running for >1 hour to complete the test when it should take ~30 minutes- so in the case of a few logs I looked at we are loading each page very slow (5-15 seconds instead of 1-3 seconds) and we timeout.
Flags: needinfo?(jmaher)
Is there a particular string/pattern that we can use to ensure that's the first line logged in perfherder? Today, in 99% of DAMP failures, the first error display in Perfherder > Failure summary is: PROCESS-CRASH | damp | application crashed [unknown top frame] Like here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=34ae98a0d2e19d03c137d8d747c1525ff699719d&selectedJob=168907788 We can tweak damp.js in order to print which test timed out and stop reporting it as a crash, but just a test failure. Do you know where is the code that implements what logs are kept in the "Failure summary" screen? If not, any idea who would know?
we are killing this from the python code that launched the browser: https://searchfox.org/mozilla-central/source/testing/talos/talos/talos_process.py#138 we timeout after 1 hour- so if the process still exists after 1 hour the python code takes over and kills the browser. we do something different for TART where it detects an internal timeout: https://searchfox.org/mozilla-central/source/testing/talos/talos/talos_process.py#71 I suppose damp could do the same thing by having a smaller timeout internally using tptimeout. Here is an example of a 5 second timeout from tp5: https://searchfox.org/mozilla-central/source/testing/talos/talos/test.py#497 I am not sure what value would make sense for damp or if it needs to do something more like TART, but there seems to be some options to pursue.
Bug 1444826 which is in autoland, should help use better sort Talos crashes related to DAMP, so it would be great to revisit this bug in one week and see if DAMP still crashes, and now, we should more easily know why.
Summary: Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, tp6_google_heavy, tp6_amazon_heavy, tp6_google, glvideo | application crashed [unknown top frame] → Intermittent PROCESS-CRASH | tps, tp5o_scroll, damp, tp6_youtube_heavy, ts_paint_heavy, tp6_google_heavy, tp6_amazon_heavy, tp6_google, glvideo | application crashed [unknown top frame]
in the last 2 weeks we had 3 failures: 2 - win 7 g2 (tps) 1 - win 10 h1 (ts_paint_heavy) I think this is not a disable recommended anymore, the failure rate is low and I will leave as [stockwell unknown]
Whiteboard: [stockwell disable-recommended] → [stockwell unknown]
Assignee: ystartsev → nobody
Priority: -- → P3
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: