Closed Bug 934310 Opened 11 years ago Closed 7 years ago

Intermittent talos timeout on startup, "command timed out: 3600 seconds without output, attempting to kill", "timed out after 3600 seconds of no output"

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(firefox58 fixed, firefox59 fixed)

RESOLVED FIXED
mozilla59
Tracking Status
firefox58 --- fixed
firefox59 --- fixed

People

(Reporter: philor, Assigned: rwood)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

Attachments

(1 file)

You'll probably find a fair few of these in bug 920161, since we tend to assume without looking that a 3600 second timeout in talos is a clone timeout. https://tbpl.mozilla.org/php/getParsedLog.php?id=30054331&tree=Fx-Team Rev4 MacOSX Lion 10.7 fx-team talos svgr on 2013-11-03 14:47:10 PST for push 1478a4c6fe96 slave: talos-r4-lion-002 14:49:30 INFO - Running test tsvgx: 14:49:30 INFO - Started Sun, 03 Nov 2013 14:49:30 14:49:30 INFO - DEBUG : operating with platform_type : mac_ 14:49:30 INFO - DEBUG : created profile 14:49:30 INFO - DEBUG : command line: /builds/slave/talos-slave/test/build/application/FirefoxNightly.app/Contents/MacOS/firefox -foreground -profile /var/folders/kl/22tr3t1d4f35n34w5z9p3xvr00000w/T/tmpSOgV8Z/profile http://localhost/getInfo.html 14:49:40 INFO - DEBUG : initialized firefox 14:49:40 INFO - DEBUG : command line: /builds/slave/talos-slave/test/build/application/FirefoxNightly.app/Contents/MacOS/firefox -foreground -profile /var/folders/kl/22tr3t1d4f35n34w5z9p3xvr00000w/T/tmpSOgV8Z/profile -tp file:/builds/slave/talos-slave/test/build/venv/lib/python2.7/site-packages/talos/page_load_test/svgx/svgx.manifest -tpchrome -tpnoisy -tpcycles 1 -tppagecycles 25 command timed out: 3600 seconds without output, attempting to kill process killed by signal 9 program finished with exit code -1 elapsedTime=3743.149791
Summary: Intermittent talos timeout on startup, command timed out: 3600 seconds without output, attempting to kill → Intermittent talos timeout on startup, "command timed out: 3600 seconds without output, attempting to kill", "timed out after 3600 seconds of no output"
That wasn't the least bit suspicious, oh no, not at all.
Depends on: 1154434
There's been a recent increase in frequency of this bug, but there have been on-going failures for months. Over the last month or so, all failures are Windows, mostly pgo. The recent spike appears to be talos-other-e10s; those logs end like: 11:31:25 INFO - PROCESS | 2156 | SessionRestoreTalosTest: INIT 11:31:25 INFO - PROCESS | 2156 | SessionRestoreTalosTest: constructor 11:31:25 INFO - PROCESS | 2156 | SessionRestoreTalosTest: profile-after-change 11:31:25 INFO - PROCESS | 2156 | ==============> SessionRestoreTalosTest: 0 command timed out: 3600 seconds without output, attempting to kill
I am confused why this is win pgo primarily- possibly a race condition upon startup.
Whiteboard: [stockwell needswork]
wow, this took a spike, I believe this is related to all win7-pgo issues on aurora which :tomcat is looking into.
we just backed this out, I suspect this will get fixed and we will see failures go down drastically in the next day or two
this seems to be fixed, although we should leave this open for while.
Whiteboard: [stockwell needswork] → [stockwell fixed]
Whiteboard: [stockwell fixed] → [stockwell fixed:product]
Failures started again on June 26, for Windows 7, talos-other-e10s. The earliest instance I see is https://treeherder.mozilla.org/#/jobs?repo=autoland&revision=e3571b957e4fe3175abf93a389ceecbd802088ba&filter-searchStr=talos. :jmaher - Any idea what might have caused this?
Flags: needinfo?(jmaher)
Whiteboard: [stockwell fixed:product] → [stockwell needswork]
this is interesting and looks to be related to sessionrestore (30 failures since Monday). While there is a new sessionrestore test (landed today) and tpaint changes (landing later today), they are not related to Monday. :gbrown, could you help bisect the problem down- this is odd that this is a windows 7 only failure.
Flags: needinfo?(jmaher) → needinfo?(gbrown)
(In reply to Joel Maher ( :jmaher) from comment #935) > this is interesting and looks to be related to sessionrestore That reminds me of bug 1244904 and several related bugs (shutdown hang after sessionstore mochitests). > :gbrown, could you help bisect the problem down- this is odd that this is a windows 7 only failure. Working on it now...
https://treeherder.mozilla.org/#/jobs?repo=autoland&filter-searchStr=windows%20pgo%20talos%20e10s%20other&tochange=a0f804d0b4873e168d31ad0e49f8b2deeb4fe52c&fromchange=629f02cdaee74c1744d874544a3155fa26a0d957 strongly suggests this problem started (well, re-started, comment 934) with bug 934967. Bug 934967 is related to session store, consistent with observations in comment 935 and comment 936. :beekill -- Can you look into this, to see how your change may be triggering this intermittent hang in windows talos tests?
Blocks: 934967
Flags: needinfo?(gbrown) → needinfo?(nnn_bikiu0707)
I'm not have much experience in dealing with this intermittent hang but I can definitely take a look at this since this is may be from my patches. But I have a question: can we know at which part does firefox hangs? Is there a stack trace? If there isn't, how can output to the talos log?
Flags: needinfo?(nnn_bikiu0707) → needinfo?(gbrown)
Depends on: 1378816
That's a good point: There is no stack trace in this case. I'll try to fix that in bug 1378816.
Flags: needinfo?(gbrown)
Now the browser is killed on timeout (after 15 minutes), so failures are being reported in other bugs, like bug 1343532 and bug 1109124.
I did push try to output whether the previous session files was read [1]. The session file was read successfully, however, the wrong session file was read. Instead of sessionstore.js, it read recovery.jsonlz4. So I suspect the problem maybe because of that. I tried the patch of bug 1376173 and the intermittent didn't show up [2]. [1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=6b81282294da5bdaccd57ddf83b81fd1dff5c024 [2]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=84473a7910dc8062db2a59e1659a45d65b748220
Whiteboard: [stockwell needswork] → [stockwell fixed:other]
This bug had a spike in the last 7 day hitting 40 fails. Failed on Windows 7, build types are opt and pgo. Here is a recent log: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=150335243&lineNumber=1519 And here is a part of the log: command timed out: 3600 seconds without output, attempting to kill 15:49:45 INFO - Automation Error: mozprocess timed out after 3600 seconds running ['C:\\slave\\test\\build\\venv\\Scripts\\python', 'C:\\slave\\test\\build\\tests\\talos\\talos\\run_tests.py', '--branchName', 'Mozilla-Inbound-Non-PGO', '--suite', 'h1-e10s', '--executablePath', 'C:\\slave\\test\\build\\application\\firefox\\firefox', '--symbolsPath', 'https://queue.taskcluster.net/v1/task/BD387rOSTteh_QIilZXFRQ/artifacts/public/build/target.crashreporter-symbols.zip', '--title', 'T-W732-IX-035', '--webServer', 'localhost', '--log-tbpl-level=debug', '--log-errorsummary=C:\\slave\\test\\build\\blobber_upload_dir\\h1-e10s_errorsummary.log', '--log-raw=C:\\slave\\test\\build\\blobber_upload_dir\\h1-e10s_raw.log'] 15:49:45 ERROR - timed out after 3600 seconds of no output 15:49:46 ERROR - Return code: 572 15:49:46 WARNING - setting return code to 572 program finished with exit code 1 elapsedTime=4596.694000 ========= master_lag: 3.41 ========= ========= Finished 'c:/mozilla-build/python27/python -u ...' warnings (results: 1, elapsed: 1 hrs, 16 mins, 40 secs) (at 2017-12-06 15:49:50.466961) ========= Hi :rwood: Can you please take a look at this bug?
Flags: needinfo?(rwood)
Whiteboard: [stockwell fixed:other] → [stockwell needswork]
These are Windows and talos h1, like in bug 1422863...still happening after the change to max-run-time. Looking more closely, I think I pointed :rwood in the wrong direction: I think it is talos_script.py that is timing out, rather than taskcluster (probably both need extended timeouts if the job is going to run for >3600 seconds).
Thanks Geoff. I'm going to remove h1 from win 7 - we had the same problem for h2 and we already disabled that on win 7 in Bug 1415858 Comment 26.
Flags: needinfo?(rwood)
Comment on attachment 8935394 [details] Bug 934310 - Remove talos h1 from win 7 to fix intermittent faiure; https://reviewboard.mozilla.org/r/206290/#review211884 thanks for fixing this
Attachment #8935394 - Flags: review?(jmaher) → review+
Pushed by rwood@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f67d4dd8bf0a Remove talos h1 from win 7 to fix intermittent faiure; r=jmaher
the fix has not been merged to central/inbound yet
Flags: needinfo?(rwood)
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla59
Blocks: 1427930
No longer blocks: 1427930
Assignee: nobody → rwood
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: