Windows Talos sessionrestore_no_auto_restore | timeout
Categories
(Testing :: Talos, defect, P1)
Tracking
(firefox-esr68 unaffected, firefox74 wontfix, firefox75 wontfix, firefox76 fixed)
Tracking | Status | |
---|---|---|
firefox-esr68 | --- | unaffected |
firefox74 | --- | wontfix |
firefox75 | --- | wontfix |
firefox76 | --- | fixed |
People
(Reporter: RaulG, Assigned: Gijs)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: intermittent-failure, regression, Whiteboard: [stockwell fixed:backout])
Attachments
(1 file)
(deleted),
text/x-phabricator-request
|
Details |
Comment hidden (obsolete) |
Reporter | ||
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 3•5 years ago
|
||
This seems to be the important part of the log:
[task 2019-11-23T22:56:20.222Z] 22:56:20 INFO - TEST-INFO | started process 4604 (C:\Users\task_1574539317\build\application\firefox\firefox -wait-for-browser -no-deelevate -profile c:\users\task_1574539317\appdata\local\temp\tmp2svst3\profile about:home)
[task 2019-11-23T22:56:26.965Z] 22:56:26 INFO - PID 8064 | console.error: (new Error("SessionFile is closed", "resource:///modules/sessionstore/SessionFile.jsm", 433))
[task 2019-11-23T22:56:42.356Z] 22:56:42 INFO - PID 8064 |
[task 2019-11-23T22:56:42.356Z] 22:56:42 INFO - PID 8064 | ###!!! [Child][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2019-11-23T22:56:42.356Z] 22:56:42 INFO - PID 8064 |
[task 2019-11-23T22:56:42.358Z] 22:56:42 INFO - PID 8064 |
[task 2019-11-23T22:56:42.358Z] 22:56:42 INFO - PID 8064 | ###!!! [Child][MessageChannel::SendAndWait] Error: Channel error: cannot send/recv
[task 2019-11-23T22:56:42.358Z] 22:56:42 INFO - PID 8064 |
[task 2019-11-23T23:11:20.216Z] 23:11:20 INFO - Timeout waiting for test completion; killing browser...
[task 2019-11-23T23:11:22.265Z] 23:11:22 INFO - TEST-UNEXPECTED-ERROR | sessionrestore_no_auto_restore | timeout
[task 2019-11-23T23:11:22.265Z] 23:11:22 ERROR - Traceback (most recent call last):
[task 2019-11-23T23:11:22.265Z] 23:11:22 INFO - File "C:\Users\task_1574539317\build\tests\talos\talos\run_tests.py", line 290, in run_tests
[task 2019-11-23T23:11:22.265Z] 23:11:22 INFO - talos_results.add(mytest.runTest(browser_config, test))
[task 2019-11-23T23:11:22.265Z] 23:11:22 INFO - File "C:\Users\task_1574539317\build\tests\talos\talos\ttest.py", line 64, in runTest
[task 2019-11-23T23:11:22.265Z] 23:11:22 INFO - return self._runTest(browser_config, test_config, setup)
[task 2019-11-23T23:11:22.265Z] 23:11:22 INFO - File "C:\Users\task_1574539317\build\tests\talos\talos\ttest.py", line 216, in _runTest
[task 2019-11-23T23:11:22.265Z] 23:11:22 INFO - debugger_args=browser_config['debugger_args']
[task 2019-11-23T23:11:22.266Z] 23:11:22 INFO - File "C:\Users\task_1574539317\build\tests\talos\talos\talos_process.py", line 143, in run_browser
[task 2019-11-23T23:11:22.266Z] 23:11:22 INFO - raise TalosError("timeout")
[task 2019-11-23T23:11:22.266Z] 23:11:22 INFO - TalosError: timeout
[task 2019-11-23T23:11:22.266Z] 23:11:22 INFO - TEST-INFO took 943588ms
[task 2019-11-23T23:11:22.266Z] 23:11:22 INFO - SUITE-END | took 1190s
[task 2019-11-23T23:14:25.284Z] 23:14:25 INFO - WARNING | IO Completion Port failed to signal process shutdown
[task 2019-11-23T23:14:25.284Z] 23:14:25 INFO - Parent process 4604 exited with children alive:
[task 2019-11-23T23:14:25.284Z] 23:14:25 INFO - PIDS: 8132, 6828
[task 2019-11-23T23:14:25.284Z] 23:14:25 INFO - Attempting to kill them, but no guarantee of success
[task 2019-11-23T23:14:25.328Z] 23:14:25 ERROR - Return code: 2
[task 2019-11-23T23:14:25.328Z] 23:14:25 WARNING - setting return code to 2
[task 2019-11-23T23:14:25.328Z] 23:14:25 ERROR - # TBPL FAILURE #
I'm going to add some additional instrumentation to see if I can find out why the SessionFile is closed.
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment 5•5 years ago
|
||
Looks like the SessionFile stuff is a red herring. Even if we don't throw there, the test still times out. Adding more logging.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0125395d774fe0f776fe22f17a9c0524fb768f26
Comment 6•5 years ago
|
||
Okay, I think I have a sense of what's happening.
Before the run where the test is hanging, we're forcibly killing the parent process after 5 seconds of inactivity. Presumably, we're taking too long to shutdown in the parent process (likely due to bug 1597312).
Killing the parent process probably creates a busted session in the profile, or puts it in the "crash recovery state", which is why the test times out - the session isn't actually being restored.
I suspect this is bug 1597312 or similar due to the error message during the shutdown hang:
[task 2019-11-26T20:46:20.503Z] 20:46:20 INFO - PID 5948 | console.warn: services.settings: Error retrieving the getLastModified timestamp from blocklists/addons RemoteSettingClient Message: Error: IndexedDB getLastModified() The operation failed for reasons unrelated to the database itself and not covered by any other error code.
Hey asuth, since I might have a fairly straight-forward of recreating bug 1597312 in automation (on Windows-only though :/), is there anything useful I can provide that might help solve bug 1597312? Or do you have any suggestions on how I can work around it, presuming this is the issue?
Comment hidden (Intermittent Failures Robot) |
Comment 8•5 years ago
|
||
I'm not sure I understand enough context about what's happening here but...
(In reply to Mike Conley (:mconley) (:⚙️) (Wayyyy behind on needinfos) from comment #6)
Before the run where the test is hanging, we're forcibly killing the parent process after 5 seconds of inactivity. Presumably, we're taking too long to shutdown in the parent process (likely due to bug 1597312).
Shouldn't this trigger a test failure at this point?
Killing the parent process probably creates a busted session in the profile, or puts it in the "crash recovery state", which is why the test times out - the session isn't actually being restored.
Shouldn't this also cause a test failure?
I'm not trying to avoid fixing bugs in our components, but there are a ton of reasons shutdown can timeout and bug 1597312 should not be an issue given that bug 1594521 should mean that remote settings should stop doing things at shutdown. (Activity stream could perhaps be ignoring shutdown, though?)
It seems like maybe the sessionstore
base test class at https://searchfox.org/mozilla-central/rev/4df8821c1b824db5f40f381f48432f219d99ae36/testing/talos/talos/test.py#207 should:
- Set a flag or something so that if Firefox exits with an error code or reports a crash that the test fails. The crash should then be symbolicated and the minidump artifact exposed. This should help immediately demonstrate what's actually causing the hang and make it directly actionable instead of a timeout.
- Have a hook that checks that the
reinstall
files are present and they don't indicate a crash happened.
Comment hidden (Intermittent Failures Robot) |
Updated•5 years ago
|
Comment 10•5 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 14•5 years ago
|
||
Mike is there a way to disable this test on windows until you have a fix?
Comment 15•5 years ago
|
||
Just a little update here - I have crash reports being (apparently) generated on try during shutdown hangs, but for some reason they're not being collected / dumped / put into the MOZ_UPLOAD_DIR. Still trying to figure out why.
Give me until the end of the day to sort this out. If I can't get it, the best idea would probably be to back out bug 1545438 for now until I sort this out and re-land.
Comment 16•5 years ago
|
||
Thank you, will do.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 19•5 years ago
|
||
Hey apavel, can we please back out bug 1545438 (b4b3dc473bc4) for mozilla-central and mozilla-beta until I can get this figured out?
Comment 20•5 years ago
|
||
(In reply to Mike Conley (:mconley) (:⚙️) (Wayyyy behind on needinfos) from comment #19)
Hey apavel, can we please back out bug 1545438 (b4b3dc473bc4) for mozilla-central and mozilla-beta until I can get this figured out?
Backed out on both trees
Central backout: https://hg.mozilla.org/mozilla-central/rev/32f84899fc000d95337a25c14a27470d0c52e3ee
Beta backout: https://hg.mozilla.org/releases/mozilla-beta/rev/283b4b57c72dad92198af43fea25e42bd15c5047
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 23•5 years ago
|
||
The priority flag is not set for this bug.
:rwood, could you have a look please?
For more information, please visit auto_nag documentation.
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 26•5 years ago
|
||
This is a timeout where we have comparatively little info in general. For a specific case that seems to make it worse, I'm investigating in bug 1545438, but this is happening on current central even without the patch from that bug -- but there's very little information on what causes it. What seems to happen is:
- we hang on shutdown (or at least, don't respond / output for 5 seconds? Loop is at https://searchfox.org/mozilla-central/rev/2fd8ffcf087bc59a8e5c962965bbb7bf230bcd28/testing/talos/talos/talos_process.py#150 ).
talos_process
terminates the launcher process- the main process stays up
- we start another run on the same profile
- we hit the "this profile is in use" dialog on startup, and then end up hanging and hitting the main talos job timeout
I think there are 2 issues here:
- when terminating things, we should terminate the launcher process as well as the main process
- when we terminate things, we should be attaching a crash report to the job so we can debug what happened.
:aklotz, for the former, how can we update talos_process.py
to do the Right Thing and also end the "main" process -- ideally while...
:gsvelto: how do we make sure we get a crash report for these crashes and/or shutdown hangs ?
Comment 27•5 years ago
|
||
(In reply to :Gijs (back Thu 12; he/him) from comment #26)
:gsvelto: how do we make sure we get a crash report for these crashes and/or shutdown hangs ?
It seems to me that the code is already doing the right thing by calling mozcrash.kill_and_get_minidump()
(see here). For that to work however an exception handler must be installed and the launcher process doesn't have one which is why I think we're not getting a crash report. On the other hand killing the browser process that way should always yield a crash report.
Comment 28•5 years ago
|
||
(In reply to :Gijs (he/him) from comment #26)
:aklotz, for the former, how can we update
talos_process.py
to do the Right Thing and also end the "main" process -- ideally while...
I seem to remember having a discussion with Henrik on how to do this with job objects for a different harness. Can we adapt that code to Talos?
Comment 29•5 years ago
|
||
Hm, the only thing I can remember was for bug 1493796 but that is to keep a handle to the job process after a restart. Sorry, but I cannot remember anything else, and also my email archive doesn't contain something related to that.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 30•5 years ago
|
||
I've filed bug 1622257 on fixing talos to shut down the launcher process, and on making it report those shutdowns as oranges on treeherder (instead of the timeout, or nothing (!) on macos/linux).
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 32•5 years ago
|
||
(oops)
Assignee | ||
Comment 33•5 years ago
|
||
Assignee | ||
Comment 34•5 years ago
|
||
By and large, this change accomplishes two things:
- Run db.close() in finally clauses so that even if db access fails, we close
our connections. It also tries to avoid waiting on other, non-DB operations
before calling close, to avoid the DB connection needlessly hanging around. - Intercept all async database operations from the remote settings client to
kinto and ensuring they complete before the end ofprofile-before-change
.
Any operations started after Services.startup.isShuttingDown (so after
quit/restart is initiated by the user) will throw. Operations started
beforehand are put in a set of operations, and remove themselves once
complete. We AsyncShutdown block on that set of operations completing.
Comment hidden (Intermittent Failures Robot) |
Comment 36•5 years ago
|
||
Comment 37•5 years ago
|
||
Backed out for perma-failing bug 1621759
Push that started the failures: https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=294399547&resultStatus=testfailed%2Cbusted%2Cexception&revision=3b7ccffc05460949c4d5503ae2acd800eb0b8364
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=294399547&repo=autoland&lineNumber=5501
Backout: https://hg.mozilla.org/integration/autoland/rev/da7b6f1b04fbd5c4343b9e3f67fc320a497742e3
Comment 38•5 years ago
|
||
Assignee | ||
Updated•5 years ago
|
Comment 39•5 years ago
|
||
bugherder |
Updated•5 years ago
|
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 41•5 years ago
|
||
(In reply to Intermittent Failures Robot from comment #40)
4 failures in 4773 pushes (0.001 failures/push) were associated with this bug in the last 7 days.
Repository breakdown:
- autoland: 1
Hrmpf. I filed bug 1625888.
Updated•3 years ago
|