Talos terminating processes that take too long to shut down should fail the task more reliably and work correctly on Windows
Categories
(Testing :: Talos, defect, P1)
Tracking
(firefox76 fixed)
Tracking | Status | |
---|---|---|
firefox76 | --- | fixed |
People
(Reporter: Gijs, Assigned: Gijs)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
Talos' process wrapper has some shutdown logic ( https://searchfox.org/mozilla-central/rev/278046367dab878316f60f0bd7f740cf73f3c447/testing/talos/talos/talos_process.py#148-152 ) where it'll kill the process if it stops responding.
The resulting crashes get found and reported and break the task (see deps of bug 1339594).
There are two missing bits here:
- having to terminate the process should just fail the task anyway. If the crashreporter fails or the minidumps end up in the wrong directory or, like on Windows, the launcher process means we fail to terminate the process, we should still be causing orange immediately. It seems right now, the fact that we don't do this is hiding shutdown hangs: the timeouts in bug 1598924 are caused by us ending the previous instance of Firefox, that only terminating the launcher (not the "real" Firefox), and then hanging on the next startup. When the same shutdown hang happens on Linux, we successfully terminate the browser in the first instance, so the next startup works fine, and the resulting run is green, but the shutdown issue is the same and should also be turning the linux job orange.
- on Windows, we should terminate the real parent process, not the launcher.
I have patches for both of these.
Assignee | ||
Comment 1•5 years ago
|
||
I thought we'd be OK for crashreports based on bug 1598924 comment #27, but re-reading, I don't actually think this is the case.
(In reply to Gabriele Svelto [:gsvelto] from bug 1598924 comment #27)
It seems to me that the code is already doing the right thing by calling
mozcrash.kill_and_get_minidump()
(see here).
Fixed the link to have a line number. However, this is only the case where we time out waiting for the event. Here's the code:
# wait until we saw __endTimestamp in the proc output,
# or the browser just terminated - or we have a timeout
if not event.wait(timeout):
LOG.info("Timeout waiting for test completion; killing browser...")
# try to extract the minidump stack if the browser hangs
mozcrash.kill_and_get_minidump(proc.pid, minidump_dir)
raise TalosError("timeout")
if reader.got_end_timestamp:
for i in range(1, wait_for_quit_timeout):
if proc.wait(1) is not None:
break
if proc.poll() is None:
LOG.info(
"Browser shutdown timed out after {0} seconds, terminating"
" process.".format(wait_for_quit_timeout)
)
If the talos run itself outputs the number for which we wait (which it does in this case), we then progress to waiting for the process to end. We do this for 5 seconds by default. If that doesn't work, we progress to killing the process differently:
finally:
# this also handle KeyboardInterrupt
# ensure early the process is really terminated
return_code = None
try:
return_code = context.kill_process()
if return_code is None:
return_code = proc.wait(1)
except Exception:
# Maybe killed by kill_and_get_minidump(), maybe ended?
LOG.info("Unable to kill process")
LOG.info(traceback.format_exc())
This doesn't use mozcrash, and so we don't get a dump.
For that to work however an exception handler must be installed and the launcher process doesn't have one which is why I think we're not getting a crash report. On the other hand killing the browser process that way should always yield a crash report.
Right. I suspect this is also true on Windows, ie we kill the "wrong" process anyway.
Gabriele, can you sanity-check the above?
Comment 2•5 years ago
|
||
I've got limited experience with this code but I agree with your analysis. context.kill_process()
will never yield a minidump on Linux because it's using SIGKILL
IIRC (you'd need SIGABRT
do get one) and on Windows one would need to explicitly grab a mindump first, like mozcrash does.
Assignee | ||
Comment 3•5 years ago
|
||
Assignee | ||
Comment 4•5 years ago
|
||
This ensures that we shut down the "real" Firefox, rather than the launcher process,
when forcing process termination. It also ensures that, even when shutdown times out,
we use mozcrash so we get minidump information.
Depends on D66783
Updated•5 years ago
|
Updated•5 years ago
|
Comment 6•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/182f6b349371
https://hg.mozilla.org/mozilla-central/rev/3537ff03365b
Description
•