Terminate mochitests using exit(0) as soon as we have test results
Categories
(Testing :: Mochitest, task)
Tracking
(firefox78 fixed)
Tracking | Status | |
---|---|---|
firefox78 | --- | fixed |
People
(Reporter: florian, Assigned: florian)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [ci-costs-2020:done])
Attachments
(3 files, 1 obsolete file)
Shutdown is slow because we write things to disk and do cleanups (eg. garbage collection). There's an ongoing effort (tracked in bug 1606879) to make shutdown faster in general by doing the writes during the beginning of shutdown, and then terminating the process with an exit(0) for optimized builds.
For browser chrome mochitests, we can terminate without worrying about what we write to disk, so we can be much more agressive.
Assignee | ||
Comment 1•5 years ago
|
||
Updated•5 years ago
|
Comment 2•5 years ago
|
||
Would this skip leak checking?
Comment 3•5 years ago
|
||
I see this is skipping debug, so leak checking would still be valid. we should validate ccov, tsan, asan needs; assuming this is successful, we should do the rest of mochitest, xpcshell, reftest, wpt as well.
Updated•5 years ago
|
Comment 4•5 years ago
|
||
For ccov, we need to take the long shutdown road, as coverage counters are written during shutdown.
Assignee | ||
Comment 5•5 years ago
|
||
So far I’ve tested on try with the browser/base/content/test/webrtc
folder of tests that I know well (as I wrote most of them years ago). Initial results seem encouraging: 53s instead of 75s on Win7, 63 instead of 73 on Win10, 55 instead of 66 on OS X. On linux 299s without the patch and 246s with it (but both runs lost 119s during startup on the initialization of dbus that seems to be terribly slow when startup profiling is enabled, as we discovered in bug 1624868).
Comment 6•5 years ago
|
||
How different is this to just sending SIGKILL to the browser? For wpt control has to return to the Python and we currently do a dance where we try to do a clean shutdown using marionette, then send a SIGTERM and then a SIGKILL. So for an implementation there there are two options: either allow marionette to perform a fast shutdown, or just kill the process agressively.
Assignee | ||
Comment 7•5 years ago
|
||
Assignee | ||
Comment 8•5 years ago
|
||
One behavior that I don't like with the current patch is that terminating abruptly the parent process without saying goodbye to child processes means they output noise in the terminal before also terminating themselves. Probably not much of a problem for runs on infra, but I would also like this patch to improve the experience for developers running mochitests locally. Sending an IPC messages to tell child processes to quietly exit could be a follow-up.
Assignee | ||
Comment 9•5 years ago
|
||
(In reply to James Graham [:jgraham] from comment #6)
How different is this to just sending SIGKILL to the browser?
If profiling is enabled, this patch dumps the profile to the disk before calling exit. I would like profiling to be available for all tests, so that we can see where they spend time, and find more opportunities for optimization.
I think we should allow marionette to perform a fast shutdown.
Comment 10•5 years ago
|
||
Overall I think this is a great initiative and I am happy to see the implementation in the proposed patch. However, I feel we should keep in mind -- especially if we are considering expanding the scope of changes -- that we want to test the entire browser life-cycle, and various tests have, on occasion, demonstrated shutdown hangs.
Assignee | ||
Comment 11•5 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #10)
various tests have, on occasion, demonstrated shutdown hangs.
Were these hangs not happening on debug builds?
Comment 12•5 years ago
|
||
I'm not sure, but probably. As I said, I support the proposed change; I just don't want us to try to short cut shutdown everywhere.
Updated•5 years ago
|
Assignee | ||
Comment 13•5 years ago
|
||
Assignee | ||
Comment 14•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #3)
I see this is skipping debug, so leak checking would still be valid. we should validate ccov, tsan, asan needs; assuming this is successful, we should do the rest of mochitest, xpcshell, reftest, wpt as well.
The patch already skips debug builds, and also skips the this._coverageCollector
case, so I'm hoping it's good for ccov. Anything specific that needs to be done to validate that tsan and asan are not going to be negatively affected?
Comment 15•5 years ago
|
||
(In reply to Florian Quèze [:florian] from comment #14)
The patch already skips debug builds, and also skips the
this._coverageCollector
case, so I'm hoping it's good for ccov.
No, _coverageCollector is not used in normal coverage builds (it was used for coverage collected via the debugger). There is a MOZ_CODE_COVERAGE definition that you can use instead.
Assignee | ||
Comment 16•5 years ago
|
||
xpcshell profile of the same test without the exit(0) https://perfht.ml/2VxnujY and with it https://perfht.ml/2x9hHYE
Comment 17•5 years ago
|
||
let me check with :decoder about tsan (and he might know about asan).
:decoder, this patch is doing a quick browser shutdown when we can to save time. Is this ok for TSAN? we are not doing this for debug as we look for leaks. Are there similar things for TSAN? If so how can we look for them?
Comment 18•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #17)
let me check with :decoder about tsan (and he might know about asan).
:decoder, this patch is doing a quick browser shutdown when we can to save time. Is this ok for TSAN? we are not doing this for debug as we look for leaks. Are there similar things for TSAN? If so how can we look for them?
Is this just for tests, or is this the same we do in reality in the release builds?
Comment 20•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #19)
just for tests, hacked into the harnesses
I don't think it's a good idea to do this (not just in TSan, but for all builds), in particular because we had and have problems with shutdown. Shutdown hangs or races have been a constant problem in Firefox and skipping this part just in tests greatly increases the risk that we ship a shutdown regression of some sort.
Comment 21•5 years ago
|
||
we could keep the normal exit process enabled for the m-c job (and disable it for try and autoland)
or some variance of this :)
and, anyway, if we regress too much, we can always come back to the previous behavior
Assignee | ||
Comment 22•5 years ago
|
||
(In reply to Christian Holler (:decoder) from comment #20)
I don't think it's a good idea to do this (not just in TSan, but for all builds), in particular because we had and have problems with shutdown. Shutdown hangs or races have been a constant problem in Firefox and skipping this part just in tests greatly increases the risk that we ship a shutdown regression of some sort.
Which kind of shutdown hangs would happen in opt builds but not in debug builds? I'm under the impression that debug builds are more likely to hang than optimized builds, so I don't think we would be losing significant test coverage.
Adding the needinfo back for the initial question:
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #17)
let me check with :decoder about tsan (and he might know about asan).
:decoder, this patch is doing a quick browser shutdown when we can to save time. Is this ok for TSAN? we are not doing this for debug as we look for leaks. Are there similar things for TSAN? If so how can we look for them?
Comment 23•5 years ago
|
||
(In reply to Florian Quèze [:florian] from comment #22)
Which kind of shutdown hangs would happen in opt builds but not in debug builds? I'm under the impression that debug builds are more likely to hang than optimized builds, so I don't think we would be losing significant test coverage.
That is correct because in opt builds, the content processes have a fast exit path, while in debug they don't (hence in total, you will see more hangs with debug builds, if the hang is caused by a child).
But there is no such thing for the parent process in debug vs. opt and since opt has a vastly different timing compared to debug, it is not viable to assume that opt has less hangs than debug. It could well be exactly the other way around.
Adding the needinfo back for the initial question:
With TSan we should not exit because we would not be able to catch shutdown races anymore. There is no debug TSan configuration running because that would be way too slow.
Updated•5 years ago
|
Assignee | ||
Comment 24•5 years ago
|
||
(In reply to Sylvestre Ledru [:Sylvestre] from comment #21)
we could keep the normal exit process enabled for the m-c job (and disable it for try and autoland)
or some variance of this :)
After a discussion on matrix with decoder, I think we should avoid the exit(0) shutdown path for RELEASE_OR_BETA. This is a good compromise between ensuring we test what we ship to users, and reducing CI costs.
Updated•5 years ago
|
Comment 25•5 years ago
|
||
Assignee | ||
Comment 26•5 years ago
|
||
I reduced the patch to only contain the part related to browser-chrome tests so that it can land sooner.
For the mochitest-plain tests, I have test failures on try when a profiler folder is re-used for the edge case of a test manifest specifying tests using both the http and https schemes.
For the xpcshell tests, I have failures for tests that run in a child process, I need to figure out how to handle this case.
I'll handle finishing both of these in follow-up bugs.
Comment 27•5 years ago
|
||
Backed out changeset b77388326b1b (bug 1630982) for causing leaks on mochitest jobs. CLOSED TREE
Log:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=301178919&repo=autoland&lineNumber=1666
Push with failures:
https://treeherder.mozilla.org/#/jobs?repo=autoland&group_state=expanded&revision=b77388326b1bf23441fad1c3aa22ad04443550f9\
Backout:
https://hg.mozilla.org/integration/autoland/rev/e2f1a6b3f0a865cbdbf6810d2278de290ce17773
Assignee | ||
Comment 28•5 years ago
|
||
Depends on D71336
Comment 29•5 years ago
|
||
Comment 30•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/ae9720f97abb
https://hg.mozilla.org/mozilla-central/rev/a8cd211f5393
Assignee | ||
Comment 31•5 years ago
|
||
(In reply to Florian Quèze [:florian] from comment #26)
I reduced the patch to only contain the part related to browser-chrome tests so that it can land sooner.
For the mochitest-plain tests, I have test failures on try when a profiler folder is re-used for the edge case of a test manifest specifying tests using both the http and https schemes.
This part actually landed in this bug.
For the xpcshell tests, I have failures for tests that run in a child process, I need to figure out how to handle this case.
I fixed this case where xpcshell tests using a child process failed, but I still have 2 xpcshell tests that crash at shutdown and I haven't figured out why yet; I'll file a follow-up for the xpcshell part of my patches.
Comment 32•5 years ago
|
||
Comment on attachment 9143127 [details]
Bug 1630982 - Exit(0) at the end of xpcshell tests, r=gbrown.
Revision D72407 was moved to bug 1637637. Setting attachment 9143127 [details] to obsolete.
Updated•4 years ago
|
Comment 33•2 years ago
|
||
Did we ever consider doing this for other test suites, e.g. reftest?
Assignee | ||
Comment 34•2 years ago
|
||
(In reply to Marco Castelluccio [:marco] from comment #33)
Did we ever consider doing this for other test suites, e.g. reftest?
I did it for xpcshell tests too in bug 1637637. I haven't attempted to do it for reftests. I just grabbed a profile of running locally a random folder of reftests: https://share.firefox.dev/3NIFAHM On my machine we could save about 200ms with an exit(0) after the last test finishes. That seems small compared to the 1.6s it took to start running tests (we do a full Firefox startup, initializing a full browser window before we close it to open the reftests window instead).
Feel free to open bugs for reftests (or other test suites) if you think there are interesting potential wins to be had. I'm probably not going to take them myself, but happy to help.
Comment 35•2 years ago
|
||
Given the large number of manifests/groups that we run, 200 ms per manifest/group is still a pretty large saving.
Joel, other than reftest, where could we do this?
Comment 36•2 years ago
|
||
reftest and web-platform-tests might be interesting. I assume we have all mochitests (webgl, media, a11y, browser, chrome, devtools, plain, etc.) with this exit(0)
. I don't believe there are other harnesses where we would see the scale of large volumes of tests and browser restarts.
This bug would be a good bug for a contributor or new hire.
Comment 37•2 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC -0800) (back July 11th) from comment #36)
reftest and web-platform-tests might be interesting.
I filed bug 1778458 for reftest.
wpt has bug 1608454 implemented according to bug 1608454 comment 4, so IDK if it would benefit from this as well.
I assume we have all mochitests (webgl, media, a11y, browser, chrome, devtools, plain, etc.) with this
exit(0)
. I don't believe there are other harnesses where we would see the scale of large volumes of tests and browser restarts.
The two patches in this bug mention browser-chrome and plain, so I don't know if the other mochitest are covered as well by them. Are they? If not, we should file something like bug 1778458 for the rest of mochitest.
Description
•