Open
Bug 1473626
Opened 6 years ago
Updated 2 years ago
[meta] Use a code coverage build to find things running on timers that should be disabled during tests
Categories
(Testing :: Code Coverage, enhancement)
Testing
Code Coverage
Tracking
(Not tracked)
NEW
People
(Reporter: marco, Unassigned)
References
(Depends on 2 open bugs, Blocks 1 open bug)
Details
(Keywords: meta)
Attachments
(7 files)
We could:
1) Run an empty test;
2) Reset coverage counters;
3) Wait 120 seconds;
4) Dump coverage counters.
After 4, we can generate a report and see what was executed.
By disabling things that are executed on timers that shouldn't be executed during tests, we could make per-test coverage data cleaner, and we might fix some intermittents due to things running on timers badly interacting with tests.
Comment 1•6 years ago
|
||
Good idea! The last timer was found by looking at the raw lines and files covered so doing this would be much easier. We could use the baseline tests as the empty tests, and reset counters and wait in the test harness before the browser is closed, but after the test finishes. I think that would also remove the timers we already know about.
Reporter | ||
Comment 2•6 years ago
|
||
I've generated a test report, there are several Telemetry functions covered (so I guess we are collecting some Telemetry data on a timer, we should likely disable it during tests). E.g. GetWebrtcStats.
Comment 3•6 years ago
|
||
:marco, could you post a link to the code you used to find that?
Updated•6 years ago
|
Flags: needinfo?(mcastelluccio)
Comment 4•6 years ago
|
||
I think we should try this with telemetry disabled in the moz-configs: https://hg.mozilla.org/try/rev/334dd0c80f0eea321c2eea1ce4667af66e7accf0
You'll see it was explicitly enabled for windows but I'm not sure why. Telemetry was enabled by default (in common-opt [1][2]) in mac and linux.
[1]: https://dxr.mozilla.org/mozilla-central/source/browser/config/mozconfigs/linux64/common-opt
[2]: https://dxr.mozilla.org/mozilla-central/source/browser/config/mozconfigs/macosx64/common-opt
Reporter | ||
Comment 5•6 years ago
|
||
(In reply to Greg Mierzwinski [:sparky] from comment #3)
> :marco, could you post a link to the code you used to find that?
I don't have it anymore unfortunately, but I think I recall all the steps I followed:
- set the GCOV_PREFIX and GCOV_RESULTS_DIR environment variables to two directories on the system;
- remove the requestResetCoverageCounters and requestDumpCoverageCounters calls from testing/mochitest/tests/SimpleTest/TestRunner.js, so that they don't conflict with the calls from inside the test;
- modify a bit tools/code-coverage/tests/mochitest/test_coverage_specialpowers.html (or just create a new test) to make it:
1) Wait a few seconds (so that any delayed initialization finishes)
2) Reset coverage counters
3) Wait 120 seconds
4) Dump coverage counters
Then, run the test, then parse the gcda files that are in GCOV_RESULTS_DIR.
(In reply to Greg Mierzwinski [:sparky] from comment #4)
> I think we should try this with telemetry disabled in the moz-configs:
> https://hg.mozilla.org/try/rev/334dd0c80f0eea321c2eea1ce4667af66e7accf0
>
> You'll see it was explicitly enabled for windows but I'm not sure why.
> Telemetry was enabled by default (in common-opt [1][2]) in mac and linux.
>
> [1]:
> https://dxr.mozilla.org/mozilla-central/source/browser/config/mozconfigs/
> linux64/common-opt
> [2]:
> https://dxr.mozilla.org/mozilla-central/source/browser/config/mozconfigs/
> macosx64/common-opt
I'm not sure we can disable Telemetry like this, we might stop collecting coverage for Telemetry if we do, or we might have differences in the tests between the coverage build and the other builds.
Flags: needinfo?(mcastelluccio)
Comment 6•6 years ago
|
||
Good point, and I don't think we can disable telemetry in any other way because it looks like we need to build with it. But I may be wrong since I'm not too familiar with how telemetry works.
What if we filter out these telemetry functions from the coverage data by using file and function names in the etl or with grcov? I could see some coverage for that code getting through. But, it might be better than adding something hacky to get telemetry data out of per-test data. Assuming it's hard to get telemetry data removed with other methods.
Reporter | ||
Comment 7•6 years ago
|
||
I think there might be a timer somewhere that when triggered calls some Telemetry function to collect data. We just need to find out how to disable this timer (probably through a preference), and set this preference during tests (we already do this for several other similar things).
Comment 8•6 years ago
|
||
Ok sounds goood, I'll start looking around for that timer.
Comment 9•6 years ago
|
||
There are at least some portions of Telemetry that still work when we set the pref 'initDelay' to a very high number: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b5183a4169f6352c261f2ba045eba9c30d608f31&selectedJob=192884180
Here's a full linux64-ccov run with initDelay set to a very high number: https://treeherder.mozilla.org/#/jobs?repo=try&revision=094e5accd4569f41dd0dd8351803132c5a217984
I'm not sure if those failures are legitimately caused by the change to initDelay because I have yet to find the link between the test that failed in 'dt4' at [1], and the Telemetry functions. I've re-triggered those chunks to see if it's an intermittent.
[1]: https://dxr.mozilla.org/mozilla-central/source/devtools/client/inspector/animation/test/browser_animation_rewind-button.js
Comment 10•6 years ago
|
||
There's only one failure that seems to caused by setting this value very high, and that is [1] mentioned in comment 9. So, we could disable telemetry with this flag - and either skip or fix that one test.
I'm going to work on trying to recreate the initial test that found this difference first to see if we still find telemetry functions in there after this change.
Comment 11•6 years ago
|
||
if it is 1 failure, lets fix the test if possible or we can |skip-if = coverage|. Good idea to revisit this to see if changing the initDelay will show telemetry functions in per-test code coverage.
Comment 12•6 years ago
|
||
I think I recreated it in this push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=37022d37ae3a7cb3bce5740dc20b27809b95fb1c&selectedJob=194154218
I've attached a file containing all the coverage that I see in the test-coverage task for tools/code-coverage/tests/mochitest/test_coverage_specialpowers.html with the following changes (see comment 5): https://hg.mozilla.org/try/rev/0eaaf5290d5c683daa93603b326370e98c577a2b
The coverage was found with the command `py pertestcoverage_view.py ~\Downloads\per-test-coverage-reports(93) -t special -s tele` using: https://github.com/gmierz/coco-tools
Unfortunately, this change didn't disable telemetry completely. I'll attach the coverage that it has in this test: https://treeherder.mozilla.org/#/jobs?repo=try&revision=73253340f90877cc00ad48ea1e24d9ed4f5a50fc
I'm testing this with a different change here as well: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7e02f39cabcc5ada10bd3901e61e7664b468a506
Comment 13•6 years ago
|
||
Coverage with telemetry disabled through the initDelay change.
Comment 14•6 years ago
|
||
These are the differences between the two. Each entry, (-1--2), means "what is in the coverage when telemetry is enabled but not in the coverage when it is disabled", and (-2--1) means the opposite.
Comment 15•6 years ago
|
||
Also, here is another preference which could cause false differences between tests, and variability in certain cases: https://dxr.mozilla.org/mozilla-central/source/testing/profiles/unittest/user.js#16
But I don't think we can't disable this one, maybe trying to capture it in the baseline would be better?
Comment 16•6 years ago
|
||
Here's a graph for the change in coverage across the files in the .txt attachments. Point 1 is where the initDelay was unchanged, and points 2, and 3 showed a stable drop in coverage from the change.
So the change does help but it doesn't stop everything from coming through. I'm going to look into how variable the coverage from this task is because I haven't seen a call to GetWebrtcStats yet [1] (around lines 687 to 689).
Luckily for us though, it seems like the baseline captures everything that's left over. This can be seen just by looking at the Telemetry.cpp coverage:
Baseline coverage: toolkit/components/telemetry/Telemetry.cpp: [120, 293, 542, 548, 552, 553, 565, 604, 607, 608, 611, 613, 614, 618, 620, 648, 650, 651, 653, 655, 657, 660, 663, 667, 669, 1088, 1091, 1095, 1098, 1110, 1111, 1112, 1116, 1120, 1137, 1138, 1139, 1143, 1147, 1148, 1149, 1150, 1151, 1169, 1173, 1175, 1179, 1181, 1183, 1184, 1185, 1186, 1193, 1197, 1200, 1201, 1204, 1207, 1210, 1212, 1214, 1215, 1217, 1229, 1547, 1549, 1553, 1554, 1558, 1560, 1564, 1565, 1580, 1581, 1628, 1630, 1634, 1638, 1639, 1646, 1652, 1654, 1658, 1660, 1664, 1666, 1748, 1752, 1801, 1803, 1804, 1932, 1934, 1935, 1944, 1946, 1947, 1980, 1982, 1986, 1987, 1991, 1996, 2000, 2001, 2010, 2012, 2016, 2018, 2048, 2052, 2053, 2054, 2114, 2117, 2121, 2123, 2127, 2129, 2132, 2133, 2134, 2137, 2143, 2145, 2146, 2149, 2151, 2152, 2155, 2157, 2158, 2167, 2169, 2170, 2173, 2175, 2176, 2185, 2187, 2188, 2197]
Without baseline correction, with telemetry disabled: toolkit/components/telemetry/Telemetry.cpp: [1088, 1091, 1138, 1558, 1560, 1564, 1565, 1580, 1932, 1934, 1935, 1944, 1946, 1947, 1980, 1982, 1986, 1987, 2016, 2018, 2143, 2145, 2146, 2167, 2169, 2170]
With baseline correction, with telemetry disabled: toolkit/components/telemetry/Telemetry.cpp: []
Without baseline correction, with telemetry enabled: toolkit/components/telemetry/Telemetry.cpp: [443, 448, 453, 454, 458, 463, 468, 469, 473, 476, 482, 490, 491, 492, 493, 630, 634, 635, 639, 643, 644, 693, 695, 696, 1088, 1091, 1137, 1138, 1139, 1169, 1173, 1175, 1558, 1560, 1564, 1565, 1580, 1628, 1630, 1670, 1673, 1698, 1701, 1702, 1932, 1934, 1935, 1944, 1946, 1947, 1980, 1982, 1986, 1987, 2016, 2018, 2143, 2145, 2146, 2167, 2169, 2170]
With baseline correction, with telemetry enabled: toolkit/components/telemetry/Telemetry.cpp: [443, 448, 453, 454, 458, 463, 468, 469, 473, 476, 482, 490, 491, 492, 493, 630, 634, 635, 639, 643, 644, 693, 695, 696, 1670, 1673, 1698, 1701, 1702, 2167, 2169, 2170]
As mentioned above, I'm going to look into how variable these results are. There is at least line 2170 that seems to be have been an intermittent line before disabling.
[1]: https://dxr.mozilla.org/mozilla-central/source/toolkit/components/telemetry/Telemetry.cpp?q=Telemetry.cpp&redirect_type=direct#685
Comment 17•6 years ago
|
||
These are the differences between chunks left over after disabling telemetry but without baseline correction.
Comment 18•6 years ago
|
||
And these are the differences with telemetry disabled, and after baseline correction.
There are some other files which were used randomly (grouped into file-level variability): 'toolkit/components/telemetry/ipc/TelemetryComms.h', 'toolkit/components/telemetry/TelemetryHistogram.cpp', 'toolkit/components/telemetry/Telemetry.cpp'
Or:
toolkit/components/telemetry/TelemetryHistogram.cpp: [1410, 1416, 1417]
toolkit/components/telemetry/Telemetry.cpp: [2167, 2169, 2170]
toolkit/components/telemetry/ipc/TelemetryComms.h: [189, 190, 220, 222, 225, 226]
So, disabling it with this flag definitely helped - but there may be another setting we could change to fix this last bit of variability.
Comment 19•6 years ago
|
||
Complete variability of test_coverage_specialpowers after base correction and disabling telemetry.
Comment 20•6 years ago
|
||
I was looking through the data in this push [1] and at files with 'time' in their name, and I found this file, with intermittent coverage of certain lines, but only happening in the first time the file is seen across all the pushes:
"js/xpconnect/src/XPCRuntimeService.cpp": {
"-1--2": [
53,
58,
59,
60,
61
],
"-2--1": [ ]
},
This is called by: https://dxr.mozilla.org/mozilla-central/source/js/xpconnect/src/XPCWrappedNativeJSOps.cpp#734
And these lines are called uniquely in the first run:
"js/xpconnect/src/XPCWrappedNativeJSOps.cpp": {
"-1--2": [
768,
384,
385,
775,
753,
754,
755,
756,
757,
758,
760,
764,
767
],
"-2--1": [ ]
}
With this coverage afterwards:
js/xpconnect/src/XPCWrappedNativeJSOps.cpp: [99, 101, 103, 104, 106, 107, 108, 111, 114, 119, 120, 121, 123, 124, 125, 128, 222, 236, 237, 238, 242, 244, 245, 248, 252, 253, 255, 258, 261, 262, 270, 271, 272, 277, 278, 279, 280, 281, 282, 351, 373, 374, 375, 376, 377, 378, 379, 382, 383, 389, 390, 393, 394, 395, 396, 397, 402, 404, 405, 406, 407, 408, 409, 410, 413, 414, 415, 417, 424, 426, 427, 428, 431, 496, 498, 499, 502, 503, 506, 507, 509, 513, 515, 516, 519, 520, 521, 525, 527, 528, 539, 541, 542, 543, 545, 547, 548, 549, 550, 553, 555, 556, 559, 561, 562, 563, 565, 566, 570, 573, 579, 617, 619, 620, 621, 623, 664, 666, 668, 671, 672, 675, 676, 677, 681, 683, 684, 685, 689, 690, 693, 694, 695, 699, 702, 703, 704, 705, 715, 717, 718, 719, 720, 721, 722, 724, 726, 727, 729, 731, 732, 734, 736, 737, 740, 741, 743, 747, 748, 749, 779, 854, 856, 859, 860, 861, 871, 873, 874, 875, 877, 878, 881, 883, 884, 885, 887, 890, 892, 893, 897, 899, 900, 901, 903, 907, 909, 911, 912, 913, 915, 918, 921, 922, 923, 924, 925, 930, 931, 970, 973, 974, 975, 976, 979, 982, 983, 986, 987, 991, 995, 996, 997, 998]
Now, that function is called by either (from what I've found so far):
(1): https://dxr.mozilla.org/mozilla-central/search?q=%2Bcallers%3A%22nsIXPCScriptable%3A%3AResolve%28nsIXPConnectWrappedNative+%2A%2C+JSContext+%2A%2C+JSObject+%2A%2C+jsid%2C+bool+%2A%2C+bool+%2A%29%22
(2): https://dxr.mozilla.org/mozilla-central/search?q=%2Bcallers%3A%22BackstagePass%3A%3AResolve%28nsIXPConnectWrappedNative+%2A%2C+JSContext+%2A%2C+JSObject+%2A%2C+jsid%2C+bool+%2A%2C+bool+%2A%29%22
But I found no references to (2) being used in the code and it's a '.h' file which is already known to have problems. The intermittent line here is 121 and it's only hit in the first run (everything else is always hit):
obj-firefox/dist/include/nsIXPCScriptable.h: [63, 108, 111, 118, 119, 120, 121]
Furthermore, the line which calls them (line 734 in XPCWrappedNativeJSOps.cpp) is always used in all 21 instances of the test runs. Which makes me think that this is a bug, or I can't find the other references to this call so I can't check if it has coverage, or it's always hit in (2) but since it's a '.h' file it's bugged. This might be a different issue we could get around with the baseline though.
(test_coverage_specialpowers without baseline correction, with telemetry disabled)
[1]: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7e02f39cabcc5ada10bd3901e61e7664b468a506
Comment 21•6 years ago
|
||
I tested out changing the `apz.content_response_timeout` preference but I can't find a significant change in variability from this push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=39bb578d8b5427aeeeaceab23a29ed221302ae2a
It might have helped, but it would be a small difference that is not yet visible until we find larger sources of variability.
Reporter | ||
Comment 22•6 years ago
|
||
In this bug, I would only consider things that we find using a code coverage build which we let stay idle for a while.
These things should be disabled for all build configurations, not only code coverage ones, as they are sources of intermittent failures (and because we want the code coverage configuration to be as close as possible to normal configurations).
Sparky, I'd make this a meta bug and file separate bugs for every thing we find so that we can fix one at a time, otherwise we get down a rabbit hole trying to fix all possible variability sources.
Reporter | ||
Updated•6 years ago
|
Summary: Use a code coverage build to find things running on timers that should be disabled during tests → [meta] Use a code coverage build to find things running on timers that should be disabled during tests
Comment 23•6 years ago
|
||
As an alternative approach, have you tried waiting for browser-delayed-startup-finished, or similar?
Comment 24•6 years ago
|
||
:gbrown, I haven't tried that, thanks for pointing it out! I'll try that out next.
What other similar approaches are you thinking off?
Comment 25•6 years ago
|
||
(In reply to Greg Mierzwinski [:sparky] from comment #24)
> What other similar approaches are you thinking off?
Nothing specific. Generally, "wait for some event after startup", so that most off-main-thread slightly-delayed-after-startup initialization is likely complete. Or maybe just wait for 60? seconds after startup before starting tests?
Comment 26•6 years ago
|
||
Ok, we currently wait for 30 seconds in the baseline tests. I'll check to see if waiting on that observer notification can change our results.
But I'll look into the notifications some more because there could be others we could wait for.
Reporter | ||
Comment 27•6 years ago
|
||
In the run I did I was waiting for way longer (120 s) before starting to collect the coverage data, as I just wanted to see what are the things running on timers that we might want to disable during tests.
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•