Closed
Bug 631811
(valgrind-on-tbpl)
Opened 14 years ago
Closed 11 years ago
[tracking] Turn Valgrind builds into a full-fledged acceptable test
Categories
(Testing :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: n.nethercote)
References
Details
(Keywords: sec-want, Whiteboard: [sg:want P2][MemShrink:P2])
Attachments
(1 obsolete file)
Cribbing from my list in http://groups.google.com/group/mozilla.dev.tree-management/msg/834a2146d0bd4894 we need to:
* have it running on try
* come as close as we can to letting developers run it themselves with something like `make valgrind`
* not require pushes to a separate repo (for the suppression files) to fix problems
* maybe improve error reporting with tinderbox logs
* get it green after bug 620828
Assignee | ||
Updated•13 years ago
|
Whiteboard: [MemShrink:P1]
Assignee | ||
Updated•13 years ago
|
Blocks: MemShrinkTools
Assignee | ||
Comment 2•13 years ago
|
||
Some clarifications and updates:
- The biggest obstacle: we need the Valgrind runs on the try server (bug 631838) before enabling them on mozilla-central (they're currently hidden and can be viewed by appending '&ignore=1' to a TBPL URL). Without that, many developers won't be able to learn about Valgrind failures until their patch lands on mozilla-central, which is not good. But getting the try server runs happening isn't high on RelEng's priority list, apparently.
- We need to decide what to do about suppressing leaks. Easiest thing is to suppress all of the current ones. Even though some of them might be real leaks in our code, at least then any new ones will be detected.
- There's some breakage at the moment:
make[3]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/locales'
make[2]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/installer'
make tools
make[2]: Entering directory `/builds/slave/valgrind-linux/objdir/browser/installer'
make[2]: Nothing to be done for `tools'.
make[2]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/installer'
if test -d ../../dist/bin ; then touch ../../dist/bin/.purgecaches ; fi
make[1]: Leaving directory `/builds/slave/valgrind-linux/objdir/browser/installer'
Traceback (most recent call last):
File "_profile/pgo/profileserver.py", line 57, in <module>
MOZ_JAR_LOG_DIR = os.path.abspath(os.getenv("JARLOG_DIR"))
File "/tools/python-2.5.1/lib/python2.5/posixpath.py", line 402, in abspath
if not isabs(path):
File "/tools/python-2.5.1/lib/python2.5/posixpath.py", line 49, in isabs
return s.startswith('/')
AttributeError: 'NoneType' object has no attribute 'startswith'
program finished with exit code 1
I don't know anything about that.
- philor says the current workload is to open a few sites in the browser, the same thing that is run for PGO. This is better than nothing, but it'd be good to have something more strenuous. Tp4 or Tp5, maybe? Something that results in a runtime that's comparable to other test suites seems reasonable; maybe 10--15 minutes.
Reporter | ||
Comment 3•13 years ago
|
||
The bustage is no big deal, bug 663737. It's only interesting in that I didn't notice it for a couple of days, and I wouldn't have noticed it even then except I was stuck having to watch &noignore=1 for other things. Now I'm not anymore, so I'd only see further bustage if I actually went looking for it, which I very rarely do. The current "once or maybe twice a day, but nobody will see it even when it does run" isn't quite worthless, but it's awfully close.
Depends on: 663737
Comment 4•13 years ago
|
||
> We need to decide what to do about suppressing leaks.
Let's suppress existing leaks, and include bug numbers in the suppressions file.
Whiteboard: [MemShrink:P1] → [MemShrink:P1][sg:want P2]
Assignee | ||
Comment 5•13 years ago
|
||
jst will talk to joduinn about this.
(In reply to Nicholas Nethercote [:njn] from comment #5)
> jst will talk to joduinn about this.
We've got two valgrind machines (one linux, one mac) running all unit tests each day as part of the bughunter automation. Once the UI is ready from bughunter and you can query these messages and see what is being found, would that be enough to satisfy this goal? I don't think we can run all unittests through valgrind on a per checkin basis and maintain our standard of reporting results within 2hours of checkin.
The bughunter UI will be ready for use at the end of December 2011.
Comment 7•13 years ago
|
||
That will probably be sufficient. I haven't seen bughunter mockups or prototypes, so I don't know whether it already does everything I want, but I'm happy to file bugs :)
Does bughunter use the in-tree Valgrind suppression files? Will logs be available? Can I get notified if it goes from green to red / if a new error appears? Can we (eventually) make it test both 32-bit and 64-bit Linux?
Comment 8•13 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #5)
> jst will talk to joduinn about this.
jst + joduinn met :-) and details are in bug#706695. Worth noting that valgrind builds are already being run once a night on mozilla-central, but are consistently failing - see bug#696299 for more details.
Comment 9•13 years ago
|
||
(In reply to Jesse Ruderman from comment #7)
>
> Does bughunter use the in-tree Valgrind suppression files?
Not currently, but that should be easy to add.
> Will logs be available?
Yes.
> Can I get notified if it goes from green to red / if a new error appears?
We don't have automatic notification via email in the current plan, but that is something that could be added later.
> Can we (eventually) make it test both 32-bit and 64-bit Linux?
It really depends on the hardware. But yes, both are possible.
Assignee | ||
Comment 10•13 years ago
|
||
We're downgrading this to MemShrink:P2 because the leaks that Valgrind finds are only a fraction of total leaks. Having said that, the other stuff Valgrind finds -- undefined memory errors, etc -- are extremely useful, not to mention often security bugs, and so this would still be a very good thing to have.
Whiteboard: [MemShrink:P1][sg:want P2] → [MemShrink:P2][sg:want P2]
Updated•12 years ago
|
Updated•12 years ago
|
Alias: valgrind-green
Comment 11•12 years ago
|
||
I moved most of the bugs to be suppressed out into bug 793882.
Comment 12•12 years ago
|
||
This comes from:
https://tbpl.mozilla.org/php/getParsedLog.php?id=15562383&tree=Firefox&full=1
(Linux mozilla-central valgrind on 2012-09-26 11:25:42 PDT for push 212cf709135c)
64-bit one is at: https://tbpl.mozilla.org/php/getParsedLog.php?id=15562882&tree=Firefox&full=1
The harness is still sending error exit code 1 out even though all errors have been suppessed.
Comment 13•12 years ago
|
||
> The harness is still sending error exit code 1 out even though all errors
> have been suppressed.
Julian, is it because --error-exitcode=1 doesn't play nice with --show-possibly-lost=no, in addition to the suppression files?
Comment 14•12 years ago
|
||
I think that's what is happening. https://bugs.kde.org/show_bug.cgi?id=307465
Assignee | ||
Comment 15•12 years ago
|
||
I think you're right. Julian, I think in get_printing_rules() in mc_leakcheck.c you should probably use the same logic for |count_as_error| as you do for |print_record|.
Assignee | ||
Comment 16•12 years ago
|
||
> The harness is still sending error exit code 1 out even though all errors
> have been suppessed.
I haven't kept track with the many Valgrind-on-TBPL fixes that have been happening. Is this exit code blocking this bug from completion? (I see there are other blocking bugs, I'm not sure which of those are critical and which would just be nice...)
Comment 17•12 years ago
|
||
> Is this exit code blocking this bug from completion? (I see
> there are other blocking bugs, I'm not sure which of those are critical and
> which would just be nice...)
As far I can tell, the exit code is the only issue preventing it from turning green. I'm not sure if merely turning green == full-fledged acceptable test, it certainly will be acceptable, but probably not full-fledged enough if the blocking bugs are not yet fixed.
Comment 18•12 years ago
|
||
(In reply to Nicholas Nethercote [:njn] from comment #15)
> I think you're right. Julian, I think in get_printing_rules() in
> mc_leakcheck.c you should probably use the same logic for |count_as_error|
> as you do for |print_record|.
Ah, that makes it very easy. Thanks! A patch to exactly to do
exactly that is at https://bugs.kde.org/show_bug.cgi?id=307465#c2
for testing now. I think it fixes the problem.
Comment 19•12 years ago
|
||
Gary, this test is running and finding leaks. It's not a "full-fleged acceptance test", but I'm tempted to close this bug, since it seems like we're as close as we expect to get.
Whiteboard: [MemShrink:P2][sg:want P2] → [sg:want P2]
Comment 20•12 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #19)
> Gary, this test is running and finding leaks. It's not a "full-fleged
> acceptance test", but I'm tempted to close this bug, since it seems like
> we're as close as we expect to get.
We're close, but not really entirely 100% there yet, since it's not yet exposed to TBPL by default.
Comment 21•12 years ago
|
||
Do you still hope that we'll be able to expose it on TBPL by default sometime in the foreseeable future, or have you given up on that cause?
Comment 22•12 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #21)
> Do you still hope that we'll be able to expose it on TBPL by default
> sometime in the foreseeable future, or have you given up on that cause?
Personally I don't see it happening any time soon, see bug 800435 comment 23.
Updated•12 years ago
|
Comment 23•12 years ago
|
||
(In reply to Justin Lebar [:jlebar] from comment #21)
> Do you still hope that we'll be able to expose it on TBPL by default
> sometime in the foreseeable future, or have you given up on that cause?
In the future, yes, but probably not in the foreseeable future.
Please feel free to resolve this any way you like.
Comment 24•12 years ago
|
||
We took the MemShrink tag off, so I guess we don't care whether this gets resolved or not anymore! :)
Assignee | ||
Comment 25•11 years ago
|
||
(This is a copy of gkw's massively informative comment from bug 934257 comment
11.)
The last time this was ever green was likely about 6 months ago, last I filed
was in May 2013, see bug 793882.
Steps:
1 - Go to https://tbpl.mozilla.org/?noignore=1&jobname=valgrind and find the
first V instance, sometimes I'd have to click the "green arrow" a few times
to get the first one.
2 - If it's green, check that everything is indeed okay by going to the full
log, going to the bottom and seeing that the browser quit properly after
running the PGO tests. See bug 885232 for a "green" V but it really is red.
Bug 823787 also describes another quirk.
3 - If it's red, click full log to find out why it is red. It usually might be
a recent error. Or it might be a harness breakage, e.g. bug 854253.
4 - File a bug, e.g. bug 866959, inspect the regression window, e.g.
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=3ada6a2fd0c6&tochange=64d6d002e888
and cc the devs likely involved.
5 - add the suppression to any applicable file in
https://hg.mozilla.org/mozilla-central/file/70de5e24d79b/build/valgrind/
5a - cross-architecture.sup is for suppressions that affect all platforms and
architectures.
5b - i386-redhat-linux-gnu.sup is for suppressions that affect only x86 Linux.
(now no longer being run due to bug 873904)
5c - x86_64-redhat-linux-gnu.sup is for suppressions that affect only x64
Linux.
5d - Bug 817853 turns on --track-origins=yes by default (even though there was
a speed penalty), because I didn't want to turn it on, wait a day, hope the
conditional error shows up, get a stack, file it, add the suppression, then
remove this flag. (I did not add this flag in the past and it got too
cumbersome)
6 - In the list of bugs present in the suppression files, find out which ones
were fixed recently, and remove them from the suppression list.
6b - Sometimes a bad changeset is pushed and causes errors in the nightly build
(which then gets fixed by a subsequent backout *after* the nightly has been
created), but I inspect this particular nightly build log and report the
bug. However, this bug may not exist anymore due to the subsequent
backout, but I don't know if this is the case anymore, so I still file the
bug. This should be mitigated by testing-on-checkin rather than
testing-on-nightly.
7 - Next day, look at the results and cross your fingers for a green build.
Alternative path:
* After pushing a suppression, politely ping a buildduty releng folk to
custom-start a Valgrind build to see that the suppression works as expected
in a few hours, rather than waiting one more day, due to bug 793989 (which
apparently prevents one from doing it oneself).
Other notes:
* Note that as mentioned above, due to bug 873904, 32-bit Linux TBPL builds
were also disabled.
* Valgrind 3.9.0 has also been released so probably an upgrade might be
warranted. The last upgrade was to a special post-3.8.1 build in bug 797573
about just over a year ago.
* We may want to consider running Valgrind TBPL on more tests (bug
795124)/platforms/branches, e.g. bug 803763 (Android platforms), bug 803758
(jsreftests), bug 803757 (crashtests), bug 803739 (xpcshell tests), bug
801913 (reftests), bug 801911 (jstests and jit-tests), bug 794627/bug 799232
(mochitests), bug 797158 (B2G tests?), bug 801955 (more branches, e.g.
mozilla-inbound, fx-team, try, etc.), but we may run into issues like bug
810753.
* I'm not sure if upgrading the Valgrind TBPL machines would help, but this
might help mitigate some slowness.
Assignee: nobody → n.nethercote
Assignee | ||
Updated•11 years ago
|
Whiteboard: [sg:want P2] → [sg:want P2][MemShrink:P2]
Assignee | ||
Updated•11 years ago
|
Attachment #665095 -
Attachment is obsolete: true
Assignee | ||
Comment 26•11 years ago
|
||
FWIW, the runs on m-c are now successful -- no errors reported, and the suppression files have been minimized. And it takes about 45 minutes for a run, which is pretty reasonable. Time to start pushing on the infrastructure side.
Assignee | ||
Updated•11 years ago
|
Alias: valgrind-green → valgrind-on-tbpl
Assignee | ||
Comment 27•11 years ago
|
||
Note to self: once all the requirements for making this test visible are met, a bug should be filed as described by https://wiki.mozilla.org/Sheriffing/Job_Visibility_Policy#Requesting_changes_in_visibility.
Assignee | ||
Comment 28•11 years ago
|
||
All the blocking bugs are now fixed. I believe the job is ready to be unhidden, but I won't file a bug on that until the new year -- making the job visible just before I disappear for two weeks is a bad idea.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•11 years ago
|
Assignee | ||
Comment 29•11 years ago
|
||
Ok, it's really done now. \o/
Updated•7 years ago
|
Component: New Frameworks → General
You need to log in
before you can comment on or make changes to this bug.
Description
•