Closed Bug 1337865 Opened 8 years ago Closed 8 years ago

Leaks detected by xpcshell tests failing on the atom leak assertion (NNTP)

Categories

(MailNews Core :: Backend, defect)

defect
Not set
critical

Tracking

(thunderbird_esr5253+ fixed, thunderbird53 fixed, thunderbird54 fixed)

RESOLVED FIXED
Thunderbird 54.0
Tracking Status
thunderbird_esr52 53+ fixed
thunderbird53 --- fixed
thunderbird54 --- fixed

People

(Reporter: aleth, Assigned: jorgk-bmo)

References

Details

(Whiteboard: [Thunderbird-testfailure: X all debug])

Attachments

(4 files, 8 obsolete files)

(deleted), patch
aceman
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
aceman
: review+
Details | Diff | Splinter Review
(deleted), patch
aceman
: review+
Details | Diff | Splinter Review
PROCESS-CRASH | mailnews/imap/test/unit/test_imapFilterActions.js PROCESS-CRASH | mailnews/imap/test/unit/test_imapFilterActionsPostplugin.js PROCESS-CRASH | mailnews/db/gloda/test/unit/test_index_junk_imap_online.js PROCESS-CRASH | mailnews/db/gloda/test/unit/test_index_junk_imap_offline.js PROCESS-CRASH | mailnews/news/test/unit/test_server.js PROCESS-CRASH | mailnews/news/test/unit/test_internalUris.js
Whiteboard: [Thunderbird-testfailure]
Whiteboard: [Thunderbird-testfailure] → [Thunderbird-testfailure: X all debug]
Severity: normal → critical
I am not sure where (which bugzilla) to comment. This bug is about data structure not being released peroperly, correct? I know this bug was mentioned for test failures in xpcshell-test. I have been running |make mozmill| with full DEBUG build of C-C TB and occasionally run this under valgrind/memcheck. valgrind/memcheck no longer runs on latest Debian GNU/Linux kernel for unknown reason (I know it is a SIGSEGV but noone knows exactly why and where it is caused. I have nailed it down to a region of code where a fork() is called inside about the time a helper program is called with the called stack. (this is the forked side, i.e. children before spawn/clone. #0 0x0000000005bae19a in __libc_fork () at ../sysdeps/nptl/fork.c:145 #1 0x0000000013fe7347 in fire_glxtest_process() () at /NREF-COMM-CENTRAL/comm-central/mozilla/toolkit/xre/glxtest.cpp:313 #2 0x0000000013fddfbe in XREMain::XRE_mainInit(bool*) (this=<optimized out>, aExitFlag=<optimized out>) at /NREF-COMM-CENTRAL/comm-central/mozilla/toolkit/xre/nsAppRunner.cpp:3118 #3 0x0000000013fe223d in XREMain::XRE_main(int, char**, (anonymous namespace)::BootstrapConfig const&) (this=<optimized out>, argc=<optimized out>, argv=<optimized out>, aConfig=...) at /NREF-COMM-CENTRAL/comm-central/mozilla/toolkit/xre/nsAppRunner.cpp:4628 The caller/parent experiences a SIGSEGV after the fork was invoked. This problem is kernel specific and dependent on some features such as heap/stack randomization and MMAP features, and phase of the moon when the kernel was compiled (so many kernel options are different from the kernel where valgrind can run C-C TB successfully and the kernel where it fails. It is almost impossible to tell what is going on. My current theory is that after fork() is called and some twiddling of getenv/setenv is done BEFORE the signal handler is modified/reset for handling the signal propagation betwen the child and parent correctly, a sigsegv condtion arises and since it is not trapped very well at the time, there is no clear explanation of what went wrong and simply valgrind segfaults. Current educated theory. But I digress.) When valgrind/C-C TB combination runs well under locally built 3.19.5 kernel, and finishes more or less cleanly, I realize that I can capture the UNRELEASED memory (i.e., leaked memory) very clearly. Mind you, this records ALL the allocated memory structure during run-time which the developer may not care aoubt: for example, if a top-level global variable contains the copy (strdup'ed) option strings at the time of program termination, probably noone cares. Or for that matter if a whole JavaString environment is kept around because of bad release practice, we are overwhelmed with unreleased data structure that were created and attached to that JS envrionment. However, valgrind/memcheck records all such allocations done during execution and still left unallocated at the time of termination. So the data structure for JS environment (presumably the root of such big data structure) is reported in the valgrind/memcheck log. At least, someone in the know should be able to identify such data structure (valbrind/memcheck does not say exactly what the memory blob is. But it reports the SIZE of such memory blob and WHERE it is allocated in a nice stack trace. Unfortunately, the allocation in JS code will be reported by way of JS interpreter and so may not be very revealing to what JS source file and line caused the allocation to happen. If someone is interested in looking at the log, I can make the log available somewhere. It is HUGE. (256MB and 16+MB compressed). Some interesting short excerpts are attached.
If we can restrict the |make mozmill| run to a test that shows similar errors as noted by xpcshell-test, then the turn around to see the fix works is rather short. Hoever, full |make mozmill| run takes more than 24 hours on my PC currently (due to the debug/Full DEBUG) version. I need to co-relate the messages I dump explicitly with some memory issues reported by valgrind/memcheck.
Note that test_nsMsgDBView.js and test_junkingWhenDisabled.js also fail again due to a backout in bug 1334874.
Documenting the process of trying to find leaks. First replace the NS_ASSERT() at nsAtomTable.cpp:411 with a print so application doesn't crash before logs are closed. According to https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Refcount_tracing_and_balancing ran: XPCOM_MEM_LOG_CLASSES=nsMsgNewsFolder XPCOM_MEM_REFCNT_LOG=$HOME/refcounts.log mozilla/mach xpcshell-test mailnews/news/test/unit/test_internalUris.js python mozilla/tools/rb/find_leakers.py $HOME/refcounts.log Output: 000000000C4B6C00 (1) @ <nsMsgNewsFolder> According to Aleth' instructions from bug 1334874 comment #35 repeated with XPCOM_MEM_LOG_CLASSES=nsMsgNewsFolder XPCOM_MEM_LOG_OBJECTS=1 XPCOM_MEM_REFCNT_LOG=$HOME/refcounts.log mozilla/mach xpcshell-test mailnews/news/test/unit/test_internalUris.js python mozilla/tools/rb/find_leakers.py $HOME/refcounts.log Output: 000000000C5D6C00 (1) @ <nsMsgNewsFolder> Then perl -w mozilla/tools/rb/make-tree.pl --ignore-balanced --object 000000000C5D6C00 < $HOME/refcounts.log > $HOME/xx.log Now the art is to interpret the resulting file which is huge. Also a quick run through for test_server.js: XPCOM_MEM_LOG_CLASSES=nsMsgNewsFolder XPCOM_MEM_REFCNT_LOG=$HOME/refcounts1.log mozilla/mach xpcshell-test mailnews/news/test/unit/test_server.js python mozilla/tools/rb/find_leakers.py $HOME/refcounts1.log 000000000C84D400 (1) @ <nsMsgNewsFolder> XPCOM_MEM_LOG_CLASSES=nsMsgNewsFolder XPCOM_MEM_LOG_OBJECTS=1 XPCOM_MEM_REFCNT_LOG=$HOME/refcounts1.log mozilla/mach xpcshell-test mailnews/news/test/unit/test_server.js python mozilla/tools/rb/find_leakers.py $HOME/refcounts1.log 000000000C34F400 (1) @ <nsMsgNewsFolder> perl -w mozilla/tools/rb/make-tree.pl --ignore-balanced --object 000000000C34F400 < $HOME/refcounts1.log > $HOME/xx1.log With luck, test_internalUris.js and test_server.js detect the same leak somewhere in news code.
Attached patch 1337865-unrelated-news-leaks.patch (obsolete) (deleted) — Splinter Review
I found these leaks, but they don't fix PROCESS-CRASH | mailnews/news/test/unit/test_server.js PROCESS-CRASH | mailnews/news/test/unit/test_internalUris.js Too bad.
Attached patch 1337865-unrelated-news-leaks.patch (v2). (obsolete) (deleted) — Splinter Review
Attachment #8840404 - Attachment is obsolete: true
Attachment #8840421 - Attachment is obsolete: true
I am uploading log from a run of the following command: $ mach xpcshell-test --verbose --debugger valgrind --debugger-args="--malloc-fill=0xA5 --free-fill=0xC3 --leak-check=full --show-mismatched-frees=no" mailnews/imap/test/unit/test_imapFilterActions.js This means that I ran the test for test_imapFilterActions.js, and let valgrind/memcheck to report memory leaks at the time of TB binary terminating. I have not patched my local source tree with the suggestions of comment 4, and so when unfreed atoms are detected, the abort is taken and so maybe objects that would have been released if abort is not taken are present as leaks. But you get the idea what valgrind/memcheck can report. I wonder if this may be helpful in addition to the high-level leak detector mentioned previously (as in comment 4). If this looks promising, I can tweak the local source and run the tests with similar valgrind/memcheck setting. BTW, the intense slow down by memcheck (since it emulates each instruction to check for memory access issues), some valid operation inside C-C TB may timeout occasionally and the test may report something which is not usually seen. (However, |mach| seems to understand the timeout issue rather well by seemingly extending the timeout internally to avoid such errors. Unfortunately we may be seeing one such error here.) Suggestions welcome. TIA
Looking at test_internalUris.js again and commenting out the subtests, I found the test that causes the leak, it's test_grouplist() and in there this statement: subserver.subscribeListener = subscribeListener; Changing that to subserver.subscribeListener = null; still leaks. I did some debugging in nsSubscribableServer::SetSubscribeListener(): mSubscribeListener which held a null pointer gets assigned a null pointer. So that can possibly not leak. Reducing the test to function* test_grouplist() { let subserver = localserver.QueryInterface(Ci.nsISubscribableServer); subserver.subscribeListener = null; leaks, and strangely enough it leaks 22 atoms, the ones from nsMsgDBFolderAtomList.h and "subscribe" and "nntp" which are allocated here: https://dxr.mozilla.org/comm-central/rev/819fc5b85f129d9ab1c4ab222c68ce4aba65ae7a/mailnews/news/src/nsNntpIncomingServer.cpp#109 So by the looks of if, the whole nsNntpIncomingServer object doesn't get destroyed, which I've just confirmed, the constructor runs but not the destructor.
OK, some more debugging. The nsNntpIncomingServer is the server that's created in the test in the run_test function like so: |localserver = setupLocalServer(server.port);|. It gets created in the nsMsgAccountManager::createKeyedServer() like so: nsCOMPtr<nsIMsgIncomingServer> server = do_CreateInstance(serverContractID.get(), &rv); As soon as in the test I have: let subserver = localserver.QueryInterface(Ci.nsISubscribableServer); subserver.subscribeListener = null; or localserver instanceof Ci.nsISubscribableServer; localserver.subscribeListener = null; the object held in localserver leaks, that is, its destructor isn't run. When removing the line with the assignment, the destructor is run. I find that very weird, Kent, can you shed some light on this?
Flags: needinfo?(rkent)
Also leaking: function* test_grouplist() { let subserver = localserver.QueryInterface(Ci.nsISubscribableServer); let x = subserver.subscribeListener; Or variations of that: subserver.delimiter = ","; let x = subserver.delimiter; Not leaking: function* test_grouplist() { let subserver = localserver.QueryInterface(Ci.nsISubscribableServer); if (subserver) { dump("huhu\n"); } So basically any access to member variables of the "cast" object will cause it to leak.
I did a bit more poking around. I checked that the nsMsgAccountManager object is destroyed. Good. I checked that nsMsgAccountManager::UnloadAccounts() is run, good, since that empties out internal references to incoming servers which are helps in some sort of list (m_incomingServers). Next I tried doing the cast in C++. In nsMsgAccountManager::createKeyedServer() I added: nsCOMPtr<nsIMsgIncomingServer> server = do_CreateInstance(serverContractID.get(), &rv); NS_ENSURE_SUCCESS(rv, NS_ERROR_NOT_AVAILABLE); nsCOMPtr<nsISubscribableServer> ss = do_QueryInterface(server); printf("=== SS=%p\n", (void*)ss); That doesn't leak. nsCOMPtr<nsIMsgIncomingServer> server = do_CreateInstance(serverContractID.get(), &rv); NS_ENSURE_SUCCESS(rv, NS_ERROR_NOT_AVAILABLE); nsCOMPtr<nsISubscribableServer> ss = do_QueryInterface(server); printf("=== SS=%p\n", (void*)ss); if (ss) { char d; ss->GetDelimiter(&d); } That leaks. So as we've seen in the JS example, any access to the "cast" nsISubscribableServer object causes a leak. Finally, I found the problem. nsNntpIncomingServer::GetDelimiter() calls EnsureInner() and that creates a new incoming server, d'oh!! So no XPCOM magic gone wrong, just bad code!!
Flags: needinfo?(rkent)
Comment on attachment 8840423 [details] [diff] [review] 1337865-unrelated-news-leaks.patch (v2b). Miscellaneous leaks found by code inspection and addref/release removals. Try run to come.
Attachment #8840423 - Flags: review?(acelists)
As discussed with Kent on IRC, we should remove mIncomingServer from nsSubscribableServer.cpp, since that back reference will dead-lock object destruction. But that's for another bug.
Attachment #8841270 - Flags: review?(acelists)
(In reply to Jorg K (GMT+1) from comment #15) > As discussed with Kent on IRC, we should remove mIncomingServer from > nsSubscribableServer.cpp, since that back reference will dead-lock object > destruction. This idea is experimentally implemented at https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=906c3e9295c4251a246488c2a989b32026129553 It seems to only fix mailnews/news/test/unit/test_server.js but not the others.
Attached patch leak-search.patch (deleted) — Splinter Review
And you waited for the result until 3 AM. That's dedication ;-) While I was hoping for more tests to get fixed, it's good that fixing a problem we recognised at least fixes one test. Let's return to test_internalUris.js. As we diagnosed, the nsNntpIncomingServer holds a reference to the nsSubscribableServer object and vice versa. With your change, we lose the back reference. This *DOES* fix one of the problems I detected. With your patch applied, the reduced test passes and the nsNntpIncomingServer object is destroyed. Run with this debug patch to see it. Remember that the leaked object I found was a nsMsgNewsFolder (comment #4) and not a nsNntpIncomingServer, for which I didn't look. So there are more leaks to be found here.
Attached patch 1337865-break-reference-loop.patch (obsolete) (deleted) — Splinter Review
I think this is the way to go. It fixes one test and part of another.
Attachment #8841303 - Flags: review+
This together with attachment 8841303 [details] [diff] [review] fixes test_internalUris.js.
Attachment #8841270 - Attachment is obsolete: true
Attachment #8841270 - Flags: review?(acelists)
Attachment #8841305 - Flags: review?(acelists)
Comment on attachment 8840473 [details] Log of running xpcshell-test mailnews/imap/test/unit/test_imapFilterAction under valgrind/memcheck.js With all due respect, I don't think this helps.
Attachment #8840473 - Attachment is obsolete: true
Attachment #8838956 - Attachment is obsolete: true
Let's do the IMAP and Gloda parts in bug 1342745.
Summary: Leaks detected by xpcshell tests failing on the atom leak assertion → Leaks detected by xpcshell tests failing on the atom leak assertion (NNTP)
Try run with the three patches here plus the "view cycle leaks" one from bug 1334874 (and one from bug 1339811) for a greener result: https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=223953da26deb389039f473922a951bc2d029f90 I'm not convinced that Aleth "view cycle leaks" patch from bug 1334874 wasn't the way to go. It has some opposition from Kent and Aceman, but it's not correct that it only fixed tests: It did address issues in code. This this, we should be down to four failures, IMAP and Gloda from bug 1342745.
Let's add some comment on why we do not store exactly what is passed to us. Carrying over r=jorgk.
Attachment #8841303 - Attachment is obsolete: true
Attachment #8841312 - Flags: review+
Comment on attachment 8840423 [details] [diff] [review] 1337865-unrelated-news-leaks.patch (v2b). Review of attachment 8840423 [details] [diff] [review]: ----------------------------------------------------------------- I assume these leaks do not cause test failures. But the fixed look useful.
Attachment #8840423 - Flags: review?(acelists) → review+
(In reply to :aceman from comment #23) > Let's add some comment on why we do not store exactly what is passed to us. > Carrying over r=jorgk. Great! (In reply to :aceman from comment #24) > I assume these leaks do not cause test failures. But the fixes look useful. They do, because they avoid running up a high memory bill. Besides, that addref/release stuff is extremely error-prone and ugly. But let's check that those changes don't do damage. That's why I started that try run (comment #22) with all the accumulated goodies ;-)
Comment on attachment 8841305 [details] [diff] [review] 1337865-release-reference.patch - fixes test test_internalUris.js Review of attachment 8841305 [details] [diff] [review]: ----------------------------------------------------------------- It's interesting that this is needed. I would expect the reference to be killed when the let-declared variable goes out of scope (at the end of test_grouplist). Maybe because it is a generator we never run it to the end of the function because we stop execution on the 'yield true', so it keeps handing around waiting if we call it again. Maybe we should call .return() on the generator when it yields true in https://hg.mozilla.org/comm-central/file/tip/mailnews/test/resources/asyncTestUtils.js
Attachment #8841305 - Flags: feedback?(rkent)
Attachment #8841305 - Flags: feedback?(arai.unmht)
(In reply to :aceman from comment #26) > ... so it keeps handing around waiting if we call it again. That's what I thought. Alternate fixes welcome, especially those, which will fix more tests ;-)
(In reply to Jorg K (GMT+1) from comment #22) > Try run with the three patches here plus the "view cycle leaks" one from bug > 1334874 (and one from bug 1339811) for a greener result: > https://treeherder.mozilla.org/#/jobs?repo=try-comm- > central&revision=223953da26deb389039f473922a951bc2d029f90 > With this, we should be down to four failures, IMAP and Gloda from bug > 1342745. Right, four failures left, the ones I moved to bug 1342745. Aceman, how much more time would you like to invest into understanding JS GC when I'm pretty sure that we fixed the TB code issues here ;-)
Comment on attachment 8841305 [details] [diff] [review] 1337865-release-reference.patch - fixes test test_internalUris.js Review of attachment 8841305 [details] [diff] [review]: ----------------------------------------------------------------- at least the lexical scope itself doesn't hold the reference at the end of test. I don't see the scope alive in cc log. maybe null-ing the somehow triggers cc for underlying XPCOM things, I guess. as said in IRC, doing Cu.forceCC() at the end of test may give us some hint.
Attachment #8841305 - Flags: feedback?(arai.unmht)
Alternate solution only nulling the listener.
Attachment #8841305 - Attachment is obsolete: true
Attachment #8841305 - Flags: review?(acelists)
Attachment #8841305 - Flags: feedback?(rkent)
Attachment #8841334 - Flags: review?(acelists)
Sorry about the noise. Fixed comment and commit message. The sad thing is that the test still leaks two databases: db left open c:\\users\\jorgk\\appdata\\local\\temp\\firefox\\xpcshellprofile\\News\\localhost\\test.filter.msf" db left open c:\\users\\jorgk\\appdata\\local\\temp\\firefox\\xpcshellprofile\\News\\localhost\\test.malformed&-name.msf" I suggest that once we plugged the leaking atoms, we invest some time into stopping those leaked databases.
Attachment #8841334 - Attachment is obsolete: true
Attachment #8841334 - Flags: review?(acelists)
Attachment #8841335 - Flags: review?(acelists)
Assignee: nobody → jorgk
Status: NEW → ASSIGNED
OS: Unspecified → All
Hardware: Unspecified → All
Version: unspecified → Trunk
Comment on attachment 8841335 [details] [diff] [review] 1337865-release-reference.patch (v2b) - fixes test test_internalUris.js Review of attachment 8841335 [details] [diff] [review]: ----------------------------------------------------------------- I like this version as it is a balanced operation. We set the listener then remove it.
Attachment #8841335 - Flags: review?(acelists) → review+
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 54.0
Comment on attachment 8841312 [details] [diff] [review] 1337865-break-reference-loop.patch v1.1 We should uplift this in due course to fix leaks and tests on the branches.
Attachment #8841312 - Flags: approval-comm-beta?
Attachment #8841312 - Flags: approval-comm-aurora?
(In reply to Jorg K (GMT+1) from comment #31) > Created attachment 8841335 [details] [diff] [review] > 1337865-release-reference.patch (v2b) - fixes test test_internalUris.js > > Sorry about the noise. Fixed comment and commit message. > > The sad thing is that the test still leaks two databases: > db left open > c: > \\users\\jorgk\\appdata\\local\\temp\\firefox\\xpcshellprofile\\News\\localho > st\\test.filter.msf" > db left open > c: > \\users\\jorgk\\appdata\\local\\temp\\firefox\\xpcshellprofile\\News\\localho > st\\test.malformed&-name.msf" > > I suggest that once we plugged the leaking atoms, we invest some time into > stopping those leaked databases. This would be great, and I think would have a significant impact on memory in the wild. The test doesn't set msgDatabase = null for either folder, but I suspect even if it did, and even if that were followed up with an explicit GC, there would be a 'leak'. Meaning that eventually the msgDatabase would be released, just not when expected/measured.
(In reply to alta88 from comment #35) > Meaning that eventually the msgDatabase would be released, just not when expected/measured. Yes, that's an interesting aspect which we'll investigate in bug 1342858. Help welcome!
(In reply to Jorg K (GMT+1) from comment #12) bableServer); > if (subserver) { > dump("huhu\n"); > } > > So basically any access to member variables of the "cast" object will cause > it to leak. Great detective work. I think the setting of the attribute (even if the value is null) makes the object to have a permanent value and thus non-eligible for automatic destruction.
Let's keep to the facts here and not create any myth: The problem was caused by a "reference loop", A points to B and B points to A, so unless you manually break the loop, these objects won't be refcounted down and die. We removed the back reference, one test got fixed and a second one needed a little helping along by triggering JS GC. > So basically any access to member variables of the "cast" object will cause > it to leak. That was *only* true in this case, since the access internally created a new object ("composition": https://en.wikipedia.org/wiki/Composition_over_inheritance) which had the back reference.
(In reply to Jorg K (GMT+1) from comment #38) > Let's keep to the facts here and not create any myth: The problem was caused > by a "reference loop", A points to B and B points to A, so unless you > manually break the loop, these objects won't be refcounted down and die. We > removed the back reference, one test got fixed and a second one needed a > little helping along by triggering JS GC. > > > So basically any access to member variables of the "cast" object will cause > > it to leak. > That was *only* true in this case, since the access internally created a new > object ("composition": > https://en.wikipedia.org/wiki/Composition_over_inheritance) which had the > back reference. Thanks for the clarification. I am not familiar with the internal implementation of JS symbol tables, and such. So, OK, in this case, the reference to delimiter in JS code a la let x = subserver.delimiter; causes the corresponding Getter function via XPCOM and the call of GetDelimiter() automatically created the loop-reference. Tough. I am afraid that there may be tons of such yet-to-be-discovered/dormant issues in resolving the unclosed database reference. :-( Well, but as you say, we finally get proper attention on this issue. That *IS* great. TIA
Attachment #8841312 - Flags: approval-comm-aurora? → approval-comm-aurora+
Attachment #8841312 - Flags: approval-comm-esr52?
Depends on: 1346338
Attachment #8841312 - Flags: approval-comm-beta?
Attachment #8841312 - Flags: approval-comm-esr52? → approval-comm-esr52+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: