Crash [@ OOM | small ] during generating MSF file from many emails in Maildir format. Memory leak. @ OOM | small
Categories
(MailNews Core :: Database, defect)
Tracking
(Not tracked)
People
(Reporter: t.matsuu, Unassigned)
References
(Blocks 1 open bug, )
Details
(Keywords: crash, memory-leak, stackwanted, Whiteboard: [maildirblocker?])
Comment 1•8 years ago
|
||
Reporter | ||
Comment 2•8 years ago
|
||
Comment 3•8 years ago
|
||
Reporter | ||
Comment 4•7 years ago
|
||
Reporter | ||
Comment 5•7 years ago
|
||
Comment 6•7 years ago
|
||
Comment 7•7 years ago
|
||
Comment 9•7 years ago
|
||
Reporter | ||
Comment 10•7 years ago
|
||
Comment 11•7 years ago
|
||
Comment 12•7 years ago
|
||
Comment 13•7 years ago
|
||
Comment 14•7 years ago
|
||
Updated•6 years ago
|
Comment 15•6 years ago
|
||
I used to run valgrind to see if there are memory alloc/free mismatches, etc.
This bug should be easy to spot if valgrind runs today.
Unfortunately, for some months since late last year, I cannot run valgrind on my local development PC.
I am running Debian kernel inside virtualbox.
It probably is a particular kernel configuration parameter or two of official Debian supported kernels
that must be the cause of strange valgrind crash.
In linux 3.x kernel series, there were versions that allowed me to run valgrind (and run thunderbird under it), but there are other versions that did not and I could not figure out what kernel config parameter changes were responsible for the failure of valgrind.
So I used to keep the particular 3.y kernel so that I could run valgrind to test TB's memory issue.
However, the time moved on and Debian userland now requires kernel 4.x series.
And the valgrind has been crashing and I have not been able to run it. I tried creating a reasonably configured kernel from pristine source, but
valgrind still crashed. I am not sure what is wrong.
I am not sure if this Debian specific (probably so because I could produce a kernel from pristine linux source with my own config that allowed me to run valgrind during 3.x days while some debian supported 3.y versions did not. ) or related to some arcane VirtualBox issue.
Anyway, if someone can run valgrind on their local development machine the cause of the bug should be very easy to spot.
Or does mozilla still run TB under valgrind for testing purposes from time to time?
(I understand FF is run under valgrind from time to time. No?)
I understand such a testing is done maybe once a week or less often.
(It used to take almost 24 hours on my PC to run |make mozmil| under valgrind.
Last November I specifically rebuilt my home PC in the hope of running |make mozmil| under valgrind faster by a ryzen CPU with 8 cores and 20MB cache, but due to the mysterious valgrind crash under debian supported 4.x series kernel, I have not been able to run it yet.)
ADDED: So if a test that mimics the behavior that triggers the problem is created and the testing using virtualbox is done from time to time in mozilla's testing farm, this problem probably is analyzed there.
(Wait, I wonder if |make mozmil| in stock form can be run under valgrind. I locally create a dummy thunderbird binary that actually invokes the original thunderbird under virtualbox and let this fake thunderbird binary run within |make mozmill| testing scheme. .If it is not easy to run TB under virtualbox, the testing in the mozilla testing farm may not be easy...)
So if a developer who uses maybe Fedora (I believe the main developer of valgrind/memcheck uses Fedora) on their development machine and can run TB test suite such as |make mozmil|, the cause of this problem and other bugzilla related to memory for Maildir usage should be easy to spot in no time.
Well I have used Debian for almost 20 years now, and so I am not inclined to switch to Fedora anytime soon... Maybe at the office...
If someone knows the particular change to the kernel config file that would allow Debian to run valgrind, I would love to hear about it.
NOTE: for a small program, valgrind has no issue even under Debian's 4.x kernel.
It is a huge program with many dynamically loaded libraries that causes this mysterious crash of valgrind. TB fits the bill.
I can run small programs under valgrind without an issue to my consternation. So debugging is very hard...
Comment 16•5 years ago
|
||
Do you still crash when using version 68?
Comment 17•5 years ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #16)
Do you still crash when using version 68?
Tested with 14000+ mails. Haven't noticed significant memory increase or crash. It looks fixed.
Regards!
Updated•5 years ago
|
Comment 18•5 years ago
|
||
(In reply to ISHIKAWA, Chiaki (may be slow to respond until Jan 4.) from comment #15)
I used to run valgrind to see if there are memory alloc/free mismatches, etc.
This bug should be easy to spot if valgrind runs today.Unfortunately, for some months since late last year, I cannot run valgrind on my local development PC.
I am running Debian kernel inside virtualbox.
It probably is a particular kernel configuration parameter or two of official Debian supported kernels
that must be the cause of strange valgrind crash.
I have found out that if I log in as superuser, valgrind no longer crashes during TB testing.
It seems either the kernel or security protection modules such as SELinux won't allow an ordinary user to extend stack during runtime in my setup.
valgrind bugzilla:
https://bugs.kde.org/show_bug.cgi?id=405295
See comment 9 there.
Back to the original issue of the bugzilla.
Maybe I can test the reported scenario this afternoon.
Comment 19•5 years ago
|
||
Can someone enlighten me regarding how to enable maildir in the latest code?
I don't find any reference to "maildir" in the preference nor in the general setting of TB. The preference setting dialogs have changed quite a bit.
BTW, the Sunday/Saturday code (from C-C) had a problem of not allowing one to input something to the body text(?). I spent a couple of days to
figure out what was wrong since I could not even write a message after starting it up.
I refreshed the code in the last 12 hours, and it seems to work now without such an issue.
Despite the comment 17, I just wanted see if there is anything we have not covered...
Comment 20•5 years ago
|
||
(In reply to ISHIKAWA, Chiaki (may be slow to respond until Jan 4.) from comment #19)
Can someone enlighten me regarding how to enable maildir in the latest code?
I don't find any reference to "maildir" in the preference nor in the general setting of TB. The preference setting dialogs have changed quite a bit.BTW, the Sunday/Saturday code (from C-C) had a problem of not allowing one to input something to the body text(?). I spent a couple of days to
figure out what was wrong since I could not even write a message after starting it up.
I refreshed the code in the last 12 hours, and it seems to work now without such an issue.Despite the comment 17, I just wanted see if there is anything we have not covered...
It's under Server Settings in Account Setup, but I think it's only available when you create your first account. Otherwise, it's disabled.
Comment 21•5 years ago
|
||
(In reply to Branimir Amidžić from comment #20)
It's under Server Settings in Account Setup, but I think it's only available when you create your first account. Otherwise, it's disabled.
Thank you. This should get me going.
Comment 22•5 years ago
|
||
I thought I would copy a couple of messages to the Maildir account from an account's folder, and repeat
copy 2 messages
copy 4 messages
copy 8 messages
...
copy 1024 messages.
However, I got a fatal bug elsewhere at the initial copy. :-(
bug 1609789
Updated•5 years ago
|
Reporter | ||
Comment 23•5 years ago
|
||
Hi Chiaki,
Which bug(s) block this?
Comment 24•5 years ago
|
||
Three months ago, I could not test this due to the bug I mentioned in comment 22.
However, I got a fatal bug elsewhere at the initial copy. :-(
bug 1609789
That bug has been taken care of.
Since then, I got carried away by this coronavirus outbreak in Japan , or rather the inept handling of it by the Japanese government, but I digress.
Now that I work from home, I have more time (less commute time) I thought I would test this and other bugzilla entries.
But for the last 10 days or so, I cannot build TB any more due to a few issues such as
Bug 1633092
TB build failure: GLSL optimizer output causes a compiler error (GCC-9) error: comparison of integer expressions of different signedness: ‘long int’ and ‘size_t’ {aka ‘long unsigned int’} [-Werror=sign-compare]
Bug 1630345
./mach bootstrap fails with python-pip dependency issue: python-pip : Depends: python-pip-whl (= 18.1-5) but 20.0.2-4 is to be installed
(Well, actually I just found a workaround for bug 1630345 on my PC.)
Once I sort them out, I will come back to this bug. :-(
Comment 25•4 years ago
|
||
(In reply to ISHIKAWA, Chiaki from comment #24)
Three months ago, I could not test this due to the bug I mentioned in comment 22.
...
But for the last 10 days or so, I cannot build TB any more due to a few issues such as Bug 1633092
...
Once I sort them out, I will come back to this bug. :-(
Only one left!
Updated•3 years ago
|
Reporter | ||
Updated•3 years ago
|
Comment 26•3 years ago
|
||
Let me try again this week.
I have finally been able to build TB locally after I updated my M-C/C-C tree (!)
Comment 27•3 years ago
|
||
I tested with a locally created TB.
Not a great news. TB crashes at shutdown.
I created a working copy of TB from local updated M-C/C-C (with only minimal patches created locally so as not to disturb the TB operation).
So this may be even newer than Daily available on the web.
As I learned a couple of years back, I can specify the mbox/maildir selection when the new account is created.
(And if I specify maildir, TB needed to restart).
Well, I received about 180 messages from locally running daemons. (This was tested under local linux machine. Actually a linux image inside virtualbox.)
I repeated copying the repeated messages to a newly created empty directory.
After the message count exceeded 1000+, there was no problem.
I did something similar by copying this 1000+ messages at once to a different folder.
Again, no problem.
But please note that such repeated copying of 180+ messages may not trigger real world conditions since the message ID strings are repeated, the time/date sender/receiver information is repeated.
OK, I quit the running TB. Here, I did not see crash.
So I used the following short shell snippet to send myself a couple of thousand e-mails to see how this impacts TB.
$ for f in $(seq 1 2000)
> do
> echo "\ntest test" | mail -s "test subject $f" ishikawa
> done
Well actually "\ntest ..." is meaningless, it can be simply "test test".
Well, |mail| under Debian GNU/Linux prompts for "cc:" address when called interactively on the shell line, but it turns out,
it does not seem to ask so, if it is invoked as part of shell script (or rather the input is a piple).
So it sends 2000 e-mails to me and I thought I would obtain 2000 e-mails from that by running TB again.
So I ran TB and try receiving messages. Great, I thought.
Actually, since I setup TB for automatically receiving messages at startup (by default, I think and periodic, too),
soon after TB was invoked it began receiving e-mails.
But then strangely, I only found I received 980 new e-mails in TB in Inbox. (I only found messages up to "test subject 980")
Something went wrong and TB is no longer able to receive the further e-mails.
I was not sure what was happening.
So I tried to receive remaining e-mails from mail command invoked from the shell.
I got this message:
$mail
Cannot read mailbox /var/mail/ishikawa: Conflict with previous locker
and sure enough, there was "ishikawa.lock" file.
Who or rather which program has created this lock file. TB (?), most likely.
So thought I would finish TB and would try to see if the lock file would disappear.
Then I hit this crash at the program finish stage.
Hit MOZ_CRASH(mozilla::LinkedList<T>::~LinkedList() [with T = nsSHistory] has a buggy user: it should have removed all this list's elements before the list's destruction) at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h:440
#01: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x616b5e9]
#02: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3ef67]
#03: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3f10a]
#04: __libc_start_main[/lib/x86_64-linux-gnu/libc.so.6 +0x277f4]
#05: _start[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird +0x162fa]
#06: ??? (???:???)
Program /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird (pid = 50352) received signal 11.
Stack:
#01: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x6604035]
#02: ???[/lib/x86_64-linux-gnu/libpthread.so.0 +0x13200]
#03: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x616b5f3]
#04: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3ef67]
#05: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3f10a]
#06: __libc_start_main[/lib/x86_64-linux-gnu/libc.so.6 +0x277f4]
#07: _start[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird +0x162fa]
#08: ??? (???:???)
Sleeping for 300 seconds.
Type 'gdb /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird 50352' to attach your debugger to this thread.
Ouch, so there could be a memory structure (List in this case), which may not be properly cleared, etc.
In a memory tight system, this could be a problem. I have 16GB memory assigned to virtualbox on a 32GB real memory PC.
When I looked at the stack using gdb, this is what I got.:
(gdb) where
#0 0x00007f290ce30335 in __GI___clock_nanosleep
(clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffe9c218f50, rem=rem@entry=0x7ffe9c218f50) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:43
#1 0x00007f290ce353f3 in __GI___nanosleep
(req=req@entry=0x7ffe9c218f50, rem=rem@entry=0x7ffe9c218f50)
at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2 0x00007f290ce3532a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3 0x00007f2904a28af9 in common_crap_handler(int, void const*)
(signum=11, aFirstFramePC=0x7f2904a03035 <nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)+197>) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:95
#4 0x00007f2904a28b1d in ah_crap_handler(int) (signum=<optimized out>)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:103
#5 0x00007f2904a03035 in nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)
(signo=11, info=0x7ffe9c2191b0, context=0x7ffe9c219080)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/profile/nsProfileLock.cpp:183
#6 0x00007f290d174200 in <signal handler called> () at /lib/x86_64-linux-gnu/libpthread.so.0
#7 MOZ_Crash
(aReason=0x55c9973b0e80 <sPrintfCrashReason> "mozilla::LinkedList<T>::~LinkedList() [with T = nsSHistory] has a buggy user: it should have removed all this list's elements before the list's destruction", aLine=440, aFilename=0x7f2906ef2370 "/NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h") at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/Assertions.h:261
#8 mozilla::LinkedList<nsSHistory>::~LinkedList() (this=<optimized out>, __in_chrg=<optimized out>)
at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h:440
--Type <RET> for more, q to quit, c to continue without paging--
#9 mozilla::LinkedList<nsSHistory>::~LinkedList() (this=<optimized out>, __in_chrg=<optimized out>)
at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h:437
#10 0x00007f290cda9f67 in __run_exit_handlers
(status=0, listp=0x7f290cf28738 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#11 0x00007f290cdaa10a in __GI_exit (status=<optimized out>) at exit.c:139
#12 0x00007f290cd927f4 in __libc_start_main (main=
0x55c99730ae90 <main(int, char**, char**)>, argc=2, argv=0x7ffe9c2197a8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe9c219798) at ../csu/libc-start.c:366
#13 0x000055c99730b2fa in _start ()
(gdb) where
#0 0x00007f290ce30335 in __GI___clock_nanosleep
(clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffe9c218f50, rem=rem@entry=0x7ffe9c218f50) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:43
#1 0x00007f290ce353f3 in __GI___nanosleep
(req=req@entry=0x7ffe9c218f50, rem=rem@entry=0x7ffe9c218f50)
at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2 0x00007f290ce3532a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3 0x00007f2904a28af9 in common_crap_handler(int, void const*)
(signum=11, aFirstFramePC=0x7f2904a03035 <nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)+197>) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:95
#4 0x00007f2904a28b1d in ah_crap_handler(int) (signum=<optimized out>)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:103
#5 0x00007f2904a03035 in nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)
(signo=11, info=0x7ffe9c2191b0, context=0x7ffe9c219080)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/profile/nsProfileLock.cpp:183
#6 0x00007f290d174200 in <signal handler called> () at /lib/x86_64-linux-gnu/libpthread.so.0
#7 MOZ_Crash
(aReason=0x55c9973b0e80 <sPrintfCrashReason> "mozilla::LinkedList<T>::~LinkedList() [with T = nsSHistory] has a buggy user: it should have removed all this list's elements before the list's destruction", aLine=440, aFilename=0x7f2906ef2370 "/NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h") at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/Assertions.h:261
#8 mozilla::LinkedList<nsSHistory>::~LinkedList() (this=<optimized out>, __in_chrg=<optimized out>)
at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h:440
--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(gdb) up
#1 0x00007f290ce353f3 in __GI___nanosleep (req=req@entry=0x7ffe9c218f50,
rem=rem@entry=0x7ffe9c218f50) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
25 ../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.
(gdb) up
#2 0x00007f290ce3532a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
55 ../sysdeps/posix/sleep.c: No such file or directory.
(gdb) up
#3 0x00007f2904a28af9 in common_crap_handler (signum=11,
aFirstFramePC=0x7f2904a03035 <nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)+197>)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:95
95 sleep(_gdb_sleep_duration);
(gdb) up
#4 0x00007f2904a28b1d in ah_crap_handler (signum=<optimized out>)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:103
103 common_crap_handler(signum, CallerPC());
(gdb) up
#5 0x00007f2904a03035 in nsProfileLock::FatalSignalHandler (signo=11, info=0x7ffe9c2191b0,
context=0x7ffe9c219080)
at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/profile/nsProfileLock.cpp:183
183 oldact->sa_handler(signo);
(gdb) up
#6 <signal handler called>
(gdb) up
#7 MOZ_Crash (
aReason=0x55c9973b0e80 <sPrintfCrashReason> "mozilla::LinkedList<T>::~LinkedList() [with T = nsSHistory] has a buggy user: it should have removed all this list's elements before the list's destruction", aLine=440,
aFilename=0x7f2906ef2370 "/NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h")
at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/Assertions.h:261
261 MOZ_REALLY_CRASH(aLine);
(gdb) list
256 MOZ_FUZZING_HANDLE_CRASH_EVENT4("MOZ_CRASH", aFilename, aLine, aReason);
257 #if defined(DEBUG) || defined(FUZZING)
258 MOZ_ReportCrash(aReason, aFilename, aLine);
259 #endif
260 MOZ_CRASH_ANNOTATE(aReason);
261 MOZ_REALLY_CRASH(aLine);
262 }
263 #define MOZ_CRASH_UNSAFE(reason) MOZ_Crash(__FILE__, __LINE__, reason)
264
265 static const size_t sPrintfMaxArgs = 4;
(gdb) list 200
^CQuit
(gdb) list
266 static const size_t sPrintfCrashReasonSize = 1024;
267
268 MFBT_API MOZ_COLD MOZ_NEVER_INLINE MOZ_FORMAT_PRINTF(1, 2) const
269 char* MOZ_CrashPrintf(const char* aFormat, ...);
270
271 /*
272 * MOZ_CRASH_UNSAFE_PRINTF(format, arg1 [, args]) can be used when more
273 * information is desired than a string literal can supply. The caller provides
274 * a printf-style format string, which must be a string literal and between
275 * 1 and 4 additional arguments. A regular MOZ_CRASH() is preferred wherever
(gdb) list 265
^CQuit
(gdb) print aLine
$1 = 440
(gdb)
$2 = 440
(gdb)
I am not even sure if this particular instance of TB was running within valgrind now. I believe it was. But TB seems to run a few processes now and then, if the particular TB instance is a child process of the original process, I was not running it since I forgot to pass the flag to trace the child processes to valgrind.
Anyway, I am keeping the crashed process debugged by gdb so that if anyone has a suggestion to where to look, I can watch it.
Or I may try to run this again with clean slate of affairs and make sure TB runs under valgrind.
Anyway, it looks there IS a piece of code not releasing the list structure properly.
Oh I forgot to mention that I am using locally created DEBUG build that is why I saw this
crash.
I don't believe I see such list not released message using mbox regularly.
(but maybe I should test TB in a similar manner when my message folder is mbox.)
The above is what I found in a very short testing.
Comment 28•3 years ago
|
||
One other observation.
When one uses maildir format, does IncorporateMessage has to call
|nsMsgLocalMailFolder::GetDatabaseWOReparse|
8, 12, or 13 times?
(I mean not a few, but this function gets called rather many times beween IncorporateMessage is called and it finishes from what I observe in my local verbose dump.)
Does IncorporateMessage try to consolidate the receiving and incorporating of e-mails into a group action of incorporating a few e-mails at a time in the case of |maildir| support?)
Again, I am not sure if I saw this with mbox format folder usage. But again, until the current pending TB instance (and possibly valgrind, too) is purged from memory (I am keeping the process image live under gdb just in case somebody wanted to take a look at particular data structure from gdb), I cannot test other scenarios easily.
Comment 29•3 years ago
|
||
I observe
- that I could finish TB successfully when I tried to copy repeated messages up to 1000+ and 2000+
- somehow TB could not receive more than 980 e-mail messages. Why? could it be there was some throttling going on the sending side and local mail server interaction, and somehow there was a race when TB tried to access the mail system. Very unlikely.
- At least, TB failed to cleanly release some data structure as shown by the crash message from List structure destructor.
This resulted in a crash caused by MOZ_CRASH_REALLY().
Another thing. In the gdb backtrace, I failed to show the tail end of the trace, here is the tail end.
It is not terribly useful IMHO, but here it goes.
#6 0x00007f290d174200 in <signal handler called> () at /lib/x86_64-linux-gnu/libpthread.so.0
#7 MOZ_Crash
(aReason=0x55c9973b0e80 <sPrintfCrashReason> "mozilla::LinkedList<T>::~LinkedList() [with T = nsSHistory] has a buggy user: it should have removed all this list's elements before the list's destruction", aLine=440, aFilename=0x7f2906ef2370 "/NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h") at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/Assertions.h:261
#8 mozilla::LinkedList<nsSHistory>::~LinkedList() (this=<optimized out>, __in_chrg=<optimized out>)
at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h:440
--Type <RET> for more, q to quit, c to continue without paging--c
#9 mozilla::LinkedList<nsSHistory>::~LinkedList() (this=<optimized out>, __in_chrg=<optimized out>) at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h:437
#10 0x00007f290cda9f67 in __run_exit_handlers (status=0, listp=0x7f290cf28738 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#11 0x00007f290cdaa10a in __GI_exit (status=<optimized out>) at exit.c:139
#12 0x00007f290cd927f4 in __libc_start_main (main=0x55c99730ae90 <main(int, char**, char**)>, argc=2, argv=0x7ffe9c2197a8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe9c219798) at ../csu/libc-start.c:366
#13 0x000055c99730b2fa in _start ()
(gdb)
Comment 30•3 years ago
|
||
I don't know if the incomplete release of the elements in the linked list has anything to do with the cyclic-nature of a linked list which
I found out after a gdb session.
That is, I have found that the list stored in gSHistoryList is cyclic. (This variable is related to web browser session history? I
wonder where this sneaked in during TB interaction. Maybe the message pane display, especially the message text window, is related to web
browser display?)
Anyway in the following is the GDB commands where I figured out gSHistoryList may point to a circular linked list.
Whether the cyclic nature of the list has anything to do with the unreleased element reported in ASSERT is not clear to me.
Also, the ASSERT problem may be orthogonal to the original memory issue in |maildir| folders.
If anyone wants me to play with the GDB session (which I am keeping at this moment), let me know.
I may need to terminate this maybe in another day or two.
GDB session explanation.
I am not sure of how the class inheritance of LinkedList, LinkedListElement and instantiation of such classes with thrown-in type specification of nsSHistory results in memory layout very well. So trials and errors.
I started with |gSHistoryList| because I noticed that this is one of the persistant variables that seem to hold list values with the type which is reported in ASSERT error.
Then I found it seems to point to a circular list. I wonder if this circular property is intended or not, and if it has anything to do with the unreleased element that caused ASSERT?
(gdb) print gSHistoryList
$13 = mozilla::LinkedList<nsSHistory> = {0x55c99d00b170, 0x55c99e1ab1b0}
(gdb) print /x (int) gSHistoryList
$14 = 0x9d00b178
(gdb) print /x (long) gSHistoryList <--- the first pointer ingSHistory List
$15 = 0x55c99d00b178 <--- is this value, it seems.
(gdb) print * (LinkedListElement *) 0x55c99d00b170
No symbol "LinkedListElement" in current context.
I needed to specify the type specification as follows.
But I don't think I am looking at the right pointer because
mIsSentinel contains a strange value.
(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99d00b170
$16 = {mNext = 0x7f290a2b1320 <vtable for nsSHistory+16>, mPrev = 0x55c99e1ab1b8, mIsSentinel = 192}
(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99d00b1b0
$17 = {mNext = 0x55c99c0f2ea0, mPrev = 0xffffff00, mIsSentinel = 96}
(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99e1ab1b0
$18 = {mNext = 0x7f290a2b1320 <vtable for nsSHistory+16>, mPrev = 0x7f290a785fc0 <gSHistoryList>,
mIsSentinel = 120}
The next one seems to point to the correct data type since mIsSentinel
is printed as false. So from there I tried to print the object in
mNext and mPrev.
(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99e1ab1b8
$19 = {mNext = 0x7f290a785fc0 <gSHistoryList>, mPrev = 0x55c99d00b178, mIsSentinel = false}
(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99d00b178
$20 = {mNext = 0x55c99e1ab1b8, mPrev = 0x7f290a785fc0 <gSHistoryList>, mIsSentinel = false}
(gdb) print * (LinkedListElement<nsSHistory> *) 0x7f290a785fc0
$21 = {mNext = 0x55c99d00b178, mPrev = 0x55c99e1ab1b8, mIsSentinel = true}
(gdb) print (LinkedListElement<nsSHistory>) gSHistoryList
$22 = {mNext = 0x55c99d00b178, mPrev = 0x55c99e1ab1b8, mIsSentinel = true}
See? It seems there is a cyclic list.
gSHistoryList -> 0x55c99d00b178 -> 0x55c99e1ab1b8 -> back to gSHistoryList
Actually, I suspected this when I tried to execute the following
command and saw the output with repeated data. I had to quit the GDB output.
print (LinkedList<nsSHistory>) gSHistoryList
$23 = mozilla::LinkedList<nsSHistory> = {0x55c99d00b170, 0x55c99e1ab1b0,
0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(gdb)
Maybe I should file a separate bug for the assert.
Comment 31•3 years ago
|
||
Well, I found a similar shutdown crash bugzilla. So I added a comment there instead of creating a new bugzilla at this moment.
Bug 1745864 Opened 1 month ago
Hit MOZ_CRASH(mozilla::LinkedList<mozilla::dom::ContentParent>::~LinkedList() [T = mozilla::dom::ContentParent] has a buggy user: it should have removed all this list's elements before the list's destruction).
comment 11 is my addition.
https://bugzilla.mozilla.org/show_bug.cgi?id=1745864#c1
Comment 32•3 years ago
|
||
Hi Chiaki, the cyclic reference is wanted by design. If you look at:
(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99e1ab1b8
$19 = {mNext = 0x7f290a785fc0 <gSHistoryList>, mPrev = 0x55c99d00b178, mIsSentinel = false}
(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99d00b178
$20 = {mNext = 0x55c99e1ab1b8, mPrev = 0x7f290a785fc0 <gSHistoryList>, mIsSentinel = false}
(gdb) print * (LinkedListElement<nsSHistory> *) 0x7f290a785fc0
$21 = {mNext = 0x55c99d00b178, mPrev = 0x55c99e1ab1b8, mIsSentinel = true}
(gdb) print (LinkedListElement<nsSHistory>) gSHistoryList
$22 = {mNext = 0x55c99d00b178, mPrev = 0x55c99e1ab1b8, mIsSentinel = true}
you will find that the 0x7f290a785fc0
has the mIsSentinel
flag set to true. This element is always present (in fact it is the sentinel
member variable of gSHistoryList
) and closes the cycle. It cannot be removed and when it is the only element in the list, the list is considered to be empty.
If that output of gdb is shown at the time of the assertion, it just means that the list was not empty and that the two elements with mIsSentinel = false
have not been removed. If you are able to reproduce the issue you might have some luck with rr for debugging this.
The issue in the other bug 1745864 is similar but on a different list, so the underlying reasons why the elements have not been removed here and there is most probably different. You might want to look out for RefPtr<nsSHistory>
that are never cleared.
Comment 33•3 years ago
|
||
(In reply to Jens Stutte [:jstutte] from comment #32)
Thank you for the comment.
Hi Chiaki, the cyclic reference is wanted by design. If you look at:
(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99e1ab1b8 $19 = {mNext = 0x7f290a785fc0 <gSHistoryList>, mPrev = 0x55c99d00b178, mIsSentinel = false} (gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99d00b178 $20 = {mNext = 0x55c99e1ab1b8, mPrev = 0x7f290a785fc0 <gSHistoryList>, mIsSentinel = false} (gdb) print * (LinkedListElement<nsSHistory> *) 0x7f290a785fc0 $21 = {mNext = 0x55c99d00b178, mPrev = 0x55c99e1ab1b8, mIsSentinel = true} (gdb) print (LinkedListElement<nsSHistory>) gSHistoryList $22 = {mNext = 0x55c99d00b178, mPrev = 0x55c99e1ab1b8, mIsSentinel = true}
you will find that the
0x7f290a785fc0
has themIsSentinel
flag set to true. This element is always present (in fact it is thesentinel
member variable ofgSHistoryList
) and closes the cycle. It cannot be removed and when it is the only element in the list, the list is considered to be empty.
I see. This is how gSHistoryList was designed to behave.
If that output of gdb is shown at the time of the assertion, it just means that the list was not empty and that the two elements with
mIsSentinel = false
have not been removed. If you are able to reproduce the issue you might have some luck with rr for debugging this.
The gdb output is indeed from the crashing TB, crashed by assert, and so the non-empty list at the shutdown of TB is for real.
That is why the assert happened.
The non-sentinel members have not been removed for some reason.
I might try the scenario of copying a large number of messages from one folder to the other to see if this will be repeated.
The issue in the other bug 1745864 is similar but on a different list, so the underlying reasons why the elements have not been removed here and there is most probably different. You might want to look out for
RefPtr<nsSHistory>
that are never cleared.
I will investigate this "look out for RefPtr<nsSHistory>
that are never cleared" a bit more, and probably file a separate bugzilla entry based on the finding, then
terminate the still kept TB process image and the gdb process attached to it and start over.
https://searchfox.org/mozilla-central/search?q=RefPtr%3CnsSHistory%3E&path=
shows the following.
▼
Textual Occurrences
docshell/base/CanonicalBrowsingContext.h
495 RefPtr<nsSHistory> mSessionHistory;
docshell/shistory/nsSHEntryShared.cpp
111 RefPtr<nsSHistory> nsshistory = static_cast<nsSHistory*>(shistory.get());
docshell/shistory/nsSHistory.cpp
181 RefPtr<nsSHistory> mSHistory;
1580 RefPtr<nsSHistory> mSHistory;
docshell/shistory/nsSHistory.h
329 RefPtr<nsSHistory> mSHistory;
These turn out to be all class members.
It is a bit awkward to figure out WHICH globally persistent variables store these class member variables (= fields ) in the runtime memory snapshot.
|gSHistoryList| was easy to spot since it was after all a variable with file-wide scope.
Hmm...
I tried to see, for example, how the "495 RefPtr<nsSHistory> mSessionHistory;" is used, and found this.
https://searchfox.org/mozilla-central/search?q=symbol:F_%3CT_mozilla%3A%3Adom%3A%3ACanonicalBrowsingContext%3E_mSessionHistory&redirect=false
▼
Uses
docshell/base/CanonicalBrowsingContext.cpp
149 if (mSessionHistory) { // found in mozilla::dom::CanonicalBrowsingContext::~CanonicalBrowsingContext
150 mSessionHistory->SetBrowsingContext(nullptr); // found in mozilla::dom::CanonicalBrowsingContext::~CanonicalBrowsingContext
288 MOZ_ASSERT(!aNewContext->mSessionHistory); // found in mozilla::dom::CanonicalBrowsingContext::ReplacedBy
331 if (mSessionHistory) { // found in mozilla::dom::CanonicalBrowsingContext::ReplacedBy
332 mSessionHistory->SetBrowsingContext(aNewContext); // found in mozilla::dom::CanonicalBrowsingContext::ReplacedBy
336 mSessionHistory->SetEpoch(0, Nothing()); // found in mozilla::dom::CanonicalBrowsingContext::ReplacedBy
337 mSessionHistory.swap(aNewContext->mSessionHistory); // found in mozilla::dom::CanonicalBrowsingContext::ReplacedBy
457 if (!mSessionHistory && GetChildSessionHistory()) { // found in mozilla::dom::CanonicalBrowsingContext::GetSessionHistory
458 mSessionHistory = new nsSHistory(this); // found in mozilla::dom::CanonicalBrowsingContext::GetSessionHistory
461 return mSessionHistory; // found in mozilla::dom::CanonicalBrowsingContext::GetSessionHistory
2799 if (tmp->mSessionHistory) { // found in mozilla::dom::CanonicalBrowsingContext::cycleCollection::Unlink
2800 tmp->mSessionHistory->SetBrowsingContext(nullptr); // found in mozilla::dom::CanonicalBrowsingContext::cycleCollection::Unlink
* 2802 NS_IMPL_CYCLE_COLLECTION_UNLINK(mSessionHistory, mContainerFeaturePolicy, // found in mozilla::dom::CanonicalBrowsingContext::cycleCollection::Unlink
* 2809 NS_IMPL_CYCLE_COLLECTION_TRAVERSE(mSessionHistory, mContainerFeaturePolicy, // found in mozilla::dom::CanonicalBrowsingContext::cycleCollection::TraverseNative
I found the usage of the NS_IMP_CYCLE_COLLECTION_UNLINK/TRAVERSE only for this particular RefPtr<nsSHistory> field and other fields in
https://searchfox.org/mozilla-central/search?q=RefPtr%3CnsSHistory%3E&path=
did not seem to use these macros. I wonder if this could be the reason for not properly removed elements.
Or for that matter, the mSHistory in
docshell/shistory/nsSHistory.cpp
181 RefPtr<nsSHistory> mSHistory;"
is created (copied upon creation), but it does seem to be explicitly
removed(?).
See https://searchfox.org/mozilla-central/source/docshell/shistory/nsSHistory.cpp#181
class MOZ_STACK_CLASS SHistoryChangeNotifier {
public:
explicit SHistoryChangeNotifier(nsSHistory* aHistory) {
// If we're already in an update, the outermost change notifier will
// update browsing context in the destructor.
if (!aHistory->HasOngoingUpdate()) {
aHistory->SetHasOngoingUpdate(true);
mSHistory = aHistory;
}
}
~SHistoryChangeNotifier() {
if (mSHistory) {
MOZ_ASSERT(mSHistory->HasOngoingUpdate());
mSHistory->SetHasOngoingUpdate(false);
if (mozilla::SessionHistoryInParent() &&
mSHistory->GetBrowsingContext()) {
mSHistory->GetBrowsingContext()
->Canonical()
->HistoryCommitIndexAndLength();
}
}
// <----- ??? if mSHIstory was not null, should we not clear it after the above is done?
}
RefPtr<nsSHistory> mSHistory;
};
I don't know the code at all, but from a cursory look, I think, if mSHIstory was not null, should we not clear it at the end of destructor?
I have not checked all the "<Refptr>nsSHistory" fields, but It seems there may be indeed some coding issues.
This shutdown-time assertion issue seems to be very deep. And it may or may not be related to the original bug symptom because I found that the session history is kept to a reasonably small number. It seems to be cropped if the list becomes longer than the maximum limit.
There lies another possibility that the removal may not be handled properly. But I am under the impression that the implicit removal at class destruction, etc. is not properly handled coding-wise.
I think I will stop the current gdb session and try to check the ORIGINAL issue again.
And I will see if the shutdown-time assert is triggered again. In any case, I think I better file a different bugzilla for
shutdown-time assert of "Hit MOZ_CRASH(mozilla::LinkedList<T>::~LinkedList() [with T = nsSHistory] "
Thank you again.
Comment 34•3 years ago
|
||
OK, I restarted locally build DEBUG version of TB under linux.
As soon as I quit the previous TB instance, the mail lock file is gone.
So the new TB image could read the remaining e-mails (2000 e-mails was read only up to 980 e-mail in previous TB run. I still don't know why.)
Now, Tried copying like 2000+, 3000+ e-mails using maildir under valgrind in different sessions.
(I have 16GB memory assigned to my linux image in VirtualBox).
I could not reproduce the crash at shutdown due to the particular MOZ_ASSERT any more (comment 27).
That assert crash probably needs to have a separate bugzilla.
But then in one of the runs, I got the following shutdown time error.
(Other than that I did not see memcheck-related errors which seem to be directly related to maildir.)
The valgrind run was done with this options and environment variables by the way.:
env MOZ_FAKE_NO_SANDBOX=yes MOZ_FAKE_NO_SECCOMP_TSYNC=yes MOZ_DISABLE_CONTENT_SANDBOX=yes MOZ_DISABLE_GMP_SANDBOX=yes MOZ_ASSUME_USER_NS=0 valgrind --trace-children=yes --fair-sched=yes --smc-check=all-non-file --gen-suppressions=all --vex-iropt-register-updates=allregs-at-mem-access --child-silent-after-fork=yes --trace-children-skip=/usr/bin/lsb_release,/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java,*/fix-stacks,*/firefox/firefox,*/bin/firefox-esr,*/bin/python,*/bin/python2,*/bin/python3,*/bin/python2.7,*/bin/bash,*/bin/nodejs,*/bin/node,*/bin/xpcshell,python3 --max-threads=5000 --max-stackframe=16000000 --num-transtab-sectors=24 --tool=memcheck --freelist-vol=500000000 --redzone-size=128 --px-default=allregs-at-mem-access --px-file-backed=unwindregs-at-mem-access --malloc-fill=0xA5 --free-fill=0xC3 --num-callers=50 --suppressions=/home/ishikawa/Dropbox/myown.sup --show-mismatched-frees=no --show-possibly-lost=no /KERNEL-SRC/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin -p
The new crash, I got was as follows.
Note 0xC3C3C3...C3 value. That is a read of of already freed memory location. (The option I specified to valgrind "--free-fill=0xC3 "
Something is really screwed up at termination-time of TB. I am not sure if that is |maildir| specific or not. (It takes time to check the operation of copying a few thousand messages under valgrind. I cannot do the testing of mbox case on the same day in my spare time.)
... toward the shutdown ...
Failed to load file:///NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mail/base/content/mailCore.js
[Parent 68965, Main Thread] WARNING: 'aOwner->IsDiscarded()', file /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/dom/SyncedContextInlines.h:94
[Parent 68965, Main Thread] WARNING: 'aOwner->IsDiscarded()', file /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/dom/SyncedContextInlines.h:94
==68965== Thread 1:
==68965== Invalid read of size 8
==68965== at 0xCB168F3: mozilla::detail::RunnableFunction<nsXULPopupManager::ShowMenu(nsIContent*, bool, bool)::{lambda()#1}>::Run() (nsCOMPtr.h:855)
==68965== by 0x927FF15: mozilla::RunnableTask::Run() (TaskController.cpp:468)
==68965== by 0x927EC77: mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) (TaskController.cpp:771)
==68965== by 0x927F40A: mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) (TaskController.cpp:607)
==68965== by 0x927F71B: mozilla::TaskController::ProcessPendingMTTask(bool) (TaskController.cpp:391)
==68965== by 0x927F7EA: mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal()::{lambda()#1}>::Run() (TaskController.cpp:124)
==68965== by 0x92804D1: nsThread::ProcessNextEvent(bool, bool*) (nsThread.cpp:1195)
==68965== by 0x925F799: NS_ProcessNextEvent(nsIThread*, bool) (nsThreadUtils.cpp:467)
==68965== by 0x99B2BC9: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) (MessagePump.cpp:85)
==68965== by 0x9956448: MessageLoop::Run() (message_loop.cc:324)
==68965== by 0xC4CDBC8: nsBaseAppShell::Run() (nsBaseAppShell.cpp:137)
==68965== by 0xDB778C9: nsAppStartup::Run() (nsAppStartup.cpp:295)
==68965== by 0xDC74F27: XREMain::XRE_mainRun() (nsAppRunner.cpp:5342)
==68965== by 0xDC76479: XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) (nsAppRunner.cpp:5527)
==68965== by 0xDC76D98: XRE_main(int, char**, mozilla::BootstrapConfig const&) (nsAppRunner.cpp:5586)
==68965== by 0x11EC73: do_main(int, char**, char**) (nsMailApp.cpp:229)
==68965== by 0x11DF02: main (nsMailApp.cpp:368)
==68965== Address 0x3a5856a0 is 6,944 bytes inside a block of size 8,192 free'd
==68965== at 0x483F74C: free (vg_replace_malloc.c:755)
==68965== by 0xC84F002: nsPresArena<8192ul, mozilla::ArenaObjectID, 163ul>::~nsPresArena() (ArenaAllocator.h:90)
==68965== by 0xC7DD61E: mozilla::PresShell::~PresShell() (PresShell.cpp:879)
==68965== by 0xC7DDC98: mozilla::PresShell::Release() (PresShell.cpp:877)
==68965== by 0xD7F3938: mozilla::AppWindow::RequestWindowClose(nsIWidget*) (RefPtr.h:50)
==68965== by 0xD7F3A8A: mozilla::AppWindow::WidgetListenerDelegate::RequestWindowClose(nsIWidget*) (AppWindow.cpp:3317)
==68965== by 0xC52D2ED: delete_event_cb(_GtkWidget*, _GdkEventAny*) (nsWindow.cpp:3914)
==68965== by 0x594CF93: ??? (in /usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2404.26)
==68965== by 0x6573908: ??? (in /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.7000.2)
==68965== by 0x658B63A: g_signal_emit_valist (in /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.7000.2)
==68965== by 0x658C4FE: g_signal_emit (in /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.7000.2)
==68965== by 0x58F6B93: ??? (in /usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2404.26)
==68965== by 0x57AD362: gtk_main_do_event (in /usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2404.26)
==68965== by 0x5DC86A4: ??? (in /usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2404.26)
==68965== by 0x5DFBD71: ??? (in /usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2404.26)
==68965== by 0x660CCDA: g_main_context_dispatch (in /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.7000.2)
==68965== by 0x660CF87: ??? (in /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.7000.2)
==68965== by 0x660D03E: g_main_context_iteration (in /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.7000.2)
==68965== by 0xC567DE3: nsAppShell::ProcessNextNativeEvent(bool) (nsAppShell.cpp:352)
==68965== by 0xC4D79F6: nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool) (nsBaseAppShell.cpp:120)
==68965== by 0x92803B1: nsThread::ProcessNextEvent(bool, bool*) (nsThread.cpp:1111)
==68965== by 0x925F799: NS_ProcessNextEvent(nsIThread*, bool) (nsThreadUtils.cpp:467)
==68965== by 0x99B2BC9: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) (MessagePump.cpp:85)
==68965== by 0x9956448: MessageLoop::Run() (message_loop.cc:324)
==68965== by 0xC4CDBC8: nsBaseAppShell::Run() (nsBaseAppShell.cpp:137)
==68965== by 0xDB778C9: nsAppStartup::Run() (nsAppStartup.cpp:295)
==68965== by 0xDC74F27: XREMain::XRE_mainRun() (nsAppRunner.cpp:5342)
==68965== by 0xDC76479: XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) (nsAppRunner.cpp:5527)
==68965== by 0xDC76D98: XRE_main(int, char**, mozilla::BootstrapConfig const&) (nsAppRunner.cpp:5586)
==68965== by 0x11EC73: do_main(int, char**, char**) (nsMailApp.cpp:229)
==68965== by 0x11DF02: main (nsMailApp.cpp:368)
==68965== Block was alloc'd at
==68965== at 0x483CF9B: malloc (vg_replace_malloc.c:380)
==68965== by 0xC869E30: nsPresArena<8192ul, mozilla::ArenaObjectID, 163ul>::Allocate(mozilla::ArenaObjectID, unsigned long) (ArenaAllocator.h:170)
==68965== by 0xC8AF869: NS_NewBlockFrame(mozilla::PresShell*, mozilla::ComputedStyle*) (PresShell.h:280)
==68965== by 0xC82E8E7: nsCSSFrameConstructor::ConstructNonScrollableBlockWithConstructor(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItem&, nsContainerFrame*, nsStyleDisplay const*, nsFrameList&, nsBlockFrame* (*)(mozilla::PresShell*, mozilla::ComputedStyle*)) (nsCSSFrameConstructor.cpp:4620)
==68965== by 0xC82EAD4: nsCSSFrameConstructor::ConstructNonScrollableBlock(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItem&, nsContainerFrame*, nsStyleDisplay const*, nsFrameList&) (nsCSSFrameConstructor.cpp:4593)
==68965== by 0xC828277: nsCSSFrameConstructor::ConstructFrameFromItemInternal(nsCSSFrameConstructor::FrameConstructionItem&, nsFrameConstructorState&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:3692)
==68965== by 0xC8291DC: nsCSSFrameConstructor::ConstructFramesFromItem(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList::Iterator&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:5658)
==68965== by 0xC829521: nsCSSFrameConstructor::ConstructFramesFromItemList(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList&, nsContainerFrame*, bool, nsFrameList&) (nsCSSFrameConstructor.cpp:9521)
==68965== by 0xC821F00: nsCSSFrameConstructor::ProcessChildren(nsFrameConstructorState&, nsIContent*, mozilla::ComputedStyle*, nsContainerFrame*, bool, nsFrameList&, bool, nsIFrame*) (nsCSSFrameConstructor.cpp:9681)
==68965== by 0xC828D10: nsCSSFrameConstructor::ConstructFrameFromItemInternal(nsCSSFrameConstructor::FrameConstructionItem&, nsFrameConstructorState&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:3832)
==68965== by 0xC8291DC: nsCSSFrameConstructor::ConstructFramesFromItem(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList::Iterator&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:5658)
==68965== by 0xC829521: nsCSSFrameConstructor::ConstructFramesFromItemList(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList&, nsContainerFrame*, bool, nsFrameList&) (nsCSSFrameConstructor.cpp:9521)
==68965== by 0xC821F00: nsCSSFrameConstructor::ProcessChildren(nsFrameConstructorState&, nsIContent*, mozilla::ComputedStyle*, nsContainerFrame*, bool, nsFrameList&, bool, nsIFrame*) (nsCSSFrameConstructor.cpp:9681)
==68965== by 0xC828D10: nsCSSFrameConstructor::ConstructFrameFromItemInternal(nsCSSFrameConstructor::FrameConstructionItem&, nsFrameConstructorState&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:3832)
==68965== by 0xC8291DC: nsCSSFrameConstructor::ConstructFramesFromItem(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList::Iterator&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:5658)
==68965== by 0xC829521: nsCSSFrameConstructor::ConstructFramesFromItemList(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList&, nsContainerFrame*, bool, nsFrameList&) (nsCSSFrameConstructor.cpp:9521)
==68965== by 0xC821F00: nsCSSFrameConstructor::ProcessChildren(nsFrameConstructorState&, nsIContent*, mozilla::ComputedStyle*, nsContainerFrame*, bool, nsFrameList&, bool, nsIFrame*) (nsCSSFrameConstructor.cpp:9681)
==68965== by 0xC828D10: nsCSSFrameConstructor::ConstructFrameFromItemInternal(nsCSSFrameConstructor::FrameConstructionItem&, nsFrameConstructorState&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:3832)
==68965== by 0xC8291DC: nsCSSFrameConstructor::ConstructFramesFromItem(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList::Iterator&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:5658)
==68965== by 0xC829521: nsCSSFrameConstructor::ConstructFramesFromItemList(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList&, nsContainerFrame*, bool, nsFrameList&) (nsCSSFrameConstructor.cpp:9521)
==68965== by 0xC821F00: nsCSSFrameConstructor::ProcessChildren(nsFrameConstructorState&, nsIContent*, mozilla::ComputedStyle*, nsContainerFrame*, bool, nsFrameList&, bool, nsIFrame*) (nsCSSFrameConstructor.cpp:9681)
==68965== by 0xC82E348: nsCSSFrameConstructor::ConstructBlock(nsFrameConstructorState&, nsIContent*, nsContainerFrame*, nsContainerFrame*, mozilla::ComputedStyle*, nsContainerFrame**, nsFrameList&, nsIFrame*) (nsCSSFrameConstructor.cpp:10570)
==68965== by 0xC82F3A0: nsCSSFrameConstructor::ConstructDocElementFrame(mozilla::dom::Element*) (nsCSSFrameConstructor.cpp:2439)
==68965== by 0xC833705: nsCSSFrameConstructor::ContentRangeInserted(nsIContent*, nsIContent*, nsCSSFrameConstructor::InsertionKind) (nsCSSFrameConstructor.cpp:6956)
==68965== by 0xC7E0DD1: mozilla::PresShell::Initialize() [clone .part.0] (PresShell.cpp:1853)
==68965== by 0xBF43EDF: mozilla::dom::PrototypeDocumentContentSink::StartLayout() (PrototypeDocumentContentSink.cpp:700)
==68965== by 0xBF440FF: mozilla::dom::PrototypeDocumentContentSink::DoneWalking() (PrototypeDocumentContentSink.cpp:669)
==68965== by 0xC48A9A4: mozilla::dom::DocumentL10n::InitialTranslationCompleted(bool) [clone .part.0] (DocumentL10n.cpp:321)
==68965== by 0xC48B197: L10nReadyHandler::ResolvedCallback(JSContext*, JS::Handle<JS::Value>) (DocumentL10n.cpp:304)
==68965== by 0xC1A85D8: mozilla::dom::(anonymous namespace)::PromiseNativeHandlerShim::ResolvedCallback(JSContext*, JS::Handle<JS::Value>) (Promise.cpp:385)
==68965== by 0xC1ABC4D: mozilla::dom::NativeHandlerCallback(JSContext*, unsigned int, JS::Value*) (Promise.cpp:338)
==68965== by 0xE1AA81E: CallJSNative(JSContext*, bool (*)(JSContext*, unsigned int, JS::Value*), js::CallReason, JS::CallArgs const&) (Interpreter.cpp:425)
==68965== by 0xE1BFBAD: js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) (Interpreter.cpp:512)
==68965== by 0xE1C0097: js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>, js::CallReason) (Interpreter.cpp:589)
==68965== by 0xE20BE58: js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::MutableHandle<JS::Value>) (Interpreter.h:106)
==68965== by 0xE3873E4: PromiseReactionJob(JSContext*, unsigned int, JS::Value*) (Promise.cpp:2067)
==68965== by 0xE1AA81E: CallJSNative(JSContext*, bool (*)(JSContext*, unsigned int, JS::Value*), js::CallReason, JS::CallArgs const&) (Interpreter.cpp:425)
==68965== by 0xE1BFBAD: js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) (Interpreter.cpp:512)
==68965== by 0xE1C0097: js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>, js::CallReason) (Interpreter.cpp:589)
==68965== by 0xE2D4E84: JS::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) (CallAndConstruct.cpp:117)
==68965== by 0xA8EFC7E: mozilla::dom::VoidFunction::Call(mozilla::dom::BindingCallContext&, JS::Handle<JS::Value>, mozilla::ErrorResult&) (CustomElementRegistryBinding.cpp:503)
==68965== by 0x9166722: mozilla::dom::PromiseJobCallback::Call(mozilla::ErrorResult&, char const*, mozilla::dom::CallbackObject::ExceptionHandling, JS::Realm*) (PromiseBinding.h:89)
==68965== by 0x916697F: mozilla::PromiseJobRunnable::Run(mozilla::AutoSlowOperation&) (PromiseBinding.h:102)
==68965== by 0x917C235: mozilla::CycleCollectedJSContext::PerformMicroTaskCheckPoint(bool) (CycleCollectedJSContext.cpp:674)
==68965== by 0x917CA31: mozilla::CycleCollectedJSContext::AfterProcessTask(unsigned int) (CycleCollectedJSContext.cpp:463)
==68965== by 0x9E68473: XPCJSContext::AfterProcessTask(unsigned int) (XPCJSContext.cpp:1424)
==68965== by 0x92805DC: nsThread::ProcessNextEvent(bool, bool*) (nsThread.cpp:1232)
==68965== by 0x925F799: NS_ProcessNextEvent(nsIThread*, bool) (nsThreadUtils.cpp:467)
==68965== by 0x99B2BC9: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) (MessagePump.cpp:85)
==68965== by 0x9956448: MessageLoop::Run() (message_loop.cc:324)
==68965==
{
<insert_a_suppression_name_here>
Memcheck:Addr8
fun:_ZN7mozilla6detail16RunnableFunctionIZN17nsXULPopupManager8ShowMenuEP10nsIContentbbEUlvE_E3RunEv
fun:_ZN7mozilla12RunnableTask3RunEv
fun:_ZN7mozilla14TaskController39DoExecuteNextTaskOnlyMainThreadInternalERKNS_6detail12BaseAutoLockIRNS_5MutexEEE
fun:_ZN7mozilla14TaskController37ExecuteNextTaskOnlyMainThreadInternalERKNS_6detail12BaseAutoLockIRNS_5MutexEEE
fun:_ZN7mozilla14TaskController20ProcessPendingMTTaskEb
fun:_ZN7mozilla6detail16RunnableFunctionIZNS_14TaskController18InitializeInternalEvEUlvE_E3RunEv
fun:_ZN8nsThread16ProcessNextEventEbPb
fun:_Z19NS_ProcessNextEventP9nsIThreadb
fun:_ZN7mozilla3ipc11MessagePump3RunEPN4base11MessagePump8DelegateE
fun:_ZN11MessageLoop3RunEv
fun:_ZN14nsBaseAppShell3RunEv
fun:_ZN12nsAppStartup3RunEv
fun:_ZN7XREMain11XRE_mainRunEv
fun:_ZN7XREMain8XRE_mainEiPPcRKN7mozilla15BootstrapConfigE
fun:_Z8XRE_mainiPPcRKN7mozilla15BootstrapConfigE
fun:_ZL7do_mainiPPcS0_
fun:main
}
==68965== Invalid read of size 8
==68965== at 0xCB16900: mozilla::detail::RunnableFunction<nsXULPopupManager::ShowMenu(nsIContent*, bool, bool)::{lambda()#1}>::Run() (RefPtr.h:49)
==68965== by 0x927FF15: mozilla::RunnableTask::Run() (TaskController.cpp:468)
==68965== by 0x927EC77: mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) (TaskController.cpp:771)
==68965== by 0x927F40A: mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) (TaskController.cpp:607)
==68965== by 0x927F71B: mozilla::TaskController::ProcessPendingMTTask(bool) (TaskController.cpp:391)
==68965== by 0x927F7EA: mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal()::{lambda()#1}>::Run() (TaskController.cpp:124)
==68965== by 0x92804D1: nsThread::ProcessNextEvent(bool, bool*) (nsThread.cpp:1195)
==68965== by 0x925F799: NS_ProcessNextEvent(nsIThread*, bool) (nsThreadUtils.cpp:467)
==68965== by 0x99B2BC9: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) (MessagePump.cpp:85)
==68965== by 0x9956448: MessageLoop::Run() (message_loop.cc:324)
==68965== by 0xC4CDBC8: nsBaseAppShell::Run() (nsBaseAppShell.cpp:137)
==68965== by 0xDB778C9: nsAppStartup::Run() (nsAppStartup.cpp:295)
==68965== by 0xDC74F27: XREMain::XRE_mainRun() (nsAppRunner.cpp:5342)
==68965== by 0xDC76479: XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) (nsAppRunner.cpp:5527)
==68965== by 0xDC76D98: XRE_main(int, char**, mozilla::BootstrapConfig const&) (nsAppRunner.cpp:5586)
==68965== by 0x11EC73: do_main(int, char**, char**) (nsMailApp.cpp:229)
==68965== by 0x11DF02: main (nsMailApp.cpp:368)
==68965== Address 0xc3c3c3c3c3c3c3c3 is not stack'd, malloc'd or (recently) free'd
==68965==
Comment 35•3 years ago
|
||
Oh, one other thing. The crash under valgrind may be caused by the timing race.
There are places where the timing of events are not strictly controlled by the program flow.
Sometimes a group of events can occur in any order.
But sometimes a group of events have to occur in a certain sequence.
In those cases, programmers need to make sure to interlock the operation by certain synchronization primitives.
Problem is that some programmers have simply "assume" that some operations finish
before others because such was the case before.
Unfortunately, valgrind/memcheck skews the execution speed of program so much so that such "assumption" no longer holds.
So the error I see about referencing 0xc3c3c3...c3 to pick up an address to possibly a code that was to be executed at the termination
may have been caused by such mis-ordering due to the lack of proper sync. Just a theory, but very plausible.
I am saying because as soon as TB under valgrind starts (well I say "as soon as", but it is almost like 20-30 seconds before I see the following message after invocation. Valgrind/memcheck is sloooow.),
I see the following message.
==68965== Memcheck, a memory error detector
==68965== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==68965== Using Valgrind-3.18.0.GIT and LibVEX; rerun with -h for copyright info
==68965== Command: /KERNEL-SRC/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin -p
==68965==
==68965== Warning: set address range perms: large range [0x7a756a20000, 0x7a7d6620000) (noaccess)
[Parent 68965, Main Thread] WARNING: dependent window created without a parent: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/components/startup/nsAppStartup.cpp:744
==68965== Warning: set address range perms: large range [0x59c8a000, 0x459c8a000) (noaccess)
==68965== Warning: set address range perms: large range [0x59c8a000, 0x459c8a000) (noaccess)
[Parent 68965, Main Thread] WARNING: NS_ENSURE_TRUE(rootFrame) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/dom/base/nsGlobalWindowOuter.cpp:4235
[Parent 68965, Main Thread] WARNING: NS_ENSURE_TRUE(root) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/layout/base/nsDocumentViewer.cpp:2619
Warning: asking to enable_gpu_markers but no supporting extension was found
I suspect that main code of TB NEEDs to wait for a window or something to appear BEFORE proceeding, but
under normal circumstances, the window gets ready before TB needs and so all is well. However, under valgrind, it may be too slow to create a required window or something. The message is very disturbing. That TB works under this condition is a bit surprising.
Comment 36•3 years ago
|
||
Oops, the above messages appear even without valgrind (!). Something is fishy with TB main code.
/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin -p
[Parent 69547, Main Thread] WARNING: dependent window created without a parent: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/components/startup/nsAppStartup.cpp:744
... some locally added dumps omitted ...
[Parent 69547, Main Thread] WARNING: NS_ENSURE_TRUE(rootFrame) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/dom/base/nsGlobalWindowOuter.cpp:4235
[Parent 69547, Main Thread] WARNING: NS_ENSURE_TRUE(root) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/layout/base/nsDocumentViewer.cpp:2619
Warning: asking to enable_gpu_markers but no supporting extension was found
[Parent 69547, Compositor] WARNING: Possibly dropping task posted to updater thread: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/gfx/layers/apz/src/APZUpdater.cpp:362
Comment 37•3 years ago
|
||
I am afraid testing maildir operation of TB under valgrind triggers more errors than the original problem and I am moving away from the original problem.
I wanted to see if I can print the memory summary at the end if TB terminates successfully after copying of 1000+ messages. With that memory summary, I can say if there is unfreed data structure, etc. with some confidence.
Well, I got hit with another instance of the assert mentioned in comment 27.
I am not sure when this bug crept in, but unless it is taken care of, DEBUG version of TB cannot print out meaningful memory summary (and that this assert is triggered only in DEBUG build, I think.)
[Parent 69547, Main Thread] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsTraceRefcnt.cpp:204
[Parent 69547, Main Thread] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsTraceRefcnt.cpp:204
[Parent 69547, Main Thread] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsTraceRefcnt.cpp:204
Hit MOZ_CRASH(mozilla::LinkedList<T>::~LinkedList() [with T = nsSHistory] has a buggy user: it should have removed all this list's elements before the list's destruction) at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h:440
#01: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x616b5e9]
#02: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3ef67]
#03: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3f10a]
#04: __libc_start_main[/lib/x86_64-linux-gnu/libc.so.6 +0x277f4]
#05: _start[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin +0x162fa]
#06: ??? (???:???)
Program /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin (pid = 69547) received signal 11.
Stack:
#01: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x6604035]
#02: ???[/lib/x86_64-linux-gnu/libpthread.so.0 +0x13200]
#03: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x616b5f3]
#04: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3ef67]
#05: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3f10a]
#06: __libc_start_main[/lib/x86_64-linux-gnu/libc.so.6 +0x277f4]
#07: _start[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin +0x162fa]
#08: ??? (???:???)
Sleeping for 300 seconds.
Type 'gdb /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin 69547' to attach your debugger to this thread.
RunWatchdog: Mainthread nested event loops during hang:
--- (no nested event loop active)
Hit MOZ_CRASH(Shutdown hanging after all known phases and workers finished.) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/components/terminator/nsTerminator.cpp:256
#01: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x65f654e]
#02: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libnspr4.so +0x2ee4d]
#03: ???[/lib/x86_64-linux-gnu/libpthread.so.0 +0x8d80]
#04: clone[/lib/x86_64-linux-gnu/libc.so.6 +0xfcb6f]
#05: ??? (???:???)
Program /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin (pid = 69547) received signal 11.
Stack:
#01: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x6604035]
#02: ???[/lib/x86_64-linux-gnu/libpthread.so.0 +0x13200]
#03: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x65f655f]
#04: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libnspr4.so +0x2ee4d]
#05: ???[/lib/x86_64-linux-gnu/libpthread.so.0 +0x8d80]
#06: clone[/lib/x86_64-linux-gnu/libc.so.6 +0xfcb6f]
#07: ??? (???:???)
Sleeping for 300 seconds.
Type 'gdb /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin 69547' to attach your debugger to this thread.
Done sleeping...
ishikawa@ip030:/NREF-COMM-CENTRAL/work-dir$
I better file a bug for this shutdown-time assert and take care of that first. :-(
Comment 38•2 years ago
|
||
(In reply to ISHIKAWA, Chiaki from comment #37)
I better file a bug for this shutdown-time assert and take care of that first. :-(
Did you file that bug?
Reporter | ||
Comment 39•2 years ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #38)
(In reply to ISHIKAWA, Chiaki from comment #37)
I better file a bug for this shutdown-time assert and take care of that first. :-(
Did you file that bug?
I suppose you want to request needinfo to Chiaki.
Comment 40•2 years ago
|
||
(In reply to Wayne Mery (:wsmwk) from comment #38)
(In reply to ISHIKAWA, Chiaki from comment #37)
I better file a bug for this shutdown-time assert and take care of that first. :-(
Did you file that bug?
(In reply to Takanori MATSUURA from comment #39)
(In reply to Wayne Mery (:wsmwk) from comment #38)
(In reply to ISHIKAWA, Chiaki from comment #37)
I better file a bug for this shutdown-time assert and take care of that first. :-(
Did you file that bug?
I suppose you want to request needinfo to Chiaki.
Thank you, Matsuura san for redirecting this to me.
Before filing a bugzilla, I searched for a similar bugzilla entry and found
bug 1745864
and so instead of creating a new one, I posted a comment to that bugzilla.
https://bugzilla.mozilla.org/show_bug.cgi?id=1745864#c11
However, there seem to be a few different underlying causes.
https://bugzilla.mozilla.org/show_bug.cgi?id=1745864#c11
There is an independent bugzilla filed about 6 months ago.
Bug 1755794
Also, bug 1661862
The underlying causes seemed very elusive and now with the patch in bug 1661862, it may show up elsewhere.
I have not run mochitest under valgrind run for a while. Maybe I should (Running mochitest test suite under valgrind is almost 20+ hours with heavy memory usage) and so I am not tempted to run it often. :-(
That there is some tests that cause timeout errors (even with regular run) is a big headache.
Even for tests that execute successfully, I need to set timeout rather long so that the slowdown by valgrind causes a timeout for successful test.
However, some failing tests (that would cause timeout even during normal run) will use the long timeout limit to the full, which causes a real pain because the few of them add maybe 3-4 hours of inactivity and causing the total execution time to be very long.
Until a lot of people run test suite under valgrind, this won't get much attention.
And, running test suite under valgrind itself is a challenge. I had to patch a few issues just get valgrind run successfully against TB.
Oh well. (I wonder how FF folks manage to run FF tests successfully under valgrind. I read somewhere that valgrind run is executed maybe every month or so?)
Comment 41•2 years ago
|
||
(In reply to ISHIKAWA, Chiaki from comment #40)
...
Until a lot of people run test suite under valgrind, this won't get much attention.
And, running test suite under valgrind itself is a challenge. I had to patch a few issues just get valgrind run successfully against TB.
Oh well. (I wonder how FF folks manage to run FF tests successfully under valgrind. I read somewhere that valgrind run is executed maybe every month or so?)
So, 8 months means 8 times of valgrind tests. Can it now get that attention you mentioned?
Comment 42•2 years ago
|
||
(In reply to Worcester12345 from comment #41)
(In reply to ISHIKAWA, Chiaki from comment #40)
...Until a lot of people run test suite under valgrind, this won't get much attention.
And, running test suite under valgrind itself is a challenge. I had to patch a few issues just get valgrind run successfully against TB.
Oh well. (I wonder how FF folks manage to run FF tests successfully under valgrind. I read somewhere that valgrind run is executed maybe every month or so?)So, 8 months means 8 times of valgrind tests. Can it now get that attention you mentioned?
I have no idea. I wonder where the valgrind run of FF tests are stored on tryserver.
As for my local tests, I have realized that maybe I should only pick up one particular test that is under focus as a candidate for valgrind run.
Right now such a test is being planned (I don't have the time to do it) to check for any suspicious memory-related errors when
TB is compiled using gcc-12. That is not directly related to this bugzilla unfortunately.
The symptom I reported in Bug 1824691 could be related to gcc-12 miscompilation or something.
But then something rang a bell, and
I recalled that, for a while, my local mochitest running debug version of C-C TB under valgrind
produced very hard to diagnose memory errors in non other than hashtable which appears in the stacktrace of bug 1824691.
So it IS POSSIBLE that there is a real memory-related error which may be triggered by a particular manner a code is compiled by a compiler.
I would run a particular test mentioned in bugzilla 1824691. Running a single test should not run for a long duration even under valgrind.
And if there is a suspect, I can look at the source code, and or invoke gdb to home in to suspicious values.
But again, even that will probably eat up half a day. :-(
Memory-related errors are really hard to debug.
One of these days, stress testing TB often uncovers ANOTHER fatal bug along the way. Debugging seems to be a never ending story.
Comment 43•2 years ago
|
||
BTW, I would think someone in the embedded computer industry who needs high level of security such as AirBus might want to create a version of ARM cpu (or whatever) that implements the valgrind-like check in firmware/hardware to speed up the valgrind-like testing significantly.
Instead of x20 slowdown, we can live with x4-x6 slowdon with such a CPU.
Just a thought.
Description
•