1333342 - Crash [@ OOM | small ] during generating MSF file from many emails in Maildir format. Memory leak. @ OOM | small

Reporter

Description

•

8 years ago

str

How to reproduce: 1. Generate a folder like below with Explorer C:\Users\<Windows user name>\AppData\Roaming\Thunderbird\Profiles\<Profile name>\Mail\Local Folders\foo\cur 2. Put many emails into the generated folder in Maildir format 3. Run Thunderbird 4. Select the folder to generate MSF file 5. Thunderbird crashed after several minutes Sometimes 1200 emails are OK, and at other time 1000 email are NG. So it depends on the condition. Crash reports: bp-ad27cdaa-770f-4806-b716-f54632170124 bp-c256ca32-4e45-441b-8a8f-78d072170124 bp-25dc17d4-da3b-4a04-ba4c-f781c2170124 bp-6b1067ad-c205-449d-8434-93eee2170124 bp-d8a14acd-51d4-4e21-a01c-93ca72170124

Wayne Mery (:wsmwk)

Comment 1

•

8 years ago

3 of the 4 are OOM | small. #3 Either there is a bad memory leak? (not closing files?) Or foul play. Please start *Windows'* safe mode with networking enabled - win7 https://support.microsoft.com/en-us/help/17419/windows-7-advanced-startup-options-safe-mode#start-computer-safe-mode=windows-7 Still In Windows safe mode, start thunderbird in safe mode - https://support.mozilla.org/kb/safe-mode-thunderbird Does problem go away?

Severity: normal → critical

Flags: needinfo?(t.matsuu)

Keywords: crash, mlk

Summary: Crash during enerating MSF file from many emails in Maildir format → Crash [@ OOM | small ] during generating MSF file from many emails in Maildir format. Memory leak?

Takanori MATSUURA

Reporter

Comment 2

•

8 years ago

I cannot change boot option because my machine using Thunderbird Nightly is not under my control for that setting. In case that folders "foo" and "bar" which have >1000 emails each (and MSF files are successfully generated), moving all emails from one folder to another is no problem.

Flags: needinfo?(t.matsuu)

Wayne Mery (:wsmwk)

Comment 3

•

8 years ago

You must have a very strong interest in maildir. You have several major extentions installed. At least please try starting Thunderbird in safe mode. Please also use the current nightly 54.0a1. Thanks Oh, and why are they making you run win7 in 32bit? That's pretty horrible.

Flags: needinfo?(t.matsuu)

Takanori MATSUURA

Reporter

Comment 4

•

7 years ago

(In reply to Wayne Mery (:wsmwk, NI for questions) from comment #3) > You have several major extentions installed. At least please try starting > Thunderbird in safe mode. Please also use the current nightly 54.0a1. Thanks I have tested Thunderbird in safe mode with the latest comm-central build. Build ID: 20170702030204 https://hg.mozilla.org/mozilla-central/rev/4d3de12dcdc539f14fcb06539da39fa7176c8955 And it is still reproduced. bp-b4e3f908-722c-4430-8786-9aec50170703 bp-76fee5df-b57c-48e4-8314-d13a80170703 bp-1703c2b3-f18b-4380-939d-4b5390170703 bp-9c0a1ef5-f3e0-4683-bd62-e97270170703 > Oh, and why are they making you run win7 in 32bit? That's pretty horrible. PCs assigned to me at the office is win7 in 32 bit. :-(

Flags: needinfo?(t.matsuu)

Takanori MATSUURA

Reporter

Comment 5

•

7 years ago

Crash reports other than OOM | small: bp-3e9d71fb-4d2b-4d28-a0cc-2e92f0170703 bp-266efbb6-9891-4f54-ab3c-3f7a40170703 bp-04deb6b1-8881-4ec3-93cb-a333d0170703 bp-0568b78c-d51b-4bbe-8dab-dd3e50170703 bp-b062f838-aa49-4240-b014-e0bc50170703 bp-d98a8c91-eb7a-469f-9ad7-61a750170703 bp-d435504e-09d4-4084-9a0f-ffe200170703 bp-98b531cd-6cdb-4d57-bb0c-d45800170703 bp-fe9148cc-686b-425c-9613-154e30170703 bp-76fee5df-b57c-48e4-8314-d13a80170703

Wayne Mery (:wsmwk)

Comment 6

•

7 years ago

Daniel is seeing this when viewing a folder bp-9acac2a3-6508-46fc-87e1-806c60170809 http://forums.mozillazine.org/viewtopic.php?f=39&t=3032305&sid=f3450df3e6ef57eea92a0c893ea7a4ed A different user "When repairing maildir folder with a lot of messages (about 30.000, overall size still below 4GB on HDD) Thunderbird has a memory leak resulting in near 2GB RAM usage before crash." bp-b86b3d45-4e54-4ce2-a456-dc46e0170801

URL: http://forums.mozillazine.org/viewtop...

Summary: Crash [@ OOM | small ] during generating MSF file from many emails in Maildir format. Memory leak? → Crash [@ OOM | small ] during generating MSF file from many emails in Maildir format. Memory leak.

Wayne Mery (:wsmwk)

Comment 7

•

7 years ago

> 2. Put many emails into the generated folder in Maildir format Note - I'm not sure this is expected to work. Most of the crashes are @ OOM | small but I'm not adding this to this bug report's crash signatures. As for the other signatures: Bug 1272230 4 crashes - Crash in CCGraphBuilder::BuildGraph - possible memory leak - no progress last 5 months and none likely in the near future Bug 1353702 2 crashes (possibly dependent on bug 1364543) - Crash in PtrToNodeMatchEntry (CompareCacheMatchEntry) and OOM | small during CC. memory leak? bp-b062f838-aa49-4240-b014-e0bc50170703 Bug 1353704 2 crashes - Crash in mozilla::mailnews::MsgDBReporter::GetPath - I doubt this will be actionable bp-d435504e-09d4-4084-9a0f-ffe200170703 Bug 1333038 1 crash - ref bug 1264302, bug 1284302 Crash in nsMsgLineStreamBuffer::ReadNextLine - probably not maildir related but should be actionable bp-fe9148cc-686b-425c-9613-154e30170703 RtlpDosPathNameToRelativeNtPathName_U | RtlDosPathNameToNtPathName_U_WithStatus | GetFileAttributesW - 1 crash, no bug report bp-76fee5df-b57c-48e4-8314-d13a80170703

Depends on: 1264302

Summary: Crash [@ OOM | small ] during generating MSF file from many emails in Maildir format. Memory leak. → Crash [@ OOM | small ] during generating MSF file from many emails in Maildir format. Memory leak. @ OOM | small

Whiteboard: [maildirblocker?]

Jorg K (CEST = GMT+2)

Updated

•

7 years ago

No longer depends on: 1264302

Jorg K (CEST = GMT+2)

Updated

•

7 years ago

Depends on: 1264302

Wayne Mery (:wsmwk)

Updated

•

7 years ago

Blocks: 1410631

Wayne Mery (:wsmwk)

Comment 9

•

7 years ago

Do you also see this problem if you use the beta from http://www.mozilla.org/thunderbird/channel/ ?

Flags: needinfo?(t.matsuu)

Flags: needinfo?(mail)

Takanori MATSUURA

Reporter

Comment 10

•

7 years ago

I checked the issue still occurred or not with Thunderbird 60.0a1 (20180219030201). And then I got OOM | small crash. b693e37d-9950-4a70-a4a0-33eb60180220

Flags: needinfo?(t.matsuu)

Branimir Amidžić

Comment 11

•

7 years ago

I checked with suggested beta build and the problem is still present. :(

Flags: needinfo?(mail)

skrytka

Comment 12

•

7 years ago

I have the same problem - is there any way to repair/ build msf file without thunderbird crashing in the process??

skrytka

Comment 13

•

7 years ago

(In reply to skrytka from comment #12) > I have the same problem - is there any way to repair/ build msf file without > thunderbird crashing in the process?? p.s. i have 32 GB ddr4 3200 RAM and maildir folder is on NVME SSD drive. I can't build msf file. I lost all of my emails

Alexey Koshterik

Comment 14

•

7 years ago

I have folder with mails (1 year, 70 000 mails) Now I copy 800-900 files to a temporary folder, repair the folder, transfer all the letters to the desired folder. If I have more than 1000 mails in folder - I got crash. But even if I have 900 mails sometimes I have situation than TB moves 800 mails, and remain still 100-120. They can not be migrated before the program is restarted. I'll continue copying for now, if I notice some other regularities then I'll write here.

Wayne Mery (:wsmwk)

Updated

•

6 years ago

Blocks: maildirblockers

ISHIKAWA, Chiaki

Comment 15

•

6 years ago

I used to run valgrind to see if there are memory alloc/free mismatches, etc.
This bug should be easy to spot if valgrind runs today.

Unfortunately, for some months since late last year, I cannot run valgrind on my local development PC.
I am running Debian kernel inside virtualbox.
It probably is a particular kernel configuration parameter or two of official Debian supported kernels
that must be the cause of strange valgrind crash.
In linux 3.x kernel series, there were versions that allowed me to run valgrind (and run thunderbird under it), but there are other versions that did not and I could not figure out what kernel config parameter changes were responsible for the failure of valgrind.
So I used to keep the particular 3.y kernel so that I could run valgrind to test TB's memory issue.

However, the time moved on and Debian userland now requires kernel 4.x series.
And the valgrind has been crashing and I have not been able to run it. I tried creating a reasonably configured kernel from pristine source, but
valgrind still crashed. I am not sure what is wrong.
I am not sure if this Debian specific (probably so because I could produce a kernel from pristine linux source with my own config that allowed me to run valgrind during 3.x days while some debian supported 3.y versions did not. ) or related to some arcane VirtualBox issue.

Anyway, if someone can run valgrind on their local development machine the cause of the bug should be very easy to spot.

Or does mozilla still run TB under valgrind for testing purposes from time to time?
(I understand FF is run under valgrind from time to time. No?)
I understand such a testing is done maybe once a week or less often.
(It used to take almost 24 hours on my PC to run |make mozmil| under valgrind.
Last November I specifically rebuilt my home PC in the hope of running |make mozmil| under valgrind faster by a ryzen CPU with 8 cores and 20MB cache, but due to the mysterious valgrind crash under debian supported 4.x series kernel, I have not been able to run it yet.)
ADDED: So if a test that mimics the behavior that triggers the problem is created and the testing using virtualbox is done from time to time in mozilla's testing farm, this problem probably is analyzed there.
(Wait, I wonder if |make mozmil| in stock form can be run under valgrind. I locally create a dummy thunderbird binary that actually invokes the original thunderbird under virtualbox and let this fake thunderbird binary run within |make mozmill| testing scheme. .If it is not easy to run TB under virtualbox, the testing in the mozilla testing farm may not be easy...)

So if a developer who uses maybe Fedora (I believe the main developer of valgrind/memcheck uses Fedora) on their development machine and can run TB test suite such as |make mozmil|, the cause of this problem and other bugzilla related to memory for Maildir usage should be easy to spot in no time.

Well I have used Debian for almost 20 years now, and so I am not inclined to switch to Fedora anytime soon... Maybe at the office...

If someone knows the particular change to the kernel config file that would allow Debian to run valgrind, I would love to hear about it.

NOTE: for a small program, valgrind has no issue even under Debian's 4.x kernel.
It is a huge program with many dynamically loaded libraries that causes this mysterious crash of valgrind. TB fits the bill.
I can run small programs under valgrind without an issue to my consternation. So debugging is very hard...

Wayne Mery (:wsmwk)

Comment 16

•

5 years ago

Do you still crash when using version 68?

Flags: needinfo?(t.matsuu)

Flags: needinfo?(skrytka)

Flags: needinfo?(mail)

Flags: needinfo?(it)

Branimir Amidžić

Comment 17

•

5 years ago

(In reply to Wayne Mery (:wsmwk) from comment #16)

Do you still crash when using version 68?

Tested with 14000+ mails. Haven't noticed significant memory increase or crash. It looks fixed.

Regards!

Flags: needinfo?(mail)

Wayne Mery (:wsmwk)

Updated

•

5 years ago

Whiteboard: [maildirblocker?] → [closeme 2020-01-20][maildirblocker?]

Version: unspecified → 45

ISHIKAWA, Chiaki

Comment 18

•

5 years ago

(In reply to ISHIKAWA, Chiaki (may be slow to respond until Jan 4.) from comment #15)

I used to run valgrind to see if there are memory alloc/free mismatches, etc.
This bug should be easy to spot if valgrind runs today.

Unfortunately, for some months since late last year, I cannot run valgrind on my local development PC.
I am running Debian kernel inside virtualbox.
It probably is a particular kernel configuration parameter or two of official Debian supported kernels
that must be the cause of strange valgrind crash.

I have found out that if I log in as superuser, valgrind no longer crashes during TB testing.
It seems either the kernel or security protection modules such as SELinux won't allow an ordinary user to extend stack during runtime in my setup.

valgrind bugzilla:
https://bugs.kde.org/show_bug.cgi?id=405295
See comment 9 there.

Back to the original issue of the bugzilla.
Maybe I can test the reported scenario this afternoon.

ISHIKAWA, Chiaki

Comment 19

•

5 years ago

Can someone enlighten me regarding how to enable maildir in the latest code?
I don't find any reference to "maildir" in the preference nor in the general setting of TB. The preference setting dialogs have changed quite a bit.

BTW, the Sunday/Saturday code (from C-C) had a problem of not allowing one to input something to the body text(?). I spent a couple of days to
figure out what was wrong since I could not even write a message after starting it up.
I refreshed the code in the last 12 hours, and it seems to work now without such an issue.

Despite the comment 17, I just wanted see if there is anything we have not covered...

Branimir Amidžić

Comment 20

•

5 years ago

(In reply to ISHIKAWA, Chiaki (may be slow to respond until Jan 4.) from comment #19)

Can someone enlighten me regarding how to enable maildir in the latest code?
I don't find any reference to "maildir" in the preference nor in the general setting of TB. The preference setting dialogs have changed quite a bit.

BTW, the Sunday/Saturday code (from C-C) had a problem of not allowing one to input something to the body text(?). I spent a couple of days to
figure out what was wrong since I could not even write a message after starting it up.
I refreshed the code in the last 12 hours, and it seems to work now without such an issue.

Despite the comment 17, I just wanted see if there is anything we have not covered...

It's under Server Settings in Account Setup, but I think it's only available when you create your first account. Otherwise, it's disabled.

ISHIKAWA, Chiaki

Comment 21

•

5 years ago

(In reply to Branimir Amidžić from comment #20)

It's under Server Settings in Account Setup, but I think it's only available when you create your first account. Otherwise, it's disabled.

Thank you. This should get me going.

ISHIKAWA, Chiaki

Comment 22

•

5 years ago

I thought I would copy a couple of messages to the Maildir account from an account's folder, and repeat
copy 2 messages
copy 4 messages
copy 8 messages
...
copy 1024 messages.

However, I got a fatal bug elsewhere at the initial copy. :-(
bug 1609789

Wayne Mery (:wsmwk)

Updated

•

5 years ago

Whiteboard: [closeme 2020-01-20][maildirblocker?] → [maildirblocker?]

Takanori MATSUURA

Reporter

Comment 23

•

5 years ago

Hi Chiaki,
Which bug(s) block this?

Flags: needinfo?(t.matsuu) → needinfo?(ishikawa)

ISHIKAWA, Chiaki

Comment 24

•

5 years ago

Three months ago, I could not test this due to the bug I mentioned in comment 22.

However, I got a fatal bug elsewhere at the initial copy. :-(
bug 1609789

That bug has been taken care of.

Since then, I got carried away by this coronavirus outbreak in Japan , or rather the inept handling of it by the Japanese government, but I digress.

Now that I work from home, I have more time (less commute time) I thought I would test this and other bugzilla entries.

But for the last 10 days or so, I cannot build TB any more due to a few issues such as
Bug 1633092
TB build failure: GLSL optimizer output causes a compiler error (GCC-9) error: comparison of integer expressions of different signedness: ‘long int’ and ‘size_t’ {aka ‘long unsigned int’} [-Werror=sign-compare]

Bug 1630345
./mach bootstrap fails with python-pip dependency issue: python-pip : Depends: python-pip-whl (= 18.1-5) but 20.0.2-4 is to be installed
(Well, actually I just found a workaround for bug 1630345 on my PC.)

Once I sort them out, I will come back to this bug. :-(

Flags: needinfo?(ishikawa)

Wayne Mery (:wsmwk)

Comment 25

•

4 years ago

(In reply to ISHIKAWA, Chiaki from comment #24)

Three months ago, I could not test this due to the bug I mentioned in comment 22.
...
But for the last 10 days or so, I cannot build TB any more due to a few issues such as Bug 1633092
...
Once I sort them out, I will come back to this bug. :-(

Only one left!

Wayne Mery (:wsmwk)

Updated

•

3 years ago

Severity: critical → S4

Flags: needinfo?(skrytka)

Flags: needinfo?(it)

Keywords: stackwanted

Takanori MATSUURA

Reporter

Updated

•

3 years ago

Flags: needinfo?(ishikawa)

ISHIKAWA, Chiaki

Comment 26

•

3 years ago

Let me try again this week.
I have finally been able to build TB locally after I updated my M-C/C-C tree (!)

Flags: needinfo?(ishikawa)

ISHIKAWA, Chiaki

Comment 27

•

3 years ago

I tested with a locally created TB.
Not a great news. TB crashes at shutdown.

I created a working copy of TB from local updated M-C/C-C (with only minimal patches created locally so as not to disturb the TB operation).
So this may be even newer than Daily available on the web.
As I learned a couple of years back, I can specify the mbox/maildir selection when the new account is created.
(And if I specify maildir, TB needed to restart).
Well, I received about 180 messages from locally running daemons. (This was tested under local linux machine. Actually a linux image inside virtualbox.)

I repeated copying the repeated messages to a newly created empty directory.
After the message count exceeded 1000+, there was no problem.
I did something similar by copying this 1000+ messages at once to a different folder.
Again, no problem.
But please note that such repeated copying of 180+ messages may not trigger real world conditions since the message ID strings are repeated, the time/date sender/receiver information is repeated.

OK, I quit the running TB. Here, I did not see crash.

So I used the following short shell snippet to send myself a couple of thousand e-mails to see how this impacts TB.

$ for f in $(seq 1 2000)
> do
> echo "\ntest test" | mail -s "test subject $f" ishikawa 
> done

Well actually "\ntest ..." is meaningless, it can be simply "test test".
Well, |mail| under Debian GNU/Linux prompts for "cc:" address when called interactively on the shell line, but it turns out,
it does not seem to ask so, if it is invoked as part of shell script (or rather the input is a piple).

So it sends 2000 e-mails to me and I thought I would obtain 2000 e-mails from that by running TB again.
So I ran TB and try receiving messages. Great, I thought.
Actually, since I setup TB for automatically receiving messages at startup (by default, I think and periodic, too),
soon after TB was invoked it began receiving e-mails.

But then strangely, I only found I received 980 new e-mails in TB in Inbox. (I only found messages up to "test subject 980")
Something went wrong and TB is no longer able to receive the further e-mails.
I was not sure what was happening.

So I tried to receive remaining e-mails from mail command invoked from the shell.
I got this message:

$mail
Cannot read mailbox /var/mail/ishikawa: Conflict with previous locker

and sure enough, there was "ishikawa.lock" file.
Who or rather which program has created this lock file. TB (?), most likely.
So thought I would finish TB and would try to see if the lock file would disappear.

Then I hit this crash at the program finish stage.

Hit MOZ_CRASH(mozilla::LinkedList<T>::~LinkedList() [with T = nsSHistory] has a buggy user: it should have removed all this list's elements before the list's destruction) at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h:440
#01: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x616b5e9]
#02: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3ef67]
#03: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3f10a]
#04: __libc_start_main[/lib/x86_64-linux-gnu/libc.so.6 +0x277f4]
#05: _start[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird +0x162fa]
#06: ??? (???:???)

Program /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird (pid = 50352) received signal 11.
Stack:
#01: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x6604035]
#02: ???[/lib/x86_64-linux-gnu/libpthread.so.0 +0x13200]
#03: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x616b5f3]
#04: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3ef67]
#05: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3f10a]
#06: __libc_start_main[/lib/x86_64-linux-gnu/libc.so.6 +0x277f4]
#07: _start[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird +0x162fa]
#08: ??? (???:???)
Sleeping for 300 seconds.
Type 'gdb /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird 50352' to attach your debugger to this thread.

Ouch, so there could be a memory structure (List in this case), which may not be properly cleared, etc.
In a memory tight system, this could be a problem. I have 16GB memory assigned to virtualbox on a 32GB real memory PC.

When I looked at the stack using gdb, this is what I got.:

(gdb) where
#0  0x00007f290ce30335 in __GI___clock_nanosleep
    (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffe9c218f50, rem=rem@entry=0x7ffe9c218f50) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:43
#1  0x00007f290ce353f3 in __GI___nanosleep
    (req=req@entry=0x7ffe9c218f50, rem=rem@entry=0x7ffe9c218f50)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x00007f290ce3532a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x00007f2904a28af9 in common_crap_handler(int, void const*)
    (signum=11, aFirstFramePC=0x7f2904a03035 <nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)+197>) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:95
#4  0x00007f2904a28b1d in ah_crap_handler(int) (signum=<optimized out>)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:103
#5  0x00007f2904a03035 in nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)
    (signo=11, info=0x7ffe9c2191b0, context=0x7ffe9c219080)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/profile/nsProfileLock.cpp:183
#6  0x00007f290d174200 in <signal handler called> () at /lib/x86_64-linux-gnu/libpthread.so.0
#7  MOZ_Crash
    (aReason=0x55c9973b0e80 <sPrintfCrashReason> "mozilla::LinkedList<T>::~LinkedList() [with T = nsSHistory] has a buggy user: it should have removed all this list's elements before the list's destruction", aLine=440, aFilename=0x7f2906ef2370 "/NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h") at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/Assertions.h:261
#8  mozilla::LinkedList<nsSHistory>::~LinkedList() (this=<optimized out>, __in_chrg=<optimized out>)
    at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h:440
--Type <RET> for more, q to quit, c to continue without paging-- 
#9  mozilla::LinkedList<nsSHistory>::~LinkedList() (this=<optimized out>, __in_chrg=<optimized out>)
    at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h:437
#10 0x00007f290cda9f67 in __run_exit_handlers
    (status=0, listp=0x7f290cf28738 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#11 0x00007f290cdaa10a in __GI_exit (status=<optimized out>) at exit.c:139
#12 0x00007f290cd927f4 in __libc_start_main (main=
    0x55c99730ae90 <main(int, char**, char**)>, argc=2, argv=0x7ffe9c2197a8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe9c219798) at ../csu/libc-start.c:366
#13 0x000055c99730b2fa in _start ()
(gdb) where
#0  0x00007f290ce30335 in __GI___clock_nanosleep
    (clock_id=clock_id@entry=0, flags=flags@entry=0, req=req@entry=0x7ffe9c218f50, rem=rem@entry=0x7ffe9c218f50) at ../sysdeps/unix/sysv/linux/clock_nanosleep.c:43
#1  0x00007f290ce353f3 in __GI___nanosleep
    (req=req@entry=0x7ffe9c218f50, rem=rem@entry=0x7ffe9c218f50)
    at ../sysdeps/unix/sysv/linux/nanosleep.c:25
#2  0x00007f290ce3532a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
#3  0x00007f2904a28af9 in common_crap_handler(int, void const*)
    (signum=11, aFirstFramePC=0x7f2904a03035 <nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)+197>) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:95
#4  0x00007f2904a28b1d in ah_crap_handler(int) (signum=<optimized out>)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:103
#5  0x00007f2904a03035 in nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)
    (signo=11, info=0x7ffe9c2191b0, context=0x7ffe9c219080)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/profile/nsProfileLock.cpp:183
#6  0x00007f290d174200 in <signal handler called> () at /lib/x86_64-linux-gnu/libpthread.so.0
#7  MOZ_Crash
    (aReason=0x55c9973b0e80 <sPrintfCrashReason> "mozilla::LinkedList<T>::~LinkedList() [with T = nsSHistory] has a buggy user: it should have removed all this list's elements before the list's destruction", aLine=440, aFilename=0x7f2906ef2370 "/NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h") at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/Assertions.h:261
#8  mozilla::LinkedList<nsSHistory>::~LinkedList() (this=<optimized out>, __in_chrg=<optimized out>)
    at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h:440
--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(gdb) up
#1  0x00007f290ce353f3 in __GI___nanosleep (req=req@entry=0x7ffe9c218f50, 
    rem=rem@entry=0x7ffe9c218f50) at ../sysdeps/unix/sysv/linux/nanosleep.c:25
25	../sysdeps/unix/sysv/linux/nanosleep.c: No such file or directory.
(gdb) up
#2  0x00007f290ce3532a in __sleep (seconds=0) at ../sysdeps/posix/sleep.c:55
55	../sysdeps/posix/sleep.c: No such file or directory.
(gdb) up
#3  0x00007f2904a28af9 in common_crap_handler (signum=11, 
    aFirstFramePC=0x7f2904a03035 <nsProfileLock::FatalSignalHandler(int, siginfo_t*, void*)+197>)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:95
95	  sleep(_gdb_sleep_duration);
(gdb) up
#4  0x00007f2904a28b1d in ah_crap_handler (signum=<optimized out>)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/xre/nsSigHandlers.cpp:103
103	  common_crap_handler(signum, CallerPC());
(gdb) up
#5  0x00007f2904a03035 in nsProfileLock::FatalSignalHandler (signo=11, info=0x7ffe9c2191b0, 
    context=0x7ffe9c219080)
    at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/profile/nsProfileLock.cpp:183
183	      oldact->sa_handler(signo);
(gdb) up
#6  <signal handler called>
(gdb) up
#7  MOZ_Crash (
    aReason=0x55c9973b0e80 <sPrintfCrashReason> "mozilla::LinkedList<T>::~LinkedList() [with T = nsSHistory] has a buggy user: it should have removed all this list's elements before the list's destruction", aLine=440, 
    aFilename=0x7f2906ef2370 "/NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h")
    at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/Assertions.h:261
261	  MOZ_REALLY_CRASH(aLine);
(gdb) list
256	  MOZ_FUZZING_HANDLE_CRASH_EVENT4("MOZ_CRASH", aFilename, aLine, aReason);
257	#if defined(DEBUG) || defined(FUZZING)
258	  MOZ_ReportCrash(aReason, aFilename, aLine);
259	#endif
260	  MOZ_CRASH_ANNOTATE(aReason);
261	  MOZ_REALLY_CRASH(aLine);
262	}
263	#define MOZ_CRASH_UNSAFE(reason) MOZ_Crash(__FILE__, __LINE__, reason)
264	
265	static const size_t sPrintfMaxArgs = 4;
(gdb) list 200

^CQuit
(gdb) list
266	static const size_t sPrintfCrashReasonSize = 1024;
267	
268	MFBT_API MOZ_COLD MOZ_NEVER_INLINE MOZ_FORMAT_PRINTF(1, 2) const
269	    char* MOZ_CrashPrintf(const char* aFormat, ...);
270	
271	/*
272	 * MOZ_CRASH_UNSAFE_PRINTF(format, arg1 [, args]) can be used when more
273	 * information is desired than a string literal can supply. The caller provides
274	 * a printf-style format string, which must be a string literal and between
275	 * 1 and 4 additional arguments. A regular MOZ_CRASH() is preferred wherever
(gdb) list 265
^CQuit
(gdb) print aLine
$1 = 440
(gdb) 
$2 = 440
(gdb)

I am not even sure if this particular instance of TB was running within valgrind now. I believe it was. But TB seems to run a few processes now and then, if the particular TB instance is a child process of the original process, I was not running it since I forgot to pass the flag to trace the child processes to valgrind.

Anyway, I am keeping the crashed process debugged by gdb so that if anyone has a suggestion to where to look, I can watch it.

Or I may try to run this again with clean slate of affairs and make sure TB runs under valgrind.

Anyway, it looks there IS a piece of code not releasing the list structure properly.

Oh I forgot to mention that I am using locally created DEBUG build that is why I saw this
crash.

I don't believe I see such list not released message using mbox regularly.
(but maybe I should test TB in a similar manner when my message folder is mbox.)

The above is what I found in a very short testing.

ISHIKAWA, Chiaki

Comment 28

•

3 years ago

One other observation.
When one uses maildir format, does IncorporateMessage has to call
|nsMsgLocalMailFolder::GetDatabaseWOReparse|
8, 12, or 13 times?
(I mean not a few, but this function gets called rather many times beween IncorporateMessage is called and it finishes from what I observe in my local verbose dump.)
Does IncorporateMessage try to consolidate the receiving and incorporating of e-mails into a group action of incorporating a few e-mails at a time in the case of |maildir| support?)
Again, I am not sure if I saw this with mbox format folder usage. But again, until the current pending TB instance (and possibly valgrind, too) is purged from memory (I am keeping the process image live under gdb just in case somebody wanted to take a look at particular data structure from gdb), I cannot test other scenarios easily.

ISHIKAWA, Chiaki

Comment 29

•

3 years ago

I observe

that I could finish TB successfully when I tried to copy repeated messages up to 1000+ and 2000+
somehow TB could not receive more than 980 e-mail messages. Why? could it be there was some throttling going on the sending side and local mail server interaction, and somehow there was a race when TB tried to access the mail system. Very unlikely.
At least, TB failed to cleanly release some data structure as shown by the crash message from List structure destructor.
This resulted in a crash caused by MOZ_CRASH_REALLY().

Another thing. In the gdb backtrace, I failed to show the tail end of the trace, here is the tail end.
It is not terribly useful IMHO, but here it goes.

#6  0x00007f290d174200 in <signal handler called> () at /lib/x86_64-linux-gnu/libpthread.so.0
#7  MOZ_Crash
    (aReason=0x55c9973b0e80 <sPrintfCrashReason> "mozilla::LinkedList<T>::~LinkedList() [with T = nsSHistory] has a buggy user: it should have removed all this list's elements before the list's destruction", aLine=440, aFilename=0x7f2906ef2370 "/NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h") at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/Assertions.h:261
#8  mozilla::LinkedList<nsSHistory>::~LinkedList() (this=<optimized out>, __in_chrg=<optimized out>)
    at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h:440
--Type <RET> for more, q to quit, c to continue without paging--c
#9  mozilla::LinkedList<nsSHistory>::~LinkedList() (this=<optimized out>, __in_chrg=<optimized out>) at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h:437
#10 0x00007f290cda9f67 in __run_exit_handlers (status=0, listp=0x7f290cf28738 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true) at exit.c:108
#11 0x00007f290cdaa10a in __GI_exit (status=<optimized out>) at exit.c:139
#12 0x00007f290cd927f4 in __libc_start_main (main=0x55c99730ae90 <main(int, char**, char**)>, argc=2, argv=0x7ffe9c2197a8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7ffe9c219798) at ../csu/libc-start.c:366
#13 0x000055c99730b2fa in _start ()
(gdb)

ISHIKAWA, Chiaki

Comment 30

•

3 years ago

I don't know if the incomplete release of the elements in the linked list has anything to do with the cyclic-nature of a linked list which
I found out after a gdb session.

That is, I have found that the list stored in gSHistoryList is cyclic. (This variable is related to web browser session history? I
wonder where this sneaked in during TB interaction. Maybe the message pane display, especially the message text window, is related to web
browser display?)

Anyway in the following is the GDB commands where I figured out gSHistoryList may point to a circular linked list.

Whether the cyclic nature of the list has anything to do with the unreleased element reported in ASSERT is not clear to me.

Also, the ASSERT problem may be orthogonal to the original memory issue in |maildir| folders.

If anyone wants me to play with the GDB session (which I am keeping at this moment), let me know.
I may need to terminate this maybe in another day or two.

GDB session explanation.

I am not sure of how the class inheritance of LinkedList, LinkedListElement and instantiation of such classes with thrown-in type specification of nsSHistory results in memory layout very well. So trials and errors.

I started with |gSHistoryList| because I noticed that this is one of the persistant variables that seem to hold list values with the type which is reported in ASSERT error.

Then I found it seems to point to a circular list. I wonder if this circular property is intended or not, and if it has anything to do with the unreleased element that caused ASSERT?

(gdb) print gSHistoryList
$13 = mozilla::LinkedList<nsSHistory> = {0x55c99d00b170, 0x55c99e1ab1b0}
(gdb) print /x (int) gSHistoryList
$14 = 0x9d00b178
(gdb) print /x (long) gSHistoryList  <--- the first pointer ingSHistory List
$15 = 0x55c99d00b178                 <--- is this value, it seems.
(gdb) print * (LinkedListElement *) 0x55c99d00b170
No symbol "LinkedListElement" in current context.

I needed to specify the type specification as follows.
But I don't think I am looking at the right pointer because
mIsSentinel contains a strange value.

(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99d00b170
$16 = {mNext = 0x7f290a2b1320 <vtable for nsSHistory+16>, mPrev = 0x55c99e1ab1b8, mIsSentinel = 192}
(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99d00b1b0
$17 = {mNext = 0x55c99c0f2ea0, mPrev = 0xffffff00, mIsSentinel = 96}
(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99e1ab1b0
$18 = {mNext = 0x7f290a2b1320 <vtable for nsSHistory+16>, mPrev = 0x7f290a785fc0 <gSHistoryList>, 
  mIsSentinel = 120}

The next one seems to point to the correct data type since mIsSentinel
is printed as false. So from there I tried to print the object in
mNext and mPrev.

(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99e1ab1b8
$19 = {mNext = 0x7f290a785fc0 <gSHistoryList>, mPrev = 0x55c99d00b178, mIsSentinel = false}
(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99d00b178
$20 = {mNext = 0x55c99e1ab1b8, mPrev = 0x7f290a785fc0 <gSHistoryList>, mIsSentinel = false}
(gdb) print * (LinkedListElement<nsSHistory> *) 0x7f290a785fc0
$21 = {mNext = 0x55c99d00b178, mPrev = 0x55c99e1ab1b8, mIsSentinel = true}
(gdb) print (LinkedListElement<nsSHistory>) gSHistoryList
$22 = {mNext = 0x55c99d00b178, mPrev = 0x55c99e1ab1b8, mIsSentinel = true}

See? It seems there is a cyclic list.
gSHistoryList -> 0x55c99d00b178 -> 0x55c99e1ab1b8 -> back to gSHistoryList

Actually, I suspected this when I tried to execute the following
command and saw the output with repeated data. I had to quit the GDB output.

print (LinkedList<nsSHistory>) gSHistoryList
$23 = mozilla::LinkedList<nsSHistory> = {0x55c99d00b170, 0x55c99e1ab1b0,
  0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
  0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
  0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
  0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
  0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
  0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
  0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
  0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
  0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
  0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
  0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
  0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
  0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
  0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
  0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
  0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
  0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
  0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
  0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
  0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
  0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>,
  0x55c99d00b170, 0x55c99e1ab1b0, 0x7f290a785fb8 <gTouchCounter>, 0x55c99d00b170, 0x55c99e1ab1b0,
--Type <RET> for more, q to quit, c to continue without paging--q
Quit
(gdb)

Maybe I should file a separate bug for the assert.

ISHIKAWA, Chiaki

Comment 31

•

3 years ago

Well, I found a similar shutdown crash bugzilla. So I added a comment there instead of creating a new bugzilla at this moment.
Bug 1745864 Opened 1 month ago
Hit MOZ_CRASH(mozilla::LinkedList<mozilla::dom::ContentParent>::~LinkedList() [T = mozilla::dom::ContentParent] has a buggy user: it should have removed all this list's elements before the list's destruction).

comment 11 is my addition.
https://bugzilla.mozilla.org/show_bug.cgi?id=1745864#c1

Jens Stutte [:jstutte]

Comment 32

•

3 years ago

Hi Chiaki, the cyclic reference is wanted by design. If you look at:

(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99e1ab1b8
$19 = {mNext = 0x7f290a785fc0 <gSHistoryList>, mPrev = 0x55c99d00b178, mIsSentinel = false}
(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99d00b178
$20 = {mNext = 0x55c99e1ab1b8, mPrev = 0x7f290a785fc0 <gSHistoryList>, mIsSentinel = false}
(gdb) print * (LinkedListElement<nsSHistory> *) 0x7f290a785fc0
$21 = {mNext = 0x55c99d00b178, mPrev = 0x55c99e1ab1b8, mIsSentinel = true}
(gdb) print (LinkedListElement<nsSHistory>) gSHistoryList
$22 = {mNext = 0x55c99d00b178, mPrev = 0x55c99e1ab1b8, mIsSentinel = true}

you will find that the 0x7f290a785fc0 has the mIsSentinel flag set to true. This element is always present (in fact it is the sentinel member variable of gSHistoryList) and closes the cycle. It cannot be removed and when it is the only element in the list, the list is considered to be empty.

If that output of gdb is shown at the time of the assertion, it just means that the list was not empty and that the two elements with mIsSentinel = false have not been removed. If you are able to reproduce the issue you might have some luck with rr for debugging this.

The issue in the other bug 1745864 is similar but on a different list, so the underlying reasons why the elements have not been removed here and there is most probably different. You might want to look out for RefPtr<nsSHistory> that are never cleared.

ISHIKAWA, Chiaki

Comment 33

•

3 years ago

(In reply to Jens Stutte [:jstutte] from comment #32)

Thank you for the comment.

Hi Chiaki, the cyclic reference is wanted by design. If you look at:
(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99e1ab1b8
$19 = {mNext = 0x7f290a785fc0 <gSHistoryList>, mPrev = 0x55c99d00b178, mIsSentinel = false}
(gdb) print * (LinkedListElement<nsSHistory> *) 0x55c99d00b178
$20 = {mNext = 0x55c99e1ab1b8, mPrev = 0x7f290a785fc0 <gSHistoryList>, mIsSentinel = false}
(gdb) print * (LinkedListElement<nsSHistory> *) 0x7f290a785fc0
$21 = {mNext = 0x55c99d00b178, mPrev = 0x55c99e1ab1b8, mIsSentinel = true}
(gdb) print (LinkedListElement<nsSHistory>) gSHistoryList
$22 = {mNext = 0x55c99d00b178, mPrev = 0x55c99e1ab1b8, mIsSentinel = true}
you will find that the 0x7f290a785fc0 has the mIsSentinel flag set to true. This element is always present (in fact it is the sentinel member variable of gSHistoryList) and closes the cycle. It cannot be removed and when it is the only element in the list, the list is considered to be empty.

I see. This is how gSHistoryList was designed to behave.

If that output of gdb is shown at the time of the assertion, it just means that the list was not empty and that the two elements with mIsSentinel = false have not been removed. If you are able to reproduce the issue you might have some luck with rr for debugging this.

The gdb output is indeed from the crashing TB, crashed by assert, and so the non-empty list at the shutdown of TB is for real.
That is why the assert happened.
The non-sentinel members have not been removed for some reason.

I might try the scenario of copying a large number of messages from one folder to the other to see if this will be repeated.

The issue in the other bug 1745864 is similar but on a different list, so the underlying reasons why the elements have not been removed here and there is most probably different. You might want to look out for RefPtr<nsSHistory> that are never cleared.

I will investigate this "look out for RefPtr<nsSHistory> that are never cleared" a bit more, and probably file a separate bugzilla entry based on the finding, then
terminate the still kept TB process image and the gdb process attached to it and start over.

https://searchfox.org/mozilla-central/search?q=RefPtr%3CnsSHistory%3E&path=
shows the following.

▼
	
Textual Occurrences
	docshell/base/CanonicalBrowsingContext.h
495	RefPtr<nsSHistory> mSessionHistory;
	docshell/shistory/nsSHEntryShared.cpp
111	RefPtr<nsSHistory> nsshistory = static_cast<nsSHistory*>(shistory.get());
	docshell/shistory/nsSHistory.cpp
181	RefPtr<nsSHistory> mSHistory;
1580	RefPtr<nsSHistory> mSHistory;
	docshell/shistory/nsSHistory.h
329	RefPtr<nsSHistory> mSHistory;

These turn out to be all class members.
It is a bit awkward to figure out WHICH globally persistent variables store these class member variables (= fields ) in the runtime memory snapshot.
|gSHistoryList| was easy to spot since it was after all a variable with file-wide scope.
Hmm...
I tried to see, for example, how the "495 RefPtr<nsSHistory> mSessionHistory;" is used, and found this.
https://searchfox.org/mozilla-central/search?q=symbol:F_%3CT_mozilla%3A%3Adom%3A%3ACanonicalBrowsingContext%3E_mSessionHistory&redirect=false

▼
	
Uses
	docshell/base/CanonicalBrowsingContext.cpp
149	if (mSessionHistory) { // found in mozilla::dom::CanonicalBrowsingContext::~CanonicalBrowsingContext
150	mSessionHistory->SetBrowsingContext(nullptr); // found in mozilla::dom::CanonicalBrowsingContext::~CanonicalBrowsingContext
288	MOZ_ASSERT(!aNewContext->mSessionHistory); // found in mozilla::dom::CanonicalBrowsingContext::ReplacedBy
331	if (mSessionHistory) { // found in mozilla::dom::CanonicalBrowsingContext::ReplacedBy
332	mSessionHistory->SetBrowsingContext(aNewContext); // found in mozilla::dom::CanonicalBrowsingContext::ReplacedBy
336	mSessionHistory->SetEpoch(0, Nothing()); // found in mozilla::dom::CanonicalBrowsingContext::ReplacedBy
337	mSessionHistory.swap(aNewContext->mSessionHistory); // found in mozilla::dom::CanonicalBrowsingContext::ReplacedBy
457	if (!mSessionHistory && GetChildSessionHistory()) { // found in mozilla::dom::CanonicalBrowsingContext::GetSessionHistory
458	mSessionHistory = new nsSHistory(this); // found in mozilla::dom::CanonicalBrowsingContext::GetSessionHistory
461	return mSessionHistory; // found in mozilla::dom::CanonicalBrowsingContext::GetSessionHistory
2799	if (tmp->mSessionHistory) { // found in mozilla::dom::CanonicalBrowsingContext::cycleCollection::Unlink
2800	tmp->mSessionHistory->SetBrowsingContext(nullptr); // found in mozilla::dom::CanonicalBrowsingContext::cycleCollection::Unlink
* 2802	NS_IMPL_CYCLE_COLLECTION_UNLINK(mSessionHistory, mContainerFeaturePolicy, // found in  mozilla::dom::CanonicalBrowsingContext::cycleCollection::Unlink
* 2809	NS_IMPL_CYCLE_COLLECTION_TRAVERSE(mSessionHistory, mContainerFeaturePolicy, // found in mozilla::dom::CanonicalBrowsingContext::cycleCollection::TraverseNative

I found the usage of the NS_IMP_CYCLE_COLLECTION_UNLINK/TRAVERSE only for this particular RefPtr<nsSHistory> field and other fields in
https://searchfox.org/mozilla-central/search?q=RefPtr%3CnsSHistory%3E&path=
did not seem to use these macros. I wonder if this could be the reason for not properly removed elements.

Or for that matter, the mSHistory in

docshell/shistory/nsSHistory.cpp
181	RefPtr<nsSHistory> mSHistory;"

is created (copied upon creation), but it does seem to be explicitly
removed(?).
See https://searchfox.org/mozilla-central/source/docshell/shistory/nsSHistory.cpp#181

class MOZ_STACK_CLASS SHistoryChangeNotifier {
 public:
  explicit SHistoryChangeNotifier(nsSHistory* aHistory) {
    // If we're already in an update, the outermost change notifier will
    // update browsing context in the destructor.
    if (!aHistory->HasOngoingUpdate()) {
      aHistory->SetHasOngoingUpdate(true);
      mSHistory = aHistory;
    }
  }

  ~SHistoryChangeNotifier() {
    if (mSHistory) {
      MOZ_ASSERT(mSHistory->HasOngoingUpdate());
      mSHistory->SetHasOngoingUpdate(false);

      if (mozilla::SessionHistoryInParent() &&
          mSHistory->GetBrowsingContext()) {
        mSHistory->GetBrowsingContext()
            ->Canonical()
            ->HistoryCommitIndexAndLength();
      }
    } 
   // <----- ??? if mSHIstory was not null, should we not clear it after the above is done?
  }

  RefPtr<nsSHistory> mSHistory;
};

I don't know the code at all, but from a cursory look, I think, if mSHIstory was not null, should we not clear it at the end of destructor?
I have not checked all the "<Refptr>nsSHistory" fields, but It seems there may be indeed some coding issues.

This shutdown-time assertion issue seems to be very deep. And it may or may not be related to the original bug symptom because I found that the session history is kept to a reasonably small number. It seems to be cropped if the list becomes longer than the maximum limit.
There lies another possibility that the removal may not be handled properly. But I am under the impression that the implicit removal at class destruction, etc. is not properly handled coding-wise.

I think I will stop the current gdb session and try to check the ORIGINAL issue again.
And I will see if the shutdown-time assert is triggered again. In any case, I think I better file a different bugzilla for
shutdown-time assert of "Hit MOZ_CRASH(mozilla::LinkedList<T>::~LinkedList() [with T = nsSHistory] "

Thank you again.

ISHIKAWA, Chiaki

Comment 34

•

3 years ago

OK, I restarted locally build DEBUG version of TB under linux.
As soon as I quit the previous TB instance, the mail lock file is gone.
So the new TB image could read the remaining e-mails (2000 e-mails was read only up to 980 e-mail in previous TB run. I still don't know why.)
Now, Tried copying like 2000+, 3000+ e-mails using maildir under valgrind in different sessions.
(I have 16GB memory assigned to my linux image in VirtualBox).

I could not reproduce the crash at shutdown due to the particular MOZ_ASSERT any more (comment 27).
That assert crash probably needs to have a separate bugzilla.

But then in one of the runs, I got the following shutdown time error.
(Other than that I did not see memcheck-related errors which seem to be directly related to maildir.)

The valgrind run was done with this options and environment variables by the way.:

env MOZ_FAKE_NO_SANDBOX=yes MOZ_FAKE_NO_SECCOMP_TSYNC=yes MOZ_DISABLE_CONTENT_SANDBOX=yes MOZ_DISABLE_GMP_SANDBOX=yes MOZ_ASSUME_USER_NS=0 valgrind --trace-children=yes --fair-sched=yes --smc-check=all-non-file --gen-suppressions=all --vex-iropt-register-updates=allregs-at-mem-access --child-silent-after-fork=yes --trace-children-skip=/usr/bin/lsb_release,/usr/bin/hg,/bin/rm,*/bin/certutil,*/bin/pk12util,*/bin/ssltunnel,*/bin/uname,*/bin/which,*/bin/ps,*/bin/grep,*/bin/java,*/fix-stacks,*/firefox/firefox,*/bin/firefox-esr,*/bin/python,*/bin/python2,*/bin/python3,*/bin/python2.7,*/bin/bash,*/bin/nodejs,*/bin/node,*/bin/xpcshell,python3 --max-threads=5000  --max-stackframe=16000000 --num-transtab-sectors=24 --tool=memcheck --freelist-vol=500000000 --redzone-size=128 --px-default=allregs-at-mem-access --px-file-backed=unwindregs-at-mem-access --malloc-fill=0xA5 --free-fill=0xC3 --num-callers=50 --suppressions=/home/ishikawa/Dropbox/myown.sup --show-mismatched-frees=no --show-possibly-lost=no  /KERNEL-SRC/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin -p

The new crash, I got was as follows.
Note 0xC3C3C3...C3 value. That is a read of of already freed memory location. (The option I specified to valgrind "--free-fill=0xC3 "
Something is really screwed up at termination-time of TB. I am not sure if that is |maildir| specific or not. (It takes time to check the operation of copying a few thousand messages under valgrind. I cannot do the testing of mbox case on the same day in my spare time.)

        ... toward the shutdown ...

Failed to load file:///NEW-SSD/NREF-COMM-CENTRAL/mozilla/comm/mail/base/content/mailCore.js
[Parent 68965, Main Thread] WARNING: 'aOwner->IsDiscarded()', file /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/dom/SyncedContextInlines.h:94
[Parent 68965, Main Thread] WARNING: 'aOwner->IsDiscarded()', file /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/dom/SyncedContextInlines.h:94
==68965== Thread 1:
==68965== Invalid read of size 8
==68965==    at 0xCB168F3: mozilla::detail::RunnableFunction<nsXULPopupManager::ShowMenu(nsIContent*, bool, bool)::{lambda()#1}>::Run() (nsCOMPtr.h:855)
==68965==    by 0x927FF15: mozilla::RunnableTask::Run() (TaskController.cpp:468)
==68965==    by 0x927EC77: mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) (TaskController.cpp:771)
==68965==    by 0x927F40A: mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) (TaskController.cpp:607)
==68965==    by 0x927F71B: mozilla::TaskController::ProcessPendingMTTask(bool) (TaskController.cpp:391)
==68965==    by 0x927F7EA: mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal()::{lambda()#1}>::Run() (TaskController.cpp:124)
==68965==    by 0x92804D1: nsThread::ProcessNextEvent(bool, bool*) (nsThread.cpp:1195)
==68965==    by 0x925F799: NS_ProcessNextEvent(nsIThread*, bool) (nsThreadUtils.cpp:467)
==68965==    by 0x99B2BC9: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) (MessagePump.cpp:85)
==68965==    by 0x9956448: MessageLoop::Run() (message_loop.cc:324)
==68965==    by 0xC4CDBC8: nsBaseAppShell::Run() (nsBaseAppShell.cpp:137)
==68965==    by 0xDB778C9: nsAppStartup::Run() (nsAppStartup.cpp:295)
==68965==    by 0xDC74F27: XREMain::XRE_mainRun() (nsAppRunner.cpp:5342)
==68965==    by 0xDC76479: XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) (nsAppRunner.cpp:5527)
==68965==    by 0xDC76D98: XRE_main(int, char**, mozilla::BootstrapConfig const&) (nsAppRunner.cpp:5586)
==68965==    by 0x11EC73: do_main(int, char**, char**) (nsMailApp.cpp:229)
==68965==    by 0x11DF02: main (nsMailApp.cpp:368)
==68965==  Address 0x3a5856a0 is 6,944 bytes inside a block of size 8,192 free'd
==68965==    at 0x483F74C: free (vg_replace_malloc.c:755)
==68965==    by 0xC84F002: nsPresArena<8192ul, mozilla::ArenaObjectID, 163ul>::~nsPresArena() (ArenaAllocator.h:90)
==68965==    by 0xC7DD61E: mozilla::PresShell::~PresShell() (PresShell.cpp:879)
==68965==    by 0xC7DDC98: mozilla::PresShell::Release() (PresShell.cpp:877)
==68965==    by 0xD7F3938: mozilla::AppWindow::RequestWindowClose(nsIWidget*) (RefPtr.h:50)
==68965==    by 0xD7F3A8A: mozilla::AppWindow::WidgetListenerDelegate::RequestWindowClose(nsIWidget*) (AppWindow.cpp:3317)
==68965==    by 0xC52D2ED: delete_event_cb(_GtkWidget*, _GdkEventAny*) (nsWindow.cpp:3914)
==68965==    by 0x594CF93: ??? (in /usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2404.26)
==68965==    by 0x6573908: ??? (in /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.7000.2)
==68965==    by 0x658B63A: g_signal_emit_valist (in /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.7000.2)
==68965==    by 0x658C4FE: g_signal_emit (in /usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.7000.2)
==68965==    by 0x58F6B93: ??? (in /usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2404.26)
==68965==    by 0x57AD362: gtk_main_do_event (in /usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2404.26)
==68965==    by 0x5DC86A4: ??? (in /usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2404.26)
==68965==    by 0x5DFBD71: ??? (in /usr/lib/x86_64-linux-gnu/libgdk-3.so.0.2404.26)
==68965==    by 0x660CCDA: g_main_context_dispatch (in /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.7000.2)
==68965==    by 0x660CF87: ??? (in /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.7000.2)
==68965==    by 0x660D03E: g_main_context_iteration (in /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.7000.2)
==68965==    by 0xC567DE3: nsAppShell::ProcessNextNativeEvent(bool) (nsAppShell.cpp:352)
==68965==    by 0xC4D79F6: nsBaseAppShell::OnProcessNextEvent(nsIThreadInternal*, bool) (nsBaseAppShell.cpp:120)
==68965==    by 0x92803B1: nsThread::ProcessNextEvent(bool, bool*) (nsThread.cpp:1111)
==68965==    by 0x925F799: NS_ProcessNextEvent(nsIThread*, bool) (nsThreadUtils.cpp:467)
==68965==    by 0x99B2BC9: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) (MessagePump.cpp:85)
==68965==    by 0x9956448: MessageLoop::Run() (message_loop.cc:324)
==68965==    by 0xC4CDBC8: nsBaseAppShell::Run() (nsBaseAppShell.cpp:137)
==68965==    by 0xDB778C9: nsAppStartup::Run() (nsAppStartup.cpp:295)
==68965==    by 0xDC74F27: XREMain::XRE_mainRun() (nsAppRunner.cpp:5342)
==68965==    by 0xDC76479: XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) (nsAppRunner.cpp:5527)
==68965==    by 0xDC76D98: XRE_main(int, char**, mozilla::BootstrapConfig const&) (nsAppRunner.cpp:5586)
==68965==    by 0x11EC73: do_main(int, char**, char**) (nsMailApp.cpp:229)
==68965==    by 0x11DF02: main (nsMailApp.cpp:368)
==68965==  Block was alloc'd at
==68965==    at 0x483CF9B: malloc (vg_replace_malloc.c:380)
==68965==    by 0xC869E30: nsPresArena<8192ul, mozilla::ArenaObjectID, 163ul>::Allocate(mozilla::ArenaObjectID, unsigned long) (ArenaAllocator.h:170)
==68965==    by 0xC8AF869: NS_NewBlockFrame(mozilla::PresShell*, mozilla::ComputedStyle*) (PresShell.h:280)
==68965==    by 0xC82E8E7: nsCSSFrameConstructor::ConstructNonScrollableBlockWithConstructor(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItem&, nsContainerFrame*, nsStyleDisplay const*, nsFrameList&, nsBlockFrame* (*)(mozilla::PresShell*, mozilla::ComputedStyle*)) (nsCSSFrameConstructor.cpp:4620)
==68965==    by 0xC82EAD4: nsCSSFrameConstructor::ConstructNonScrollableBlock(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItem&, nsContainerFrame*, nsStyleDisplay const*, nsFrameList&) (nsCSSFrameConstructor.cpp:4593)
==68965==    by 0xC828277: nsCSSFrameConstructor::ConstructFrameFromItemInternal(nsCSSFrameConstructor::FrameConstructionItem&, nsFrameConstructorState&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:3692)
==68965==    by 0xC8291DC: nsCSSFrameConstructor::ConstructFramesFromItem(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList::Iterator&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:5658)
==68965==    by 0xC829521: nsCSSFrameConstructor::ConstructFramesFromItemList(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList&, nsContainerFrame*, bool, nsFrameList&) (nsCSSFrameConstructor.cpp:9521)
==68965==    by 0xC821F00: nsCSSFrameConstructor::ProcessChildren(nsFrameConstructorState&, nsIContent*, mozilla::ComputedStyle*, nsContainerFrame*, bool, nsFrameList&, bool, nsIFrame*) (nsCSSFrameConstructor.cpp:9681)
==68965==    by 0xC828D10: nsCSSFrameConstructor::ConstructFrameFromItemInternal(nsCSSFrameConstructor::FrameConstructionItem&, nsFrameConstructorState&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:3832)
==68965==    by 0xC8291DC: nsCSSFrameConstructor::ConstructFramesFromItem(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList::Iterator&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:5658)
==68965==    by 0xC829521: nsCSSFrameConstructor::ConstructFramesFromItemList(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList&, nsContainerFrame*, bool, nsFrameList&) (nsCSSFrameConstructor.cpp:9521)
==68965==    by 0xC821F00: nsCSSFrameConstructor::ProcessChildren(nsFrameConstructorState&, nsIContent*, mozilla::ComputedStyle*, nsContainerFrame*, bool, nsFrameList&, bool, nsIFrame*) (nsCSSFrameConstructor.cpp:9681)
==68965==    by 0xC828D10: nsCSSFrameConstructor::ConstructFrameFromItemInternal(nsCSSFrameConstructor::FrameConstructionItem&, nsFrameConstructorState&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:3832)
==68965==    by 0xC8291DC: nsCSSFrameConstructor::ConstructFramesFromItem(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList::Iterator&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:5658)
==68965==    by 0xC829521: nsCSSFrameConstructor::ConstructFramesFromItemList(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList&, nsContainerFrame*, bool, nsFrameList&) (nsCSSFrameConstructor.cpp:9521)
==68965==    by 0xC821F00: nsCSSFrameConstructor::ProcessChildren(nsFrameConstructorState&, nsIContent*, mozilla::ComputedStyle*, nsContainerFrame*, bool, nsFrameList&, bool, nsIFrame*) (nsCSSFrameConstructor.cpp:9681)
==68965==    by 0xC828D10: nsCSSFrameConstructor::ConstructFrameFromItemInternal(nsCSSFrameConstructor::FrameConstructionItem&, nsFrameConstructorState&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:3832)
==68965==    by 0xC8291DC: nsCSSFrameConstructor::ConstructFramesFromItem(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList::Iterator&, nsContainerFrame*, nsFrameList&) (nsCSSFrameConstructor.cpp:5658)
==68965==    by 0xC829521: nsCSSFrameConstructor::ConstructFramesFromItemList(nsFrameConstructorState&, nsCSSFrameConstructor::FrameConstructionItemList&, nsContainerFrame*, bool, nsFrameList&) (nsCSSFrameConstructor.cpp:9521)
==68965==    by 0xC821F00: nsCSSFrameConstructor::ProcessChildren(nsFrameConstructorState&, nsIContent*, mozilla::ComputedStyle*, nsContainerFrame*, bool, nsFrameList&, bool, nsIFrame*) (nsCSSFrameConstructor.cpp:9681)
==68965==    by 0xC82E348: nsCSSFrameConstructor::ConstructBlock(nsFrameConstructorState&, nsIContent*, nsContainerFrame*, nsContainerFrame*, mozilla::ComputedStyle*, nsContainerFrame**, nsFrameList&, nsIFrame*) (nsCSSFrameConstructor.cpp:10570)
==68965==    by 0xC82F3A0: nsCSSFrameConstructor::ConstructDocElementFrame(mozilla::dom::Element*) (nsCSSFrameConstructor.cpp:2439)
==68965==    by 0xC833705: nsCSSFrameConstructor::ContentRangeInserted(nsIContent*, nsIContent*, nsCSSFrameConstructor::InsertionKind) (nsCSSFrameConstructor.cpp:6956)
==68965==    by 0xC7E0DD1: mozilla::PresShell::Initialize() [clone .part.0] (PresShell.cpp:1853)
==68965==    by 0xBF43EDF: mozilla::dom::PrototypeDocumentContentSink::StartLayout() (PrototypeDocumentContentSink.cpp:700)
==68965==    by 0xBF440FF: mozilla::dom::PrototypeDocumentContentSink::DoneWalking() (PrototypeDocumentContentSink.cpp:669)
==68965==    by 0xC48A9A4: mozilla::dom::DocumentL10n::InitialTranslationCompleted(bool) [clone .part.0] (DocumentL10n.cpp:321)
==68965==    by 0xC48B197: L10nReadyHandler::ResolvedCallback(JSContext*, JS::Handle<JS::Value>) (DocumentL10n.cpp:304)
==68965==    by 0xC1A85D8: mozilla::dom::(anonymous namespace)::PromiseNativeHandlerShim::ResolvedCallback(JSContext*, JS::Handle<JS::Value>) (Promise.cpp:385)
==68965==    by 0xC1ABC4D: mozilla::dom::NativeHandlerCallback(JSContext*, unsigned int, JS::Value*) (Promise.cpp:338)
==68965==    by 0xE1AA81E: CallJSNative(JSContext*, bool (*)(JSContext*, unsigned int, JS::Value*), js::CallReason, JS::CallArgs const&) (Interpreter.cpp:425)
==68965==    by 0xE1BFBAD: js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) (Interpreter.cpp:512)
==68965==    by 0xE1C0097: js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>, js::CallReason) (Interpreter.cpp:589)
==68965==    by 0xE20BE58: js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::MutableHandle<JS::Value>) (Interpreter.h:106)
==68965==    by 0xE3873E4: PromiseReactionJob(JSContext*, unsigned int, JS::Value*) (Promise.cpp:2067)
==68965==    by 0xE1AA81E: CallJSNative(JSContext*, bool (*)(JSContext*, unsigned int, JS::Value*), js::CallReason, JS::CallArgs const&) (Interpreter.cpp:425)
==68965==    by 0xE1BFBAD: js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct, js::CallReason) (Interpreter.cpp:512)
==68965==    by 0xE1C0097: js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>, js::CallReason) (Interpreter.cpp:589)
==68965==    by 0xE2D4E84: JS::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) (CallAndConstruct.cpp:117)
==68965==    by 0xA8EFC7E: mozilla::dom::VoidFunction::Call(mozilla::dom::BindingCallContext&, JS::Handle<JS::Value>, mozilla::ErrorResult&) (CustomElementRegistryBinding.cpp:503)
==68965==    by 0x9166722: mozilla::dom::PromiseJobCallback::Call(mozilla::ErrorResult&, char const*, mozilla::dom::CallbackObject::ExceptionHandling, JS::Realm*) (PromiseBinding.h:89)
==68965==    by 0x916697F: mozilla::PromiseJobRunnable::Run(mozilla::AutoSlowOperation&) (PromiseBinding.h:102)
==68965==    by 0x917C235: mozilla::CycleCollectedJSContext::PerformMicroTaskCheckPoint(bool) (CycleCollectedJSContext.cpp:674)
==68965==    by 0x917CA31: mozilla::CycleCollectedJSContext::AfterProcessTask(unsigned int) (CycleCollectedJSContext.cpp:463)
==68965==    by 0x9E68473: XPCJSContext::AfterProcessTask(unsigned int) (XPCJSContext.cpp:1424)
==68965==    by 0x92805DC: nsThread::ProcessNextEvent(bool, bool*) (nsThread.cpp:1232)
==68965==    by 0x925F799: NS_ProcessNextEvent(nsIThread*, bool) (nsThreadUtils.cpp:467)
==68965==    by 0x99B2BC9: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) (MessagePump.cpp:85)
==68965==    by 0x9956448: MessageLoop::Run() (message_loop.cc:324)
==68965==
{
   <insert_a_suppression_name_here>
   Memcheck:Addr8
   fun:_ZN7mozilla6detail16RunnableFunctionIZN17nsXULPopupManager8ShowMenuEP10nsIContentbbEUlvE_E3RunEv
   fun:_ZN7mozilla12RunnableTask3RunEv
   fun:_ZN7mozilla14TaskController39DoExecuteNextTaskOnlyMainThreadInternalERKNS_6detail12BaseAutoLockIRNS_5MutexEEE
   fun:_ZN7mozilla14TaskController37ExecuteNextTaskOnlyMainThreadInternalERKNS_6detail12BaseAutoLockIRNS_5MutexEEE
   fun:_ZN7mozilla14TaskController20ProcessPendingMTTaskEb
   fun:_ZN7mozilla6detail16RunnableFunctionIZNS_14TaskController18InitializeInternalEvEUlvE_E3RunEv
   fun:_ZN8nsThread16ProcessNextEventEbPb
   fun:_Z19NS_ProcessNextEventP9nsIThreadb
   fun:_ZN7mozilla3ipc11MessagePump3RunEPN4base11MessagePump8DelegateE
   fun:_ZN11MessageLoop3RunEv
   fun:_ZN14nsBaseAppShell3RunEv
   fun:_ZN12nsAppStartup3RunEv
   fun:_ZN7XREMain11XRE_mainRunEv
   fun:_ZN7XREMain8XRE_mainEiPPcRKN7mozilla15BootstrapConfigE
   fun:_Z8XRE_mainiPPcRKN7mozilla15BootstrapConfigE
   fun:_ZL7do_mainiPPcS0_
   fun:main
}
==68965== Invalid read of size 8
==68965==    at 0xCB16900: mozilla::detail::RunnableFunction<nsXULPopupManager::ShowMenu(nsIContent*, bool, bool)::{lambda()#1}>::Run() (RefPtr.h:49)
==68965==    by 0x927FF15: mozilla::RunnableTask::Run() (TaskController.cpp:468)
==68965==    by 0x927EC77: mozilla::TaskController::DoExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) (TaskController.cpp:771)
==68965==    by 0x927F40A: mozilla::TaskController::ExecuteNextTaskOnlyMainThreadInternal(mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&) (TaskController.cpp:607)
==68965==    by 0x927F71B: mozilla::TaskController::ProcessPendingMTTask(bool) (TaskController.cpp:391)
==68965==    by 0x927F7EA: mozilla::detail::RunnableFunction<mozilla::TaskController::InitializeInternal()::{lambda()#1}>::Run() (TaskController.cpp:124)
==68965==    by 0x92804D1: nsThread::ProcessNextEvent(bool, bool*) (nsThread.cpp:1195)
==68965==    by 0x925F799: NS_ProcessNextEvent(nsIThread*, bool) (nsThreadUtils.cpp:467)
==68965==    by 0x99B2BC9: mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) (MessagePump.cpp:85)
==68965==    by 0x9956448: MessageLoop::Run() (message_loop.cc:324)
==68965==    by 0xC4CDBC8: nsBaseAppShell::Run() (nsBaseAppShell.cpp:137)
==68965==    by 0xDB778C9: nsAppStartup::Run() (nsAppStartup.cpp:295)
==68965==    by 0xDC74F27: XREMain::XRE_mainRun() (nsAppRunner.cpp:5342)
==68965==    by 0xDC76479: XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) (nsAppRunner.cpp:5527)
==68965==    by 0xDC76D98: XRE_main(int, char**, mozilla::BootstrapConfig const&) (nsAppRunner.cpp:5586)
==68965==    by 0x11EC73: do_main(int, char**, char**) (nsMailApp.cpp:229)
==68965==    by 0x11DF02: main (nsMailApp.cpp:368)
==68965==  Address 0xc3c3c3c3c3c3c3c3 is not stack'd, malloc'd or (recently) free'd
==68965==

ISHIKAWA, Chiaki

Comment 35

•

3 years ago

Oh, one other thing. The crash under valgrind may be caused by the timing race.

There are places where the timing of events are not strictly controlled by the program flow.
Sometimes a group of events can occur in any order.
But sometimes a group of events have to occur in a certain sequence.
In those cases, programmers need to make sure to interlock the operation by certain synchronization primitives.
Problem is that some programmers have simply "assume" that some operations finish
before others because such was the case before.
Unfortunately, valgrind/memcheck skews the execution speed of program so much so that such "assumption" no longer holds.
So the error I see about referencing 0xc3c3c3...c3 to pick up an address to possibly a code that was to be executed at the termination
may have been caused by such mis-ordering due to the lack of proper sync. Just a theory, but very plausible.

I am saying because as soon as TB under valgrind starts (well I say "as soon as", but it is almost like 20-30 seconds before I see the following message after invocation. Valgrind/memcheck is sloooow.),
I see the following message.

==68965== Memcheck, a memory error detector
==68965== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==68965== Using Valgrind-3.18.0.GIT and LibVEX; rerun with -h for copyright info
==68965== Command: /KERNEL-SRC/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin -p
==68965==
==68965== Warning: set address range perms: large range [0x7a756a20000, 0x7a7d6620000) (noaccess)
[Parent 68965, Main Thread] WARNING: dependent window created without a parent: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/components/startup/nsAppStartup.cpp:744
==68965== Warning: set address range perms: large range [0x59c8a000, 0x459c8a000) (noaccess)
==68965== Warning: set address range perms: large range [0x59c8a000, 0x459c8a000) (noaccess)
[Parent 68965, Main Thread] WARNING: NS_ENSURE_TRUE(rootFrame) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/dom/base/nsGlobalWindowOuter.cpp:4235
[Parent 68965, Main Thread] WARNING: NS_ENSURE_TRUE(root) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/layout/base/nsDocumentViewer.cpp:2619
Warning: asking to enable_gpu_markers but no supporting extension was found

I suspect that main code of TB NEEDs to wait for a window or something to appear BEFORE proceeding, but
under normal circumstances, the window gets ready before TB needs and so all is well. However, under valgrind, it may be too slow to create a required window or something. The message is very disturbing. That TB works under this condition is a bit surprising.

ISHIKAWA, Chiaki

Comment 36

•

3 years ago

Oops, the above messages appear even without valgrind (!). Something is fishy with TB main code.

 /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin -p
[Parent 69547, Main Thread] WARNING: dependent window created without a parent: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/components/startup/nsAppStartup.cpp:744

        ... some locally added dumps omitted ...

[Parent 69547, Main Thread] WARNING: NS_ENSURE_TRUE(rootFrame) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/dom/base/nsGlobalWindowOuter.cpp:4235
[Parent 69547, Main Thread] WARNING: NS_ENSURE_TRUE(root) failed: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/layout/base/nsDocumentViewer.cpp:2619
Warning: asking to enable_gpu_markers but no supporting extension was found
[Parent 69547, Compositor] WARNING: Possibly dropping task posted to updater thread: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/gfx/layers/apz/src/APZUpdater.cpp:362

ISHIKAWA, Chiaki

Comment 37

•

3 years ago

I am afraid testing maildir operation of TB under valgrind triggers more errors than the original problem and I am moving away from the original problem.
I wanted to see if I can print the memory summary at the end if TB terminates successfully after copying of 1000+ messages. With that memory summary, I can say if there is unfreed data structure, etc. with some confidence.
Well, I got hit with another instance of the assert mentioned in comment 27.

I am not sure when this bug crept in, but unless it is taken care of, DEBUG version of TB cannot print out meaningful memory summary (and that this assert is triggered only in DEBUG build, I think.)

[Parent 69547, Main Thread] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsTraceRefcnt.cpp:204
[Parent 69547, Main Thread] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsTraceRefcnt.cpp:204
[Parent 69547, Main Thread] WARNING: XPCOM objects created/destroyed from static ctor/dtor: file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/xpcom/base/nsTraceRefcnt.cpp:204
Hit MOZ_CRASH(mozilla::LinkedList<T>::~LinkedList() [with T = nsSHistory] has a buggy user: it should have removed all this list's elements before the list's destruction) at /NEW-SSD/moz-obj-dir/objdir-tb3/dist/include/mozilla/LinkedList.h:440
#01: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x616b5e9]
#02: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3ef67]
#03: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3f10a]
#04: __libc_start_main[/lib/x86_64-linux-gnu/libc.so.6 +0x277f4]
#05: _start[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin +0x162fa]
#06: ??? (???:???)

Program /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin (pid = 69547) received signal 11.
Stack:
#01: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x6604035]
#02: ???[/lib/x86_64-linux-gnu/libpthread.so.0 +0x13200]
#03: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x616b5f3]
#04: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3ef67]
#05: ???[/lib/x86_64-linux-gnu/libc.so.6 +0x3f10a]
#06: __libc_start_main[/lib/x86_64-linux-gnu/libc.so.6 +0x277f4]
#07: _start[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin +0x162fa]
#08: ??? (???:???)
Sleeping for 300 seconds.
Type 'gdb /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin 69547' to attach your debugger to this thread.
RunWatchdog: Mainthread nested event loops during hang: 
 --- (no nested event loop active)
Hit MOZ_CRASH(Shutdown hanging after all known phases and workers finished.) at /NEW-SSD/NREF-COMM-CENTRAL/mozilla/toolkit/components/terminator/nsTerminator.cpp:256
#01: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x65f654e]
#02: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libnspr4.so +0x2ee4d]
#03: ???[/lib/x86_64-linux-gnu/libpthread.so.0 +0x8d80]
#04: clone[/lib/x86_64-linux-gnu/libc.so.6 +0xfcb6f]
#05: ??? (???:???)

Program /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin (pid = 69547) received signal 11.
Stack:
#01: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x6604035]
#02: ???[/lib/x86_64-linux-gnu/libpthread.so.0 +0x13200]
#03: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libxul.so +0x65f655f]
#04: ???[/NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/libnspr4.so +0x2ee4d]
#05: ???[/lib/x86_64-linux-gnu/libpthread.so.0 +0x8d80]
#06: clone[/lib/x86_64-linux-gnu/libc.so.6 +0xfcb6f]
#07: ??? (???:???)
Sleeping for 300 seconds.
Type 'gdb /NEW-SSD/moz-obj-dir/objdir-tb3/dist/bin/thunderbird-bin 69547' to attach your debugger to this thread.
Done sleeping...
ishikawa@ip030:/NREF-COMM-CENTRAL/work-dir$

I better file a bug for this shutdown-time assert and take care of that first. :-(

Wayne Mery (:wsmwk)

Comment 38

•

2 years ago

(In reply to ISHIKAWA, Chiaki from comment #37)

I better file a bug for this shutdown-time assert and take care of that first. :-(

Did you file that bug?

Flags: needinfo?(t.matsuu)

Takanori MATSUURA

Reporter

Comment 39

•

2 years ago

(In reply to Wayne Mery (:wsmwk) from comment #38)

(In reply to ISHIKAWA, Chiaki from comment #37)

I better file a bug for this shutdown-time assert and take care of that first. :-(

Did you file that bug?

I suppose you want to request needinfo to Chiaki.

Flags: needinfo?(t.matsuu) → needinfo?(ishikawa)

ISHIKAWA, Chiaki

Comment 40

•

2 years ago

(In reply to Wayne Mery (:wsmwk) from comment #38)

(In reply to ISHIKAWA, Chiaki from comment #37)

I better file a bug for this shutdown-time assert and take care of that first. :-(

Did you file that bug?

(In reply to Takanori MATSUURA from comment #39)

(In reply to Wayne Mery (:wsmwk) from comment #38)

(In reply to ISHIKAWA, Chiaki from comment #37)

I better file a bug for this shutdown-time assert and take care of that first. :-(

Did you file that bug?

I suppose you want to request needinfo to Chiaki.

Thank you, Matsuura san for redirecting this to me.

Before filing a bugzilla, I searched for a similar bugzilla entry and found
bug 1745864
and so instead of creating a new one, I posted a comment to that bugzilla.
https://bugzilla.mozilla.org/show_bug.cgi?id=1745864#c11

However, there seem to be a few different underlying causes.
https://bugzilla.mozilla.org/show_bug.cgi?id=1745864#c11

There is an independent bugzilla filed about 6 months ago.
Bug 1755794

Also, bug 1661862

The underlying causes seemed very elusive and now with the patch in bug 1661862, it may show up elsewhere.
I have not run mochitest under valgrind run for a while. Maybe I should (Running mochitest test suite under valgrind is almost 20+ hours with heavy memory usage) and so I am not tempted to run it often. :-(
That there is some tests that cause timeout errors (even with regular run) is a big headache.
Even for tests that execute successfully, I need to set timeout rather long so that the slowdown by valgrind causes a timeout for successful test.
However, some failing tests (that would cause timeout even during normal run) will use the long timeout limit to the full, which causes a real pain because the few of them add maybe 3-4 hours of inactivity and causing the total execution time to be very long.
Until a lot of people run test suite under valgrind, this won't get much attention.
And, running test suite under valgrind itself is a challenge. I had to patch a few issues just get valgrind run successfully against TB.
Oh well. (I wonder how FF folks manage to run FF tests successfully under valgrind. I read somewhere that valgrind run is executed maybe every month or so?)

Flags: needinfo?(ishikawa)

Worcester12345

Comment 41

•

2 years ago

(In reply to ISHIKAWA, Chiaki from comment #40)
...

Until a lot of people run test suite under valgrind, this won't get much attention.
And, running test suite under valgrind itself is a challenge. I had to patch a few issues just get valgrind run successfully against TB.
Oh well. (I wonder how FF folks manage to run FF tests successfully under valgrind. I read somewhere that valgrind run is executed maybe every month or so?)

So, 8 months means 8 times of valgrind tests. Can it now get that attention you mentioned?

ISHIKAWA, Chiaki

Comment 42

•

2 years ago

(In reply to Worcester12345 from comment #41)

(In reply to ISHIKAWA, Chiaki from comment #40)
...

Until a lot of people run test suite under valgrind, this won't get much attention.
And, running test suite under valgrind itself is a challenge. I had to patch a few issues just get valgrind run successfully against TB.
Oh well. (I wonder how FF folks manage to run FF tests successfully under valgrind. I read somewhere that valgrind run is executed maybe every month or so?)

So, 8 months means 8 times of valgrind tests. Can it now get that attention you mentioned?

I have no idea. I wonder where the valgrind run of FF tests are stored on tryserver.

As for my local tests, I have realized that maybe I should only pick up one particular test that is under focus as a candidate for valgrind run.
Right now such a test is being planned (I don't have the time to do it) to check for any suspicious memory-related errors when
TB is compiled using gcc-12. That is not directly related to this bugzilla unfortunately.

The symptom I reported in Bug 1824691 could be related to gcc-12 miscompilation or something.
But then something rang a bell, and
I recalled that, for a while, my local mochitest running debug version of C-C TB under valgrind
produced very hard to diagnose memory errors in non other than hashtable which appears in the stacktrace of bug 1824691.

So it IS POSSIBLE that there is a real memory-related error which may be triggered by a particular manner a code is compiled by a compiler.
I would run a particular test mentioned in bugzilla 1824691. Running a single test should not run for a long duration even under valgrind.
And if there is a suspect, I can look at the source code, and or invoke gdb to home in to suspicious values.
But again, even that will probably eat up half a day. :-(
Memory-related errors are really hard to debug.

One of these days, stress testing TB often uncovers ANOTHER fatal bug along the way. Debugging seems to be a never ending story.

ISHIKAWA, Chiaki

Comment 43

•

2 years ago

BTW, I would think someone in the embedded computer industry who needs high level of security such as AirBus might want to create a version of ARM cpu (or whatever) that implements the valgrind-like check in firmware/hardware to speed up the valgrind-like testing significantly.
Instead of x20 slowdown, we can live with x4-x6 slowdon with such a CPU.
Just a thought.