Closed Bug 1205016 Opened 9 years ago Closed 9 years ago

SIGSEGV in je_bitmap_sfu on startup

Categories

(Core :: Memory Allocator, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla43
Tracking Status
firefox41 --- wontfix
firefox42 --- fixed
firefox43 --- fixed

People

(Reporter: fitzgen, Assigned: lsalzman)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(4 files, 1 obsolete file)

Pulled down the latest m-c today and now I get segfaults in jemalloc on startup on linux64.

My (git) revision is:

> commit 6944c5ba30bcc3f54ec4549fe8c20bdfc4b25b70
> Merge: 0abaaa3 b3c78a8
> Author: Carsten "Tomcat" Book <cbook@mozilla.com>
> Date:   Tue Sep 15 15:05:24 2015 +0200
> 
>     merge mozilla-inbound to mozilla-central a=merge

Here is a gdb session and backtrace:

> bash-4.3$ ./mach run -P dev --debugger gdb
>  0:00.06 /usr/bin/gdb -q --args /home/fitzgen/src/mozilla-central/obj-debug/dist/bin/firefox -P dev -no-remote
> Reading symbols from /home/fitzgen/src/mozilla-central/obj-debug/dist/bin/firefox...done.
> warning: File "/home/fitzgen/src/mozilla-central/.gdbinit" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
> To enable execution of this file add
> 	add-auto-load-safe-path /home/fitzgen/src/mozilla-central/.gdbinit
> line to your configuration file "/home/fitzgen/.gdbinit".
> To completely disable this security protection add
> 	set auto-load safe-path /
> line to your configuration file "/home/fitzgen/.gdbinit".
> For more information about this security protection see the
> "Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
> 	info "(gdb)Auto-loading safe path"
> (gdb) run
> Starting program: /home/fitzgen/src/mozilla-central/obj-debug/dist/bin/firefox -P dev -no-remote
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> warning: File "/home/fitzgen/src/mozilla-central/obj-debug/toolkit/library/libxul.so-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
> Detaching after fork from child process 7404.
> Detaching after fork from child process 7406.
> warning: Corrupted shared library list: 0x7ffff691a200 != 0x7ffff688fd00
> [New Thread 0x7ffff7feb700 (LWP 7411)]
> [New Thread 0x7fffd5a1e700 (LWP 7410)]
> [New Thread 0x7fffd626f700 (LWP 7409)]
> [New Thread 0x7fffd6ac0700 (LWP 7408)]
> [New Thread 0x7fffda9ca700 (LWP 7407)]
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 0x7ffff7feb700 (LWP 7411)]
> 0x000000000042fc87 in je_bitmap_sfu (bitmap=0x0, binfo=0x0) at /home/fitzgen/src/mozilla-central/memory/jemalloc/src/include/jemalloc/internal/bitmap.h:170
> 170	{
> Missing separate debuginfos, use: dnf debuginfo-install alsa-lib-1.0.29-1.fc22.x86_64 atk-2.16.0-1.fc22.x86_64 at-spi2-atk-2.16.0-1.fc22.x86_64 at-spi2-core-2.16.0-1.fc22.x86_64 bzip2-libs-1.0.6-14.fc22.x86_64 cairo-1.14.2-1.fc22.x86_64 cairo-gobject-1.14.2-1.fc22.x86_64 dbus-glib-0.104-1.fc22.x86_64 dbus-libs-1.8.18-1.fc22.x86_64 dconf-0.24.0-1.fc22.x86_64 elfutils-libelf-0.161-6.fc22.x86_64 elfutils-libs-0.161-6.fc22.x86_64 expat-2.1.0-10.fc22.x86_64 fontconfig-2.11.93-2.fc22.x86_64 freetype-2.5.5-1.fc22.x86_64 GConf2-3.2.6-11.fc22.x86_64 gdk-pixbuf2-2.31.4-1.fc22.x86_64 glib2-2.44.1-1.fc22.x86_64 glib-networking-2.44.0-1.fc22.x86_64 gmp-6.0.0-9.fc22.x86_64 gnutls-3.3.15-1.fc22.x86_64 graphite2-1.2.4-3.fc22.x86_64 gtk3-3.16.4-2.fc22.x86_64 gvfs-1.24.1-1.fc22.x86_64 harfbuzz-0.9.40-1.fc22.x86_64 keyutils-libs-1.5.9-4.fc22.x86_64 krb5-libs-1.13.2-5.fc22.x86_64 libattr-2.4.47-9.fc22.x86_64 libbluray-0.7.0-1.fc22.x86_64 libcanberra-0.30-7.fc22.x86_64 libcanberra-gtk3-0.30-7.fc22.x86_64 libcap-2.24-7.fc22.x86_64 libcom_err-1.42.12-4.fc22.x86_64 libdrm-2.4.61-3.fc22.x86_64 libepoxy-1.2-1.fc22.x86_64 libffi-3.1-7.fc22.x86_64 libICE-1.0.9-2.fc22.x86_64 libmodman-2.0.1-9.fc22.x86_64 libogg-1.3.2-2.fc22.x86_64 libpng-1.6.16-3.fc22.x86_64 libproxy-0.4.11-10.fc22.x86_64 libselinux-2.3-10.fc22.x86_64 libSM-1.2.2-2.fc22.x86_64 libtasn1-4.5-1.fc22.x86_64 libtdb-1.3.4-1.fc22.x86_64 libtool-ltdl-2.4.2-34.fc22.x86_64 libuuid-2.26.2-1.fc22.x86_64 libvorbis-1.3.4-3.fc22.x86_64 libwayland-client-1.7.0-1.fc22.x86_64 libwayland-cursor-1.7.0-1.fc22.x86_64 libwayland-server-1.7.0-1.fc22.x86_64 libX11-1.6.3-1.fc22.x86_64 libXau-1.0.8-4.fc22.x86_64 libxcb-1.11-3.fc22.x86_64 libXcomposite-0.4.4-6.fc22.x86_64 libXcursor-1.1.14-4.fc22.x86_64 libXdamage-1.1.4-6.fc22.x86_64 libXext-1.3.3-2.fc22.x86_64 libXfixes-5.0.1-4.fc22.x86_64 libXi-1.7.4-2.fc22.x86_64 libXinerama-1.1.3-4.fc22.x86_64 libxkbcommon-0.5.0-1.fc22.x86_64 libxml2-2.9.2-3.fc22.x86_64 libXrandr-1.4.2-2.fc22.x86_64 libXrender-0.9.9-1.fc22.x86_64 libxshmfence-1.2-1.fc22.x86_64 libXt-1.1.4-10.fc22.x86_64 libXxf86vm-1.1.4-1.fc22.x86_64 mesa-libEGL-10.5.4-1.20150505.fc22.x86_64 mesa-libgbm-10.5.4-1.20150505.fc22.x86_64 mesa-libGL-10.5.4-1.20150505.fc22.x86_64 mesa-libglapi-10.5.4-1.20150505.fc22.x86_64 mesa-libwayland-egl-10.5.4-1.20150505.fc22.x86_64 nettle-2.7.1-5.fc22.x86_64 openssl-libs-1.0.1k-11.fc22.x86_64 p11-kit-0.23.1-1.fc22.x86_64 PackageKit-gtk3-module-1.0.6-4.fc22.x86_64 pango-1.36.8-5.fc22.x86_64 pcre-8.37-1.fc22.x86_64 pixman-0.32.6-4.fc22.x86_64 systemd-libs-219-13.fc22.x86_64 trousers-0.3.13-3.fc22.x86_64 xz-libs-5.2.0-2.fc22.x86_64 zlib-1.2.8-7.fc22.x86_64
> (gdb) bt
> #0  0x000000000042fc87 in je_bitmap_sfu (bitmap=0x0, binfo=0x0) at /home/fitzgen/src/mozilla-central/memory/jemalloc/src/include/jemalloc/internal/bitmap.h:170
> #1  0x0000000000439888 in arena_run_reg_alloc (run=0x7fffd8406290, bin_info=0x66eda0 <je_arena_bin_info+1920>) at /home/fitzgen/src/mozilla-central/memory/jemalloc/src/src/arena.c:302
> #2  0x0000000000451a08 in je_arena_malloc_small (arena=0x7ffff6a00180, size=1024, zero=true) at /home/fitzgen/src/mozilla-central/memory/jemalloc/src/src/arena.c:2151
> #3  0x00000000004f2f22 in je_calloc (tcache=0x0, zero=true, size=1024, arena=0x7ffff6a00180, tsd=0x7ffff7feb6b0)
>     at /home/fitzgen/src/mozilla-central/memory/jemalloc/src/include/jemalloc/internal/arena.h:1145
> #4  0x00000000004f2f22 in je_calloc (arena=0x0, is_metadata=false, tcache=0x0, zero=true, size=1024, tsd=0x7ffff7feb6b0) at src/include/jemalloc/internal/jemalloc_internal.h:887
> #5  0x00000000004f2f22 in je_calloc (size=1024, tsd=0x7ffff7feb6b0) at src/include/jemalloc/internal/jemalloc_internal.h:920
> #6  0x00000000004f2f22 in je_calloc (num=1, size=1024) at /home/fitzgen/src/mozilla-central/memory/jemalloc/src/src/jemalloc.c:1663
> #7  0x0000000000421ab1 in calloc (num=1, size=1024) at /home/fitzgen/src/mozilla-central/memory/build/replace_malloc.c:181
> #8  0x00007ffff65c9eb3 in PR_Calloc (nelem=1, elsize=1024) at /home/fitzgen/src/mozilla-central/nsprpub/pr/src/malloc/prmem.c:443
> #9  0x00007ffff65c8045 in PR_SetThreadPrivate (index=2, priv=0x7fffd843c5b8) at /home/fitzgen/src/mozilla-central/nsprpub/pr/src/threads/prtpd.c:161
> #10 0x00007fffe2dad66f in mozilla::BlockingResourceBase::ResourceChainAppend(mozilla::BlockingResourceBase*) (this=0x7fffd843c5b8, aPrev=0x0) at ../../dist/include/mozilla/BlockingResourceBase.h:181
> #11 0x00007fffe2da87a5 in mozilla::BlockingResourceBase::Acquire() (this=0x7fffd843c5b8) at /home/fitzgen/src/mozilla-central/xpcom/glue/BlockingResourceBase.cpp:322
> #12 0x00007fffe2da8980 in mozilla::OffTheBooksMutex::Lock() (this=0x7fffd843c5b8) at /home/fitzgen/src/mozilla-central/xpcom/glue/BlockingResourceBase.cpp:383
> #13 0x00007fffe2c48424 in mozilla::Monitor::Lock() (this=0x7fffd843c5b8) at ../../dist/include/mozilla/Monitor.h:35
> #14 0x00007fffe2c4848a in mozilla::MonitorAutoLock::MonitorAutoLock(mozilla::Monitor&) (this=0x7ffff7feae40, aMonitor=...) at ../../dist/include/mozilla/Monitor.h:78
> #15 0x00007fffe2e1f8b2 in mozilla::net::ClosingService::ThreadFunc() (this=0x7fffd843c5a0) at /home/fitzgen/src/mozilla-central/netwerk/base/ClosingService.cpp:206
> #16 0x00007fffe2e35bed in mozilla::net::ClosingService::ThreadFunc(void*) (aClosure=0x7fffd843c5a0) at /home/fitzgen/src/mozilla-central/netwerk/base/ClosingService.h:52
> #17 0x00007ffff65e56f5 in _pt_root (arg=0x7fffd8449f00) at /home/fitzgen/src/mozilla-central/nsprpub/pr/src/pthreads/ptthread.c:212
> #18 0x00007ffff7bc7555 in start_thread (arg=0x7ffff7feb700) at pthread_create.c:333
> #19 0x00007ffff6e5cf3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> (gdb)
Crash Signature: 0x000000000042fc87 in je_bitmap_sfu (bitmap=0x0, binfo=0x0)
Keywords: crash
Summary: SIGSEGV in je_bitmap_sfu → SIGSEGV in je_bitmap_sfu on startup
For anyone else hitting this issue, this revision works for me:

> commit 42ed9cb8a74b3f8b4636ad812cb60c6ee9cde4b7
> Author: Kan-Ru Chen <kanru@kanru.info>
> Date:   Thu Sep 3 13:36:02 2015 +0800
> 
>     Bug 1200498 - Clean up dom/browser-element mochitest.ini that has skip-if toolkit != gtk2 now that gtk3 is the default

Going to start bisecting.
Shu says he also hit this, but building with clang side stepped the problem. GCC only, apparently.
Bisect done!

760a84e7cf7fa49c889a5a17a5935d3ca1e02384 is the first bad commit
commit 760a84e7cf7fa49c889a5a17a5935d3ca1e02384
Author: Dragana Damjanovic <dd.mozilla@gmail.com>
Date:   Thu Sep 10 19:07:00 2015 +0200

    Bug 1152046 - Make separate thread only for PRClose. r=mcmanus r=mayhemer
    
    --HG--
    extra : rebase_source : a4f4845023d6cebdd56d75b1ff7afd29447d2167

:040000 040000 f594d863ab973b3488ba662aa6f879466408b223 79061f13112b6b20fb4d7c9805dc6b799a738aed M	netwerk
:040000 040000 490e7544e5192d2256de88a2e4e86330ca3c73f5 b1f0ff4ef68f679cac4f7e1f84c069e064e8a54a M	toolkit
Flags: needinfo?(mcmanus)
Flags: needinfo?(honzab.moz)
Flags: needinfo?(dd.mozilla)
so I ran into something very similar just now trying to run the xpcshell netwerk/test/unit/test_post.js locally on linux x64 - almost the same stack.

I'll back that patch out
Flags: needinfo?(mcmanus)
close this based on backout https://bugzilla.mozilla.org/show_bug.cgi?id=1152046#c65
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Flags: needinfo?(honzab.moz)
Flags: needinfo?(dd.mozilla)
Bah, this bug wasn't mentioned on the commit. I had to reland the patch because the backout caused a b2g debug hangs on shutdown

https://hg.mozilla.org/integration/mozilla-inbound/rev/5320f1017b81
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Seeing the same bustage on all my linux builds, and can confirm that removing the patch Nick identified makes things work again.
Keywords: regression
Note: I backed out bug 1152046 (again) locally, since I was insta-crashing with it in.  That lets me run, but testing https://mozilla.github.com/webrtc-landing/pc_test.html crashes in the same place (jemalloc/internal/bitmap.h) when in e10s - but not in a non-e10s profile.  So it's very touchy and machine/perf/allocation/? dependent.
Note: both crash roughly the same place and same stack, but with the service code it starts it at startup (instacrash); with the patch backed out it doesn't crash until I use UDPSocket from webrtc (which we only do in e10s).
An ASAN build locally doesn't seem to hit this bug.  However, it appears heavily dependent on timing/order-of-allocation/layout/etc.
Only 16KB (4*4096) was allocated to the ClosingService thread, causing us to crash inside jemalloc when the stack overflowed.

This is one of the few callsites in the entire source code that even passes in an explicit stack size other than 0, the default stack size, which should be at least 64KB.

Just pass in 0 to PR_CreateThread here to get the bigger stack we need and avoid overflows.
Attachment #8662120 - Flags: review?(mh+mozilla)
Assignee: nobody → lsalzman
Okay, slight modification... Instead of using default stack size, which can be much larger than needed, just double the current stack size of the ClosingService thread, which is sufficient to avoid the overflow still.
Attachment #8662120 - Attachment is obsolete: true
Attachment #8662120 - Flags: review?(mh+mozilla)
Attachment #8662123 - Flags: review?(mh+mozilla)
Attachment #8662123 - Flags: feedback+
Comment on attachment 8662123 [details] [diff] [review]
double ClosingService thread stack size to avoid stack overflow

Review of attachment 8662123 [details] [diff] [review]:
-----------------------------------------------------------------

::: netwerk/base/ClosingService.cpp
@@ +107,5 @@
>  ClosingService::StartInternal()
>  {
>    mThread = PR_CreateThread(PR_USER_THREAD, ThreadFunc, this,
>                              PR_PRIORITY_NORMAL, PR_GLOBAL_THREAD,
> +                            PR_JOINABLE_THREAD, 32 * 1024);

Great detective work. There is another one in netwerk/base/nsUDPSocket.cpp (which comes from bug 1152046).

I also went through all the things based on nsThread, which uses PR_CreateThread with a parametrized stack size, and none are setting a stack size smaller than 256K.
Attachment #8662123 - Flags: review?(mh+mozilla) → feedback+
Comment on attachment 8662123 [details] [diff] [review]
double ClosingService thread stack size to avoid stack overflow

Review of attachment 8662123 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Mike Hommey [:glandium] from comment #15)
> Great detective work. There is another one in netwerk/base/nsUDPSocket.cpp
> (which comes from bug 1152046).

Err, it was moved from there to here in bug 1152046, so there's only one.
Attachment #8662123 - Flags: feedback+ → review+
(In reply to Mike Hommey [:glandium] from comment #16)
>
> Err, it was moved from there to here in bug 1152046, so there's only one.

so it sounds like this change should be backported to older channels using the previous location of createthread().. dragana can probably do that.
Flags: needinfo?(dd.mozilla)
Attached patch bug_1205016_backport.patch (deleted) — Splinter Review
I am not sure if you need to review it.
Flags: needinfo?(dd.mozilla)
Attachment #8662433 - Flags: review?(mcmanus)
Attachment #8662433 - Flags: review?(mcmanus) → review+
Comment on attachment 8662433 [details] [diff] [review]
bug_1205016_backport.patch

Approval Request Comment
[Feature/regressing bug #]: Bug 1124880 
[User impact if declined]: Crash. A too small stack size of a thread caused stack overflow.
[Describe test coverage new/current, TreeHerder]: couple of people could reproduce this reliably, and after this change it worked for them
[Risks and why]: Not high. Only changes the size of the stack to 32KB (most other thread has this set to 0 which uses the default 64KB)
[String/UUID change made/needed]: none
Attachment #8662433 - Flags: approval-mozilla-beta?
Attachment #8662433 - Flags: approval-mozilla-aurora?
https://hg.mozilla.org/mozilla-central/rev/7d9e6debd7e7
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla43
Comment on attachment 8662433 [details] [diff] [review]
bug_1205016_backport.patch

Fix a crash, taking it.
Attachment #8662433 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8662433 [details] [diff] [review]
bug_1205016_backport.patch

Too late for 41.
Attachment #8662433 - Flags: approval-mozilla-beta? → approval-mozilla-beta-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: