Closed Bug 1224057 Opened 9 years ago Closed 9 years ago

[e10s] Multi-process crashes tabs

Categories

(Core :: Security: Process Sandboxing, defect)

44 Branch
x86_64
Linux
defect
Not set
critical

Tracking

()

RESOLVED DUPLICATE of bug 1222500
Tracking Status
e10s - ---

People

(Reporter: marcodv, Assigned: jld)

References

Details

(Keywords: crash)

Attachments

(1 file)

Attached file about:support raw data (deleted) —
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:42.0) Gecko/20100101 Firefox/42.0
Build ID: 20151110142142

Steps to reproduce:

Start firefox aurora. Completely fresh profile. No addons/no anything. This bug only happens when multi-process is enabled. Upon disabling, it works fine.


Actual results:

Every tab shows 'Bad news first: This tab has crashed' http://i.imgur.com/ZY1kMMm.png

output from terminal: http://pastebin.com/v1t9Y5HH


Expected results:

Tabs should have load normally
Summary: Mulyti-process crashes tabs → Multi-process crashes tabs
OS: Unspecified → Linux
Hardware: Unspecified → x86_64
Blocks: e10s
Severity: normal → critical
Keywords: crash
Summary: Multi-process crashes tabs → [e10s] Multi-process crashes tabs
marco:  do you have any crash reports?  about:crashes.  Please link to them here.  Thanks.
Flags: needinfo?(marcodv)
Nope, about:crashes is completely clear. No crash reported. http://i.imgur.com/RH0w4Zf.png
Flags: needinfo?(marcodv)
Hmmm, without any crash data, there isn't much developers can do.   I am going to mark this as incomplete.  If you see this crash again, please submit the crash report, (see https://support.mozilla.org/en-US/kb/mozillacrashreporter?redirectlocale=en-US&redirectslug=Mozilla+Crash+Reporter) and reopen the bug, thanks.
Status: UNCONFIRMED → RESOLVED
Closed: 9 years ago
Resolution: --- → INCOMPLETE
That's why I tried to include about:support data and terminal output if it's any help.

> If you see this crash again...

Well, it's 100% reproducible on my side.

If there are some advanced logs somewhere that might help, I'll try to provide them...
make sure crash reporter is enabled in about:preferences#advanced
Already enabled (by default). But I guess it submits crash reports to devs, but if no crash reports have actually been captured, it's not useful.

I'm actually wondering if it's not some other bug. Since Firefox itself didn't crash, it didn't create any crash report, and it ignores crash reports form individual tabs/processes. I'm not sure.

Interesting thing is that it doesn't crash on about:* pages. It crashes when trying to go outside to the web.

Crashing that tab logs

    Assertion failure: IsSingleThreaded(), at /builds/slave/m-aurora-l64-ntly-000000000000/build/src/security/sandbox/linux/Sandbox.cpp:527
    [Parent 8790] WARNING: pipe error (57): Connection reset by peer: file /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459

    ###!!! [Parent][MessageChannel] Error: (msgtype=0x280082,name=PBrowser::Msg_Destroy) Channel error: cannot send/recv

    [Parent 8790] WARNING: pipe error (69): Connection reset by peer: file /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459
    [Parent 8790] WARNING: pipe error (79): Connection reset by peer: file /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459
    [Parent 8790] WARNING: pipe error (81): Connection reset by peer: file /builds/slave/m-aurora-l64-ntly-000000000000/build/src/ipc/chromium/src/chrome/common/ipc_channel_posix.cc, line 459

into the terminal.
The content crashed. Firefox should be offering to send a report for that.  Is the content crashing on any real website or some page(s) in particular?  

I assume this is a recent regression.  Does it also occur on Nightly builds?
Status: RESOLVED → REOPENED
Ever confirmed: true
Resolution: INCOMPLETE → ---
Yes. Nightly is having those crashes too. It's actually been quite a long time since I've tried nightly and it crashed the same way. A few days ago, I wanted to try aurora if it's not better and/or fixed and nope.

It crashes on every pages that is not ff's about:*.

There is some possibility it's some problem on my system, but I don't know what might been causing it.
My outsider guess is something to do with sandbox/security.
Couple suggestions;

Check if running dmesg before and after adds any message.

Look for any new non-submitted crash reports. run;
ls -l ~/.mozilla/firefox/Crash\ Reports/pending/

If you have flash installed check if that works with multiprocess disabled.
I had an idea then compared it to your output finally spotted the most obvious.

First step for you to try is remove mozplugger.

Assuming this hasn't fixed it and nothing found from my first comment;
Check you have extracted firefox correctly.

When I remove executable flag from plugin-container (or just (re)move the file) I see terminal output similar to yours and also tab crash message with no error reporting.
Interesting you're mentioning mozplugger. I'd had it installed, but I thought I removed it a few weeks ago. But somehow, package manager failed to remove

    ~/.mozilla/plugins 
    └ $ ls -la
    total 120
    drwx------ 2 marek users  4096 Oct 19 22:22 .
    drwx------ 5 marek users  4096 Oct  5 22:07 ..
    -rw-r--r-- 1 marek users 54456 Oct 12 22:17 mozplugger.tmp
    -rw-r--r-- 1 marek users 54456 Oct 19 22:22 mozplugger0.so

And upon checking about:addons, Plugin tab, there it was. Unfortunatelly, deleting those two files didn't solve the crashing.

There is no `pending` folder in `Crash Reports`, so I guess none.

dmesg actually generates an output with crashes:

    [22872.527093] plugin-containe[4579]: segfault at 0 ip 00000000004513d7 sp 00007ffed508c420 error 6 in plugin-container[400000+67000]
    [22879.227073] plugin-containe[4602]: segfault at 0 ip 00000000004513d7 sp 00007ffc70784d70 error 6 in plugin-container[400000+67000]

> Check you have extracted firefox correctly.

I haven't done any manual extracting. I use Arch and this package from AUR: https://aur.archlinux.org/packages/firefox-aurora/

What exactly should I check?
plugin-container already has an X flag:

$ ls -l /opt/firefox-aurora/plugin-container 
-rwxr-xr-x 1 root root 428848 Nov  8 19:57 /opt/firefox-aurora/plugin-container
Just tried the AUR package on existing install I have. Starts fine and I can run bad javascript, kill it and crash-report the content. See the same sandbox settings. AUR package downloads the latest binaries but modifies them by stripping debug symbols so someone with more knowledge that can help will possibly require a clean download.

Last suggestion I can think of is to add a new temporary user and try running from that.
Flags: needinfo?(wmccloskey)
I will gladly provide more info, but I need to know what exactly are you looking for. There are no crashes reported.

> Last suggestion I can think of is to add a new temporary user and try running from that.

I've just tried a new user. Didn't help. Fresh user, fresh profile. I've also tried running it on i3 if it was something in Plasma5, but no deal. Doesn't work in neither of them.
Jed, can you maybe take a look here? Somehow our b2g sandboxing code is crashing here (on desktop) because it detects multiple threads at startup. I can't imagine why that might be.

Marco, what do you see when you type "ls -a /proc/$$/task"?
Flags: needinfo?(wmccloskey)
Oops. See previous comment Jed.
Flags: needinfo?(jld)
SandboxEarlyInit is for parts of sandboxing that need to happen while single-threaded; that assertion is enabled in all child processes to detect if that regresses, even in process types that aren't using sandboxing yet, with the idea that it's better to find out about those regressions so they can be dealt with sooner.  (Also, it's not just the B2G sandboxing code, as of Firefox 33 when GeckoMediaPlugin sandboxing shipped (bug 1012951).)

This is also before child process crash reporting is started, so that's why there's no crash report.  An OS-produced core dump (run "ulimit -c unlimited" in a shell, then run firefox in that shell and reproduce the bug) should still work, but that needs debug symbols from the same firefox build (and preferably the system libraries in use) to get information from it.

I'm wondering if there's something like an LD_PRELOAD library that might be creating threads in an unexpected place, or some library's initializers creating threads before main().

If all else fails, we should be able to interpose pthread_create and make *sure* this happens before creating other threads, and optionally use dladdr(3) to identify the caller.
Flags: needinfo?(jld)
Assignee: nobody → jld
Component: General → Security: Process Sandboxing
Product: Firefox → Core
$ ls -a /proc/$$/task
.  ..  14401

$ ulimit -c unlimited
http://pastebin.com/raw.php?i=WyGKcUv9

The build I was using is the one directly taken from https://ftp.mozilla.org/pub/firefox/nightly/latest-mozilla-aurora/firefox-44.0a2.en-US.linux-.tar.bz2

Version: 44.0a2
Build ID: 20151108004059
(In reply to marcodv from comment #18)
> $ ls -a /proc/$$/task
> .  ..  14401

Okay, so the shell doesn't have extra threads, which means that whatever it is doesn't apply to all processes.  It could still be an initializer in something like pulseaudio or a GUI library.

> $ ulimit -c unlimited
> http://pastebin.com/raw.php?i=WyGKcUv9

Ideally, there's now a core file in that directory (named "core", or "core.NNNNN" with the pid); some distributions change what happens to core files (see /proc/sys/kernel/core_pattern and /proc/sys/kernel/core_uses_pid) and I don't know offhand what Arch does.

With that file, "gdb $(which firefox) core" should be able to get some information even without debug symbols for Firefox: "info thread" lists threads, "thread N" (for thread number N) followed by "backtrace" or "bt" should be able to at least find out what library is involved.

If you'd rather just send me the core file: there could be privacy issues; this is in a new profile, and the process has just been exec()ed and hasn't done much yet[1], but it would have all of the environment variables, for one thing.  It's probably too large for Bugzilla, but email (jld at mozilla.com) ought to work.


[1] https://dxr.mozilla.org/mozilla-central/source/ipc/contentproc/plugin-container.cpp?case=true&from=content_process_main#150
Ok, it was getting slightly more complicated but with the help of archwiki, I think (and hope) I managed to snatch a core dump.

One thing though. I'm not sure if I did some previous commands wrong, but:

$ pgrep -f firefox-aurora 
5021
5210

pgrep returns two processes for firefox-aurora. this means two core dumps (if I understand that correctly). One(core.5021) has 21.7MiB and had only two threads. I did a backtrace for all of them. The second(core.5210) has 514.3MiB and has 47 threads and I did the bt for only the first three.

http://pastebin.com/emtEua7j

I could send you the core dumps, but I'm unsure what you meant with the last paragraph.
(In reply to marcodv from comment #20)
> $ pgrep -f firefox-aurora 
> 5021
> 5210

Normally, one of those is firefox (the parent process) and one is plugin-container (the child process; the executable is still named "plugin-container" even for child processes that aren't plugins).  "ps auxww | grep firefox-aurora" would have more information.

What I don't understand is how that process is still running.  If it's hitting the IsSingleThreaded() assertion failure, it should exit immediately after it's started.

But this is good, because if the process is still running, there's no need to deal with core files; you can just use "gdb -p" and get the stack traces directly from that.

> pgrep returns two processes for firefox-aurora. this means two core dumps
> (if I understand that correctly). One(core.5021) has 21.7MiB and had only
> two threads. I did a backtrace for all of them. The second(core.5210) has
> 514.3MiB and has 47 threads and I did the bt for only the first three.

The second one looks about right for a parent process.  The first one looks like it could be plugin-container if it got stuck while crashing from that assertion somehow, so 5021 is the interesting process.

> http://pastebin.com/emtEua7j

And I made a mistake earlier: I should have said to use "gdb -c [corefile]" instead of trying to specify the executable path, because that's the wrong executable for the process I'm interested in, and as a result the backtraces didn't work correctly.
ok, interesting progress...

well, first thing:

$ ps auxww | grep firefox
marek     5460 14.3 27.4 2772400 1113660 ?     Rl   Nov19 132:01 /usr/bin/firefox
marek    19632  0.0  0.0  11816  2276 pts/1    S+   07:16   0:00 grep firefox

but more interesting thing is that I was playing around with gdb,

# gdb
(gdb) exec /usr/bin/firefox-aurora
(gdb) run

and I noticed that it stops on SIG38:

Program received signal SIG38, Real-time event 38.

so I tried

(gdb) handle SIG38 nostop
Signal        Stop      Print   Pass to program Description
SIG38         No        Yes     Yes             Real-time event 38
(gdb) run

and it worked! The aurora didn't crash at all. gdb shell was receiving plenty of SIG38, but I could open pages with e10s: http://pastebin.com/UxmuvDJf however I couldn't retrieve backtrace after quitting the firefox.

Also there was this curious thing (I'm not sure if it has something in common), but normally, I have a Profile manager to show up on startup. It shows up (as it should) when I start stable or aurora(the crashing one), but when running aurora via gdb with nostop, it skips the profile manager and loads directly the main window.
With help from bug 1222500, I can reproduce this.  I'd noticed the reference to the nVidia GL driver in one of the pastebins, and I have a Debian unstable machine with the same driver version, and I even tried setting __GL_THREADED_OPTIMIZATIONS=1… but I didn't notice the part of the documentation about LD_PRELOAD'ing libpthread and libGL, which is the important part.

Thanks for the information, and apologies for breaking things.
Status: REOPENED → RESOLVED
Closed: 9 years ago9 years ago
Resolution: --- → DUPLICATE
No worries. thanks for looking into this. I just hope it will get resolved soon.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: