Closed
Bug 1290896
Opened 8 years ago
Closed 8 years ago
Crash in je_free | swrast_dri.so@0x438a90
Categories
(Core :: Security: Process Sandboxing, defect)
Tracking
()
VERIFIED
FIXED
mozilla51
Tracking | Status | |
---|---|---|
e10s | - | --- |
firefox47 | --- | unaffected |
firefox48 | --- | unaffected |
firefox49 | --- | unaffected |
firefox50 | --- | disabled |
firefox51 | --- | verified |
People
(Reporter: mboldan, Assigned: jld)
References
Details
(Keywords: crash, regression, Whiteboard: sb+)
Crash Data
Attachments
(1 file)
(deleted),
patch
|
gcp
:
review+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is
report bp-57b785b8-0756-43de-9da7-40dce2160801.
=============================================================
[Note]:
This crash is reproducible only on Ubuntu platform, on Firefox 50.0a1 build and with E10s enabled.
[Affected versions]:
Firefox 50.0a1 (2016-07-31)
[Affected platforms]:
Ubuntu 16.04 x64
[Steps to reproduce]:
1. Visit https://s3.amazonaws.com/mozilla-games/tmp/2015-08-28-emunittest_0.4-AngryBots-u5.1.3f1_hg-e1.34.6-release-prof/index.html?playback
[Expected result]:
The video is correctly played.
[Actual result]:
The tab is crashing after the Unity page is loaded.
[Regression range]:
Last good revision: c676d55b6b006a2edb37c7c29c64e69f7cb8012a
First bad revision: 23140396a80eb27ff586c41fdc1cad62c875c9b1
Pushlog:
https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=c676d55b6b006a2edb37c7c29c64e69f7cb8012a&tochange=23140396a80eb27ff586c41fdc1cad62c875c9b1
Looks like the following bug has the changes which introduced the regression:
https://bugzilla.mozilla.org/show_bug.cgi?id=742434
Comment 1•8 years ago
|
||
Note: apparently caused by turning on seccomp-bpf on linux
Comment 2•8 years ago
|
||
Crash appears to be in mozilla::WebGLContext::DrawElements() apparently in mSymbols.fDrawRangeElements()
Comment 3•8 years ago
|
||
-> WebGL, but more likely is sandboxing
NI to various people
Blocks: desktop-seccomp
Component: Web Audio → Canvas: WebGL
Flags: needinfo?(milan)
Flags: needinfo?(julian.r.hector)
Assignee | ||
Comment 4•8 years ago
|
||
Crash reason is SIGILL, and the crash address isn't within a mapped file — jitcode, maybe? If someone with permission to look at full minidumps could disassemble the code around %rip (I think that's in the minidump?) that might help understand what's going on.
Assignee | ||
Comment 5•8 years ago
|
||
Note that the je_free frame is from stack scanning, so it (and everything else there besides the anonymous code at %rip) could be from old stack frames that have since returned, or could be stack-spilled variables that happen to point into a text section, etc. (This also means the crash signature could be completely meaningless.)
I tried reproducing this in a VM, but it doesn't crash for me. It might depend on CPU model, or maybe GPU.
Updated•8 years ago
|
tracking-e10s:
--- → ?
Let's deal with this before we start building on top of bug 742434 (I see a few bugs depending on it) in case we don't find a fix and need to back this out?
Component: Canvas: WebGL → Security: Process Sandboxing
Flags: needinfo?(milan)
Comment 7•8 years ago
|
||
Mihai - what's the machine configuration (especially GFX card/driver info and OS version)? Can you attach the about:support data?
Flags: needinfo?(mihai.boldan)
Comment 8•8 years ago
|
||
I took a look at the minidump file (had to convert it to a core file)
I couldn't disassemble the code around RIP (memory was not accessible from the core file). The violating RIP address (0x7f2aeada80a2) seems to be an anonymous r-x mapping.
> 7f2aeada3000-7f2aeada4000 r--s 00000000 08:07 278562 [censored]
> 7f2aeada7000-7f2aeadb0000 r-xp 00000000 00:00 0
> 7f2aeadb0000-7f2aeadc4000 r--p 00000000 08:07 658617 [censored]
backtrace isn't any good either, the return address is pointing somewhere into the stack...
Flags: needinfo?(julian.r.hector)
Reporter | ||
Comment 9•8 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #7)
> Mihai - what's the machine configuration (especially GFX card/driver info
> and OS version)? Can you attach the about:support data?
Here is the requested info:
OS: Ubuntu 16.04 x64
About support data: http://pastebin.com/GCAEYjNQ
product: Sky Lake Integrated Graphics [8086:1916]
vendor: Intel Corporation [8086]
bus info: pci@0000:00:02.0
version: 07
width: 64 bits
clock: 33MHz
capabilities:
vga_controller,
bus mastering,
PCI capabilities listing,
extension ROM
configuration:
driver: i915_bpo
latency: 0
resources:
irq: 128
memory: c1000000-c1ffffff
memory: a0000000-afffffff
ioport: 5000(size=64)
Flags: needinfo?(mihai.boldan)
Updated•8 years ago
|
Updated•8 years ago
|
Whiteboard: sb+
Comment 10•8 years ago
|
||
are you still seeing this in latest nightlies?
Flags: needinfo?(mihai.boldan)
Assignee | ||
Comment 11•8 years ago
|
||
I looked at the minidump with the "minidump_dump" tool, which prints all the metadata structures and hexdumps all the memory regions, and there is a region around the bad instruction. I un-hexdump'ed it and ran the binary through this command (the offset is also from the minidump_dump output) to wrap it in ELF:
objcopy -B i386 -I binary -O elf64-x86-64 --adjust-vma=0x7f2aeada8022 --rename-section=.data=.text,code,contents bug1290896.bin bug1290896.o
And disassembled it, after finding a new enough version of binutils:
7f2aeada8099: 40 18 f6 sbb %sil,%sil
7f2aeada809c: 81 e6 01 00 00 00 and $0x1,%esi
*** 7f2aeada80a2: c5 f8 92 c6 kmovw %esi,%k0
7f2aeada80a6: c5 f9 92 ca kmovb %edx,%k1
7f2aeada80aa: c5 f4 45 c0 korw %k0,%k1,%k0
kmov is an AVX-512 instruction. And it's not JS jitcode (judging by a grep in js/src; also I think we allocate larger memory regions than that?) but I'm guessing it's llvmpipe jitcode (http://www.mesa3d.org/llvmpipe.html).
Which leaves me with some questions:
1. Does the CPU actually support AVX-512? Apparently Skylake CPUs models that are ”Xeon” branded support it and others don't, so this would need the CPU model number.
2a. If it doesn't, then why is something deciding that it does, and how does it make that determination?
2b. If it *does*, then why is it not enabled when that instruction is executed?
3. Is this reproducible on the same / a similar model of CPU?
Possible things to troubleshoot (I can post patches for the ones that need code changes if need be):
* Trying the same build with content sandboxing disabled (MOZ_DISABLE_CONTENT_SANDBOX in the environment, or setting the pref security.sandbox.content.level to 0 and restarting).
* The only syscalls I'm seeing where the policy says to return an error instead of allowing or crashing are readlink/readlinkat. Changing those to Allow() could be informative, especially if the answer to question 1 is “no” and it's a case of broken CPU feature probing.
* Replacing the *entire* sandbox policy with Allow() and seeing if that's different from no seccomp-bpf. (And, if it does, trying with a handwritten BPF program that really accepts everything, instead of having the architecture check prologue that the Chromium seccomp-bpf compiler inserts.) If the answer to question 1 is “yes”, I'm wondering if there might be a bug in syscall handling that gets the FPU state wrong somehow in the slower path used by seccomp-bpf, or something like that.
Assignee | ||
Comment 12•8 years ago
|
||
(In reply to Jed Davis [:jld] [⏰PDT; UTC-7] from comment #11)
> 1. Does the CPU actually support AVX-512? Apparently Skylake CPUs models
> that are ”Xeon” branded support it and others don't, so this would need the
> CPU model number.
I think I misread something there — it looks like AVX-512 was maybe planned for Skylake at some point but didn't happen, and actual shipping hardware with AVX-512 hasn't quite happened yet.
Which leaves me confused about how LLVM would misread the CPU features — I've found what looks like the code for it, and it's doing the usual kind of thing with CPUID, which shouldn't be affected by anything to do with system calls.
Reporter | ||
Comment 13•8 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #10)
> are you still seeing this in latest nightlies?
The issue is still reproducible on Firefox 51.0a1 (2016-08-04) and on Ubuntu 16.04 x64.
Note that the internet connection is through Wi-Fi.
Flags: needinfo?(mihai.boldan)
Comment 14•8 years ago
|
||
This is a bug in LLVM 3.8 -- https://bugs.archlinux.org/task/49518. LLVM thought Skylake chips did support AVX-512, cf. https://github.com/llvm-mirror/llvm/blob/release_38/lib/Target/X86/X86.td#L487-L523.
Comment 15•8 years ago
|
||
So why would sandboxing trigger this bug?
My guess would be that we're blocking something that causes the real graphics driver to fail to load, and the user gets llvmpipe as a fallback, which then ends up crashing.
Unfortunately I don't know an easy way to see which renderer is backing MESA. Here's what we can try:
do a pmap <pid-of-the-main-firefox-process> [1] and pastebin it. Do this on the revision that works and the current ones that are failing. Maybe we can see if the renderer switches.
[1] I would've expected to write content process here, but on my system I only see the gfx driver loaded into chrome.
Flags: needinfo?(mihai.boldan)
Reporter | ||
Comment 16•8 years ago
|
||
(In reply to Gian-Carlo Pascutto [:gcp] from comment #15)
> So why would sandboxing trigger this bug?
>
> My guess would be that we're blocking something that causes the real
> graphics driver to fail to load, and the user gets llvmpipe as a fallback,
> which then ends up crashing.
>
> Unfortunately I don't know an easy way to see which renderer is backing
> MESA. Here's what we can try:
> do a pmap <pid-of-the-main-firefox-process> [1] and pastebin it. Do this on
> the revision that works and the current ones that are failing. Maybe we can
> see if the renderer switches.
>
> [1] I would've expected to write content process here, but on my system I
> only see the gfx driver loaded into chrome.
Here are the pmap results:
- Results from latest Nightly - http://pastebin.com/1ULayQ8F
- Results from Firefox 48.0a1 (2016-03-21) - http://pastebin.com/tw5U4fsL
Flags: needinfo?(mihai.boldan)
Comment 17•8 years ago
|
||
Unfortunately I don't see any gfx driver libs in there. Might be reproducible with a VM on a Skylake machine.
Comment 18•8 years ago
|
||
Per https://bugzilla.mozilla.org/show_bug.cgi?id=1284912#c14 not tracking for 50.
status-firefox51:
--- → ?
Assignee | ||
Comment 19•8 years ago
|
||
Can you try to reproduce the bug with this Try build? https://archive.mozilla.org/pub/firefox/try-builds/jedavis@mozilla.com-12522edae0a88a52800d2e7c8504d31709214581/try-linux64/firefox-51.0a1.en-US.linux-x86_64.tar.bz2
I changed the sandbox rules that I suspect might be involved: https://hg.mozilla.org/try/rev/12522edae0a88a52800d2e7c8504d31709214581
Flags: needinfo?(mihai.boldan)
Reporter | ||
Comment 20•8 years ago
|
||
(In reply to Jed Davis [:jld] [⏰PDT; UTC-7] from comment #19)
> Can you try to reproduce the bug with this Try build?
> https://archive.mozilla.org/pub/firefox/try-builds/jedavis@mozilla.com-
> 12522edae0a88a52800d2e7c8504d31709214581/try-linux64/firefox-51.0a1.en-US.
> linux-x86_64.tar.bz2
>
> I changed the sandbox rules that I suspect might be involved:
> https://hg.mozilla.org/try/rev/12522edae0a88a52800d2e7c8504d31709214581
The crash is no longer reproducible on this build.
The tests were performed on the same laptop with Ubuntu 16.04 x64 OS.
Flags: needinfo?(mihai.boldan)
Assignee | ||
Comment 21•8 years ago
|
||
Assignee: nobody → jld
Attachment #8779507 -
Flags: review?(gpascutto)
Updated•8 years ago
|
Attachment #8779507 -
Flags: review?(gpascutto) → review+
Assignee | ||
Comment 22•8 years ago
|
||
Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=12522edae0a88a52800d2e7c8504d31709214581
There's some orange and a bunch of tests that still haven't started, but so far nothing that looks like my fault.
(Self-ni? for uplift request when landed and stable.)
Flags: needinfo?(jld)
Keywords: checkin-needed
Assignee | ||
Comment 23•8 years ago
|
||
(In reply to Jed Davis [:jld] [⏰PDT; UTC-7] from comment #22)
> (Self-ni? for uplift request when landed and stable.)
Ignore that; seccomp-bpf being on by default is Nightly-only.
Flags: needinfo?(jld)
Comment 24•8 years ago
|
||
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/f416db46e66e
Allow readlink() in desktop Linux content processes. r=gps
Keywords: checkin-needed
Comment 25•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
Reporter | ||
Comment 26•8 years ago
|
||
The crash is no longer reproducible on Firefox 51.0a1 (2016-08-17), using STR from Comment 0.
The tests were performed under Ubuntu 16.04 x64, Mac OS X 10.11.1 and under Windows 10x64.
I am marking this issue Verified Fixed.
Status: RESOLVED → VERIFIED
Comment 27•8 years ago
|
||
I was able to reproduce this issue again on latest Nightly 52.0a1 (2016-10-19), using STR from comment 0. Also this is reproducibile by playing one of the WebGL Samples (e.g. "Aquarium") from here: http://webglsamples.org/
Note: I used the same machine and OS mentioned by Mihai in comment 9.
See Crash Signature: https://crash-stats.mozilla.com/report/index/81282b10-329c-4b50-8a4e-fb9182161020
-- I run mozregression for a regression range and this are the results (although I'm not sure what regressed this):
Last good revision: da986c9f1f723af1e0c44f4ccd4cddd5fb6084e8
First bad revision: d8e1f5cf0a70a53e8a5532809096a0a5bf729196
Pushlog: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=da986c9f1f723af1e0c44f4ccd4cddd5fb6084e8&tochange=d8e1f5cf0a70a53e8a5532809096a0a5bf729196
Given that we're talking about Ubuntu, perhaps the sandboxing work and preference change from bug 1289718?
Flags: needinfo?(gpascutto)
Comment 29•8 years ago
|
||
Yes, please file a new bug. The Intel driver is refusing to load and it's falling back to a broken MESA LLVM.
Flags: needinfo?(gpascutto)
Created bug 1312894.
You need to log in
before you can comment on or make changes to this bug.
Description
•