1290896 - Crash in je_free | swrast_dri.so@0x438a90

Reporter

Description

•

8 years ago

This bug was filed from the Socorro interface and is report bp-57b785b8-0756-43de-9da7-40dce2160801. ============================================================= [Note]: This crash is reproducible only on Ubuntu platform, on Firefox 50.0a1 build and with E10s enabled. [Affected versions]: Firefox 50.0a1 (2016-07-31) [Affected platforms]: Ubuntu 16.04 x64 [Steps to reproduce]: 1. Visit https://s3.amazonaws.com/mozilla-games/tmp/2015-08-28-emunittest_0.4-AngryBots-u5.1.3f1_hg-e1.34.6-release-prof/index.html?playback [Expected result]: The video is correctly played. [Actual result]: The tab is crashing after the Unity page is loaded. [Regression range]: Last good revision: c676d55b6b006a2edb37c7c29c64e69f7cb8012a First bad revision: 23140396a80eb27ff586c41fdc1cad62c875c9b1 Pushlog: https://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=c676d55b6b006a2edb37c7c29c64e69f7cb8012a&tochange=23140396a80eb27ff586c41fdc1cad62c875c9b1 Looks like the following bug has the changes which introduced the regression: https://bugzilla.mozilla.org/show_bug.cgi?id=742434

Randell Jesup [:jesup] (needinfo me)

Comment 1

•

8 years ago

Note: apparently caused by turning on seccomp-bpf on linux

Randell Jesup [:jesup] (needinfo me)

Comment 2

•

8 years ago

Crash appears to be in mozilla::WebGLContext::DrawElements() apparently in mSymbols.fDrawRangeElements()

Randell Jesup [:jesup] (needinfo me)

Comment 3

•

8 years ago

-> WebGL, but more likely is sandboxing NI to various people

Blocks: desktop-seccomp

Component: Web Audio → Canvas: WebGL

Flags: needinfo?(milan)

Flags: needinfo?(julian.r.hector)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 4

•

8 years ago

Crash reason is SIGILL, and the crash address isn't within a mapped file — jitcode, maybe? If someone with permission to look at full minidumps could disassemble the code around %rip (I think that's in the minidump?) that might help understand what's going on.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 5

•

8 years ago

Note that the je_free frame is from stack scanning, so it (and everything else there besides the anonymous code at %rip) could be from old stack frames that have since returned, or could be stack-spilled variables that happen to point into a text section, etc. (This also means the crash signature could be completely meaningless.) I tried reproducing this in a VM, but it doesn't crash for me. It might depend on CPU model, or maybe GPU.

Liviu Cirdei

Updated

•

8 years ago

tracking-e10s: --- → ?

Milan Sreckovic [:milan] (needinfo for best results)

Comment 6

•

8 years ago

Let's deal with this before we start building on top of bug 742434 (I see a few bugs depending on it) in case we don't find a fix and need to back this out?

Component: Canvas: WebGL → Security: Process Sandboxing

Flags: needinfo?(milan)

Randell Jesup [:jesup] (needinfo me)

Comment 7

•

8 years ago

Mihai - what's the machine configuration (especially GFX card/driver info and OS version)? Can you attach the about:support data?

Flags: needinfo?(mihai.boldan)

Julian Hector [:tedd] [:jhector]

Comment 8

•

8 years ago

I took a look at the minidump file (had to convert it to a core file) I couldn't disassemble the code around RIP (memory was not accessible from the core file). The violating RIP address (0x7f2aeada80a2) seems to be an anonymous r-x mapping. > 7f2aeada3000-7f2aeada4000 r--s 00000000 08:07 278562 [censored] > 7f2aeada7000-7f2aeadb0000 r-xp 00000000 00:00 0 > 7f2aeadb0000-7f2aeadc4000 r--p 00000000 08:07 658617 [censored] backtrace isn't any good either, the return address is pointing somewhere into the stack...

Flags: needinfo?(julian.r.hector)

Mihai Boldan, Desktop QA [:mboldan]

Reporter

Comment 9

•

8 years ago

(In reply to Randell Jesup [:jesup] from comment #7) > Mihai - what's the machine configuration (especially GFX card/driver info > and OS version)? Can you attach the about:support data? Here is the requested info: OS: Ubuntu 16.04 x64 About support data: http://pastebin.com/GCAEYjNQ product: Sky Lake Integrated Graphics [8086:1916] vendor: Intel Corporation [8086] bus info: pci@0000:00:02.0 version: 07 width: 64 bits clock: 33MHz capabilities: vga_controller, bus mastering, PCI capabilities listing, extension ROM configuration: driver: i915_bpo latency: 0 resources: irq: 128 memory: c1000000-c1ffffff memory: a0000000-afffffff ioport: 5000(size=64)

Flags: needinfo?(mihai.boldan)

Jim Mathies [:jimm]

Updated

•

8 years ago

tracking-e10s: ? → -

Jim Mathies [:jimm]

Updated

•

8 years ago

Whiteboard: sb+

Jim Mathies [:jimm]

Comment 10

•

8 years ago

are you still seeing this in latest nightlies?

Flags: needinfo?(mihai.boldan)

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 11

•

8 years ago

I looked at the minidump with the "minidump_dump" tool, which prints all the metadata structures and hexdumps all the memory regions, and there is a region around the bad instruction. I un-hexdump'ed it and ran the binary through this command (the offset is also from the minidump_dump output) to wrap it in ELF: objcopy -B i386 -I binary -O elf64-x86-64 --adjust-vma=0x7f2aeada8022 --rename-section=.data=.text,code,contents bug1290896.bin bug1290896.o And disassembled it, after finding a new enough version of binutils: 7f2aeada8099: 40 18 f6 sbb %sil,%sil 7f2aeada809c: 81 e6 01 00 00 00 and $0x1,%esi *** 7f2aeada80a2: c5 f8 92 c6 kmovw %esi,%k0 7f2aeada80a6: c5 f9 92 ca kmovb %edx,%k1 7f2aeada80aa: c5 f4 45 c0 korw %k0,%k1,%k0 kmov is an AVX-512 instruction. And it's not JS jitcode (judging by a grep in js/src; also I think we allocate larger memory regions than that?) but I'm guessing it's llvmpipe jitcode (http://www.mesa3d.org/llvmpipe.html). Which leaves me with some questions: 1. Does the CPU actually support AVX-512? Apparently Skylake CPUs models that are ”Xeon” branded support it and others don't, so this would need the CPU model number. 2a. If it doesn't, then why is something deciding that it does, and how does it make that determination? 2b. If it *does*, then why is it not enabled when that instruction is executed? 3. Is this reproducible on the same / a similar model of CPU? Possible things to troubleshoot (I can post patches for the ones that need code changes if need be): * Trying the same build with content sandboxing disabled (MOZ_DISABLE_CONTENT_SANDBOX in the environment, or setting the pref security.sandbox.content.level to 0 and restarting). * The only syscalls I'm seeing where the policy says to return an error instead of allowing or crashing are readlink/readlinkat. Changing those to Allow() could be informative, especially if the answer to question 1 is “no” and it's a case of broken CPU feature probing. * Replacing the *entire* sandbox policy with Allow() and seeing if that's different from no seccomp-bpf. (And, if it does, trying with a handwritten BPF program that really accepts everything, instead of having the architecture check prologue that the Chromium seccomp-bpf compiler inserts.) If the answer to question 1 is “yes”, I'm wondering if there might be a bug in syscall handling that gets the FPU state wrong somehow in the slower path used by seccomp-bpf, or something like that.

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 12

•

8 years ago

(In reply to Jed Davis [:jld] [⏰PDT; UTC-7] from comment #11) > 1. Does the CPU actually support AVX-512? Apparently Skylake CPUs models > that are ”Xeon” branded support it and others don't, so this would need the > CPU model number. I think I misread something there — it looks like AVX-512 was maybe planned for Skylake at some point but didn't happen, and actual shipping hardware with AVX-512 hasn't quite happened yet. Which leaves me confused about how LLVM would misread the CPU features — I've found what looks like the code for it, and it's doing the usual kind of thing with CPUID, which shouldn't be affected by anything to do with system calls.

Mihai Boldan, Desktop QA [:mboldan]

Reporter

Comment 13

•

8 years ago

(In reply to Jim Mathies [:jimm] from comment #10) > are you still seeing this in latest nightlies? The issue is still reproducible on Firefox 51.0a1 (2016-08-04) and on Ubuntu 16.04 x64. Note that the internet connection is through Wi-Fi.

Flags: needinfo?(mihai.boldan)

Samuel Neves

Comment 14

•

8 years ago

This is a bug in LLVM 3.8 -- https://bugs.archlinux.org/task/49518. LLVM thought Skylake chips did support AVX-512, cf. https://github.com/llvm-mirror/llvm/blob/release_38/lib/Target/X86/X86.td#L487-L523.

Gian-Carlo Pascutto [:gcp]

Comment 15

•

8 years ago

So why would sandboxing trigger this bug? My guess would be that we're blocking something that causes the real graphics driver to fail to load, and the user gets llvmpipe as a fallback, which then ends up crashing. Unfortunately I don't know an easy way to see which renderer is backing MESA. Here's what we can try: do a pmap <pid-of-the-main-firefox-process> [1] and pastebin it. Do this on the revision that works and the current ones that are failing. Maybe we can see if the renderer switches. [1] I would've expected to write content process here, but on my system I only see the gfx driver loaded into chrome.

Flags: needinfo?(mihai.boldan)

Mihai Boldan, Desktop QA [:mboldan]

Reporter

Comment 16

•

8 years ago

(In reply to Gian-Carlo Pascutto [:gcp] from comment #15) > So why would sandboxing trigger this bug? > > My guess would be that we're blocking something that causes the real > graphics driver to fail to load, and the user gets llvmpipe as a fallback, > which then ends up crashing. > > Unfortunately I don't know an easy way to see which renderer is backing > MESA. Here's what we can try: > do a pmap <pid-of-the-main-firefox-process> [1] and pastebin it. Do this on > the revision that works and the current ones that are failing. Maybe we can > see if the renderer switches. > > [1] I would've expected to write content process here, but on my system I > only see the gfx driver loaded into chrome. Here are the pmap results: - Results from latest Nightly - http://pastebin.com/1ULayQ8F - Results from Firefox 48.0a1 (2016-03-21) - http://pastebin.com/tw5U4fsL

Flags: needinfo?(mihai.boldan)

Gian-Carlo Pascutto [:gcp]

Comment 17

•

8 years ago

Unfortunately I don't see any gfx driver libs in there. Might be reproducible with a VM on a Skylake machine.

Andrew Overholt [:overholt]

Comment 18

•

8 years ago

Per https://bugzilla.mozilla.org/show_bug.cgi?id=1284912#c14 not tracking for 50.

status-firefox50: affected → disabled

status-firefox51: --- → ?

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 19

•

8 years ago

Can you try to reproduce the bug with this Try build? https://archive.mozilla.org/pub/firefox/try-builds/jedavis@mozilla.com-12522edae0a88a52800d2e7c8504d31709214581/try-linux64/firefox-51.0a1.en-US.linux-x86_64.tar.bz2 I changed the sandbox rules that I suspect might be involved: https://hg.mozilla.org/try/rev/12522edae0a88a52800d2e7c8504d31709214581

Flags: needinfo?(mihai.boldan)

Mihai Boldan, Desktop QA [:mboldan]

Reporter

Comment 20

•

8 years ago

(In reply to Jed Davis [:jld] [⏰PDT; UTC-7] from comment #19) > Can you try to reproduce the bug with this Try build? > https://archive.mozilla.org/pub/firefox/try-builds/jedavis@mozilla.com- > 12522edae0a88a52800d2e7c8504d31709214581/try-linux64/firefox-51.0a1.en-US. > linux-x86_64.tar.bz2 > > I changed the sandbox rules that I suspect might be involved: > https://hg.mozilla.org/try/rev/12522edae0a88a52800d2e7c8504d31709214581 The crash is no longer reproducible on this build. The tests were performed on the same laptop with Ubuntu 16.04 x64 OS.

Flags: needinfo?(mihai.boldan)

Julian Hector [:tedd] [:jhector]

Updated

•

8 years ago

Blocks: 1280415

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 21

•

8 years ago

Attached patch bug1290896-readlink-hg0.diff (deleted) — Details — Splinter Review

Assignee: nobody → jld

Attachment #8779507 - Flags: review?(gpascutto)

Gian-Carlo Pascutto [:gcp]

Updated

•

8 years ago

Attachment #8779507 - Flags: review?(gpascutto) → review+

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 22

•

8 years ago

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=12522edae0a88a52800d2e7c8504d31709214581 There's some orange and a bunch of tests that still haven't started, but so far nothing that looks like my fault. (Self-ni? for uplift request when landed and stable.)

Flags: needinfo?(jld)

Keywords: checkin-needed

Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧

Assignee

Comment 23

•

8 years ago

(In reply to Jed Davis [:jld] [⏰PDT; UTC-7] from comment #22) > (Self-ni? for uplift request when landed and stable.) Ignore that; seccomp-bpf being on by default is Nightly-only.

Flags: needinfo?(jld)

Pulsebot

Comment 24

•

8 years ago

Pushed by cbook@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/f416db46e66e Allow readlink() in desktop Linux content processes. r=gps

Keywords: checkin-needed

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 25

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f416db46e66e

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox51: ? → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla51

Mihai Boldan, Desktop QA [:mboldan]

Reporter

Comment 26

•

8 years ago

The crash is no longer reproducible on Firefox 51.0a1 (2016-08-17), using STR from Comment 0. The tests were performed under Ubuntu 16.04 x64, Mac OS X 10.11.1 and under Windows 10x64. I am marking this issue Verified Fixed.

Status: RESOLVED → VERIFIED

status-firefox51: fixed → verified

Ciprian Georgiu, Desktop QA

Comment 27

•

8 years ago

I was able to reproduce this issue again on latest Nightly 52.0a1 (2016-10-19), using STR from comment 0. Also this is reproducibile by playing one of the WebGL Samples (e.g. "Aquarium") from here: http://webglsamples.org/ Note: I used the same machine and OS mentioned by Mihai in comment 9. See Crash Signature: https://crash-stats.mozilla.com/report/index/81282b10-329c-4b50-8a4e-fb9182161020 -- I run mozregression for a regression range and this are the results (although I'm not sure what regressed this): Last good revision: da986c9f1f723af1e0c44f4ccd4cddd5fb6084e8 First bad revision: d8e1f5cf0a70a53e8a5532809096a0a5bf729196 Pushlog: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=da986c9f1f723af1e0c44f4ccd4cddd5fb6084e8&tochange=d8e1f5cf0a70a53e8a5532809096a0a5bf729196

Milan Sreckovic [:milan] (needinfo for best results)

Comment 28

•

8 years ago

Given that we're talking about Ubuntu, perhaps the sandboxing work and preference change from bug 1289718?

Flags: needinfo?(gpascutto)

Gian-Carlo Pascutto [:gcp]

Comment 29

•

8 years ago

Yes, please file a new bug. The Intel driver is refusing to load and it's falling back to a broken MESA LLVM.

Flags: needinfo?(gpascutto)

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

8 years ago

Blocks: 1312894

Milan Sreckovic [:milan] (needinfo for best results)

Comment 30

•

8 years ago

Created bug 1312894.