Closed
Bug 827946
Opened 12 years ago
Closed 12 years ago
Linux crash in skia::BGRAConvolve2D
Categories
(Core :: Graphics, defect)
Tracking
()
VERIFIED
FIXED
mozilla21
People
(Reporter: scoobidiver, Assigned: tnikkel)
References
()
Details
(5 keywords)
Crash Data
Attachments
(4 files)
(deleted),
text/html
|
Details | |
(deleted),
text/html
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
joe
:
review+
lsblakk
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
It has been hit by three users since the latest build. The regression range is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=605ae260b7c8&tochange=dccab70d8554
It might be related to bug 817356 although it landed one month ago.
Signature skia::BGRAConvolve2D(unsigned char const*, int, bool, skia::ConvolutionFilter1D const&, skia::ConvolutionFilter1D const&, int, unsigned char*, bool) More Reports Search
UUID 924f69e0-fba4-46be-8a7f-dfe922130108
Date Processed 2013-01-08 18:24:21
Uptime 985
Last Crash 16.5 minutes before submission
Install Age 2.9 hours since version was first installed.
Install Time 2013-01-08 15:29:26
Product Firefox
Version 21.0a1
Build ID 20130108033457
Release Channel nightly
OS Linux
OS Version 0.0.0 Linux 3.2.0-24-generic #37-Ubuntu SMP Wed Apr 25 08:43:22 UTC 2012 x86_64
Build Architecture amd64
Build Architecture Info family 6 model 42 stepping 7
Crash Reason SIGSEGV
Crash Address 0x7fdbe6900000
App Notes
OpenGL: Tungsten Graphics, Inc -- Mesa DRI Intel(R) Sandybridge Mobile -- 3.0 Mesa 8.0.2 -- texture_from_pixmap
Processor Notes /data/socorro/stackwalk/bin/exploitable: ERROR: unable to analyze dump
EMCheckCompatibility False
Frame Module Signature Source
0 libxul.so skia::BGRAConvolve2D gfx/2d/convolver.cpp:178
1 libxul.so skia::ImageOperations::ResizeBasic gfx/2d/image_operations.cpp:535
2 libxul.so skia::ImageOperations::Resize gfx/2d/image_operations.cpp:371
3 libxul.so skia::ImageOperations::Resize gfx/2d/image_operations.cpp:550
4 libxul.so mozilla::gfx::Scale gfx/2d/Scale.cpp:45
5 libxul.so ScaleRunner::Run image/src/RasterImage.cpp:326
6 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:627
7 libxul.so NS_ProcessNextEvent_P obj-firefox/xpcom/build/nsThreadUtils.cpp:237
8 libxul.so nsThread::ThreadFunc xpcom/threads/nsThread.cpp:265
9 libnspr4.so _pt_root nsprpub/pr/src/pthreads/ptthread.c:156
10 libpthread-2.15.so libpthread-2.15.so@0x7e99
More reports at:
https://crash-stats.mozilla.com/report/list?signature=skia%3A%3ABGRAConvolve2D%28unsigned+char+const*%2C+int%2C+bool%2C+skia%3A%3AConvolutionFilter1D+const%26%2C+skia%3A%3AConvolutionFilter1D+const%26%2C+int%2C+unsigned+char*%2C+bool%29
Reporter | ||
Comment 1•12 years ago
|
||
Now hit by 8 users in one build.
tracking-firefox21:
--- → ?
Keywords: topcrash
Comment 2•12 years ago
|
||
I hit this clicking on the first link of https://encrypted.google.com/search?hl=en&q=smallworld%20board%20game%20geek
Graphics
Adapter Description: Tungsten Graphics, Inc -- Mesa DRI Intel(R) Sandybridge Mobile
Device ID: Mesa DRI Intel(R) Sandybridge Mobile
Driver Version: 2.1 Mesa 8.0.4
GPU Accelerated Windows: 0/1 Basic
Vendor ID: Tungsten Graphics, Inc
WebGL Renderer: Tungsten Graphics, Inc -- Mesa DRI Intel(R) Sandybridge Mobile
AzureCanvasBackend: cairo
AzureContentBackend: none
AzureFallbackCanvasBackend: none
Comment 4•12 years ago
|
||
Opening http://boardgamegeek.com/boardgame/40692/small-world (from comment #2) crashes most of the times for me. If not when loading it the first time, force-reloading with Ctrl+F5 does.
Reporter | ||
Updated•12 years ago
|
Keywords: reproducible
Reporter | ||
Updated•12 years ago
|
Flags: needinfo?(kbrosnan)
Updated•12 years ago
|
Assignee: nobody → joe
Comment 6•12 years ago
|
||
I have tried and tried and just can't reproduce this. The high quality downscaler is turned on, too. Lots of force reloads seem to have no effect. I'm baffled.
Reporter | ||
Comment 9•12 years ago
|
||
It happens only on 64-bit Linux.
Comment 10•12 years ago
|
||
(In reply to Scoobidiver from comment #9)
> It happens only on 64-bit Linux.
I see that comment #0 is on Intel Sandybridge / Mesa 8.0.2, comment #2 is Intel Sandybridge Mobile / Mesa 8.0.4 - is there a commonality on terms of graphics hardware and/or Mesa version for this bug?
Comment 11•12 years ago
|
||
If anyone in MTV or SF wants to capture this it has been somewhat reproducible.
Updated•12 years ago
|
status-firefox21:
--- → affected
Comment 12•12 years ago
|
||
NVIDIA GeForce 9600M / Mesa 9.0.1 / x64
I just got this on the 20130115040206 build, 100% reproducible on loading: http://imgur.com/a/SWVTE
https://crash-stats.mozilla.com/report/index/bp-7830d459-5ed0-4807-b35d-3193c2130117
Updating to latest stopped the crashing on that page but I still get crashes on loading: http://sohowww.nascom.nasa.gov/data/realtime-images.html
https://crash-stats.mozilla.com/report/index/bp-1aaa631e-643b-4107-b0f3-0a80b2130117
Comment 13•12 years ago
|
||
Aurora (20.0a2) is affected too. Here is a corresponding crash report:
https://crash-stats.mozilla.com/report/index/bp-c7b4de8b-89f8-4331-a07b-e429c2130118
Updated•12 years ago
|
Summary: crash in skia::BGRAConvolve2D → Linux crash in skia::BGRAConvolve2D
Updated•12 years ago
|
Version: 21 Branch → 20 Branch
Comment 14•12 years ago
|
||
I'm setting status for Firefox 20 to affected, see https://crash-stats.mozilla.com/report/index/bp-c7b4de8b-89f8-4331-a07b-e429c2130118
status-firefox20:
--- → affected
tracking-firefox20:
--- → ?
Reporter | ||
Comment 15•12 years ago
|
||
Based on comment 14, the regression range in comment 0 can be restricted to:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=605ae260b7c8&tochange=5bb309998e70
Comment 16•12 years ago
|
||
The timestamps suggest that this happened just before Firefox 20 went from central to aurora.
Comment 17•12 years ago
|
||
Joe, could this be another PGO issue? Could that be related to not being able to reproduce?
Flags: needinfo?(joe)
Assignee | ||
Comment 19•12 years ago
|
||
I managed to reproduce this once while either loading or closing one of
http://imgur.com/a/SWVTE
http://sohowww.nascom.nasa.gov/data/realtime-images.html
but despite much effort have not reproduced it since.
Reporter | ||
Updated•12 years ago
|
Keywords: reproducible
Comment 20•12 years ago
|
||
(In reply to Milan Sreckovic [:milan] from comment #17)
> Joe, could this be another PGO issue?
I thought we didn't do PGO on Linux? Note that this is from what I saw only happening on Linux so far - but there in pretty large numbers (for that platform).
Updated•12 years ago
|
Keywords: reproducible
Comment 21•12 years ago
|
||
Some of these crashes seem to occur even randomly. Like this one using Firefox 20.0a2 as of Jan. 28th, crashing Firefox while not even accessing a particular page:
https://crash-stats.mozilla.com/report/index/bp-f109ad83-e196-499c-8d6d-36c302130129
Comment 22•12 years ago
|
||
(In reply to Joe Drew (:JOEDREW! \o/) from comment #6)
> I have tried and tried and just can't reproduce this. The high quality
> downscaler is turned on, too. Lots of force reloads seem to have no effect.
> I'm baffled.
As Kairo points out in comment 10, are you testing with Intel Sandybridge? Do we have an option of testing with that?
Comment 23•12 years ago
|
||
By the way, I can't reproduce this either, but I do have nvidia card...
Comment 24•12 years ago
|
||
I created a local copy of http://sohowww.nascom.nasa.gov/data/realtime-images.html and removed as much as possible, and still get Firefox to crash. See soho.html for what I got.
Find the corresponding crash report at https://crash-stats.mozilla.com/report/index/bp-0f6236dd-7896-46b1-bbe6-c3b152130129
I hope this helps.
Comment 25•12 years ago
|
||
One more note, be sure to delete all browser data (cache, cookies etc.) and close Firefox completely when trying. My local test page didn't crash every time, but usually it did after cleaning and closing Firefox 21.0a1 completely.
Comment 26•12 years ago
|
||
I managed to remove even more, including the CSS and other meta tags. Now the page only adds http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using the img tag, and still gets Firefox to crash.
However, if I load http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg directly, Firefox doesn't crash (at least I couldn't get it to).
So it seems the difference lays somewhere between adding http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using the img tag, and the image itself.
Assignee | ||
Comment 27•12 years ago
|
||
I can now pretty consistently reproduce this myself with a nightly, but not with a build I've made myself.
Assignee | ||
Comment 28•12 years ago
|
||
try builds also fail to reproduce the crash.
What is different between try builds and nightlies that might cause this?
Comment 29•12 years ago
|
||
(In reply to Timothy Nikkel (:tn) from comment #28)
> What is different between try builds and nightlies that might cause this?
Linux PGO perhaps? I thought we weren't doing that, but I just read some other comments that imply that we actually do. And if MS PGO can cause weird crashes, gcc PGO might just as well.
Assignee | ||
Comment 30•12 years ago
|
||
I was able to reliably reproduce the crash in the pgo build http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64-pgo/1359511648/ but was not able to reproduce the crash in the non-pgo build http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64/1359508908/
They are both build from the same revision. As far as I can tell pgo is the only variable here. So we are either looking at a PGO bug in gcc or a bug in our code that is only tickled when PGO is enabled for some reason.
Comment 31•12 years ago
|
||
(In reply to Timothy Nikkel (:tn) from comment #30)
> but was not able to reproduce the crash in
> the non-pgo build
> http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-
> central-linux64/1359508908/
Confirming this. Couldn't get this build to crash on http://sohowww.nascom.nasa.gov/data/realtime-images.html as opposed to January 29th Nightly.
But I'm confused, isn't this the same source that Nightlies are based on?
Assignee | ||
Comment 32•12 years ago
|
||
Yes, it is exactly the same source, but it is compiled slightly differently. Nightlies have PGO (profile guided optimizations), which take a lot longer to compile but produce slightly faster code. We also make non-PGO builds but they aren't nightlies.
Thanks for testing.
Comment 33•12 years ago
|
||
This is starting to sound like a PGO optimizer bug, then. :(
Assignee | ||
Comment 34•12 years ago
|
||
I figured out how to turn on PGO on try and that try build has the problem. This is good news as now we can do some poking using these try builds to try to figure out the problem.
Assignee | ||
Comment 35•12 years ago
|
||
A local PGO build using gcc 4.7 did not have the crash.
Comment 36•12 years ago
|
||
IIRC, our official build machines use gcc 4.5 right now.
Assignee | ||
Comment 37•12 years ago
|
||
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #36)
> IIRC, our official build machines use gcc 4.5 right now.
Correct, you can tell by looking at the build logs.
Comment 38•12 years ago
|
||
(In reply to Markus Popp from comment #26)
> Created attachment 707773 [details]
> Further stripped down soho.html
>
> I managed to remove even more, including the CSS and other meta tags. Now
> the page only adds
> http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using
> the img tag, and still gets Firefox to crash.
>
> However, if I load
> http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg
> directly, Firefox doesn't crash (at least I couldn't get it to).
>
> So it seems the difference lays somewhere between adding
> http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using
> the img tag, and the image itself.
FWIW, I don't crash with latest aurora and latest nightly on this page.
Comment 39•12 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #38)
> > So it seems the difference lays somewhere between adding
> > http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using
> > the img tag, and the image itself.
>
> FWIW, I don't crash with latest aurora and latest nightly on this page.
Sorry, that was a bit misleading. Accessing latest.jpg directly doesn't crash, but http://sohowww.nascom.nasa.gov/data/realtime-images.html (which includes latest.jpg as img) still does (also in the latest Nightly, just tried it).
Comment 40•12 years ago
|
||
(In reply to Markus Popp from comment #39)
> (In reply to Mike Hommey [:glandium] from comment #38)
> > > So it seems the difference lays somewhere between adding
> > > http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using
> > > the img tag, and the image itself.
> >
> > FWIW, I don't crash with latest aurora and latest nightly on this page.
>
> Sorry, that was a bit misleading. Accessing latest.jpg directly doesn't
> crash, but http://sohowww.nascom.nasa.gov/data/realtime-images.html (which
> includes latest.jpg as img) still does (also in the latest Nightly, just
> tried it).
Sorry, my message wasn't clear, I tried the attached page, not the linked page. I also tried ttp://boardgamegeek.com/boardgame/40692/small-world and only managed to crash once when going back and forth between the various pages.
Comment 41•12 years ago
|
||
So, with a crash under a debugger here's what i can see:
(gdb) p $_siginfo._sifields._sigfault
$1 = {si_addr = 0x7fffc6d00000}
Looking at /proc/self/maps:
7fffc6c00000-7fffc6d00000 rw-p 00000000 00:00 0
7fffc6e00000-7fffc7500000 rw-p 00000000 00:00 0
That address is not mapped, but is right at the end of an existing mapping. Could be a region that used to be mapped but isn't anymore, or a buffer overflow.
Comment 42•12 years ago
|
||
On another crash, the crash address is 0x7fc4ed500000, and the last mmap/munmap calls before the crash read:
munmap(0x7fc4ed400000, 1048576) = 0
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc4ed3ff000
munmap(0x7fc4ed3ff000, 4096) = 0
munmap(0x7fc4ed500000, 1044480) = 0
mmap(NULL, 3145728, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc4ed100000
munmap(0x7fc4ed300000, 1048576) = 0
mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc4ed300000
munmap(0x7fc4ed100000, 2097152) = 0
munmap(0x7fc4ed300000, 1048576) = 0
mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fc4ecbff000
See the munmap(0x7fc4ed500000, 1044480). It sounds like a race condition.
Comment 43•12 years ago
|
||
Actually, it sounds like a buffer overflow:
This is how things are allocated:
mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc4ed3ff000
munmap(0x7fc4ed3ff000, 4096) = 0
munmap(0x7fc4ed500000, 1044480) = 0
That probably comes from jemalloc chunk allocation, where jemalloc tries to get something 1MB-aligned, so it gets something bigger, and munmaps what it doesn't need.
Assignee | ||
Comment 44•12 years ago
|
||
I'm slowly plodding my way through this x86 assembly stuff.
(gdb) p $_siginfo._sifields._sigfault
$1 = {si_addr = 0x7fffcce00000}
the crashing instruction is
=> 0x00007ffff4175a62 <+1186>: movdqu 0x30(%rcx),%xmm15
and
rcx 0x7fffccdfffc1
movdqu is a 128 bit or 16 byte read, 0x7fffccdfffc1 + 0x30 + 0xF = 0x7fffcce00000 so it segfaulting on the last byte of the read.
Assignee | ||
Comment 45•12 years ago
|
||
So we are in this loop
// Apply the filter to the row to get the destination pixel in |accum|.
int accum[4] = {0};
for (int filter_x = 0; filter_x < filter_length; filter_x++) {
ConvolutionFilter1D::Fixed cur_filter = filter_values[filter_x];
accum[0] += cur_filter * row_to_filter[filter_x * 4 + R_OFFSET_IDX];
accum[1] += cur_filter * row_to_filter[filter_x * 4 + G_OFFSET_IDX];
accum[2] += cur_filter * row_to_filter[filter_x * 4 + B_OFFSET_IDX];
if (has_alpha)
accum[3] += cur_filter * row_to_filter[filter_x * 4 + A_OFFSET_IDX];
}
and row_to_filter = 0x7fffcc1fffc0, filter_length = 16. So the largest address of row_to_filter that the C++ might access is row_to_filter + filter_length*4 - 1.
In the disassembly %rax starts out as row_to_filter. The assembly then loads 16 pixels (4 pixels at a time) starting at %rax, and then increments %rax by 16 pixels (interleaved with various SSE instructions operating on those pixels).
The assembly initializes %rcx to %rax+1. The assembly then loads 16 pixels starting at %rcx (so %rax but offset by 1 byte) and the increments %rcx by 16 pixels (interleaved with various SSE instructions operating on those pixels). I'm not sure why the compiler wants to do these unaligned loads. GCC 4.7 with PGO doesn't do any of these unaligned loads.
So you can see the problem here, the code loads 16 pixels starting from %rax+1 and %rax is an array of 16 pixels, so we go one byte past the end.
It looks like the problem is that this "pull 16 pixels at a time" code path needs to be only used on arrays of length > 16 (because it go a few bytes past 16 pixels) but it is currently used on arrays of length >= 16.
Indeed, before the 16 pixels at a time code path we have
cmp $0xf,%r9d
jbe 0x7ffff4177291
where the jbe looks to jump to alternate code if we have 15 or less.
Assignee | ||
Comment 46•12 years ago
|
||
The disassembly for reference.
Assignee | ||
Comment 47•12 years ago
|
||
This fixes the crash for me. "-O2" does not fix it. I'm open to other ideas for how to make the compiler behave. Also if there was an easy way to make this only apply to PGO that would be better.
I haven't tested the perf difference, but I'm open to suggestions for what would be a good way to test that.
Assignee: joe → tnikkel
Attachment #711675 -
Flags: review?
Assignee | ||
Updated•12 years ago
|
Attachment #711675 -
Flags: review? → review?(joe)
Updated•12 years ago
|
Attachment #711675 -
Flags: review?(joe) → review+
Assignee | ||
Comment 48•12 years ago
|
||
Assignee | ||
Comment 49•12 years ago
|
||
And backed out cause that gcc version macro isn't defined where one would expect
https://hg.mozilla.org/integration/mozilla-inbound/rev/c38ac98bdc95
Assignee | ||
Comment 50•12 years ago
|
||
re-landed with that fixed up
https://hg.mozilla.org/integration/mozilla-inbound/rev/be77205650e3
Comment 51•12 years ago
|
||
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla21
Assignee | ||
Comment 52•12 years ago
|
||
Comment on attachment 711675 [details] [diff] [review]
Make us use -O1 for this fuction on gcc 4.5
There are some crashes of this on aurora but it seems lower frequency. This is low risk so I think we should take it there.
[Approval Request Comment]
Bug caused by (feature/regressing bug #): it hasn't been tracked down in detail
User impact if declined: crashes on linux
Testing completed (on m-c, etc.): on m-c for a nightly now
Risk to taking this patch (and alternatives if risky): low risk, we just instruct the compiler to use less optimization on gcc 4.5 only (so Linux only for our purposes)
String or UUID changes made by this patch: none
Attachment #711675 -
Flags: approval-mozilla-aurora?
Comment 53•12 years ago
|
||
Comment on attachment 711675 [details] [diff] [review]
Make us use -O1 for this fuction on gcc 4.5
low risk, go ahead with uplift.
Attachment #711675 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 54•12 years ago
|
||
Firefox Nightly on Linux64 has definitely become a lot more stable since yesterday (Feb. 10). Crashed a lot before, no crash since yesterday so far.
Assignee | ||
Comment 55•12 years ago
|
||
Comment 56•12 years ago
|
||
Is this supposed to be fixed on Aurora yet?
While Firefox Nightly looks fine, Firefox Aurora still crashes at http://sohowww.nascom.nasa.gov/data/realtime-images.html as of February 13th build. Here is a corresponding crash report:
https://crash-stats.mozilla.com/report/index/bp-90a3fb6c-c444-4469-ba33-aa2812130213
Reporter | ||
Updated•12 years ago
|
Assignee | ||
Comment 57•12 years ago
|
||
Ah, the reason is that the MOZ_GCC_VERSION_AT_LEAST macro is only defined on mozilla-central and not aurora because it was added recently by bug 833254. I'll see about getting that uplifted.
Assignee | ||
Comment 58•12 years ago
|
||
There's been enough churn that the patch for bug 833254 needs some work to apply. I'm just going to make a patch doing it the old way looking at the gcc defines directly without the shiny new macro.
Assignee | ||
Comment 59•12 years ago
|
||
Pushed using the old way to check the gcc version
https://hg.mozilla.org/releases/mozilla-aurora/rev/0e3b5d954cf1
Comment 60•12 years ago
|
||
Reproduced the crash on Aurora 20.0a2 2013-02-12.
Verified fixed FF 20b1 Ubuntu 12.04 x64.
Comment 61•12 years ago
|
||
Verified fixed FF 21b2 Ubuntu 12.04 x64.
You need to log in
before you can comment on or make changes to this bug.
Description
•