Closed Bug 827946 Opened 12 years ago Closed 12 years ago

Linux crash in skia::BGRAConvolve2D

Categories

(Core :: Graphics, defect)

20 Branch
x86_64
Linux
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla21
Tracking Status
firefox20 + verified
firefox21 + verified

People

(Reporter: scoobidiver, Assigned: tnikkel)

References

()

Details

(5 keywords)

Crash Data

Attachments

(4 files)

It has been hit by three users since the latest build. The regression range is: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=605ae260b7c8&tochange=dccab70d8554 It might be related to bug 817356 although it landed one month ago. Signature skia::BGRAConvolve2D(unsigned char const*, int, bool, skia::ConvolutionFilter1D const&, skia::ConvolutionFilter1D const&, int, unsigned char*, bool) More Reports Search UUID 924f69e0-fba4-46be-8a7f-dfe922130108 Date Processed 2013-01-08 18:24:21 Uptime 985 Last Crash 16.5 minutes before submission Install Age 2.9 hours since version was first installed. Install Time 2013-01-08 15:29:26 Product Firefox Version 21.0a1 Build ID 20130108033457 Release Channel nightly OS Linux OS Version 0.0.0 Linux 3.2.0-24-generic #37-Ubuntu SMP Wed Apr 25 08:43:22 UTC 2012 x86_64 Build Architecture amd64 Build Architecture Info family 6 model 42 stepping 7 Crash Reason SIGSEGV Crash Address 0x7fdbe6900000 App Notes OpenGL: Tungsten Graphics, Inc -- Mesa DRI Intel(R) Sandybridge Mobile -- 3.0 Mesa 8.0.2 -- texture_from_pixmap Processor Notes /data/socorro/stackwalk/bin/exploitable: ERROR: unable to analyze dump EMCheckCompatibility False Frame Module Signature Source 0 libxul.so skia::BGRAConvolve2D gfx/2d/convolver.cpp:178 1 libxul.so skia::ImageOperations::ResizeBasic gfx/2d/image_operations.cpp:535 2 libxul.so skia::ImageOperations::Resize gfx/2d/image_operations.cpp:371 3 libxul.so skia::ImageOperations::Resize gfx/2d/image_operations.cpp:550 4 libxul.so mozilla::gfx::Scale gfx/2d/Scale.cpp:45 5 libxul.so ScaleRunner::Run image/src/RasterImage.cpp:326 6 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:627 7 libxul.so NS_ProcessNextEvent_P obj-firefox/xpcom/build/nsThreadUtils.cpp:237 8 libxul.so nsThread::ThreadFunc xpcom/threads/nsThread.cpp:265 9 libnspr4.so _pt_root nsprpub/pr/src/pthreads/ptthread.c:156 10 libpthread-2.15.so libpthread-2.15.so@0x7e99 More reports at: https://crash-stats.mozilla.com/report/list?signature=skia%3A%3ABGRAConvolve2D%28unsigned+char+const*%2C+int%2C+bool%2C+skia%3A%3AConvolutionFilter1D+const%26%2C+skia%3A%3AConvolutionFilter1D+const%26%2C+int%2C+unsigned+char*%2C+bool%29
Now hit by 8 users in one build.
Keywords: topcrash
I hit this clicking on the first link of https://encrypted.google.com/search?hl=en&q=smallworld%20board%20game%20geek Graphics Adapter Description: Tungsten Graphics, Inc -- Mesa DRI Intel(R) Sandybridge Mobile Device ID: Mesa DRI Intel(R) Sandybridge Mobile Driver Version: 2.1 Mesa 8.0.4 GPU Accelerated Windows: 0/1 Basic Vendor ID: Tungsten Graphics, Inc WebGL Renderer: Tungsten Graphics, Inc -- Mesa DRI Intel(R) Sandybridge Mobile AzureCanvasBackend: cairo AzureContentBackend: none AzureFallbackCanvasBackend: none
Kevin, is it 100% reproducible for you?
Flags: needinfo?(kbrosnan)
Opening http://boardgamegeek.com/boardgame/40692/small-world (from comment #2) crashes most of the times for me. If not when loading it the first time, force-reloading with Ctrl+F5 does.
Flags: needinfo?(kbrosnan)
Blocks: 830639
Assignee: nobody → joe
I have tried and tried and just can't reproduce this. The high quality downscaler is turned on, too. Lots of force reloads seem to have no effect. I'm baffled.
George, any thoughts?
Flags: needinfo?(gwright)
Joe - have you been testing on Linux or OS X?
Flags: needinfo?(gwright)
It happens only on 64-bit Linux.
(In reply to Scoobidiver from comment #9) > It happens only on 64-bit Linux. I see that comment #0 is on Intel Sandybridge / Mesa 8.0.2, comment #2 is Intel Sandybridge Mobile / Mesa 8.0.4 - is there a commonality on terms of graphics hardware and/or Mesa version for this bug?
If anyone in MTV or SF wants to capture this it has been somewhat reproducible.
NVIDIA GeForce 9600M / Mesa 9.0.1 / x64 I just got this on the 20130115040206 build, 100% reproducible on loading: http://imgur.com/a/SWVTE https://crash-stats.mozilla.com/report/index/bp-7830d459-5ed0-4807-b35d-3193c2130117 Updating to latest stopped the crashing on that page but I still get crashes on loading: http://sohowww.nascom.nasa.gov/data/realtime-images.html https://crash-stats.mozilla.com/report/index/bp-1aaa631e-643b-4107-b0f3-0a80b2130117
Aurora (20.0a2) is affected too. Here is a corresponding crash report: https://crash-stats.mozilla.com/report/index/bp-c7b4de8b-89f8-4331-a07b-e429c2130118
Summary: crash in skia::BGRAConvolve2D → Linux crash in skia::BGRAConvolve2D
Version: 21 Branch → 20 Branch
The timestamps suggest that this happened just before Firefox 20 went from central to aurora.
Joe, could this be another PGO issue? Could that be related to not being able to reproduce?
Flags: needinfo?(joe)
I'll try to reproduce on a PGO'd build.
Flags: needinfo?(joe)
I managed to reproduce this once while either loading or closing one of http://imgur.com/a/SWVTE http://sohowww.nascom.nasa.gov/data/realtime-images.html but despite much effort have not reproduced it since.
Keywords: reproducible
(In reply to Milan Sreckovic [:milan] from comment #17) > Joe, could this be another PGO issue? I thought we didn't do PGO on Linux? Note that this is from what I saw only happening on Linux so far - but there in pretty large numbers (for that platform).
Some of these crashes seem to occur even randomly. Like this one using Firefox 20.0a2 as of Jan. 28th, crashing Firefox while not even accessing a particular page: https://crash-stats.mozilla.com/report/index/bp-f109ad83-e196-499c-8d6d-36c302130129
(In reply to Joe Drew (:JOEDREW! \o/) from comment #6) > I have tried and tried and just can't reproduce this. The high quality > downscaler is turned on, too. Lots of force reloads seem to have no effect. > I'm baffled. As Kairo points out in comment 10, are you testing with Intel Sandybridge? Do we have an option of testing with that?
By the way, I can't reproduce this either, but I do have nvidia card...
I created a local copy of http://sohowww.nascom.nasa.gov/data/realtime-images.html and removed as much as possible, and still get Firefox to crash. See soho.html for what I got. Find the corresponding crash report at https://crash-stats.mozilla.com/report/index/bp-0f6236dd-7896-46b1-bbe6-c3b152130129 I hope this helps.
One more note, be sure to delete all browser data (cache, cookies etc.) and close Firefox completely when trying. My local test page didn't crash every time, but usually it did after cleaning and closing Firefox 21.0a1 completely.
Attached file Further stripped down soho.html (deleted) —
I managed to remove even more, including the CSS and other meta tags. Now the page only adds http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using the img tag, and still gets Firefox to crash. However, if I load http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg directly, Firefox doesn't crash (at least I couldn't get it to). So it seems the difference lays somewhere between adding http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using the img tag, and the image itself.
I can now pretty consistently reproduce this myself with a nightly, but not with a build I've made myself.
try builds also fail to reproduce the crash. What is different between try builds and nightlies that might cause this?
(In reply to Timothy Nikkel (:tn) from comment #28) > What is different between try builds and nightlies that might cause this? Linux PGO perhaps? I thought we weren't doing that, but I just read some other comments that imply that we actually do. And if MS PGO can cause weird crashes, gcc PGO might just as well.
I was able to reliably reproduce the crash in the pgo build http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64-pgo/1359511648/ but was not able to reproduce the crash in the non-pgo build http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-linux64/1359508908/ They are both build from the same revision. As far as I can tell pgo is the only variable here. So we are either looking at a PGO bug in gcc or a bug in our code that is only tickled when PGO is enabled for some reason.
(In reply to Timothy Nikkel (:tn) from comment #30) > but was not able to reproduce the crash in > the non-pgo build > http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla- > central-linux64/1359508908/ Confirming this. Couldn't get this build to crash on http://sohowww.nascom.nasa.gov/data/realtime-images.html as opposed to January 29th Nightly. But I'm confused, isn't this the same source that Nightlies are based on?
Yes, it is exactly the same source, but it is compiled slightly differently. Nightlies have PGO (profile guided optimizations), which take a lot longer to compile but produce slightly faster code. We also make non-PGO builds but they aren't nightlies. Thanks for testing.
Keywords: testcase
This is starting to sound like a PGO optimizer bug, then. :(
I figured out how to turn on PGO on try and that try build has the problem. This is good news as now we can do some poking using these try builds to try to figure out the problem.
A local PGO build using gcc 4.7 did not have the crash.
IIRC, our official build machines use gcc 4.5 right now.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #36) > IIRC, our official build machines use gcc 4.5 right now. Correct, you can tell by looking at the build logs.
(In reply to Markus Popp from comment #26) > Created attachment 707773 [details] > Further stripped down soho.html > > I managed to remove even more, including the CSS and other meta tags. Now > the page only adds > http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using > the img tag, and still gets Firefox to crash. > > However, if I load > http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg > directly, Firefox doesn't crash (at least I couldn't get it to). > > So it seems the difference lays somewhere between adding > http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using > the img tag, and the image itself. FWIW, I don't crash with latest aurora and latest nightly on this page.
(In reply to Mike Hommey [:glandium] from comment #38) > > So it seems the difference lays somewhere between adding > > http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using > > the img tag, and the image itself. > > FWIW, I don't crash with latest aurora and latest nightly on this page. Sorry, that was a bit misleading. Accessing latest.jpg directly doesn't crash, but http://sohowww.nascom.nasa.gov/data/realtime-images.html (which includes latest.jpg as img) still does (also in the latest Nightly, just tried it).
(In reply to Markus Popp from comment #39) > (In reply to Mike Hommey [:glandium] from comment #38) > > > So it seems the difference lays somewhere between adding > > > http://sohowww.nascom.nasa.gov/data/realtime/eit_171/512/latest.jpg using > > > the img tag, and the image itself. > > > > FWIW, I don't crash with latest aurora and latest nightly on this page. > > Sorry, that was a bit misleading. Accessing latest.jpg directly doesn't > crash, but http://sohowww.nascom.nasa.gov/data/realtime-images.html (which > includes latest.jpg as img) still does (also in the latest Nightly, just > tried it). Sorry, my message wasn't clear, I tried the attached page, not the linked page. I also tried ttp://boardgamegeek.com/boardgame/40692/small-world and only managed to crash once when going back and forth between the various pages.
So, with a crash under a debugger here's what i can see: (gdb) p $_siginfo._sifields._sigfault $1 = {si_addr = 0x7fffc6d00000} Looking at /proc/self/maps: 7fffc6c00000-7fffc6d00000 rw-p 00000000 00:00 0 7fffc6e00000-7fffc7500000 rw-p 00000000 00:00 0 That address is not mapped, but is right at the end of an existing mapping. Could be a region that used to be mapped but isn't anymore, or a buffer overflow.
On another crash, the crash address is 0x7fc4ed500000, and the last mmap/munmap calls before the crash read: munmap(0x7fc4ed400000, 1048576) = 0 mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc4ed3ff000 munmap(0x7fc4ed3ff000, 4096) = 0 munmap(0x7fc4ed500000, 1044480) = 0 mmap(NULL, 3145728, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc4ed100000 munmap(0x7fc4ed300000, 1048576) = 0 mmap(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc4ed300000 munmap(0x7fc4ed100000, 2097152) = 0 munmap(0x7fc4ed300000, 1048576) = 0 mmap(NULL, 8392704, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7fc4ecbff000 See the munmap(0x7fc4ed500000, 1044480). It sounds like a race condition.
Actually, it sounds like a buffer overflow: This is how things are allocated: mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fc4ed3ff000 munmap(0x7fc4ed3ff000, 4096) = 0 munmap(0x7fc4ed500000, 1044480) = 0 That probably comes from jemalloc chunk allocation, where jemalloc tries to get something 1MB-aligned, so it gets something bigger, and munmaps what it doesn't need.
I'm slowly plodding my way through this x86 assembly stuff. (gdb) p $_siginfo._sifields._sigfault $1 = {si_addr = 0x7fffcce00000} the crashing instruction is => 0x00007ffff4175a62 <+1186>: movdqu 0x30(%rcx),%xmm15 and rcx 0x7fffccdfffc1 movdqu is a 128 bit or 16 byte read, 0x7fffccdfffc1 + 0x30 + 0xF = 0x7fffcce00000 so it segfaulting on the last byte of the read.
So we are in this loop // Apply the filter to the row to get the destination pixel in |accum|. int accum[4] = {0}; for (int filter_x = 0; filter_x < filter_length; filter_x++) { ConvolutionFilter1D::Fixed cur_filter = filter_values[filter_x]; accum[0] += cur_filter * row_to_filter[filter_x * 4 + R_OFFSET_IDX]; accum[1] += cur_filter * row_to_filter[filter_x * 4 + G_OFFSET_IDX]; accum[2] += cur_filter * row_to_filter[filter_x * 4 + B_OFFSET_IDX]; if (has_alpha) accum[3] += cur_filter * row_to_filter[filter_x * 4 + A_OFFSET_IDX]; } and row_to_filter = 0x7fffcc1fffc0, filter_length = 16. So the largest address of row_to_filter that the C++ might access is row_to_filter + filter_length*4 - 1. In the disassembly %rax starts out as row_to_filter. The assembly then loads 16 pixels (4 pixels at a time) starting at %rax, and then increments %rax by 16 pixels (interleaved with various SSE instructions operating on those pixels). The assembly initializes %rcx to %rax+1. The assembly then loads 16 pixels starting at %rcx (so %rax but offset by 1 byte) and the increments %rcx by 16 pixels (interleaved with various SSE instructions operating on those pixels). I'm not sure why the compiler wants to do these unaligned loads. GCC 4.7 with PGO doesn't do any of these unaligned loads. So you can see the problem here, the code loads 16 pixels starting from %rax+1 and %rax is an array of 16 pixels, so we go one byte past the end. It looks like the problem is that this "pull 16 pixels at a time" code path needs to be only used on arrays of length > 16 (because it go a few bytes past 16 pixels) but it is currently used on arrays of length >= 16. Indeed, before the 16 pixels at a time code path we have cmp $0xf,%r9d jbe 0x7ffff4177291 where the jbe looks to jump to alternate code if we have 15 or less.
Attached file disassembly (deleted) —
The disassembly for reference.
This fixes the crash for me. "-O2" does not fix it. I'm open to other ideas for how to make the compiler behave. Also if there was an easy way to make this only apply to PGO that would be better. I haven't tested the perf difference, but I'm open to suggestions for what would be a good way to test that.
Assignee: joe → tnikkel
Attachment #711675 - Flags: review?
Attachment #711675 - Flags: review? → review?(joe)
Attachment #711675 - Flags: review?(joe) → review+
And backed out cause that gcc version macro isn't defined where one would expect https://hg.mozilla.org/integration/mozilla-inbound/rev/c38ac98bdc95
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla21
Comment on attachment 711675 [details] [diff] [review] Make us use -O1 for this fuction on gcc 4.5 There are some crashes of this on aurora but it seems lower frequency. This is low risk so I think we should take it there. [Approval Request Comment] Bug caused by (feature/regressing bug #): it hasn't been tracked down in detail User impact if declined: crashes on linux Testing completed (on m-c, etc.): on m-c for a nightly now Risk to taking this patch (and alternatives if risky): low risk, we just instruct the compiler to use less optimization on gcc 4.5 only (so Linux only for our purposes) String or UUID changes made by this patch: none
Attachment #711675 - Flags: approval-mozilla-aurora?
Comment on attachment 711675 [details] [diff] [review] Make us use -O1 for this fuction on gcc 4.5 low risk, go ahead with uplift.
Attachment #711675 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Firefox Nightly on Linux64 has definitely become a lot more stable since yesterday (Feb. 10). Crashed a lot before, no crash since yesterday so far.
Is this supposed to be fixed on Aurora yet? While Firefox Nightly looks fine, Firefox Aurora still crashes at http://sohowww.nascom.nasa.gov/data/realtime-images.html as of February 13th build. Here is a corresponding crash report: https://crash-stats.mozilla.com/report/index/bp-90a3fb6c-c444-4469-ba33-aa2812130213
Ah, the reason is that the MOZ_GCC_VERSION_AT_LEAST macro is only defined on mozilla-central and not aurora because it was added recently by bug 833254. I'll see about getting that uplifted.
Depends on: 833254
No longer depends on: 833254
There's been enough churn that the patch for bug 833254 needs some work to apply. I'm just going to make a patch doing it the old way looking at the gcc defines directly without the shiny new macro.
Pushed using the old way to check the gcc version https://hg.mozilla.org/releases/mozilla-aurora/rev/0e3b5d954cf1
Reproduced the crash on Aurora 20.0a2 2013-02-12. Verified fixed FF 20b1 Ubuntu 12.04 x64.
Keywords: verifyme
QA Contact: paul.silaghi
Verified fixed FF 21b2 Ubuntu 12.04 x64.
Status: RESOLVED → VERIFIED
Keywords: verifyme
Depends on: 1120060
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: