Closed
Bug 521549
Opened 15 years ago
Closed 13 years ago
Crash [@ qcms_transform_data_rgba_out_lut_sse2], [@ qcms_transform_data_rgb_out_lut_sse2]
Categories
(Core :: Graphics: Color Management, defect)
Tracking
()
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
blocking2.0 | --- | - |
People
(Reporter: bc, Assigned: swsnyder)
References
Details
(Keywords: crash, intermittent-failure)
Crash Data
Attachments
(3 files, 3 obsolete files)
When running the jsreftests on 32bit linux centos5 vm with a debug mozilla-central build I see a crash after loading the js1_5/Regress/regress-351116.js test at
#6 qcms_transform_data_rgba_out_lut_sse2 (transform=0xb64ddd0,
src=0x99519c1 "", dest=0x99519c1 "", length=16)
at gfx/qcms/transform-sse2.c:152
Running the test by itself does not crash. But the crash occurs regularly at this point when running the full test.
My local profile on that machine also crashes at startup in a related place at
qcms_transform_data_rgb_out_lut_sse2 (transform=0x9121b20, src=0xbf816a0d "\020\020\020 \033\022\t\f\r%Gշ�%@\231\020\t|j\201%G�%@[\205\030%G�%@@q\n\t\001",
dest=0xbf816a0d "\020\020\020 \033\022\t\f\r%Gշ�%@\231\020\t|j\201%G�%@[\205\030%G�%@@q\n\t\001", length=1) at gfx/qcms/transform-sse2.c:41
A fresh profile however does not crash. The crash still occurs in safe mode.
I don't know how to reproduce the second crash other than with the "bad" profile, but to reproduce the first crash.
1. apply bug-515440.patch to disable a failing js test (at least until I check it in).
2. make -C $OBJDIR jstestbrowser where $OBJDIR is your debug objdir.
This appears to have been added in
Bug 512865. qcms: Improve SSE2 performance, add SSE support.
Reporter | ||
Comment 1•15 years ago
|
||
Reporter | ||
Comment 2•15 years ago
|
||
sayrer: This patch should land on mozilla-central and tracemonkey. Would it cause you problems if it was landed on both at the same time? Or would it be better to land on tracemonkey and let you handle it during the next merge to mozilla-central?
Attachment #405644 -
Attachment is obsolete: true
Attachment #407122 -
Flags: review?(sayrer)
Updated•15 years ago
|
Attachment #407122 -
Flags: review?(sayrer) → review+
Reporter | ||
Comment 3•15 years ago
|
||
http://hg.mozilla.org/tracemonkey/rev/4175f7e34648
disable js1_5/Regress/regress-351116.js for debug 32bit linux browser
to test this crash you must re-enable this test.
Comment 4•15 years ago
|
||
This stack is one we've been seeing on shutdown (intermittent, but frequently) on the Thunderbird Bloat tests.
It is slightly different to the other stacks but I'm guessing it is the same issue as I am pretty sure the intermittent crashes started the same time as when bug 512865 landed.
Comment 5•15 years ago
|
||
This is crashing frequently on debug tests on tinderbox. I think this may start affecting Thunderbird devs as soon as we start actively working on trunk.
Also I suspect this could be the cause of SeaMonkey's crashes on Linux recorded in bug 520707 that started on the same day as bug 512865.
Therefore I think this should block 1.9.3 as it is a clear reproducible crash for debug builds at least... and we need a stable environment for our Linux developers.
Assignee | ||
Comment 6•15 years ago
|
||
(In reply to comment #5)
> This is crashing frequently on debug tests on tinderbox.
Mark, can you point me toward doc that explains how the debug tests differ from a standard build? Put another way, how do I build Thunderbird/SeaMonkey in such a way that this crash can be reproduced?
Thanks.
Assignee | ||
Comment 7•15 years ago
|
||
Never mind. Found it.
https://developer.mozilla.org/en/MailNews/Leak_And_Bloat_Tests
Comment 8•15 years ago
|
||
(In reply to comment #6)
> (In reply to comment #5)
> > This is crashing frequently on debug tests on tinderbox.
>
> Mark, can you point me toward doc that explains how the debug tests differ from
> a standard build? Put another way, how do I build Thunderbird/SeaMonkey in
> such a way that this crash can be reproduced?
For Thunderbird we're doing slightly different tests to the Firefox Leak & Bloat tests. Unfortunately I need to update the documentation on how to run these, thankfully it is quite simple.
If you need info on how to pull & build Thunderbird, see https://developer.mozilla.org/En/Simple_Thunderbird_build
This is the mozconfig that we're using for the bloat tests:
http://hg.mozilla.org/build/buildbot-configs/file/c230f470c116/thunderbird/linux/comm-central/debug/mozconfig
once you've got a build, go into the src directory, into mailnews/performance/bloat and run:
./runtest.py --objdir=/path/to/objdir
replacing the path as appropriate, e.g. /home/cltbld/build/objdir-tb
If you want to try and reproduce the actual actions rather than run the automated test, then for Thunderbird it is:
1) Start up with minimal profile (one account, most things turned off including contacting the server on startup).
2) Open and close compose window
3) Open and close address book window
4) Exit Thunderbird
I haven't tried it myself yet as I need to set up a new VM for that, but they are the steps for the mail leak test.
As I said I'm not sure SeaMonkey is definitely the same crash, but afaik their leak & bloat tests are run in exactly the same way as Firefox - unfortunately I don't know where we keep details on that. This is the mozconfig used by SeaMonkey:
http://hg.mozilla.org/build/buildbot-configs/file/c230f470c116/seamonkey/linux/comm-central-trunk/debug/mozconfig
Reporter | ||
Comment 9•15 years ago
|
||
make jstestbrowser in your objdir on mozilla-central will also show this crash
Assignee | ||
Comment 10•15 years ago
|
||
Mark:
The bloat test seems to run correctly (windows opened, closed) until the end, where I get this line:
/home/steve/tmp/comm-central/objdir-tb/mozilla/dist/bin/run-mozilla.sh: line 131: 10088 Segmentation fault "$prog" ${1+"$@"}
TEST-UNEXPECTED-FAIL | runtest.py | Exited with code 139 during test run
It always crashes at the same place, in NS_StackWalk(). No references to the QCMS code being blamed.
I commented-out the QCMS code that selects the SSE2 path, rebuilt, and see the same behavior.
This on a 32-bit Fedora 10 (GCC v4.3.2) system.
Bob:
$ cd -
/home/steve/tmp/comm-central
$ cd objdir-tb/
$ make jstestbrowser
make: *** No rule to make target `jstestbrowser'. Stop.
The name of the test suggests that a SeaMonkey build is required, explaining why it's not found in my Thunderbird build. True.
Reporter | ||
Comment 11•15 years ago
|
||
Firefox build required: mozilla-central.
Reporter | ||
Comment 12•15 years ago
|
||
I am seeing crash on mozilla-central debug build @ qcms_transform_data_rgb_out_lut_sse2 on startup on a centos5 32bit vm.
Assignee | ||
Comment 13•15 years ago
|
||
(In reply to comment #12)
> I am seeing crash on mozilla-central debug build @
> qcms_transform_data_rgb_out_lut_sse2 on startup on a centos5 32bit vm.
I've been unsuccessful in reproducing this on Fedora, but I've got CentOS 5 too.
Are you on CentOS v5.4, fully updated? More specifically, are you building with GCC package gcc-4.1.2-46.el5_4.1 and binutils-2.17.50.0.6-12.el5 ?
By "debug build" do you mean just having "export MOZ_DEBUG_SYMBOLS=1" in your mozconfig file, or is there more to it than that?
Given that you're in a VM, I assume that you do in fact have SSE2 support in your environment. That is, it is likely a misalignment issue, not an attempt to execute instructions unsupported by your CPU.
Reporter | ||
Comment 14•15 years ago
|
||
(In reply to comment #13)
> Are you on CentOS v5.4, fully updated? More specifically, are you building
> with GCC package gcc-4.1.2-46.el5_4.1 and binutils-2.17.50.0.6-12.el5 ?
yes.
>
> By "debug build" do you mean just having "export MOZ_DEBUG_SYMBOLS=1" in your
> mozconfig file, or is there more to it than that?
>
mk_add_options MOZ_CO_PROJECT=browser
ac_add_options --enable-application="browser"
ac_add_options --disable-optimize
ac_add_options --enable-debugger-info-modules=all
ac_add_options --enable-debug
ac_add_options --enable-tests
ac_add_options --enable-default-toolkit=cairo-gtk2
ac_add_options --disable-static
ac_add_options --enable-shared
ac_add_options --disable-crashreporter
ac_add_options --disable-installer
ac_add_options --disable-jemalloc
ac_add_options --with-valgrind
ac_add_options --enable-valgrind
ac_add_options --enable-logrefcnt
> Given that you're in a VM, I assume that you do in fact have SSE2 support in
> your environment. That is, it is likely a misalignment issue, not an attempt
> to execute instructions unsupported by your CPU.
how do I check for SSE2 support?
Assignee | ||
Comment 15•15 years ago
|
||
(In reply to comment #14)
> how do I check for SSE2 support?
Examine the output of "cat /proc/cpuinfo" for SSE2 in the feature flags.
$ cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 13
model name : Intel(R) Pentium(R) M processor 1.80GHz
stepping : 6
cpu MHz : 1798.521
cache size : 2048 KB
fdiv_bug : no
hlt_bug : no
f00f_bug : no
coma_bug : no
fpu : yes
fpu_exception : yes
cpuid level : 2
wp : yes
flags : fpu vme de pse tsc msr mce cx8 mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss tm pbe est tm2
bogomips : 3597.04
Reporter | ||
Comment 16•15 years ago
|
||
has both sse and sse2
Assignee | ||
Comment 17•15 years ago
|
||
OK, here's the root of the problem:
transform.c: At top level:
transform.c:1321: warning: ‘__force_align_arg_pointer__’ attribute directive ignored
Googling that text gets a whole lot of complaints from Wine and libpixman users complaining that they can't build with their GCC v4.1.x compilers for this very reason.
In our case we (sometimes) enter the transformation functions with the stack mis-aligned, causing a segfault on the first attempt to write to a local __m128 variable:
const __m128 mat0 = _mm_load_ps(mat[0]);
Confirmation that __force_align_arg_pointer__ is not doing it's job of forcing stack alignment:
0x014acce5 <qcms_transform_data+0>: push %ebp
0x014acce6 <qcms_transform_data+1>: mov %esp,%ebp
0x014acce8 <qcms_transform_data+3>: push %esi
0x014acce9 <qcms_transform_data+4>: sub $0x14,%esp
0x014accec <qcms_transform_data+7>: mov 0x8(%ebp),%eax
0x014accef <qcms_transform_data+10>: mov 0x7c(%eax),%esi
0x014accf2 <qcms_transform_data+13>: mov 0x10(%ebp),%edx
0x014accf5 <qcms_transform_data+16>: mov 0xc(%ebp),%ecx
0x014accf8 <qcms_transform_data+19>: mov 0x14(%ebp),%eax
0x014accfb <qcms_transform_data+22>: mov %eax,0xc(%esp)
0x014accff <qcms_transform_data+26>: mov %edx,0x8(%esp)
0x014acd03 <qcms_transform_data+30>: mov %ecx,0x4(%esp)
0x014acd07 <qcms_transform_data+34>: mov 0x8(%ebp),%eax
0x014acd0a <qcms_transform_data+37>: mov %eax,(%esp)
0x014acd0d <qcms_transform_data+40>: call *%esi
0x014acd0f <qcms_transform_data+42>: add $0x14,%esp
0x014acd12 <qcms_transform_data+45>: pop %esi
0x014acd13 <qcms_transform_data+46>: pop %ebp
0x014acd14 <qcms_transform_data+47>: ret
We don't see similar segfaults from our copy of libpixman because in Linux we use the system pango libraries which, in RHEL5/CentOS5, don't have SSE/SSE2 support.
No fix yet, but at least the problem isn't a mystery anymore.
Assignee | ||
Comment 18•15 years ago
|
||
Jeff:
In brief testing this patch seems to fix the problem. Before I take the time to test it more rigorously (multiple compilers, OSs, etc.), though, I have a question: is this scheme too clunky to be considered for acceptance?
The scheme is that for versions of GCC prior to v4.2 local 128-bit variables are encapulated in a run-time-aligned structure. We decide at build time whether the local vars are accessed as stand-alone variables or as members of the structure, with a macro ( LSVAR(x) ) hiding the decision.
If you're OK with this general course of action, I'll submit a second patch and request a formal review of it.
Assignee | ||
Comment 19•15 years ago
|
||
As described above:
The scheme is that for versions of GCC prior to v4.2 local 128-bit variables
are encapsulated in a run-time-aligned structure. We decide at build time
whether the local vars are accessed as stand-alone variables or as members of
the structure, with a macro ( LSVAR(x) ) hiding the decision.
Addendum:
It turns out that the above wasn't enough. With optimization completely disabled GCC v4.1.2 does a _mm_load_ss() by zeroing the register, storing it to an intermediate temporary location, and overwriting the low 32 bits with the specified float before storing it in the destination specified in the source code. That temporary location is on the stack. Given that we cannot guarantee stack alignment I added the lowest optimization to the make file, ifdef'd to GCC. That causes the compiler to skip the intermediate step and write the intrinsic output direct to specified (aligned) destination. The added switch is specified before the standard compiler switches, overriding it in non-debug builds. This is what the use of intermediate storage looks like:
LSVAR(vec_g) = _mm_load_ss(&igtbl_g[src[1]]);
xorps %xmm0, %xmm0 # tmp325
movaps %xmm0, -776(%ebp) # tmp325, D.6803
movl -752(%ebp), %eax # __F, __F
movl %eax, -776(%ebp) # __F, D.6803
movaps -776(%ebp), %xmm0 # D.6803, D.6804
movl -860(%ebp), %eax # lsptr, lsptr
movaps %xmm0, 112(%eax) # D.6643, <variable>.vec_g
Ramifications of the patch:
1. Building with GCC versions prior to v4.2 will gen code that runs a little slower due to the additional pointer referencing.
2. The SSE1- and SSE2-specific source files can't be built with optimization wholly disabled. (Better would be to add the optimization only for <4.2 GCC but I don't see that the make file can know the version of GCC being used.)
Assignee: nobody → swsnyder
Attachment #411680 -
Attachment is obsolete: true
Attachment #412351 -
Flags: review?(jmuizelaar)
Comment 20•15 years ago
|
||
I'm not sure that all these contortions are worth it. Trunk is currently built with gcc 4.3. So the problem doesn't even occur with those builds. Perhaps we should just not build the sse files with gcc < 4.2 or force optimization and hope nothing goes wrong.
Comment 21•15 years ago
|
||
(In reply to comment #20)
> I'm not sure that all these contortions are worth it. Trunk is currently built
> with gcc 4.3. So the problem doesn't even occur with those builds. Perhaps we
> should just not build the sse files with gcc < 4.2 or force optimization and
> hope nothing goes wrong.
Ah so that's why the Thunderbird tinderboxes are failing with this (and possibly the SeaMonkey ones) and not the Firefox ones, because afaik Thunderbird ones are still on gcc 4.1.
Personally I think whichever version of gcc is used, the build shouldn't fail, or the prerequisites raised so that devs don't spend ages working out why their builds are failing.
Although, having said that, I will now go and raise a bug for seeing if gcc 4.3 will fix the Thunderbird boxes.
Assignee | ||
Comment 22•15 years ago
|
||
(In reply to comment #20)
> I'm not sure that all these contortions are worth it.
Me either. Especially since older compilers would still provide the same functionality through the non-SIMD code path. And the number of people who build with GCC versions prior to v4.2 will only get smaller with time.
On the other hand, a contemporary Linux distribution (RHEL5/CentOS5) still uses v4.1.2 as the standard compiler. On still another other hand, RHEL5/CentOS also offers GCC v4.4 as a non-standard installation option. <shrug>
I looked for some Mozilla-mandated minimum version, but only found an advisement to use GCC v3.4 or newer when building.
Comment 23•15 years ago
|
||
Until this patch gets reviewed/lands, is there any workaround for this? I experience this on CentOS 5 where I cannot update to GCC 4.2 (there is no rpm available for that so far).
Comment 24•15 years ago
|
||
(In reply to comment #23)
> Until this patch gets reviewed/lands, is there any workaround for this? I
> experience this on CentOS 5 where I cannot update to GCC 4.2 (there is no rpm
> available for that so far).
Sure. Just make sse_version_available() in gfx/qcms/transform.c unconditionally return 0;
Reporter | ||
Comment 25•15 years ago
|
||
I updated my vms to gcc 4.4. Hopefully that won't cause more problems.
Comment 26•15 years ago
|
||
Any progress since last month?
Comment 27•15 years ago
|
||
I think the resolution to this is going to be use a gcc that supports __force_align_arg_pointer__ or avoid compiling in the sse code.
Comment 28•15 years ago
|
||
FTR the Thunderbird tinderboxes are now building with gcc 4.3.3 (same as FF boxes) and of the 9 or so builds today, we've not had any fail yet.
No longer blocks: CcMcBuildIssues
Comment 29•15 years ago
|
||
Comment on attachment 412351 [details] [diff] [review]
Prevent <4.2 GCC from generating illegal SIMD addressing
I don't think we need this anymore.
Attachment #412351 -
Flags: review?(jmuizelaar)
Comment 30•15 years ago
|
||
We should fix this somehow, but it doesn't need to block.
blocking2.0: ? → -
Updated•13 years ago
|
Crash Signature: [@ qcms_transform_data_rgba_out_lut_sse2]
[@ qcms_transform_data_rgb_out_lut_sse2]
Comment 32•13 years ago
|
||
We use the attribute unconditionally now. This shouldn't be a problem anymore.
Status: NEW → RESOLVED
Crash Signature: [@ qcms_transform_data_rgba_out_lut_sse2]
[@ qcms_transform_data_rgb_out_lut_sse2] → [@ qcms_transform_data_rgba_out_lut_sse2]
[@ qcms_transform_data_rgb_out_lut_sse2]
Closed: 13 years ago
Resolution: --- → FIXED
Comment 33•13 years ago
|
||
Comment on attachment 407122 [details] [diff] [review]
patch to disable js1_5/Regress/regress-351116.js for debug 32bit linux browser
[Checked in: Comment 33]
http://hg.mozilla.org/mozilla-central/rev/4175f7e34648
Attachment #407122 -
Attachment description: patch to disable js1_5/Regress/regress-351116.js for debug 32bit linux browser → patch to disable js1_5/Regress/regress-351116.js for debug 32bit linux browser
[Checked in: Comment 33]
Updated•13 years ago
|
Attachment #412351 -
Attachment is obsolete: true
Updated•12 years ago
|
Keywords: intermittent-failure
Updated•12 years ago
|
Whiteboard: [orange]
You need to log in
before you can comment on or make changes to this bug.
Description
•