Closed Bug 1273048 Opened 9 years ago Closed 8 years ago

x86 Android crashes in mozilla::PseudoElementForStyleContext

Categories

(Core :: Layout, defect)

x86
Android
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla49
Tracking Status
firefox49 --- fixed

People

(Reporter: n.nethercote, Assigned: dbaron)

References

Details

(Keywords: crash)

Crash Data

Attachments

(2 files, 1 obsolete file)

This bug was filed from the Socorro interface and is report bp-eb3eccff-dc1a-4034-af8e-8a9732160511. ============================================================= This crash signature first appeared in Fennec Nightly 20160511030221, and has occurred 5 times since, across three installations. I don't see any obvious changes in and around this code recently. One interesting thing is that all 5 crashes have occurred on x86/Android devices. dbaron, any ideas?
Flags: needinfo?(dbaron)
The crash address for the crashes seems to be build-specific, and not near either any of the registers, nor in any of the mapped modules. The CPU for *all* of the Android-x86 crashes is: GenuineIntel family 6 model 55 stepping 8 | 4 (This CPU is the second most common Android-x86 CPU in crash-stats, responsible for 12.66% of crash reports: https://crash-stats.mozilla.com/search/?product=FennecAndroid&cpu_arch=x86&_facets=cpu_info&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-cpu_info The distribution on nightly is quite different, where it's 3rd and 13.08%: https://crash-stats.mozilla.com/search/?product=FennecAndroid&cpu_arch=x86&release_channel=nightly&date=%3E2016-01-01&_facets=cpu_info&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-cpu_info .)
Flags: needinfo?(dbaron)
Bug 1269976 is about pseudo elements and landed on m-c 2016-05-10, fwiw.
My guess is that the compiler somehow removes the sanity check > if (aPseudoType >= CSSPseudoElementType::Count) { above the crashing line. The patch in bug 1269976 moves kPseudoElementFlags from a local const array to a scoped one, which may change where the array lives in memory, and the new location may have less valid space after...
(In reply to Mats Palmgren (:mats) from comment #2) > Bug 1269976 is about pseudo elements and landed on m-c 2016-05-10, fwiw. Seems pretty likely. Given the nightly crash data, I think there's a >75% chance or so that the regression is in the one-day range: https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=043082cb7bd8490c60815f67fbd1f33323ad7663&tochange=674a552743785c28c75866969aad513bd8eaf6ae (especially since the May 9 and May 10 builds were built from the same changeset).
Blocks: 1269976
Summary: Crash in mozilla::PseudoElementForStyleContext → x86 Android crashes in mozilla::PseudoElementForStyleContext
(In reply to David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) from comment #1) > The CPU for *all* of the Android-x86 crashes is: > GenuineIntel family 6 model 55 stepping 8 | 4 This is no longer true. This is also now 17% of x86 Android crashes for 49.0a1: https://crash-stats.mozilla.com/search/?product=FennecAndroid&version=49.0a1&cpu_arch=x86&date=%3E2016-04-01&_facets=signature
Hmmm, I have no idea how to disasm the xul library in x86 Android build... I heard that libxul.so there is compressed by szip, which I don't find any tool to decompress. And I don't know how to get the proper symbols either...
Per IRC discussion between dbaron and glandium, to decompress the .so file, you need to: > mach configure --enable-project=mobile/android > mach build pre-export > mach build -C . mozglue/linker/host > $objdir/dist/host/bin/szip -d libxul.so
I debugged bp-1b4efb55-ecc2-42e4-98c7-27a872160601 a bit In order to objdump the libxul.so from the build, I needed to compile szip with: unset MOZCONFIG; MOZ_LINKER=1 ./mach configure --with-system-zlib ; ./mach build pre-export ; ./mach build -C . mozglue/linker/host Then I could determine that the function we want is: 017eb8ad <mozilla::PseudoElementForStyleContext(nsIFrame*, mozilla::CSSPseudoEle mentType)>: 17eb8ad: 80 fa 16 cmp dl,0x16 17eb8b0: 77 3a ja 17eb8ec <mozilla::PseudoElementFo rStyleContext(nsIFrame*, mozilla::CSSPseudoElementType)+0x3f> 17eb8b2: 55 push ebp 17eb8b3: 89 e5 mov ebp,esp 17eb8b5: 56 push esi 17eb8b6: 89 c6 mov esi,eax 17eb8b8: 53 push ebx 17eb8b9: e8 b2 5a b5 fe call 341370 <__x86.get_pc_thunk.bx> 17eb8be: 81 c3 6a c6 17 02 add ebx,0x217c66a 17eb8c4: f6 84 93 98 1d 33 ff test BYTE PTR [ebx+edx*4-0xcce268],0x4 17eb8cb: 04 17eb8cc: 75 14 jne 17eb8e2 <mozilla::PseudoElementForStyleContext(nsIFrame*, mozilla::CSSPseudoElementType)+0x35> 17eb8ce: 83 ec 0c sub esp,0xc 17eb8d1: 52 push edx 17eb8d2: e8 0b 3b f9 ff call 177f3e2 <nsCSSPseudoElements::PseudoElementSupportsUserActionState(mozilla::CSSPseudoElementType)> 17eb8d7: 83 c4 10 add esp,0x10 17eb8da: 84 c0 test al,al 17eb8dc: 75 04 jne 17eb8e2 <mozilla::PseudoElementForStyleContext(nsIFrame*, mozilla::CSSPseudoElementType)+0x35> 17eb8de: 31 c0 xor eax,eax 17eb8e0: eb 03 jmp 17eb8e5 <mozilla::PseudoElementForStyleContext(nsIFrame*, mozilla::CSSPseudoElementType)+0x38> 17eb8e2: 8b 46 14 mov eax,DWORD PTR [esi+0x14] 17eb8e5: 8d 65 f8 lea esp,[ebp-0x8] 17eb8e8: 5b pop ebx 17eb8e9: 5e pop esi 17eb8ea: 5d pop ebp 17eb8eb: c3 ret 17eb8ec: 31 c0 xor eax,eax 17eb8ee: c3 ret 17eb8ef: 90 nop The module_offset in the raw dump in the crash report is 0x17eb8c4, so our EIP (0x2c6a48c4) means 0x17eb8c4 above. EBX is 0x2e820f28 and EDX is 0x211fa509. Note that the instruction at the very start compares only the low bytes (cmp dl,0x16) when deciding whether to jump to the end. It's not clear to me whether aType is supposed to be passed in DL (with the other 24 bits ignored) or in EDX (consider all 32 bits). Either way, things seem wrong. Looking at the calling function, we have (from the crash report's raw dump), module_offset: 0x17ff1ac (which should be the return address... but apparently it's one byte less than the return address!?), and the calling code there looks like: 17ff19f: 8a 95 5c ff ff ff mov dl,BYTE PTR [ebp-0xa4] 17ff1a5: 8b 45 0c mov eax,DWORD PTR [ebp+0xc] 17ff1a8: e8 00 c7 fe ff call 17eb8ad <mozilla::PseudoElementForStyleContext(nsIFrame*, mozilla::CSSPseudoElementType)> 17ff1ad: 52 push edx This makes me think it's a miscompilation that the above code is using a test instruction that uses all of EDX: 17eb8c4: f6 84 93 98 1d 33 ff test BYTE PTR [ebx+edx*4-0xcce268],0x4 The Android NDK's objdump and my system objdump on Ubuntu 16.04 at least agree on that instruction. In AT&T syntax, it's: 17eb8c4: f6 84 93 98 1d 33 ff testb $0x4,-0xcce268(%ebx,%edx,4)
Looking nsCSSPseudoElements::PseudoElementSupportsUserActionState() which is called a bit later, it seems the function there only uses one byte. I suspect this is what was compiled before I move that function into the header, and thus it didn't crash before bug 1269976: 0177f3e2 <_ZN19nsCSSPseudoElements36PseudoElementSupportsUserActionStateEN7mozilla20CSSPseudoElementTypeE>: 177f3e2: e8 c0 6d bc fe call 3461a7 <__x86.get_pc_thunk.cx> 177f3e7: 81 c1 41 8b 1e 02 add $0x21e8b41,%ecx 177f3ed: 55 push %ebp 177f3ee: 89 e5 mov %esp,%ebp 177f3f0: 0f b6 45 08 movzbl 0x8(%ebp),%eax 177f3f4: 5d pop %ebp 177f3f5: f6 84 81 98 1d 33 ff testb $0x8,-0xcce268(%ecx,%eax,4) 177f3fc: 08 177f3fd: 0f 95 c0 setne %al 177f400: c3 ret 177f401: 90 nop
00:50:42 <jandem> dbaron: that sounds like https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64037, it comes up from time to time in spidermonkey :/ 00:50:55 <jandem> i checked the enum there has uint8_t storage
I mean, before I move nsCSSPseudoElements::PseudoElementSupportsStyleAttribute into the header. So the solution would probably be: move nsCSSPseudoElements::PseudoElementSupportsStyleAttribute back to the source file, so that the compiler wouldn't mis-optimize it when inlining.
Depends on: 1277189
01:58:51 <xidorn> dbaron: I have no idea. if upgrading ndk is non-trivial, I'd suggest we move nsCSSPseudoElements::PseudoElementSupportsStyleAttribute back to the .cpp file, or mark it never inline for x86 android build 02:01:01 → adalucinet and KaiRo joined ⇐ reed and AdrianSV quit ↔ mgoodwin_OoO and sankha nipped out • mtabara|away → mtabara, fscholz|off → fscholz 02:10:56 <glandium> dbaron: it's supposed to be fixed in rev 218721 of the 4.9 branch, and NDK r11 has a newer revision than that 02:11:47 <glandium> dbaron: also, we don't have to land a NDK upgrade to validate that it does indeed fix it 02:11:57 <glandium> that can be checked with a try build Also, we could perhaps use __attribute__((noinline)), #ifdef'd based on the bad GCC versions, based on https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#Common-Function-Attributes . (There's a good bit of __attribute__((noinline)) usage in the tree already; see https://mxr.mozilla.org/mozilla-central/search?string=__%28%28noinline%29%29&tree=mozilla-central
MozReview-Commit-ID: 4VjAra5B6GM
Attachment #8759868 - Flags: review?(nfroyd)
Assignee: nobody → dbaron
Status: NEW → ASSIGNED
MozReview-Commit-ID: IXYYn3mLQBf
Attachment #8759870 - Flags: review?(bugzilla)
I tested locally that the MOZ_GCC_VERSION_AT_LEAST() and MOZ_GCC_VERSION_AT_MOST() expressions do sensible things when I fiddle with the numbers to make them more or less than my local gcc version. (I tested this for all 4 expressions.) I don't know for sure that this will fix the crashes we're seeing, but it seems like it should undo the change that triggered it, so I think it's worth trying. MozReview-Commit-ID: IXYYn3mLQBf
Attachment #8759871 - Flags: review?(bugzilla)
Attachment #8759868 - Flags: review?(nfroyd) → review+
Attachment #8759870 - Attachment is obsolete: true
Attachment #8759870 - Flags: review?(bugzilla)
Comment on attachment 8759871 [details] [diff] [review] Add __attribute__((noinline)) to work around compiler bug on Android/x86 Review of attachment 8759871 [details] [diff] [review]: ----------------------------------------------------------------- Looks fine if it works. ::: layout/style/nsCSSPseudoElements.h @@ +109,5 @@ > + > + // Work around https://gcc.gnu.org/bugzilla/show_bug.cgi?id=64037 , > + // which is a general gcc bug that we seem to have hit only on Android/x86. > +#if defined(ANDROID) && defined(__i386__) && defined(__GNUC__) && !defined(__clang__) > +#if (MOZ_GCC_VERSION_AT_LEAST(4,8,0) && MOZ_GCC_VERSION_AT_MOST(4,8,4)) || (MOZ_GCC_VERSION_AT_LEAST(4,9,0) && MOZ_GCC_VERSION_AT_MOST(4,9,2)) This line, as well as the line above, looks too long. Could you break both of them into two lines?
Attachment #8759871 - Flags: review?(bugzilla) → review+
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla49
It looks like the fix worked, since the last crash in: https://crash-stats.mozilla.com/signature/?date=%3E%3D2016-05-01&signature=mozilla%3A%3APseudoElementForStyleContext#aggregations is still from the 2016-06-05 build.
(In reply to Martin Stránský from comment #20) > This crash is still live in FF 52 ESR/i386, at least on RHEL6/CentOS6: > > https://bugs.centos.org/view.php?id=13253&nbn=1 > https://bugzilla.redhat.com/show_bug.cgi?id=1451055 Same signature doesn't mean it is the same bug, but it could be. This crash was from a compiler bug, and the workaround was only added for Android because that's the only place we use -Os I suppose. It is possible that the compiler RHEL6 / CentOS6 uses still have this bug, and your build configuration triggers it.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: