1019634 - Enabling DMD on flame puts phone in reboot loop

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Reporter

Description

•

10 years ago

Attached file Logcat snippet for crash (deleted) — Details

I tried enabling DMD on Flame (doing a clean gecko build with export MOZ_DMD=1 in my .userconfig) on recent trunk code and when I flashed it on the device it just continuously rebooted. The reboot happens really early in startup so I couldn't get a debugger attached. Logcat snippet attached.

Eric Rahm [:erahm]

Assignee

Comment 1

•

10 years ago

I've reproduced this with the following backtrace:

#0  0xb6e82ace in jemalloc_crash () at ../../../gecko/memory/mozjemalloc/jemalloc.c:1574
#1  0xb6e82eda in arena_bin_malloc_easy (bin=<optimized out>, run=<optimized out>, arena=Unhandled dwarf expression opcode 0xfa
) at ../../../gecko/memory/mozjemalloc/jemalloc.c:3870
#2  0xb6e8497a in arena_bin_malloc_hard (bin=<optimized out>, arena=<optimized out>) at ../../../gecko/memory/mozjemalloc/jemalloc.c:3891
#3  arena_malloc_small (zero=false, size=112, arena=0xb6ba3040) at ../../../gecko/memory/mozjemalloc/jemalloc.c:4076
#4  arena_malloc (arena=0xb6ba3040, size=<optimized out>, zero=<optimized out>) at ../../../gecko/memory/mozjemalloc/jemalloc.c:4150
#5  0xb6e84efe in imalloc (size=100) at ../../../gecko/memory/mozjemalloc/jemalloc.c:4162
#6  imalloc (size=<optimized out>) at ../../../gecko/memory/mozjemalloc/jemalloc.c:6192
#7  je_malloc (size=<optimized out>) at ../../../gecko/memory/mozjemalloc/jemalloc.c:6216
#8  0xb6f2618c in mozilla::dmd::InfallibleAllocPolicy::malloc_ (aSize=<optimized out>) at ../../../../gecko/memory/replace/dmd/DMD.cpp:95
#9  0xb6f27144 in new_<mozilla::dmd::StackTrace, mozilla::dmd::StackTrace> (p1=...) at ../../../../gecko/memory/replace/dmd/DMD.cpp:148
#10 mozilla::dmd::StackTrace::Get (aT=<optimized out>) at ../../../../gecko/memory/replace/dmd/DMD.cpp:903
#11 0xb6f27c56 in AllocCallback (aT=0xb6a02090, aReqSize=96, aPtr=0xb6ae2820) at ../../../../gecko/memory/replace/dmd/DMD.cpp:1178
#12 mozilla::dmd::AllocCallback (aPtr=0xb6ae2820, aReqSize=96, aT=0xb6a02090) at ../../../../gecko/memory/replace/dmd/DMD.cpp:1155
#13 0xb6f2901a in replace_realloc (aOldPtr=0xb6ac75f0, aSize=96) at ../../../../gecko/memory/replace/dmd/DMD.cpp:1299
#14 0xb6cde70c in android::Parcel::continueWrite (this=0xbedf38d0, desired=96) at frameworks/native/libs/binder/Parcel.cpp:1529
#15 0xb6cde7e6 in android::Parcel::writeInplace (this=0xbedf38d0, len=54) at frameworks/native/libs/binder/Parcel.cpp:610
#16 0xb6cdf08c in android::Parcel::writeString16 (this=0xbedf38d0, str=0xb6ad8110 u"android.os.IServiceManager", len=52) at frameworks/native/libs/binder/Parcel.cpp:685
#17 0xb6cdc328 in android::BpServiceManager::checkService (this=0xb6a04fa0, name=...) at frameworks/native/libs/binder/IServiceManager.cpp:148
#18 0xb6cdc7fc in android::BpServiceManager::getService (this=0xb6a04fa0, name=...) at frameworks/native/libs/binder/IServiceManager.cpp:137
#19 0xb4394b96 in ?? ()
#20 0xb4394b96 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Whiteboard: [MemShrink]

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Updated

•

10 years ago

Blocks: CAF-v2.0-FC-metabug

blocking-b2g: --- → 2.0?

Eric Rahm [:erahm]

Assignee

Updated

•

10 years ago

Assignee: nobody → erahm

Eric Rahm [:erahm]

Assignee

Comment 3

•

10 years ago

It would appear a run header is being overwritten, the asserting line is:
  at ../../../gecko/memory/mozjemalloc/jemalloc.c:3870
  3870		RELEASE_ASSERT(run->magic == ARENA_RUN_MAGIC);

Eric Rahm [:erahm]

Assignee

Comment 4

•

10 years ago

I've attempted doing a FORTIFY build (it caught some unrelated compile time issues, I'll file bugs for those), and a stackprotect build neither of which caught the issue.

After testing on a debug build I've narrowed this down to always happening when we're taking a stack trace for an allocation in DMD. It's not the first measurement, it seems to happen after the 5th or so.

Basically this part:
#9 0xb6f27144 in new_<mozilla::dmd::StackTrace, mozilla::dmd::StackTrace> (p1=...) at ../../../../gecko/memory/replace/dmd/DMD.cpp:148
#10 mozilla::dmd::StackTrace::Get (aT=<optimized out>) at ../../../../gecko/memory/replace/dmd/DMD.cpp:903

So it's possible that when we grab a stack trace there's some sort of memory corruption and then the next allocation blows up. The other possibility is there's memory corruption earlier on that is somehow consistently in the run that happens to be in the run prior to the sizeof(dmd::StackTrace) run (I think it ends up being 112).

For background: jemalloc carves up large chunks of memory into page-sized runs, each run holds items all of the same size.

Next steps:
- Disable all stack walking, see if it still reproduces
- Figure out what size bin the previous run serves, see if it's consistently the same each time we tryp to reproduce. If so set a breakpoint for allocations of that size, check the stack for allocations towards the end of the run and trace down the memory.
- If we can get jemalloc to play nice w/ valgrind that might help us trace things down (bug 977067, comment 29).

I'm open to other suggestions of course!

Eric Rahm [:erahm]

Assignee

Comment 5

•

10 years ago

I disabled the NS_StackWalking call in DMD and am still seeing crashes, so that's probably not it. I disabled the AllocCallback and no longer saw crashes, so it's not inherently DMD injecting itself at least, but somehow we're doing something wrong in the stack tracking portion.

Of further interest would probably be looking at the tables we're using to store blocks, and really any allocs or deallocs within DMD. There's an area where we "GC" the stack traces, so it's possible that's doing something bad as well.

Eric Rahm [:erahm]

Assignee

Comment 6

•

10 years ago

Disabling stack trace GC had no effect. Allowing stack trace measurement, allocation and insertion into the stack trace table, but disabling insertion into the block table does not crash. So it looks like something bad is happening with the block table.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Updated

•

10 years ago

blocking-b2g: 2.0? → 2.0+

Kevin Grandon :kgrandon

Updated

•

10 years ago

Blocks: 1029902

Eric Rahm [:erahm]

Assignee

Comment 7

•

10 years ago

It looks like if I disable shrinking in JS::HashTable the crash goes away (although b2g ends up hosed in some other way).

Eric Rahm [:erahm]

Assignee

Comment 8

•

10 years ago

I've tracked down the real issue to qcom's hwcomposer stomping memory in some debug config code.

Eric Rahm [:erahm]

Assignee

Updated

•

10 years ago

Depends on: 1034146

Eric Rahm [:erahm]

Assignee

Comment 9

•

10 years ago

Bug 1034146 provides further details and a patch that fixes the issue. |git apply| the patch to |hardware/qcom/display|, run |./build.sh && ./flash.sh| and you should be good to go.

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Updated

•

10 years ago

Whiteboard: [MemShrink] → [MemShrink] [CR 1019634]

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Updated

•

10 years ago

Whiteboard: [MemShrink] [CR 1019634] → [MemShrink] [CR 689431]

cafbot (PoC: ggrisco)

Updated

•

10 years ago

Whiteboard: [MemShrink] [CR 689431] → [caf priority: p2][MemShrink] [CR 689431]

bhavana bajaj [:bajaj]

Comment 10

•

10 years ago

Tapas, can you confirm comment #9 is working for you ?

Flags: needinfo?(tkundu)

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Comment 11

•

10 years ago

(In reply to bhavana bajaj [:bajaj] [NOT reading Bugmail, needInfo please] from comment #10)
> Tapas, can you confirm comment #9 is working for you ?

This patch works fine and we are seeing dmd report now.

Flags: needinfo?(tkundu)

Eric Rahm [:erahm]

Assignee

Comment 12

•

10 years ago

What else needs to be done to get this fix live?

Flags: needinfo?(mwu)

Whiteboard: [caf priority: p2][MemShrink] [CR 689431] → [caf priority: p2][MemShrink:P1] [CR 689431]

Tapas[:tkundu on #b2g/gaia/memshrink/gfx] (always NI me)

Comment 13

•

10 years ago

(In reply to Eric Rahm [:erahm] from comment #12)
> What else needs to be done to get this fix live?

It is already landed. See bug 1034146 Comment 3

Michael Wu [:mwu]

Comment 14

•

10 years ago

If you're testing on 2.0, we'll need to update our manifest to pick up the updates.

Flags: needinfo?(mwu)

Eric Rahm [:erahm]

Assignee

Comment 15

•

10 years ago

We need this for 1.4 and 2.0 to help with memory regression analysis.

Michael Wu [:mwu]

Comment 16

•

10 years ago

Oh, actually that won't work because the fix from bug 1034146 comment 3 only landed on the KK branch. Our flames are still on JB, so that won't help us until we get KK images for our Flames.

Inder

Updated

•

10 years ago

No longer blocks: CAF-v2.0-FC-metabug

Eric Rahm [:erahm]

Assignee

Comment 17

•

10 years ago

Tapas, is there a way to get this landed on the JB branch?

Flags: needinfo?(tkundu)

Inder

Comment 18

•

10 years ago

Eric -- Sushil is working on getting the fix available on JB.

Sushil -- please update here once we have the fix on codeaurora.org.

Flags: needinfo?(tkundu) → needinfo?(sushilchauhan)

cafbot (PoC: ggrisco)

Updated

•

10 years ago

Blocks: CAF-v2.0-FC-metabug

Eric Rahm [:erahm]

Assignee

Comment 19

•

10 years ago

Any updates here?

Sushil

Comment 20

•

10 years ago

The fix has landed on CAF. Here is the link:
https://www.codeaurora.org/cgit/quic/la/platform/hardware/qcom/display/commit/?h=b2g_jb_3.2&id=3f499aa8c4af9ae053dbce91eec60b37d0d6a26d

Flags: needinfo?(sushilchauhan)

Eric Rahm [:erahm]

Assignee

Comment 21

•

10 years ago

Now that this has landed on the JB branch, what's needed to get the manifests for 1.4, 2.0, m-c updated?

Flags: needinfo?(mwu)

Michael Wu [:mwu]

Comment 22

•

10 years ago

I can update the 2.0 manifest since this has 2.0 blocking.

Flags: needinfo?(mwu)

Eric Rahm [:erahm]

Assignee

Comment 23

•

10 years ago

FWIW bug 1034146 (the bug that blocks this) is 1.4+ which is what this issue fixes.

Michael Wu [:mwu]

Comment 24

•

10 years ago

2.0 manifest: https://github.com/mozilla-b2g/b2g-manifest/commit/d2babab58743c696f46d614e84fdb9f2a0dd75d7

Michael Wu [:mwu]

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → DUPLICATE