Closed
Bug 1019634
Opened 10 years ago
Closed 10 years ago
Enabling DMD on flame puts phone in reboot loop
Categories
(Core :: DMD, defect)
Tracking
()
RESOLVED
DUPLICATE
of bug 1034146
blocking-b2g | 2.0+ |
People
(Reporter: kats, Assigned: erahm)
References
Details
(Whiteboard: [caf priority: p2][MemShrink:P1] [CR 689431])
Attachments
(1 file)
(deleted),
text/plain
|
Details |
I tried enabling DMD on Flame (doing a clean gecko build with export MOZ_DMD=1 in my .userconfig) on recent trunk code and when I flashed it on the device it just continuously rebooted. The reboot happens really early in startup so I couldn't get a debugger attached. Logcat snippet attached.
Assignee | ||
Comment 1•10 years ago
|
||
I've reproduced this with the following backtrace: #0 0xb6e82ace in jemalloc_crash () at ../../../gecko/memory/mozjemalloc/jemalloc.c:1574 #1 0xb6e82eda in arena_bin_malloc_easy (bin=<optimized out>, run=<optimized out>, arena=Unhandled dwarf expression opcode 0xfa ) at ../../../gecko/memory/mozjemalloc/jemalloc.c:3870 #2 0xb6e8497a in arena_bin_malloc_hard (bin=<optimized out>, arena=<optimized out>) at ../../../gecko/memory/mozjemalloc/jemalloc.c:3891 #3 arena_malloc_small (zero=false, size=112, arena=0xb6ba3040) at ../../../gecko/memory/mozjemalloc/jemalloc.c:4076 #4 arena_malloc (arena=0xb6ba3040, size=<optimized out>, zero=<optimized out>) at ../../../gecko/memory/mozjemalloc/jemalloc.c:4150 #5 0xb6e84efe in imalloc (size=100) at ../../../gecko/memory/mozjemalloc/jemalloc.c:4162 #6 imalloc (size=<optimized out>) at ../../../gecko/memory/mozjemalloc/jemalloc.c:6192 #7 je_malloc (size=<optimized out>) at ../../../gecko/memory/mozjemalloc/jemalloc.c:6216 #8 0xb6f2618c in mozilla::dmd::InfallibleAllocPolicy::malloc_ (aSize=<optimized out>) at ../../../../gecko/memory/replace/dmd/DMD.cpp:95 #9 0xb6f27144 in new_<mozilla::dmd::StackTrace, mozilla::dmd::StackTrace> (p1=...) at ../../../../gecko/memory/replace/dmd/DMD.cpp:148 #10 mozilla::dmd::StackTrace::Get (aT=<optimized out>) at ../../../../gecko/memory/replace/dmd/DMD.cpp:903 #11 0xb6f27c56 in AllocCallback (aT=0xb6a02090, aReqSize=96, aPtr=0xb6ae2820) at ../../../../gecko/memory/replace/dmd/DMD.cpp:1178 #12 mozilla::dmd::AllocCallback (aPtr=0xb6ae2820, aReqSize=96, aT=0xb6a02090) at ../../../../gecko/memory/replace/dmd/DMD.cpp:1155 #13 0xb6f2901a in replace_realloc (aOldPtr=0xb6ac75f0, aSize=96) at ../../../../gecko/memory/replace/dmd/DMD.cpp:1299 #14 0xb6cde70c in android::Parcel::continueWrite (this=0xbedf38d0, desired=96) at frameworks/native/libs/binder/Parcel.cpp:1529 #15 0xb6cde7e6 in android::Parcel::writeInplace (this=0xbedf38d0, len=54) at frameworks/native/libs/binder/Parcel.cpp:610 #16 0xb6cdf08c in android::Parcel::writeString16 (this=0xbedf38d0, str=0xb6ad8110 u"android.os.IServiceManager", len=52) at frameworks/native/libs/binder/Parcel.cpp:685 #17 0xb6cdc328 in android::BpServiceManager::checkService (this=0xb6a04fa0, name=...) at frameworks/native/libs/binder/IServiceManager.cpp:148 #18 0xb6cdc7fc in android::BpServiceManager::getService (this=0xb6a04fa0, name=...) at frameworks/native/libs/binder/IServiceManager.cpp:137 #19 0xb4394b96 in ?? () #20 0xb4394b96 in ?? () Backtrace stopped: previous frame identical to this frame (corrupt stack?)
Whiteboard: [MemShrink]
Updated•10 years ago
|
Blocks: CAF-v2.0-FC-metabug
blocking-b2g: --- → 2.0?
Assignee | ||
Updated•10 years ago
|
Assignee: nobody → erahm
Assignee | ||
Comment 3•10 years ago
|
||
It would appear a run header is being overwritten, the asserting line is: at ../../../gecko/memory/mozjemalloc/jemalloc.c:3870 3870 RELEASE_ASSERT(run->magic == ARENA_RUN_MAGIC);
Assignee | ||
Comment 4•10 years ago
|
||
I've attempted doing a FORTIFY build (it caught some unrelated compile time issues, I'll file bugs for those), and a stackprotect build neither of which caught the issue. After testing on a debug build I've narrowed this down to always happening when we're taking a stack trace for an allocation in DMD. It's not the first measurement, it seems to happen after the 5th or so. Basically this part: #9 0xb6f27144 in new_<mozilla::dmd::StackTrace, mozilla::dmd::StackTrace> (p1=...) at ../../../../gecko/memory/replace/dmd/DMD.cpp:148 #10 mozilla::dmd::StackTrace::Get (aT=<optimized out>) at ../../../../gecko/memory/replace/dmd/DMD.cpp:903 So it's possible that when we grab a stack trace there's some sort of memory corruption and then the next allocation blows up. The other possibility is there's memory corruption earlier on that is somehow consistently in the run that happens to be in the run prior to the sizeof(dmd::StackTrace) run (I think it ends up being 112). For background: jemalloc carves up large chunks of memory into page-sized runs, each run holds items all of the same size. Next steps: - Disable all stack walking, see if it still reproduces - Figure out what size bin the previous run serves, see if it's consistently the same each time we tryp to reproduce. If so set a breakpoint for allocations of that size, check the stack for allocations towards the end of the run and trace down the memory. - If we can get jemalloc to play nice w/ valgrind that might help us trace things down (bug 977067, comment 29). I'm open to other suggestions of course!
Assignee | ||
Comment 5•10 years ago
|
||
I disabled the NS_StackWalking call in DMD and am still seeing crashes, so that's probably not it. I disabled the AllocCallback and no longer saw crashes, so it's not inherently DMD injecting itself at least, but somehow we're doing something wrong in the stack tracking portion. Of further interest would probably be looking at the tables we're using to store blocks, and really any allocs or deallocs within DMD. There's an area where we "GC" the stack traces, so it's possible that's doing something bad as well.
Assignee | ||
Comment 6•10 years ago
|
||
Disabling stack trace GC had no effect. Allowing stack trace measurement, allocation and insertion into the stack trace table, but disabling insertion into the block table does not crash. So it looks like something bad is happening with the block table.
blocking-b2g: 2.0? → 2.0+
Assignee | ||
Comment 7•10 years ago
|
||
It looks like if I disable shrinking in JS::HashTable the crash goes away (although b2g ends up hosed in some other way).
Assignee | ||
Comment 8•10 years ago
|
||
I've tracked down the real issue to qcom's hwcomposer stomping memory in some debug config code.
Assignee | ||
Comment 9•10 years ago
|
||
Bug 1034146 provides further details and a patch that fixes the issue. |git apply| the patch to |hardware/qcom/display|, run |./build.sh && ./flash.sh| and you should be good to go.
Updated•10 years ago
|
Whiteboard: [MemShrink] → [MemShrink] [CR 1019634]
Updated•10 years ago
|
Whiteboard: [MemShrink] [CR 1019634] → [MemShrink] [CR 689431]
Updated•10 years ago
|
Whiteboard: [MemShrink] [CR 689431] → [caf priority: p2][MemShrink] [CR 689431]
Comment 10•10 years ago
|
||
Tapas, can you confirm comment #9 is working for you ?
Flags: needinfo?(tkundu)
(In reply to bhavana bajaj [:bajaj] [NOT reading Bugmail, needInfo please] from comment #10) > Tapas, can you confirm comment #9 is working for you ? This patch works fine and we are seeing dmd report now.
Flags: needinfo?(tkundu)
Assignee | ||
Comment 12•10 years ago
|
||
What else needs to be done to get this fix live?
Flags: needinfo?(mwu)
Whiteboard: [caf priority: p2][MemShrink] [CR 689431] → [caf priority: p2][MemShrink:P1] [CR 689431]
(In reply to Eric Rahm [:erahm] from comment #12) > What else needs to be done to get this fix live? It is already landed. See bug 1034146 Comment 3
Comment 14•10 years ago
|
||
If you're testing on 2.0, we'll need to update our manifest to pick up the updates.
Flags: needinfo?(mwu)
Assignee | ||
Comment 15•10 years ago
|
||
We need this for 1.4 and 2.0 to help with memory regression analysis.
Comment 16•10 years ago
|
||
Oh, actually that won't work because the fix from bug 1034146 comment 3 only landed on the KK branch. Our flames are still on JB, so that won't help us until we get KK images for our Flames.
No longer blocks: CAF-v2.0-FC-metabug
Assignee | ||
Comment 17•10 years ago
|
||
Tapas, is there a way to get this landed on the JB branch?
Flags: needinfo?(tkundu)
Comment 18•10 years ago
|
||
Eric -- Sushil is working on getting the fix available on JB. Sushil -- please update here once we have the fix on codeaurora.org.
Flags: needinfo?(tkundu) → needinfo?(sushilchauhan)
Updated•10 years ago
|
Blocks: CAF-v2.0-FC-metabug
Assignee | ||
Comment 19•10 years ago
|
||
Any updates here?
Comment 20•10 years ago
|
||
The fix has landed on CAF. Here is the link: https://www.codeaurora.org/cgit/quic/la/platform/hardware/qcom/display/commit/?h=b2g_jb_3.2&id=3f499aa8c4af9ae053dbce91eec60b37d0d6a26d
Flags: needinfo?(sushilchauhan)
Assignee | ||
Comment 21•10 years ago
|
||
Now that this has landed on the JB branch, what's needed to get the manifests for 1.4, 2.0, m-c updated?
Flags: needinfo?(mwu)
Comment 22•10 years ago
|
||
I can update the 2.0 manifest since this has 2.0 blocking.
Flags: needinfo?(mwu)
Assignee | ||
Comment 23•10 years ago
|
||
FWIW bug 1034146 (the bug that blocks this) is 1.4+ which is what this issue fixes.
Comment 24•10 years ago
|
||
2.0 manifest: https://github.com/mozilla-b2g/b2g-manifest/commit/d2babab58743c696f46d614e84fdb9f2a0dd75d7
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•