Closed
Bug 1206485
Opened 9 years ago
Closed 9 years ago
Boot loop after first boot on some devices (Xperia M2, ...)
Categories
(Firefox OS Graveyard :: GonkIntegration, defect)
Tracking
(firefox44 fixed)
RESOLVED
FIXED
FxOS-S8 (02Oct)
Tracking | Status | |
---|---|---|
firefox44 | --- | fixed |
People
(Reporter: gerard-majax, Assigned: jonco)
References
Details
(Keywords: regression, Whiteboard: [dogfood-blocker])
Attachments
(2 files)
After a first boot properly completed, next reboot ends up in segfault. Reproduced on Xperia Eagle and Tianchi at least.
STR:
0. Build and flash everything (including userdata)
1. Boot, complete FTU
2. Reboot
Expected:
Device boots properly
Actual:
Device dies with segfault.
I have dug a little bit and regression would have occurred during the week: gecko 9c01eed3d4e41157ce25c14fd52a7af98d0d13dc do not exposes the issue.
Reporter | ||
Comment 1•9 years ago
|
||
Flags: needinfo?(nicolas.b.pierron)
Reporter | ||
Comment 2•9 years ago
|
||
Maybe a long shot, but it's in the range of regression AND it matches the js/src/gc/Heap.h file. So I'll try a local revert of 5da7dbdb733e4c9e96945b7aa74dd8654da2f3d1:
$ git log 9c01eed3d4e41157ce25c14fd52a7af98d0d13dc..mozillaorg/master js/src/gc/Heap.h
commit 5da7dbdb733e4c9e96945b7aa74dd8654da2f3d1
Author: Terrence Cole <terrence@mozilla.com>
Date: Wed Sep 16 11:19:44 2015 -0700
Bug 1205054 - Remove isNullLike and other imprecise null checks; r=sfink
commit bdd0fc968bc607fe892538d97047202694b82485
Author: Kan-Ru Chen <kanru@kanru.info>
Date: Fri May 8 11:13:51 2015 +0800
Bug 1123237 - Part 3. Monitoring allocation and gc events in nursery and tenured heaps. r=terrence
Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
commit 00a45d37f4e08418790dbd48cf64f4557eac5ffb
Author: Kan-Ru Chen <kanru@kanru.info>
Date: Fri May 8 11:05:08 2015 +0800
Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence
Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
Reporter | ||
Comment 3•9 years ago
|
||
So after testing, still reproducing with bug 1205054 reverted. Maybe it's bug 1123237: it's a big one that landed recently also, and that already got backed out :)
Reporter | ||
Comment 4•9 years ago
|
||
Latest commit just before bug 1123237 (5d8728423441575dc81c6c38de69fbc7ca35f163) is fine. Checking the one that includes this set of patches (16cae3c6c6b37c2580f05f4ee415b18fff635c83).
I don't reproduce the issue on some devices (Z3c, Flame), but I saw reports yesterday night of people on Flame with device going into boot loop since a couple of hours/days. So this might be worse than it looks.
Reporter | ||
Comment 5•9 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #4)
> Latest commit just before bug 1123237
> (5d8728423441575dc81c6c38de69fbc7ca35f163) is fine. Checking the one that
> includes this set of patches (16cae3c6c6b37c2580f05f4ee415b18fff635c83).
>
> I don't reproduce the issue on some devices (Z3c, Flame), but I saw reports
> yesterday night of people on Flame with device going into boot loop since a
> couple of hours/days. So this might be worse than it looks.
Right, so I can confirm regression is within the range 5d8728423441575dc81c6c38de69fbc7ca35f163..16cae3c6c6b37c2580f05f4ee415b18fff635c83:
$ git log --oneline 5d8728423441575dc81c6c38de69fbc7ca35f163..16cae3c6c6b37c2580f05f4ee415b18fff635c83
16cae3c Bug 1123237 - Part 12. Fix GC hazards. r=terrence
4efb6ef Bug 1123237 - Part 11. Don't use STL in memory-profiler. r=BenWa,cervantes
28c419d Bug 1123237 - Part 10. Expose SwapElements from nsBaseHashtable. r=nfroyd
c2a6c6f Bug 1123237 - Part 9. Interface to memory-profiler add-ons. r=jimb
5c8dec0 Bug 1123237 - Part 8. Tracking the memory events. r=BenWa,terrence
008c01b Bug 1123237 - Part 7. XPCOM interface for memory profiler. r=smaug
b4ef7f3 Bug 1123237 - Part 6. A new API to get backtrace without allocating memory in profiler. r=mstange
22e4adb Bug 1123237 - Part 5. Don't emit inline allocation when memory profiler enabled. r=terrence
75ad5ad Bug 1123237 - Part 4. Monitoring allocations and frees for ArrayBuffer. r=terrence,sfink
bdd0fc9 Bug 1123237 - Part 3. Monitoring allocation and gc events in nursery and tenured heaps. r=terrence
00a45d3 Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence
Can we backout only one of those or do we absolutely need to keep them alltogether?
Sorry for the mass needinfo :)
Reporter | ||
Comment 6•9 years ago
|
||
008c01bd6c9cf6ed6831d9ad3663f54b7b427484 is the first bad commit
commit 008c01bd6c9cf6ed6831d9ad3663f54b7b427484
Author: Kan-Ru Chen <kanru@kanru.info>
Date: Fri May 8 11:22:38 2015 +0800
Bug 1123237 - Part 7. XPCOM interface for memory profiler. r=smaug
Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
:040000 040000 ed6a258195e2c7f780ec8b06e751df38a2a69c65 2cde680a8dad5555679bd7e28d132871636d9dbe M b2g
:040000 040000 dd426c61bf5f8225bca94c9f0d922e448648806a 33f8e1fce98ce0bf539df2065f5bb53e9ee7bd7e M browser
:040000 040000 c0f1c5b2d67565f4d5fc816a698487d4afb8ae41 f648cbd70baee3cf41267d6f73805b1aa515aeec M mobile
:040000 040000 0474479facca6017da670b573a9fe99511795e3e 3152a76d0d8e002889a25b13b822f2d2699c7ec1 M toolkit
:040000 040000 8093074ae509c6d8b843079815d27facc1accc0a a8e8a6319e4fdb7d9bf198d32e04a01d647a8c2d M tools
Comment 7•9 years ago
|
||
I'm not sure what the above has to do with this but from my experiments after reverting to 5d8728423441575dc81c6c38de69fbc7ca35f163:
Experiment 1:
Boot, go through FTU with connecting to WiFi
Open browser, open settings
reboot --> Boot loop.
Experiment 2:
Boot, go through FTU do not connect to WiFi
Open browser, open settings
reboot --> Boot normal.
Experiment 3:
Disconnect WiFi router from Internet
Boot, go through FTU with connecting to WiFi
Open browser, open settings
reboot --> Boot normal.
Experiment 4:
Disconnect WiFi router from Internet
Boot, go through FTU with connecting to WiFi
Open browser, open settings
reboot --> Boot normal.
Reconnect WiFi router to Internet
Open browser, open settings
reboot --> Boot normal.
Open browser, open settings
reboot --> Boot loop.
What is common about Experiment 1 and Experiment 4? in both cases there is an eventual boot to home screen with a functioning WiFi connection. This triggers a check for updates. It is this check that is causing the boot loop, probably a database is being corrupted.
Updated•9 years ago
|
Summary: Boot loot after first boot on some devices (Xperia M2, ...) → Boot loop after first boot on some devices (Xperia M2, ...)
Reporter | ||
Comment 8•9 years ago
|
||
Ok, comment 6 might be wrong but the range is still the proper one. I have checked out Part 6 and pushed that Gecko on a device already in a bad state. Device is still boot looping.
Reporter | ||
Comment 9•9 years ago
|
||
Reporter | ||
Updated•9 years ago
|
Reporter | ||
Updated•9 years ago
|
Reporter | ||
Updated•9 years ago
|
Comment 10•9 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #5)
> Can we backout only one of those or do we absolutely need to keep them
> alltogether?
If needed they have to be backed out together. Note I can't reproduce this on Flame and I'm not sure how could a disabled feature affect booting. Is the segfault always at the same place?
Flags: needinfo?(kchen)
Reporter | ||
Comment 11•9 years ago
|
||
(In reply to Kan-Ru Chen [:kanru] from comment #10)
> (In reply to Alexandre LISSY :gerard-majax from comment #5)
> > Can we backout only one of those or do we absolutely need to keep them
> > alltogether?
>
> If needed they have to be backed out together. Note I can't reproduce this
> on Flame and I'm not sure how could a disabled feature affect booting. Is
> the segfault always at the same place?
Always. Before the regression range, nothing. After, constantly under the described conditions.
Reporter | ||
Comment 12•9 years ago
|
||
Right, now after a repo sync of Gecko and Gaia, I don't have the issue anymore.
Only spurious thing I could notice was a crash report on the very first boot, before I begin FTU. And homescreen seemed to be broken after finishing FTU.
It's all okay after a reboot.
Reporter | ||
Comment 13•9 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #12)
> Right, now after a repo sync of Gecko and Gaia, I don't have the issue
> anymore.
>
> Only spurious thing I could notice was a crash report on the very first
> boot, before I begin FTU. And homescreen seemed to be broken after finishing
> FTU.
>
> It's all okay after a reboot.
False hope: after a shutdown and a startup, it is crashing again
Comment 14•9 years ago
|
||
I've noticed that too, _sometimes_ it doesn't start crashing, which made my bisect attempt a fruitless waste of time.
Comment 15•9 years ago
|
||
(In reply to Alexandre LISSY :gerard-majax from comment #1)
> Created attachment 8663388 [details]
> gdb backtrace on Xperia M2
I cannot spot anything obvious from the backtrace, but I am no expert in the GC. None of the patches from Bug 1123237 are modifying the parser. So I guess this issue might be related to some of the nursery patches.
Terrence might know better, but I think this kind of bug is hard to investigate, especially on devices, and our best hope might be to wait until fuzzers find similar signature.
Flags: needinfo?(nicolas.b.pierron)
Comment 16•9 years ago
|
||
Crashes in the GC are usually heap corruption of some sort, rather than a direct consequence of GC changes. Generally the only way to track this sort of problem down is to bisect, which is seems you have done.
In this particular case, it looks like either the arena list or freespan head points into unmapped addresses. I'm not entirely sure how that squares with the bisection results, but that patch does add some members to the relevant structs which could be bumping the addresses off-by-one if not everything is compiling with the same #defines?
Flags: needinfo?(terrence)
Flags: needinfo?(nhirata.bugzilla)
Comment 17•9 years ago
|
||
I'm getting a boot loop on Z3C - don't know if it's related, but I bisected:
2d0398ffa709b2af2e5a1e588086a874479c67e6 is the first bad commit
commit 2d0398ffa709b2af2e5a1e588086a874479c67e6
Author: Josh Matthews <josh@joshmatthews.net>
Date: Sun Sep 20 05:57:15 2015 -0400
Bug 885982 - Part 4: Remove all traces of JS implementation. r=asuth
:040000 040000 e3e092e3fc55443ecb1ff1f635dbc68633ee90f6 87637d13278226cea38d380d14f5933d1d9bb5b3 M b2g
:040000 040000 580eb5cdb448408ad501c32ab3f895417d87000e f6cf9fd2663e05213aca86f893deede1d813f519 M browser
:040000 040000 67c3d9dd8c9afcff5cea25fbc997cbd9df99b9d2 feb0b53d300320ccacea09d06281670b0e11475e M dom
:040000 040000 4d5d2838010e6d1ebe5e036f831cccfa19f41199 56c90d1b6513a0038c87c8f793e1aaa3704f14d9 M mobile
Comment 18•9 years ago
|
||
fwiw, I'm facing the same issue on a z3c, with the same stack trace as the one initially reported
Comment 20•9 years ago
|
||
ftr, I got a profile that always crash when compiling ContactDB.jsm
Comment 21•9 years ago
|
||
Reverting the commit I mention in comment #17 on top of master fixes the boot loop for me.
Comment 22•9 years ago
|
||
(gdb) f
#2 js::TenuringTracer::moveToTenured (this=this@entry=0xbed9b248, src=0xb2033180) at /home/ting/w/fx/os/aries-kk/gecko/js/src/gc/Marking.cpp:2059
2059 TenuredCell* t = zone->arenas.allocateFromFreeList(dstKind, Arena::thingSize(dstKind));
(gdb) p zone->runtime
There is no member or method named runtime.
(gdb) p zone->runtime_
$9 = (JSRuntime * const) 0x904ff0e9
(gdb) p zone->runtime_ == zone->arenas.runtime_
$10 = false
(gdb) p zone->arenas.runtime_
$11 = (JSRuntime *) 0x1cf8cd93
(gdb) p *zone->arenas.runtime_
Cannot access memory at address 0x1cf8cd93
(gdb)
I think zone->runtime_ != zone->arenas.runtime_ is impossible
Comment 23•9 years ago
|
||
Set javascript.options.ion to false prevents the crash.
Assignee | ||
Comment 24•9 years ago
|
||
The crash occurs during minor GC when we are marking the store buffers. With kanru's help debugging we found that we are marking what appears to be a FunctionBox object where we expect to see a nursery allocated JSObject.
Assignee | ||
Comment 25•9 years ago
|
||
I wasn't able to reproduce the crash, but I found something that could cause it.
JSFunction has a union containing a JSObject* and a FunctionBox*. To make barriers work on the object pointer, when assigning to this we cast its address to a HeapPtrObject*. This will create a store buffer entry in the right circumstances (JSFunction allocated in the tenured heap, JSObject allocated in the nursery).
While parsing a function we swap out this object pointer and set the function box pointer instead. We don't do anything to remove the store buffer entry though.
Attachment #8664187 -
Flags: review?(terrence)
Reporter | ||
Comment 26•9 years ago
|
||
Comment on attachment 8664187 [details] [diff] [review]
bug1206485-function-box-aliasing
Ship it! It looks perfect!
Tested on Xperia M2:
- pushing a gecko with the fix, doing ~8-10 reboots, no problem
- pushing a gecko without fix, crash after one or two reboot, crash report at the first boot during FTU, homescreen broken
- pushing a gecko with the fix on top of a broken profile, revived
Thanks for the quick patch!
Attachment #8664187 -
Flags: feedback+
Comment 27•9 years ago
|
||
Comment on attachment 8664187 [details] [diff] [review]
bug1206485-function-box-aliasing
This looks like it fixes the problem. I've done several reboots on both devices I had problems with, both rebooted several times without bootloop.
Comment 28•9 years ago
|
||
This fixes it for me too :)
Comment 30•9 years ago
|
||
That also fixed it on my z3c. SHIP IT!
Comment 31•9 years ago
|
||
Comment on attachment 8664187 [details] [diff] [review]
bug1206485-function-box-aliasing
Review of attachment 8664187 [details] [diff] [review]:
-----------------------------------------------------------------
Great find!
Attachment #8664187 -
Flags: review?(terrence) → review+
Comment 32•9 years ago
|
||
Comment 33•9 years ago
|
||
(In reply to [:fabrice] Fabrice Desré from comment #30)
> That also fixed it on my z3c. SHIP IT!
Shipped to m-i.
Comment 34•9 years ago
|
||
Nice!
Assignee: kchen → jcoppeard
Flags: needinfo?(nhirata.bugzilla)
Flags: needinfo?(cyu)
Whiteboard: [dogfood-blocker]
Comment 35•9 years ago
|
||
The patch is confirmed to work on Flame.
PS. I'm the author of Bug 1207213.
Comment 36•9 years ago
|
||
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox44:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → FxOS-S8 (02Oct)
You need to log in
before you can comment on or make changes to this bug.
Description
•