Closed Bug 1206485 Opened 9 years ago Closed 9 years ago

Boot loop after first boot on some devices (Xperia M2, ...)

Categories

(Firefox OS Graveyard :: GonkIntegration, defect)

ARM
Gonk (Firefox OS)
defect
Not set
blocker

Tracking

(firefox44 fixed)

RESOLVED FIXED
FxOS-S8 (02Oct)
Tracking Status
firefox44 --- fixed

People

(Reporter: gerard-majax, Assigned: jonco)

References

Details

(Keywords: regression, Whiteboard: [dogfood-blocker])

Attachments

(2 files)

After a first boot properly completed, next reboot ends up in segfault. Reproduced on Xperia Eagle and Tianchi at least. STR: 0. Build and flash everything (including userdata) 1. Boot, complete FTU 2. Reboot Expected: Device boots properly Actual: Device dies with segfault. I have dug a little bit and regression would have occurred during the week: gecko 9c01eed3d4e41157ce25c14fd52a7af98d0d13dc do not exposes the issue.
Attached file gdb backtrace on Xperia M2 (deleted) —
Flags: needinfo?(nicolas.b.pierron)
Maybe a long shot, but it's in the range of regression AND it matches the js/src/gc/Heap.h file. So I'll try a local revert of 5da7dbdb733e4c9e96945b7aa74dd8654da2f3d1: $ git log 9c01eed3d4e41157ce25c14fd52a7af98d0d13dc..mozillaorg/master js/src/gc/Heap.h commit 5da7dbdb733e4c9e96945b7aa74dd8654da2f3d1 Author: Terrence Cole <terrence@mozilla.com> Date: Wed Sep 16 11:19:44 2015 -0700 Bug 1205054 - Remove isNullLike and other imprecise null checks; r=sfink commit bdd0fc968bc607fe892538d97047202694b82485 Author: Kan-Ru Chen <kanru@kanru.info> Date: Fri May 8 11:13:51 2015 +0800 Bug 1123237 - Part 3. Monitoring allocation and gc events in nursery and tenured heaps. r=terrence Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com> commit 00a45d37f4e08418790dbd48cf64f4557eac5ffb Author: Kan-Ru Chen <kanru@kanru.info> Date: Fri May 8 11:05:08 2015 +0800 Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com>
So after testing, still reproducing with bug 1205054 reverted. Maybe it's bug 1123237: it's a big one that landed recently also, and that already got backed out :)
Latest commit just before bug 1123237 (5d8728423441575dc81c6c38de69fbc7ca35f163) is fine. Checking the one that includes this set of patches (16cae3c6c6b37c2580f05f4ee415b18fff635c83). I don't reproduce the issue on some devices (Z3c, Flame), but I saw reports yesterday night of people on Flame with device going into boot loop since a couple of hours/days. So this might be worse than it looks.
(In reply to Alexandre LISSY :gerard-majax from comment #4) > Latest commit just before bug 1123237 > (5d8728423441575dc81c6c38de69fbc7ca35f163) is fine. Checking the one that > includes this set of patches (16cae3c6c6b37c2580f05f4ee415b18fff635c83). > > I don't reproduce the issue on some devices (Z3c, Flame), but I saw reports > yesterday night of people on Flame with device going into boot loop since a > couple of hours/days. So this might be worse than it looks. Right, so I can confirm regression is within the range 5d8728423441575dc81c6c38de69fbc7ca35f163..16cae3c6c6b37c2580f05f4ee415b18fff635c83: $ git log --oneline 5d8728423441575dc81c6c38de69fbc7ca35f163..16cae3c6c6b37c2580f05f4ee415b18fff635c83 16cae3c Bug 1123237 - Part 12. Fix GC hazards. r=terrence 4efb6ef Bug 1123237 - Part 11. Don't use STL in memory-profiler. r=BenWa,cervantes 28c419d Bug 1123237 - Part 10. Expose SwapElements from nsBaseHashtable. r=nfroyd c2a6c6f Bug 1123237 - Part 9. Interface to memory-profiler add-ons. r=jimb 5c8dec0 Bug 1123237 - Part 8. Tracking the memory events. r=BenWa,terrence 008c01b Bug 1123237 - Part 7. XPCOM interface for memory profiler. r=smaug b4ef7f3 Bug 1123237 - Part 6. A new API to get backtrace without allocating memory in profiler. r=mstange 22e4adb Bug 1123237 - Part 5. Don't emit inline allocation when memory profiler enabled. r=terrence 75ad5ad Bug 1123237 - Part 4. Monitoring allocations and frees for ArrayBuffer. r=terrence,sfink bdd0fc9 Bug 1123237 - Part 3. Monitoring allocation and gc events in nursery and tenured heaps. r=terrence 00a45d3 Bug 1123237 - Part 2. MemoryProfiler hooks in js engine. r=terrence Can we backout only one of those or do we absolutely need to keep them alltogether? Sorry for the mass needinfo :)
Blocks: 1123237
Flags: needinfo?(terrence)
Flags: needinfo?(kchen)
Flags: needinfo?(cyu)
008c01bd6c9cf6ed6831d9ad3663f54b7b427484 is the first bad commit commit 008c01bd6c9cf6ed6831d9ad3663f54b7b427484 Author: Kan-Ru Chen <kanru@kanru.info> Date: Fri May 8 11:22:38 2015 +0800 Bug 1123237 - Part 7. XPCOM interface for memory profiler. r=smaug Based on patch from Ting-Yuan Huang <laszio.bugzilla@gmail.com> :040000 040000 ed6a258195e2c7f780ec8b06e751df38a2a69c65 2cde680a8dad5555679bd7e28d132871636d9dbe M b2g :040000 040000 dd426c61bf5f8225bca94c9f0d922e448648806a 33f8e1fce98ce0bf539df2065f5bb53e9ee7bd7e M browser :040000 040000 c0f1c5b2d67565f4d5fc816a698487d4afb8ae41 f648cbd70baee3cf41267d6f73805b1aa515aeec M mobile :040000 040000 0474479facca6017da670b573a9fe99511795e3e 3152a76d0d8e002889a25b13b822f2d2699c7ec1 M toolkit :040000 040000 8093074ae509c6d8b843079815d27facc1accc0a a8e8a6319e4fdb7d9bf198d32e04a01d647a8c2d M tools
I'm not sure what the above has to do with this but from my experiments after reverting to 5d8728423441575dc81c6c38de69fbc7ca35f163: Experiment 1: Boot, go through FTU with connecting to WiFi Open browser, open settings reboot --> Boot loop. Experiment 2: Boot, go through FTU do not connect to WiFi Open browser, open settings reboot --> Boot normal. Experiment 3: Disconnect WiFi router from Internet Boot, go through FTU with connecting to WiFi Open browser, open settings reboot --> Boot normal. Experiment 4: Disconnect WiFi router from Internet Boot, go through FTU with connecting to WiFi Open browser, open settings reboot --> Boot normal. Reconnect WiFi router to Internet Open browser, open settings reboot --> Boot normal. Open browser, open settings reboot --> Boot loop. What is common about Experiment 1 and Experiment 4? in both cases there is an eventual boot to home screen with a functioning WiFi connection. This triggers a check for updates. It is this check that is causing the boot loop, probably a database is being corrupted.
Summary: Boot loot after first boot on some devices (Xperia M2, ...) → Boot loop after first boot on some devices (Xperia M2, ...)
Ok, comment 6 might be wrong but the range is still the proper one. I have checked out Part 6 and pushed that Gecko on a device already in a bad state. Device is still boot looping.
(In reply to Alexandre LISSY :gerard-majax from comment #5) > Can we backout only one of those or do we absolutely need to keep them > alltogether? If needed they have to be backed out together. Note I can't reproduce this on Flame and I'm not sure how could a disabled feature affect booting. Is the segfault always at the same place?
Flags: needinfo?(kchen)
(In reply to Kan-Ru Chen [:kanru] from comment #10) > (In reply to Alexandre LISSY :gerard-majax from comment #5) > > Can we backout only one of those or do we absolutely need to keep them > > alltogether? > > If needed they have to be backed out together. Note I can't reproduce this > on Flame and I'm not sure how could a disabled feature affect booting. Is > the segfault always at the same place? Always. Before the regression range, nothing. After, constantly under the described conditions.
Right, now after a repo sync of Gecko and Gaia, I don't have the issue anymore. Only spurious thing I could notice was a crash report on the very first boot, before I begin FTU. And homescreen seemed to be broken after finishing FTU. It's all okay after a reboot.
(In reply to Alexandre LISSY :gerard-majax from comment #12) > Right, now after a repo sync of Gecko and Gaia, I don't have the issue > anymore. > > Only spurious thing I could notice was a crash report on the very first > boot, before I begin FTU. And homescreen seemed to be broken after finishing > FTU. > > It's all okay after a reboot. False hope: after a shutdown and a startup, it is crashing again
I've noticed that too, _sometimes_ it doesn't start crashing, which made my bisect attempt a fruitless waste of time.
(In reply to Alexandre LISSY :gerard-majax from comment #1) > Created attachment 8663388 [details] > gdb backtrace on Xperia M2 I cannot spot anything obvious from the backtrace, but I am no expert in the GC. None of the patches from Bug 1123237 are modifying the parser. So I guess this issue might be related to some of the nursery patches. Terrence might know better, but I think this kind of bug is hard to investigate, especially on devices, and our best hope might be to wait until fuzzers find similar signature.
Flags: needinfo?(nicolas.b.pierron)
Crashes in the GC are usually heap corruption of some sort, rather than a direct consequence of GC changes. Generally the only way to track this sort of problem down is to bisect, which is seems you have done. In this particular case, it looks like either the arena list or freespan head points into unmapped addresses. I'm not entirely sure how that squares with the bisection results, but that patch does add some members to the relevant structs which could be bumping the addresses off-by-one if not everything is compiling with the same #defines?
Flags: needinfo?(terrence)
Flags: needinfo?(nhirata.bugzilla)
I'm getting a boot loop on Z3C - don't know if it's related, but I bisected: 2d0398ffa709b2af2e5a1e588086a874479c67e6 is the first bad commit commit 2d0398ffa709b2af2e5a1e588086a874479c67e6 Author: Josh Matthews <josh@joshmatthews.net> Date: Sun Sep 20 05:57:15 2015 -0400 Bug 885982 - Part 4: Remove all traces of JS implementation. r=asuth :040000 040000 e3e092e3fc55443ecb1ff1f635dbc68633ee90f6 87637d13278226cea38d380d14f5933d1d9bb5b3 M b2g :040000 040000 580eb5cdb448408ad501c32ab3f895417d87000e f6cf9fd2663e05213aca86f893deede1d813f519 M browser :040000 040000 67c3d9dd8c9afcff5cea25fbc997cbd9df99b9d2 feb0b53d300320ccacea09d06281670b0e11475e M dom :040000 040000 4d5d2838010e6d1ebe5e036f831cccfa19f41199 56c90d1b6513a0038c87c8f793e1aaa3704f14d9 M mobile
fwiw, I'm facing the same issue on a z3c, with the same stack trace as the one initially reported
I'm investigating this.
Assignee: administration → kchen
ftr, I got a profile that always crash when compiling ContactDB.jsm
Reverting the commit I mention in comment #17 on top of master fixes the boot loop for me.
(gdb) f #2 js::TenuringTracer::moveToTenured (this=this@entry=0xbed9b248, src=0xb2033180) at /home/ting/w/fx/os/aries-kk/gecko/js/src/gc/Marking.cpp:2059 2059 TenuredCell* t = zone->arenas.allocateFromFreeList(dstKind, Arena::thingSize(dstKind)); (gdb) p zone->runtime There is no member or method named runtime. (gdb) p zone->runtime_ $9 = (JSRuntime * const) 0x904ff0e9 (gdb) p zone->runtime_ == zone->arenas.runtime_ $10 = false (gdb) p zone->arenas.runtime_ $11 = (JSRuntime *) 0x1cf8cd93 (gdb) p *zone->arenas.runtime_ Cannot access memory at address 0x1cf8cd93 (gdb) I think zone->runtime_ != zone->arenas.runtime_ is impossible
Set javascript.options.ion to false prevents the crash.
The crash occurs during minor GC when we are marking the store buffers. With kanru's help debugging we found that we are marking what appears to be a FunctionBox object where we expect to see a nursery allocated JSObject.
I wasn't able to reproduce the crash, but I found something that could cause it. JSFunction has a union containing a JSObject* and a FunctionBox*. To make barriers work on the object pointer, when assigning to this we cast its address to a HeapPtrObject*. This will create a store buffer entry in the right circumstances (JSFunction allocated in the tenured heap, JSObject allocated in the nursery). While parsing a function we swap out this object pointer and set the function box pointer instead. We don't do anything to remove the store buffer entry though.
Attachment #8664187 - Flags: review?(terrence)
Comment on attachment 8664187 [details] [diff] [review] bug1206485-function-box-aliasing Ship it! It looks perfect! Tested on Xperia M2: - pushing a gecko with the fix, doing ~8-10 reboots, no problem - pushing a gecko without fix, crash after one or two reboot, crash report at the first boot during FTU, homescreen broken - pushing a gecko with the fix on top of a broken profile, revived Thanks for the quick patch!
Attachment #8664187 - Flags: feedback+
Comment on attachment 8664187 [details] [diff] [review] bug1206485-function-box-aliasing This looks like it fixes the problem. I've done several reboots on both devices I had problems with, both rebooted several times without bootloop.
This fixes it for me too :)
That also fixed it on my z3c. SHIP IT!
Comment on attachment 8664187 [details] [diff] [review] bug1206485-function-box-aliasing Review of attachment 8664187 [details] [diff] [review]: ----------------------------------------------------------------- Great find!
Attachment #8664187 - Flags: review?(terrence) → review+
(In reply to [:fabrice] Fabrice Desré from comment #30) > That also fixed it on my z3c. SHIP IT! Shipped to m-i.
Nice!
Assignee: kchen → jcoppeard
Flags: needinfo?(nhirata.bugzilla)
Flags: needinfo?(cyu)
Whiteboard: [dogfood-blocker]
The patch is confirmed to work on Flame. PS. I'm the author of Bug 1207213.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Target Milestone: --- → FxOS-S8 (02Oct)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: