Closed Bug 1033618 Opened 10 years ago Closed 10 years ago

Applications (e.g. homescreen) can fail to start if the preallocated process is in the process of dying

Categories

(Core :: DOM: Content Processes, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED FIXED
2.0 S6 (18july)
blocking-b2g 2.0+
Tracking Status
firefox31 --- wontfix
firefox32 --- fixed
firefox33 --- fixed
b2g-v2.0 --- verified
b2g-v2.1 --- verified

People

(Reporter: tkundu, Assigned: khuey)

References

(Depends on 1 open bug, )

Details

(Keywords: crash, Whiteboard: [b2g-crash][caf-crash 263][caf priority: p1][CR 686636][systemsfe])

Attachments

(5 files, 4 obsolete files)

Reference from #comment 47 of bug 1023796.


I found that homescreen main thread is waiting in 

IPDL::PContent::RecvFlushMemory 
gfxContext::Paint() [1] 
nsCycleCollector::Collect() [2] 

Homescreen spawned another IPC thread which is stuck in
MessagePump::Wait()[3] 

Easy STR to reproduce this issue:
1) Flash build on device which is memory restricted to use only 256MB.
2) Copy lots of contents to sdcard and launch gallery.
3) You will see gallery is scanning images and now try to zoom in/out an image. This will force to use more memory and device will become slower
4) Minimize gallery and launch camcorder and try to record video
5) Stop recording video and Minimize camcorder.
6) Homescreen will not show any icons

You need to follow step #3 properly to make sure that you are using lots of memory.


[1] http://dxr.allizom.org/mozilla-central/source/gfx/thebes/gfxContext.cpp#1532
[2] http://dxr.mozilla.org/mozilla-central/source/xpcom/base/nsCycleCollector.cpp#4127
[3] http://dxr.mozilla.org/mozilla-central/source/ipc/chromium/src/base/message_pump_default.cc#56


Please note that homescreen does not contains any icon or text when this happens. And it is reproducible with testpatch [2] from bug 1023796.

Device used: msm8610 256MB
OS: FFOS 2.0

Logs: https://bugzilla.mozilla.org/show_bug.cgi?id=1023796#c46


[2] https://bugzilla.mozilla.org/show_bug.cgi?id=1023796#c45
blocking-b2g: --- → 2.0?
Whiteboard: [CR 686636]
Could you please help me with https://bugzilla.mozilla.org/show_bug.cgi?id=1023796#c51 

That will be helpful for this bug.
Flags: needinfo?(21)
Blocks: 1015336
blocking-b2g: 2.0? → 2.0+
Component: General → Gaia::Homescreen
Whiteboard: [CR 686636] → [CR 686636][systemsfe]
QA Whiteboard: [VH-FL-blocking-][VH-FC-blocking+]
If anyone has problem with reproducing this issue then I can provide remote gdb access to device & help anybody to jointly debug this issue with me on a live device.
Whiteboard: [CR 686636][systemsfe] → [caf priority: p1][CR 686636][systemsfe]
Observed on: 

Device: msm8610
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.022
Moz BuildID: 20140630000201
B2G Version: 2.0
Gecko Version: 32.0a2
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=c0c8ad187c0466285f2580531e09f8322996f561
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=ef0781044f85ee7f364236c074ab0b7644c38ff8
Discussed offline with Kevin, for now, :kgrandon is going to help start the investigation to get started here..
Assignee: nobody → kgrandon
Target Milestone: --- → 2.1 S5 (26sep)
Currently following up with Tapas offline to schedule a debugging session with a few Mozilla folks.
What bug are we trying to fix here?

If we don't have enough memory to honor the proper rendering then this is working as expected. If a large allocation fails then we will recover by skipping rendering for that layer of the page. I believe if the memory clears up then we should eventually render it when the next thing triggers a paint.

I guess if there memory is held by another process we can make sure that we run the OOM killer to kill the background app to pick up that memory to render the homescreen.

Is the point of the fix to make sure that we have the right mechanism in place to claim the graphic memory we need via the OOM killer?
Attached file Full logcat (deleted) —
Here's a full logcat. This is me starting the camera, letting it run for a few seconds. After a few seconds the camera process crashes and there's no homescreen in the background.
Simpler STR:
* Switch phone to use 273 MB
 $ adb reboot bootloader
 $ fastboot oem mem 273
 $ fastboo reboot
* Start the camera app, switch to video recording
* Record until the camera app crashes

You're left in a state where the Camera and Homescreen are both OOM and the Homescreen is not restarted.
Attached patch Patch - Add logging to homescreen launcher (obsolete) (deleted) — Splinter Review
More logs.
Attachment #8451794 - Attachment is obsolete: true
Attached file crash with the above log patch (deleted) —
Kevin and I spent some time with Tapas and Benoit this morning and then dived into this in depth later.  Here is our understanding of what is going on.

The camera app is killed.  The homescreen has been dead for sometime, so this causes Gaia to try to restart the home screen.

The preallocated app process has been killed, but we haven't yet processed ActorDestroy for it.  Presumably that is pending in the main thread event loop somewhere.

Gaia attempts to restart the home screen.  That goes into the ContentParent logic which grabs the preallocated process and attempts to construct a TabParent for it.  SendPBrowserConstructor fails because the other side is gone.

That triggers an ActorDestroy call on our TabParent with why == FailedConstructor.  TabParent::ActorDestroy fails to handle this, only checking for the AbnormalShutdown value to do oop-frameloader-crashed.  This is bug #1.

Once that is fixed, we fire oop-frameloader-crashed, but that does not trigger a mozbrowsererror event.  It appears that the BrowserElementParent for this frame may not have been constructed yet.  That's consistent with the ordering of things in nsFrameLoader::ShowRemoteFrame, where we TryRemoteFrame() at the top of the function and then fire remote-browser-pending later which is what sets up the BrowserElementParent.  This is bug #2.

Fixing #1 is easy enough but I have no idea what to do about #2.  The only thing I can think of is to make ContentParent::CreateBrowserOrApp "try again" if it takes the preallocated process and SendPBrowserConstructor fails.  That doesn't actually fix #2 though, just fixes the only way we know it can happen.  Olli, any better ideas?
Component: Gaia::Homescreen → DOM: Content Processes
Flags: needinfo?(21) → needinfo?(bugs)
Product: Firefox OS → Core
Summary: Vertical homescreen without icons in applications → Applications (e.g. homescreen) can fail to start if the preallocated process is in the process of dying
(We did something recently that made this much more common, but I suspect the underlying issue goes back to 1.0.)
Attachment #8451801 - Attachment is obsolete: true
#1 from comment 13. We can land this if we want, or someone can feel free to steal it.
Observed on: 

Device: msm8610
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.026
Moz BuildID: 20140707000200
B2G Version: 2.0
Gecko Version: 32.0a2
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=ef67af27dff3130d41a9139d6ae7ed640c34d922
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=d3eae03cdf4e6944e646d05938a22dc1380a0d95
Attachment #8452068 - Attachment description: Part 1 - Patch, oop-frameloader-crashed oop-frameloader-crashed event is not sent during FailedConstructor → Part 1 - Patch, oop-frameloader-crashed event is not sent during FailedConstructor
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #13)
> Fixing #1 is easy enough but I have no idea what to do about #2.  The only
> thing I can think of is to make ContentParent::CreateBrowserOrApp "try
> again" if it takes the preallocated process and SendPBrowserConstructor
> fails.  That doesn't actually fix #2 though, just fixes the only way we know
> it can happen.  Olli, any better ideas?
Either we need ipc layer to tell asap that child process has died, or do try-again.
And since child process might die at any point, try-again sounds like the way to go.

Does SendPBrowserConstructor return non-null? I guess it could and we'd return TabParent 
which doesn't really have TabChild working. But we should be able to deal with that case already.
Just getting error later and need to restart process.
If SendPBrowserConstructor fails synchronously, we could indeed just try creating a non-preallocated process.
Flags: needinfo?(bugs)
This bug has just been pointed out to me and from your described in comment 13 there's something else that sounds wrong to me:

(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #13)
> The preallocated app process has been killed, but we haven't yet processed
> ActorDestroy for it.  Presumably that is pending in the main thread event
> loop somewhere.

The preallocated process is not supposed to be killed by the LMK unless it's the only process left besides the b2g process so I would expect the camera process to be killed first and then the preallocated process afterwards. I haven't had time to look at the logs attached here yet but if the preallocated process has been killed *before* the other apps then we've got a bug in the process priority manager too.

One possibility is that the main process has already tried to restart the homescreen which was then killed and thus there's no preallocated priority process around because it hasn't been restarted just yet, not because it's been killed.
The preallocated process is killed immediately after the camera app process.
Assignee: kgrandon → khuey
Target Milestone: 2.1 S5 (26sep) → 2.0 S6 (18july)
Attached patch Patch (obsolete) (deleted) — Splinter Review
Untested, but I think this should work.

https://tbpl.mozilla.org/?tree=Try&rev=61f39e20b8b6
Attachment #8452068 - Attachment is obsolete: true
Attachment #8452509 - Flags: review?(bugs)
Attachment #8452509 - Flags: feedback?(kgrandon)
Comment on attachment 8452509 [details] [diff] [review]
Patch

Could you add a comment why the new ContentParent::CreateBrowserOrApp
call wouldn't lead to endless recursion.
(It shouldn't since we Take() the existing preallocated process)
Attachment #8452509 - Flags: review?(bugs) → review+
Attachment #8452509 - Attachment is obsolete: true
Attachment #8452509 - Flags: feedback?(kgrandon)
Attached patch Patch (deleted) — Splinter Review
Attachment #8452669 - Flags: review?(bugs)
Attachment #8452669 - Flags: feedback?(kgrandon)
There were two problems with the previous patch.  The first is that we store ContentParents in a hashtable so we can reuse them for the same app.  Obviously reusing the broken ContentParent isn't helpful, and that's exactly what was happening in the "try again path".

The second is that Nuwa wants the frame loader to "try again later" if there is no preallocated process.  The path we're coming through in the frame loader isn't set up to do that though ... and I think the path that does claim to handle it properly is also wrong.  This patch removes that code which means we launch a plugin-container that's not forked from Nuwa, but that's better than not launching a plugin-container at all.

We really need to figure out what nsFrameLoader's handling of this should look like.
Comment on attachment 8452669 [details] [diff] [review]
Patch

Review of attachment 8452669 [details] [diff] [review]:
-----------------------------------------------------------------

The patch seems to solve the problem for me. Thanks!
Attachment #8452669 - Flags: feedback?(kgrandon) → feedback+
Attachment #8452669 - Flags: review?(bugs) → review+
I think this should apply cleanly to 2.0, but if it doesn't I'll fix it up tomorrow morning.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #30)
> Trivial fix.
> 
> https://hg.mozilla.org/releases/mozilla-aurora/rev/4cf4838bd5e1


This solves the problem of homescreen failed to launch. I cannot reproduce original issue[1] by following #comment 10 

But I can still reproduce original issue[1] by following #comment 0. 

Now I can see homescreen is getting killed when I am in step 5 of #comment 0
and it is launched again. but It becomes icon less in step #6 of #comment 0

Please let me know if you are unable to reproduce this issue on your 256MB flame device.


[1] "homescreen stuck without icon"
Status: RESOLVED → REOPENED
Flags: needinfo?(kgrandon)
Resolution: FIXED → ---
(In reply to Tapas Kumar Kundu from comment #31)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #30)
> > Trivial fix.
> > 
> > https://hg.mozilla.org/releases/mozilla-aurora/rev/4cf4838bd5e1
> 
> 
> This solves the problem of homescreen failed to launch. I cannot reproduce
> original issue[1] by following #comment 10 
> 
> But I can still reproduce original issue[1] by following #comment 0. 
> 
> Now I can see homescreen is getting killed when I am in step 5 of #comment 0
> and it is launched again. but It becomes icon less in step #6 of #comment 0

We will try to reproduce. Do you have a screenshot of the device in the state without icons? Also have you verified that the homescreen is really launched and that the process is running?
Flags: needinfo?(kgrandon)
(In reply to Kevin Grandon :kgrandon from comment #32)

> We will try to reproduce. Do you have a screenshot of the device in the
> state without icons? 

Yes. I attached it here.

> Also have you verified that the homescreen is really
> launched and that the process is running?

Yes. I can see homescreen pid has changed from 1096 to 1311 when it is stucked. 
I verified it by running |adb shell b2g-procrank|
There are some steps to reproduce for a very similar sounding issue in Bug 1036680.
suspected highly related with recent MTBF crashes. Set direct blocker to MTBF-B2G meta b2g.
Blocks: MTBF-B2G
This doesn't cause a crash.  File a new bug.
Tapas - would you happen to have a logcat when this happens? I am currently trying to reproduce, but not having much luck. Thanks!
Flags: needinfo?(tkundu)
Observed on: 

Device: 
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.028
Moz BuildID: 20140709000201
B2G Version: 2.0
Gecko Version: 32.0a2
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=1dd043857399c713e3b509c0ed31bdf20326f08b
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=e535af683023b392a2a89e96d67273266d836327
The patch here changed content process behavior on emulator mochitest: the content process that runs mochitest may or may not forked from Nuwa, depends on the timing that mochitest launched. We should have an environment to test processes that forked from Nuwa.
(In reply to Kevin Grandon :kgrandon from comment #37)
> Tapas - would you happen to have a logcat when this happens? I am currently
> trying to reproduce, but not having much luck. Thanks!

I tried with two builds:

(1) gaia : https://www.codeaurora.org/cgit/quic/lf/b2g/mozilla/gaia/commit/?h=mozilla/v2.0&id=1bd6e8957ccf310b2f75ba5695b058a2e284df3a
   gecko : https://www.codeaurora.org/cgit/quic/lf/b2g/mozilla/gecko/commit/?h=mozilla/v2.0&id=7a31c320cc1ea4ca6393634a17cfd575d17b035d

(2) gaia : https://www.codeaurora.org/cgit/quic/lf/b2g/mozilla/gaia/commit/?h=mozilla/v2.0&id=1dd043857399c713e3b509c0ed31bdf20326f08b
   gecko : https://www.codeaurora.org/cgit/quic/lf/b2g/mozilla/gecko/commit/?h=mozilla/v2.0&id=e535af683023b392a2a89e96d67273266d836327
   + your fix from comment 31 


Build (1) has already your patch and I applied your patch manually in Build (2)
I also see that there is lots of patches (related to graphich/paint fixes) has landed between (1) and (2).


I am unable to reproduce this issue anymore on build(1) but I can reproduce this issue (see comment 31) on build (2) with some efforts (see comment 0)

Homescreen stucks without icon for 2-3 secs in build(1) once we are in step4 in comment 0. But, homescreen is able to display icon again after 2-3 secs in build(1).


So  my conclusion is that this patch (comment 31) alone does not solve homescreen issue but it may hide it temporarily. 

We are asking out internal team for more testing on tip build to confirm that this issue has gone. 

Meanwhile, could you please confirm if there is any other patch (other than fix from this bug 1033618) which has landed in build(1) specifically for homescreen hang issue ?

I just want to double check that we are solving this issue fully instead of hiding it temporarily .
Flags: needinfo?(tkundu) → needinfo?(kgrandon)
As far as I know there have been no specific efforts landed in the last few days to address this bug. We have landed a few vertical homescreen patches, and something that could possibly help is this one: https://github.com/mozilla-b2g/gaia/commit/d422c2f353816b0781fca219d13f7789205d589c

This patch should make the vertical homescreen use significantly less CPU and memory on startup. It's just a theory, but it could be posible that having the homescreen use more memory on startup before could have been causing problems when recovering from an OOM.

Tapas - one option might be to revert that patch to see if it has any impact. The only other alternative I can think of would be to perform a bisection over the last few days. Please let us know what your internal team finds out on this issue.
Flags: needinfo?(kgrandon)
(In reply to Kevin Grandon :kgrandon from comment #41)
> Tapas - one option might be to revert that patch to see if it has any
> impact. 

I reverted patch from this bug 1033618 and I found that homescreen still does not hang. This means that this patch is not making any impact on this hang issue. Something else has hided it or fixed it in tip (see comment 40)

I am bisecting git now. I will update soon.
Flags: needinfo?(tkundu)
Observed on: 

Device: 
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.030
Moz BuildID: 20140711000201
B2G Version: 2.0
Gecko Version: 32.0a2
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=18c44a1bc31b374ba00a069904465a8d07971a60
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=22368827423484e09c3840ebc785de09b821a039
Observed on: 

Device: 
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.031
Moz BuildID: 20140712000201
B2G Version: 2.0
Gecko Version: 32.0a2
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=ca022f811bcbbda0f89086094a9e92bb220fea18
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=bf1bffa293379d6c7338dad0d2df64db7724ca4b
My understand is that this was fixed by a combination of the patch in this bug and by https://github.com/mozilla-b2g/gaia/commit/d422c2f353816b0781fca219d13f7789205d589c.
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Flags: needinfo?(tkundu)
Observed on: 

Device: 
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.033
Moz BuildID: 20140713000201
B2G Version: 2.0
Gecko Version: 32.0a2
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=ca022f811bcbbda0f89086094a9e92bb220fea18
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=be6908fec84d3e39453275da96c031336f58f23d
Observed on: 

Device: msm8226
Gonk Version: AU_LINUX_GECKO_B2G_KK_2.0.01.04.00.114.022
Moz BuildID: 20140715000201
B2G Version: 2.0
Gecko Version: 32.0a2
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=2c6c413ed729d465c52d6c2d5d458e2eee79e956
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=323e7d8e8884be8da8341c0e2687e81962a774f4
Observed on: 

Device: 
Gonk Version: AU_LINUX_GECKO_B2G_KK_3.6.01.04.00.000.038
Moz BuildID: 20140716000201
B2G Version: 2.0
Gecko Version: 32.0a2
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=5f8b1b8a2da9e3b531eee817a669f57fa4d9b9c6
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=e00f7e464333689fcf54edb4945ece94f97f930b
Observed on: 

Device: 
Gonk Version: AU_LINUX_GECKO_B2G_KK_2.0.01.04.00.114.028
Moz BuildID: 20140721000201
B2G Version: 2.0
Gecko Version: 32.0a2
Gaia:  http://git.mozilla.org/?p=releases/gaia.git;a=commit;h=8cb1a949f2e9650bb2c5598e78a6f24a58bbaf97
Gecko: http://git.mozilla.org/?p=releases/gecko.git;a=commit;h=5f27d3ee3ccf01ac91a3efacb5e3e22ea62fd73c
Whiteboard: [caf priority: p1][CR 686636][systemsfe] → [b2g-crash][caf-crash 263][caf priority: p1][CR 686636][systemsfe]
Keywords: crash
Flags: in-moztrap?(ychung)
New test case needs to be added. There is no existing test case.
QA Whiteboard: [VH-FL-blocking-][VH-FC-blocking+] → [VH-FL-blocking-][VH-FC-blocking+][QAnalyst-Triage?]
Flags: needinfo?(ktucker)
Test case has been added in moztrap:

https://moztrap.mozilla.org/manage/case/14321/
QA Whiteboard: [VH-FL-blocking-][VH-FC-blocking+][QAnalyst-Triage?] → [VH-FL-blocking-][VH-FC-blocking+][QAnalyst-Triage+]
Flags: needinfo?(ktucker)
Flags: in-moztrap?(ychung)
Flags: in-moztrap+
This issue has been successfully verified on Flame v2.1&2.0.
See attachment: verified_v2.1.mp4.
Reproduce rate: 0/6

STR:
1. Set device's memory as 256MB.
2. Copy lots of contents to sdcard and launch Gallery.
3. When gallery is scanning images,try to zoom in/out an image. 
**It will become slower.
4. Launch Camera at lower left corner and slide to camcorder,and then try to record video.
5. Stop recording video and tap Home button.
**Homescreen will show all icons normally.
Note:Sometimes,it will automatically go back Homescreen when record video(after step 4),and show all icons normally on homescreen.

Flame 2.1 build:
Gaia-Rev        38e17b0219cbc50a4ad6f51101898f89e513a552
Gecko-Rev       https://hg.mozilla.org/releases/mozilla-b2g34_v2_1/rev/8b92c4b8f59a
Build-ID        20141205001201
Version         34.0
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  eng.cltbld.20141205.035305
FW-Date         Fri Dec  5 03:53:16 EST 2014
Bootloader      L1TC00011880

Flame 2.0 build:
Gaia-Rev        856863962362030174bae4e03d59c3ebbc182473
Gecko-Rev       https://hg.mozilla.org/releases/mozilla-b2g32_v2_0/rev/e40fe21e37f1
Build-ID        20141208000206
Version         32.0
Device-Name     flame
FW-Release      4.4.2
FW-Incremental  eng.cltbld.20141208.035628
FW-Date         Mon Dec  8 03:56:38 EST 2014
Bootloader      L1TC00011880
Attached video verified_v2.1.MP4 (deleted) —
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: