Closed Bug 968604 Opened 11 years ago Closed 11 years ago

[Tarako] The browser process needs to be forked from Nuwa

Categories

(Core :: IPC, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla31
tracking-b2g backlog
Tracking Status
b2g-v1.3T --- fixed

People

(Reporter: llin, Assigned: cyu)

References

Details

Attachments

(5 files, 1 obsolete file)

After smoking test of Tarako, it was found at final memory usage check. Reproduce steps are still wanted. The screenshot of b2g-info is attached.
Browser processes are the ones used by browser tabs. So it can be normal to have more than one.
The potential issue is that they are forked from the main b2g process, and not from a preallocated one.
Was the Nuwa process killed and re-born? Lawrence, could you provide logcat or information of creation/killing of processes?
Thinker, I just did a quick check:

Before opening a page in the browser:
fabrice@fabrice-x240:~/dev/tarako$ b2g-info 
                         |     megabytes     |
           NAME PID PPID CPU(s) NICE  USS  PSS  RSS VSIZE OOM_ADJ USER   
            b2g  86    1   28.4    0 40.9 44.8 52.5 147.0       0 root   
         (Nuwa) 321   86    0.9    0  0.8  3.4  9.5  49.5       0 root   
     Homescreen 338  321    5.5    1 10.2 14.6 23.6  64.7       2 app_338
(Preallocated a 390  321    0.9   18  4.5  7.6 15.0  57.5      10 root   

And after opening a page:

fabrice@fabrice-x240:~/dev/tarako$ b2g-info 
                    |     megabytes     |
      NAME PID PPID CPU(s) NICE  USS  PSS  RSS VSIZE OOM_ADJ USER   
       b2g  86    1   38.7    0 33.9 37.0 42.5 150.0       0 root   
    (Nuwa) 321   86    1.1    0  0.3  1.1  3.4  49.5       0 root   
Homescreen 338  321    5.9   18  6.8  8.6 12.8  63.7       8 app_338
   Browser 438   86    4.8    1 13.2 16.2 21.8  71.2       2 app_438

So we don't reuse the preallocated process, but just fork from b2g. The USS of the browser app is then pretty high (in this test I could not even get a resource because I have no network access...)
In the smoke test of Tarako, there is no browser tab related cases. 
It means, during the test, no additional tab will be created.

The issue is that I cannot see or kill the additional broswer process via the application manager, and what the browser process being killed is the latest created. 
Even the application manager shows "No recent apps", the browser process (in the case, PID is 1024) is still there.
Have we been got any reported zombie processes?
Possible reproduce steps:

1. Load reference workload (light) on the target phone
2. After reboot, launch Music and play a song
3. Launch Contacts
4. Launch Messages
5. Launch Gallery
6. Call the target phone from another phone
7. The dialer(Communications) is not launched and the call is redirected to the voice mail directly
8. Then, the instance of zombie browser is on the process list
I think there is a bug in creating the browser process so it forks from b2g. I will take a look.
Assignee: nobody → cyu
Summary: [Tarako] There might be two existing instances of Browser → [Tarako] The browser process needs to be forked from Nuwa
Attached patch For the browser process from Nuwa (obsolete) (deleted) — Splinter Review
* Perform ContentChild::InitProcessAttributes() when PBrowser is first allocated in the child process. This is to reset child-side attributes for browser.
* Add ContentParent::TransformPreallocatedIntoBrowser() to reset ContentParent's member fields and set OS privileges of the child process.
Attachment #8375423 - Flags: review?(khuey)
Comment on attachment 8375423 [details] [diff] [review]
For the browser process from Nuwa

Review of attachment 8375423 [details] [diff] [review]:
-----------------------------------------------------------------

very nice. r=me
Attachment #8375423 - Flags: review?(khuey) → review+
This patch causes dom/browser-element/mochitest/priority/test_Simple.html to fail though ... are we not setting the priority of the browser process?
We are.

ProcessPriorityManager::SetProcessPriority(this,                                           PROCESS_PRIORITY_FOREGROUND);

should do this. I will take a look why it fails.
blocking-b2g: --- → 1.3T?
blocking-b2g: 1.3T? → 1.3T+
The test case passes when run alone, but fails when run in in the suite.

This seems to be relevant to the test case failure.

 1:03.11 [Parent 15327] WARNING: Ignoring duplicate observer.: file /home/cervantes/hg/mozilla-central/modules/libpref/src/nsPrefBranch.cpp, line 622
 1:03.12 [Parent 15327] WARNING: Adding an observer twice!: '!mEventObservers.Contains(observer)', file /home/cervantes/hg/mozilla-central/xpcom/threads/nsThread.cpp, line 752

Still checking why there is a duplicate observer.
The failure is a false alarm not indicating to a bug in this fix. The problem is that (very unfortunately) test_Preallocated.html runs before test_Simple.html. The preallocated process created in test_Preallocated.html is used by test_Simple.html, then "process-priority-manager:TEST-ONLY:process-created" will not be fired because it already was in test_Preallocated.html.

We need to shutdown the preallocated process in test_Preallocated.html so it's not affecting the test case running after it.
This fixes the test case that the preallocated process is not actually shut down in the cleanup() function:

- PreallocatedProcessManagerImpl::AllocateNow() creates the ContentParent instance.
- ContentParent notifies that the process is created
- The test case calls cleanup() to flip the preference back to false and then PreallocatedProcessManagerImpl::Disable() is called.
- PreallocatedProcessManagerImpl::Disable() won't call mPreallocatedAppProcess->Close() because the assignment in PreallocatedProcessManagerImpl::AllocateNow() hasn't been done.

This fixes the test case by making cleanup() async.
Attachment #8381368 - Flags: review?(khuey)
(In reply to Cervantes Yu from comment #15)
> Tests: https://tbpl.mozilla.org/?tree=Try&rev=b966175e43ff

Mochitest group 2 on Linux Debug and Linux X64 Debug still fail in a line. Seems not intermittent. I need to look at it.
Cervantes, do you need help fixing these tests issues?
Flags: needinfo?(cyu)
(In reply to Fabrice Desré [:fabrice] from comment #17)
> Cervantes, do you need help fixing these tests issues?

That would be great. :)
Flags: needinfo?(cyu)
From the log it looks like it completed all the test cases in mochitest-2, then it get stuck in shutting down. The test runner sends SIGABRT to force it to shut down. I still don't make the test get stuck. Is there any way to bisect the offending test case in mochitest-2?
ni? Fabrice to see if any help can be provided regarding comment 19
Flags: needinfo?(fabrice)
For more information, I ran mochitest-2 locally with the patches applied. The failure is intermittent. I still have no clue why it is permfail on try.
Component: General → IPC
Product: Firefox OS → Core
From the crash stack in the log, the failure indicates that it is performing PlatformThread::join(), called from ShutdownXPCOM() to delete sIOThread. It's still unclear how the change can make this pthread_join() to block until we got killed by the test runner.
Still wrestling with mochitest-2 failures. With https://bugzilla.mozilla.org/attachment.cgi?id=8381368 fixing the inherent bug in the test case, we saw consecutive mochitest-2 failure. It's hard to say that this is intermittent. I'll try to reorder or disabling this test case to see if we can pass the suite. If yes, can we unblock the patch to 1.3T and reenable the test case once we fix it. Fabrice, how does this sound?
When can we land this patch?
Flags: needinfo?(styang)
(In reply to Cervantes Yu from comment #23)
> Still wrestling with mochitest-2 failures. With
> https://bugzilla.mozilla.org/attachment.cgi?id=8381368 fixing the inherent
> bug in the test case, we saw consecutive mochitest-2 failure. It's hard to
> say that this is intermittent. I'll try to reorder or disabling this test
> case to see if we can pass the suite. If yes, can we unblock the patch to
> 1.3T and reenable the test case once we fix it. Fabrice, how does this sound?

Last resort we'll disable the test temporarily. Are you sure it's a test issue and not a bug in the patch itself?

(In reply to James Zhang from comment #24)
> When can we land this patch?

Worst case we'll land Tuesday with the test disabled, sooner if Cervantes finds what's going wrong.
Flags: needinfo?(fabrice)
(In reply to Fabrice Desré [:fabrice] from comment #25)
> (In reply to Cervantes Yu from comment #23)
> Last resort we'll disable the test temporarily. Are you sure it's a test
> issue and not a bug in the patch itself?
> 
Seriously, I am not sure. But from the log the test cases finished, but the suite cannot finish and get killed by the test runner after no output for 330 seconds. mochitest-2 on Linux Debug and Linux 64 Debug are the only 2 suites that failed during shutdown.

3 more tests:

1. https://tbpl.mozilla.org/?tree=Try&rev=69bea5abc0f8
Only with test case fix. 1 passed, 2 failed. This should always pass in theory.

2. https://tbpl.mozilla.org/?tree=Try&rev=d96de9c1e324
Code and test fixes, plus a little change to not use preallocated process for the browser. All 3 failed.

3. 2. with test_Preallocated.html disable. 1 green (for the 1st time...), 2 more running.

Because of 1. I tend to believe there is something wrong in how we run the tests. I need to dig into the tests further to fix that.
I guess the main thread fails to pthread_join() the IO thread because it is stuck in the following call:

Thread 3
 0  libpthread-2.15.so + 0xf88d
 1  libxul.so!ChildReaper::WaitForChildExit [process_watcher_posix_sigchld.cc:69bea5abc0f8 : 68 + 0xa]
 2  libxul.so!ChildLaxReaper::WillDestroyCurrentMessageLoop [process_watcher_posix_sigchld.cc:69bea5abc0f8 : 158 + 0x4]
 3  libxul.so!MessageLoop::~MessageLoop() [message_loop.cc:69bea5abc0f8 : 155 + 0x5]
 4  libxul.so!base::Thread::ThreadMain() [thread.cc:69bea5abc0f8 : 174 + 0xb]
 5  libxul.so!ThreadFunc [platform_thread_posix.cc:69bea5abc0f8 : 39 + 0x2]
 6  libpthread-2.15.so + 0x7e99

There is a loop in HANDLE_EINTR(waitpid()). It could be an infinite loop so the main thread can never shut down.

Since this is not reproduced locally, I pushed a test with more debugging instruments. Let's see if it's really what I guess. If so then I think we have another bug.
https://tbpl.mozilla.org/?tree=Try&rev=5d82ded68f39
waitpid() for the prealloc'd process doesn't return but instead block indefinitely. I think PreallocatedProcessManagerImpl::Disable() doesn't close the prealloc'd process correctly.
Flags: needinfo?(styang)
(In reply to Fabrice Desré [:fabrice] from comment #25)
> (In reply to James Zhang from comment #24)
> > When can we land this patch?
> 
> Worst case we'll land Tuesday with the test disabled, sooner if Cervantes
> finds what's going wrong.

It seems to be difficult to debug the test case failure on try server, can we land first and follow up to fix the test case? thanks
Flags: needinfo?(fabrice)
Flags: needinfo?(cyu)
Updated patch for 1.3t:
* Check for dom.ipc.processPrelaunch.enabled before returning null from ContentParent::GetNewOrUsed() so it will work correctly for NUWA + no prelaunch config.
* Don't #ifdef in ContentChild::RecvPBrowserConstructor() so it will work correctly for NUWA + no prelaunch config.
Attachment #8375423 - Attachment is obsolete: true
Attachment #8396307 - Flags: review+
(In reply to Joe Cheng [:jcheng] from comment #30)
> It seems to be difficult to debug the test case failure on try server, can
> we land first and follow up to fix the test case? thanks

Fabrice, if you agree, we can land this on 1.3t with test_Preallocated.html disabled temporarily (the test is never correctly cleaned up, see comment #14). I will fix it and reenable the test in bug 987164.
Flags: needinfo?(cyu)
Attached patch Disable test_Preallocated.html (deleted) — Splinter Review
Attachment #8396314 - Flags: review?(fabrice)
Depends on: 987164
Comment on attachment 8396314 [details] [diff] [review]
Disable test_Preallocated.html

Review of attachment 8396314 [details] [diff] [review]:
-----------------------------------------------------------------

ok, let's land.
Attachment #8396314 - Flags: review?(fabrice) → review+
Flags: needinfo?(fabrice)
Ying, are you okay to uplift it? thanks!

--
Keven
Flags: needinfo?(ying.xu)
https://hg.mozilla.org/mozilla-central/rev/0c3c8215ef73
https://hg.mozilla.org/mozilla-central/rev/e3ddffe9d504
https://hg.mozilla.org/mozilla-central/rev/65092435a027
Status: NEW → RESOLVED
Closed: 11 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Cervantes, can you rebase and push to 1.3t? thanks!
Flags: needinfo?(cyu)
Flags: needinfo?(ying.xu)
Depends on: 992853
Nom for 1.3? because we will need this to help fix bug 987556 which is a 1.3+ blocker.
blocking-b2g: 1.3T+ → 1.3?
What's the regression risk analysis on this patch?
Flags: needinfo?(cyu)
This is relatively high risk, IMO.
Although we're going to have to fix any issues that come up for Tarako since we took this on 1.3T, so ...
Ok - in that case, we're probably going to need to punt this then. We can't take anything risky at this point. I'll go back & discuss the blocking bug here with Vance here to see if we can get a waiver for it.
Flags: needinfo?(cyu)
bkelly has a new theory about the blocking bug, fwiw
Yes, let me see if I can get browsermark to run without this.  Its a bit up in the air at the moment.

That sad truth is our baseline memory usage went up from b2g18 to b2g26.  We more or less compensated with NUWA.

That being said, the major issue with the blocking bug is with script source compression being disabled.  I just need to verify if I can get browsermark to run with compression or another fix, but without NUWA.
You don't have to really test on Tarako. For a quick verification we can test it on m-c, ./edit-prefs.sh to disable "dom.ipc.processPrelaunch.enabled" and retest.
Triage decided to move this to backlog but reevaluate if this is deemed necessary to fix bug 987556.
blocking-b2g: 1.3? → backlog
blocking-b2g: backlog → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: