Closed Bug 1094010 Opened 10 years ago Closed 8 years ago

[meta] Backlog for profiling on quad core device

Categories

(Firefox OS Graveyard :: Performance, defect, P2)

ARM
Gonk (Firefox OS)
defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: sinker, Unassigned)

References

Details

Before we start to improve B2G for quad core devices, we need to do profiling to learn behavior of current B2G.  This bug here to
 1. define profiling use cases and scenarios,
 2. do profiling,
 3. investigate profiling data.
Summary: Profile B2G for Quad core platforms as baseline of improvements → Profile B2G for Quad core platforms as the baseline of improvements
As the requirement was raised from SMS app launch time, currently we think the test cases could be:

1. App launch time (template, sms, etc.)
2. UI benchmark which includes DOM manipulation and causes layout changes

Any other ideas?
We need to watch on start-up time, FPS of some animations, and response time of actions.  For example,
 - start-up time of pre-installed apps,
 - FPS of scrolling of FB, twitter, and swipe of homescreen,
 - response time of keyboard.

Not just SPS profiler, we may need to run TaskTracer and other tools to get different views of behavior.
I just talked to Jerry, cpu scheduling systrace captured does not relate to profiler labels, and it can record cpu frequency, loading, and idle so should be good to understand how busy cpu is on each core.
Assignee: nobody → kchen
15:53 <kanru`> TODO: define app launch end
15:57 <kanru`> TODO: profile template app
16:00 <kanru`> TODO: mark each stage of web page rendering: network kickoff, frame construction, layer construction/paint, etc
16:10 <kanru`> TOOD: find out if JAR file is partially unzip
Profile Messages app launch time which has reference-workload-medium installed.

Configuration:
  version: 2.2 (trunk)
  gecko: c44a46f049c3fda5bbf2c4371364e154942c250c
  gaia: 2f077d05105f227839dbecb22cb5324f1321b934

Commands:
  sps: $ ./profile.sh start -p b2g && ./profile.sh start -p Homescreen && ./profile.sh start -p [preallocated pid]; sleep 5; ./profile.sh capture
  systrace: $ systrace.py --time=5 --buf-size=10240 -o mynewtrace.html sched load freq idle

Steps:
  1) Exeucte sps and systrace commands from two terminals simutaneoulsy
  2) Touch Messages app icon

Results:
  Nexus-4
    sps      http://people.mozilla.org/~bgirard/cleopatra/#report=4bd449e91e5592dd8181d26f5dd04b81d0702fa6
    systrace http://www.googledrive.com/host/0B0yKxuGl8-XATld5SkdmQ2FPZms
  Nexus-5
    sps      http://people.mozilla.org/~bgirard/cleopatra/#report=9e925256793041c656146df19ad688cb54c3881d
    systrace http://www.googledrive.com/host/0B0yKxuGl8-XANWZ2bklhQmExajA

Notes:
  I added profiler label to OnStartRequest() and OnStopRequest() of nsJARChannel.
(In reply to Ting-Yu Chou [:ting] from comment #5)
>   Nexus-5
>     sps     http://people.mozilla.org/~bgirard/cleopatra/#report=9e925256793041c656146df19ad688cb54c3881d

I checked the 1.5s after the first DispatchEvent.

Synchronous IPCs block SMS main thread because B2G main is busy:

1. 46ms [1524,1571] RecvShow
1a. SendPRenderFrameConstructor (bug 1085655)
1b. SendPLayerTransactionConstructor (bug 1085655)
1c. BrowserElementChild.js SendSyncMessage (bug 1003848)
2. 65ms [1583,1649] RecvUpdateDimensions, note this is not observed in bug 1074783.
2a. SendGetDPI
2c. SendGetDefaultScale

Long run task on B2G main thread:

1. 68ms [1463,1532] awf_launch() @ app_window_factory.js
2. 17ms [1533,1551] nsRefreshDriver::Tick
3. 12ms [1555,1568] awm_switchApp() @ app_window_manager.js
4. 71ms [1570,1642] nsRefereshDriver::Tick
5. 94ms [2309.2404] aw__handle_mozbrowserloadend @ app_window.js:883

Long run task on SMS main thread:

1. 238ms [1823,2062] s() @ gaia-header.js:1 (bug 1089920)
2. 67ms [2491,2559] nsRefreshDriver::Tick (Refresh8), delays PMobileMessageCursor::RecvNotifyResult
3. 95ms [2601,2697] nsRefreshDriver::Tick (Refresh9)

Some thoughts:

1. Can RecvLoadURL be the first IPC content process received, and is it possible to RecvCacheFileDescriptor when RecvLoadURL?
2. It seems nsJARChannel is unarchiving on main thread, if it is, can we move it out for the changing of 1)?
3. The time difference between nsViewManager::DispatchEvent for the touch down/up is ~110ms, is it real or because of latency?
4. Is it possible to move system app to a worker?
5. Any chances to lower the impact of nsRefreshDriver::Tick()?
Ideally content process should start working as early as possible and never block by synchronous IPC, here're what I think we can do (reference to Nexus-4 SPS profile of comment 5):

1. Content process to RecvLoadURL() in the beginning
   Currently B2G process knows what app to launch in Webapps.jsm doLaunch(), but SendLoadURL() is done in nsFrameLoader::ReallyStartLoadingInternal() when system app set iframe's src, which is 44ms later. This is a waste. I have traced nsJARChannel, the inflate is usually done off main thread, and parsing read data is done on the main thread, so we will make at least 3 cores busy from the beginning.

2. Move system app to a worker
   Content process RecvLoadURL() 67ms later after B2G SendLoadURL(), since RecvShow() comes first and is blocked because B2G main thread is busy on system app and the painting. As system app could run on the main thread for a while, like [1603,1654] and [2854,2966], move it to a worker will have better responsiveness of B2G main thread.

3. Minimize the overhead of app launch animation
   Actually there're two animations, one for closing homescreen and one for splash, and there're something else especially around status bar. [1702,1782] is a long painting for the splash.

4. Check why Refresh1 takes ~100ms, is there anything not related to Messages app and can be improved?

Any thoughts?
About 4., I think this is due to gaia header, I'm investigating using a simpler header for our first panel in bug 1089920.
I profiled also launching Contacts, Gallery and Music on Nexus-4 with the same configuration as comment 5, basically the same problems like Messages app:

Contacts: http://people.mozilla.org/~bgirard/cleopatra/#report=3b8e661b1c9ac31e48faac98993a4abde60b2a15
  - runFontFit() [1565,1737].

Gallery: http://people.mozilla.org/~bgirard/cleopatra/#report=728451ecd3cd5f2cc93e4f5d494028c95947caf6
  - B2G process main thread has a very long repaint [1177,1364] and [1381,1491] after system app launch Gallery.
  - runFontFit() [1740,1912].

Music: http://people.mozilla.org/~bgirard/cleopatra/#report=cb3075d5d82466f883376f3c7bbb8cdf4c80f036
  - PBackgroundIDBCursor::RecvResponse() in Music calls SendGetVolumes() [2460,2609] and is blocked since B2G is busy painting and handling mozbrowserloadend in system app.
  - runFontFit() [2148,2260].
Ting-Yu,  Can you file bugs for tangible improvements identified above.
Flags: needinfo?(tchou)
Ravi, Thinker let not only me but also Ting-Yuan and Kanru to collect different profiles, I'll file bugs later once we reach a concensus.
Flags: needinfo?(tchou)
I got similar results to comment #6 with Communications/contacts on a nexus 4.

http://people.mozilla.org/~bgirard/cleopatra/#report=9ec5296becf5a8c8d06f8f6cf7be5715e60a0e07

There are 958 samples (1ms) out of 1311ms, first touch event to loadstart, in the contacts app. Not sure why there are 353 samples missed in SPS. The first ~170ms are spent on waiting b2g, which was preparing and rendering something before informing the pre-allocated app:

AppWindow() -> aw_render() -> aw__render() -> publish('rendered') ->
                                                                     IPDL::PBrowser::RecvShow

After that the content process spent ~130ms wait b2g in IPDL::PBrowser::RecvShow, mainly in IPDL::PBrowser::SendSyncMessage (59ms) and IPDL::PBrowser::SendPRenderFrameConstructor (55ms).

Not sure if the execution orders can be adjusted to save the 170ms or even 300ms launch time.


When counting from first touch event to loadend, it costs:

382ms on waiting b2g: initial wait (171ms), IPDL::XXX::RecvYYY (211ms)
193ms on nsRefreshDriver::Tick
150ms on gaia-header.js, mostly RunFontFit
212ms on nsInputStreamPump::OnInputStreamReady

It seems to me that it's worth investigation if the 382ms can be hidden by parallelization.
For what it's worth, here is the perf result, which provides a low-level, system-wise aspect of view. There is no outstanding function.

Samples: 3K of event 'cpu-clock:HG', Event count (approx.): 3150                                                                
  30.03%           946          swapper  [kernel.kallsyms]    [k] msm_cpuidle_enter
   0.92%            29              b2g  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
   0.51%            16              b2g  [kernel.kallsyms]    [k] _raw_spin_unlock_irq
   0.41%            13   Communications  libmozglue.so        [.] arena_malloc
   0.41%            13              b2g  libmozglue.so        [.] arena_dalloc
   0.41%            13             perf  [kernel.kallsyms]    [k] format_decode
   0.38%            12              b2g  libc.so              [.] memset
   0.35%            11   Communications  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
   0.35%            11   Communications  libxul.so            [.] xpc::JSXrayTraits::resolveOwnProperty(JSContext*, js::Wrapper
   0.35%            11             perf  [kernel.kallsyms]    [k] vsnprintf
   0.32%            10   Communications  libc.so              [.] memset
   0.32%            10   Communications  libxul.so            [.] JSCompartment::wrap(JSContext*, JS::MutableHandle<JSObject*>,
   0.32%            10   Communications  libxul.so            [.] js::Shape::search(js::ExclusiveContext*, js::Shape*, jsid, js:
   0.32%            10   Communications  libxul.so            [.] js::StringEqualsAscii(JSLinearString*, char const*)
   0.32%            10              b2g  [kernel.kallsyms]    [k] do_page_fault
   0.29%             9   Communications  libmozglue.so        [.] arena_dalloc
   0.25%             8  (Preallocated a  [kernel.kallsyms]    [k] _raw_spin_unlock_irqrestore
   0.25%             8   Communications  libc.so              [.] __memcpy_base
   0.25%             8   Communications  libxul.so            [.] Interpret(JSContext*, js::RunState&)
   0.25%             8              b2g  libmozglue.so        [.] arena_malloc
(In reply to Ting-Yuan Huang from comment #14)
> and IPDL::PBrowser::SendPRenderFrameConstructor (55ms).
> 

There is a patch to improve SendPRenderFrameConstructor() that had been landed days ago.
(In reply to Ting-Yuan Huang from comment #14)
> I got similar results to comment #6 with Communications/contacts on a nexus
> 4.
> 
> http://people.mozilla.org/~bgirard/cleopatra/
> #report=9ec5296becf5a8c8d06f8f6cf7be5715e60a0e07

If you look into the profile, you will find 2461ms ~ 2478ms is waiting for SendGetDPI().  I think it should be part of RecvUpdateDimensions.  Then, we don't need another sync IPC.
 
Between 3746ms ~ 3951ms and more periods of 100+ms after then, they spend a lot of time at ContactManager to convert data.  There is a potential improvement.

And, I find there are a lot of missed samples.  It means CPU is busy at these periods.  Some one should investigate these periods, maybe there is a chance to move these tasks to the periods that is less busy.
Gregor, could you give some comment on ContactManager?
Flags: needinfo?(anygregor)
Here's another profile, scrolling Pinterest on a Nexus-4 with the same gecko/gaia as comment 5:

Steps:
  1. Open http://www.pinterest.com/all/ from Browser app, wait till it's loaded
  2. Execute sps and systrace commands from two terminals simutaneously
  3. Start scrolling down

Results:
  sps: http://people.mozilla.org/~bgirard/cleopatra/#report=14d65851f3aeacc41f20b43aa860b82228b7aef0
  systrace: http://www.googledrive.com/host/0B0yKxuGl8-XAMEd5a0JRNTNkZHc

Analysis:
  1. Browser process is quite busy in the for loop of FrameLayerBuilder::PaintItems(), which I think could be parallelized on multi core. The 4th core is almost idle from systrace. ClientTiledLayerBuffer::ProgressiveUpdate() can also be a candidate of parallelization.
  2. [3980,5840] browser process is running the site's javascript to load more tiles.
This is the profile of scrolling Messages with medium workload

systrace: http://people.mozilla.org/~kchen/bug1094010/message-scroll-systrace.html

CPU 2 and 3 are mostly idle. While the time is mostly spent on DoReflow and painting we could still see a lot of blurred text and blank on the screen.

I think it might be that the Message app has some optimization that aggressively throw away the the DOM node that is not visible.
(In reply to Kan-Ru Chen [:kanru] from comment #20)
> I think it might be that the Message app has some optimization that
> aggressively throw away the the DOM node that is not visible.

Please check this in further.
(In reply to Thinker Li [:sinker] from comment #21)
> (In reply to Kan-Ru Chen [:kanru] from comment #20)
> > I think it might be that the Message app has some optimization that
> > aggressively throw away the the DOM node that is not visible.
> 
> Please check this in further.

Only the photo is removed by img.src = "";
(In reply to Kan-Ru Chen [:kanru] from comment #22)
> Only the photo is removed by img.src = "";

This seems an useless optimization since gecko would check if it is visible to user.  Right?!
Try to remove this optimization.
(In reply to Thinker Li [:sinker] from comment #18)
> Gregor, could you give some comment on ContactManager?

What is the workload here? It depends on the number of contacts you have in your DB.
Flags: needinfo?(anygregor) → needinfo?(tlee)
Could you answer Gregor's question?
Flags: needinfo?(tlee) → needinfo?(thuang)
(In reply to Thinker Li [:sinker] from comment #23)
> (In reply to Kan-Ru Chen [:kanru] from comment #22)
> > Only the photo is removed by img.src = "";
> 
> This seems an useless optimization since gecko would check if it is visible
> to user.  Right?!

Actually, not really :)
You can ask :tn for more information but basically on B2G it doesn't work like this.


(In reply to Kan-Ru Chen [:kanru] from comment #22)
> Only the photo is removed by img.src = "";

Where do you see this? The only place I see this is when we resize a picture, the image is not in the DOM. (utils.js)


(In reply to Kan-Ru Chen [:kanru] from comment #20)
> This is the profile of scrolling Messages with medium workload
> 
> systrace:
> http://people.mozilla.org/~kchen/bug1094010/message-scroll-systrace.html
> 
> CPU 2 and 3 are mostly idle. While the time is mostly spent on DoReflow and
> painting we could still see a lot of blurred text and blank on the screen.
> 
> I think it might be that the Message app has some optimization that
> aggressively throw away the the DOM node that is not visible.

Nope we don't do this at all right now. We only have a plain DOM that's not changing once it's loaded and no message is sent/received.

However, for threads belonging to known contacts, we display an image which can be costly to display (and it's also using :nth-child CSS selectors BTW). Maybe this is the culprit here?
Flags: needinfo?(tnikkel)
We generally try to keep around decoded versions of visible images and those that are close to being visible by way of scrolling them into view. Setting img.src to "" will probably also discard the compressed source of the image from memory.
Flags: needinfo?(tnikkel)
I misread the code. The setImageURL in contacts/js/views/list.js only set img.dataset.src to ''. I'm not sure how the image is rendered. I think it will be reflected to img.src at some point.

Anyway, I think it only looks slow because the photo is not rendered until the scroll stops.
(In reply to Thinker Li [:sinker] from comment #17)
> If you look into the profile, you will find 2461ms ~ 2478ms is waiting for
> SendGetDPI().  I think it should be part of RecvUpdateDimensions.  Then, we
> don't need another sync IPC.

Bug 835679 is out there to fix this problem.
(In reply to Kan-Ru Chen [:kanru] from comment #28)
> I misread the code. The setImageURL in contacts/js/views/list.js only set
> img.dataset.src to ''. I'm not sure how the image is rendered. I think it
> will be reflected to img.src at some point.
> 
> Anyway, I think it only looks slow because the photo is not rendered until
> the scroll stops.

The blank saw while scrolling is due to the lazy rendering of the DOM nodes. Only the first screenful of contacts is rendered. All other contacts are rendered when "onscreen".
(In reply to Kan-Ru Chen [:kanru] from comment #28)
> I misread the code. The setImageURL in contacts/js/views/list.js only set
> img.dataset.src to ''. I'm not sure how the image is rendered. I think it
> will be reflected to img.src at some point.
> 
> Anyway, I think it only looks slow because the photo is not rendered until
> the scroll stops.

Ah, you're talking about the Contacts app here, not the Messages app. But comment 20 is about the Messages app.


(sorry Timothy, I didn't mean to ask NI, I just wanted to know your alias. But thanks for the answer!)
(In reply to Gregor Wagner [:gwagner] from comment #24)
> What is the workload here? It depends on the number of contacts you have in
> your DB.

It's "reference-workload-light" and there are 200 contacts.
Flags: needinfo?(thuang)
Flags: needinfo?(anygregor)
(In reply to Thinker Li [:sinker] from comment #17)
> (In reply to Ting-Yuan Huang from comment #14)
> > I got similar results to comment #6 with Communications/contacts on a nexus
> > 4.
> > 
> > http://people.mozilla.org/~bgirard/cleopatra/
> > #report=9ec5296becf5a8c8d06f8f6cf7be5715e60a0e07
> 
> If you look into the profile, you will find 2461ms ~ 2478ms is waiting for
> SendGetDPI().  I think it should be part of RecvUpdateDimensions.  Then, we
> don't need another sync IPC.
>  
> Between 3746ms ~ 3951ms and more periods of 100+ms after then, they spend a
> lot of time at ContactManager to convert data.  There is a potential
> improvement.
> 

I assume thats not the very first start of the app where we have to upgrade the contactsDB after pushing the workload.

4sec is a long time for converting. We do have a cursor so we don't pause the main thread for too long. We also have to copy all data from the parent to the child and with the cursor we have many roundtrips.
In the child we have to create objects that we can hand out to content, which is also quite slow (see bug 783057)
Flags: needinfo?(anygregor)
(In reply to Ting-Yu Chou [:ting] from comment #19)
>   1. Browser process is quite busy in the for loop of
> FrameLayerBuilder::PaintItems(), which I think could be parallelized on
> multi core. The 4th core is almost idle from systrace.
> ClientTiledLayerBuffer::ProgressiveUpdate() can also be a candidate of
> parallelization.

I've just talked to BenWa about this, he reminded me that parallelize probably doesn't help if the paintings are memory bound, especially on device.
Depends on: 1107259
No longer depends on: 1107259
Blocks: AppStartup
(In reply to Ting-Yu Chou [:ting] from comment #8)
> 2. Move system app to a worker
>    Content process RecvLoadURL() 67ms later after B2G SendLoadURL(), since
> RecvShow() comes first and is blocked because B2G main thread is busy on
> system app and the painting. As system app could run on the main thread for
> a while, like [1603,1654] and [2854,2966], move it to a worker will have
> better responsiveness of B2G main thread.

Alive & Kevin, how do you think about this? Is this reasonable?
Flags: needinfo?(kgrandon)
Flags: needinfo?(alive)
(In reply to Ting-Yu Chou [:ting] from comment #35)
> (In reply to Ting-Yu Chou [:ting] from comment #8)
> > 2. Move system app to a worker
> >    Content process RecvLoadURL() 67ms later after B2G SendLoadURL(), since
> > RecvShow() comes first and is blocked because B2G main thread is busy on
> > system app and the painting. As system app could run on the main thread for
> > a while, like [1603,1654] and [2854,2966], move it to a worker will have
> > better responsiveness of B2G main thread.
> 
> Alive & Kevin, how do you think about this? Is this reasonable?

It might be.. but much part of system app is still dealing with DOM manipulation. We cannot move the entire system app to worker at all.
Flags: needinfo?(alive)
We are talking about web workers here, correct? I'm not sure how you would move the entire system to a web worker, but you may be able to move a few pieces. Sorry, I'm lacking some context/knowledge here, so apologize for the lack of ideas/suggestions on this subject.
Flags: needinfo?(kgrandon)
No longer blocks: AppStartup
I just went through the Gaia polish bugs [1], here're the ones I think we could profile:

bug 802208,
bug 1077169,
bug 1077516,
bug 1090571,
bug 1112581,
bug 1112590,
bug 1115537,
bug 1131050,
bug 1137613,
bug 1139303

[1] https://wiki.mozilla.org/FirefoxOS/polish
Summary: Profile B2G for Quad core platforms as the baseline of improvements → [meta] Profile on quad core device
(In reply to Ting-Yu Chou [:ting] from comment #38)
> I just went through the Gaia polish bugs [1], here're the ones I think we
> could profile:
> 
> bug 1115537,

Scratch this one, I misplaced it.
Summary: [meta] Profile on quad core device → [meta] Backlog for profiling on quad core device
No longer depends on: 1131050
No longer depends on: 1137613
No longer depends on: 1112581
Depends on: 1091654
Can not reproduce bug 802208, the slide works smoothly.
No longer depends on: 802208
Copy from bug 1093564 comment 9.

1. While scrolling in one direction, it pauses for a second if you scroll reversely.
2. It takes some time to show the homescreen when you press home key at card view.
3. Unlock screen does not show homescreen smoothly.
4. Screen does not scroll smoothly with your continuous swipe.
5. Keyboard does not come up right away when you start input.
6. It's janky when you drag at Contacts app to search people by the first character.
7. Press button ">" or "<" to switch forward/backward to screens ususally not happen instantly, this can be seen easily from Settings.
8. Switching between alphabet/symbol keybards is slow.
9. Keyboard seems a bit late than my fingers, you can see the characters come up later if you random press some keys fast and then stop.
10. Dragging web page from Browser app is janky.
11. The animation of moving app out from card view is not smooth
12. Moving icon around while editing homescreen is janky.
Depends on: 1157648
No longer depends on: 1157648
tracking-b2g: --- → +
OS: Linux → Gonk (Firefox OS)
Hardware: x86_64 → ARM
No longer depends on: 1077169
No longer depends on: 1112590
Depends on: 1180168
No longer depends on: 1180168
[Tracking Requested - why for this release]:
Priority: -- → P2
Assignee: kchen → nobody
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.