Closed Bug 975680 Opened 11 years ago Closed 6 years ago

[Contacts] Implement lazy layout for contact lists

Categories

(Firefox OS Graveyard :: Gaia::Contacts, defect, P1)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: kgrandon, Assigned: kgrandon)

References

Details

(Keywords: perf, Whiteboard: [c=handeye p=5 s= u=])

Attachments

(2 files)

:roc has prototyped a recycled dom list using HTML which we may consider implementing in contacts. This is unlikely to solve all performance problems (things like jump to end), but this may solve our scrolling problem for the time being.

Bug 909935 is open to track implementing a streaming cursor based list, which is probably an ideal long-term solution still, but that relies on lots of IDB enhancements which may take a long time. While the recycled dom approach will probably not work too well for variable height lists, it could be good enough for contacts for now.
Attached file Pull Request (deleted) —
WIP - this pull request actually works for contact lists, but needs lots of improvements still. No jump links or headers yet, but I'll keep plugging away. If anyone wants to send me some patches, feel free to :)
Whiteboard: [c= p=5 s= u=] → [c=handeye p=5 s= u=]
I think the key problem to solve here is how to take the index provided to populate() and find the correct contact.

Using a local IDB with an index on a normalized name field is probably our best bet.  Then we can use IDBCursor.advance() to convert the index to an item.

I will see if I can get that working based on the PR today.
I think Francisco said he would have time to help with this next week as well.

My other major area of concern is search.  Right now it depends on the DOM of the main list being fully defined, so it will likely break when we change it.  I don't see a good way to provide equivalent search ability using IDB since IDB only provides prefix searching on a single field at a time.
I think it might help to create an integration branch for this effort since we are likely going to have multiple working on it.
Actually, before proceeding down the IDB route, I'd like to understand why we still have checkerboarding with this solution.

Here's a profile:

  http://people.mozilla.org/~bgirard/cleopatra/#report=a57c2c6b251a8c6d50431c55090a768f786699a1

With FPS overlay enabled I see a huge amount of overpainting with values of 700+.
(In reply to Ben Kelly [:bkelly] from comment #4)
> I think Francisco said he would have time to help with this next week as
> well.

> I think it might help to create an integration branch for this effort since
> we are likely going to have multiple working on it.

Sounds good - that's definitely something we can do. I was not aware other people were interested in working on it. The more the merrier.

I've just squashed the branch on my github (bug_975680_contacts_lazy_layout) - and we can use that as a starting point for a branch. You guys can also submit pull requests to my branch - or I can just give you commit access (unless there is a reason for us creating a mozilla-b2g branch).

(In reply to Ben Kelly [:bkelly] from comment #2)
> Using a local IDB with an index on a normalized name field is probably our
> best bet.  Then we can use IDBCursor.advance() to convert the index to an
> item.

Honestly my thoughts were that we should ignore IDB stuff for now. I think this could be a useful bridge between truly streaming lists, and what we have today. I was thinking we could use what we have now to asynchronously populate the lists backing dataset - and allow the user to potentially scroll it before we have all of the data. We'll have to play with this to see if the user interaction still works.
I think the most important thing to do here though is to ensure that this will actually remove checkerboarding. There's still a bit of dom manipulation that we need to throw into the populate() method, so that will just slow things down even further. My pull request has the default display port multiplier set at 8x, perhaps I'm going to try reducing that to see if that helps.
Here's my branch tracking the KevinGrandon/bug_975680_contacts_lazy_layout branch.  GitHub won't let me open a pull request to your repo for some reason, though.

  https://github.com/wanderview/gaia/tree/contacts-abs-pos-work

I added one commit to call revokeObjectURL() on the photo URLs when the nodes are recycled.  This seems to help a little bit.  (Does revokeObjectURL() avoid cycle collects??)
Well, I think there are going to be some painful things to implement with an in-memory array.

For example, user modifies a contact such that it changes sort order.  Are we going to reload the array from contacts API or manually resort the array?  If there is an insertion we need to grow and resort.  All of that needs to be complete before we can go back to the Recyclist to refresh the display.

If we are using IDB we just update the record and the index automatically handles resorting, insertions, etc.  We then just tell Recyclist to repopulate the existing nodes and we're done.

Also, the current in-memory solution creates a long pause before above-the-fold.  That is probably solvable, but will be more complex than if we just had an IDB.

We've used DataStore for FB now, so its not a completely unknown quantity.  I think we should try to get it in if we can as it makes it easier to deal with the absolute positioning in a dynamic list.

Also, if we have 3 or more people working on this, one person can focus on DataStore -> IDB, one person can focus on layout, etc.

Just a thought.

Of course all of that is dependent on getting the basic implementation working without checkerboarding.
(In reply to Ben Kelly [:bkelly] from comment #8)
> Here's my branch tracking the KevinGrandon/bug_975680_contacts_lazy_layout
> branch.  GitHub won't let me open a pull request to your repo for some
> reason, though.
> 
>   https://github.com/wanderview/gaia/tree/contacts-abs-pos-work
> 
> I added one commit to call revokeObjectURL() on the photo URLs when the
> nodes are recycled.  This seems to help a little bit.  (Does
> revokeObjectURL() avoid cycle collects??)

Nice - that was on my todo list :) I did pretty much most of this on an airplane, so there's lots of stuff to fix/polish.

You should be able to submit a pr, you may need to change the fork and type in kevingrandon/gaia though. I've also added you as a contributor to my gaia fork, so you can just push there now if you want. (I hope it did not auto-subscribe you to notifications).

I've cherry-picked your fix in.
Adding display:none to the various details view, form view, etc has significantly cut over painting.  The FPS overlay value dropped from ~720 to ~400.  If those are translated off screen, I wonder why we are triggering paints for them.
Interesting, checkerboarding is also much worse on my buri with hwc enabled. I see these in the log with hwc enabled when significant checkerboarding appears:

  D/HwcUtils( 3459): Skip layer

Benoit, Sotaro, Kats, do you know whats going on here with hwc and checkerboarding?

This is on a buri with v1.2-device.cfg OEM firmware and gecko rev 1238ef12b996.
Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(bugmail.mozilla)
Flags: needinfo?(bgirard)
I've pushed some commits to the branch that do the following:

* hack out other contacts panels using display:none to avoid overpainting for now
* render photos asynchronously using requestAnimationFrame()
* reverse the order of the Recyclist populate calls when scrolling backwards
* use a separate multiplier for recycling to avoid having the synchronous generate(1) call recycle items just populated by an async generate(8) call.

While these help, there is still some checkerboarding.  This profile suggests its still the photos:

http://people.mozilla.org/~bgirard/cleopatra/#report=d56bff21943d3838fcadfd1d65608a213cff291c

We spend a lot of time decoding images for the background-image CSS styles.  Is there a better way to asynchronously load/decode those?
(In reply to Ben Kelly [:bkelly] from comment #13)
> We spend a lot of time decoding images for the background-image CSS styles. 
> Is there a better way to asynchronously load/decode those?

Let me think about this. I wonder if we can further throttle images, or delay them based on some scroll heuristic. E.g., don't load any images until after 75ms after the last scroll event (and stop loading them if the user starts to scroll again).
I think it's possible that the clientHeight issue was causing some weirdness and potentially lots of over painting. I've fixed this now by adding some CSS, but it seems we've regressed elsewhere as items stop appearing once you do the initial few viewport scrolls. I will take a look tomorrow.
Ok, scroll issue should now be fixed. A few things I'm thinking about here:

1 - I do think we're overprocessing items, and could be smarter about what we process based on scroll direction.

2 - I also think we could probably be smarter about the buffer integers that we pass to the generate function. (I don't really see a need to pass two different numbers into it).

3 - I think our method for changing the order of recycled items is wrong, because populateItem calls recyclableItems.pop(). I think we may want to change the sort function instead on line 108.
Yea, the clientHeight thing basically meant everything was being rendered upfront for the list.  So none of my changes were executing while scrolling.  Let me see what I broke.
I pushed a change to not append duplicate name elements to each row item.  This caused the items to get corrupted when recycling a node.

Can you explain why it matter what order we pull from the recycledItems list?  I would think once they are no longer attached to the DOM we could manipulate them in whatever order we choose.

I was trying to render the new items to the screen such that the ones closest to the "soon to be scrolled into view" are rendered first.  I have some more code to improve this further, but its not done yet.
I think we might be falling prey to bug 957668 now.  I'm seeing lots of:

E/memalloc(  137): /dev/pmem: No more pmem available
W/memalloc(  137): Falling back to ashmem

Which I think often happens when APZ wants to allocate a graphics buffer for a large display port.  Now that we have correct heights our list container is quite tall.  For reference-workload-medium its 3000px.  For reference-workload-heavy its 6000px, etc.

I see a lot slowdown now too which I'm guessing is this giant layer getting repainted as we recycle nodes.

I'll try to profile some tomorrow to verify if this is what is happening.
(In reply to Ben Kelly [:bkelly] from comment #12)
> Interesting, checkerboarding is also much worse on my buri with hwc enabled.
> I see these in the log with hwc enabled when significant checkerboarding
> appears:
> 
>   D/HwcUtils( 3459): Skip layer
> 
> Benoit, Sotaro, Kats, do you know whats going on here with hwc and
> checkerboarding?

Sotaro can provide details on the "skip layer" message. From what I understand it is switching to a different type of memory.

(In reply to Ben Kelly [:bkelly] from comment #19)
> I think we might be falling prey to bug 957668 now.  I'm seeing lots of:
> 
> E/memalloc(  137): /dev/pmem: No more pmem available
> W/memalloc(  137): Falling back to ashmem
> 
> Which I think often happens when APZ wants to allocate a graphics buffer for
> a large display port.  Now that we have correct heights our list container
> is quite tall.  For reference-workload-medium its 3000px.  For
> reference-workload-heavy its 6000px, etc.

You can verify the size of the displayport by checking the values getting passed in to SetDisplayPortForElement - that value will be in CSS pixels and presumably you will only ever have a 1.0 resolution in the contacts app so it should map directly to layer pixels.

I noticed that Kevin in comment 7 said that the displayport multiplier was set to 8x which is definitely too big.
Flags: needinfo?(bugmail.mozilla)
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #20)
> I noticed that Kevin in comment 7 said that the displayport multiplier was
> set to 8x which is definitely too big.

Unfortunately I get these even with a multiplier of 2 (so 5 screen heights worth rendered).  If we make these small, we can't render enough to account for scroll velocity between scroll events as well.
While scrolling with multiplier set to 2:

  I/Gecko   ( 1339): ### ### SetDisplayPortForElement: x:0.000000 y:-134.097 width:330.000000 height:1025
  E/memalloc( 1120): /dev/pmem: No more pmem available
  W/memalloc( 1120): Falling back to ashmem

Looking at /proc/meminfo it appears we are using almost 4MB of ashmem:

  Mlocked:            3960 kB

This is on mozilla-central at 1238ef12b996 with hwc disabled.

Kats, does this sound correct?
Flags: needinfo?(bugmail.mozilla)
(In reply to Ben Kelly [:bkelly] from comment #18)
> Can you explain why it matter what order we pull from the recycledItems
> list?  I would think once they are no longer attached to the DOM we could
> manipulate them in whatever order we choose.

After looking closer I understand now.  The "recyclable" items are still in the DOM until we actually reuse them.  We want to pull from the items furthest from the viewport to keep a consecutive block of rendered content.

I think maybe we should remove the forget() callback I added as its probably prematurely revoking photo URLs for items which may not be recycled.

Regardless, it appears roc's original virtual-list-demo.html is checkerboarding as well.  We should probably see what is going there before trying to port contacts any further.
I opened bug 975831 to investigate the issues with the demo.
(In reply to Ben Kelly [:bkelly] from comment #22)
> Looking at /proc/meminfo it appears we are using almost 4MB of ashmem:
> 
>   Mlocked:            3960 kB
> 
> This is on mozilla-central at 1238ef12b996 with hwc disabled.
> 
> Kats, does this sound correct?

Honestly I don't know. There might be other things that are using ashmem too. But a displayport size of 330x1025 sounds reasonable at least, and we should be able to handle that fine.
Flags: needinfo?(bugmail.mozilla)
Made some tweaks in the PR. Gave items a translateZ() until we have will-change to reduce painting. Also backed out one of your patches due to it causing additional reflows (we were recycling too much I think on the initial synchronous generate() call.
What's the main objective of this bug? What's the performance improvement you are aiming at? Why do we need this? and last but not least there should be a commitment that all the regressions introduced if this ever lands will be fixed by the author and if not backout will be done immediately. I'm a bit sick of fixing regressions that other people introduce with "magical performance hacky patches" in the contacts app and specially in the Facebook integration.

thanks, best
(In reply to Jose M. Cantera from comment #27)
> What's the main objective of this bug? What's the performance improvement
> you are aiming at? Why do we need this? and last but not least there should
> be a commitment that all the regressions introduced if this ever lands will
> be fixed by the author and if not backout will be done immediately. I'm a
> bit sick of fixing regressions that other people introduce with "magical
> performance hacky patches" in the contacts app and specially in the Facebook
> integration.
> 
> thanks, best

The main objective of this bug is to see if it is possible to create a performant scrolling contact list. Right now we don't have one. As you scroll the contacts list today there is lots of checkerboarding, we need to remove this. We are in prototyping phase and there are lots of bugs/regressions still.

The patch authors will definitely be responsible for any regressions should we go down this path. If I am the final author of this patch, I will definitely try to test as much as I can (including Facebook), and follow any fallout regressions of this patch. Jose, if you or Francisco want to take this on as contacts developers, feel free to assign this bug to yourself. I'm looking forward to collaborating with you in whatever you think is the most efficient/regression free way.
(In reply to Jose M. Cantera from comment #27)
> What's the main objective of this bug? What's the performance improvement
> you are aiming at? Why do we need this? and last but not least there should
> be a commitment that all the regressions introduced if this ever lands will
> be fixed by the author and if not backout will be done immediately. I'm a
> bit sick of fixing regressions that other people introduce with "magical
> performance hacky patches" in the contacts app and specially in the Facebook
> integration.
> 
> thanks, best

Hi folks, we are all in the same boat, I totally trust the people that have been working on performance, specially when we have a contacts peer on that.

As peers is our task to help them with their patches, give ideas and try to come with a solution that we all are happy.

IMHO, is not about regressions or pointing at people, it's  about collaboration and create the best experience ever.

Cheers,
F.
(In reply to Jose M. Cantera from comment #27)
> What's the main objective of this bug?

Short answer is "scroll large contacts list without checkerboarding when APZ is enabled without regressing other features, launch times, etc".

See bug 967884 for some of the challenges we are faced with.  Alternative, less risky suggestions welcome!
Skip layer just means that the content isn't suitable to use HWC. This is expected for scrolling content (however this must not happen for video playback). There's an Developer option to turn off HWC if you beleive that is causing problem. The PMEM is warning is ok as long as we're not preventing video playback from using HWC.
Flags: needinfo?(bgirard)
Currently I think this is our best option in the near future for scroll performance. Going to mark as blocking 967884.
Blocks: 967884
This is a team effort, but assigning to myself for the time being so we can track this better in scrumbugs.
Assignee: nobody → kgrandon
Status: NEW → ASSIGNED
\o/ thanks folks for the great work here
Flags: needinfo?(sotaro.ikeda.g)
(In reply to Francisco Jordano [:arcturus] from comment #34)
> \o/ thanks folks for the great work here

Thank you in advance for reviews, and potential regressions :-p Also if you or Jose have some cycles to do some early investigation/bug fixing on the branch, it would be greatly appreciated. Though if you are busy we completely understand.
JSHint would be *really* nice to have for this refactoring, and I've created bug 981076 to do so. To not delay this too much, and assuming contacts peers are on-board, I would just start with doing jshint for the files that this bug needs to touch.
José: This new approach will also greatly improve load time and peak memory consumption.

Kevin, Ben: One thing I mentioned in the original blog post was my concern about accessibility. Now that we're trying to land this solution, have you checked with our accessibility experts if that solution is ok or can be turned into one?
Flags: needinfo?(kgrandon)
Flags: needinfo?(bkelly)
This can absolutely be an accessible solution because as you move to items with the screen reader, the element will scroll and new items will be accessible (current/old items are only removed when they are out of the displayport). I tested the screen reader on the branch and it works fine, though there are some problems with other panels which is a different problem. In any case we can loop the accessibility guys in once we are closer with the patch, but I don't see any difficulties here. Thanks for the reminder.
Flags: needinfo?(kgrandon)
Flags: needinfo?(bkelly)
The current scroll handler causes a sync reflow by accessing scrollTop on the scrolled element. I profiled and I saw that the sync reflow was taking around 25% of the time during scroll events. Bug 655228 can address this by passing in the position or delta in the scroll event. This bug should probably depend on bug 655228, but I want to make sure that one is actionable/landable first, and also validate profiles before/after.
Depends on: 982152
I am learning that this bug is way too massive to do in a single pass, so I'm going to see what kinds of chunks we can split out, and work on those first. Basically anything that hooks into the main list dom needs to be refactored in order to support this, or alternative approaches. Smaller patches should also be better for testing/stability. I've filed bug 982152 for the search functionality, and will look into other areas that need to be decoupled as well.

Anyone can feel free to jump in and steal a dependent bug.
This is another pull request which contains changes that would be required if bug 655228 lands.
Here are two profiles of scrolling a heavy workload a few times.

Master: http://people.mozilla.org/~bgirard/cleopatra/#report=ce1805a052dcd663b3f1e200b0e879eacbe9c972

Lazy Layout: http://people.mozilla.org/~bgirard/cleopatra/#report=97f84707d5304dbdcdd6fdeac9b428f57ae10af8

Ben, Mason - is there anything you can glean from these profiles? Any suggestions as to further improvements? Thanks!
Flags: needinfo?(mchang)
Flags: needinfo?(bkelly)
From the profiles, it looks like the one on master is actually performing better. Were the URLs mixed up?

When I was looking at the long rasterizes, i see that they are usually followed by long reflows. The 116ms rasterize is preceded by a ~45ms reflow. Are those all related to visibility monitor reflows? 

Is the app still checkerboarding quite a bit even with tiling?
Flags: needinfo?(mchang)
(In reply to Mason Chang [:mchang] from comment #43)
> From the profiles, it looks like the one on master is actually performing
> better. Were the URLs mixed up?

The links are correct - but what's making you say that? 

For master we're seeing on average of 19-25ms composite times, with the 116ms rasterize time. With the lazy layout approach the composite time is generally much less, and appears to be between 13-18ms.


> When I was looking at the long rasterizes, i see that they are usually
> followed by long reflows. The 116ms rasterize is preceded by a ~45ms reflow.
> Are those all related to visibility monitor reflows? 

Yes, it looks like this is from visibility monitor (which we are trying to fix with this approach).


> Is the app still checkerboarding quite a bit even with tiling?

I think it's better, but I'm still seeing checkerboarding.
The paint markers in those profilers are totally messed up.  I think cleopatra or gfx might be confused about the markers now that tiling as landed.

Lazy layout looks better to me.

On master reflows are taking 50ms to 70ms and paints are taking 60ms to 150ms.

In lazy layout reflows are taking about 17ms and paints are taking 50ms to 80ms.

Kevin, are you still seeing checkerboarding with lazy loading?
Flags: needinfo?(bkelly)
(In reply to Kevin Grandon :kgrandon from comment #44)
> For master we're seeing on average of 19-25ms composite times, with the
> 116ms rasterize time. With the lazy layout approach the composite time is
> generally much less, and appears to be between 13-18ms.

Incidentally, I think the composite time is less important for what you are optimizing here.  With APZC the composites are essentially running asynchronously with the content work.  Ultimately we need layout and rasterization to fast enough to keep up, so we want those smaller.

Composites may get slow with too much content work, but thats simply because the child process is interfering with the scheduling of the CPU for the compositor.  (I think.)
Kevin, what CSS are using?  Is it the full contacts CSS?  I wonder what the profiles would look if you dumped it all and only had a minimal CSS to layout the list.  I think there is something in there that is killing our text rendering performance.

Are you rendering the "org" field yet?
I am still seeing checkerboarding, but is getting better. I'm using the full css including rendering the org and bolded first/last names depending on setting. I will try again today, stripping down the CSS to the bare minimum. If I have any luck I will upload a profile with the changes.
Priority: -- → P1
No longer blocks: 967884
Firefox OS is not being worked on
Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: