Open Bug 986694 Opened 11 years ago Updated 2 years ago

Browser app scrolling shows white screen occasionally on Desktop Sites

Categories

(Core :: Graphics: Layers, defect, P3)

30 Branch
ARM
Gonk (Firefox OS)
defect

Tracking

()

blocking-b2g -

People

(Reporter: tkundu, Unassigned)

References

Details

(Keywords: perf, Whiteboard: [c=handeye p= s= u=])

Attachments

(5 files)

Reference Bug: https://bugzilla.mozilla.org/show_bug.cgi?id=942750#c0 STR: 1) Flash v1.4 FFOS build on device 2) make sure that APZ is turned on for All gaia app in your device 3) start browser App and launch www.cnbc.com . Give it 1 min to load full website. Try to scroll as fast as possible on 800x480 display device. 4) You will see white screens occasionally. Max size 10MB is tool small for video attachment. I can still compress and upload a video if needed. Gaia:https://www.codeaurora.org/cgit/quic/lf/b2g/mozilla/gaia/commit/?h=mozilla/v1.4&id=ee89ad8ce3dbaa27b372affb7121a429ffe18f7a Gecko: https://www.codeaurora.org/cgit/quic/lf/b2g/mozilla/gecko/commit/?h=mozilla/v1.4&id=176fc2ed072055ac33a174d4e92169705874a4b9
blocking-b2g: --- → 1.4?
Can you upload to youtube?
blocking-b2g: 1.4? → 1.4+
Whiteboard: [systemsfe]
Keywords: perf
Target Milestone: --- → 1.4 S4 (28mar)
triage: Can someone on the browser team please grab a profile, which will help us diagnose the issue. Thanks! Profiling guide - https://developer.mozilla.org/en-US/docs/Performance/Profiling_with_the_Built-in_Profiler
Flags: needinfo?(anygregor)
Whiteboard: [systemsfe] → [systemsfe][c=handeye p= s= u=]
I can do that
Assignee: nobody → dale
Flags: needinfo?(anygregor)
Whiteboard: [systemsfe][c=handeye p= s= u=] → [systemsfe][c=handeye p= s= u=1.4]
Attached file Profile of scrolling cnbc.com (deleted) —
Using this I seen some white screen, particularly when scrolling back to the top of the page (graphics heavy) from the bottom There were a few crashes, this is a desktop site we are being sent (seperate evangelism bug?) and its very heavy, not sure the white is entirely unexpected This was a fresh gecko build and gaia profile, gecko: d75fda4229c7f297f2aa15ae61520dbbb07160f2
And deassigning, the white screen is a graphics issue, I dont believe theres anything well be able to do gaia side
Assignee: dale → nobody
I did a profile when scrolling naver.com, that seems different from Dale's profile, so I would like to share it. I used nexus-4 for that. In this profile, the child spends long time on Paint, because it is waiting the main process to allocate the buffer (SendGrallocBufferConstructor). As I understand, TextureClientPool should avoid that allocation by keeping some buffers on its pool. However the number of active buffers (mOutstandingClients) is bigger than the maximum texture clients managed by this pool (sMaxTextureClients), so no textures are kept in the pool, causing the child to request a new buffer almost always. I did a small test, by increasing the sMaxTextureClients from 50 to 200, and I could see a improvement in frequency that the white screen appears.
I should have mentioned, mine was on a hamachi device, apologies
Component: Gaia::Browser → Graphics: Layers
Product: Firefox OS → Core
Version: unspecified → 30 Branch
Whiteboard: [systemsfe][c=handeye p= s= u=1.4] → [c=handeye p= s= u=1.4]
I don't know what we can do here. Increasing the number of maximum clients means a better chance of OOM. At some point, you're trying to draw more than this device can cache or keep up with. Especially given that we're hitting the desktop site as mentioned above.
(In reply to Milan Sreckovic [:milan] from comment #8) > I don't know what we can do here. Increasing the number of maximum clients > means a better chance of OOM. At some point, you're trying to draw more > than this device can cache or keep up with. Especially given that we're > hitting the desktop site as mentioned above. I think we should get a more realistic test case here. Desktop sites will always have problems, as they aren't optimized for mobile devices. I'm renoming this because I don't think it's realistic to set the no checkerboarding requirement on non-optimized sites for mobile.
blocking-b2g: 1.4+ → 1.4?
Tapas, Can you please help check if this happening on an optimized site for the phone?
Flags: needinfo?(tkundu)
(In reply to Preeti Raghunath(:Preeti) from comment #10) > Tapas, > > Can you please help check if this happening on an optimized site for the > phone? It does not come on optimized site for the phone. I tested with youtube, yahoo and cnbc mobile websites.
Flags: needinfo?(tkundu)
(In reply to Tapas Kumar Kundu from comment #11) > (In reply to Preeti Raghunath(:Preeti) from comment #10) > > Tapas, > > > > Can you please help check if this happening on an optimized site for the > > phone? > > It does not come on optimized site for the phone. I tested with youtube, > yahoo and cnbc mobile websites. Are you saying you can or can't reproduce this with these mobile optimized sites? I can't tell by your comment here.
Flags: needinfo?(tkundu)
(In reply to Jason Smith [:jsmith] from comment #12) > (In reply to Tapas Kumar Kundu from comment #11) > Are you saying you can or can't reproduce this with these mobile optimized > sites? I can't tell by your comment here. I CANNOT reproduce this issue with mobile optimized web sites.
Flags: needinfo?(tkundu)
Summary: Browser app scrolling shows white screen occasionally → Browser app scrolling shows white screen occasionally on Desktop Sites
Inder Since this is not seen in mobile sites, we wouldn't block on this. Please assess and let me know
Flags: needinfo?(ikumar)
Moving the ni to Vikram, who can assess the requirement from perf perspective.
Flags: needinfo?(ikumar) → needinfo?(mvikram)
The problem is that an end user will not know if he is hitting a mobile or desktop site. Also, as everyone knows, not all websites direct mobiles to a mobile friendly site. Was this a regression since when APZC was introduced(I'm not sure because I thought the browser always supported APZC). Can we quantify the memory increase by increasing the buffer pool? We could consider making this a pref value as some devices may not be that memory constrained.
Flags: needinfo?(mvikram)
During triage we were wondering if adding checkerboarding over the background color to indicate motion would be an acceptable interim solution for this bug.
Flags: needinfo?(milan)
That suggestion has certainly been forwarded before, including the perhaps having different pattern for different applications, or only doing it on applications and not browser, or others. Something like that is probably doable in the 2.0 timeframe if we decide to prioritize it.
Flags: needinfo?(milan)
Milan Please respond to Comment 16
Flags: needinfo?(milan)
Maybe we can try FF for Android using the same Gecko version on a QRD device to better level-set the issue. Might as well try Chrome too. If they do no better than I think we should reconsider spending further v1.4 time on this.
Flags: needinfo?(tkundu)
(In reply to Mandyam Vikram from comment #16) > ... > Was this a regression since when APZC was introduced(I'm not sure because I > thought the browser always supported APZC). Can we quantify the memory > increase by increasing the buffer pool? We could consider making this a pref > value as some devices may not be that memory constrained. Yes, browser supported APZ since the start. I don't know if we have devices that run both 1.0 and 1.4 in order to compare if this should be marked as a regression.
Flags: needinfo?(milan)
(In reply to Michael Vines [:m1] [:evilmachines] from comment #20) > Maybe we can try FF for Android using the same Gecko version on a QRD device > to better level-set the issue. Might as well try Chrome too. If they do no > better than I think we should reconsider spending further v1.4 time on this. I tested firefox aurora[1] on msm8x26 android Kitkat. Browser is scrolling fine with www.cnbc.com in android and it does not show any white screen if we scroll fast. Same is observed with chrome too. So if want to make v1.4 FFOS as good as 'firefox for android' then we should fix this issue in v1.4. [1] https://www.mozilla.org/en-US/mobile/aurora/
Flags: needinfo?(tkundu)
BenWa, let's see if there is something obvious here.
Assignee: nobody → bgirard
blocking-b2g: 1.4? → 1.4+
(In reply to Andre Graziani (:graziani) from comment #6) > Created attachment 8397502 [details] > Screen shot of profile while scrolling naver.com > > I did a profile when scrolling naver.com, that seems different from Dale's > profile, so I would like to share it. I used nexus-4 for that. > > In this profile, the child spends long time on Paint, because it is waiting > the main process to allocate the buffer (SendGrallocBufferConstructor). The pool is aimed to make this a bit better but in general bug 959089 should be a better solution. (In reply to Mandyam Vikram from comment #16) > Can we quantify the memory > increase by increasing the buffer pool? We could consider making this a pref > value as some devices may not be that memory constrained. Yes, take sMaxTextureClients * 256 * 256 * 4 will give you an upper-bound. Adding a preference for this is a good idea but we should discuss this in a different bug (clone of this bug is fine). (In reply to Tapas Kumar Kundu from comment #22) > So if want to make v1.4 FFOS as good as 'firefox for android' then we should > fix this issue in v1.4. On Firefox for android we just use GL Texture. They are overall slower than Gralloc but they can be faster if there's a lot of allocation of gpu tiles and the compositor thread is busy thus dealing servicing the incoming gralloc allocations. Bug 959089 will hopefully close this gap. From the profile in Comment 6 this is what we're seeing. This would explain why Firefox for android would be faster then. My suggestion here is to divert all effort to bug 959089.
Depends on: 959089
Depends on: 996458
(In reply to Benoit Girard (:BenWa) from comment #24) > Adding a preference for this is a good idea but we should discuss this in a > different bug (clone of this bug is fine). Opened bug 996458
(In reply to Benoit Girard (:BenWa) from comment #24) > ... > > My suggestion here is to divert all effort to bug 959089. That may be the only practical thing to do now. It would probably disqualify this bug from being 1.4, there are a lot of changes in that bug, and it depends on more work that needs to be done. We have our explanation as to why we're slower, at this point, I would prefer we reconsider this as a blocker. Re-sending to triage.
blocking-b2g: 1.4+ → 1.4?
Inder, Based on risk, we'd like to move this to 2.0
Flags: needinfo?(ikumar)
Preeti -- ok. fine by me. Moving ni to Vikram to get his input as well.
Flags: needinfo?(ikumar) → needinfo?(mvikram)
Whiteboard: [c=handeye p= s= u=1.4] → [c=handeye p= s= u=]
Status: NEW → ASSIGNED
Ok. I guess we don't have too many options.
Flags: needinfo?(mvikram)
Moving this to 2.0.
blocking-b2g: 1.4? → 2.0?
Minusing from 2.0 since in past releases desktop sites have not been expected to be problem-free when viewing in the fxOS browser. If that understanding has changed and involved teams are committing to supporting full desktop sites feel free to renom.
blocking-b2g: 2.0? → -
Priority: P1 → P3
Target Milestone: 1.4 S4 (28mar) → ---
(In reply to Benoit Girard (:BenWa) from comment #24) > My suggestion here is to divert all effort to bug 959089. Bug 959089 was landed to master and visually it seems that checkerboarding was reduced when scrolling desktop pages. But it still happens a lot. This is the new profile I got after patch from bug 959089, for scrolling in the same conditions as reported in comment 6: http://people.mozilla.org/~bgirard/cleopatra/#report=5c319354028788c710d0c381cda655405b153635 The time spent by the child process waiting for buffer allocation is about 30% of the total.
Thanks for capturing this. Can you please enable some debug info in SimpleTextureClientPool (at the top of the file) to see why we aren't reusing tiles more efficiently?
Attached file texture recycle log for simpletiles (deleted) —
Important to notice that I had to enable "Simple Tiling" in developer menu to get this log. By default the SimpleTextureClientPool path is not taken, it uses the TextureClientPool. So, using the Simple Tiling, the logs show that about 25% of the new textures requested come from newly allocated textures, and 75% comes from recycled.
Hey Andre. Sorry about misdirecting you. We want to use the default tile pool. Looking at TextureClientPool I don't think we are running into sMaxTextureClients. Could you add a printf into the GetTextureClient path to verify? More likely we shrink down to sMinCacheSize, which is 0, and then have to allocate our way back up. That having said there is another bug here. We aren't flushing the pool on a low-memory notification. We only do that after the timer. That needs to be fixed as well. Can you add the printf and see what mOutstandingClients looks like during your tests? We can whip up a patch that keeps more tiles around as long there is no memory pressure.
Attached patch Only clear pool on low memory. (deleted) — Splinter Review
Andre, want to try this patch? After a few initial allocations we should never wait, except if we run low on memory. Also please log how many outstanding texture clients we have with a printf so we know what we are dealing with here. Tapas: "The time spent by the child process waiting for buffer allocation is about 30% of the total." Are you guys allocating without MAP_POPULATE again in your vendor code? 30% time spent in allocation sounds a lot like memset exercising the page fault handler of the kernel. Would be good if you can measure where this time is spent. If its in the kernel, its probably something we should fix in the vendor code as well (the caching stuff above is a separate bug, that avoids the parent process round-trip).
Flags: needinfo?(tkundu)
Attached file Log number of outstanding clients (deleted) —
Andreas, I think the line 80 was deleted by mistake in your patch, otherwise the pool will be always empty. So I just inserted it back in my tests. This log contains the number of outstanding clients and the number of textures in the pool.
(In reply to Andreas Gal :gal from comment #37) > Tapas: "The time spent by the child process waiting for buffer allocation is > about 30% of the total." Are you guys allocating without MAP_POPULATE again > in your vendor code? 30% time spent in allocation sounds a lot like memset > exercising the page fault handler of the kernel. Would be good if you can > measure where this time is spent. If its in the kernel, its probably > something we should fix in the vendor code as well (the caching stuff above > is a separate bug, that avoids the parent process round-trip). Thanks for pointing me to this. I am looking into it.. I will update asap
(In reply to Andreas Gal :gal from comment #37) > Tapas: "The time spent by the child process waiting for buffer allocation is > about 30% of the total." Are you guys allocating without MAP_POPULATE again > in your vendor code? 30% time spent in allocation sounds a lot like memset > exercising the page fault handler of the kernel. Would be good if you can > measure where this time is spent. If its in the kernel, its probably > something we should fix in the vendor code as well (the caching stuff above > is a separate bug, that avoids the parent process round-trip). Can you please point me to exact gecko function/line number where you are seeing this delay during buffer allocation ? Thanks a lot for your help.
Flags: needinfo?(gal)
Flags: needinfo?(tkundu)
Flags: needinfo?(andre.graziani)
Hi Tapas, I am going to rewrite my statement to make it clearer: "The time spent by the child process waiting for parent process to allocate buffer is about 30% of the total." I got this from the profile in comment 32. Here is the exactly same profile, but with a better range: http://people.mozilla.org/~bgirard/cleopatra/#report=f4ef254bbb5a77a213007d24f05d4b1a3bbe73d5 If you keep expanding the methods with highest running time, you may end up here: http://dxr.mozilla.org/mozilla-central/source/gfx/layers/ipc/ISurfaceAllocator.cpp#318 where 30% of the time is spent. However, AFAIK, the allocation happens at parent process, so you may go deeper into parent side.
Flags: needinfo?(andre.graziani)
The parent handles this in SharedBufferManagerParent::RecvAllocateGrallocBuffer, which does: sp<GraphicBuffer> outgoingBuffer = new GraphicBuffer(aSize.width, aSize.height, aFormat, aUsage); In previous versions of your silicon gonk we have seen an mmap in the gralloc/ion code that doesn't pre-map the entire buffer and then does an memset, which for larger buffers causes hundreds of segfaults, which is slow.
Flags: needinfo?(gal)
110 is a lot of textures. Why are we doing that?
(In reply to Andreas Gal :gal from comment #42) > The parent handles this in > SharedBufferManagerParent::RecvAllocateGrallocBuffer, which does: > > sp<GraphicBuffer> outgoingBuffer = new GraphicBuffer(aSize.width, > aSize.height, aFormat, aUsage); > > In previous versions of your silicon gonk we have seen an mmap in the > gralloc/ion code that doesn't pre-map the entire buffer and then does an > memset, which for larger buffers causes hundreds of segfaults, which is slow. I just confirmed that SharedBufferManagerParent::RecvAllocateGrallocBuffer() takes negligible time to create gralloc buffer. but there is a big IPC delay delay between ISurfaceAllocator::AllocGrallocBuffer() and SharedBufferManagerParent::RecvAllocateGrallocBuffer() . And this delay happens randomly during scrolling www.cnbc.com I used following gaia/gecko for profiling: gaia: https://www.codeaurora.org/cgit/quic/lf/b2g/mozilla/gaia/commit/?h=mozilla/master&id=ed5d408dc1120b035ebce9a809499c30fbfb4582 gecko: https://www.codeaurora.org/cgit/quic/lf/b2g/mozilla/gecko/commit/?h=mozilla/master&id=45c69de2af9d2504a8baac39ca759403931d5158 please make NI on me for faster response .
Flags: needinfo?(andre.graziani)
Flags: needinfo?(gal)
Do we peg both CPUs at 100% during this time? Any idea why there is a scheduling delay?
Flags: needinfo?(gal)
I tested today again with the current master, and I was surprised with the improvements that prograssive-paint/low-precision-buffer features have done about the checkerboarding. I still can see checkerboarding in some extreme cases, but most of the time it doesn't appear.
Flags: needinfo?(andre.graziani)
Assignee: bgirard → nobody
Status: ASSIGNED → NEW
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: