Closed Bug 1019693 Opened 10 years ago Closed 10 years ago

[Flame] e-mail with pictures crashes after repeated pinch zoom / scrolling

Categories

(Core :: Layout, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

VERIFIED FIXED
2.1 S2 (15aug)
blocking-b2g 2.0+
Tracking Status
firefox32 --- wontfix
firefox33 --- wontfix
firefox34 --- fixed
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: njpark, Assigned: mattwoodrow)

References

Details

(4 keywords, Whiteboard: [MemShrink:P2][c=memory p= s= u=2.0])

Attachments

(13 files)

(deleted), message/rfc822
Details
(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), application/x-zip
Details
(deleted), application/x-zip
Details
(deleted), text/plain
Details
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
asuth
: feedback-
Details | Diff | Splinter Review
(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), patch
tnikkel
: review+
Details | Diff | Splinter Review
(deleted), image/png
Details
(deleted), image/png
Details
STR:
Open an HTML email with pictures (attached e-mail)
Pinch zoom in/out, and scroll.  Repeat these actions.

Expected:
the image zooms in/out and scrolls normally
Actual:
the image freezes, and eventually the e-mail app exits suddenly.

Video location:
https://www.dropbox.com/s/lyl1w3ij5h37uvj/VID_20140603_111520.mp4


Version Info:
β”‚ Gaia      61cd07a8b5fa017777db6d345e00afb4fb8789b7                         β”‚
β”‚ Gecko     https://hg.mozilla.org/mozilla-central/rev/f28005b84ed0          β”‚
β”‚ BuildID   20140603040210                                                   β”‚
β”‚ Version   32.0a1                                                           β”‚
β”‚ ro.build.version.incremental=94                                            β”‚
β”‚ ro.build.date=Tue May 20 09:29:20 CST 2014
Is there a log and a crash report?  Regression?
Attached file emailcrash.log (deleted) β€”
logcat of the crash
The device is throttled to have 512MB of RAM
blocking-b2g: --- → 2.0?
Looks like Email got killed due to OOM, looking at the logcat:
E/OomLogger(  291): [Kill]: send sigkill to 1141 (E-Mail), adj 134, size 8473
er. LMK.
:njpark, is 1.4 affected ?
Flags: needinfo?(npark)
Switching to qawanted here to determine if this reproduces on 1.4.
(In reply to bhavana bajaj [:bajaj] from comment #6)
> :njpark, is 1.4 affected ?

No this does not reproduce on 1.4  The scroll /pinch zoom actions are more responsive on 1.4 Flame device.
Flags: needinfo?(npark)
Whiteboard: [MemShrink]
Keywords: footprint
Component: Gaia::E-Mail → Panning and Zooming
Product: Firefox OS → Core
blocking-b2g: 2.0? → 2.0+
Keywords: reproducible
Just to be clear, this is 2.0+ because it's a regression from 1.4, and this is a critical workflow?
Attachment #8433555 - Attachment mime type: text/x-log → text/plain
(In reply to No-Jun Park [:njpark] from comment #3)
> The device is throttled to have 512MB of RAM

Do we expect to ship devices with a Flame-like screen size with 512 MB of RAM?

The experience on Flame panning/zooming this email is indeed quite janky and slow with 512 MB but is much better with 1024 MB. I haven't yet reproduced the OOM crash though.
Attached file Email OOM kill log (deleted) β€”
After some more panning around I reproduced the crash. I had various logging enabled (attached) and as far as I can tell the APZ code and displayport stuff is all working as expected. None of the sizes there are obscenely large, so it's not an obvious problem.

Specifically the last layer dump before the email app is killed shows two layers with a displayport set; the first one is 320x545 and the second one is 512x1365 - both of these are pretty reasonable. I haven't yet looked into the impact of the CSS transform that's used to "zoom" the email in the email app, that might be relevant.

In the worst case we can do something like make the displayport heuristics more memory-dependent, or disable low-res painting to save some memory on devices where the available memory is small compared to the screen size.
(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #10)
> (In reply to No-Jun Park [:njpark] from comment #3)
> > The device is throttled to have 512MB of RAM
> 
> Do we expect to ship devices with a Flame-like screen size with 512 MB of
> RAM?
> 
> The experience on Flame panning/zooming this email is indeed quite janky and
> slow with 512 MB but is much better with 1024 MB. I haven't yet reproduced
> the OOM crash though.

tchung suggested that we should throttle the RAM of Flame device to 512, because apparently that would be the spec of the production device.
QA Contact: pcheng
I will need confirmation on how to proceed to finding the window for this bug.

The bug reproduces on today's 2.0 Flame, with and without APZ turned on.
The bug reproduces on earliest Flame central tinderbox engineering build (4/17), but ONLY with APZ turned ON. It does NOT occur with APZ turned OFF.

The bug reproduces on today's Buri 2.0 with and without APZ turned on.

Would you like the window to be found in Flame, with APZ turned OFF? Or in Buri, with APZ turned ON?

Since I see discussions about tweaking Flame memory, I wasn't sure if you guys want this bug exclusive to Flame. (I'm on v10g-2 base image and I've never tweaked my Flame's memory)
Flags: needinfo?(jsmith)
I'd do the window on Buri since we've got more builds there.
Flags: needinfo?(jsmith)
Can we get a memory report for this? Running this against a DMD build would be even better as I'm guessing there will be a fair amount of |heap-unclassified|.

To retrieve a memory report:
  MOZ_IGNORE_NUWA_PROCESS=1 tools/get_about_memory.py
Flags: needinfo?(npark)
Attached file about-memory-0.zip (deleted) β€”
This is the memory report when e-mail app was not responding during UI actions
Flags: needinfo?(npark)
Attached file about-memory-1.zip (deleted) β€”
This is the memory report right after the app crashed
I'm not sure why LMK would kick in here, the memory usage in about-memory-0 shows plenty of free memory:

404.56 MB (100.0%) -- mem
β”œβ”€β”€230.73 MB (57.03%) ── free
β”œβ”€β”€100.21 MB (24.77%) ── other
└───73.62 MB (18.20%) -- processes
    β”œβ”€β”€40.20 MB (09.94%) ++ process(/system/b2g/b2g, pid=309)
    β”œβ”€β”€24.95 MB (06.17%) ++ process(/system/b2g/plugin-container, pid=1124) <-- email
    β”œβ”€β”€β”€6.58 MB (01.63%) ++ process(/system/b2g/plugin-container, pid=1387)
    └───1.89 MB (00.47%) ++ (27 tiny)

In about-memory-1 it looks like the 'Vertical' process was started, did you load this or have any idea what it is?
'Vertical' is probably the vertical homescreen which is what would be the active app after the email app crashes.
Whiteboard: [MemShrink] → [MemShrink:P2]
On Buri with APZ turned on, the regression window is the same as https://bugzilla.mozilla.org/show_bug.cgi?id=996991#c12
QA Whiteboard: [QAnalyst-Triage?]
Thinker, do you or someone on your team know if zram is somehow factored into LMK? We noticed that we have ~230MB listed as free in the system memory report, but also saw that the zram (uncompressed) disk size is set to 192MB and were wondering if this might be causing LMK to trigger too soon.

From the memory report:

404.56 MB (100.0%) -- mem
β”œβ”€β”€230.73 MB (57.03%) ── free

238,751 (100.0%) -- zram-accesses
└──238,751 (100.0%) -- zram0
   β”œβ”€β”€144,149 (60.38%) ── writes
   └───94,602 (39.62%) ── reads

9.60 MB (100.0%) -- zram-compr-data-size
└──9.60 MB (100.0%) ── zram0

192.00 MB (100.0%) -- zram-disksize
└──192.00 MB (100.0%) -- zram0
   β”œβ”€β”€154.73 MB (80.59%) ── unused
   └───37.27 MB (19.41%) ── used
Flags: needinfo?(tlee)
Severity: normal → blocker
Priority: -- → P1
Whiteboard: [MemShrink:P2] → [MemShrink:P2][c=memory p= s= u=2.0]
Eric,

Assigning this to you since you seem to be leading the investigation here and it's important to have owners on blockers. If someone else is more appropriate please reassign to them.

Thanks,
FxOS Per Triage
Assignee: nobody → erahm
Status: NEW → ASSIGNED
QA Whiteboard: [QAnalyst-Triage?] → [QAnalyst-Triage?][lead-review+]
(In reply to Eric Rahm [:erahm] from comment #21)
> Thinker, do you or someone on your team know if zram is somehow factored
> into LMK? We noticed that we have ~230MB listed as free in the system memory
> report, but also saw that the zram (uncompressed) disk size is set to 192MB
> and were wondering if this might be causing LMK to trigger too soon.

zram allocates and releases memory on-the-fly.  Even its maximum is 192MB, it doesn't allocate 192MB of memory at beginning.  According the log provided by you, it uses only 9.6MB of memory for compressed data that is 37MB in uncompressed.
Flags: needinfo?(tlee)
Also noticed now that after the app crash, the homescreen does not resume and shows blank background (buttons all work though).  raised Bug 1024536 for it.
(In reply to Thinker Li [:sinker] from comment #23)

> zram allocates and releases memory on-the-fly.  Even its maximum is 192MB,
> it doesn't allocate 192MB of memory at beginning.  According the log
> provided by you, it uses only 9.6MB of memory for compressed data that is
> 37MB in uncompressed.

I'm wondering if the _uncompressed_ "disk size" is used in LMK. If I look under /proc/swaps[1] I can see a size of 196604, is it possible that value is factored in, or does it only factor in the compressed amount?

[1] $ adb shell cat /proc/swaps
Filename				Type		Size	Used	Priority
/dev/block/zram0                partition	196604	0	-1
Flags: needinfo?(tlee)
Thinker, do you have an answer to question in comment 25?
(In reply to Milan Sreckovic [:milan] from comment #26)
> Thinker, do you have an answer to question in comment 25?

I don't have an answer since the partner has change a lot on their LMK.   In the standard implementation, it is not in factors.
Flags: needinfo?(tlee)
(In reply to Thinker Li [:sinker] from comment #27)
> (In reply to Milan Sreckovic [:milan] from comment #26)
> > Thinker, do you have an answer to question in comment 25?
> 
> I don't have an answer since the partner has change a lot on their LMK.   In
> the standard implementation, it is not in factors.

I don't quite understand, sorry. So, are we expecting to get more information from partners? Or, it actually doesn't matter? Is there something we can help? Thanks.
Flags: needinfo?(tlee)
(In reply to Kevin Hu [:khu] from comment #28)
> I don't quite understand, sorry. So, are we expecting to get more
> information from partners? Or, it actually doesn't matter? Is there
> something we can help? Thanks.

The question of Eric should be answered by the partner since they change LMK a lot.
Flags: needinfo?(tlee)
No-Jun, does this happen when Flame is configured at 256mb equivalent?
Flags: needinfo?(npark)
I wasn't able to boot with 256mb of memory on flame, so I set it to 300mb on flame, and was able to reproduce the issue on following 2.0 build:

Gaia      e935f4ff190b76c70d9b2af8856c542a6e4a7546
Gecko     https://hg.mozilla.org/releases/mozilla-aurora/rev/3f9d7a3a0b7b
BuildID   20140708000322
Version   32.0a2
ro.build.version.incremental=109
ro.build.date=Mon Jun 16 16:51:29 CST 2014

Originally this issue was found in 256mb Buri.  On latest flame build, it looks like email viewer performance on handling HTML emails have gotten a bit worse.  It is easy to cause the screen to freeze momentarily and/or disregard user input.
Flags: needinfo?(npark)
(In reply to Thinker Li [:sinker] from comment #29)
> The question of Eric should be answered by the partner since they change LMK
> a lot.

If the partner or the chip vendor does not change LMK, as the same as one of AOSP, disk size of zram is not accounted.
OK so we've ruled out LMK being confused (it reproduces on a buri and Thinker confirmed zram shouldn't be a factor on Flame).

At this point it sounds like we should look into some of the options kats suggested, primarily investigating the CSS transform further and coming up with low-memory heuristics:

From comment 11:
> I haven't yet looked into the impact of the CSS transform that's used to "zoom"
> the email in the email app, that might be relevant.
> 
> In the worst case we can do something like make the displayport heuristics
> more memory-dependent, or disable low-res painting to save some memory on
> devices where the available memory is small compared to the screen size.
Flags: needinfo?(bugmail.mozilla)
OK, sounds like this is a general "OOM" scenario.  2.0 just takes more memory than 1.4, but I don't see that being actionable here on the platform side - do we know that e-mail is not using more memory than before?  Based on comment 31, do we first need to fix the e-mail app?
blocking-b2g: 2.0+ → 2.0?
Component: Panning and Zooming → Gaia::E-Mail
Flags: needinfo?(bugmail.mozilla)
Product: Core → Firefox OS
=== What's Changed In Email

The only change to the email app's HTML display logic since v1.2 was for bug 1007560 (landed on v2.0 only) to have us inject "word-wrap: break-word" into the CSS of the HTML document.  This will tend to cause the size of the HTML document to end up conforming more to the screen's width than it did previously, although that's probably only really going to happen with ridiculous pathological test emails where someone wrote a reallllllly long word without any opportunities for breaking.

In v2.0 we landed our visual refresh; the most notable aspect of this as it relates to graphics was bug 940132 changing our folder picker from being that "tray" thing on the left side of the screen with the gutter.  Now it's basically integrated into the message list card and it comes down from the top.  There's a fair amount of clipping and such done there to allow for some animation tricks, so it may affect our layer count/etc.  However, the message_list card should be off-screen when the message_reader card is on the screen, so it should ideally not be affecting things.

The primary other memory usage change for us is our virtual scroll implementation *that is also on v1.4* done on bug 796474.  It causes us to cache all of the mail headers from the top of the mailbox through whatever is currently displayed.  However, headers have relatively small memory footprints (they're just the envelope data) and this memory increase was offset by massive improvements in our caching logic for message bodies.

One notable change, as I understand it, is that the Flame has more pixels than the other stuff we've been testing on.


=== Thoughts / Differences in Test Cases

My guess, especially looking at the video, but ignoring the fact that the amazing :kats has already looked into this, is that this is still a result of our CSS transform stuff.  For bug 984460 and bug 996991 which was duped to it, the graphics team really pulled a rabbit out of a hat by making things not crash horribly.

The test cases on bug 984460 and bug 996991 were extremely different than this test case.  The one from bug 984460 was just a really long string that resulted in a very wide document.  (The kind fixed by bug 1007560.)  The one on bug 996991 was a long HTML document primarily composed of text; there were some images but *they were never downloaded or displayed in the repro*.

In this bug, the test HTML document's visibility rectangle ends up including many sliced up images where the visible area ends up being 600 px wide * 2240 px high.  (It's 2027px to the bottom of the corn, then that chicken that's partially panned onto the screen is another 213px.)  Assuming 32bpp, that's 5.13 megs of images.  And then we go and zoom on some of them!

It'd be good for us to make sure that there isn't something wacky happening that :kats debugging didn't look into, like the zooming causing the image decoding libraries to decode the images at multiple different scale-factors as the zoom occurred.  But...

...while I'd love for the graphics team to pull some more rabbits out of their hats, I am very open to mitigations.  (It is my understanding that being able to use native APZ on our iframe is still a ways out.)  This HTML mail has explicit sizing information on both its img tags and its table tags, making it something we can analyze before letting the images be displayed.  We can then use this to trigger some type of protective behaviour like disabling pinch-to-zoom and only showing entirely zoomed out or entirely zoomed in to 100%, capping the scale ranges, quantizing the scale ranges (entirely zoomed out/100%), etc.  I may have proposed some other mitigations on bug 996991 too.
(In reply to Milan Sreckovic [:milan] from comment #34)
> OK, sounds like this is a general "OOM" scenario.  2.0 just takes more
> memory than 1.4, but I don't see that being actionable here on the platform
> side - do we know that e-mail is not using more memory than before?  Based
> on comment 31, do we first need to fix the e-mail app?

Not sure about the renom, since if we can reproduce a crash and email becomes unusable, so leaving the blocking.
blocking-b2g: 2.0? → 2.0+
Eric, are you still doing investigation on this one? Can you provide an update and/or unassign if this is no longer on your radar?
Flags: needinfo?(erahm)
Okay, so I ran on a just-made v2.0 build using the email in question (with external images told to display).  According to firewatch, when I zoom in, we experience short-lived transient memory spikes taking us from ~35 MB to peaks as high as 220 MB, although it's more common to see the peak jump to 120 MB.  These peaks last only for one sample, so it's tremendously surprising that about:memory did not manage to pick up what was going on.
(In reply to Andrew Sutherland [:asuth] from comment #38)
> These peaks last only for one sample, so it's tremendously
> surprising that about:memory did not manage to pick up what was going on.

I accidentally left out an important 'not' there.  These peaks are very short-lived, I expect firewatch's sampling isn't always capturing the total height of the peak.  about:memory likewise needs to be triggered at the exact right instant, which is obviously a hard thing to do, so not about:memory's fault.
I'll abdicate this bug, it sounds like someone from the e-mail team should be driving. One thing we might do is trigger a memory report if the memory usage goes over an arbitrary value (maybe 200MB in this case). 

I'm unfamiliar with firewatch, if it's getting the memory usage externally (via b2g-procrank or the like) we could just call out to |get_about_memory.py| when it hits the threshold. If it's internal it should be pretty straightforward to trigger a report as well.
Assignee: erahm → nobody
Status: ASSIGNED → NEW
(In reply to Andrew Sutherland [:asuth] from comment #38)
> Okay, so I ran on a just-made v2.0 build using the email in question (with
> external images told to display).  According to firewatch, when I zoom in,
> we experience short-lived transient memory spikes taking us from ~35 MB to
> peaks as high as 220 MB...

David, more image decoding related (as in, smarter image decoding would help us?)
Flags: needinfo?(dflanagan)
It's worth noting that these are our memory states, roughly:
-  18 MB: email app freshly opened, showing the message in the message list (non-inbox folder)
-  20 MB: email app displaying the messages without remote images fetched/enabled
-  29 MB: "show external images" hit, no scrolling performed or anything.  we continue to be at default zoom.
-  29 MB: I scroll all the way down through the message; all the images are already there.
- 165 MB+: transient from when I trigger a pinch.  This results in us tapping iframe.style.transform 6 times.  (I added some extra logging to get this.)  Note that the highest peak was for a single 500ms sample point, so it's not the result of all of these, probably just the initial burst.
07-17 16:37:32.811 I/GeckoDump( 8100): LOG: scale change from 0.44991789819376027 to 0.45804008714741107
07-17 16:37:32.821 I/GeckoDump( 8100): LOG: scale change from 0.45804008714741107 to 0.5677727475718227
07-17 16:37:33.231 I/GeckoDump( 8100): LOG: scale change from 0.5677727475718227 to 1.3305062696204473
07-17 16:37:34.731 I/GeckoDump( 8100): LOG: scale change from 1.3305062696204473 to 1.329306091511926
07-17 16:37:35.461 I/GeckoDump( 8100): LOG: scale change from 1.329306091511926 to 1.4155637249570396
07-17 16:37:35.471 I/GeckoDump( 8100): LOG: scale change from 1.4155637249570396 to 1.4429900807087592
-  29 MB: Spike over, we return to normalcy
-  20 MB: The screen turns off because I am typing this bug.  lock-screen is not enabled because adb.
-  68 MB: Spike from turning the screen back on.
-  29 MB: Return to normalcy
-  72 MB: Spike from the screen turning off (?!)
-  20 MB: Return to low memory after the spike; screen is off.

I'll try using #-moz-samplesize=8 from bug 854795 to see what that does.  There are obviously things we could do about tapping iframe.style.transform a lot less.
Flags: needinfo?(erahm)
Andrew: let me know if I can help with any #moz-samplesize stuff.  In particular, consider using the shared/js/media/downsample.js utility to wrap it.  If you use #-moz-samplesize=8 you'll get an image that has only 1/64th of the pixels as the original.

I don't understand this bug, though...  If you've already decoded the image, then the memory damage is already done. I don't see how a pinch to zoom in would cause any more images to be decoded.
Flags: needinfo?(dflanagan)
Target Milestone: --- → 2.1 S1 (1aug)
Assignee: nobody → bugmail
As :djf surmised image decoding seems to have little to do with the problem.  I instrumented image/src/DiscardTracker.cpp and even without downsampling the image buffer swing only ever gets as high as 7,952,208 bytes.

After it turned out --enable-trace-malloc doesn't seem to work for flame on v2.0, I tried instrumenting some other graphics memory implementation allocation points by finding where the memory reporter entry points are, but other than suspecting that the gralloc memory reporter has a bug in it (SharedBufferManagerParent::RecvDropGrallocBuffer never seems to be called?) I haven't found smoking bugs quite yet.  (ISurfaceAllocator looked promising but didn't seem to fire the memory-reporter code-path.)

The most useful thing I have found is that if I turn on paint flashing, I find that in general everything I see has the cool layer tiling boundaries.  This includes the HTML mail.  However, when I pinch/zoom, the entire HTML mail is painted in its entirety with a single overlay color.  Well, multiple different colors as we keep changing the scale() factor.  It is subsequently fixed up to be tile-paint-flashed.

:kats, do you have an idea where I could instrument to catch a smoking gun allocation point if the problem is that the scale() transform is resulting in the entirety of the HTML mail iframe being rendered at 1:1 and then scaled down after the fact?  (I was trying to find where the paint flashing was coming from but screwed up and keyed off the widget flashing pref instead; I have since found where the actual paint flashing comes from but it's a work week and we're off to do team-buildy things.)
Flags: needinfo?(bugmail.mozilla)
(In reply to Andrew Sutherland [:asuth] from comment #44)
> The most useful thing I have found is that if I turn on paint flashing, I
> find that in general everything I see has the cool layer tiling boundaries. 
> This includes the HTML mail.  However, when I pinch/zoom, the entire HTML
> mail is painted in its entirety with a single overlay color.  Well, multiple
> different colors as we keep changing the scale() factor.  It is subsequently
> fixed up to be tile-paint-flashed.

If I understand your description correctly I think this is expected behaviour. When you change the scale whatever is visible should get repainted, so you'll see a single overlay "paint flash" color. And then when you scroll around it goes back to painting in tiles.

> :kats, do you have an idea where I could instrument to catch a smoking gun
> allocation point if the problem is that the scale() transform is resulting
> in the entirety of the HTML mail iframe being rendered at 1:1 and then
> scaled down after the fact?

To see what resolution the thing is getting painted at I usually check the value at [1] but I don't know if that will trigger in this situation. I also don't know if that will answer the question you have as to whether things are getting drawn at the wrong scale. Somebody from layout would be better able to answer that question.

[1] http://mxr.mozilla.org/mozilla-central/source/layout/base/nsDisplayList.cpp?rev=bd9304ad42cb#623
Flags: needinfo?(bugmail.mozilla)
hi guys, any movement on this?
Attached file large shmem allocations (deleted) β€”
More instrumentation shows that if I am good at pinching/zooming I can trigger some huge shmem allocations.  The shmem allocations aren't reported via GfxMemoryImageReporter::DidAlloc (I guess it's a system memory reporter?).

Here's an edited excerpt from the log during which I think I got our memory usage up to 270 megs.  You will note that we allocate 2 ~82MiB buffers, each for an 1805x11845 image.  The first one lasts ~5 seconds, the second one lasts ~3 seconds and they're deallocated at the same time.  Note that I was pinching and zooming quite a lot at that point so the system was pretty backed up.

07-29 17:14:24.753 I/Gecko   (25151): ShmemTextureClient::Allocate 85520928 0xb14b7e80
07-29 17:14:24.753 I/Gecko   (25151): BufferTextureClient::AllocateForSurface width: 1805 height: 11845
07-29 17:14:26.133 I/Gecko   (25151): ShmemTextureClient::~ShmemTextureClient no dealloc 0xb14b7160
07-29 17:14:26.213 I/Gecko   (25151): ShmemTextureClient::~ShmemTextureClient no dealloc 0xb1415dc0
07-29 17:14:26.263 I/Gecko   (25151): ShmemTextureClient::Allocate 85520928 0xb14b7f40
07-29 17:14:26.263 I/Gecko   (25151): BufferTextureClient::AllocateForSurface width: 1805 height: 11845
07-29 17:14:29.583 I/Gecko   (25151): ShmemTextureClient::~ShmemTextureClient no dealloc 0xb14b7e80
07-29 17:14:29.583 I/Gecko   (25151): ShmemTextureClient::~ShmemTextureClient no dealloc 0xb14b7f40
Attached patch gecko-gfx-instrum-v1.diff (deleted) β€” β€” Splinter Review
This is the instrumentation I am running with.  It includes the r+'ed fix on bug 989403.  (In simple firewatch testing the fix made no difference.)  I am running with the patch applied to v2.0 branch gecko hash
902adb7850609db2fb371a2c2c5375e36668b457

The email app instrumentation is just:
diff --git a/apps/email/js/iframe_shims.js b/apps/email/js/iframe_shims.js
index d6363d9..35d1b46 100644
--- a/apps/email/js/iframe_shims.js
+++ b/apps/email/js/iframe_shims.js
@@ -257,6 +257,7 @@ function createAndInsertIframeForContent(htmlStr, scrollContainer,
       iframe.style.height = '';
       scrollHeight = iframeBody.scrollHeight;
     }
+    console.log('iframe.style.transform = scale(' + scale + ')');
     if (scale !== 1)
       iframe.style.transform = 'scale(' + scale + ')';
     else

Is this something that should be moved back to graphics, or does this seem like a layout issue?
Flags: needinfo?(milan)
Hmm, Shmem, rather than Gralloc TextureClient, I would at least like to understand where we're using those 1805x11485 images.  Jeff, any thoughts?
Flags: needinfo?(milan) → needinfo?(jmuizelaar)
Layout developers, any idea why messing with "transform: scale(blah)" (where blah varied over the range [0.45, 1.75] in testing) on an iframe whose contents are 609x4278 could result in layout/graphics deciding that a buffer of size 1805 x 11845 is needed?  (roughly 2.96 times as wide, 2.77 times as high.)  The hard limits the email app implements for its scale() are max of 2x, and a min of whatever it takes to scale the iframe so its horizontal contents are entirely visible on the screen (given 1.5rem left/right margins) which ends up being 290px on the v2.0 flame.

Perhaps more significantly, is there anything we can do to get the relevant subsystems clipping somewhat instead of doing all that rendering?  Is it possible the iframe is not aware of the big 'ole displayport/viewport of its owner?
Component: Gaia::E-Mail → Layout
Product: Firefox OS → Core
Jet, is there someone on the layout team who can help triage/investigate the layout aspects of this?

I'm also going to clear ownership of the bug for now since it's possible the bug being assigned is impacting triage/dashboards in an unhelpful way.  NI'ing myself to make super sure this doesn't fall off my radar, however.
Assignee: bugmail → nobody
Flags: needinfo?(bugs)
Flags: needinfo?(bugmail)
I can take a look at this.

Firstly, adding an 'active' transform around the content is probably taking us out of tiling mode and we're just using a single buffer for the region (tiling is only used for layers we've detected as being scrollable).

We're also likely hitting our 'prerendering' optimization [1] where we render areas of transformed content that aren't currently visible under the assumption that they will become visible soon, which lets us do animations without painting (and ideally asynchronously). This heuristic is probably pretty useless for the zooming in use case here, but it should help for zooming out.

Lastly we also try to render transformed content at the resolution it will appear on screen (rounded up to the nearest power of 2 for actively changing transforms) to keep the quality high. We should be restricting this to within the max texture size though [2], which I'd expect to be much lower than 11845.

Could someone please check the code at [2] and see if they can figure out how we're not restricting the resolution to something smaller? If the max texture size really is that big, then maybe we also need to restrict to 2x the screen size or something.

It would also be nice to improve our heuristics about when to prerender and when to draw at device resolution so that they work better for this use-case. Detecting the type of transform changes is somewhat hard when it's just being updated by JS code (rather than a declarative animation).


[1] http://mxr.mozilla.org/mozilla-central/source/layout/base/nsDisplayList.cpp#4715
[2] http://mxr.mozilla.org/mozilla-central/source/layout/base/FrameLayerBuilder.cpp#3664
Flags: needinfo?(bugs)
I'm beginning to wonder if having pre-rendering for transforms that aren't animations is actually useful. It's certainly costing us a lot here, and I can't think of any b2g use cases that would be negatively affected.

I can make a patch for that fairly easy, but I'm worried about performance regressions that won't get picked up by automation.
For non-trunk where we probably want to be more conservative for regression reasons, could the email app do something like add "will-change: content" to defeat the mechanism that's being too aggressive?

The email app is additionally a certified app and could also avail itself of mechanisms that we don't want escaping into the web, but it would be nice to avoid those.  The whiteout.io people are porting/adapting their HTML5-based mail app to Firefox OS and I'd like to avoid doing something they can't imitate.
Target Milestone: 2.1 S1 (1aug) → 2.1 S2 (15aug)
Attached patch Disable prerendering (deleted) β€” β€” Splinter Review
Does this fix the issue?
Attachment #8468226 - Flags: feedback?(bugmail)
Comment on attachment 8468226 [details] [diff] [review]
Disable prerendering

I manually applied your patch to my v2.0 build because patch -p1 was unhappy for some reason that was not immediately clear to me.

Note that it appears that on my device many of the external images no longer seem to want to load (I guess the mailer-people are cheap and don't want to leave out-of-date pictures around), that doesn't seem to matter for memory use.  (Which makes sense since we already ruled out the images themselves as the problem.)

In my testing, I was still able to get firewatch to see us jumping up by ~210 megs to ~240 megs.  Here's a log excerpt thing:
08-06 01:42:17.676 I/Gecko   ( 1636): ShmemTextureClient::Allocate 75085760 0xb33ea940
08-06 01:42:17.676 I/Gecko   ( 1636): BufferTextureClient::AllocateForSurface width: 1802 height: 10417

While I'm about done for today (and lazy ;), if you'd like, tomorrow I can try and create a reduced/quasi-automated test-case web-page or app where clicking a button makes us tap transform(scale) a bunch on a big iframe which hopefully replicates the problem.
Attachment #8468226 - Flags: feedback?(bugmail) → feedback-
Flags: needinfo?(bugmail)
Triage: From comment 52, assigning to mattwoodrow. Please unassign or find a new owner if this is incorrect. Thanks!
Assignee: nobody → matt.woodrow
Status: NEW → ASSIGNED
Are you sure the patch applied and was rebuilt correctly? I can't think of anything else that could let us get such large layers.
Yes, the build passed, but to be double-sure I added a logcat statement inside the modified if and it was definitely triggering.

I suspect that this call-stack is too late, but I added a LOG statement to BufferTextureClient::AllocateForSurface that triggers when aSize.height > 6000 and then set a breakpoint on it.

Here's the concise bt followed by a "bt full" in the case that helps.  (The stlport rb-tree won't, but the other stuff might :)
It's probably worth confirming that we don't ever take the other branch in that function (the one that returns true).

Are you sure the element doesn't have animation or transition properties set?
So it seems like bt full might have been boring, Here's 3 layers of p *child for the ClientContainerLayers at different layers from a different breakpoint triggering.  For the skimmers, here's the mVisibleRegions as we go 'down'
- frame 12: 0,0,480,818
- frame 11: -315,-2141,746,2979
- frame 10: 0,0,1827,12829
(In reply to Matt Woodrow (:mattwoodrow) from comment #60)
> It's probably worth confirming that we don't ever take the other branch in
> that function (the one that returns true).

I added some logging code that will definitely be logged if we don't take the "return false" path out.  I have never seen it print anything inside the email app.

> Are you sure the element doesn't have animation or transition properties set?

This is what the WebIDE says about our iframe and its enclosing div via "Copy outer HTML":

<div style="padding: 0px; border-width: 0px; margin: 0px; position: relative; overflow: hidden; width: 836.227px; height: 5868.22px;">
  <iframe style="padding: 0px; border-width: 0px; margin: 0px; transform-origin: left top 0px; pointer-events: none; width: 609px; transform: scale(1.37311); height: 5868.22px;" sandbox="allow-same-origin">
  </iframe>
</div>

Per the rules display, they do not have additional classes applied to them beyond the (inherited from div):
  white-space: pre-wrap;
  font-size: 1.6rem;
  line-height: 2rem;
  word-wrap: break-word;


They *do* live inside the email app's animated card stuff.  None of that should be active, but for example, a containing div has:
element {
    z-index: 10;
}
.card.center {
    transform: translateX(0px);
}
.card {
    position: absolute;
    top: 0px;
    width: 100%;
    height: 100%;
    background-color: #FFF;
    transition: transform 0.3s ease 0s;
}


However, that's the extent of containing things with animationy stuff on them.
Ok, thanks for that.

The main thing that sticks out for now is that the intermediate ContainerLayer has mFrameMetrics.mDisplayPort.height = 3413.33325. That's quite huge to begin with, is the mail app setting that explicitly?

I suspect something to do with the combination of the display-port and transform is causing things to get further out of control, need to go through all the values more carefully.
That display port is in CSS pixels, and with a device-pixels-per-css-pixel ratio of 1.5, we end up with 5120. That matches the height of the middle ContainerLayer (y2-y1).

For the child drawn at 1.16 scale, I'd expect 4414 pixels of it to be visible (5120/1.16). We also round the scale factors up the nearest power of 2 in preparation for the scale factor to change in the future. That would give us a 8818 pixel high buffer.

The actual buffer is almost 50% bigger again though, so it looks like we're computing something wrong.

I wonder if this is a regression from bug 1022612.
I'm not sure how we'd set the displayport directly.

There are 2 phases to our HTML mail display:

1) We perform a layout unconstrained by the screen size to see how wide the content wants to be on its own.

We do this by having our DOM tree look like:
  - ("viewport") div: overflow: hidden; position: relative; width: 100%;
    - iframe: position: absolute; border-width: 0; overflow: hidden;

We then check if the scrollWidth is wider than the 'viewport' width (our parent's offsetWidth less scrollbar padding).  If it is, we decide we want to be in "newsletter" mode which is where we set the default zoom so you can see everything and once and enable pinch/zoom.


2) We set things to be the way they are in comment 62.  Our goal is to have the iframe behave in a "seamless" fashion so that it is fully displayed at all times and scrolls "with" the page rather than scrolling independently, etc.  The "viewport" div clips the iframe down to its effective post-transform size.  (Since transform is paint-only we need the iframe to think it is its full scroll size DOM-wise to get fully painted.  But then we don't want all the wasted white-space if it's scaled down/etc.)


The code for all that lives at:
https://github.com/mozilla-b2g/gaia/blob/master/apps/email/js/iframe_shims.js

Note that it's an evolved hodge-podge of hacks/etc. and all horribleness is disclaimed.
Note that the attachments in comment 59 and comment 62 are from different runs and likely have different buffer sizes.  I think in comment 62 I had fully zoomed all the way in so things should have been huge.
(In reply to Matt Woodrow (:mattwoodrow) from comment #64)
> I wonder if this is a regression from bug 1022612.

To be clear, I'm using a v2.0 FxOS build because that's what the regression is filed against.  Per my understanding from https://wiki.mozilla.org/Release_Management/B2G_Landing, that's Gecko 32, and bug 1022612 was never present on Gecko 32.  It looks like it landed when Gecko 33 was trunk but then was backed out of Gecko 33 and now only lives on Gecko 34/trunk.
Kats, how do we compute the display port size for apps? 3.4k CSS pixels sounds really excessive to me.
I can do a full derivation of the 3.4k value if you like, but the 3.4k CSS pixels is the height of the low-res displayport, so the buffer allocation for the layer should only be a quarter of that (layers.low-precision-resolution is 0.25). If we're painting 3.4k pixels at full-resolution then that's bad. The critical displayport is 853.33 CSS pixels tall and that's the amount that should be painted at full resolution.

I notice from the stack in comment 59 there's no ClientTiledThebesLayer anywhere in the stack. I would have expected the ClientThebesLayer in frame 7 and frame 6 to be a ClientTiledThebesLayer. I don't know if any code outside of ClientTiledThebesLayer/TiledContentClient knows how to deal with the low-res displayport properly.

For some reason ClientLayerManager::CreateThebesLayerWithHint must be getting a non-SCROLLABLE hint, so it doesn't create a tiled layer.
Flags: needinfo?(jmuizelaar)
Ah, that makes a lot of sense. There's an active transform layer between the scrollable ContainerLayer and the content, so we must be losing the SCROLLABLE hint in the process. But we're keep the low-res display port, and drawing it at full-res.
Looks like we decide the SCROLLABLE hint based on whether the animated geometry root for the content is a scroll frame. The animated geometry root for transformed content will be the transform frame, which is why we're not getting it.
Attached patch low-res-display-port (deleted) β€” β€” Splinter Review
This is rapidly leaving code that I understand, but I think this should help. This makes sure that we always get tiled layers when we have a low-res display port, and those know how to only draw at low-resolution.
Attachment #8469581 - Flags: feedback?(bugmail)
Comment on attachment 8469581 [details] [diff] [review]
low-res-display-port

The patch seems to shift the problem.  Specifically, with this patch (and regardless of whether the "Disable prerendering" patch attachment 8468226 [details] [diff] [review] is applied or the fix from bug 989403 that does not seem to be on the v2.0 branch yet):
- Memory now only spikes from our ~35 baseline to ~115.  We no longer are getting up into the mid 200's.
- A different buffer allocation path seems to be happening now.  I am no longer seeing ShmemTextureClient::Allocate or BufferTextureClient::AllocateForSurface.  I think this may explain why our peaks are lower since it seemed like it was possible for us to have 2 of those Shmem allocations live at the same time.
Attachment #8469581 - Flags: feedback?(bugmail)
(In reply to Andrew Sutherland [:asuth] from comment #73)
> - Memory now only spikes from our ~35 baseline to ~115.  We no longer are
> getting up into the mid 200's.

That sounds like a good win to me!

> - A different buffer allocation path seems to be happening now.  I am no
> longer seeing ShmemTextureClient::Allocate or
> BufferTextureClient::AllocateForSurface.  I think this may explain why our
> peaks are lower since it seemed like it was possible for us to have 2 of
> those Shmem allocations live at the same time.

I'm not sure but we might also be taking different buffer allocation paths now because we are using a tiled layer than one giant flat layer.
Attachment #8469581 - Flags: review?(roc)
Attachment #8469581 - Flags: review?(roc) → review?(tnikkel)
Sounds like a pretty huge win.

I'd expect us to be seeing allocations through GrallocTextureClientOGL::AllocateForSurface now, and it to be a lot of small allocations (for each tile) instead of one huge one.

It's possible that low-res still isn't working, dumping the content of mPaintData after ClientTiledThebesLayer::BeginPaint might show what's happening there.
Attachment #8469581 - Flags: review?(tnikkel) → review+
Attached image graphical artifacts example 1 (deleted) β€”
(In reply to Matt Woodrow (:mattwoodrow) from comment #75)
> Sounds like a pretty huge win.

Oh, yes, definitely a win!  Both on the using less memory front and the progress towards absolute fix front.  (Since it sounds like this fix now puts things on the right code-path so there's some other small (diff-wise, not impressiveness-of-fix-wise! :))

> I'd expect us to be seeing allocations through
> GrallocTextureClientOGL::AllocateForSurface now, and it to be a lot of small
> allocations (for each tile) instead of one huge one.

Yes, seeings tons of
  GrallocTextureClientOGL::AllocateForSurface width: 256 height: 256
followed by round-off ones like
  GrallocTextureClientOGL::AllocateForSurface width: 198 height: 32

Count-wise, many times I'm seeing on the order of ~157 of these between the LOG() calls that I had added to RecordFrameMetrics.  I'm not sure if we expect that to be sufficient demarcation or what, but 256 * 256 * 4 * 157 is roughly 39.25 megs.


There is also an interesting new wrinkle. *sometimes* there is *no memory spike problem* whatsoever.  In these cases, there are artifacts.  I've seen:
- some of the tiles just don't paint.  I end up with white squares on the screen.  Like it seems like the screen is made up of 4 tiles, and I only see the ones of the left-hand side, but not the ones on the right.
- partially stale display, semi-persistent.  In this case, the tiles on the left seem to be full-resolution tiles and the tiles on the right seem like they were from a different scale-factor.  I've attached a screenshot of an instance of this happening.

Key things to note:
- When we aren't experiencing memory spikes, it seems a bit like graphics memory is actually being leaked.  Like the firewatch memory usage keeps going up and up as I pinch/zoom.  Memory is reclaimed when I turn off the display.
- It seems like we may be able to switch from "no spikes but maybe leaks" to "spike time" by both turning the display off then on again or also killing the process and re-creating it.  I've definitely seen it after killing the email app, and believe the other thing has happened too.

I'll see if I can get an about:memory dump from the leaky situation.  Also, one more graphical corruption attachment coming.
Attached image other graphical artifacts, same run (deleted) β€”
And by semi-persistent I mean that if I scroll around, the tiles aren't re-rendered or anything.  They stay like that.  Thus far I've only done vertical scrolling, I haven't gone off to see if there are corrupted things where I can't see them.
Attachment #8470614 - Attachment description: 2014-08-10-23-50-45.png → graphical artifacts example 1
Those definitely look like bugs with our low-resolution display port rendering.

I don't have a flame, and I think I've about reached the limit of what I can diagnose from reading code.

Kats, are you the right person to look at low-res bugs on device? If not, could you please transfer the ni? appropriately.
Assignee: matt.woodrow → nobody
Flags: needinfo?(bugmail.mozilla)
I'm having trouble getting the non-spikey-memory thing to happen again.  Given that the corruption may also involve things that are fixed on trunk/FxOSv2.1 but not on gecko32/FxOSv2.0, I'll switch to a v2.1 build with the patch and see how that fares.
Yeah i can look at the low-res drawing issues. However it would probably be a good idea to land this patch and close this bug to fix the OOM issue and file a new bug to track the rendering glitches.
Flags: needinfo?(bugmail.mozilla)
Agreed on landing and doing a follow-up.  Notes from running with the patch on v2.1:

* Spikey transient memory allocations seem to now be limited to a swing of ~30 megs.  Like we only go from ~35 to ~65.  Which is pretty awesome.  It would be nice if we could uplift whatever improved that too.

* I am still seeing that some reboots I get the "spikey scenario" and others the "non-spikey graphical corruptiony scenario."  In 4 runs just now (1 after ./flash.sh gecko, then 3 reboots), I got 2 non-spikey and 2 spikey.  Specifically [(flash) non-spikey, (reboots:) spikey, spikey, non-spikey]

* In non-spikey I don't see any indications of any type of actual memory leak.  All process growth seems attributable to limited jemalloc fragmentation / heap overhead / accumulated JIT byproducts / strings not yet garbage collected, etc.

* In non-spikey, if I run an about:memory with the screen off after doing pinch/zoom, then run it again after turning the screen back on, the about:memory *diff* has the follow interesting stuff in it.  And by interesting I mean largely boring.
** E-mail proc:
8.13 MB (100.0%) -- explicit
β”œβ”€β”€7.65 MB (94.09%) -- images/content/raster/used

** Main proc:
29.01 MB (100.0%) -- gralloc
β”œβ”€β”€25.72 MB (88.65%) -- pid(1135) (email!)
β”‚  β”œβ”€β”€22.50 MB (77.57%) ── buffer(width=256, height=256, bpp=4, stride=288) [80] [+]
β”‚  β”œβ”€β”€β”€2.96 MB (10.21%) ── buffer(width=480, height=809, bpp=4, stride=480) [2] [+]
β”‚  └───0.25 MB (00.87%) ── buffer(width=480, height=69, bpp=4, stride=480) [2] [+]
└───3.29 MB (11.35%) -- pid(298) (Main)
    β”œβ”€β”€3.13 MB (10.78%) ── buffer(width=480, height=854, bpp=4, stride=480) [4]
    └──0.16 MB (00.57%) ── buffer(width=480, height=45, bpp=4, stride=480) [2] [+]


So yeah, it'd be great if this can get landed and I agree the graphical corruption issues seem like they want to be in a separate follow-up.  And the patch definitely was an improvement on v2.0, but we might want to understand what the graphical corruption issues are before uplifting.

Thanks very much graphics and layout peoples!
Great, that sounds promising.

https://tbpl.mozilla.org/?tree=Try&rev=62844d84d463
Assignee: nobody → matt.woodrow
https://hg.mozilla.org/integration/mozilla-inbound/rev/046dc36c93b1
https://hg.mozilla.org/mozilla-central/rev/046dc36c93b1
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
https://hg.mozilla.org/releases/mozilla-b2g32_v2_0/rev/c643486eee65
Just checked out the patch in 2.0.  the image loads a lot faster now, and looks good!
Blocks: 1058831
Status: RESOLVED → VERIFIED
Depends on: 1058900
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: