Closed Bug 963917 Opened 11 years ago Closed 11 years ago

Gallery app high memory usage during 1st launch

Categories

(Firefox OS Graveyard :: Gaia::Gallery, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

(blocking-b2g:1.3+, b2g-v1.3 fixed, b2g-v1.3T fixed, b2g-v1.4 fixed)

RESOLVED FIXED
1.4 S1 (14feb)
blocking-b2g 1.3+
Tracking Status
b2g-v1.3 --- fixed
b2g-v1.3T --- fixed
b2g-v1.4 --- fixed

People

(Reporter: tkundu, Assigned: djf)

References

()

Details

(Keywords: memory-footprint, perf, Whiteboard: [caf priority: p2][c=memory p= s= u=1.3] [CR 611568][ETA: n/a])

Attachments

(11 files, 7 obsolete files)

(deleted), image/jpeg
Details
(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), video/3gpp
Details
(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), text/x-github-pull-request
dkuo
: review+
Details
(deleted), text/x-github-pull-request
dkuo
: review+
Details
(deleted), patch
Details | Diff | Splinter Review
Gallery app takes varying amount of memory during 1st launch

Steps to reproduce:
1) Flash boot, system and userdata image to 256MB configuration device. Please make sure that you are flashing userdata partition. This will force gallery app to scan and load all images from sdcard. You will not see varying mem usage without this step.

2) Put 1000 images and 100 videos on sdcard.
3) Put sdcard to device and start gallery app.
4) Run |reset &&  while true; do adb shell b2g-procrank ;  done| on bash shell
5) You will see Gallery app memory usage is varying a lot while it takes > 1min to scan all images 1st time.
6) Keep scrolling Gallery app while it is scanning images.

You can see Gallery app crashes or killed at some point.


This varying memory usage happens only during 1st time launch of gallery app i.e. only when it scans sdcard for new images.

Please note that Gallery app memory usage gets reduced in above scenerio if you use less number of pics/videos .
blocking-b2g: --- → 1.3?
Please note that above test case is performed with APZ enabled for all apps
Duplicate of bug 854783?  Also, I was assuming that all of gallery is already disabling SkiaGL (see bug 939962), but it may be worth setting:
pref("gfx.canvas.azure.accelerated", false);
just to see if it makes a difference in this crash.

I'm also assuming these are random 1000 images, not actually taken by the phone?  I would expect those to take more memory in gallery, as we may not have the thumbnails embedded.  Is that the case?
ni Tapas for answering Milan's questions
Flags: needinfo?(tkundu)
Attached image IMG_0002.jpg (deleted) —
(In reply to Milan Sreckovic [:milan] from comment #2)
> Duplicate of bug 854783?  Also, I was assuming that all of gallery is
> already disabling SkiaGL (see bug 939962), but it may be worth setting:
> pref("gfx.canvas.azure.accelerated", false);
> just to see if it makes a difference in this crash.

I will try and let you know in few hours.

> I'm also assuming these are random 1000 images, not actually taken by the
> phone?  I would expect those to take more memory in gallery, as we may not
> have the thumbnails embedded.  Is that the case?

I used |cd gaia && make reference-workload-heavy| to produce content on device. This means that I have >1000 copy of this image and >100 videos generated by this command. I attached that image here.
(In reply to Milan Sreckovic [:milan] from comment #2)
> Duplicate of bug 854783?  Also, I was assuming that all of gallery is
> already disabling SkiaGL (see bug 939962), but it may be worth setting:
> pref("gfx.canvas.azure.accelerated", false);
> just to see if it makes a difference in this crash.
> 
Enabling this option prevents crash. But my guesss is that performance will be slower without it. Can we get a better workaround for this ?
Flags: needinfo?(tkundu)
If we enable "pref("gfx.canvas.azure.accelerated", false);" and put 2500 pictures & 100 videos then crash happens again. I also observed that gallery is using >80MB of USS during 1st launch (i.e when it scans for all images on sdcard and sdcard >2500 images") .

Shouldn't we minimize memory usage of gallery app during 1st time launch ? Memory usage should not increase with number of images or videos. Right ? Please suggest
Flags: needinfo?(milan)
This suggests that bug 939962 is only covering the edit image aspect of the gallery, not the first display as well; SkiaGL is meant to be disabled internally already, and setting the other preference shouldn't really make a difference (unless there is some other canvas used somewhere, or we're just running into the SkiaGL caching.)  David, would you expect this as well?  Is there another place where gallery uses canvas that should be forced out of SkiaGL?

To answer the "shouldn't we minimize memory usage..." - sure, but we need bug 854783 for that, and that's a long term project and not appropriate for 1.3.

Now, other than the graphics memory usage for image decoding and display, is there other memory overuse that is happening here?
Depends on: 854795
Flags: needinfo?(milan) → needinfo?(dflanagan)
Keywords: perf
Whiteboard: [c=memory p= s= u=]
Keywords: footprint
Inder,

Please help understand what the upper limit for the pictures is?

This is not a regression issue so need understanding on spec.
Flags: needinfo?(ikumar)
(In reply to Preeti Raghunath(:Preeti) from comment #8)
>
> This is not a regression issue so need understanding on spec.

Gallery app should not let itself get killed by the kernel regardless of the SD card contents.
But that aside, 1000 images and 100 videos on sdcard seems like a fairly normal use case.
(In reply to Michael Vines [:m1] [:evilmachines] from comment #9)
> (In reply to Preeti Raghunath(:Preeti) from comment #8)
> >
> > This is not a regression issue so need understanding on spec.
> 
> Gallery app should not let itself get killed by the kernel regardless of the
> SD card contents.
> But that aside, 1000 images and 100 videos on sdcard seems like a fairly
> normal use case.

I don't think so. 100 videos in the gallery app can only happen if a user decides to record 100 videos & then launch the gallery app for a first launch. That's absolutely not possible to happen in any end-user use case.
Is this a regression?  If not, it's way too late for 1.3.
(In reply to Jason Smith [:jsmith] from comment #10)
> I don't think so. 100 videos in the gallery app can only happen if a user
> decides to record 100 videos & then launch the gallery app for a first
> launch. That's absolutely not possible to happen in any end-user use case.

SD cards are removable.
:tk, can you please try your same SD card on a QRD7x27a loaded with the latest v1.3 and v1.2 and report the results here.
Flags: needinfo?(tkundu)
Please make sure that the images in your reference workload all contain EXIF preview images that are at least as large as the screen (measured in device pixels).  That is the fast case, and we try to enforce it for the camera.  

Until we have a fix for bug 854795, scanning images without big EXIF previews is just not going to work well, no matter what.
Flags: needinfo?(dflanagan)
After reading the comments, it seems like this problem only arises when there is a large number of photos with large EXIF preview images.  If that's true, then my question is, how large are the EXIF previews created by the phone when taking photos?  If the phone doesn't create images that cause this bug, I don't think this bug should block 1.3.

Am I missing something?
Flags: needinfo?(mvines)
Potentially.  Let's see what we get from comment 13.  Also :tk, please clarify the origin of the images on the SD card
Flags: needinfo?(mvines)
(In reply to Dave Huseby [:huseby] from comment #15)
> After reading the comments, it seems like this problem only arises when
> there is a large number of photos with large EXIF preview images.  If that's
> true, then my question is, how large are the EXIF previews created by the
> phone when taking photos?  If the phone doesn't create images that cause
> this bug, I don't think this bug should block 1.3.
> 
> Am I missing something?

It's more of a large images without EXIF previews, so we have to decode the full image, then scale it down to display it.
(In reply to Tapas Kumar Kundu from comment #0)
> Please note that Gallery app memory usage gets reduced in above scenerio if
> you use less number of pics/videos .

Is this explainable by the gallery app decoding images to generate a preview?  I would expect memory usage during import to be somewhat independent of total number of images.  I assume we are not trying to decode them all at once.
Michael provided the info. Clearing ni.
Flags: needinfo?(ikumar)
This contains b2g-procrank usage logs for Gallery app when it scans all images on sdcard . Please note that sdcard has 2000 images and 100 videos. All images are simply a copy of attached [1]
https://bugzilla.mozilla.org/attachment.cgi?id=8366621&action=edit

From dmesg log, i can see following things during Gallery launch 1st time.

<6>[   40.509489] send sigkill to 1003 (Homescreen), adj 534, size 4588
<6>[   50.540774] send sigkill to 2143 ((Preallocated a), adj 667, size 3607
<6>[   75.946431] send sigkill to 1123 (Gallery), adj 134, size 16731
Attachment #8365503 - Attachment is obsolete: true
MemUsage for #comment 20 . It uses same sdcard as mentioned in #comment 20. Sorry for re-uploading it. You can see galley is using 86068KB as USS during scanning all images on FFOS 1.3 JB port
Attachment #8367577 - Attachment is obsolete: true
Flags: needinfo?(tkundu)
It contains gallery app mem usage for Ics Strawberry FFOS 1.3 .

It uses same content as #comments 20 . It is running on 320x480 resolution device. It is not killed and mem usage is also significantly lower (<55MB) than mem usage in #comments 20. 

Please note that #comment 20 is for FFOS 1.3 JB port running on 800x480 resolution device.
Mem usage for gallery app running on ICS Strawberry FFOS 1.2 320x480 resolution device . Highest USS memory usage is <55MB for gallery app scanning same content.
Attached video VID_0003.3gp (deleted) —
In summary,

FFOS 1.2 ics strawberry and FFOS 1.3 ics strawberry runs on 320x480 device and gallery mem usage is <55MB . We also didn't observe any crash here.

But FFOS 1.3 JB port runs on 800x480 device and gallery app mem usage is >80MB at some point. We always observed crash in this use case.

Please note that crash happened only when gallery app scans for images on sdcard. I used 2000 copy of same image :
https://bugzilla.mozilla.org/attachment.cgi?id=8366621&action=edit and 100 copies of same video (attached here).
Flags: needinfo?(dhuseby)
Thanks for the deep analysis in comment 24 - that confirms that this is a regression in memory when moving over to JB for 1.3. Agreed on the blocking decision now.

Removing dependency also - the problem is likely JB-specific, so the root cause is likely different than what the dependency is implying.
blocking-b2g: 1.3? → 1.3+
No longer depends on: 854795
Keywords: regression
JB on 1.3 was only done on Nexus 4.  I imagine that has more memory and doesn't run into this problem.  I wouldn't call that a regression, as we never had this work on a JellyBean phone we're talking about (which one are we talking about, by the way), but I suppose the fact that it works on ICS is what would have us call it a regression.
(In reply to Milan Sreckovic [:milan] from comment #26)
> JB on 1.3 was only done on Nexus 4.  I imagine that has more memory and
> doesn't run into this problem.  I wouldn't call that a regression, as we
> never had this work on a JellyBean phone we're talking about (which one are
> we talking about, by the way), but I suppose the fact that it works on ICS
> is what would have us call it a regression.

Fair enough, let's pull the keyword.
Keywords: regression
Please see comment 14.  The EXIF preview in the attached image is not very large, and will not cover a 480x800 screen, I don't think

Therefore, the gallery app will decode the full-size image 1000 times, using much more memory.

If the exif preview is big enough to cover the 320x480 screen, then the gallery app will just use the exif preview which uses much less memory.

Because of this screensize depedency it is not fair to use the same reference workload for the two phones.  We ensure in the Camera app that photos produced on the phone have large enough EXIF previews to take the fast path in Gallery.

However, see also bug 942199 where we may have to relax the "big enough" preview size to screensize comparison for some hardware.
Mike: flagging you on this because I see now that Tapas is using the standard reference workloads for photos. As described above, I think that might not be a fair test for large-screen devices.
Flags: needinfo?(mlee)
Comments 5 and 6 indicate that this isn't just about decodeing large images, but about the number of images.  If it doens't crash after 100 duplicate images, I would expect the gallery to have reached a steady state.  But those comments describe a crash that happens somewhere between 1000 and 2500 images, so I wonder if we have to look for a memory leak somewhere.

Gallery used to display all of its thumbnails, all of the time.  Then we introduced shared/js/visibility_monitor.js so that thumbnails would not be loaded when not visible so that we wouldn't run of of memory when the number of thumbnails got large.

It might be worth investigating whether the visibility monitor is working as it should during the thumbnail scanning process.

The patch that added date headers to the thumbnail list changed the version of the visiblity monitor we're using.  It would be interesting to try backing that patch out to see if this can still be reproduced. That was bug 925179
Looks like bug 925179 backs out mostly cleanly with only a trivial manual fix to the index.html to resolve by hand.  I used this line:

I did this:

git checkout -b revert925179 v1.3
git revert -m1 14ed05fcf07b43ec8bed744c3b2e3e2977ebb40f

I haven't actually tried to reproduce the crash yet myself, so I haven't investigated whether backing this out prevents the crash.

Note that I am not suggesting we back out that patch to fix this bug.  But if backing it out fixes it, then that is a huge clue about what we need to do to fix it.
Flags: needinfo?(tkundu)
David, this is seriously useful information and triage, thanks!
It would also be very helpful to know whether scrolling while scanning is actually necessary to reproduce the crash.  If we crash on 2000 images without any scrolling that should simplify the diagnosis somewhat
Whiteboard: [c=memory p= s= u=] → [c=memory p= s= u=1.3]
Running on master, I can reproduce this crash with 1000 copies of the test image on the sdcard.  It appears to have crashed after ~700 or so.

I instrumented the onscreen and offscreen callbacks (that are supposed to be called by the visibilty monitor).  The onscreen callback was called 659 times before crashing and the offscreen callback was only called once.

This probably means that we were keeping 659 thumbnails in memory. 120x120 px, so 120x120x4x659 = 36.2mb

The thumbnails would be bigger on a 480x800 device so the problem would be worse there.

The visibility montior was supposed to be tuned so that it kept 300 thumbnails in memory, so I think something is going wrong.
Interesting.  I would have thought unloading images would be less of an issue as well given that we discard images that are off screen.  See bug 689623 now.

Also, late in v1.2 we enabled the pref for bug 847223 to avoid decoding images that are offscreen:

  pref("layout.imagevisibility.enabled", true);

I wonder if that pref could be interacting badly with the mutation code or something.
Maybe gallery is described by bug 689623 comment 165?
Ben,

And independently, I just saw bug 943248 which also appears to set that pref. This is the first I've heard about it, so maybe we don't need visibility monitor anymore.

And yes, bug 689623 comment 165 seems to describe Gallery. But that comment is really old and the STR on this bug involve scrolling, so it might not be relevant.
(In reply to David Flanagan [:djf] from comment #31)
> Looks like bug 925179 backs out mostly cleanly with only a trivial manual
> fix to the index.html to resolve by hand.  I used this line:
> 

I reverted that change but we still see crash during startup for 256MB 480x800 device (running FFOS 1.3 JB port)

>> It would also be very helpful to know whether scrolling while scanning is actually necessary to reproduce the crash.
Not necessary. It will automatically crash at some point during scanning.
Flags: needinfo?(tkundu)
djf: I am assigning this to you since you have been actively investigating this today (please feel free to ping punam to help with additional testing as you suggested, I will also talk to her about this one in the morning).

Thanks
Hema
Assignee: nobody → dflanagan
(In reply to Tapas Kumar Kundu from comment #38)
> (In reply to David Flanagan [:djf] from comment #31)
> > Looks like bug 925179 backs out mostly cleanly with only a trivial manual
> > fix to the index.html to resolve by hand.  I used this line:
> > 
> 
> I reverted that change but we still see crash during startup for 256MB
> 480x800 device (running FFOS 1.3 JB port)
> 
> >> It would also be very helpful to know whether scrolling while scanning is actually necessary to reproduce the crash.
> Not necessary. It will automatically crash at some point during scanning.

Thanks for this information, Tapas.  How did you test with bug 925179 reverted? When I try reverting that patch, Gallery ends up hanging with IndexedDB errors for me, so I haven't been able to test that yet
Flags: needinfo?(tkundu)
Clearing the needinfo flag for Tapas. I flashed a nightly from before bug 925179 landed and still get the creah trying to scan 1000 copies of the test image my hamachi.
Flags: needinfo?(tkundu)
Following up on comment 34, running master on hamachi, with 900 copies of the test image.

Yesterday Gallery crashed on the first run and visiblity monitor fails calling the onscreen callback 659 times and offscreen only once. When I ran it again, it finishes scanning.

Today, I ran it again with all of the iamges already scanned. This time it loads and displays all of the thumbanils and the onscreen callback is only called ~300 times which is the expected number. So there is something about the scanning process.

I tried the scanning test again today with today's master build.  900 photos. This time the scan completed without an OOM.  But the onscreen calback was called 905 times and the offscreen callback only once.  I could see a steady increase in the max memory usage as scanning continued, so I think we did indeed have all of the decoded thumbnails in memory at the same time.

Next, I'm going to try the test again with all of the visibility monitor code removed, since per the discussion in comments 35 to 37, it shouldn't actually be needed anymore.
Running the scan again without the visibility monitor worked with 900 photos.  Memory usage dropped when the screen blanked and then spiked when I turned the screen back on again.  Trying again with the screen set to stay on always.  Memory usage sometimes seems alarmingly high (spikes in the 160's) but it hasn't crashed yet.

Next, I'm going to try working with different imges that have suitable EXIF previews. That will reduce the variation in memory usage since we'll never have to decode the full image.  It should also make scanning much faster, so it will be easier to try with larger numbers of photos.  If the underlying cause is that we just have too many decoded thumbnails in memory, we ought to be able to trigger the bug even with photos with previews.
Now I've put 2187 copies of a photo from my hamachi camera on the sdcard and am running gallery without the visibility monitor.  It scans much more quickly, and does not OOM.

Once the photos are scanned, I can scroll quickly through them without causing an OOM.  Memory usage does go up as I scroll, and the frame rate is not particularly good, but we had that problem with Gallery before.

Next, I'll try this same test with the faster scanning photos but with the visibility monitor running.
My 2000+ images also scanned fine with the visibility monitor running. And this time, the onscreen callback was only called 215 times.

Memory usage while scanning seemed about the same as without it.  Scrolling performance after scanning is worse with the visibility monitor installed, as expected.

One thing that is different this time is that I have the screen always on. I wonder if the visibility monitor was malfunctioning when the screen went blank.  I'll try that next.
When the screen blanks while we're scanning, we OOM.
This is on master, and with the visibility monitor running.
Nice find!  Do the mutation events stop coming or something?
Same thing without the visibility monitor: we OOM if keep scanning while the screen is blank.
(In reply to Ben Kelly [:bkelly] from comment #47)
> Nice find!  Do the mutation events stop coming or something?

I'm wondering if the LMK just kills us more easily when the screen blanks.  Are we still the foreground app in that case?  I've been testing by manually tapping the sleep button. 

Maybe I have to hold a screen lock while scanning?
Next I'll try this on 1.3. It could be that I'm just seeing some new buggy behavior on master.
Here are my findings testing on Hamachi 1.3 with 1000 big EXIF preview images. Image used was taken using Hamchi camera and 1000 replicated images copied Using APPS="gallery" make reference-workload-x-heavy.

No crash was observed while scanning, time taken to scan and load images was 4 mins for 1000 images, memory usage stayed around 150000.

On Hamachi 1.3 similar scanning and load for 1000 test image attached to the bug  https://bug963917.bugzilla.mozilla.org/attachment.cgi?id=8366621, took 31 mins with memory usage varied between 160000-180000.
Same result in the latest 1.3 nightly.  If I use the default 'blank screen after 1 minute' setting, then the gallery app will die during scanning.

If do Settings->Display->Screen timeout->Never then I can scan 2187 photos without an OOM.

Tapas: do you find the same behavior?  If you force the screen to stay on, does the OOM go away?
(In reply to David Flanagan [:djf] from comment #49)
> I'm wondering if the LMK just kills us more easily when the screen blanks. 
> Are we still the foreground app in that case?  I've been testing by manually
> tapping the sleep button. 
> 
> Maybe I have to hold a screen lock while scanning?

Yea, I think you are the background process at that point and will OOM more easily.  Can you see what b2g-info says in terms of memory usage?  Is it roughly the same whether you let it sleep or not?
calling navigator.requestWakeLock('screen') in the onscanstart callback, and then unlocking it in onscanend allows the scan to complete.  And a minute after the scan completes, the screen blanks successfully.
Attached file link to patch on github (obsolete) (deleted) —
Tapas,

Does this patch solve the problem for you?
Attachment #8368974 - Flags: review?(tkundu)
Note that even with this patch, if I push the power button to manually blank the screen the app will go to the background and OOM.  But the screen lock at least keeps it awake by default so that the scan can complete.

I suppose we could argue about whether an OOM while sleeping is a bug or not. I'm not sure that it is.  Something to discuss, perhaps, but not something to block on.

Alive and Tim: can either of you comment on what happens to the foreground app when the phone goes to sleep?  Does it become a background app and easier for the LMK to kill, or is there a bug here? Is there an optimization we can make here so that if an app has a screen lock and the user sleeps the phone we keep that app as the foreground app or something?  Basically, are there followup bugs we should file here?
Flags: needinfo?(timdream)
Flags: needinfo?(alive)
I wonder if a cpu lock would work here instead of a screen lock...  I don't actually know what a cpu lock is supposed to do
With this patch applied, I just scanned 3600 photos (each one with an EXIF preview of large enough size for the 320x480 hamachi screen).  When scanning completed, the screen blanked correctly, but it did not blank during scanning.
Changing from screen lock to cpu lock does not fix the bug.
(In reply to David Flanagan [:djf] from comment #56)
> Note that even with this patch, if I push the power button to manually blank
> the screen the app will go to the background and OOM.  But the screen lock
> at least keeps it awake by default so that the scan can complete.
> 
> I suppose we could argue about whether an OOM while sleeping is a bug or
> not. I'm not sure that it is.  Something to discuss, perhaps, but not
> something to block on.
> 
> Alive and Tim: can either of you comment on what happens to the foreground
> app when the phone goes to sleep?  Does it become a background app and
> easier for the LMK to kill, or is there a bug here? Is there an optimization
> we can make here so that if an app has a screen lock and the user sleeps the
> phone we keep that app as the foreground app or something?  Basically, are
> there followup bugs we should file here?

Quick answer is: when screen is locked, all apps are at background.
App is not able to distinguish background and screen-off.

Because of page visibility is inheritable, and all app are embedded under system app,
when screen is turned of, the visibility of system app's iframe is changed to be false,
thus all apps including foreground app would get visibilitychange.

I don't think we should not put app to background when screen is off. This would result power drain if the app don't know screen is off to do less work.
screen lock does prevent screen to be automatically turned off by screen manager in system app,
and maybe we could ban the power button even the foreground app has a screen lock.
Flags: needinfo?(alive)
Tested on Hamachi v1.3 with default 'blank screen after 1 minute' screen timeout setting, Gallery Crash seen after 4 mins of scanning 1800 big EXIF preview images.

Changing the setting to screen never blanks out scanned 1800 images in 7 mins, max memory usage reached around 158000
Setting needinfo for Tapas in case he is not monitoring his review requests.

Tapas: does the attached patch solve the bug you were seeing?
Flags: needinfo?(tkundu)
No :((In reply to David Flanagan [:djf] from comment #55)
> Created attachment 8368974 [details]
> link to patch on github
> 
> Tapas,
> 
> Does this patch solve the problem for you?

No. Didn't work :(
Flags: needinfo?(tkundu)
Did the patch prevent the screen from blanking while you ran the test? Did it allow your test to run longer than it did before?

If you switch to a more realistic workload (with images that have bigger EXIF previews), then maybe this patch in conjunction with a patch for bug 942199 will be enough.

Other than that, I don't know what to suggest. If we got a fix for bug 854795 and could uplift it to 1.3 that would make a big difference, but I don't think there is any chance of that for 1.3.

You could reduce the size of the photos that the Camera app is taking.  Though I'm guessing that an OOM for a completely unrealistic workload is easier to live with then compromising the photo size of the device.
Flags: needinfo?(tkundu)
I'm tempted to unassign myself from this bug, and I won't be able to figure out why my 256mb device (hamachi) can scan the photos and the QC reference device cannot.

I suppose we should try one more time with exactly the workload that Tapas is using however before I give up completely.

Punam: if you have time, would you try reproducing this bug with the 1000 photos and 100 videos workload that the reporter is using?  I think the make reference-workload targets don't put enough photos and videos on the phone, so you'll have to use more.  Please use the default reference workload image even though it has a too-small preview and takes a really long time to scan.
Assignee: dflanagan → nobody
Flags: needinfo?(pdahiya)
Tested on below environment

Device: Hamachi
OS Version 1.3
Platform 28.0a2
Build Id: 20140131004001

Steps

1. Applied the attached patch on latest gaia v1.3 and installed on device using make reset-gaia

2. Load 1000 images on device by changing makeRefrenceWorkLoad.sh script to push 1000 default images https://github.com/mozilla-b2g/gaia/blob/master/test_media/reference-workload/makeReferenceWorkload.sh#L62

3. Run APPS="gallery" make reference-workload-x-heavy

4. Load 100 attached video https://bug963917.bugzilla.mozilla.org/attachment.cgi?id=8367619 on sdcard

5. Open gallery app

The gallery app took 22 mins to load 1000 images and screen didn't timeout while scanning. Screen successfully timeout and turned black after scanning finished. No crash observed.
Flags: needinfo?(pdahiya)
Attached file gaia and gecko commit hash (deleted) —
(In reply to David Flanagan [:djf] from comment #64)
> Did the patch prevent the screen from blanking while you ran the test? Did
> it allow your test to run longer than it did before?
> 
It crashed after 10mins and I tested with 2000 copy of sample image (which I attached with this bugid) . I am not sure whether this patch has prevented 'screen blanking' for me or not. I changed screen timeout to NEVER in settings>display>screentimeout and I also applied your patch [1] on top of my repo . I still saw a crash exactly after 10mins and my display was not  blank even after gallery crashed.

I attached my gaia/gecko repo's last two commit hash. Can you please tell me if I am missing some other changes in gaia/gecko ?

It will be helpful if you can point me to debug something which can tell me whether gecko is activating screentimeout after 10mins or not.

I am assuming that I am hitting some other bug for screen timeout as it is working for others in 256MB configuration. I am re-checking with latest gaia/gecko and I will update soon with your patch on top of latest gaia/gecko in FFOS 1.3 .

[1] https://bugzilla.mozilla.org/attachment.cgi?id=8368974&action=edit
Flags: needinfo?(dflanagan)
Do you have calendar or clock alarms that might be triggering another app launch after 10 minutes?  That shouldn't kill the foreground app, but maybe that is triggering some bug.
(In reply to Punam Dahiya from comment #66)
> Tested on below environment
> 
> Device: Hamachi
> OS Version 1.3
> Platform 28.0a2
> Build Id: 20140131004001
> 
Did you test it on JB-port of FFOS 1.3 ? I guess that it is ICS strawberry based FFOS 1.3 . Please also confirm me about total memory available on this device. Thanks a lot for help
Flags: needinfo?(pdahiya)
Tapas,

Yes, the Hamachi is an ICS device. It has 256mb of memory, but only has a 320x480 screen, which means that the thumbnail images and the preview images that are being generated are smaller than they would be on your device. 

The only JB device that I know of is the nexus 4, and although it has the larger screen size obviously it has much more memory than your device.  

I don't know if anyone at mozilla has a device that can replicate your test.

Punam tried with 1000 images (Thanks, Punam!).  I'll try with the 2000 that you tried.  I'm eager to hear what you experience with the latest version of Mozilla's 1.3 branch.

If the crash is always at the 10 minute mark that does seem like an important clue. I don't know anything about b2g process management, however, so don't know what to make of that.
Flags: needinfo?(dflanagan)
(In reply to David Flanagan [:djf] from comment #70)
> Tapas,
> 
> Yes, the Hamachi is an ICS device. It has 256mb of memory, but only has a
> 320x480 screen, which means that the thumbnail images and the preview images
> that are being generated are smaller than they would be on your device. 
> 
> The only JB device that I know of is the nexus 4, and although it has the
> larger screen size obviously it has much more memory than your device.  
> 
> I don't know if anyone at mozilla has a device that can replicate your test.

I already confirmed that ICS decice running FFOS 1.3 has no issue even with 2000 images in #comment 24.
Please know that this issue is present only in FFOS 1.3 JB device with 256 Mem configuration. 

>> Punam tried with 1000 images (Thanks, Punam!).  I'll try with the 2000 that you tried. 
Looks like it will not help much as you don't have proper device to test this use case. 

>> If the crash is always at the 10 minute mark that does seem like an important clue. I don't know anything about b2g process management, however, so don't know what to make of that.

Do you want us to re-adjust LMK values for this galley app problem ? I guess that we should not allow gallery app to consume ~80MB memory during loading. If you look into #comment 20 then you can see that this is memory usage regression of gallery app for 800x480 JB 256MB device. It would be great if we can find a solution for memory usage of galley app instead of adjusting LMK or screen timeout settings. 

please suggest
Flags: needinfo?(tkundu)
Flags: needinfo?(pdahiya)
(In reply to David Flanagan [:djf] from comment #64)
> Did the patch prevent the screen from blanking while you ran the test? Did
> it allow your test to run longer than it did before?
> 
> If you switch to a more realistic workload (with images that have bigger
> EXIF previews), then maybe this patch in conjunction with a patch for bug
> 942199 will be enough.
> 
> Other than that, I don't know what to suggest. If we got a fix for bug
> 854795 and could uplift it to 1.3 that would make a big difference, but I
> don't think there is any chance of that for 1.3.
>
It seems to me that we are not trying to solve memory usage regression (#comment 24) for gallery app on 800x480 device running FFOS 1.3 (JB port) now.
(In reply to Tapas Kumar Kundu from comment #71)
> (In reply to David Flanagan [:djf] from comment #70)
> > Tapas,
> > 
> > Yes, the Hamachi is an ICS device. It has 256mb of memory, but only has a
> > 320x480 screen, which means that the thumbnail images and the preview images
> > that are being generated are smaller than they would be on your device. 
> > 
> > The only JB device that I know of is the nexus 4, and although it has the
> > larger screen size obviously it has much more memory than your device.  
> > 
> > I don't know if anyone at mozilla has a device that can replicate your test.
> 
> I already confirmed that ICS decice running FFOS 1.3 has no issue even with
> 2000 images in #comment 24.
> Please know that this issue is present only in FFOS 1.3 JB device with 256
> Mem configuration. 
> 

I guess I didn't read it carefully enough, and since I was able to observe an OOM when the screen blanked, I thought maybe I was seeing the same bug as you.

Do you actually know that this is related to ICS vs JS or could it be related to the screen size?

> >> Punam tried with 1000 images (Thanks, Punam!).  I'll try with the 2000 that you tried. 
> Looks like it will not help much as you don't have proper device to test
> this use case. 
> 
> >> If the crash is always at the 10 minute mark that does seem like an important clue. I don't know anything about b2g process management, however, so don't know what to make of that.
> 
> Do you want us to re-adjust LMK values for this galley app problem ? 

I know nothing at all about the LMK, so have no opinion on what to do with it.

In addition to tweaking the LMK parameters, I think there are other knobs that can be adjusted that have to do with how gecko manages image memory. I don't understand those, either.

I guess
> that we should not allow gallery app to consume ~80MB memory during loading.

If the memory is free, I don't care if the app uses it. If gecko wants to hang on to decoded images, that should be fine as long as it releases them on memory pressure....


> If you look into #comment 20 then you can see that this is memory usage
> regression of gallery app for 800x480 JB 256MB device. It would be great if
> we can find a solution for memory usage of galley app instead of adjusting
> LMK or screen timeout settings. 
> 

There is no regression in the gallery app itself. Gallery's metadata parsing code is careful to serialize the parsing process so that only one image is being decoded at a time. We had all kinds of memory related bugs in previous releases, and I really do not think that there are any further memory savings to be achieved at the JavaScript level.  (Until we have bug 854795 and can downsample and decode at the same time.)

> please suggest

It seems clear at this point that this is not a Gaia regression. I don't know if it is gecko or gonk, though, and really have no clue.

I'll try following Milan's suggestion in comment #7 and will attach a patch, but I'm not particularly hopeful about that since I don't think that has anything to do with ICS vs JB. I don't know what SkiaGL even is, but I'm told that setting this new flag might reduce some memory usage, so maybe it will help.

After that I'm going to unassign myself, since I'm out of ideas and it seems clear that we won't solve this bug with JavaScript.
Attached file link to patch on github (obsolete) (deleted) —
Tapas,

This patch implements Milan's suggestion from early in the history of this bug. Please give this a try and see if it makes any difference.
Looks like I actually unassigned myself from this bug up at comment 65.

For a regression caused by the switch from ICS to JB, I don't even have a guess who the new assignee should be. Milan, is this something someone on your team could work on?
Flags: needinfo?(milan)
Sounds good - let's find out if your patch makes a difference.  Otherwise, I imagine that the memory increase isn't so much from the switch from ICS to JB (although I supposed it could be), but more because the size of the screen, and needing the larger image than we have in the thumbnail.  At that point, we need to decode the full size image (at least until bug 854795 gets fixed), which is larger than before.
NI Tapas to test David's patch from comment 74
Flags: needinfo?(tkundu)
(In reply to Hema Koka [:hema] from comment #77)
> NI Tapas to test David's patch from comment 74

here is my observation:
1) With 2000 sample image (I attached that image in this bugid), I can see device is scanning for 20mins and then gallery crashes. This is observed with patch from #comment 74 . You can see b2g-procrank and kernel logs when it crashes. 

2) But if I don't use patch from #comment 74 then gallery crashes within ~5min of scanning. So I can say that this patch(#comment 74) improves memory usage a lot. But we still need to save more memory if we want to allow end users to use their existing memory card (or memory card from another smartphone) with FFOS. 

3) If we change device orientation then we see gallery crashes during scanning. It happens almost immediately. Please suggest for this problem.

4) If we don't use attached sample image and only use picture captured by camera then we don't see any crash during initial scanning of 2000 images. I think that this is explained in #comment 14. But if we open a picture during scanning (during this use case) then I can see crash immediately. Do we have a bug for it ?

5) I don't think that this is ICS/JB issue . It seems to me that it is related to additional memory usage mentioned in #comment 14. And this is present only in 800x480 resolution device with 256MB memory and it can be reproduced very easily with 2000 copy of same image (attached to this bugid).
Flags: needinfo?(tkundu)
Comment on attachment 8369887 [details]
link to patch on github

It saves memory but we are still crashing after 20mins for 2000 images. We need to save more memory during scanning. But it is a big improvement. Please see #comment 78 for more details on this.
Attachment #8369887 - Flags: feedback+
Comment on attachment 8368974 [details]
link to patch on github

It didn't help in JB based FFOS 1.3 . I can see that gallery is scanning fine even if I force blanking screen by pressing power button on device.
Attachment #8368974 - Flags: review?(tkundu) → review-
Whiteboard: [c=memory p= s= u=1.3] → [c=memory p= s= u=1.3] [CR 611568]
Comment 60 is correct. If you think visibility state and memory priority should decouple for occasion like this, we should ask Gecko developers to fix that in another bug.
Flags: needinfo?(timdream)
(In reply to Tapas Kumar Kundu from comment #80)
> Comment on attachment 8368974 [details]
> link to patch on github
> 
> It didn't help in JB based FFOS 1.3 . I can see that gallery is scanning
> fine even if I force blanking screen by pressing power button on device.

Screen lock or CPU lock has nothing to do with visibility state. User is always allowed to press home to hide the app and press the hardware sleep/power key to turn off the screen.

Screen lock simply stops system app from automatically turning off the screen, e.g. for video/monitoring app.
CPU lock simply stop the phone from suspending JavaScript execution cycles when the screen is asleep.

Maybe we do need to fix comment 81 in this bug.....
I use about:memroy to get the memory usage. I found that we have a very different usage for "canvas-2d-pixels."
We need to find why the "canvas-2d-pixels" is so different. Maybe the visibility monitor is not work for the first launch.


1. first time launch for size(256x256) x 400pic png image.
 60.73 MB ── canvas-2d-pixels
  2.92 MB ── gfx-surface-image
  0.00 MB ── gfx-textures
        0 ── ghost-windows
  0.00 MB ── gralloc
 16.57 MB ── heap-allocated
 18.54 MB ── heap-committed
   11.91% ── heap-overhead-ratio
      400 ── host-object-urls
  0.00 MB ── imagelib-surface-cache
  0.29 MB ── js-main-runtime-temporary-peak
        0 ── page-faults-hard
  336,203 ── page-faults-soft
 42.88 MB ── resident
 23.89 MB ── resident-unique
118.36 MB ── vsize

2. the second time launch for size(256x256) x 400pic png image..
  2.76 MB ── gfx-surface-image
  0.00 MB ── gfx-textures
        0 ── ghost-windows
  0.00 MB ── gralloc
 15.64 MB ── heap-allocated
 16.36 MB ── heap-committed
    4.61% ── heap-overhead-ratio
    2,004 ── host-object-urls
  0.00 MB ── imagelib-surface-cache
  0.29 MB ── js-main-runtime-temporary-peak
        2 ── page-faults-hard
   14,350 ── page-faults-soft
 37.73 MB ── resident
 22.17 MB ── resident-unique
117.26 MB ── vsize
Sorry, forget the "Maybe the visibility monitor is not work for the first launch" in comment 83.

Maybe we have some problem to release the canvas 2d buffer.
So it seems to me there is no back-pressure in the scanning process.  It reads files, generates thumbnails/previews which creates canvases, and then saves them to disk as a blob.  The canvases are not discarded until the canvas.toBlob() callback is fired.

This seems problematic to me because its probably faster to read files off disk than it is to re-encode images and save to disk.  This means we could be building up a large amount of canvases in flight waiting for things to flush.

David, Gabriele, what do you think about adding a callback from the metadata parser infrastructure through to MediaDB so that it can throttle when to process the next newly scanned file?  We could then limit the number of outstanding canvases to a known quantity.
Flags: needinfo?(gsvelto)
Flags: needinfo?(dflanagan)
Oops.  I thought Gabriele was working this bug, but I was thinking of bug 917193.  Sorry for the bug spam.
Flags: needinfo?(gsvelto)
Jerry is looking at this, so assigning to him.  I don't know how long this will take because I don't know what success looks like.  We were failing with 1000+100, and that was deemed a reasonable case.  We now have a patch that lets us go beyond that, but we fail at 2500+100.  Is there another target short of "never OOM"?
Assignee: nobody → hshih
Flags: needinfo?(milan)
Whiteboard: [c=memory p= s= u=1.3] [CR 611568] → [c=memory p= s= u=1.3] [CR 611568][ETA: n/a]
(In reply to Ben Kelly [:bkelly] from comment #85)
> So it seems to me there is no back-pressure in the scanning process.  It
> reads files, generates thumbnails/previews which creates canvases, and then
> saves them to disk as a blob.  The canvases are not discarded until the
> canvas.toBlob() callback is fired.
> 
> This seems problematic to me because its probably faster to read files off
> disk than it is to re-encode images and save to disk.  This means we could
> be building up a large amount of canvases in flight waiting for things to
> flush.
> 
> David, Gabriele, what do you think about adding a callback from the metadata
> parser infrastructure through to MediaDB so that it can throttle when to
> process the next newly scanned file?  We could then limit the number of
> outstanding canvases to a known quantity.

Keep in mind that I've been through all this before. Gallery's memory usage has been optimized and optimized and optimized already in the 1.0 and 1.1 release cycles.

MediaDB has lots of code to ensure that metadata parsing, although done asynchronously, is strictly serialized. There is only one image being processed at a time, and nothing happens until toBlob() is done.  We do not wait until the DB writes and file saves are complete, however.  If thumbnails are showing up on the screen while the scan is in progress, that indicates that the DB writes are happening. I suppose it is possible that the writes of the generated preview images to /sdcard/.gallery/previews/ are backing up, though that seems unlikely.  Probably easy enough to check that by adding an onsuccess callback with some console logging in the save() function at apps/gallery/js/MetadataParser.js:371

Though I don't know that will do any of us any good without the QRD device to try it out on.

Does anyone at Mozilla have the ability to work on this bug and have access to a device that can reproduce the crash?
(In reply to Milan Sreckovic [:milan] from comment #87)
> Jerry is looking at this, so assigning to him.  I don't know how long this
> will take because I don't know what success looks like.  We were failing
> with 1000+100, and that was deemed a reasonable case.  We now have a patch
> that lets us go beyond that, but we fail at 2500+100.  Is there another
> target short of "never OOM"?

I agree that at some point we may have to just say "good enough; we don't have to block" anymore.  But there is a mystery memory leak here.  If memory usage increases linearly with the number of photos that need to be scanned, something is going wrong.  And it is something that does not seem to happen on the Hamachi, which can scan 2000 previewless images (it takes over 40 minutes, but there is no crash).
Flags: needinfo?(dflanagan)
An interesting experiment with a hamachi might be to modify the metadata parser code to force it to create 400x800 preview images and 200x200 thumbnails (instead of 320x480 and 120x120). This would simulate the image sizes being used on the QRD device.

If we had a QRD device, I would like to know what happens if we didn't generate the preview image. Note, though that because of bug 942199 photos from that device do not have a big enough preview, so we'd have to use a manually created test image, or just turn off the preview generation code in MetadataParser.js.  This might help us narrow down the leak.  If we crash with only thumbnails being created, then we know that the leak probably has something to do with those thumbnails.

Jerry's finding in comment #83 does seem like an interesting lead.  Jerry are you testing this on the QRD device? In terms of freeing up canvases, there are two calls to toBlob() in MetadataParser.js.  In one the callback explicitly sets the canvas size to 0 once the blob is available.  In the other it does not, and we rely on ordinary garbage collection to discard that 200x200 canvas.  We could try patching that by changing MetadataParser.js:127 to be:

  canvas.toBlob(function(b) { canvas.width = canvas.height = 0; callback(b); }, 'image/jpeg');
(In reply to David Flanagan [:djf] from comment #90)
> If we had a QRD device,

QRD8x28s are available.  See bug 961246.  We could probably force the RAM down to simulate a 256MB config.  Not an exact match but would be closer than a Hamachi.
Attached patch gallery.patch (obsolete) (deleted) — Splinter Review
Release the canvas buffer explicitly. The "canvas-2d-pixels" becomes zero.
I think the gallery can load more picture.

--
If we want to handle 5000 even 10000 picture, I think we should consider the DOM element number. Currently, If one picture is out of the visibility window, we still have a DOM element for this picture. We just set the backgroundImage to null. The DOM element might consume some memory.


First init with 4000pic 256x256 png file
  0.00 MB ── canvas-2d-pixels
  2.92 MB ── gfx-surface-image
  0.00 MB ── gfx-textures
        0 ── ghost-windows
  0.00 MB ── gralloc
 50.46 MB ── heap-allocated
 52.59 MB ── heap-committed
    4.22% ── heap-overhead-ratio
    4,000 ── host-object-urls
  0.00 MB ── imagelib-surface-cache
  0.29 MB ── js-main-runtime-temporary-peak
        0 ── page-faults-hard
2,707,857 ── page-faults-soft
 78.34 MB ── resident
 59.12 MB ── resident-unique
144.42 MB ── vsize

Second launch time
 2.76 MB ── gfx-surface-image
 0.00 MB ── gfx-textures
       0 ── ghost-windows
 0.00 MB ── gralloc
13.00 MB ── heap-allocated
13.63 MB ── heap-committed
   4.79% ── heap-overhead-ratio
     945 ── host-object-urls
 0.00 MB ── imagelib-surface-cache
 0.29 MB ── js-main-runtime-temporary-peak
       2 ── page-faults-hard
  13,378 ── page-faults-soft
35.21 MB ── resident
19.82 MB ── resident-unique
92.37 MB ── vsize
Attachment #8371256 - Flags: feedback?(tkundu)
Attachment #8371256 - Flags: feedback?(dflanagan)
My test device is helix. I have 384mb memory.
So far I don't get OOM when I launch gallery.
Comment on attachment 8371256 [details] [diff] [review]
gallery.patch

Jerry,

Thanks for trying this approach. Since you observe a difference in the about:memory results, I'm hoping that Tapas will see the difference, too.  Unfortunately he's the only one I know of who can actually test our patches for us.  Setting needinfo for Tapas in case he doesn't follow the feedback? flag as closely.

Is there a reason that you didn't leave in the willReadFrequently flags?  I tried that at Milan's suggestion, and Tapas seemed to think that it made some difference.  But I don't really understand what it does, so I'm not sure whether it is necessary here or not. In any case, you'll want to modify your patch to uncomment those lines or to remove the commented lines.
Attachment #8371256 - Flags: feedback?(dflanagan) → feedback+
Flags: needinfo?(tkundu)
(In reply to David Flanagan [:djf] from comment #94)
> Is there a reason that you didn't leave in the willReadFrequently flags?  I
> tried that at Milan's suggestion, and Tapas seemed to think that it made
> some difference.  But I don't really understand what it does, so I'm not
> sure whether it is necessary here or not. In any case, you'll want to modify
> your patch to uncomment those lines or to remove the commented lines.

I just want to test the worse case. We should use willReadFrequently for final release.

(In reply to Jerry Shih[:jerry] from comment #92)
> If we want to handle 5000 even 10000 picture, I think we should consider the
> DOM element number. Currently, If one picture is out of the visibility
> window, we still have a DOM element for this picture. We just set the
> backgroundImage to null. The DOM element might consume some memory.

Does the gallery easy to remove/attach the preview DOM node? Not only set the backgroundImage to null
Flags: needinfo?(dflanagan)
(In reply to Jerry Shih[:jerry] from comment #95)
> (In reply to David Flanagan [:djf] from comment #94)
> > Is there a reason that you didn't leave in the willReadFrequently flags?  I
> > tried that at Milan's suggestion, and Tapas seemed to think that it made
> > some difference.  But I don't really understand what it does, so I'm not
> > sure whether it is necessary here or not. In any case, you'll want to modify
> > your patch to uncomment those lines or to remove the commented lines.
> 
> I just want to test the worse case. We should use willReadFrequently for
> final release.

Great.

> 
> (In reply to Jerry Shih[:jerry] from comment #92)
> > If we want to handle 5000 even 10000 picture, I think we should consider the
> > DOM element number. Currently, If one picture is out of the visibility
> > window, we still have a DOM element for this picture. We just set the
> > backgroundImage to null. The DOM element might consume some memory.
> 
> Does the gallery easy to remove/attach the preview DOM node? Not only set
> the backgroundImage to null

That would be too big a change for a last minute 1.3 blocker.

I've actually filed a 1.4 bug to try to remove visibility_monitor entirely, because I don't think we need it for image visibility any more.  

If we also want to add and remove dom elements, we should consider some other infinite scrolling solution, I think. I suspect that trying to do that with visibility monitor would cause problems with too many layouts.  Also, the way we now segment thumbnails by month complicates matters.
Flags: needinfo?(dflanagan)
(In reply to David Flanagan [:djf] from comment #94)
> But I don't really understand what it does, so I'm not
> sure whether it is necessary here or not.

Setting |willReadFrequently| to true forces the canvas to be backed by a software-only surface as opposed to one accelerated via SkiaGL. This uses less memory as we don't allocate a GL context and the associated backing buffer in graphics memory.
Comment on attachment 8371256 [details] [diff] [review]
gallery.patch

This patch does not apply cleanly on FFOS 1.3 gaia/gecko. It seems like that you have [1] in your repo but it has not came to FFOS 1.3 gaia/gecko till now.

[1] https://bugzilla.mozilla.org/show_bug.cgi?id=958681 

I tried this patch by resolving minor conflict manually. I found that 

1) Gallery is killed after 10mins with this patch
2) Gallery is killed within 5mins without this patch

This patch looks nice to me as it is saving some memory. But I am interested to know what is causing memory growth over a period of time for 2000 copy of same image in gallery. is there any memory leak issue with gallery?
Attachment #8371256 - Flags: review+
Attachment #8371256 - Flags: feedback?(tkundu)
Attachment #8371256 - Flags: feedback+
Flags: needinfo?(tkundu)
memory growth of gallery app with patch from #comment 98. Looks like either gallery is leaking some memory or it is using more memory over a period of time to scan images (2000 copy of same image which i have attached with this bugid). 

can anyone please explain why we are seeing memory growth for over a period of time in gallery ?
Tapas,

Would you test again with all three patches applied (or at least the last two)?  This most recent patch did not include the previous patch, and both improved things for you. So it would be good to see how they do together.

As for the memory growth, if anyone could explain why were seeing that we would know how to fix the bug!

Your attachments of b2g-procrank output are helpful but the fact that gallery's memory goes up and down so much make it hard to interpret the output.  I'd rather have data from runs that don't crash after they have finished scanning and memory usage has stabilized.

Here's some data I'd be interested in seeing:

Put 100 photos on your device, let gallery scan them to completion (assuming no crash with this number). Wait a minute without scrolling, to let memory stabilize, and then use procrank to see the final memory usage.

Then repeat with the same photos so no scanning is required.

Then erase the db (or reset gaia) and repeat with 200, 300, 400, 500, 600, etc.

For small number of photos, I'd expect to see a linear relationship between photos and memory usage because more photos means more decoded thumbnails.  Each 200x200 thumbnail requires 160,000 bytes to hold the decoded image, so adding 100 more thumbnail should make memory use go up by 16,000,000 bytes.

At some point, however, gecko and/or the visibility monitor code in Gallery should stop decoding thumbnail images when they are too far offscreen. So as long as we don't scroll, we should see the total memory size flatten out and no longer increase very much as the # of photos goes up.

If we see this pattern for the runs where the scanning is already complete but do not see the memory usage tapering off for the runs where scanning is done, then we know that there is something wrong with the scanning process.

From some preliminary testing I did earlier in this bug, it seems as if the visibility monitor code is failing during the scan and is allowing too many images to decode, so that is something we could investigate. 

(Ah, here's another idea: try commenting out the body of thumbnailOnscreen() at lines 754 and 755 of gallery.js. This will mean that thumbnails will never be displayed and you'll just see a bunch of gray boxes.  If you still get an OOM during scanning, then we know that the bug is not related to the thumbnails.)

Unfortunately, Tapas, you are the only person I know of who can do any testing on this bug.
Flags: needinfo?(tkundu)
Hi David,

If we have a lot of picture, the gallery will load all thumbnail into memory.
https://github.com/mozilla-b2g/gaia/blob/master/apps/gallery/js/gallery.js#L446
After enumerating, we have all thumbnail in memory.
I think it is the reason why the memory usage will increase with more and more picture.

Can we remove the thumbnail from fileinfo object? If the thumbnail is in the visible window, we load the thumbnail blob object on demand.
Flags: needinfo?(dflanagan)
(In reply to Jerry Shih[:jerry] from comment #83)
> I use about:memroy to get the memory usage. I found that we have a very
> different usage for "canvas-2d-pixels."
> We need to find why the "canvas-2d-pixels" is so different. Maybe the
> visibility monitor is not work for the first launch.
> 
> 
> 1. first time launch for size(256x256) x 400pic png image.
>  60.73 MB ── canvas-2d-pixels
>   2.92 MB ── gfx-surface-image
>   0.00 MB ── gfx-textures

David, about memory of "canvas-2d-pixels" above are created by too many local canvas objects[1] for thumbnail creation.
Since the width/height of the thumbnail canvas are the same[2], could we use only one canvas object to generate all the thumbnail. I think it will help to reduce the total memory usage during first launch and we don't need to wait GC to cleanup the unused canvas objects.

[1] https://github.com/mozilla-b2g/gaia/blob/master/apps/gallery/js/MetadataParser.js#L44
[2] https://github.com/mozilla-b2g/gaia/blob/master/apps/gallery/js/MetadataParser.js#L45
Hi David,

I saw some interesting memory usage in about:memory.
We see the "memory-file-data/large/file" at 1st launch, but we don't see this at 2nd launch.
It costs a lot.

1st launch
29.70 MB (100.0%) -- explicit
├──15.63 MB (52.62%) -- dom
│  ├──15.63 MB (52.62%) ── memory-file-data/large/file
│  └───0.00 MB (00.00%) ── event-listener-managers-hash


This "memory-file-data" is a blob object in gecko, and it was generated after calling toBlob() in createThumbnailFromElement().
https://github.com/mozilla-b2g/gaia/blob/master/apps/gallery/js/MetadataParser.js#L134

Does this blob object be freed after saving to mediaDB? There is too many callback and closure. It's hard to me to find the callback function. :(
Hi jerry shih/David Flanagan,

can we have chat in IRC ? I want to debug this issue together on my device. 

Please let me know when you are free. We are seeing same issue <100 pic for higher resolution image (not attached in this bugid) . 

I think that we should debug together to conclude quickly.
Flags: needinfo?(tkundu)
Flags: needinfo?(hshih)
(In reply to Jerry Shih[:jerry] from comment #101)
> Hi David,
> 
> If we have a lot of picture, the gallery will load all thumbnail into memory.
> https://github.com/mozilla-b2g/gaia/blob/master/apps/gallery/js/gallery.
> js#L446
> After enumerating, we have all thumbnail in memory.
> I think it is the reason why the memory usage will increase with more and
> more picture.
> 
> Can we remove the thumbnail from fileinfo object? If the thumbnail is in the
> visible window, we load the thumbnail blob object on demand.

That code is fine for this bug. This is only about the first run case where we start with no images in the db and are scanning them all.  When a blob is stored in indexeddb, it is stored in an external file, and when queried, the Blob object is just a reference to that file.  So enumerating all the images does not actually (unless something has changed in gecko) read the encoded images into memory.
Flags: needinfo?(dflanagan)
(In reply to peter chang[:pchang][:peter] from comment #102)

> David, about memory of "canvas-2d-pixels" above are created by too many
> local canvas objects[1] for thumbnail creation.
> Since the width/height of the thumbnail canvas are the same[2], could we use
> only one canvas object to generate all the thumbnail. I think it will help
> to reduce the total memory usage during first launch and we don't need to
> wait GC to cleanup the unused canvas objects.
> 

Peter: reusing that canvas is an interesting idea.  Note, however, that Jerry's patch attached to this bug solves the same problem by explicitly setting the canvas size to 0 after use.  Reusing the cavans might be more efficient, but presumably won't save us any more memory than we've already saved with Jerry's patch, so I don't think that will help us.
(In reply to Jerry Shih[:jerry] from comment #103)
> Hi David,
> 
> I saw some interesting memory usage in about:memory.
> We see the "memory-file-data/large/file" at 1st launch, but we don't see
> this at 2nd launch.
> It costs a lot.
> 
> 1st launch
> 29.70 MB (100.0%) -- explicit
> ├──15.63 MB (52.62%) -- dom
> │  ├──15.63 MB (52.62%) ── memory-file-data/large/file
> │  └───0.00 MB (00.00%) ── event-listener-managers-hash
> 
> 
> This "memory-file-data" is a blob object in gecko, and it was generated
> after calling toBlob() in createThumbnailFromElement().
> https://github.com/mozilla-b2g/gaia/blob/master/apps/gallery/js/
> MetadataParser.js#L134
> 
> Does this blob object be freed after saving to mediaDB? There is too many
> callback and closure. It's hard to me to find the callback function. :(

I don't know of any way to explicitly free a blob, do you? If we are somehow retaining a reference to these blobs and not allowing them to be garbage collected, at least they'll be jpeg-encoded images which means much much smaller than the decoded images.  

Jerry: how many images did you scan before you saw that 15.6mb of large files? Does the number go up linearly with the number of images scanned?  That will help us know if this is a memory leak or whether you just did an about:memory dump before gc had been performed.
(In reply to Tapas Kumar Kundu from comment #104)
> Hi jerry shih/David Flanagan,
> 
> can we have chat in IRC ? I want to debug this issue together on my device. 
> 
> Please let me know when you are free. We are seeing same issue <100 pic for
> higher resolution image (not attached in this bugid) . 
> 
> I think that we should debug together to conclude quickly.

Tapas: my IRC handle is djf, and I'm usually in #gaia. I'm on Pacific time. Jerry is in Taipei, so it will be difficult to find a time that works for both he and I. You can feel free to ping me, but I'm at home sick today, will probably be useless for debugging.

I'll spend some time re-reading the code that Jerry points to above to see if I an identify a memory leak that would cause the memory-file-data/large/file.

Maybe I'll modify the code so that it creates 480x800 previews even though my hamachi screen size is 320x480.  If I can reproduce your crash that way, then we'll know that this has more to do with the screen size than with the ICS/JB distinction.
I've modified gallery to always create 200x200 thumbnails and 480x800 previews, and was then able to process 1000 images without OOM.  (With the display set to never blank.) So maybe this does have more to do with the version of android than with the screen size.  My test did not include either of the memory saving patches attached to this bug.  I did it with the latest 1.3 nightly for my hamachi device.

So we can keep trying to chase memory leaks, but I fear that we won't be able to get to the bottom of this without a 256mb ICS device to test on.  And I wonder if we need someone who knows about memory management at the OS level rather than trying to chase this at the Gaia level.
After killing and restarting the app, it scanned some more. It ran to completion without crashing previously but somehow missed some images on the first scan? The second scan was short, but it did process images I think.  Killing it a second time and starting a third, it launched  and did not scan any new images.

After the first run (when it scanned for 20+ minutes) the stable state memory usage (reported by b2g-ps) was 135M.  On the third run where it didn't scan at all, the stable state memory usage is 91M.

So maybe I'll go back to trying to figure out if and why the visibility monitor is failing during the scanning.
Going back to Jerry's comments #101 and #103...

When we enumerate the database (and no scanning is involved) the thumbnail blobs are files on disk.

But when we scan, we're using in-memory files and we're holding on to those.  That might account for the large difference in memory usage for the scanning case vs the not scanning case.

When the gallery is notified of a new file, maybe it should query the mediadb to get the record out of the db so that the blob is on disk rather than in memory.  If that solves this problem, I should possibly modify mediadb to have an option to do that automatically.
While scanning photos on my 1.3 ICS hamachi, I notice that every 20 to 30 seconds, all of the thumbnails go blank and revert to a gray background color.  If I watch the b2g-ps output every second, this happens when VSIZE gets up to about 220M, and after the blank, VSIZE drops down to 150M or so.

It seems clear that this is some sort of garbage collection going on. I wonder if on the QRD device the app is getting killed before that gc pass can complete.  Tapas: do you ever see those repeated flashes of gray screen while testing on your device?

Meanwhile, on this test run, I'm not seeing errors with the visibility monitor.  The onscreen callback was called for the first ~150 thumbnails, and then it stopped getting called.  (It is getting called twice for each thumbnail, which seems like an error, but harmless.)  My previous tests included photos in many different directories. This time I'm using a single directory, and scanning seems to be happening from most recent to least recent, so all thumbnails are being added at the bottom.  In the previous case we were probably doing insertions at the top, so onscreen had to be called more often. (There was a bug where offscreen was not being called, I suppose.)
Scanning has just completed without any OOM.  Of the 1000 photos on the sdcard, only 969 actually got scanned, which is concerning.

Memory use when scanning was complete was stable at 127mb. That is lower than the 135mb I saw before, but much higher than I'd expect for the case where no scanning took place.  But if each thumbnail blob was 7kb and there were 1000 of them, then I'd expect my patch to use file-backed thumbnail blobs instead of memory-backed blobs would save about 7mb.
Running the app again, I find that some or all of remaining files are scanned, 157 thumbnails get their onscreen callback called (about as expected) and memory usage settles at 85mb.  But if I scroll even once, then memory goes up to ~120M and stays at that level.

The thumbnail blobs turn out to be 12kb each, so I would expect to see more memory savings if my patch to force file-backed blobs had worked.
This patch is my attempt to force the gallery to discard the memory-backed thumbnail blob it is passed during the scanning process and instead explicitly query the media db to get a file-backed blob.  With the test image we're using and 1000 200x200 thumbnails, this should result in saving something like 12mb.

The patch does slow down scanning. On my hamachi it now takes 1.4s for each image.  (It won't be this bad in practice assuming that EXIF preview images are big enough.)
Tapas,

The patch attached above seems to resolve the "big file" memory issue that Jerry noticed.  With this patch applied, I can scan 1000 images (creating big previews and thumbnails for them) without an OOM. And when I'm done, the gallery memory usage is stable at 103mb instead of at 127 or 135mb as I'd seen before. Looks like I'm getting substantially better memory savings from it than the 12mb I expected.

(Unfortunately, I think the tradeoff is significantly worse scanning time. There is probably a batching solution that could help with that, though.)

Please check whether the patch above makes any difference for you.

The patch also includes my code to force 200x200 thumbnails and 480x800 preview images.  Please confirm that your device is 480x800 and that devicePixelRatio is 1.  (That is, that 1 css pixel == 1 device pixel). If your devicePixelRatio is > 1 then your device uses even bigger images than I realized.

If you want to continue investigating the possibility of memory leaks in Gallery, let's first try to narrow things down.  We need to find out whether it is the thumbnail images or the preview images that are causing problems.  If you modify the "is it big enough" test in the previewsuccess function in MetadataParser.js you can force Gallery to just accept the embedded EXIF preview image as big enough and not try to generate its own.  This will make scanning much, much quicker.  It would be interesting to know if you crash in that case. 

Next, I'd try commenting out the thumbnailOnscreen callback in gallery.js.  This will prevent any thumbnails from being decoded and displayed.  If you still crash with that, then we know that it is not the memory consumed by decoding the thumbnails that is at fault.  But if you don't crash, then maybe you can fix things by just tuning the parameters passed to the VisibilityMonitor.

But again, the fact that the scan always completes without OOM on my 256mb hamachi, even when I simulate a big screen, makes me think that the problem is probably not in Gallery (although we've found room for improvement in Gallery) but in the lower levels of the OS. That is an area that I know nothing about and cannot help you with.
Flags: needinfo?(tkundu)
Jerry,

Could you also try my most recent patch to see if it solves the "big file" memory issue you identified?  I tested it but just looked at the b2g-ps output to assess memory impact.  Your analysis of an about:memory dump will be more authoritative.
Thanks, David!
The large memory-file-data is disappeared after scanning process.
I don't find that in explicit memory section.
But the "vsize" is still huge. I will check how we collect this value.

---
1st
19.34 MB (100.0%) -- explicit
├───4.60 MB (23.78%) ++ window-objects/top
├───4.22 MB (21.81%) ++ images
├───3.01 MB (15.54%) ++ js-non-window
├───2.89 MB (14.92%) ── heap-unclassified
├───2.19 MB (11.33%) ++ heap-overhead
├───0.87 MB (04.48%) ── xpti-working-set
├───0.30 MB (01.58%) ++ (8 tiny)
├───0.29 MB (01.48%) ── freetype
├───0.26 MB (01.36%) ++ xpconnect
├───0.26 MB (01.35%) ── layout/style-sheet-cache
├───0.25 MB (01.30%) ++ xpcom
└───0.21 MB (01.07%) ── atom-tables

  0.00 MB ── canvas-2d-pixels
  2.92 MB ── gfx-surface-image
  0.00 MB ── gfx-textures
        0 ── ghost-windows
  0.00 MB ── gralloc
 14.92 MB ── heap-allocated
 17.12 MB ── heap-committed
   14.68% ── heap-overhead-ratio
    2,000 ── host-object-urls
  0.00 MB ── imagelib-surface-cache
  0.29 MB ── js-main-runtime-temporary-peak
        0 ── page-faults-hard
  239,903 ── page-faults-soft
 39.23 MB ── resident
 22.44 MB ── resident-unique
162.10 MB ── vsize


2nd
19.23 MB (100.0%) -- explicit
├───4.43 MB (23.06%) ++ window-objects/top
├───4.05 MB (21.04%) ++ images
├───2.94 MB (15.30%) ++ js-non-window
├───2.77 MB (14.39%) ── heap-unclassified
├───2.62 MB (13.63%) ++ heap-overhead
├───0.87 MB (04.51%) ── xpti-working-set
├───0.29 MB (01.50%) ++ (8 tiny)
├───0.28 MB (01.48%) ── freetype
├───0.26 MB (01.36%) ++ xpconnect
├───0.26 MB (01.36%) ── layout/style-sheet-cache
├───0.25 MB (01.30%) ++ xpcom
└───0.21 MB (01.08%) ── atom-tables

  2.76 MB ── gfx-surface-image
  0.00 MB ── gfx-textures
        0 ── ghost-windows
  0.00 MB ── gralloc
 14.00 MB ── heap-allocated
 16.62 MB ── heap-committed
   18.72% ── heap-overhead-ratio
    2,000 ── host-object-urls
  0.00 MB ── imagelib-surface-cache
  0.29 MB ── js-main-runtime-temporary-peak
        2 ── page-faults-hard
   19,999 ── page-faults-soft
 39.10 MB ── resident
 22.59 MB ── resident-unique
104.36 MB ── vsize
Here are another interesting parts.

9.92 MB (100.0%) -- blob-urls
└──9.92 MB (100.0%) ++ owner(app://gallery.gaiamobile.org/index.html)

    2,000 ── host-object-urls

1.
I decrease the visibility window size, but the blob-urls size is not change.
Maybe we should call revokeObjectURL() to purge the blob object which is not used.
I will check when the blob-urls be created.
2.
We have 2000 url object which number is the same as the all picture number.
If we only create url object for the picture which is within the visibility window, maybe we can save some memory.
Hi Tapas and David,
You can try the about:memory tool to dump the memory usage.
It's easy to use.

In B2G/tools folder, you will find the get_about_memory.py.
Just exec the get_about_memory.py. It will connect to device, and get the memory report back.

You also can exec:
./get_about_memory.py -m
It will call GC before dumping memory report.
(In reply to Tapas Kumar Kundu from comment #104)
> Hi jerry shih/David Flanagan,
> 
> can we have chat in IRC ? I want to debug this issue together on my device. 
> 
> Please let me know when you are free. We are seeing same issue <100 pic for
> higher resolution image (not attached in this bugid) . 
> 
> I think that we should debug together to conclude quickly.

My nickname is Jerry or Jerry_ in IRC. I also join the #gaia.
Taipei is GMT+8, so it's not easy to find David and I together.

I suggest that you can find the maximum number of higher resolution image which doesn't get OOM.
Then use get_about_memory.py to dump the memory report.
Maybe there is another problem in hi-resolution image case.
Flags: needinfo?(hshih)
I don't have time to read this whole bug but if you need some help figuring out what is going on on the Gecko side I am in Taipei this week.  It looks like Jerry is on to the cause in comment 119 though.  If I can help please come find have Jerry come find me.
(In reply to David Flanagan [:djf] from comment #112)
> It seems clear that this is some sort of garbage collection going on.

That's probably caused by the memory pressure monitor. When we detect that available memory drops below a certain boundary we dispatch 'memory-pressure' events to all applications and all subsystems. This not only runs the GC/CC but also flushes caches and other temporary structures. The flash might be caused by this mechanism purging the image cache.

Note that this mechanism is device-dependent; the unagi devices didn't have it for example which lead to more frequent OOMs in the case of a single application taking lots of memory all for itself. AFAIK all other devices should have it but if you want to verify check with adb if the following patch exists:

/sys/module/lowmemorykiller/parameters/notify_trigger
Gabriele: thanks for this useful information.  Tapas: does  your device have the notify_trigger patch? (And do you see the gray flash during scanning when the thumbnails disappear and then come back?)

Kyle: thank you for your offer to help. I'd guess that Jerry would love some help with this.

Jerry: Yes, one blob url per thumbnail, held for the lifetime of the app. I was assured (by bent, probably) that blob urls were inexpensive.  So that requirement is pretty fundamental to the current architecture. Not something we can change in 1.3. In any case, I think we're spending too much time at the Gaia level looking for memory savings.  We've found some good ones. But remember that the app runs find on every other device we can test it on. Tapas doesn't see the crash on ICS-based devices. And I don't see the crash even when I try to simulate a 480x800 device on my hamachi. I suspect that we now need to be looking at OS-level memory management stuff.  I have no idea how to do that, which is why I unassigned myself from this bug. 

In comment #71, Tapas asks whether we should adjust the LMK parameters.  I don't know anything about that, but maybe we should. Or we should at least find out if they are differnet on Tapas's device than they are on other devices.  Gabriele: do you know about the LMK? If so (and if Jerry does not) then perhaps you are a better person to work on this bug than Jerry is.

The tricky thing about this bug is that the crash only occurs on a qualcomm reference device that no one at mozilla has.  There are apparently some similar QRD devices available at mozilla, but they have more memory than the one that crashes.

The device that crashes has 256mb of memory, a 480x800 screen, and runs JB.  I've tried making the gallery app think it has a 480x800 screen so it uses up more memory but I cannot reproduce the OOM on my 256mb hamachi.  (Note that the hamachi will OOM if the screen goes to sleep duing the scanning, but that is not the issue here.)

In any case, I think the ball is in QC's court right now. We need Tapas to try out the latest patch (Tapas: Jerry's combination patch in comment #118 is probably a good one to try). And we want to know whether his device has the notify_trigger patch and whether he sees the thumbnails going blank and then coming back during the scanning process when memory gets low.

It does not make sense to have 4 Mozilla engineers (Jerry, Kyle, Gabriele and myself) spending time on a bug when none of us has access to the device that exhibits the crash.  But when we hear again from Tapas, perhaps we can figure out which of us is the best qualified to continue to work with Tapas on it.
Flags: needinfo?(khuey)
Flags: needinfo?(hshih)
Flags: needinfo?(gsvelto)
Using a blob URL will keep the blob alive until it is revoked.  It's cheaper than a data: URI certainly, but it's not cheap.  As long as your blobs are all backed by the disk it's not a big deal (each blob object takes up a few tens of bytes), but if they're in memory I would expect that to be a pretty big problem.
Sorry, for the delay. Looks like Jerry's combination patch in comment #118 has a big improvement. We also have NUWA enabled in latest FFOS 1.3 which saves more memory in FFOS 1.3 now. I am still running scripts to collect logs.. I will update here soon. Thanks a lot for your work
Flags: needinfo?(tkundu)
Flags: needinfo?(tkundu)
(In reply to David Flanagan [:djf] from comment #106)
> (In reply to peter chang[:pchang][:peter] from comment #102)
> 
> > David, about memory of "canvas-2d-pixels" above are created by too many
> > local canvas objects[1] for thumbnail creation.
> > Since the width/height of the thumbnail canvas are the same[2], could we use
> > only one canvas object to generate all the thumbnail. I think it will help
> > to reduce the total memory usage during first launch and we don't need to
> > wait GC to cleanup the unused canvas objects.
> > 
> 
> Peter: reusing that canvas is an interesting idea.  Note, however, that
> Jerry's patch attached to this bug solves the same problem by explicitly
> setting the canvas size to 0 after use.  Reusing the cavans might be more
> efficient, but presumably won't save us any more memory than we've already
> saved with Jerry's patch, so I don't think that will help us.

David, agree that reusing canvas had similar memory usage as zero canvas size. But I think reusing canvas could also save the buffer allocations inside gecko which involve some IPC calls.
(In reply to Jerry Shih[:jerry] from comment #119)
> Here are another interesting parts.
> 
> 9.92 MB (100.0%) -- blob-urls
> └──9.92 MB (100.0%) ++ owner(app://gallery.gaiamobile.org/index.html)
> 
>     2,000 ── host-object-urls
> 

I'm sure that we load all thumbnail into these blob-urls(both in 1st and 2nd launch).
We can unroll the blob-urls item to see the detail.

If I have 2000 images, I can see 2000 blob-urls entries in memory report. The total blob-urls' size will become large if we have a lot of thumbnail.

9.92 MB (100.0%) -- blob-urls
└──9.92 MB (100.0%) ++ owner(app://gallery.gaiamobile.org/index.html)
   ├──0.00 MB (00.05%) ── blob:0027984b-4209-4198-9cc6-b1aa3954c2d5   #1
   ├──0.00 MB (00.05%) ── blob:0070362d-3c69-4762-91a9-0c1d2424db60
   ...............
   ├──0.00 MB (00.05%) ── blob:03f49825-b5ee-4e6f-9b9b-f686e2ea0373   #2000



(In reply to David Flanagan [:djf] from comment #124)
> Jerry: Yes, one blob url per thumbnail, held for the lifetime of the app. I
> was assured (by bent, probably) that blob urls were inexpensive.  So that
> requirement is pretty fundamental to the current architecture. Not something
> we can change in 1.3. In any case, I think we're spending too much time at
> the Gaia level looking for memory savings.  We've found some good ones. But
> remember that the app runs find on every other device we can test it on.
> Tapas doesn't see the crash on ICS-based devices. And I don't see the crash
> even when I try to simulate a 480x800 device on my hamachi. I suspect that
> we now need to be looking at OS-level memory management stuff.  I have no
> idea how to do that, which is why I unassigned myself from this bug. 

I have no idea about the different of OS-level memory management between jb and ics-base platform.
I can just help to find a better way for gaia to improve the memory usage.
Flags: needinfo?(hshih) → needinfo?(dflanagan)
Comment on attachment 8372737 [details] [diff] [review]
mix willReadFrequently, resize canvas and file-base blob

Review of attachment 8372737 [details] [diff] [review]:
-----------------------------------------------------------------

This patch helps to load 2000 image and 100 videos properly. I can also put gallery app in background during scanning.
Looks like it is saving good amount of memory. Please also note that NUWA has landed in latest FFOS 1.3 build which I tested. NUWA also saves some additional memory.
But I noticed that gallery app shows blank screen if I scroll. Blank screen comes only for 2000 image case. I tested with 100 images and scroll was fine. 
Please let me know if you are still interested to see memory reports/procrank logs for this usecase.
Attachment #8372737 - Flags: review+
Attachment #8372737 - Flags: feedback+
Flags: needinfo?(tkundu)
Tapas: What about with larger image sizes? You said earlier that larger images caused an almost immediate crash. You've been testing with 2mp images, but I'm assuming that your camera app will be configured to produce 5mp images. Do those work this this patch applied?

Also, as you've pointed out, this bug suggests significant memory management differences between ICS and JB builds.  The Gallery app is the most memory hungry of the core Gaia apps, and in the past it has pushed us to find and fix memory management bugs.  I'm reluctant to close this bug based only on gaia level fixes, because I think we've only masked an underlying issue.  Both Jerry and I have worked on this bug only at the Gaia level. Do you understand what is going on at the LMK and notify_trigger level? Or do you have a colleague who can investigate this bug at that level?  Even if we land this patch and close this bug, I suggest you open a new one to get to the bottom of the underlying issue.

Jerry: before we land your combination patch, we should measure the performance impact on scanning time. The piece I did to ensure that scanning produces file-backed blobs instead of memory-backed blobs probably slows down the first scan time enough to be considered a regression. I may be able to optimize that with some batching. Tapas gave an unsolicited r+ to the patch, but please don't land until we know more about performance.
Flags: needinfo?(dflanagan)
Flags: needinfo?(tkundu)
Flags: needinfo?(hshih)
Comment on attachment 8372737 [details] [diff] [review]
mix willReadFrequently, resize canvas and file-base blob

Clearing the r+ on this patch because we need to check on performance and clean up the code before landing it.
Attachment #8372737 - Flags: review+ → review-
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #125)
> Using a blob URL will keep the blob alive until it is revoked.  It's cheaper
> than a data: URI certainly, but it's not cheap.  As long as your blobs are
> all backed by the disk it's not a big deal (each blob object takes up a few
> tens of bytes), but if they're in memory I would expect that to be a pretty
> big problem.

Kyle: I've patched the bug that was causing them to be in-memory blobs. But Jerry's memory dumps are still showing 2000 blob urls are taking up 10mb of memory.  See comment #119.

That is not something we're going to fix here in a 1.3 blocker, but I wonder if it is something worth filing a followup bug about. To either reduce memory usage in Gecko or modify the app so that it doesn't hold so many blob urls.
With no patches applied, my hamachi running 1.3, can scan 200 of the sample photos in 207 seconds. 1.035 seconds per photo.

When I apply the willReadFrequently patch, I can scan 200 photos in 182 seconds or 0.906 seconds per photo.
That's a pleasant surprise: willReadFrequently forces use of a software canvas instead of hardware, so I assumed it saved memory at the expense of time, but I guess it saves us time in this case, too.

Applying the next patch to set the canvas size to 0 after calling toBlob() makes basically no change in time: .905 seconds per photo for 200 photos.

I'll create a rollup patch will all of these fixes applied.  But I still want to investigate whether we need to keep the screen on during scanning or if there is some other way to prevent the app from going to the background.

Modifying that patch so that it uses a single shared canvas reduces the time to .897s per photo. This comes at the cost of a fixed size canvas that is never freed.  On the target 480x800 device, I think this will be 160kb of memory (for a 200x200 thumbnail).  That's < 1% improvement in time, so I'm going to go with the simpler patch that allocates and discards a fresh canvas for each thumbnail.

When I add the patch that ensures that we get a file-backed thumbnail blob instead of a memory backed blob, the time per photo goes up only to 0.922s.  This is much, much better than I expected, and is faster than what we started with. 

Tweaking the mediadb batch timeout to 1500ms gets this scan time down to about .88s per photo. But this causes a full 15 photos per batch which makes the scan seem very unresponsive.  A 500ms timeout gives us batches of 1 and 1000ms often gives us batches that are too big.   If I force batches of 3, I get an average time of .882s per photo for 201 photos.  That saves 40ms per photo, and seems worth doing.
(In reply to David Flanagan [:djf] from comment #130)
> Tapas: What about with larger image sizes? You said earlier that larger
> images caused an almost immediate crash. You've been testing with 2mp
> images, but I'm assuming that your camera app will be configured to produce
> 5mp images. Do those work this this patch applied?

Yes it works fine.

> 
> Also, as you've pointed out, this bug suggests significant memory management
> differences between ICS and JB builds. 
Yes JB has ZRAM, KSM etc. 

> Do you
> understand what is going on at the LMK and notify_trigger level? 

It does not seem to me that its a good idea to allow an app to consume >100MB memory on 256MB device. I also don't like the idea of tweaking LMK values for only one app 'gallery'. Still I will look in it and update asap here.

> 
> Jerry: before we land your combination patch, we should measure the
> performance impact on scanning time.

Please also check why we are seeing blank screen during scroll for 2000 images with this patch. I can give you access to my device if you want. Please let me know .

Thanks a lot.
(In reply to Tapas Kumar Kundu from comment #134)

> 
> > Do you
> > understand what is going on at the LMK and notify_trigger level? 
> 
> It does not seem to me that its a good idea to allow an app to consume
> >100MB memory on 256MB device. 

Wait, are you saying that your device is configured to kill an app that exceeds 100MB? If so, I assume that is the underlying issue right there.  Is the notify trigger stuff set up to send memory pressure notifications when this artificial 100mb limit is neared?  Maybe that is what was going wrong: you're setting a memory limit that is defeating the memory pressure notifications?

I also don't like the idea of tweaking LMK
> values for only one app 'gallery'. Still I will look in it and update asap
> here.
> 

Do we have gallery-specific LMK tweaks on ICS builds? I honestly know nothing at all about this.

> 
> Please also check why we are seeing blank screen during scroll for 2000
> images with this patch. 

You would have seen that without this patch had you been able to scan the images. Our visibility monitor code does not set the backgroundImage of thumbnails that are far off screen. Then it sets backgroundImage as they get closer to scrolling onscreen.  It is a time/memory tradeoff.  If you keep more thumbnails swapped in, it takes more memory but you see fewer blank squares.  You've got a bigger screen on this device so your device has to work harder to load and decode the thumbnails when they're needed.  

There are parameters you can tweak if you want.  Look for visibility monitor in gallery.js. You can change one of the parameters to swap images in more responsively in a tradeoff for jankier scrolling.  Currently we're tuned for 320x480 devices, so possibly you'll want to adjust those values.

But completely unrelated to this bug. You can file a new bug for it, if you want, but I don't think it is something we can block on, because there's really nothing we can do: it is a fundamental memory vs time tradeoff.
Attached file patch for the 1.3 branch (deleted) —
Jerry,

This patch is a cleaned up version of the patch you attached. Note that most of the changes in gallery.js are just an indentation level change.

If you are not comfortable reviewing the mediadb.js changes, please ask dkuo if he can do that part.

Note that with these patches applied, I was no longer seeing the issue of the gallery being killed when the screen had blanked, so I have not included the patch to force the screen to stay on.

I'll produce a separate version of the patch for the master branch.
Assignee: hshih → dflanagan
Attachment #8368974 - Attachment is obsolete: true
Attachment #8369887 - Attachment is obsolete: true
Attachment #8371256 - Attachment is obsolete: true
Attachment #8372692 - Attachment is obsolete: true
Attachment #8372737 - Attachment is obsolete: true
Attachment #8373686 - Flags: review?(hshih)
Attached file patch for master (deleted) —
Jerry: this is the same patch as the other one, but with a couple of merge conflicts manually resolved.
Attachment #8373696 - Flags: review?(hshih)
>> Is the notify trigger stuff set up to send memory pressure notifications when this artificial 100mb limit is neared?  Maybe that is what was going wrong

It is not the case. LMK sees only available memory and it will kill any app based on pririty. LMK does not seem to be an issue for any other apps. I will raise separate bugid if I see any problem for this later.

>>Look for visibility monitor in gallery.js. You can change one of the parameters to swap images in more responsively in a tradeoff for jankier scrolling.  Currently we're tuned for 320x480 devices, so possibly you'll want to adjust those values.

Thanks for helping. I will raise a separate bugid if needed. 

(In reply to David Flanagan [:djf] from comment #130)
> Jerry: before we land your combination patch, we should measure the
> performance impact on scanning time. The piece I did to ensure that scanning
> produces file-backed blobs instead of memory-backed blobs probably slows
> down the first scan time enough to be considered a regression. I may be able
> to optimize that with some batching.

@mvikram: please suggest
Flags: needinfo?(tkundu) → needinfo?(mvikram)
Tapas, please run a test to determine if the first time scanning of the images has regressed. Let's decide on next steps after.
Flags: needinfo?(mvikram)
Hi all,

With the requesting from Jerry, I had changed the code to revoke the URL for memory testing. A WIP can be found here (without David's patch):

https://github.com/huchengtw-moz/gaia/commit/e8e60896bb4f9ad055a9875a934305daca9d985d

The following report is generated with both this WIP and David's patch, the blob-urls becomes 187 and the memory size of blob-urls goes low. That means the memory usage of blob's buffer is freed when we revoked the url. And blob's buffer is loaded to memory when we recreate the url and set it as the background image.

Note, this WIP patch breaks some unit tests and regression tests of gallery app.

0.68 MB (100.0%) -- blob-urls
└──0.68 MB (100.0%) -- owner(app://gallery.gaiamobile.org/index.html)
   ├──0.00 MB (00.55%) ── blob:02c41822-f958-44bb-adfb-84b792909deb (1)
   ├──0.00 MB (00.55%) ── blob:03a03b34-322b-4bcc-88d1-c16385535c75 (2)
   ├──0.00 MB (00.55%) ── blob:03f8edae-8894-4aad-8afe-5ea395b7a62c (3)
   ...
   ├──0.00 MB (00.55%) ── blob:09a9f20f-1efe-4f81-b728-b4d53bef5284 (187)


 0.00 MB ── canvas-2d-pixels
 0.00 MB ── gfx-surface-image
 0.00 MB ── gfx-textures
       0 ── ghost-windows
 0.00 MB ── gralloc
 9.21 MB ── heap-allocated
10.34 MB ── heap-committed
  12.25% ── heap-overhead-ratio
     187 ── host-object-urls ######
 0.00 MB ── imagelib-surface-cache
 0.29 MB ── js-main-runtime-temporary-peak
       1 ── page-faults-hard
 539,809 ── page-faults-soft
31.93 MB ── resident
15.21 MB ── resident-unique
96.86 MB ── vsize
Tapas: see comment 133. I did the timing tests, and it turns out that this patch actually improves scanning speed as well as reduces memory usage.
(In reply to David Flanagan [:djf] from comment #133)
> When I apply the willReadFrequently patch, I can scan 200 photos in 182
> seconds or 0.906 seconds per photo.
> That's a pleasant surprise: willReadFrequently forces use of a software
> canvas instead of hardware, so I assumed it saved memory at the expense of
> time, but I guess it saves us time in this case, too.

With SkiaGL and with read operation, we may spend a lot of time loading data from GPU(from GL texture).
With GPU, it is good for rendering, but may be not good for reading data back.
Flags: needinfo?(hshih)
Comment on attachment 8373686 [details]
patch for the 1.3 branch

Hi Dominic,
Could you review this patch?
I'm not a js expert.
Thanks!
Attachment #8373686 - Flags: review?(hshih) → review?(dkuo)
Attachment #8373696 - Flags: review?(hshih) → review?(dkuo)
John and Jerry,

Please file a separate bug for investigation of the blob urls.  I feel that fixing that is too risky for a 1.3 patch.

John: what do you mean when you say "blob's buffer"?  These are file-backed blobs, not memory-backed blobs, and creating a URL that references them should not be loading them into memory.  Do you have any idea why each blob URL takes 3.4kb?  Is that actually the size of the thumbnail blobs? Gecko engineers tell me that creating a blob URL is cheap and that referencing a file backed blob with a blob url will not bring it into memory.  If so, why are they so expensive here?  I wonder if there is a gecko bug here.

Also, before we go too far down the road of eliminating the blob urls, I also want to consider the possiblity of eliminating the visibility monitor and converting the thumbnails to <img> elements because I've been told that gecko will now do the image decoding and freeing based on visiblity for us.  But if we did that, then we would have to have blob urls for each thumbnail.  So I don't want to eliminate them right away before investigating why they are not cheaper.

Please don't reply here, though. Let's start a new bug for this.
John: I'd be happy with your review, too, if you're free and Dominic is not, especially since you've already started looking at this.
Move the blob-url object problem to bug 970801
(In reply to Tapas Kumar Kundu from comment #134)
> It does not seem to me that its a good idea to allow an app to consume
> >100MB memory on 256MB device.

Actually we spent a great deal of work in the v1-train timeframe to ensure that a large, foreground application could stay alive on 256MB devices and I don't really see any downsides to this. So I don't think we should adjust LMK parameters in this case and certainly not to catch memory regressions: I'd be far more keen on adding additional testing (maybe marionette-based?) that reproduces known high memory usage scenarios and automatically extracts memory profiles to ensure we're not regressing.
Flags: needinfo?(gsvelto)
Target Milestone: --- → 1.4 S1 (14feb)
(In reply to Jerry Shih[:jerry] from comment #143)
> Comment on attachment 8373686 [details]
> patch for the 1.3 branch
> 
> Hi Dominic,
> Could you review this patch?
> I'm not a js expert.
> Thanks!

No problem, and I just finished reading all the comments and now I know what's going on here, will continue to review and test the patches later.
Comment on attachment 8373696 [details]
patch for master

David,

I was not involved in this bug thread at beginning, but hopefully I am on the right track and review your patch correctly. From the patch I saw four major changes:

1. The |willReadFrequently| flag of canvas.getContext() is set to true, which Gabriele explained in comment 97 that it can save memory, and with your test result it saved the time as well.
2. Set the width and height of the canvas to 0, then purge the context and canvas to force it to release the memory after we created the blobs. This is what Jerry found in comment 83 and fixed in comment 92.
3. Add getFileInfo() in MediaDB so that MediaDB can return file-backed blobs instead of memory-backed blobs while adding the thumbnail to the DOM tree. This is because you guys found memory-backed cost too many memory, we should use file-backed blobs form the disk.
4. Adjust the batch size and timeout to save more time for parsing each images. This is because the flow of parsing images was changed a little bit, so gallery has to adapt it by tweaking the batch size and timeout.

My configuration is different from everyone's, it's inari but should be similar to hamachi. After applied David's patch, I did saw the memory usage dropped by using b2g-procrank at first launch, though I didn't get crash before/after applied the patch, probably it's because I only test few times.

So if my understanding is correct then this patch looks good to me! David, please correct me if I am wrong with the description above, thanks. And everyone rocks!
Attachment #8373696 - Flags: review?(dkuo) → review+
Comment on attachment 8373686 [details]
patch for the 1.3 branch

The patch for 1.3 is also r+ because it just fixed the conflicts.
Attachment #8373686 - Flags: review?(dkuo) → review+
Tapas, please try out this patch and verify if it solves the issue on our end.
Flags: needinfo?(tkundu)
Landed on master: https://github.com/mozilla-b2g/gaia/commit/fdc7a7e422470492e2ea107f8d48651bb5d4ebf2
Landed on 1.3: https://github.com/mozilla-b2g/gaia/commit/b1a249ba6e9db3c1d0f91530e3c3a91ce9ee091f

These are solid memory wins for the gallery app, so I'm going to resolve this bug.  Tapas, if you find that this does not resolve the issue for you, let's open a new bug.
Status: NEW → UNCONFIRMED
Ever confirmed: false
Set the status wrong. Marking resolved now.
Status: UNCONFIRMED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
We almost certainly want to be sure this gets on the 1.3T branch since it is memory-related. I don't know who handles those uplifts. I would guess that the 1.3 patch would uplift cleanly if anyone wants it.
Flags: needinfo?(dhuseby)
Flags: needinfo?(tkundu)
(In reply to Inder from comment #151)
> Tapas, please try out this patch and verify if it solves the issue on our
> end.

works as expected
Joe, see David's comment 154 re: 1.3T inclusion.
Flags: needinfo?(mlee) → needinfo?(jcheng)
Thanks
1.3T? to get this in Tarako triage
blocking-b2g: 1.3+ → 1.3T?
Flags: needinfo?(jcheng)
(In reply to Joe Cheng [:jcheng] from comment #157)
> Thanks
> 1.3T? to get this in Tarako triage

This is a CS blocker - this stays as a 1.3+. You need to request a tree manager to uplift this, not flip the blocking flag.
blocking-b2g: 1.3T? → 1.3+
No worries, we merge 1.3 to 1.3t daily.
Flags: in-moztrap?
Flags: in-moztrap? → in-moztrap+
Whiteboard: [c=memory p= s= u=1.3] [CR 611568][ETA: n/a] → [caf priority: p2][c=memory p= s= u=1.3] [CR 611568][ETA: n/a]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: