Closed
Bug 919864
Opened 11 years ago
Closed 11 years ago
Leaking screenshot blobs in system app via blob URIs
Categories
(Firefox OS Graveyard :: Gaia::System, defect, P1)
Tracking
(blocking-b2g:koi+, b2g-v1.2 fixed)
Tracking | Status | |
---|---|---|
b2g-v1.2 | --- | fixed |
People
(Reporter: m1, Assigned: alive)
References
Details
(Keywords: perf, Whiteboard: [MemShrink:P1] [c=memory p= s= u=1.2])
Attachments
(3 files, 1 obsolete file)
For the past 3-4 days we've consistently observed abnormal USS growth in the b2g process.
STR:
1) Run orangutan for 8 hours while logging |adb shell busybox watch b2g-procrank|.
Initial memory usage is usually something like:
APPLICATION PID Vss Rss Pss Uss cmdline
b2g 125 104368K 67620K 58470K 50860K /system/b2g/b2g
and then ~8 hours later:
APPLICATION PID Vss Rss Pss Uss cmdline
b2g 125 189144K 152396K 143676K 142460K /system/b2g/b2g
Above was run on a 512MB device, but on a 256MB device the UI would likely be useable at this point. The expectation is that b2g memory usage remain well under 100MB at all times.
Reporter | ||
Comment 1•11 years ago
|
||
(I've filed this bug /early/ to get more eyes on it. I'm still gathering data and will share as it becomes available. It would be very good for Moz folks to try to reproduce as well, the STR are pretty simple. This bug has been reproduced here on a build without any commercial XPCOM extensions as well FYI.)
blocking-b2g: --- → koi+
Updated•11 years ago
|
Whiteboard: [MemShrink]
Comment 2•11 years ago
|
||
Not sure if it's related, but I'm trying to diagnose an OoM problem in bug 851626. In that bug, I can't reproduce the problem with a b2g18 build, so it's something we've introduced since then.
Reporter | ||
Comment 3•11 years ago
|
||
Partial about:memory from an occurrence of this bug overnight, looks like we have a big pile of dark matter :(
105.31 MB (100.0%) -- explicit
├───60.50 MB (57.45%) ── heap-unclassified
├────7.39 MB (07.02%) -- window-objects
├────6.10 MB (05.79%) -- images
├────4.74 MB (04.51%) -- heap-overhead
├────3.24 MB (03.07%) ++ (15 tiny)
├────2.78 MB (02.64%) -- dom
├────2.45 MB (02.32%) -- workers/workers()
└────1.35 MB (01.28%) ── freetype
Reporter | ||
Comment 4•11 years ago
|
||
USS=104144K. Top DMD entry is 52.25% of the heap, with next entry is is 1.91% and then the rest is noise (not shown).
Comment 5•11 years ago
|
||
Mike, is there someone on your team who could take a look at this?
Flags: needinfo?(mlee)
Comment 6•11 years ago
|
||
Jed, need you to work with Kyle to troubleshoot and if possible, fix this.
Assignee: nobody → jld
Status: NEW → ASSIGNED
Flags: needinfo?(mlee) → needinfo?(jld)
Keywords: perf
Priority: -- → P1
Whiteboard: [MemShrink] → [MemShrink] [c=memory p= s= u=1.2]
Looks like we're leaking blobs here or something.
So the obvious step forward here is to get GC/CC logs and see if we can find the blobs and what is holding them alive.
If we can't do that then we can dive in with gdb if we can reproduce.
mvines can you get the GC/CC logs for jld?
Flags: needinfo?(mvines)
Reporter | ||
Comment 9•11 years ago
|
||
In this run we saw USS of b2g main process start at 48508K, finish at 148640K.
gecko: 72011db807e6a5e46462676974cd45e208e1a3e3
gaia: 85c4af3f48a91878d565f518ba0eed68f0628e21
(recent v1.2 bits)
Flags: needinfo?(mvines)
Comment 10•11 years ago
|
||
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #8)
> So the obvious step forward here is to get GC/CC logs and see if we can find
> the blobs and what is holding them alive.
Maybe I'm missing something, but Blobs don't look either cycle collected nor related to the GC heap, so I don't think the CC/GC logs will be very useful. At least, there doesn't seem to be much related to blobs in the CC log.
If they're being used by JS they'll have JS reflections and we might be able to trace them back that way. But yes, they probably won't appear in the CC log.
So I actually don't see anything that looks like a blob in the GC log :-(
However I do see 218 unique blob URI strings in it. I wonder if we're leaking blob URIs.
Depends on: 924696
Does orangutan use the cards view at all? I see some stuff there that looks a little sketchy.
Flags: needinfo?(mvines)
Comment 14•11 years ago
|
||
I think there are 7 Blobs in the GC log, but I can't tell if they are the leaky ones or not. jld noticed that from the DMD log the average size of the blob string things is 128k.
Comment 15•11 years ago
|
||
A little less than 128k — 42857254 / 337 = 127172 = 124 * 1024 + 196. (Or, for the number with shoes on: 43495424 / 337 = 129066 = 126 * 1024 + 42.)
So probably most of them are 128k and some are smaller. Or half of them are 256k. Or... something.
I don't suppose there's any chance b2g can pretend to be a reasonable platform and give me a core dump and debug info? Or will I have to reproduce this locally and live-debug it? Because at this point the next thing I'd do is walk the jemalloc metadata.
Flags: needinfo?(jld)
Comment 16•11 years ago
|
||
If we're leaking in the parent process, this should all be reproducible in the desktop builds where you will get dumps and debug info.
Reporter | ||
Comment 17•11 years ago
|
||
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #13)
> Does orangutan use the cards view at all? I see some stuff there that looks
> a little sketchy.
Yeah, every N events it sends a long home key to bring up the cards view.
Flags: needinfo?(mvines)
Comment 18•11 years ago
|
||
So, going into IPCMessageUtils.h and making ParamTraits<nsACString>::Read log whenever the string is >4KiB gets me messages like this if I enter the window manager from an app:
I/Gecko ( 3720): Read large nsACString: 23642
I/Gecko ( 3720): Read large nsACString: 32378
I/Gecko ( 3720): Read large nsACString: 32378
If I go to the homescreen instead, it's like this:
I/Gecko ( 3720): Read large nsACString: 32518
I/Gecko ( 3720): Read large nsACString: 32518
I suspect that my hamachi's screen is smaller than the one used to discover this bug.
Reporter | ||
Comment 19•11 years ago
|
||
(In reply to Jed Davis [:jld] from comment #18)
> I suspect that my hamachi's screen is smaller than the one used to discover
> this bug.
Perhaps physically, but the screen resolution is identical with the QRD device that we're running here.
Comment 20•11 years ago
|
||
I tried using DMD. It crashed on startup, and gdb doesn't want to do anything symbolic with the result. However:
Program received signal SIGSEGV, Segmentation fault.
0x400129b0 in ?? ()
(gdb) x/7i $pc
=> 0x400129b0: ldr r3, [r0, #0]
0x400129b2: lsls r2, r3, #1
0x400129b4: ite mi
0x400129b6: orrmi.w r3, r3, #2147483648 ; 0x80000000
0x400129ba: bicpl.w r3, r3, #2147483648 ; 0x80000000
0x400129be: adds r0, r0, r3
0x400129c0: bx lr
This is an implementation of prel31 resolution from the ARM exception handling spec. The address in question:
(gdb) p/x $r0
$1 = 0x42c6c43c
But also, just because gdb is confused doesn't mean I am.
4000d000-40015000 r-xp 00000000 1f:01 932 /system/b2g/libdmd.so
$ addr2line -ie objdir-gecko/dist/bin/libdmd.so 0x59b0
/tmp/android-build-bb7e003d31d08f72cabc269a652912b7/src/build/../gcc/gcc-4.4.3/libgcc/../gcc/config/arm/unwind-arm.c:535
As for the would-be exception index entry:
42c65000-42c6b000 r-xp 00000000 1f:01 724 /system/lib/hw/sensors.default.so
42c6b000-42c6c000 rwxp 00006000 1f:01 724 /system/lib/hw/sensors.default.so
42c85000-42e47000 rwxs 00e67000 00:0b 2587 /dev/graphics/fb0
So maybe it's sensors.default.so. Let's try `readelf -u`:
[...]
0x4790: 0x1 [cantunwind]
0x2368: 0x80b0b0b0
[...]
It violates the ABI by having exception index entries out of order; this is bug in ICS's version of gold, but I don't know yet if it's why the unwinder is going out of bounds.
Of course, none of this gets me any closer to finding the original memory leak.
Comment 21•11 years ago
|
||
I got some help on blobs and createObjectURL from :khuey on IRC, and we found a leak in window_manager.js. (Also, a simpler way to check for this kind of leak: break into gdb and print gDataTable->mTable.entryCount.)
I've attached a patch against v1.2 gaia which should avoid this leak, and (in case we still need more info) logs a message when it does so. There's also a less serious issue in that we hold on to a screenshot even after an app is closed; I haven't tried to address that.
This patch conflicts with bug 920890, and won't apply to master, and also won't apply to v1.2 after that bug is uplifted (it's currently waiting on merge resolution).
All that said: Does this fix the original leak?
Flags: needinfo?(mvines)
Updated•11 years ago
|
Nice work guys!
Comment 23•11 years ago
|
||
Bug 920890 is uplifted now, so I need to see what's going on post-that....
Flags: needinfo?(jld)
Comment 24•11 years ago
|
||
Screenshot leaks I've found on master:
1. With an app open, repeat: long-press the home button, then touch the card to return to that app.
2. Open an app, go to card view, kill app by swiping it upwards; repeat with the same app.
3. Open more apps than will fit in memory, then go to card view; repeat with the same sequence of apps.
I've been testing leaks by attatching with gdb and looking at gDataTable.mTable.entryCount; if I can do a sequence of actions repeatedly and the count keeps going up, I figure it's probably a leak.
#1 is fixable by revoking _screenshotURL before reassigning it. #2 is fixable by revoking the screenshot during removeFrame in window_manager.js. #3 I don't understand what's going on. So I should probably hand off the bug, I think.
Assignee: jld → alive
Flags: needinfo?(mvines)
Flags: needinfo?(jld)
Updated•11 years ago
|
Attachment #815173 -
Attachment is obsolete: true
I have some ideas on how we can make this better. I'll touch base with alive next week.
Assignee | ||
Comment 26•11 years ago
|
||
Just noticed I am the assignee :)
Assignee | ||
Comment 27•11 years ago
|
||
I do some improvement about screenshot in https://bugzilla.mozilla.org/show_bug.cgi?id=925609
not sure it satisfies all cases in comment 24.
Assignee | ||
Comment 28•11 years ago
|
||
For the screenshot doesn't revoked when app is terminated:
Because currently the kill function is living in window manager instead of app window, so the revokeObjectURL in appWindow is never called. We could improve this by moving mozbrowsererror event handler into appWindow.
Assignee | ||
Comment 29•11 years ago
|
||
Does this make sense?: Keep the blob instead of created ObjectURL and create a new ObjectURL each time we want to display the screenshot, revoking it immediately we show the screenshot.
(In reply to Alive Kuo [:alive] from comment #29)
> Does this make sense?: Keep the blob instead of created ObjectURL and create
> a new ObjectURL each time we want to display the screenshot, revoking it
> immediately we show the screenshot.
Yes. This is sort of what I was thinking in comment 25. There is even a provision in the spec for blob URIs that revoke themselves after being used once, but I don't know that we implemented that yet.
Also we still have to make sure our references to the blob are cleaned up, but we can rely on the GC to do that if we get rid of the objects the blobs are stored on.
Assignee | ||
Comment 32•11 years ago
|
||
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #30)
> Yes. This is sort of what I was thinking in comment 25. There is even a
> provision in the spec for blob URIs that revoke themselves after being used
> once, but I don't know that we implemented that yet.
If it's automatically revoked already then currently we should have some flicking when we from an app go to cardview :/
Anyway I am going to have a patch for
1. Be sure to revoke objectURL when app is killed
2. Store blob instead objectURL in appWindow and create/revoke objectURL via stored blob each time we need to display image.
My question is how do I know I have already fixed the leak problem? No idea how to confirm.
(In reply to Alive Kuo [:alive] from comment #32)
> My question is how do I know I have already fixed the leak problem? No idea
> how to confirm.
We can get mvines to have his people do some QA on the patch. jld also managed to reproduce several different issues in comments 24, perhaps he can comment on that.
Flags: needinfo?(jld)
Comment 34•11 years ago
|
||
I tried to see if I could automate the cases I found, so other people could use them more easily.
Orangutan needs a certain amount of manual makefile editing to get it built with a b2g tree and pushed to a b2g device, and then I couldn't get it to actually generate any kind of effect.
I also tried Marionette, but it looks like I'd need to spend some time reading through the system app to figure out how to use it automate these actions.
It might also help if we exposed the number / total size of blobs — in about:memory, to sufficiently privileged script, that kind of thing. That might be bug 924696.
Flags: needinfo?(jld)
Reporter | ||
Comment 35•11 years ago
|
||
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #33)
> We can get mvines to have his people do some QA on the patch.
Yes, we can certainly try out a v1.2 patch in our setup here pretty easily. Which patch did you want our monkeys to kick and scream at?
Updated•11 years ago
|
Target Milestone: --- → 1.2 C3(Oct25)
Assignee | ||
Comment 36•11 years ago
|
||
(In reply to Michael Vines [:m1] [:evilmachines] from comment #35)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #33)
> > We can get mvines to have his people do some QA on the patch.
>
> Yes, we can certainly try out a v1.2 patch in our setup here pretty easily.
> Which patch did you want our monkeys to kick and scream at?
Working in Progress, stay tuned :)
Assignee | ||
Comment 37•11 years ago
|
||
Better blob and objectURL for screenshot management.
Michael, please note this is based on https://bugzilla.mozilla.org/show_bug.cgi?id=925609
Attachment #816926 -
Flags: review?(timdream)
Attachment #816926 -
Flags: feedback?(mvines)
Updated•11 years ago
|
Reporter | ||
Comment 38•11 years ago
|
||
(In reply to Alive Kuo [:alive] from comment #37)
> Created attachment 816926 [details]
> https://github.com/mozilla-b2g/gaia/pull/12842
This patch doesn't apply on gaia/v1.2, could you please rebase it.
Updated•11 years ago
|
Whiteboard: [MemShrink] [c=memory p= s= u=1.2] → [MemShrink:P1] [c=memory p= s= u=1.2]
Assignee | ||
Comment 39•11 years ago
|
||
(In reply to Michael Vines [:m1] [:evilmachines] from comment #38)
> (In reply to Alive Kuo [:alive] from comment #37)
> > Created attachment 816926 [details]
> > https://github.com/mozilla-b2g/gaia/pull/12842
>
> This patch doesn't apply on gaia/v1.2, could you please rebase it.
Don't need rebase, it's because 925609 isn't uplifted.
https://github.com/mozilla-b2g/gaia/pull/12872 here for v1.2 including 925609 commit.
Component: General → Gaia::System
Summary: b2g process memory leak on v1.2 → Leaking screenshot blobs in system app via blob URIs
Reporter | ||
Comment 40•11 years ago
|
||
Thanks. Running monkeys with these two patches now, will report back with data in a day or two.
Flags: needinfo?(mvines)
Updated•11 years ago
|
Attachment #816926 -
Flags: review?(timdream) → review+
Reporter | ||
Comment 41•11 years ago
|
||
USS of the b25 process looked pretty good overnight, maxing out at 73mb-ish. So far so good, I think we should land the fix.
Flags: needinfo?(mvines)
Reporter | ||
Updated•11 years ago
|
Attachment #816926 -
Flags: feedback?(mvines) → feedback+
Great work jed, alive.
Assignee | ||
Comment 43•11 years ago
|
||
Travis red is unrelated. Merged
https://github.com/mozilla-b2g/gaia/commit/56ce4212d45941536ffd814b1c92167af91e0efe
Thank everyone in the bug: Kyle, Jed, Michael.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Comment 44•11 years ago
|
||
Uplifted 56ce4212d45941536ffd814b1c92167af91e0efe to:
v1.2: a884560d00563def6f2fda20ab1b6a53c855ed4a
status-b2g-v1.2:
--- → fixed
You need to log in
before you can comment on or make changes to this bug.
Description
•