Closed
Bug 985193
Opened 11 years ago
Closed 10 years ago
High heap unclassified/display driver crash during slideshow
Categories
(Core :: Graphics, defect)
Tracking
()
RESOLVED
FIXED
mozilla38
People
(Reporter: ferongr, Assigned: seth)
References
(Blocks 1 open bug, )
Details
(Whiteboard: [MemShrink:P2])
Attachments
(6 files, 4 obsolete files)
(deleted),
application/gzip
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
application/gzip
|
Details | |
(deleted),
image/svg+xml
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
lmandel
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20100101 Firefox/31.0 ID:20140318030202 CSet: 082761b7bc54 2014-03-18 Nightly build.
STR
1. Follow URL
2. Enable auto-next at the bottom left panel and set it to a low value (for faster reproduction, the crash eventually happens regardless of speed).
3. Wait for a minute or so
4. Observe about:memory
5. Wait for 5 minutes or so
6a. With layers.offmainthreadcomposition.enabled;false: Display driver/DWM crash (resolution switches to 640x480, theme switches to Aero Basic, Windows recovers after a few seconds). Nightly becomes inoperable (window shrinks completely showing only a small titlebar), requiring to restart it.
6b. With layers.offmainthreadcomposition.enabled;true: The browser crashes. Example crash ID: 203e780f-5466-400a-9132-e02802140318
Without OMTC, the heap-unclassified leaf node keeps growing until the driver crashes. With OMTC enabled it's the GFX -> heap-textures leaf node that keeps growing.
Forcing a GC/CC without OMTC does lower the memory used (heap-unclassified goes to normal levels). With OMTC though the memory stays around until the browser is restarted.
GPU: AMD HD6850 with 13.12 Catalyst.
Blocks: DarkMatter
Updated•11 years ago
|
Component: General → Graphics
Whiteboard: [MemShrink]
Comment 3•11 years ago
|
||
Bad behaviour with and without OMTC, yikes.
Whiteboard: [MemShrink] → [MemShrink:P2]
Comment 4•11 years ago
|
||
In my testing images uncompressed heap never went up with OMTC on or off, so I'm guessing the presence of significant images uncompressed heap in the OMTC off report attachment is due to other activity in the session?
There was lots of memory usage in heap-textures and/or heap-unclassified though.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Comment 5•11 years ago
|
||
On mac (with omtc on, the default) heap-textures grows to 1 GB in about a minute. So this seems bad.
No, I don't think so. The snapshot with OMTC off is with a brand-new profile and a session a few seconds old. The OMTC snapshot is from my normal profile but I will upload a new snapshot with a fresh profile and OMTC enable in a bit just to be sure.
Attachment #8393206 -
Attachment is obsolete: true
Comment 8•11 years ago
|
||
(In reply to ferongr from comment #6)
> No, I don't think so. The snapshot with OMTC off is with a brand-new profile
> and a session a few seconds old. The OMTC snapshot is from my normal profile
> but I will upload a new snapshot with a fresh profile and OMTC enable in a
> bit just to be sure.
Interesting. I tested on a Windows machine (previous results were for mac) and I still never get any values on images uncompressed over 1 MB. I wish I could reproduce what you are seeing.
Either way those numbers are but a sideshow to heap-textures and heap-unclassified.
Comment 9•11 years ago
|
||
Ah, so I can get large images uncompressed-heap numbers in the release channel. I'm guessing that bug 962670 moved that to heap-textures or unclassified?
The site users background-image to show the images, for which we don't have any smarts about keeping around decoded data like we do for img elements.
Reporter | ||
Comment 10•11 years ago
|
||
In my case, it's the layers.offmainthreadcomposition.enabled (that's false by default on Windows Nightly) pref that makes the high number go from heap-unclassified to heap-textures.
Comment 11•11 years ago
|
||
We appear to be discarding images fine. The difference between the number of RasterImage::DecodingComplete calls and RasterImage::Discard calls (on non-chrome images) is never more than 15 even after my laptop start going into an unusable state due to memory use. And I traced one Discard call down to the vm_deallocate call, so that appears to be working fine. So the problem must be somewhere else.
Comment 12•11 years ago
|
||
The images are getting layerized and the image container that RasterImage uses seems to be the source of the problem. We don't touch it in ::Discard, so if proper disposal of it is happening must be done by the layers subsystem.
Comment 13•10 years ago
|
||
Looks like this is still an issue. With a late Nightly 37 on Windows, after about an hour I see:
463.56 MB (100.0%) -- explicit
├──273.78 MB (59.06%) ── heap-unclassified
├──117.12 MB (25.27%) -- images
5,468.42 MB ── gpu-committed
367.62 MB ── gpu-dedicated
122.00 MB ── gpu-shared
422.81 MB ── heap-allocated
473 ── heap-chunks
1.00 MB ── heap-chunksize
426.46 MB ── heap-committed
473.00 MB ── heap-mapped
0.86% ── heap-overhead-ratio
2.42 MB ── imagelib-surface-cache-estimated-locked
2.42 MB ── imagelib-surface-cache-estimated-total
0.34 MB ── js-main-runtime-temporary-peak
0 ── low-commit-space-events
734.10 MB ── private
139.82 MB ── resident
3,906.76 MB ── vsize
17.00 MB ── vsize-max-contiguous
VMMap says I have about 2700MB of "Shareable" write-combine regions. (Is the gpu-committed number double-counting?)
A sample WinDbg !address dump on one of those regions looks like:
Usage: MappedFile
Base Address: dab00000
End Address: db600000
Region Size: 00b00000
State: 00001000 MEM_COMMIT
Protect: 00000404 PAGE_READWRITE|PAGE_WRITECOMBINE
Type: 00040000 MEM_MAPPED
Allocation Base: dab00000
Allocation Protect: 00000404 PAGE_READWRITE|PAGE_WRITECOMBINE
Mapped file name: PageFile
I can't tell where those regions are coming from. AFAICT it's not through the VirtualAlloc or MapViewOfFile families.
On the bright side, those regions do get cleaned up if I navigate away and Minimize Memory.
Comment 14•10 years ago
|
||
So I had a look at gpuview trace of this situation. We definitely seem to be accumulating gpu allocations. I've attached a graph of gpu allocations from David's gpuview trace.
Comment 15•10 years ago
|
||
Looking more closely, it appears as though we are accumulating textures. I do not yet know why or how.
Comment 16•10 years ago
|
||
I can reproduce this locally and it looks like the gpuview logs contain allocation stacks so I should be able to figure out what's going on tomorrow.
Comment 17•10 years ago
|
||
Here's a sample deallocation from when everything gets freed.
Comment 18•10 years ago
|
||
It looks like imagelib is keeping all of images on the page in the image cache. This is keeping a bunch of the gpu stuff alive.
How are images supposed to be removed from the cache? What's keeping them there on this page?
Comment 19•10 years ago
|
||
A few more details.
mCache is growing in size. The cacheQueue does not have very many items in it. It seems like the imageRequestProxy destructor is not being called much.
Flags: needinfo?(seth)
Updated•10 years ago
|
Flags: needinfo?(tnikkel)
Comment 20•10 years ago
|
||
I think the problem is that there is nothing to release mImageContainer on RasterImage. So as long as the image is kept alive (the imgCache is only for keeping the source data around) the image container sticks around. Instead it should be treated the same as the rest of the decoded data for images.
I think all we should need is nulling out mImageContainer in RasterImage::Discard.
Flags: needinfo?(tnikkel)
Comment 21•10 years ago
|
||
(In reply to Timothy Nikkel (:tn) from comment #20)
> I think the problem is that there is nothing to release mImageContainer on
> RasterImage. So as long as the image is kept alive (the imgCache is only for
> keeping the source data around) the image container sticks around. Instead
> it should be treated the same as the rest of the decoded data for images.
>
> I think all we should need is nulling out mImageContainer in
> RasterImage::Discard.
This seems to have worked.
Assignee: nobody → tnikkel
Assignee | ||
Comment 22•10 years ago
|
||
(In reply to Timothy Nikkel (:tn) from comment #20)
> I think all we should need is nulling out mImageContainer in
> RasterImage::Discard.
Looking at this problem, I'm wonder if what we should really do is get rid of RasterImage::mImageContainer completely and keep only RasterImage::mImageContainerCache. In other words, we should keep only a weak reference to the ImageContainer and not a strong reference.
Is there any reason we *should* keep a strong reference? I suppose whether that will hurt us depends on how the layer system behaves.
Flags: needinfo?(seth)
Assignee | ||
Comment 23•10 years ago
|
||
Here are some further comments from IRC:
seth: tn: the comment in GetImageContainer says "We only need to be careful
about holding on to the image when it is discardable by the OS"
seth: tn: that makes zero sense to me
seth: tn: i think the goal here is that we keep the ImageContainer alive,
which keeps its reference the imgFrame's SourceSurface alive, which keeps the
VolatileBuffer underneath alive
seth: tn: but if the layer system drops its reference to the ImageContainer,
and the image is unlocked, and nothing is calling Draw, why *should* we keep the
VolatileBuffer alive?
I've gone ahead and written a patch that removes the strong reference to
RasterImage's ImageContainer, and keeps only a weak reference. In my local
testing it seems to work as expected and have no negative effects - the weak
reference stays alive as long as the image stays layerized, and gets dropped
when it stops being layerized.
I wasn't 100% sure who should review this; Matt, if you think it should be
someone else, please forward the r?.
Attachment #8549297 -
Flags: review?(matt.woodrow)
Assignee | ||
Updated•10 years ago
|
Assignee: tnikkel → seth
Status: NEW → ASSIGNED
Comment 24•10 years ago
|
||
Comment on attachment 8549297 [details] [diff] [review]
Stop holding a strong reference to RasterImage's ImageContainer
Review of attachment 8549297 [details] [diff] [review]:
-----------------------------------------------------------------
::: image/src/RasterImage.cpp
@@ +850,5 @@
>
> + // We need a new ImageContainer, so create one.
> + container = LayerManager::CreateImageContainer();
> +
> + nsRefPtr<layers::Image> image = GetCurrentImage(container);
Add a comment saying that the Image contains a SourceSurface which is holding the lock on the VolatileBuffer, preventing it from getting freed.
Attachment #8549297 -
Flags: review?(matt.woodrow) → review+
Assignee | ||
Comment 25•10 years ago
|
||
Thanks for the quick review! I'll make that change.
Assignee | ||
Comment 26•10 years ago
|
||
Addressed review comments.
Assignee | ||
Updated•10 years ago
|
Attachment #8549297 -
Attachment is obsolete: true
Assignee | ||
Comment 27•10 years ago
|
||
Here's a try job. I'm going to be paranoid about this one and run this on all platforms:
https://tbpl.mozilla.org/?tree=Try&rev=03ea59b4a02f
Comment 28•10 years ago
|
||
It seems like we still accumulate image cache entries for this page without bound. Is that intended?
Comment 29•10 years ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #28)
> It seems like we still accumulate image cache entries for this page without
> bound. Is that intended?
So it seems like this is intended by this particular page. It keeps a cache of all the image elements that it has ever loaded, which explains the behaviour I was seeing.
Assignee | ||
Comment 30•10 years ago
|
||
So it looks like this actually fixed several tests that were marked fails-if on
Android, although I don't really understand why. I've updated the patch to
remove those fails-if annotations.
Assignee | ||
Updated•10 years ago
|
Attachment #8549307 -
Attachment is obsolete: true
Comment 31•10 years ago
|
||
It would be valuable to know when this problem started occurring.
Keywords: regressionwindow-wanted
Assignee | ||
Comment 32•10 years ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #31)
> It would be valuable to know when this problem started occurring.
I'd bet that it came from the bugs that added support for volatile buffers.
Assignee | ||
Comment 33•10 years ago
|
||
I pushed a new Android-only try job to make sure that the new test annotations are good:
https://tbpl.mozilla.org/?tree=Try&rev=36d55de4d291
Assignee | ||
Comment 34•10 years ago
|
||
(In reply to Seth Fowler [:seth] from comment #32)
> I'd bet that it came from the bugs that added support for volatile buffers.
I take it back! My guess was based on the comments in GetImageContainer, but Timothy investigated and it appears that the volatile buffers patches added the weak pointer, not the strong pointer. We were in an even worse position before those patches.
Assignee | ||
Comment 35•10 years ago
|
||
Sigh.. I appear to have neglected to request tests on that try push.
https://tbpl.mozilla.org/?tree=Try&rev=e927a3d40324
Assignee | ||
Comment 36•10 years ago
|
||
Wow, I appear to be hitting a bug in 'hg trychooser'. Good to know about. =\
At any rate, this new try run is now useless, because those same test annotations got changed by someone else.
I'm just going to go ahead and pushed.
Here's the patch without the new test annotations, and rebased.
Assignee | ||
Updated•10 years ago
|
Attachment #8549851 -
Attachment is obsolete: true
Assignee | ||
Comment 37•10 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
Assignee | ||
Comment 39•10 years ago
|
||
Comment on attachment 8550545 [details] [diff] [review]
Stop holding a strong reference to RasterImage's ImageContainer
Approval Request Comment
[Feature/regressing bug #]: Part of fix for bug 1125272 and bug 1119938.
[User impact if declined]: Already approved.
[Describe test coverage new/current, TreeHerder]: In 38 for a long time.
[Risks and why]: Low risk; in 38 for a long time.
[String/UUID change made/needed]: None.
Attachment #8550545 -
Flags: approval-mozilla-aurora?
Assignee | ||
Comment 40•10 years ago
|
||
Pushed to Aurora:
https://hg.mozilla.org/releases/mozilla-aurora/rev/79cd28190706
status-firefox37:
--- → fixed
status-firefox38:
--- → fixed
Comment 41•10 years ago
|
||
Comment on attachment 8550545 [details] [diff] [review]
Stop holding a strong reference to RasterImage's ImageContainer
Aurora approval previously granted for a collection of ImageLib related bugs that Seth worked on and QE verified before uplift.
Attachment #8550545 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Updated•10 years ago
|
tracking-firefox37:
--- → +
Updated•9 years ago
|
Keywords: regressionwindow-wanted
You need to log in
before you can comment on or make changes to this bug.
Description
•