Closed Bug 1830753 Opened 1 year ago Closed 1 year ago

Continuous memory leak, with animated theme and at least one visible and one occluded window

Categories

(Core :: Graphics: WebRender, defect)

Firefox 112
defect

Tracking

()

RESOLVED FIXED
116 Branch
Tracking Status
firefox-esr102 --- wontfix
firefox113 + fixed
firefox114 + fixed
firefox115 blocking fixed
firefox116 blocking fixed

People

(Reporter: n6fpblsjcva7, Assigned: tnikkel)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression)

Attachments

(2 files, 2 obsolete files)

Attached file memory-report.json.gz (deleted) —

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/112.0

Steps to reproduce:

Updated to latest version

Actual results:

Every ~4 seconds, the memory that Firefox is using increases by .02 - .4 GB. This continues until it is out of memory.

Expected results:

Kept around the same memory usage.

The Bugbug bot thinks this bug should belong to the 'Core::Performance' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Performance
Product: Firefox → Core

This bug was moved into the Performance component.

:n6fpblsjcva7, could you make sure the following information is on this bug?

  • For slowness or high CPU usage, capture a profile with http://profiler.firefox.com/, upload it and share the link here.
  • ✅ For memory usage issues, capture a memory dump from about:memory and attach it to this bug.
  • Troubleshooting information: Go to about:support, click "Copy raw data to clipboard", paste it into a file, save it, and attach the file here.

If the requested information is already in the bug, please confirm it is recent.

Thank you.

Flags: needinfo?(n6fpblsjcva7)

That's a lot of shmem.

1,446.02 MB ── resident
2,286.36 MB ── resident-peak
16,480.96 MB ── resident-phys-footprint
1,163.64 MB ── resident-unique
14,639.53 MB ── shmem-allocated
14,703.05 MB ── shmem-mapped

I'm going to move this to Graphics under the assumption that it is the most likely cause of a continuous leak like that.

Do you have any windows minimized? I see that you have the Dark Space addon. Do you have a lightweight theme with an animation?

Maybe this is another instance of bug 1820841, which seemed to be a kind of variation of bug 1828587 showing up on MacOS.

Component: Performance → Graphics: WebRender
Summary: Firefox has a memory leak and eventually leads to maxing out my memory consistently. → Continuous memory leak, with animated theme and very high shmem-allocated and shmem-mapped

While this likely represents a bug in Firefox that we need to fix, you might be able to avoid this leak by disabling the animated theme.

:ahale, this sounds similar to recent issues you've worked on WRT animated themes causing problems. Is this a dupe, maybe?

Flags: needinfo?(ahale)
Blocks: gfx-triage
Severity: -- → S2

(In reply to Erich Gubler [:ErichDonGubler] from comment #5)

:ahale, this sounds similar to recent issues you've worked on WRT animated themes causing problems. Is this a dupe, maybe?

It sure does look that way; I can't see anything to distinguish this bug from that one other than this being MacOS rather than Windows.

I'm not sure if this is after the fix landed, reporter version is 112.0, so it may be before the fix. If it is after the fix it would imply that the mThrottled check in https://hg.mozilla.org/mozilla-central/rev/6f2d406b412b#l1.102 is not working?

:bradwerth, would you mind attempting repro on this on MacOS? It looks easy to repro, and we need to take action on this if the bug is still around. Repro steps in https://bugzilla.mozilla.org/show_bug.cgi?id=1828587#c71 should still work.

Flags: needinfo?(ahale) → needinfo?(bwerth)

It's pretty easy reproduce by using the dark space theme, opening a bunch of windows and then minimizing them (we don't know that the reporter of this bug report did that but it's quite possible). This is already know as Andrew commented in bug 1820841, comment 9. It reproduces in nightlies at least as far back as 110, so it's before Brad's work in bug 1768495 which landed during the 112 cycle.

So I guess minimized windows on mac aren't throttled, and aren't painting either.

Status: UNCONFIRMED → NEW
Ever confirmed: true

The bug is marked as tracked for firefox113 (release), tracked for firefox114 (beta) and tracked for firefox115 (nightly). However, the bug still isn't assigned.

:bhood, could you please find an assignee for this tracked bug? If you disagree with the tracking decision, please talk with the release managers.

For more information, please visit BugBot documentation.

Flags: needinfo?(bhood)

(In reply to Timothy Nikkel (:tnikkel) from comment #7)

So I guess minimized windows on mac aren't throttled, and aren't painting either.

They should be tho...

Brad, can you poke at this and see if you can come up with any ideas?

Assignee: nobody → bwerth
Flags: needinfo?(bhood)

Tim, does this repro in an even older nightly (82)? macOS effectively paused the compositor before already, see bug 1580117.

Flags: needinfo?(tnikkel)

I'm able to reproduce using Tim's approach in comment 7. I'll try to figure out what's going on.

Flags: needinfo?(bwerth)

(In reply to Emilio Cobos Álvarez (:emilio) from comment #11)

Tim, does this repro in an even older nightly (82)? macOS effectively paused the compositor before already, see bug 1580117.

Yeah, I can reproduce using whatever nightly mozregression launches when I ask for 82.

Flags: needinfo?(tnikkel)

I just checked with printfs in a local build: we do throttle the browser.xhtml parent process and content process refresh driver in windows that are minimizied on macOS. So there must be something more going on here.

When I reproduce, and diff the memory reports of before-and-after, the growth is in textures/render-texture-hosts.

With an animated theme active, this only reproduces when I have at least one window minimized and one window not minimized. When that is true, calls to RenderThread::RegisterExternalImage continue but there are no corresponding calls to RenderThread::UnregisterExternalImage. This is what's building up the memory reported as textures/render-texture-hosts.

We keep hitting this case
https://searchfox.org/mozilla-central/rev/e77d89c414eac63a295a2124600045775a6f4715/image/imgFrame.cpp#279
when we are trying to recycle a frame from the animated image. Seems like it is an unexpected interaction with the animated image frame recycling code.

Flags: needinfo?(n6fpblsjcva7)
No longer blocks: gfx-triage

I don't completely understand this yet, but here's what I've discovered about the behavior of animated theme in at least one visible window and at least one occluded window:

  1. While decoding an animated GIF, we sometimes reuse an external image id. When we do that, we create a ScheduleSharedSurfaceRelease for the old image, which gets posted as a transaction to the Renderer. We also (somewhere) create a transaction to register the new image and I believe increment a use count for the external image. I believe the transaction gets posted to all Renderer using that image, including those which might be associated with occluded windows. Through some call chain that I don't understand yet, this transaction is executed on all those Renderer.
  2. The ScheduleSharedSurfaceRelease transaction created in Step 1 will only be executed on the Renderer instances when Renderer.update is called, and the AppendNotificationRequests message is processed. But Renderer.update has a long causation chain which leads back to WebRenderBridgeParent::MaybeGenerateFrame. In that function, we early exit if the compositor is paused.
  3. Since the register transaction from Step 1 is executed on all Renderer and the release transaction is only executed on unoccluded Renderer, the usage counts are never decremented to 0 and the external image is never released until the occluded Renderer no longer has a paused compositor (it becomes unoccluded) and processes all of its pending transactions.

The ideal fix I think would be to prevent the register transaction from being posted (or the usage counter being incremented) on occluded windows. But I don't yet understand that call chain. If somebody does, please post to this Bug how that causation works. An alternative fix is to change the behavior of MaybeGenerateFrame, so that when it detects a paused compositor, it calls the Update method on Renderer before the early exit. Update, unlike UpdateAndRender should process the pending transactions without generating a frame. I'll try to build a fix that does this.

I also have no understanding why this is an issue in macOS and not on other platforms. It's possible that the image decode step is handled differently and somehow the register transaction is permitted on macOS and not on other platforms.

(In reply to Brad Werth [:bradwerth] from comment #19)

I also have no understanding why this is an issue in macOS and not on other platforms. It's possible that the image decode step is handled differently and somehow the register transaction is permitted on macOS and not on other platforms.

Isn't this the same issue as reported here? https://bugzilla.mozilla.org/show_bug.cgi?id=1828587

I also have no understanding why this is an issue in macOS and not on other platforms.

(I'm not a developer & not sure.)
Don't have Mac and Wayland a different kind of vsync than Windows and X11? bug 1829493 (114), bug 1797978 (108,sounds similar), bug 1786247 (107: sounds similar), bug 1765399 (102), bug 1759234 (100), bug 1645528 (86: "widget-local vsync source"), bug 1542808 (72)

Only Mac and Linux have gfx.canvas.accelerated=true:
bug 1831548 (async RemoteTexture=Early Beta or earlier) makes Linux running out of file descriptors, which crashes Firefox sometimes.

Summary: Continuous memory leak, with animated theme and very high shmem-allocated and shmem-mapped → Continuous memory leak, with animated theme and at least one visible and one occluded window
Attachment #9335023 - Attachment description: WIP: Bug 1830753 Part 1: Make RenderBackend wake up instead of rendering when compositor is paused. → Bug 1830753: Make renderer wake up instead of rendering when compositor is paused.

A little bit of context of what I've figured out in case that helps. There is one RasterImage object for the animated image in the theme, it is shared by all of the windows. So when the refresh driver ticks in the one visible window we advance the frame of that one image object, that those minimized windows are pointing to as well. We share the image frame to the gpu process so it can paint it. Then the gpu process is supposed to send a message back to the process that sent it the animated image frame after it is done with it. The gpu process doesn't send the "finished with frame n" msg, I'm fuzzy on the exact mechanism this uses, but something to do with the gpu process knowing those minimized windows contain the same animated image and still need to present that image frame (and the logic doesn't take into account paused compositors in some way).

FYI our last beta builds tomorrow and ships on Friday. We build our RC on Monday, I would take mitigation patches in the RC.

Some insight into why we're allocating a frame in this case. When we have an animated theme and at least one window visible and one occluded, we always choose a recycled frame that is the restore frame. I'm not sure why -- not familiar with this logic. This causes us to allocate a new frame for each frame, thus the memory leak. I'll try to figure out what we can do in this case. Tim, do you understand why we are always hitting this code path in this case?

Flags: needinfo?(tnikkel)

I think we should probably back out bug 1768495 again from beta. It's unfortunate, but it's way less risky than adding a whole new code-path to mitigate this IMO, specially close to RC.

(In reply to Brad Werth [:bradwerth] from comment #25)

Some insight into why we're allocating a frame in this case. When we have an animated theme and at least one window visible and one occluded, we always choose a recycled frame that is the restore frame. I'm not sure why -- not familiar with this logic. This causes us to allocate a new frame for each frame, thus the memory leak. I'll try to figure out what we can do in this case. Tim, do you understand why we are always hitting this code path in this case?

Hmm, that's not quite the behaviour that I'm observing. In that code you linked blocked is false for me and we go into InitForDecoderRecycle. However InitForDecoderRecycle fails so blocked does end up true and we still go to allocate a new frame.

InitForDecoderRecycle fails because we can't recycle a frame if it is still being held onto by someone. Similar to comment 17, at this point

https://searchfox.org/mozilla-central/rev/e77d89c414eac63a295a2124600045775a6f4715/image/imgFrame.cpp#279

we have one ref for each open window (since each open window points to the same image), so we can't recycle the old frame cause those minimized windows still have a reference to the frame.

Recycling of frames is not really the cause because I can still reproduce with setting the pref image.animated.decode-on-demand.recycle to false, which should disable recycling.

Flags: needinfo?(tnikkel)

And we allocate a new frame because the image in question is large enough that we don't keep around every frame of the animated image. The code that handles this operates on the assumption that old frames will get dropped when we are finished with them, so it doesn't expect to wrap around and start decoding new copies of frames that are still being held somewhere.

Thanks for correcting me on the code path through Decoder::AllocateFrameInternal. Indeed, we are failing there for the reason you detail in comment 28, and not due to matching the restore frame, as I proposed in comment 25.

(In reply to Timothy Nikkel (:tnikkel) from comment #27)

Recycling of frames is not really the cause because I can still reproduce with setting the pref image.animated.decode-on-demand.recycle to false, which should disable recycling.

If this pref is off, it seems that the correct fix is D178551. Since that fix also handles the case when the pref is on, I think we really should land it, and treat the behavior in comment 28 as expected and correct.

(In reply to Emilio Cobos Álvarez (:emilio) from comment #26)

I think we should probably back out bug 1768495 again from beta. It's unfortunate, but it's way less risky than adding a whole new code-path to mitigate this IMO, specially close to RC.

(In reply to Pascal Chevrel:pascalc from comment #24)

FYI our last beta builds tomorrow and ships on Friday. We build our RC on Monday, I would take mitigation patches in the RC.

Pascal, let's backout Bug 1768495 from beta, as Emilio proposes. That will give us more time to find the right fix.

Flags: needinfo?(pascalc)

I'm moving pascals needinfo to bug 1834246 which we will use to handle the backout from beta. This bug can continue to be used to track a long term fix.

Flags: needinfo?(pascalc)
Attachment #9335023 - Attachment is obsolete: true

Shared surfaces are given a refcount of 1 at the point of creation, and
then that refcount is increased for each window that displays the surface.
This change ensures that if the compositor associated with a window is
paused, we don't increase the refcount because that paused compositor will
never display the surface and decrease the refcount.

Depends on: 1835127

I think D179121 solves the problem correctly, but Bug 1835127 is confusing the issue since it's very easy to crash the browser while testing it.

Tim, is D179393 to be landed in conjunction with the earlier patch D179121, or instead of?

Flags: needinfo?(tnikkel)

D179393 solves the memory use increase by itself in my testing. D179393 stops the updates from being sent from the content process, so we avoid any work in the gpu process.

D179121 could prevent any updates that do slip through and get to the compositor through different paths (any that exist currently that we aren't aware of or created in the future) from causing problems.

Flags: needinfo?(tnikkel)

Thank you, let's land them both. I'll take them out of WIP when Bug 1835127 is resolved.

:bradwerth 115 goes to beta next week (Monday 2023-06-05)
Any concerns about the timing for landing the fix in Bug 1835127 and then here in Bug 1830753?

The regressor Bug 1768495 introduced in Fx112, was backed out of Fx113 beta, Fx114 beta. I was wondering if we'll need to do the same for Fx115.

Flags: needinfo?(bwerth)

To hasten this along, I'm going to remove Bug 1835127 as a blocker. It's a related issue that I believe has no effect outside of debug builds. Let's try to get the patches in this Bug approved and landed.

Flags: needinfo?(bwerth)
Attachment #9335953 - Attachment description: WIP: Bug 1830753: Prevent increases in shared surface refcount when the compositor is paused. → Bug 1830753: Prevent increases in shared surface refcount when the compositor is paused.

Setting 115 to fixed, the initial patch that introduced the regression was backed out in beta by uplifting Bug 1836699.
Setting 116 to affected and tracked as blocking, the patches from Bug 1768495 are still in central.

As mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=1836699#c3, at this point we should consider also backing out in central until we have everything needed in terms of landing the fixes.

Attachment #9336454 - Attachment description: WIP: Bug 1830753. Don't send animated image updates to paused compositors. → Bug 1830753. Don't send animated image updates to paused compositors.
Attachment #9335953 - Attachment is obsolete: true
Attachment #9336454 - Attachment description: Bug 1830753. Don't send animated image updates to paused compositors. → Bug 1830753. Don't send animated image updates to paused compositors. r?#gfx-reviewers

Tim has the last surviving patch for this Bug so it should be assigned to Tim.

Assignee: bwerth → tnikkel
Pushed by tnikkel@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1b140d3a2770 Don't send animated image updates to paused compositors. r=bradwerth,gfx-reviewers

I filed bug 1838111 to explore a more general solution.

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 116 Branch

I think we want QA to do another round of testing on 116 with this patch now landed.

Flags: qe-verify+
Blocks: 1839109
Flags: qe-verify+
Blocks: 1845413
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: