Closed Bug 1749298 Opened 3 years ago Closed 3 years ago

.NET 6 Blazor wasm page leaked after reloading

Categories

(Core :: JavaScript: GC, defect, P2)

Firefox 95
defect

Tracking

()

RESOLVED FIXED
99 Branch
Tracking Status
firefox99 --- fixed

People

(Reporter: anders.frantz, Assigned: jonco)

References

(Blocks 2 open bugs)

Details

Attachments

(4 files)

Steps to reproduce:

Reload a published .NET 6 compiled Blazor Wasm application page multiple times.
Different test setups:

  1. Windows 10. Our own application with 2 iFrames each hosting 2 Blazor Wasm in Blazor application published with AOT compilation. Crash after 3-4 reloads.
  2. Windows 10. Our own Blazor application published without AOT compilation. Crash after 10-12 reloads
  3. Windows 10. To eliminate possible developer errors I ran the Visual Studio 2022 out-of-the-box self hosted Blazor client template project untouched and published with AOT. Crash after 32 reloads.
  4. Ubuntu LTS, Firefox 95.0.1. 3GB memory. Your tab just crashed after 2 reloads
  5. Same scenario as 1 but on Chrome. No problem and increased memory consumption after 70 reloads

Also reported this issues dotnet/aspnetcore GitHub:https://github.com/dotnet/aspnetcore/issues/39305

Actual results:

Crash with some different log messages:
Scenario 1 above:
The following was logged:
info: Streaming compilation failed. Falling back to ArrayBuffer instantiation. out of memory blazor.webassembly.js:1:36697

err: TypeError: Response.arrayBuffer: Body has already been consumed. blazor.webassembly.js:1:35461
printErr https://localhost/pg2/_framework/blazor.webassembly.js:1
Module https://localhost/pg2/_framework/blazor.webassembly.js:1

err: Uncaught (in promise) TypeError: Response.arrayBuffer: Body has already been consumed.
n https://localhost/pg2/_framework/blazor.webassembly.js:1

Scenario 3:
warn: WebAssembly module validated with warning: failed to allocate executable memory for module

err: Uncaught (in promise) TypeError: Response.arrayBuffer: Body has already been consumed.
n https://localhost:5001/_framework/blazor.webassembly.js:1

Firefox memory in task manager has at this point grown to about 4 GB.
Below is an excerpt from about:memory (Force GC by pressing the 'Minimize memory usage' button makes no noticeable difference):

Explicit Allocations

20,118.30 MB (100.0%) -- explicit
β”œβ”€β”€20,068.68 MB (99.75%) -- window-objects/top(https://localhost:5001/, id=37)
β”‚ β”œβ”€β”€20,053.02 MB (99.68%) -- active/window(https://localhost:5001/)
β”‚ β”‚ β”œβ”€β”€20,049.23 MB (99.66%) -- js-realm(https://localhost:5001/)
β”‚ β”‚ β”‚ β”œβ”€β”€20,042.39 MB (99.62%) -- classes
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€16,385.64 MB (81.45%) -- class(ArrayBuffer)/objects
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€16,384.00 MB (81.44%) ── non-heap/elements/wasm [32]
β”‚ β”‚ β”‚ β”‚ β”‚ └───────1.64 MB (00.01%) ++ (2 tiny)
β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€β”€3,638.22 MB (18.08%) -- class(WebAssembly.Instance)/objects
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€1,966.21 MB (09.77%) ── malloc-heap/misc [32]
β”‚ β”‚ β”‚ β”‚ β”‚ β”œβ”€β”€1,672.00 MB (08.31%) ── non-heap/code/wasm [32]
β”‚ β”‚ β”‚ β”‚ β”‚ └──────0.00 MB (00.00%) ── gc-heap [32]
β”‚ β”‚ β”‚ β”‚ └──────18.53 MB (00.09%) ++ (6 tiny)
β”‚ β”‚ β”‚ └───────6.84 MB (00.03%) ++ (2 tiny)
β”‚ β”‚ └───────3.79 MB (00.02%) ++ (3 tiny)
β”‚ └──────15.66 MB (00.08%) ++ (32 tiny)
└──────49.61 MB (00.25%) ++ (24 tiny)

Other Measurements

134,217,727.94 MB (100.0%) -- address-space
β”œβ”€β”€131,915,122.37 MB (98.28%) ── free(segments=202)
β”œβ”€β”€β”€β”€2,282,128.48 MB (01.70%) -- reserved
β”‚ β”œβ”€β”€2,097,096.57 MB (01.56%) ── mapped(segments=13)
β”‚ └────185,031.91 MB (00.14%) ── private(segments=1331)
└───────20,477.09 MB (00.02%) ++ commit

5.87 MB (100.0%) -- decommitted
└──5.87 MB (100.0%) ── js-non-window/gc-heap/decommitted-pages

81 (100.0%) -- event-counts
└──81 (100.0%) -- window-objects/top(https://localhost:5001/, id=37)/active/window(https://localhost:5001/)/dom
β”œβ”€β”€77 (95.06%) ── event-listeners [32]
└───4 (04.94%) ── event-targets

Expected results:

There should be no memory consumption/leakage causing the application to crash after reloading the page.

The Bugbug bot thinks this bug should belong to the 'Core::JavaScript: GC' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → JavaScript: GC
Product: Firefox → Core

Thanks for opening this bug.

Would you have a static copy of the AOT compilation case which reproduces this issue? Such that we can easily reproduce this issue without setting up .NET / Blazor locally.

Lars, would you have any idea what might be causing this memory retention?

Severity: -- → S3
Flags: needinfo?(lhansen)
Flags: needinfo?(anders.frantz)
Priority: -- → P2

This is just speculation, but from old I seem to remember that Blazor apps tend to be fairly large / have a lot of code. On our side, we compile each wasm module into one object code blob, while as a security measure SpiderMonkey has a single per-process code memory area that is restricted to 2GB on 64-bit systems. Even if unloading works as it should and there's no actual leak, the code area can become fragmented, and that might prevent new, large, contiguous code memory blobs from being allocated; that could in turn account for the "failed to allocate executable memory" error, which I expect is the key here. But there could also be a plain old leak causing that error.

I agree that having access to the test case would be extremely helpful here. We could arrange for it not to be attached to a public bug, if that's desirable.

(In reply to Nicolas B. Pierron [:nbp] from comment #2)

Thanks for opening this bug.

Would you have a static copy of the AOT compilation case which reproduces this issue? Such that we can easily reproduce this issue without setting up .NET / Blazor locally.

Lars, would you have any idea what might be causing this memory retention?

The repro zip file was to large to attach. I've shared it here https://www.dropbox.com/s/v31101q3w7gj4b8/Bug_1749298_repro.zip?dl=0. Let me know if not accessible.

Flags: needinfo?(anders.frantz)

Thanks, I grabbed a copy.

Flags: needinfo?(lhansen)

Firefox release-debug, mozilla-central from Jan 13. 8-core/16-thread Xeon workstation, 32GB RAM, Fedora 35.
Numbers from htop sorted by size, filtered by "contentproc", numbers in GB. Developer console closed.

Load empty browser with default start tab, wait for stabilization. Virt=2.6, Res=0.18
Load blazor app once, wait for stabilization. Virt=9.0, Res=0.33
Load again, wait. Virt=15.2, Res=0.49
Again, wait. Virt=21.4, Res=0.66
Again, wait. Virt=27.7, Res=0.85
Again, wait. Virt=33.9, Res=1.0
Again, wait. Virt=40.2, Res=1.2
Again, wait. Virt=46.4, Res=1.3
Again, wait. Virt=52.6, Res=1.5
Again, wait. Virt=58.9, Res=1.7
Again, wait. Virt=65.1, Res=1.8
Again, wait. Virt=71.3, Res=2.0
Again, wait. Virt=77.6, Res=2.2
Five fairly rapid reloads (waiting ca 1.5s between each). Virt=106, Res=3.2

At this point, about:memory says:
11GB explicit allocations
8.7GB in ArrayBuffers (this is committed but most of it will be zero and so not resident)
2.0GB in wasm instance-related (code+data)
We have 17 memories (assuming one memory per reload), @ 6GB per memory, 17*6=102
So far it looks like we leak the wasm instance with its compiled code and memory when we reload. Eventually available virtual memory, or the 2GB code memory area, will be exhausted and we'll OOM

Minimize memory usage. Htop numbers drop a little to 106G/3.0G. about:memory says:
(no change from the above)

Close the blazor tab. Htop shows that we drop back down to the baseline within about 2s.

Disable wasm caching (Nightly-only feature), repeat experiment, no difference.

No other wasm features should matter for this. (For grins I disabled tiered compilation but other than resident memory growing more slowly because we only have one code blob, there was no difference.)

Taking this for now, for further experiments.

Assignee: nobody → lhansen
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

A couple more data points:

  • on my system, Chromium does indeed not exhibit the leak
  • i've yet to find other wasm properties that exhibit similar leaks in Firefox, though mostly these are emscripten-based and not .NET-based

At this point we probably have to identify the leaked instances in a running process and try to find out who keeps them alive.

Problem goes away completely when running with --disable-e10s.

Jon, any ideas? There's a shutdown crash in release-debug too, "Extra shutdown CC".

Assignee: lhansen → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(jcoppeard)
No longer blocks: wasm-lang

I tested locally and can reproduce. I can't immediately see why this is happening though.

Hi, is there any progress on this issue? Is there anything we can do in our end to help resolve this issue?

(In reply to anders.frantz from comment #12)
I've spent a while looking into this but not come to any definite conclusions so far.

One thing that is suspicious that that there are about 100 BodyStreamHolder objects in the CC log when this happens. These don't seem to have any incoming edges and are holding alive the old windows that should have been collected. However there are a few other things (LoadedScripts and orphaned FragmentOrElements) that are also holding the windows alive which might also be to blame. I'll continue looking into this tomorrow.

Andrew, have you seen anything like that before? I don't know anything about how these holders work.

Flags: needinfo?(continuation)

While there may be some kind of leak here, the cycle collector is not well-equipped for this kind of "hit reload a bunch of times on a tab containing a big web page" scenario, which is what the original report seems to describe.

At least at some point, there was some kind of mechanism Luke added that would cause these large buffers to get dumped on memory pressure, though I'm not finding it exactly right now. Maybe the memory pressure stuff in the large allocation failure callback is supposed to do that?

Flags: needinfo?(continuation)

I can try looking at why we are leaking the page here from the DOM side, but it would also be nice to know if we can get whatever other mechanism working to dump the huge buffers for pages that aren't currently being displayed to work.

(In reply to Andrew McCreight [:mccr8] from comment #15)
Was it this perhaps?

https://searchfox.org/mozilla-central/source/js/src/vm/ArrayBufferObject.cpp#83-136

This only triggers GC though, and the problem here is that these buffers are still live from the GC point of view.

How do I load this test case? I downloaded the file in comment 4, and tried loading wwwroot/index.html but it just says "Loading... An unhandled error has occurred."

(In reply to Andrew McCreight [:mccr8] from comment #17)
I ran 'python3 -m http.server' from the wwwroot directory.

(In reply to Andrew McCreight [:mccr8] from comment #14)

the cycle collector is not well-equipped for this kind of "hit reload a bunch of times on a tab containing a big web page" scenario

We can do this outside the context of this bug, but I'd like to understand why this is.

(In reply to Jon Coppeard (:jonco) from comment #16)

(In reply to Andrew McCreight [:mccr8] from comment #15)
Was it this perhaps?

Looking at some ancient bugs, it seems like I was thinking of bug 936236, which sounds like the same thing. I thought we did something more magical, but that got WONTFIXed (bug 936128).

Minimize memory usage doesn't free the leaked page (which runs sync GC/CC/GC/CC). However, getting a verbose CC log does free the page. (For Jon, it doesn't, which is odd, but I'm going to ignore that for now.) That suggests that the problem is with a cycle collector CanSkip optimization, and not something due to incremental CC.

In the concise log, the document's JS reflector is marked black, and doesn't seem to be held alive via any non-browser root, so maybe the CC optimizations are causing the reflector to get marked black. In the verbose log, the document's JS reflector is marked grey.

It appears to be the "forceGC" part of the AllTraces behavior that avoids the leak. If you change the bool forceGC = line in nsCycleCollector.cpp to be bool forceGC = true; then we don't leak the page.

Blocks: GhostWindows
Summary: Repeatedly reloading of .NET 6 Blazor wasm page in Firefox crash of memory → .NET 6 Blazor wasm page leaked after reloading

I ripped out most or all of the ForgetSkippable stuff, and the document reflector is still getting marked black somehow. Maybe it is getting marked black via exposeToActiveJS() somehow.

I'm still confused about what the difference is between doing a GC via "memory-pressure" "heap-minimize" followed by a CC is from doing a GC within the CC itself. I noticed that the "heap-minimize" GC is a shrinking one, while the forceGC one is not, but sadly making the forceGC into a shrinking GC doesn't somehow make it stop collecting the leaked page.

(In reply to Andrew McCreight [:mccr8] from comment #21)

However, getting a verbose CC log does free the page. (For Jon, it doesn't, which is odd, but I'm going to ignore that for now.)

I rechecked this. I don't see a memory reduction immediately when I save verbose logs, but if I do minimize memory after that then everything is released.

I'm still confused about what the difference is between doing a GC via "memory-pressure" "heap-minimize" followed by a CC is from doing a GC within the CC itself. I noticed that the "heap-minimize" GC is a shrinking one, while the forceGC one is not, but sadly making the forceGC into a shrinking GC doesn't somehow make it stop collecting the leaked page.

I suspect the difference is that if you set forceGC it uses JS::GCReason::SHUTDOWN_CC. The GC treats shutdown reasons specially and does extra cleanup and some things that only make sense during shutdown:

https://searchfox.org/mozilla-central/search?q=symbol:_ZN2js2gc16IsShutdownReasonEN2JS8GCReasonE&redirect=false

Not all of this is valid do outside shutdown. In particular it will affect use of FinalizationRegistry and WeakRef, which suggests these could be involved in the leak. We should probably make this behaviour explicit rather than depending on the reason in this way.

Flags: needinfo?(jcoppeard)

This seems to be caused by how FinalizationRegistry is implemented. When registering a target and held value with a finalization registry we create a record object containing the held value (and various other things). These are marked as roots and are used to queue the finalization callback when the target is found to be dead. One concern is that we don't entrain the target, which would make it uncollectable. Unfortunately record objects (and the other objects involved) are normal JS objects and reference the global, so if the target is reachable from the global then we will trigger this problem and make the target, and the global, live forever.

Assignee: nobody → jcoppeard

(In reply to Jon Coppeard (:jonco) from comment #25)

I suspect the difference is that if you set forceGC it uses JS::GCReason::SHUTDOWN_CC. The GC treats shutdown reasons specially and does extra cleanup and some things that only make sense during shutdown:

Huh. Good point. The piece of code that decides the GC reasons is aForceGC ? JS::GCReason::SHUTDOWN_CC : JS::GCReason::CC_FORCED, so you are right that when forceGC is true the reason ends up as SHUTDOWN_CC, but for some reason I wasn't actually reading the code and was assuming it would only do a SHUTDOWN_CC on shutdown, so when I was manually inserting GarbageCollect() calls I was using CC_FORCED instead, which explains why it wasn't doing the cleanup. The currently behavior is incorrect in the case of a non-shutdown CC, so I should file a bug on that. Though in this case it did help reveal the issue, oddly enough.

There's quite a lot of data that's stored on the zone and this bug will add
more. We can encapsulate this inside a per-zone object that is created on
demand.

To mark finalization records from the registry's global we need the registry
and its targets to be swept in the same sweep group, otherwise the global may
not be marked in time to mark the CCWs to the records.

The patch adds a count of cross-zone wrappers in the record map and uses this
to add sweep group edges in both directions.

We have to be careful when nuking wrappers that we update this correctly. The
patch adds addtional state to a finalization record to indicate whether it is
currently in the Zone's record map. This is different to the previous 'active'
state, here renamed to 'registered' as it reflects the caller's use of
register() and unregister() methods.

Depends on D138436

This adds a set of record object wrappers to the global data and traces them as
part of global tracing.

This may be cross compartment edges. They are not in the CCW map but are
handled by the extra sweep group edges added by the previous patch.

Depends on D138437

Depends on D138438

Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/bdc9d39e0886
Part 1: Factor out FinalizationRegistry data from Zone into its own class r=sfink
https://hg.mozilla.org/integration/autoland/rev/c6e0088728db
Part 2: Sweep FinalizationRegistries in the same sweep group as their targets r=sfink
https://hg.mozilla.org/integration/autoland/rev/ca41fc23ea6a
Part 3: Only trace finalization records if the registry's global is alive r=sfink
https://hg.mozilla.org/integration/autoland/rev/e9a80289cc3b
Part 4: Test that global not kept alive when a finalization registry target is reachable from it r=sfink

Hi Anders, can you check whether the memory leak is still happening in the latest nightly?

Flags: needinfo?(anders.frantz)
Regressions: 1755257

(In reply to Jon Coppeard (:jonco) from comment #35)

Hi Anders, can you check whether the memory leak is still happening in the latest nightly?

Hi,
Thanks for the effort!
I've made some quick test with nightly build and it seems to be working better now.

I've only looked in task manager, no rigorous memory analyze.
Memory seems to double up temporarily on the reload, but then it's reduced to about the initial size, which is quite large buy the way, but that's another question.

Flags: needinfo?(anders.frantz)

(In reply to anders.frantz from comment #36)
Great, good to hear it's working better now. I'll mark this as fixed.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Keywords: leave-open
Target Milestone: --- → 99 Branch
Regressions: 1756567
Regressions: 1813556
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: