Closed Bug 1586236 Opened 5 years ago Closed 4 years ago

Experiment detecting low memory scenarios using memory resource notifications in Windows

Categories

(Core :: Memory Allocator, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
89 Branch
Fission Milestone M8
Tracking Status
firefox89 --- fixed

People

(Reporter: gsvelto, Assigned: toshi)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 2 obsolete files)

The available memory tracker currently uses a polling mechanism to detect low-memory scenarios. Windows offers a signal-based system to detect low/high memory scenarios called called memory resource notifications. The thresholds used by this system are not documented and it's unclear what it's measuring against (commit space? physical memory?) so we should experiment with it to establish if it could be a useful replacement for our current implementation.

I'm going to work on this right away.

Assignee: nobody → gsvelto
Status: NEW → ASSIGNED
Summary: Experiment detecting low memory scenarios using resource notifications in Windows → Experiment detecting low memory scenarios using memory resource notifications in Windows

My plan here is to let this stay in nightly for 2-3 weeks to gather information about the thresholds. If enough OOM crashes have the annotations and the thresholds look reasonable we can remove the polling mechanism entirely on 64-bit builds and keep it only for the 32-bit ones. We will also be able to remove the IPC needed to pass the messages between the main process and the content processes. Without polling there's nothing preventing us from using this mechanism directly within the content processes.

Blocks: 1587762
Attachment #9099829 - Attachment is obsolete: true

Quick update: I'm completely rewriting the patch to use the memory resource notifications to replace all our current machinery. I also found a good way to test this w/o having to use a VM. It's possible to adjust the maximum amount of memory that Windows will use at boot time:

https://helpdeskgeek.com/windows-7/fix-the-maximum-amount-of-memory-usable-by-windows-7-64-bit/

Putting the swap file on a regular HDD instead of the SSD on which I have Windows installed is the final touch to simulate slow hardware on which Firefox might be running.

As a side note I found one more reason to abandon the current code: I got confirmation that recent kernels adjust the maximum amount of commit-space at runtime so measuring it is pointless. I saw it happening only once during testing and thought it was a fluke but by toying with the boot parameters I can reproduce it all the time.

David, if you want to have a look this is a WIP of the revised patch. I've dropped all the old code since it wouldn't be useful anymore and left only low physical memory detection via the memory resource notifications. The core logic works like this (pending small changes):

  • We listen to a low-memory resource notification using ´RegisterWaitForSingleObject()´
  • When we hit a low-memory notification we send the relevant event and start polling the memory status every 10 seconds. Once the low-memory condition is over we go back to listening
  • If the user becomes inactive we stop polling, but only when the timer ticks, when it becomes active again we resume polling
  • The reason why we don't cancel polling every time the user is inactive is that it could cause the timer to restart far too often and depending on the user input it might not even get a chance to tick for a long time
Attachment #9103934 - Attachment description: Bug 1586236 - Add annotations to measure the thresolds of memory resource notifications on Windows; r=dmajor → Bug 1586236 - Use memory resource notifications to detect low memory scenarios on Windows; r=dmajor

[Tracking Requested - why for this release]: Might want to keep an eye on this as it is taking a look at our recent increase in OOM crashes

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/60cccadbeaf0 Use memory resource notifications to detect low memory scenarios on Windows; r=dmajor
Backout by rgurzau@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/87443484a46a Backed out changeset 60cccadbeaf0 for failures at test_memoryReporters.xul on a CLOSED TREE.

I forgot to re-run the tests after having added the changes to the memory reporting machinery, silly me.

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9fccb38453a4 Use memory resource notifications to detect low memory scenarios on Windows; r=dmajor
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla72
Target Milestone: mozilla72 → ---

Since with this patch we're sending memory-pressure events more frequently on platforms with little memory (like our 32-bit Windows 7 runners) I did expect that some issues would crop up, but not this fast or this frequently. The stack traces are not very helpful, it seems that a call to nsIObserverService.notifyObservers() is throwing an out-of-memory exception. It is very unlikely that the nsIObserverService implementation is responsible for this; it's possible that one of the memory-pressure listeners is throwing it and we're just bubbling it up. Why on earth a memory-pressure listener would throw that exception is beyond me. I'll try to reproduce locally with a 32-bits build to see what might be causing it.

Flags: needinfo?(gsvelto)

OK, I'm finally back working on this one. I hope to have a fix for the tests by tomorrow.

I've been trying to figure out why the tests fail for a while now but to no avail. Somehow I can't manage to reproduce the issue locally, even artificially bombarding the running tests with memory-pressure notifications. This is annoying :-(

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:gsvelto, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(gsvelto)

This is still undergoing tests which is why it hasn't landed yet.

Flags: needinfo?(gsvelto)

I managed to reproduce some of the failures here: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&selectedJob=280444150&revision=76e9a422edeef3371506d20b0d1f0822064e92d8

It's tricky business. In some cases we're hitting actual out-of-memory errors, in others we're timing out possibly because the memory-pressure notifications are slowing us down. There's some message in the log that might suggest we're sending more than one notification at the same time even though that should not be possible. I fear this investigation is going to take a while.

I've manually instrumented a build to figure out what's going on during the failures but I still don't get anything readable from it. There's out-of-memory errors being thrown but I don't know from where. The upside is that they seem to be happening while the memory-pressure notification is being dispatched so it's probably some component misbehaving during the handling of that notification.

(In reply to Gabriele Svelto [:gsvelto] from comment #20)

I managed to reproduce some of the failures here: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&selectedJob=280444150&revision=76e9a422edeef3371506d20b0d1f0822064e92d8

It's tricky business. In some cases we're hitting actual out-of-memory errors, in others we're timing out possibly because the memory-pressure notifications are slowing us down. There's some message in the log that might suggest we're sending more than one notification at the same time even though that should not be possible. I fear this investigation is going to take a while.

(In reply to Gabriele Svelto [:gsvelto] from comment #21)

I've manually instrumented a build to figure out what's going on during the failures but I still don't get anything readable from it. There's out-of-memory errors being thrown but I don't know from where. The upside is that they seem to be happening while the memory-pressure notification is being dispatched so it's probably some component misbehaving during the handling of that notification.

Any updates on this work?

Flags: needinfo?(gsvelto)

No, sorry, this has stalled.

Flags: needinfo?(gsvelto)

gcp said Toshihito will work on the low memory detection for Windows so leaving a NI here.

Assignee: gsvelto → nobody
Flags: needinfo?(tkikuchi)

Yes, I can take this.

Assignee: nobody → tkikuchi
Flags: needinfo?(tkikuchi)
Attached file bug1586236-datareview-request (deleted) —
Attachment #9212125 - Flags: data-review?(chutten)

Comment on attachment 9212125 [details]
bug1586236-datareview-request

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Toshihito Kikuchi is responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.


Result: datareview+

Attachment #9212125 - Flags: data-review?(chutten) → data-review+
Fission Milestone: --- → M8
Priority: -- → P3

This patch uses the low memory resource notification facility to detect
scenarios where physical memory is running low without polling. This is a
significant change compared to the previous behavior which measured both
available virtual memory (only on 32-bit builds) and available commit space.

Since we're not trying to avoid OOMs anymore we don't save memory reports
anymore when hitting a low-memory condition.

Attachment #9103934 - Attachment is obsolete: true
Pushed by tkikuchi@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bba32db4da1e Use memory resource notifications to detect low memory scenarios on Windows. r=gsvelto
Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → FIXED
Target Milestone: --- → 89 Branch

As described in Microsoft's document, I confirmed a memory resource notification object is signaled when physical memory is running low, not a low commit-space.

The latest data shows the new annotation LowPhysicalMemoryEvents is included in OOM crashes more often than the old annotation LowCommitSpaceEvents (60% vs 50%).

That's very good news, it means that we're catching those conditions more often without having to poll.

I updated the query to compare v89beta vs v89nightly vs v88. Unfortunately the ratio of OOM crashes with low_physical_memory_events annotation from v89b is about 40%, less than v88.

I also made histograms to analyze OOM crashes: https://iodide.telemetry.mozilla.org/notebooks/531/?viewMode=report. These charts are based on the last 10,000 OOM crashes from x64 Firefox release version on Windows. This data tells us OOM crash is caused by low page file (= low commit space), not physical memory.

Lastly I also did some manual experiments of the memory resource objects and saw the notification was signaled when the usage of physical memory reached about 93% regardless of the total size of physical memory. That explains the ratio of OOM crashes with the new annotation because the ratio of OOM crashes with ~10% physical memory is 50% or less.

(In reply to Toshihito Kikuchi [:toshi] from comment #34)

I updated the query to compare v89beta vs v89nightly vs v88. Unfortunately the ratio of OOM crashes with low_physical_memory_events annotation from v89b is about 40%, less than v88.

Yes, that's expected. We went from detecting low commit-space to low physical memory so it was bound to happen. This means we're probably offering a better experience to the user (less swapping) but we're not mitigating the amount of OOM crashes. BTW did you look at a breakup of main process vs content process crashes? Last time I checked the vast majority of our crashes were in the main process (obviously) but I wonder if detecting low physical memory altered it (it shouldn't ).

I also made histograms to analyze OOM crashes: https://iodide.telemetry.mozilla.org/notebooks/531/?viewMode=report. These charts are based on the last 10,000 OOM crashes from x64 Firefox release version on Windows. This data tells us OOM crash is caused by low page file (= low commit space), not physical memory.

Lastly I also did some manual experiments of the memory resource objects and saw the notification was signaled when the usage of physical memory reached about 93% regardless of the total size of physical memory. That explains the ratio of OOM crashes with the new annotation because the ratio of OOM crashes with ~10% physical memory is 50% or less.

I've looked at the OOM ratio in nightly and it didn't go up from what I can tell. This means that previously while we did detect low commit-space scenarios we weren't able to prevent processes from crashing anyway. Tab unloading should change that. Also we really need to make progress in bug 1475518. Unless we know where our commit-space is going we won't be able to counter it.

BTW did you look at a breakup of main process vs content process crashes? Last time I checked the vast majority of our crashes were in the main process (obviously) but I wonder if detecting low physical memory altered it (it shouldn't ).

I created https://sql.telemetry.mozilla.org/queries/79145#198220 for per-process OOM crashes on Nightly. For 64bit OS, the biggest group is the main process and the second group is the content process. For 32bit OS (excluded wow64 crashes), on the other hand, the biggest group is the content process.

The number of the 64bit main process has increased since Apr-4, but not sure it's caused by this patch (landed on Mar-31).

We went from detecting low commit-space to low physical memory so it was bound to happen. This means we're probably offering a better experience to the user (less swapping) but we're not mitigating the amount of OOM crashes.

I'm wondering what was the main downside of the old polling logic. I thought the new notification-based logic was better because polling has performance cost and having a hardcoded threshold is not flexible. However, these charts suggest the new detection logic may cause lots of false-negative (low page file but physical memory is still available = not trigger tab unload when we want) and false-positive (low physical memory but page file is still available = trigger tab unload when we should not) cases, so I'm leaning towards thinking the polling is better.

I've looked at the OOM ratio in nightly and it didn't go up from what I can tell. This means that previously while we did detect low commit-space scenarios we weren't able to prevent processes from crashing anyway. Tab unloading should change that. Also we really need to make progress in bug 1475518. Unless we know where our commit-space is going we won't be able to counter it.

It may not be available in the crash ping now, but we should check how much commit space was consumed by ourselves.

This change caused some noticeable changes to the GC telemetry when it landed. The number of non-incremental GCs went from 15% to 19% which suggests we're getting a lot more low memory notifications now. Also we're spending more time compacting, probably because this is being triggered by the memory pressure GCs. This is not necessarily a problem but worth noting.

Regressions: 1712084

(In reply to Jon Coppeard (:jonco) from comment #38)

This change caused some noticeable changes to the GC telemetry when it landed. The number of non-incremental GCs went from 15% to 19% which suggests we're getting a lot more low memory notifications now. Also we're spending more time compacting, probably because this is being triggered by the memory pressure GCs. This is not necessarily a problem but worth noting.

We're still figuring out what is the best way to detect a low memory situation on Windows (and other platforms, too). This patch was efficient from the performance perspective, but probably was not so accurate to detect users who were likely to hit an OOM crash. We need more analysis and experiments to improve the detection logic. In the meantime, I landed the patch for bug 1711610 to avoid triggering unnecessary memory pressures. It will revert the number of non-incremental GCs back to 15% or even smaller.

(In reply to Toshihito Kikuchi [:toshi] from comment #39)
Great, I like the empirical approach in that bug. Let's see what difference that makes to the telemetry.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: