Closed Bug 1586236 Opened 5 years ago Closed 4 years ago

Experiment detecting low memory scenarios using memory resource notifications in Windows

Tracking

()

Status:

RESOLVED FIXED

Milestone:

89 Branch

Project Flags:

Fission Milestone

Tracking Flags:

Tracking

Status

firefox89

---

fixed

People

(Reporter: gsvelto, Assigned: toshi)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 2 obsolete files)

Bug 1586236 - Add annotations to measure the thresolds of memory resource notifications on Windows; r=dmajor 5 years ago Gabriele Svelto [:gsvelto] (deleted), text/x-phabricator-request		Details
Bug 1586236 - Use memory resource notifications to detect low memory scenarios on Windows; r=dmajor 5 years ago Gabriele Svelto [:gsvelto] (deleted), text/x-phabricator-request		Details
bug1586236-datareview-request 4 years ago Toshihito Kikuchi [:toshi] (deleted), text/plain	chutten : data-review+	Details
Bug 1586236 - Use memory resource notifications to detect low memory scenarios on Windows. r=gsvelto 4 years ago Toshihito Kikuchi [:toshi] (deleted), text/x-phabricator-request		Details
Link to GitHub pull-request: https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/668 4 years ago GitHub Bugzilla PR Linker (deleted), text/x-github-pull-request		Details

Gabriele Svelto [:gsvelto]

Reporter

Description

•

5 years ago

The available memory tracker currently uses a polling mechanism to detect low-memory scenarios. Windows offers a signal-based system to detect low/high memory scenarios called called memory resource notifications. The thresholds used by this system are not documented and it's unclear what it's measuring against (commit space? physical memory?) so we should experiment with it to establish if it could be a useful replacement for our current implementation.

Gabriele Svelto [:gsvelto]

Reporter

Comment 1

•

5 years ago

I'm going to work on this right away.

Assignee: nobody → gsvelto

Status: NEW → ASSIGNED

Gabriele Svelto [:gsvelto]

Reporter

Updated

•

5 years ago

Summary: Experiment detecting low memory scenarios using resource notifications in Windows → Experiment detecting low memory scenarios using memory resource notifications in Windows

Gabriele Svelto [:gsvelto]

Reporter

Comment 2

•

5 years ago

Attached file Bug 1586236 - Add annotations to measure the thresolds of memory resource notifications on Windows; r=dmajor (obsolete) (deleted) — Details

Gabriele Svelto [:gsvelto]

Reporter

Comment 3

•

5 years ago

My plan here is to let this stay in nightly for 2-3 weeks to gather information about the thresholds. If enough OOM crashes have the annotations and the thresholds look reasonable we can remove the polling mechanism entirely on 64-bit builds and keep it only for the 32-bit ones. We will also be able to remove the IPC needed to pass the messages between the main process and the content processes. Without polling there's nothing preventing us from using this mechanism directly within the content processes.

Gabriele Svelto [:gsvelto]

Reporter

Updated

•

5 years ago

Blocks: 1587762

Phabricator Automation

Updated

•

5 years ago

Attachment #9099829 - Attachment is obsolete: true

Gabriele Svelto [:gsvelto]

Reporter

Comment 4

•

5 years ago

Quick update: I'm completely rewriting the patch to use the memory resource notifications to replace all our current machinery. I also found a good way to test this w/o having to use a VM. It's possible to adjust the maximum amount of memory that Windows will use at boot time:

https://helpdeskgeek.com/windows-7/fix-the-maximum-amount-of-memory-usable-by-windows-7-64-bit/

Putting the swap file on a regular HDD instead of the SSD on which I have Windows installed is the final touch to simulate slow hardware on which Firefox might be running.

As a side note I found one more reason to abandon the current code: I got confirmation that recent kernels adjust the maximum amount of commit-space at runtime so measuring it is pointless. I saw it happening only once during testing and thought it was a fluke but by toying with the boot parameters I can reproduce it all the time.

Gabriele Svelto [:gsvelto]

Reporter

Comment 5

•

5 years ago

Attached file Bug 1586236 - Use memory resource notifications to detect low memory scenarios on Windows; r=dmajor (obsolete) (deleted) — Details

Gabriele Svelto [:gsvelto]

Reporter

Comment 6

•

5 years ago

David, if you want to have a look this is a WIP of the revised patch. I've dropped all the old code since it wouldn't be useful anymore and left only low physical memory detection via the memory resource notifications. The core logic works like this (pending small changes):

We listen to a low-memory resource notification using ´RegisterWaitForSingleObject()´
When we hit a low-memory notification we send the relevant event and start polling the memory status every 10 seconds. Once the low-memory condition is over we go back to listening
If the user becomes inactive we stop polling, but only when the timer ticks, when it becomes active again we resume polling
The reason why we don't cancel polling every time the user is inactive is that it could cause the timer to restart far too often and depending on the user input it might not even get a chance to tick for a long time

Phabricator Automation

Updated

•

5 years ago

Attachment #9103934 - Attachment description: Bug 1586236 - Add annotations to measure the thresolds of memory resource notifications on Windows; r=dmajor → Bug 1586236 - Use memory resource notifications to detect low memory scenarios on Windows; r=dmajor

Liz Henry (:lizzard Please n-i to RyanVM, jcristau, or pascal)

Comment 7

•

5 years ago

[Tracking Requested - why for this release]: Might want to keep an eye on this as it is taking a look at our recent increase in OOM crashes

status-firefox72: --- → affected

tracking-firefox72: --- → ?

Pulsebot

Comment 8

•

5 years ago

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/60cccadbeaf0 Use memory resource notifications to detect low memory scenarios on Windows; r=dmajor

Pulsebot

Comment 9

•

5 years ago

Backout by rgurzau@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/87443484a46a Backed out changeset 60cccadbeaf0 for failures at test_memoryReporters.xul on a CLOSED TREE.

Gabriele Svelto [:gsvelto]

Reporter

Comment 10

•

5 years ago

I forgot to re-run the tests after having added the changes to the memory reporting machinery, silly me.

Pulsebot

Comment 11

•

5 years ago

Pushed by gsvelto@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9fccb38453a4 Use memory resource notifications to detect low memory scenarios on Windows; r=dmajor

Arthur Iakab [arthur_iakab]

Comment 12

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9fccb38453a4

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox72: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla72

Noemi Erli[:noemi_erli]

Comment 13

•

5 years ago

Backed out changeset 9fccb38453a4 (Bug 1586236) for causing mass devtools failures, all on Windows

Failure log examples:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274255280&repo=autoland&lineNumber=4486
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274257831&repo=autoland&lineNumber=4649
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274255280&repo=autoland&lineNumber=4486
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274257897&repo=autoland&lineNumber=4558
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274257802&repo=autoland&lineNumber=4397

Push range that reveals the started from this bug: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=pending%2Crunning%2Csuccess%2Ctestfailed%2Cbusted%2Cexception&searchStr=fd4dd1f5d4a2b9cb7b653314d7370d081e86acff&fromchange=8aa8ed80ea4325031cd613b5f85a3d694231c2ba&tochange=3e02f4745b0d6fb8eb395cbb81d4bc9098d1626d

Status: RESOLVED → REOPENED

status-firefox72: fixed → affected

tracking-firefox72: ? → ---

Flags: needinfo?(gsvelto)

Resolution: FIXED → ---

Noemi Erli[:noemi_erli]

Updated

•

5 years ago

Target Milestone: mozilla72 → ---

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 14

•

5 years ago

Backout merged https://hg.mozilla.org/mozilla-central/rev/3e02f4745b0d

Gabriele Svelto [:gsvelto]

Reporter

Comment 15

•

5 years ago

Since with this patch we're sending memory-pressure events more frequently on platforms with little memory (like our 32-bit Windows 7 runners) I did expect that some issues would crop up, but not this fast or this frequently. The stack traces are not very helpful, it seems that a call to nsIObserverService.notifyObservers() is throwing an out-of-memory exception. It is very unlikely that the nsIObserverService implementation is responsible for this; it's possible that one of the memory-pressure listeners is throwing it and we're just bubbling it up. Why on earth a memory-pressure listener would throw that exception is beyond me. I'll try to reproduce locally with a 32-bits build to see what might be causing it.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Reporter

Comment 16

•

5 years ago

OK, I'm finally back working on this one. I hope to have a fix for the tests by tomorrow.

Hsin-Yi Tsai (she/her) [:hsinyi]

Updated

•

5 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1584266

Gabriele Svelto [:gsvelto]

Reporter

Comment 17

•

5 years ago

I've been trying to figure out why the tests fail for a while now but to no avail. Somehow I can't manage to reproduce the issue locally, even artificially bombarding the running tests with memory-pressure notifications. This is annoying :-(

BugBot [:suhaib / :marco/ :calixte]

Comment 18

•

5 years ago

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:gsvelto, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Reporter

Comment 19

•

5 years ago

This is still undergoing tests which is why it hasn't landed yet.

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Reporter

Comment 20

•

5 years ago

I managed to reproduce some of the failures here: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&selectedJob=280444150&revision=76e9a422edeef3371506d20b0d1f0822064e92d8

It's tricky business. In some cases we're hitting actual out-of-memory errors, in others we're timing out possibly because the memory-pressure notifications are slowing us down. There's some message in the log that might suggest we're sending more than one notification at the same time even though that should not be possible. I fear this investigation is going to take a while.

Gabriele Svelto [:gsvelto]

Reporter

Comment 21

•

5 years ago

I've manually instrumented a build to figure out what's going on during the failures but I still don't get anything readable from it. There's out-of-memory errors being thrown but I don't know from where. The upside is that they seem to be happening while the memory-pressure notification is being dispatched so it's probably some component misbehaving during the handling of that notification.

Worcester12345

Comment 22

•

4 years ago

(In reply to Gabriele Svelto [:gsvelto] from comment #20)

I managed to reproduce some of the failures here: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&selectedJob=280444150&revision=76e9a422edeef3371506d20b0d1f0822064e92d8

It's tricky business. In some cases we're hitting actual out-of-memory errors, in others we're timing out possibly because the memory-pressure notifications are slowing us down. There's some message in the log that might suggest we're sending more than one notification at the same time even though that should not be possible. I fear this investigation is going to take a while.

(In reply to Gabriele Svelto [:gsvelto] from comment #21)

I've manually instrumented a build to figure out what's going on during the failures but I still don't get anything readable from it. There's out-of-memory errors being thrown but I don't know from where. The upside is that they seem to be happening while the memory-pressure notification is being dispatched so it's probably some component misbehaving during the handling of that notification.

Any updates on this work?

Flags: needinfo?(gsvelto)

Gabriele Svelto [:gsvelto]

Reporter

Comment 23

•

4 years ago

No, sorry, this has stalled.

Flags: needinfo?(gsvelto)

Wayne Mery (:wsmwk)

Updated

•

4 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1575950

Neha Kochar [:neha]

Comment 24

•

4 years ago

gcp said Toshihito will work on the low memory detection for Windows so leaving a NI here.

Assignee: gsvelto → nobody

Flags: needinfo?(tkikuchi)

Toshihito Kikuchi [:toshi]

Assignee

Comment 25

•

4 years ago

Yes, I can take this.

Assignee: nobody → tkikuchi

Flags: needinfo?(tkikuchi)

Toshihito Kikuchi [:toshi]

Assignee

Comment 26

•

4 years ago

Attached file bug1586236-datareview-request (deleted) — Details

Attachment #9212125 - Flags: data-review?(chutten)

Chris H-C :chutten

Comment 27

•

4 years ago

Comment on attachment 9212125 [details]
bug1586236-datareview-request

DATA COLLECTION REVIEW RESPONSE:

Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?

Yes.

Is there a control mechanism that allows the user to turn the data collection on and off?

Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.

If the request is for permanent data collection, is there someone who will monitor the data over time?

Yes, Toshihito Kikuchi is responsible.

Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?

Category 1, Technical.

Is the data collection request for default-on or default-off?

Default on for all channels.

Does the instrumentation include the addition of any new identifiers?

No.

Is the data collection covered by the existing Firefox privacy notice?

Yes.

Does there need to be a check-in in the future to determine whether to renew the data?

No. This collection is permanent.

Result: datareview+

Attachment #9212125 - Flags: data-review?(chutten) → data-review+

Chris Peterson [:cpeterson]

Updated

•

4 years ago

Fission Milestone: --- → M8

status-firefox72: affected → ---

Priority: -- → P3

Toshihito Kikuchi [:toshi]

Assignee

Comment 28

•

4 years ago

Attached file Bug 1586236 - Use memory resource notifications to detect low memory scenarios on Windows. r=gsvelto (deleted) — Details

This patch uses the low memory resource notification facility to detect
scenarios where physical memory is running low without polling. This is a
significant change compared to the previous behavior which measured both
available virtual memory (only on 32-bit builds) and available commit space.

Since we're not trying to avoid OOMs anymore we don't save memory reports
anymore when hitting a low-memory condition.

Toshihito Kikuchi [:toshi]

Assignee

Updated

•

4 years ago

Attachment #9103934 - Attachment is obsolete: true

Pulsebot

Comment 29

•

4 years ago

Pushed by tkikuchi@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bba32db4da1e Use memory resource notifications to detect low memory scenarios on Windows. r=gsvelto

Natalia Csoregi [:nataliaCs]

Comment 30

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/bba32db4da1e

Status: REOPENED → RESOLVED

Closed: 5 years ago → 4 years ago

status-firefox89: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 89 Branch

GitHub Bugzilla PR Linker

Comment 31

•

4 years ago

Attached file Link to GitHub pull-request: https://github.com/mozilla-services/mozilla-pipeline-schemas/pull/668 (deleted) — Details

Toshihito Kikuchi [:toshi]

Assignee

Comment 32

•

4 years ago

As described in Microsoft's document, I confirmed a memory resource notification object is signaled when physical memory is running low, not a low commit-space.

The latest data shows the new annotation LowPhysicalMemoryEvents is included in OOM crashes more often than the old annotation LowCommitSpaceEvents (60% vs 50%).

Gabriele Svelto [:gsvelto]

Reporter

Comment 33

•

4 years ago

That's very good news, it means that we're catching those conditions more often without having to poll.

Toshihito Kikuchi [:toshi]

Assignee

Comment 34

•

4 years ago

I updated the query to compare v89beta vs v89nightly vs v88. Unfortunately the ratio of OOM crashes with low_physical_memory_events annotation from v89b is about 40%, less than v88.

I also made histograms to analyze OOM crashes: https://iodide.telemetry.mozilla.org/notebooks/531/?viewMode=report. These charts are based on the last 10,000 OOM crashes from x64 Firefox release version on Windows. This data tells us OOM crash is caused by low page file (= low commit space), not physical memory.

Lastly I also did some manual experiments of the memory resource objects and saw the notification was signaled when the usage of physical memory reached about 93% regardless of the total size of physical memory. That explains the ratio of OOM crashes with the new annotation because the ratio of OOM crashes with ~10% physical memory is 50% or less.

Gabriele Svelto [:gsvelto]

Reporter

Comment 35

•

4 years ago

(In reply to Toshihito Kikuchi [:toshi] from comment #34)

I updated the query to compare v89beta vs v89nightly vs v88. Unfortunately the ratio of OOM crashes with low_physical_memory_events annotation from v89b is about 40%, less than v88.

Yes, that's expected. We went from detecting low commit-space to low physical memory so it was bound to happen. This means we're probably offering a better experience to the user (less swapping) but we're not mitigating the amount of OOM crashes. BTW did you look at a breakup of main process vs content process crashes? Last time I checked the vast majority of our crashes were in the main process (obviously) but I wonder if detecting low physical memory altered it (it shouldn't ).

I also made histograms to analyze OOM crashes: https://iodide.telemetry.mozilla.org/notebooks/531/?viewMode=report. These charts are based on the last 10,000 OOM crashes from x64 Firefox release version on Windows. This data tells us OOM crash is caused by low page file (= low commit space), not physical memory.

Lastly I also did some manual experiments of the memory resource objects and saw the notification was signaled when the usage of physical memory reached about 93% regardless of the total size of physical memory. That explains the ratio of OOM crashes with the new annotation because the ratio of OOM crashes with ~10% physical memory is 50% or less.

I've looked at the OOM ratio in nightly and it didn't go up from what I can tell. This means that previously while we did detect low commit-space scenarios we weren't able to prevent processes from crashing anyway. Tab unloading should change that. Also we really need to make progress in bug 1475518. Unless we know where our commit-space is going we won't be able to counter it.

Toshihito Kikuchi [:toshi]

Assignee

Comment 36

•

4 years ago

BTW did you look at a breakup of main process vs content process crashes? Last time I checked the vast majority of our crashes were in the main process (obviously) but I wonder if detecting low physical memory altered it (it shouldn't ).

I created https://sql.telemetry.mozilla.org/queries/79145#198220 for per-process OOM crashes on Nightly. For 64bit OS, the biggest group is the main process and the second group is the content process. For 32bit OS (excluded wow64 crashes), on the other hand, the biggest group is the content process.

The number of the 64bit main process has increased since Apr-4, but not sure it's caused by this patch (landed on Mar-31).

Toshihito Kikuchi [:toshi]

Assignee

Comment 37

•

4 years ago

We went from detecting low commit-space to low physical memory so it was bound to happen. This means we're probably offering a better experience to the user (less swapping) but we're not mitigating the amount of OOM crashes.

I'm wondering what was the main downside of the old polling logic. I thought the new notification-based logic was better because polling has performance cost and having a hardcoded threshold is not flexible. However, these charts suggest the new detection logic may cause lots of false-negative (low page file but physical memory is still available = not trigger tab unload when we want) and false-positive (low physical memory but page file is still available = trigger tab unload when we should not) cases, so I'm leaning towards thinking the polling is better.

I've looked at the OOM ratio in nightly and it didn't go up from what I can tell. This means that previously while we did detect low commit-space scenarios we weren't able to prevent processes from crashing anyway. Tab unloading should change that. Also we really need to make progress in bug 1475518. Unless we know where our commit-space is going we won't be able to counter it.

It may not be available in the crash ping now, but we should check how much commit space was consumed by ourselves.

Jon Coppeard (:jonco) (PTO until 14th September)

Comment 38

•

4 years ago

This change caused some noticeable changes to the GC telemetry when it landed. The number of non-incremental GCs went from 15% to 19% which suggests we're getting a lot more low memory notifications now. Also we're spending more time compacting, probably because this is being triggered by the memory pressure GCs. This is not necessarily a problem but worth noting.

Toshihito Kikuchi [:toshi]

Assignee

Updated

•

3 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1711610

Gabriele Svelto [:gsvelto]

Reporter

Updated

•

3 years ago

Regressions: 1712084

Toshihito Kikuchi [:toshi]

Assignee

Comment 39

•

3 years ago

(In reply to Jon Coppeard (:jonco) from comment #38)

This change caused some noticeable changes to the GC telemetry when it landed. The number of non-incremental GCs went from 15% to 19% which suggests we're getting a lot more low memory notifications now. Also we're spending more time compacting, probably because this is being triggered by the memory pressure GCs. This is not necessarily a problem but worth noting.

We're still figuring out what is the best way to detect a low memory situation on Windows (and other platforms, too). This patch was efficient from the performance perspective, but probably was not so accurate to detect users who were likely to hit an OOM crash. We need more analysis and experiments to improve the detection logic. In the meantime, I landed the patch for bug 1711610 to avoid triggering unnecessary memory pressures. It will revert the number of non-incremental GCs back to 15% or even smaller.

Jon Coppeard (:jonco) (PTO until 14th September)

Comment 40

•

3 years ago

(In reply to Toshihito Kikuchi [:toshi] from comment #39)
Great, I like the empirical approach in that bug. Let's see what difference that makes to the telemetry.

You need to log in before you can comment on or make changes to this bug.