Experiment detecting low memory scenarios using memory resource notifications in Windows
Categories
(Core :: Memory Allocator, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox89 | --- | fixed |
People
(Reporter: gsvelto, Assigned: toshi)
References
(Blocks 1 open bug)
Details
Attachments
(3 files, 2 obsolete files)
(deleted),
text/plain
|
chutten
:
data-review+
|
Details |
Bug 1586236 - Use memory resource notifications to detect low memory scenarios on Windows. r=gsvelto
(deleted),
text/x-phabricator-request
|
Details | |
(deleted),
text/x-github-pull-request
|
Details |
The available memory tracker currently uses a polling mechanism to detect low-memory scenarios. Windows offers a signal-based system to detect low/high memory scenarios called called memory resource notifications. The thresholds used by this system are not documented and it's unclear what it's measuring against (commit space? physical memory?) so we should experiment with it to establish if it could be a useful replacement for our current implementation.
Reporter | ||
Comment 1•5 years ago
|
||
I'm going to work on this right away.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 2•5 years ago
|
||
Reporter | ||
Comment 3•5 years ago
|
||
My plan here is to let this stay in nightly for 2-3 weeks to gather information about the thresholds. If enough OOM crashes have the annotations and the thresholds look reasonable we can remove the polling mechanism entirely on 64-bit builds and keep it only for the 32-bit ones. We will also be able to remove the IPC needed to pass the messages between the main process and the content processes. Without polling there's nothing preventing us from using this mechanism directly within the content processes.
Updated•5 years ago
|
Reporter | ||
Comment 4•5 years ago
|
||
Quick update: I'm completely rewriting the patch to use the memory resource notifications to replace all our current machinery. I also found a good way to test this w/o having to use a VM. It's possible to adjust the maximum amount of memory that Windows will use at boot time:
https://helpdeskgeek.com/windows-7/fix-the-maximum-amount-of-memory-usable-by-windows-7-64-bit/
Putting the swap file on a regular HDD instead of the SSD on which I have Windows installed is the final touch to simulate slow hardware on which Firefox might be running.
As a side note I found one more reason to abandon the current code: I got confirmation that recent kernels adjust the maximum amount of commit-space at runtime so measuring it is pointless. I saw it happening only once during testing and thought it was a fluke but by toying with the boot parameters I can reproduce it all the time.
Reporter | ||
Comment 5•5 years ago
|
||
Reporter | ||
Comment 6•5 years ago
|
||
David, if you want to have a look this is a WIP of the revised patch. I've dropped all the old code since it wouldn't be useful anymore and left only low physical memory detection via the memory resource notifications. The core logic works like this (pending small changes):
- We listen to a low-memory resource notification using ´RegisterWaitForSingleObject()´
- When we hit a low-memory notification we send the relevant event and start polling the memory status every 10 seconds. Once the low-memory condition is over we go back to listening
- If the user becomes inactive we stop polling, but only when the timer ticks, when it becomes active again we resume polling
- The reason why we don't cancel polling every time the user is inactive is that it could cause the timer to restart far too often and depending on the user input it might not even get a chance to tick for a long time
Updated•5 years ago
|
[Tracking Requested - why for this release]: Might want to keep an eye on this as it is taking a look at our recent increase in OOM crashes
Reporter | ||
Comment 10•5 years ago
|
||
I forgot to re-run the tests after having added the changes to the memory reporting machinery, silly me.
Comment 11•5 years ago
|
||
Comment 12•5 years ago
|
||
bugherder |
Comment 13•5 years ago
|
||
Backed out changeset 9fccb38453a4 (Bug 1586236) for causing mass devtools failures, all on Windows
Failure log examples:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274255280&repo=autoland&lineNumber=4486
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274257831&repo=autoland&lineNumber=4649
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274255280&repo=autoland&lineNumber=4486
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274257897&repo=autoland&lineNumber=4558
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=274257802&repo=autoland&lineNumber=4397
Push range that reveals the started from this bug: https://treeherder.mozilla.org/#/jobs?repo=autoland&resultStatus=pending%2Crunning%2Csuccess%2Ctestfailed%2Cbusted%2Cexception&searchStr=fd4dd1f5d4a2b9cb7b653314d7370d081e86acff&fromchange=8aa8ed80ea4325031cd613b5f85a3d694231c2ba&tochange=3e02f4745b0d6fb8eb395cbb81d4bc9098d1626d
Updated•5 years ago
|
Comment 14•5 years ago
|
||
Backout merged https://hg.mozilla.org/mozilla-central/rev/3e02f4745b0d
Reporter | ||
Comment 15•5 years ago
|
||
Since with this patch we're sending memory-pressure events more frequently on platforms with little memory (like our 32-bit Windows 7 runners) I did expect that some issues would crop up, but not this fast or this frequently. The stack traces are not very helpful, it seems that a call to nsIObserverService.notifyObservers()
is throwing an out-of-memory exception. It is very unlikely that the nsIObserverService
implementation is responsible for this; it's possible that one of the memory-pressure listeners is throwing it and we're just bubbling it up. Why on earth a memory-pressure listener would throw that exception is beyond me. I'll try to reproduce locally with a 32-bits build to see what might be causing it.
Reporter | ||
Comment 16•5 years ago
|
||
OK, I'm finally back working on this one. I hope to have a fix for the tests by tomorrow.
Updated•5 years ago
|
Reporter | ||
Comment 17•5 years ago
|
||
I've been trying to figure out why the tests fail for a while now but to no avail. Somehow I can't manage to reproduce the issue locally, even artificially bombarding the running tests with memory-pressure
notifications. This is annoying :-(
Comment 18•5 years ago
|
||
There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:gsvelto, could you have a look please?
For more information, please visit auto_nag documentation.
Reporter | ||
Comment 19•5 years ago
|
||
This is still undergoing tests which is why it hasn't landed yet.
Reporter | ||
Comment 20•5 years ago
|
||
I managed to reproduce some of the failures here: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&selectedJob=280444150&revision=76e9a422edeef3371506d20b0d1f0822064e92d8
It's tricky business. In some cases we're hitting actual out-of-memory errors, in others we're timing out possibly because the memory-pressure notifications are slowing us down. There's some message in the log that might suggest we're sending more than one notification at the same time even though that should not be possible. I fear this investigation is going to take a while.
Reporter | ||
Comment 21•5 years ago
|
||
I've manually instrumented a build to figure out what's going on during the failures but I still don't get anything readable from it. There's out-of-memory errors being thrown but I don't know from where. The upside is that they seem to be happening while the memory-pressure notification is being dispatched so it's probably some component misbehaving during the handling of that notification.
Comment 22•4 years ago
|
||
(In reply to Gabriele Svelto [:gsvelto] from comment #20)
I managed to reproduce some of the failures here: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&selectedJob=280444150&revision=76e9a422edeef3371506d20b0d1f0822064e92d8
It's tricky business. In some cases we're hitting actual out-of-memory errors, in others we're timing out possibly because the memory-pressure notifications are slowing us down. There's some message in the log that might suggest we're sending more than one notification at the same time even though that should not be possible. I fear this investigation is going to take a while.
(In reply to Gabriele Svelto [:gsvelto] from comment #21)
I've manually instrumented a build to figure out what's going on during the failures but I still don't get anything readable from it. There's out-of-memory errors being thrown but I don't know from where. The upside is that they seem to be happening while the memory-pressure notification is being dispatched so it's probably some component misbehaving during the handling of that notification.
Any updates on this work?
Updated•4 years ago
|
Comment 24•4 years ago
|
||
gcp said Toshihito will work on the low memory detection for Windows so leaving a NI here.
Assignee | ||
Comment 25•4 years ago
|
||
Yes, I can take this.
Assignee | ||
Comment 26•4 years ago
|
||
Comment 27•4 years ago
|
||
Comment on attachment 9212125 [details]
bug1586236-datareview-request
DATA COLLECTION REVIEW RESPONSE:
Is there or will there be documentation that describes the schema for the ultimate data set available publicly, complete and accurate?
Yes.
Is there a control mechanism that allows the user to turn the data collection on and off?
Yes. This collection is Telemetry so can be controlled through Firefox's Preferences.
If the request is for permanent data collection, is there someone who will monitor the data over time?
Yes, Toshihito Kikuchi is responsible.
Using the category system of data types on the Mozilla wiki, what collection type of data do the requested measurements fall under?
Category 1, Technical.
Is the data collection request for default-on or default-off?
Default on for all channels.
Does the instrumentation include the addition of any new identifiers?
No.
Is the data collection covered by the existing Firefox privacy notice?
Yes.
Does there need to be a check-in in the future to determine whether to renew the data?
No. This collection is permanent.
Result: datareview+
Updated•4 years ago
|
Assignee | ||
Comment 28•4 years ago
|
||
This patch uses the low memory resource notification facility to detect
scenarios where physical memory is running low without polling. This is a
significant change compared to the previous behavior which measured both
available virtual memory (only on 32-bit builds) and available commit space.
Since we're not trying to avoid OOMs anymore we don't save memory reports
anymore when hitting a low-memory condition.
Assignee | ||
Updated•4 years ago
|
Comment 29•4 years ago
|
||
Comment 30•4 years ago
|
||
bugherder |
Comment 31•4 years ago
|
||
Assignee | ||
Comment 32•4 years ago
|
||
As described in Microsoft's document, I confirmed a memory resource notification object is signaled when physical memory is running low, not a low commit-space.
The latest data shows the new annotation LowPhysicalMemoryEvents
is included in OOM crashes more often than the old annotation LowCommitSpaceEvents
(60% vs 50%).
Reporter | ||
Comment 33•4 years ago
|
||
That's very good news, it means that we're catching those conditions more often without having to poll.
Assignee | ||
Comment 34•4 years ago
|
||
I updated the query to compare v89beta vs v89nightly vs v88. Unfortunately the ratio of OOM crashes with low_physical_memory_events annotation from v89b is about 40%, less than v88.
I also made histograms to analyze OOM crashes: https://iodide.telemetry.mozilla.org/notebooks/531/?viewMode=report. These charts are based on the last 10,000 OOM crashes from x64 Firefox release version on Windows. This data tells us OOM crash is caused by low page file (= low commit space), not physical memory.
Lastly I also did some manual experiments of the memory resource objects and saw the notification was signaled when the usage of physical memory reached about 93% regardless of the total size of physical memory. That explains the ratio of OOM crashes with the new annotation because the ratio of OOM crashes with ~10% physical memory is 50% or less.
Reporter | ||
Comment 35•4 years ago
|
||
(In reply to Toshihito Kikuchi [:toshi] from comment #34)
I updated the query to compare v89beta vs v89nightly vs v88. Unfortunately the ratio of OOM crashes with low_physical_memory_events annotation from v89b is about 40%, less than v88.
Yes, that's expected. We went from detecting low commit-space to low physical memory so it was bound to happen. This means we're probably offering a better experience to the user (less swapping) but we're not mitigating the amount of OOM crashes. BTW did you look at a breakup of main process vs content process crashes? Last time I checked the vast majority of our crashes were in the main process (obviously) but I wonder if detecting low physical memory altered it (it shouldn't ).
I also made histograms to analyze OOM crashes: https://iodide.telemetry.mozilla.org/notebooks/531/?viewMode=report. These charts are based on the last 10,000 OOM crashes from x64 Firefox release version on Windows. This data tells us OOM crash is caused by low page file (= low commit space), not physical memory.
Lastly I also did some manual experiments of the memory resource objects and saw the notification was signaled when the usage of physical memory reached about 93% regardless of the total size of physical memory. That explains the ratio of OOM crashes with the new annotation because the ratio of OOM crashes with ~10% physical memory is 50% or less.
I've looked at the OOM ratio in nightly and it didn't go up from what I can tell. This means that previously while we did detect low commit-space scenarios we weren't able to prevent processes from crashing anyway. Tab unloading should change that. Also we really need to make progress in bug 1475518. Unless we know where our commit-space is going we won't be able to counter it.
Assignee | ||
Comment 36•4 years ago
|
||
BTW did you look at a breakup of main process vs content process crashes? Last time I checked the vast majority of our crashes were in the main process (obviously) but I wonder if detecting low physical memory altered it (it shouldn't ).
I created https://sql.telemetry.mozilla.org/queries/79145#198220 for per-process OOM crashes on Nightly. For 64bit OS, the biggest group is the main process and the second group is the content process. For 32bit OS (excluded wow64 crashes), on the other hand, the biggest group is the content process.
The number of the 64bit main process has increased since Apr-4, but not sure it's caused by this patch (landed on Mar-31).
Assignee | ||
Comment 37•4 years ago
|
||
We went from detecting low commit-space to low physical memory so it was bound to happen. This means we're probably offering a better experience to the user (less swapping) but we're not mitigating the amount of OOM crashes.
I'm wondering what was the main downside of the old polling logic. I thought the new notification-based logic was better because polling has performance cost and having a hardcoded threshold is not flexible. However, these charts suggest the new detection logic may cause lots of false-negative (low page file but physical memory is still available = not trigger tab unload when we want) and false-positive (low physical memory but page file is still available = trigger tab unload when we should not) cases, so I'm leaning towards thinking the polling is better.
I've looked at the OOM ratio in nightly and it didn't go up from what I can tell. This means that previously while we did detect low commit-space scenarios we weren't able to prevent processes from crashing anyway. Tab unloading should change that. Also we really need to make progress in bug 1475518. Unless we know where our commit-space is going we won't be able to counter it.
It may not be available in the crash ping now, but we should check how much commit space was consumed by ourselves.
Comment 38•4 years ago
|
||
This change caused some noticeable changes to the GC telemetry when it landed. The number of non-incremental GCs went from 15% to 19% which suggests we're getting a lot more low memory notifications now. Also we're spending more time compacting, probably because this is being triggered by the memory pressure GCs. This is not necessarily a problem but worth noting.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 39•3 years ago
|
||
(In reply to Jon Coppeard (:jonco) from comment #38)
This change caused some noticeable changes to the GC telemetry when it landed. The number of non-incremental GCs went from 15% to 19% which suggests we're getting a lot more low memory notifications now. Also we're spending more time compacting, probably because this is being triggered by the memory pressure GCs. This is not necessarily a problem but worth noting.
We're still figuring out what is the best way to detect a low memory situation on Windows (and other platforms, too). This patch was efficient from the performance perspective, but probably was not so accurate to detect users who were likely to hit an OOM crash. We need more analysis and experiments to improve the detection logic. In the meantime, I landed the patch for bug 1711610 to avoid triggering unnecessary memory pressures. It will revert the number of non-incremental GCs back to 15% or even smaller.
Comment 40•3 years ago
|
||
(In reply to Toshihito Kikuchi [:toshi] from comment #39)
Great, I like the empirical approach in that bug. Let's see what difference that makes to the telemetry.
Description
•