Add more size classes between 512B and the page size
Categories
(Core :: Memory Allocator, enhancement, P2)
Tracking
()
People
(Reporter: pbone, Assigned: pbone)
References
(Blocks 1 open bug)
Details
(Keywords: memory-footprint, Whiteboard: [MemShrink] fission-memory [fission:m95])
Attachments
(4 files, 3 obsolete files)
In a normal configuration (4KiB pages) jemalloc uses power-of-two size classes between 512 bytes and 4KiB. Thanks to Bug 1640309. for a process loading example.com we can see:
bin | slop | used | percent |
---|---|---|---|
496 | 136 | 9920 | 1% |
512 | 528 | 98304 | 1% |
1024 | 156226 | 634880 | 25% |
2048 | 178876 | 643072 | 28% |
large | 385725 | 3338240 | 12% |
The table shows slop as a fraction of allocated memory per size class. In the classes 1024, 2048 and (not shown) 4096 the slop is much higher.
(Comment edited to mark up the table properly)
Assignee | ||
Comment 1•4 years ago
|
||
I tried a few different size classes, a size class every 256 bytes is the best. But I think we can do even better than this by reconsidering all the size classes below 4KiB (but I'm happy to do that in a follow-up bug). I measured this with the memory replay tool.
Num bins | Quantium | Allocated (kb) | Waste (KB) | Dirty (KB) | Bookkeep (KB) | Committed (KB) | Bin unused (KB) | Slop (KB) | unused+slop | u/s diff | commited diff |
---|---|---|---|---|---|---|---|---|---|---|---|
3 | pow-of-2 | 7,317 | 516 | 592 | 178 | 9,156 | 553 | 847 | 1,400 | 0.0% | 0.0% |
32 | 128 | 6,841 | 516 | 500 | 184 | 8,964 | 923 | 371 | 1,294 | 7.6% | 2.1% |
16 | 256 | 6,892 | 516 | 436 | 180 | 8,756 | 732 | 422 | 1,154 | 17.6% | 4.4% |
8 | 512 | 7,060 | 516 | 560 | 184 | 8,992 | 671 | 590 | 1,261 | 9.9% | 1.8% |
The first row is the current configuration of using power-of-2 bin sizes in this range (1024, 2048, 4096).
Comment 2•4 years ago
|
||
Are you taking into account the fact that for one allocation of each size, the minimum size actually used becomes the page size? Counter intuitively, this could have the complete reverse effect than expected.
It might be better to see why we're allocating these odd > 512 < 4K sizes.
Comment 3•4 years ago
|
||
Also, this has the potential to increase fragmentation substantially (and thus RSS).
Comment 4•4 years ago
|
||
The fact that the "Allocated" numbers vary so much in comment 1 makes me doubt the entire table. It would also be necessary to see the full picture with more sites, after loading/unloading, multiple tabs, etc. An AWSY run would be a good start for that, but until those per-bin stats are exposed to about:memory, AWSY is not going to give that information... I'd suggest doing a local AWSY run with logging enabled and replay those logs.
Assignee | ||
Comment 5•4 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #2)
Are you taking into account the fact that for one allocation of each size, the minimum size actually used becomes the page size? Counter intuitively, this could have the complete reverse effect than expected.
Right, for a process with very few live allocations this would make things worse. We need to balance slop with bin-unused. Which is why I wanted to measure both.
It might be better to see why we're allocating these odd > 512 < 4K sizes.
I've had a look at slop in DMD and found one thing I've fixed, another that I'm investigating (Bug 1662345), the remaining ones are things that are a dynamic size anyway. Like JS script code (bytecode I think). Rather than constant sized allocations. However:
- I should filter for this range when looking at the DMD output.
- I should try a larger process, I've been "optimising for" small processes since that's the case we're interested in for Fission, but I can look at larger ones too.
Assignee | ||
Comment 6•4 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #3)
Also, this has the potential to increase fragmentation substantially (and thus RSS).
Good point. I'll test with longer-running processes that do a few navigations.
Assignee | ||
Comment 7•4 years ago
|
||
Apologies, I want to break up your paragraph to make my reply clearer.
(In reply to Mike Hommey [:glandium] from comment #4)
The fact that the "Allocated" numbers vary so much in comment 1 makes me doubt the entire table.
I generated this table by replaying the same log file of allocations.
Allocated can vary since it measures the full cell size, not the requested size. From the table in comment 0 the sum of the slop column is at least 327KiB, that doesn't include the allocations that get rounded up to 4096 since I excluded that row. I think Allocated can vary that much.
It would also be necessary to see the full picture with more sites, after loading/unloading, multiple tabs, etc. An AWSY run would be a good start for that, but until those per-bin stats are exposed to about:memory, AWSY is not going to give that information... I'd suggest doing a local AWSY run with logging enabled and replay those logs.
Yes, I havn't tested what happens for a longer-lived process. Running AWSY locally is a good idea. Thanks.
I ran AWSY in try here:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=b37bbae0a0c0c017f4eceabee3b8e52132d8b04c&newProject=try&newRevision=01a730fb4d3d464566f1410effee400919778190&framework=4
My next step is to tidy up the patch so you can see it.
Comment 8•4 years ago
|
||
BTW, considering Apple Silicon macs are going to have 16KB pages, you'll also want to measure how this goes with 16KB pages (you can fake that by setting the static page size to 16KB on whatever OS you're using)
Assignee | ||
Comment 9•4 years ago
|
||
Assignee | ||
Comment 10•4 years ago
|
||
Assignee | ||
Comment 11•4 years ago
|
||
Assignee | ||
Comment 12•4 years ago
|
||
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Updated•4 years ago
|
Assignee | ||
Comment 13•4 years ago
|
||
My list of things to test with this change that haven't been tested/answered yet is:
- Test on long-running processes:
** Try processes that do a new pageload each navigation (eg, browsing a news site/reddit)
** Try single page apps.
** JS games / something that does a lot of processing in JS? (I'm thinking of object churn, but it may be more small objects here). - Test on large processes
** facebook?
** google docs (many tabs for different documents?)
** Some kind of large document? a very detailed SVG file? - Test with 16KiB pages
Any other ideas, sites w/ patterns we've thought of in the past for these cases?
Comment 14•4 years ago
|
||
I'm not sure. Long running processes (beyond AWSY) is something we've never tested very well.
Assignee | ||
Comment 15•4 years ago
|
||
Assignee | ||
Comment 16•4 years ago
|
||
My patch had a limitation preventing 16KIB and other page sizes form working. With it fixed the results look like:
I recorded a process browsing wikipedia, following several links until the gziped log file was 100MB. Then captured a memory report and killed the browser. Here is the amount of committed memory for each configuration (I have the other results too, but committed is the fairest):
4KiB pages 16KiB pages
Without patch: 68,732KiB 83,312KiB
With patch: 67,004KiB 80,320KiB
The patch is a 3% win in this test with 4KiB pages and a 4% win with 16KiB pages.
Just switching from 4KiB pages to 16KiB pages is a memory regression of 21%.
Assignee | ||
Comment 17•4 years ago
|
||
With more testing for different sizes & lifetimes of processes we have jemalloc committed memory for each process
Before After Improvement
Wikipedia 67.12 65.43 2.51%
Facebook 492.13 486.31 1.18%
example.com 8.85 8.57 3.18%
slack 113.66 110.94 2.40%
google 545.00 530.98 2.57%
parent 257.27 253.37 1.52%
prealloc 6.39 5.86 8.31%
socket 2.16 2.25 -4.15%
privileged 22.41 22.41 0.02%
Mean 1.95%
(excluding singleton): 3.36%
- Wikipedia: Follow about 10 links to different articles, each new article causes a new pageload.
- Facebook: Scroll the news feed, react to some posts, leave idle for some time, reload the news feed and scroll & react again.
- example.com: Load the page
- slack: Read "All Unread" clocking on some threads to read them.
- google: Label and archive some e-mails and open 3 google docs, edit one of these docs.
The total amount of reduced memory (even though 'Facebook' was captured for a different browser session is 1.91%. The average saving per process is 1.95%. If we exclude the "singleton" processes, those that the browser has only one of like socket, main and privilegedabout, then the average is 3.36%.
I'm confident that this is a clear win for memory saving.
Assignee | ||
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Comment 19•4 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #18)
Can you check actual RSS rather than committed?
The same using RSS of the logalloc-replay tool excluding logalloc-replay's own dynamic memory.
Before After MiB Percent
Wikipedia 69.82 67.77 2.05 2.94%
Facebook 494.34 488.88 5.46 1.10%
example.com 11.52 11.28 0.24 2.10%
slack 117.18 113.46 3.72 3.17%
google 541.25 526.38 14.87 2.75%
parent 259.61 254.79 4.82 1.86%
prealloc 9.07 8.53 0.54 5.99%
socket 4.27 4.25 0.02 0.37%
privileged 25.11 25.08 0.03 0.11%
Mean 2.26%
Mean ex1 3.01%
Comment 20•4 years ago
|
||
I'm kind of surprised you're getting such large RSSes at all, considering (and I had forgotten) that logalloc-replay doesn't memset() the allocated memory (since bug 1423000), so allocated memory is never actually committed unless you enable zero or junk.
Updated•4 years ago
|
Assignee | ||
Comment 21•4 years ago
|
||
We need Bug 1671114 to measure the benefit of this change.
Updated•4 years ago
|
Updated•4 years ago
|
Assignee | ||
Comment 22•4 years ago
|
||
Here's the updated RSS data
RSS (before) | RSS (after) | delta | percent | |
---|---|---|---|---|
example | 8,508 | 8,220 | 288 | 3.39% |
extension | 15,396 | 15,524 | -128 | -0.83% |
Fb-parent | 231,156 | 220,344 | 10,812 | 4.68% |
Fb | 503,512 | 497,400 | 6,112 | 1.21% |
557,944 | 543,100 | 14,844 | 2.66% | |
parent | 263,788 | 259,032 | 4,756 | 1.80% |
prealloc | 5,964 | 5,428 | 536 | 8.99% |
slack | 116,624 | 112,908 | 3,716 | 3.19% |
socket | 1,624 | 1,544 | 80 | 4.93% |
wiki | 68,244 | 66,004 | 2,240 | 3.28% |
This is almost always a benefit, saving sometimes up to 5% of a processes' memory usage.
Assignee | ||
Comment 23•4 years ago
|
||
The next thing to test is a long browser session (eg 4 hours) to see if this negatively impacts fragmentation drastically.
Assignee | ||
Comment 24•3 years ago
|
||
I have tested a longer browser session (typical evening firefox usage for me) to check that there's no regression with these changes due to fragmentation - and none of the processes showed any regresion in RSS or bin-unused+swap (not shown).
There's the results from memory-replay at the end of the session for each of the processes.
RSS (mc) | RSS patches | Delta | Percent | |
---|---|---|---|---|
parent | 1,589,548 | 1,574,680 | 14,868 | 0.94% |
discord | 140,788 | 137,112 | 3,676 | 2.61% |
103,804 | 102,980 | 824 | 0.79% | |
github | 75,960 | 69,660 | 6,300 | 8.29% |
todoist | 63,336 | 61,832 | 1,504 | 2.37% |
151,728 | 146,480 | 5,248 | 3.46% | |
youtube | 377,444 | 374,280 | 3,164 | 0.84% |
total | 2,502,608 | 2,467,024 | 35,584 | 1.42% |
After pressing "Minimise memory usage" at the end of the session, extra usage due to fragmentation would have shown up here if anywhere.
RSS (mc) | RSS patches | Delta | Percent | |
---|---|---|---|---|
parent | 1,015,704 | 999,272 | 16,432 | 1.62% |
discord | 113,580 | 111,100 | 2,480 | 2.18% |
94,564 | 93,784 | 780 | 0.82% | |
github | 46,604 | 45,196 | 1,408 | 3.02% |
143,912 | 139,444 | 4,468 | 3.10% | |
youtube | 367,224 | 363,912 | 3,312 | 0.90% |
Assignee | ||
Comment 25•3 years ago
|
||
I ran some AWSY tests:
Explicit base content memory has increased, but that's the trade-off these patches make, they increase fragmentation in order to decrease sloppy allocations, but overall reduce resident memory. We can verify that this is fragmentation in the memory reports for the base explicit memory shows there is an increase in bin-unused.
Some of the tests show a regression for resident memory, such as for windows, but viewing their subtests: https://treeherder.mozilla.org/perfherder/comparesubtest?originalProject=try&newProject=try&newRevision=68ab6bb4f439f4751497ba695c6a1514d2d65716&originalSignature=2240017&newSignature=2240017&framework=4&originalRevision=9122dd221e6801f1db419147af8d981b71829b31 show that the "tabs closed" subtest is bringing the score down. Although that's a big difference, after the forced GC the regression disappears and returns to a win. In most of the other tests the regression disappears without the forced GC (only needing the 30 seconds). This may mean that the browser doesn't return memory quickly after closing tabs but on the whole it uses less memory. This is also a symptom of fragmentation since a single allocation can keep a chunk allocated.
Assignee | ||
Comment 26•3 years ago
|
||
The case I wanted to optimise is a content process running example.com, but AWSY doesn't test this. I tested it above with logalloc-replay and it showed a 288KB improvment. But when I test by comparing memory reports it's 100KB worse. That's not the result I'd hoped for, still there's a lot of improvment for larger content processes.
Assignee | ||
Comment 27•3 years ago
|
||
Any further thoughts/reviews for this glandium?
Thanks.
Comment 28•3 years ago
|
||
This bug is a soft blocker for Fission MVP. We'd like to fix it before our Release channel rollout, but we won't delay the rollout waiting for it.
Updated•3 years ago
|
Assignee | ||
Comment 29•3 years ago
|
||
Revert the SubPage size class to it's original power-of-two sizing and make
the 512-4KiB range a 2nd Quamtum-spaced size class.
All the ranges defined for the size classes are now inclusive of there upper
bound to make them consistent.
Depends on D92729
Updated•3 years ago
|
Assignee | ||
Comment 30•3 years ago
|
||
The latest results are here: https://docs.google.com/spreadsheets/d/1uvyu1JOxyydd2GcXdBQT7Q_Kk1m0_ink8l-YzhaYnXk/edit#gid=0
Comment 31•3 years ago
|
||
Setting status-firefox94=wontfix. Since the Nightly 94 code freeze is this week, Paul plans to wait until Nightly 95 to land these malloc changes.
Assignee | ||
Comment 32•3 years ago
|
||
jemalloc_stats takes an array for its second argument. It expects this
array to have enough space for all the bins, previously the maximum was set
as a magic number. To make it dependent on the configured bins this patch
replaces the compile-time constant with a function.
Depends on D92729
Comment 33•3 years ago
|
||
Comment on attachment 9244690 [details]
Bug 1669392 - Provide a less-magic array size for jemalloc_stats r=glandium
Revision D127761 was moved to bug 1735250. Setting attachment 9244690 [details] to obsolete.
Comment 34•3 years ago
|
||
Comment 35•3 years ago
|
||
Backed out changeset 9de9bd47c061 (Bug 1669392) for causing build bustages.
Backout link
Push with failures - B
Failure Log
Assignee | ||
Comment 36•3 years ago
|
||
oh, I need to move some code between patches.
Comment 37•3 years ago
|
||
Comment 38•3 years ago
|
||
bugherder |
Comment 39•3 years ago
|
||
== Change summary for alert #31892 (as of Fri, 15 Oct 2021 09:51:13 GMT) ==
Improvements:
Ratio | Test | Platform | Options | Absolute values (old vs new) |
---|---|---|---|---|
14% | perf_reftest_singletons link-style-cache-1.html | macosx1014-64-shippable-qr | e10s fission stylo webrender | 1,026.34 -> 879.62 |
13% | perf_reftest_singletons link-style-cache-1.html | macosx1014-64-shippable-qr | e10s stylo webrender | 1,020.05 -> 886.26 |
8% | perf_reftest_singletons link-style-cache-1.html | linux1804-64-shippable-qr | e10s fission stylo webrender | 472.13 -> 434.69 |
7% | perf_reftest_singletons inline-style-cache-1.html | macosx1014-64-shippable-qr | e10s stylo webrender | 1,721.18 -> 1,599.22 |
7% | perf_reftest_singletons inline-style-cache-1.html | macosx1014-64-shippable-qr | e10s fission stylo webrender | 1,714.59 -> 1,601.56 |
6% | perf_reftest_singletons link-style-cache-1.html | linux1804-64-shippable-qr | e10s fission stylo webrender | 475.06 -> 446.11 |
For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=31892
Comment hidden (obsolete) |
Description
•