Tune minimum nursery size using a more realistic benchmark
Categories
(Core :: JavaScript: GC, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox68 | --- | fixed |
People
(Reporter: pbone, Assigned: pbone)
References
(Regressed 1 open bug)
Details
(Keywords: perf:resource-use)
Attachments
(4 files)
In Bug 1525983 Jon found that collecting the nursery more frequently (4x as frequently) is causing 0.5x more time spend in minor GCs. One reason might be that the current minimum 192K is not appropriate for that benchmark. Is it appropriate for typical use / JS games?
Nursery collection is already well under 100us for these workloads, so this change wont affect responsiveness and maybe shouldn't be a QF bug. However it will affect throughput.
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 1•6 years ago
|
||
Hi Jean,
To tune this parameter we (Jon and I) think it'd be best to use quantum flow reference hardware. Should I have some hardware shipped here (Australia) for what is probably a once-off use? Or can I direct someone with this hardware to run some tests for me?
Thanks.
Assignee | ||
Comment 2•6 years ago
|
||
I was discussing this with jonco and we think it'd also be interesting to use a microbenchmark to determine where the theoretical limits of the nursery & cache are. I'm working on that now.
Assignee | ||
Comment 3•5 years ago
|
||
Hi Jonco, Here's the results I got from this test. The optimal nursery size for this test is 288KB, that maximises the throughput. I unfortunatly didn't measure throughput of the mutator directly, so I'm using allocation rate as a proxy.
There's two things happening for allocation rate here. If the GC runs less (as a percentage of total runtime) then the mutator can run more, and therefore throughput improves, that's the first allocation rate colloum, which is how much gets allocated for the entire 20s duration.
The 2nd coloum, shows allocation rate pers second but only counts while the mutator is running, that gives an idea of how the GC can affect the performance of the mutator. In this case I think due to cache as the allocations in the nursery touch memory and more memory needs to be brought in to the 256KB L2 cache (my laptop, I don't have reference hardware). So we see this figure begin to drop after 288KB, My best guess about why it didn't drop before 256KB is that the hardware prefetching is able to keep up with it for a while. We then see the allocation rate drop, and begin to pick up again, I'm not sure why it picks up again so easilly.
Maybe we should raise the minimum nursery size to 256KB, that should improve this case, but not much of course and still keeps baseline memory usage lower. Later when we know if our tab is in the fore/back-groud we can lower it for background tabs and maybe even rase it quite a lot for foreground tabs.
The elephant in the room in this data is the time taken to mark the Roots, Mostly that's the mkRntm (mark runtime) column from the raw data. That is quite large compared to the time actually taken to do the collection. But even when we reduce it (I tried subtracting it from the total GC time) the GC still spends more time GCing at small sizes, I'm not sure which phase dominates things then but that doesn't help us today anyway.
Assignee | ||
Comment 4•5 years ago
|
||
Comment 7•5 years ago
|
||
(In reply to Paul Bone [:pbone] from comment #3)
Thanks for doing this testing.
Maybe we should raise the minimum nursery size to 256KB, that should improve this case, but not much of course and still keeps baseline memory usage lower.
Yes that sounds good to me.
Pushed by pbone@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/81d3a25685dc Use correct units in a preference name r=jonco https://hg.mozilla.org/integration/autoland/rev/fbf69131cbda Add a pref for the minimum nursery size r=mccr8 https://hg.mozilla.org/integration/autoland/rev/2509defe2779 Set minimum nursery size to 256KB r=jonco
Assignee | ||
Comment 9•5 years ago
|
||
Hi perfherders,
This change may regress memory usage a little bit, but that's okay.
Thanks.
Assignee | ||
Comment 10•5 years ago
|
||
I did the testing/tuning on my X1 Carbon - not reference hardware. This computer has a 256KB L2 cache.
Comment 11•5 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/81d3a25685dc
https://hg.mozilla.org/mozilla-central/rev/fbf69131cbda
https://hg.mozilla.org/mozilla-central/rev/2509defe2779
Comment 12•5 years ago
|
||
And the expected regression:
== Change summary for alert #20826 (as of Mon, 06 May 2019 17:21:34 GMT) ==
Regressions:
3% Base Content JS windows7-32-shippable opt 3,174,032.00 -> 3,255,554.67
2% Base Content JS linux64-shippable-qr opt 4,014,694.67 -> 4,077,597.33
2% Base Content JS osx-10-10-shippable opt 4,015,438.67 -> 4,078,466.67
2% Base Content JS linux64-shippable opt 4,014,724.00 -> 4,077,464.00
2% Base Content JS windows10-64-shippable-qr opt 4,078,922.67 -> 4,142,024.00
2% Base Content JS windows10-64-shippable opt 4,078,969.33 -> 4,142,170.67
For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=20826
:pbone please confirm that this is accepted
Assignee | ||
Comment 13•5 years ago
|
||
Thanks Bebe, yeah that's the amount of regression we were expecting. Thanks.
Updated•3 years ago
|
Description
•