Closed Bug 1532838 Opened 6 years ago Closed 5 years ago

Tune minimum nursery size using a more realistic benchmark

Categories

(Core :: JavaScript: GC, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
mozilla68
Performance Impact low
Tracking Status
firefox68 --- fixed

People

(Reporter: pbone, Assigned: pbone)

References

(Regressed 1 open bug)

Details

(Keywords: perf:resource-use)

Attachments

(4 files)

In Bug 1525983 Jon found that collecting the nursery more frequently (4x as frequently) is causing 0.5x more time spend in minor GCs. One reason might be that the current minimum 192K is not appropriate for that benchmark. Is it appropriate for typical use / JS games?

Nursery collection is already well under 100us for these workloads, so this change wont affect responsiveness and maybe shouldn't be a QF bug. However it will affect throughput.

Whiteboard: [qf] → [qf:p3:resource]

Hi Jean,

To tune this parameter we (Jon and I) think it'd be best to use quantum flow reference hardware. Should I have some hardware shipped here (Australia) for what is probably a once-off use? Or can I direct someone with this hardware to run some tests for me?

Thanks.

Flags: needinfo?(jgong)

I was discussing this with jonco and we think it'd also be interesting to use a microbenchmark to determine where the theoretical limits of the nursery & cache are. I'm working on that now.

Assignee: nobody → pbone
Status: NEW → ASSIGNED
Depends on: 1544648
Depends on: 1544651
Attached file test.pdf (deleted) β€”

Hi Jonco, Here's the results I got from this test. The optimal nursery size for this test is 288KB, that maximises the throughput. I unfortunatly didn't measure throughput of the mutator directly, so I'm using allocation rate as a proxy.

There's two things happening for allocation rate here. If the GC runs less (as a percentage of total runtime) then the mutator can run more, and therefore throughput improves, that's the first allocation rate colloum, which is how much gets allocated for the entire 20s duration.

The 2nd coloum, shows allocation rate pers second but only counts while the mutator is running, that gives an idea of how the GC can affect the performance of the mutator. In this case I think due to cache as the allocations in the nursery touch memory and more memory needs to be brought in to the 256KB L2 cache (my laptop, I don't have reference hardware). So we see this figure begin to drop after 288KB, My best guess about why it didn't drop before 256KB is that the hardware prefetching is able to keep up with it for a while. We then see the allocation rate drop, and begin to pick up again, I'm not sure why it picks up again so easilly.

Maybe we should raise the minimum nursery size to 256KB, that should improve this case, but not much of course and still keeps baseline memory usage lower. Later when we know if our tab is in the fore/back-groud we can lower it for background tabs and maybe even rase it quite a lot for foreground tabs.

The elephant in the room in this data is the time taken to mark the Roots, Mostly that's the mkRntm (mark runtime) column from the raw data. That is quite large compared to the time actually taken to do the collection. But even when we reduce it (I tried subtracting it from the total GC time) the GC still spends more time GCing at small sizes, I'm not sure which phase dominates things then but that doesn't help us today anyway.

Depends on D29814

Depends on D29815

(In reply to Paul Bone [:pbone] from comment #3)

Thanks for doing this testing.

Maybe we should raise the minimum nursery size to 256KB, that should improve this case, but not much of course and still keeps baseline memory usage lower.

Yes that sounds good to me.

Pushed by pbone@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/81d3a25685dc
Use correct units in a preference name r=jonco
https://hg.mozilla.org/integration/autoland/rev/fbf69131cbda
Add a pref for the minimum nursery size r=mccr8
https://hg.mozilla.org/integration/autoland/rev/2509defe2779
Set minimum nursery size to 256KB r=jonco

Hi perfherders,

This change may regress memory usage a little bit, but that's okay.

Thanks.

Flags: needinfo?(igoldan)
Flags: needinfo?(fstrugariu)

I did the testing/tuning on my X1 Carbon - not reference hardware. This computer has a 256KB L2 cache.

Flags: needinfo?(jgong)
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68

And the expected regression:

== Change summary for alert #20826 (as of Mon, 06 May 2019 17:21:34 GMT) ==

Regressions:

3% Base Content JS windows7-32-shippable opt 3,174,032.00 -> 3,255,554.67
2% Base Content JS linux64-shippable-qr opt 4,014,694.67 -> 4,077,597.33
2% Base Content JS osx-10-10-shippable opt 4,015,438.67 -> 4,078,466.67
2% Base Content JS linux64-shippable opt 4,014,724.00 -> 4,077,464.00
2% Base Content JS windows10-64-shippable-qr opt 4,078,922.67 -> 4,142,024.00
2% Base Content JS windows10-64-shippable opt 4,078,969.33 -> 4,142,170.67

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=20826

:pbone please confirm that this is accepted

Flags: needinfo?(pbone)
Flags: needinfo?(igoldan)
Flags: needinfo?(fstrugariu)
Regressions: 1533762

Thanks Bebe, yeah that's the amount of regression we were expecting. Thanks.

Flags: needinfo?(pbone)
Blocks: 1550382
Performance Impact: --- → P3
Whiteboard: [qf:p3:resource]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: