Closed Bug 1375566 Opened 7 years ago Closed 3 years ago

We can blow our slice budgets due to arena finalization taking too long

Categories

(Core :: JavaScript: GC, enhancement, P3)

53 Branch
enhancement

Tracking

()

RESOLVED FIXED
92 Branch
Performance Impact medium
Tracking Status
firefox92 --- fixed

People

(Reporter: bzbarsky, Assigned: jonco)

References

()

Details

(Keywords: perf)

Attachments

(2 files)

From a memory log on https://hsivonen.com/test/moz/encoding_bench_web/english-only.html (after clicking the "Run" button):

[content-98761] GC Slice 0 - Pause: 16.151ms of 5ms budget (@ 0.000ms); Reason: TOO_MUCH_MALLOC; Reset: no; Times: Wait Background Thread: 0.211ms, Mark: 1.262ms, Other: 1.149ms, Mark Roots: 0.112ms, Sweep: 14.128ms, Other: 12.340ms, Mark During Sweeping: 0.984ms, Mark Gray: 0.952ms, Sweep Compartments: 0.614ms, Other: 0.199ms, Sweep JIT Data: 0.131ms, Sweep type information: 0.237ms, Other: 0.236ms, Minor GCs to Evict Nursery: 0.453ms, Other: 0.426ms

[content-98761] GC Slice 0 - Pause: 16.216ms of 10ms budget (@ 0.000ms); Reason: TOO_MUCH_MALLOC; Reset: no; Times: Wait Background Thread: 0.266ms, Mark: 1.569ms, Other: 1.460ms, Mark Roots: 0.109ms, Sweep: 14.138ms, Other: 12.072ms, Mark During Sweeping: 1.116ms, Mark Gray: 1.081ms, Sweep Compartments: 0.770ms, Other: 0.185ms, Sweep JIT Data: 0.251ms, Sweep type information: 0.297ms, Other: 0.294ms, Minor GCs to Evict Nursery: 0.159ms, Other: 0.130ms

etc.  Note that we're blowing our 5/10ms budgets because Sweep is taking forever, and in particular Sweep:Other.

What we're doing under there (on Mac) is mostly madvise/munmap from FinalizeArenas, called by gc::ArenaLists::foregroundFinalize.  I wonder whether we could make this more incremental somehow or something...

See https://perf-html.io/public/f6a0a4a61edcd784b461d17ea3879c30e03ee7fb/calltree/?implementation=cpp&search=gcIfRequested&thread=2 for a profile.
The mysterious thing here is that our stack is:

    huge_dalloc(void*)
    unsigned long js::gc::Arena::finalize<JSExternalString>(js::FreeOp*, js::gc::AllocKind, unsigned long)

From what I can see, finalize doesn't call free() itself. If the arena is unused, it gets appended to a free list. It does call t->finalize() on each cell, which in this case would end up invoking the external string finalizer, which *would* call free(). But that frame doesn't show up in the profile, and would require inlining nsStringBuffer::Release() into XPCStringConvert::FinalizeDOMString into JSExternalString::finalize into js::gc::Arena::finalize, and the FinalizeDOMString into JSExternalString::finalize step is not very plausible because it's through a function pointer. (I also skipped some intermediate steps.)

But I never trust stacks, so I'm not sure what to think.
It looks like the only string finalizer we use in Gecko involves a release of an atomic variable. Maybe we could do this on a background thread? (If we're not already...)
Whiteboard: [qf]
> Maybe we could do this on a background thread?

That's bug 627220.
I tried stepping through this in a debugger, and in an opt build it really does claim that JSExternalString::finalize directly calls free.  Maybe we end up with some sort of tail-call business going on or something?

In a debug build, we're definitely calling through XPCStringConvert::FinalizeDOMString, as expected, and doing huge_dalloc there.

I guess the real point of this bug is that whatever budgeting we're doing for finalization it's off....  The other bugs track doing this finalization offthread and whatnot.
Steve, I believe you are doing work in the area already.
Assignee: nobody → sfink
Whiteboard: [qf] → [qf:p1]
Wow, that email address is still in the system? I guess it's disabled, at least. (It's from my Cal grad school days.)

I'm not sure there's anything to do on this bug? Given that it appears to be straightforward to do this in the background, I don't see much point in doing it incrementally unless backgrounding it introduces a bunch of contention and thread thrashing. Which is certainly possible, mind, but until we try it we won't know. I guess I'll mark it as depending on that one.
Assignee: sfink → sphink
Depends on: 627220
I think what bothers me is that we clearly have _some_ sort of finalization budget, because all the finalizations were about 14ms.  That was just bigger than our actual gc slice budget.  This bug is about trying to align the two, if that makes sense.

I agree that for the specific case of external strings we should just background them.
Ok, that makes sense. My guess is that this is an artifact, though -- looking at the profile, the GCs are marching along like a row of identical penguins. I think we're running through a homogeneous workload. We allocate the same amount between each GC, and end up doing the same amount of finalization for each one, so it takes about the same amount of time.

I dumped out the sweep times for all of these slices (one line per GCSlice marker, value is the time spent sweeping in that slice):

108/timings/times/sweep = 19.022
111/timings/times/sweep = 17.459
114/timings/times/sweep = 18.557
125/timings/times/sweep = 16.378
128/timings/times/sweep = 17.331
131/timings/times/sweep = 17.069
134/timings/times/sweep = 17.433
137/timings/times/sweep = 17.71
140/timings/times/sweep = 18.905
143/timings/times/sweep = 17.506
146/timings/times/sweep = 18.013
149/timings/times/sweep = 17.758
152/timings/times/sweep = 16.382
167/timings/times/sweep = 16.661
181/timings/times/sweep = 18.573
184/timings/times/sweep = 17.755
187/timings/times/sweep = 17.227
190/timings/times/sweep = 16.573
193/timings/times/sweep = 15.796
196/timings/times/sweep = 16.377
199/timings/times/sweep = 18.309
202/timings/times/sweep = 16.835
220/timings/times/sweep = 16.977
231/timings/times/sweep = 16.488
234/timings/times/sweep = 16.347
237/timings/times/sweep = 16.683
240/timings/times/sweep = 17.183
243/timings/times/sweep = 15.885
246/timings/times/sweep = 17.935
249/timings/times/sweep = 17.177
252/timings/times/sweep = 16.948
267/timings/times/sweep = 18.216
270/timings/times/sweep = 17.001
273/timings/times/sweep = 16.702
285/timings/times/sweep = 16.141
288/timings/times/sweep = 17.679
293/timings/times/sweep = 17.356
306/timings/times/sweep = 17.702
309/timings/times/sweep = 18.695
322/timings/times/sweep = 17.577
346/timings/times/sweep = 18.077
349/timings/times/sweep = 16.904
362/timings/times/sweep = 17.77
377/timings/times/sweep = 15.593
380/timings/times/sweep = 14.612
383/timings/times/sweep = 15.426
397/timings/times/sweep = 15.941
400/timings/times/sweep = 15.119
403/timings/times/sweep = 15.675
406/timings/times/sweep = 16.172
420/timings/times/sweep = 18.932
435/timings/times/sweep = 17.703
438/timings/times/sweep = 16.227
441/timings/times/sweep = 15.887
444/timings/times/sweep = 17.348
447/timings/times/sweep = 17.025
450/timings/times/sweep = 16.278
453/timings/times/sweep = 17.261
456/timings/times/sweep = 17.559
459/timings/times/sweep = 16.029
462/timings/times/sweep = 21.821
465/timings/times/sweep = 25.236
468/timings/times/sweep = 17.515
471/timings/times/sweep = 16.811
474/timings/times/sweep = 16.263
485/timings/times/sweep = 17.362
491/timings/times/sweep = 19.643
494/timings/times/sweep = 18.532
497/timings/times/sweep = 16.907
500/timings/times/sweep = 16.519
503/timings/times/sweep = 17.253
524/timings/times/sweep = 12.574
527/timings/times/sweep = 2.632

You're right, they cluster, though I'm seeing about 16ms.

The amount of GC things allocated is about the same (now it's one line per GCMajor marker, value is megabytes of GC data):

/threads/2/markers/data> cat */timings/allocated
109/timings/allocated = 5
112/timings/allocated = 5
115/timings/allocated = 5
126/timings/allocated = 5
129/timings/allocated = 5
132/timings/allocated = 5
135/timings/allocated = 5
138/timings/allocated = 5
141/timings/allocated = 5
144/timings/allocated = 5
147/timings/allocated = 5
150/timings/allocated = 5
153/timings/allocated = 5
168/timings/allocated = 5
182/timings/allocated = 5
185/timings/allocated = 5
188/timings/allocated = 5
191/timings/allocated = 5
194/timings/allocated = 5
197/timings/allocated = 5
200/timings/allocated = 5
203/timings/allocated = 5
221/timings/allocated = 5
232/timings/allocated = 5
235/timings/allocated = 5
238/timings/allocated = 5
241/timings/allocated = 5
244/timings/allocated = 5
247/timings/allocated = 5
250/timings/allocated = 5
253/timings/allocated = 5
268/timings/allocated = 5
271/timings/allocated = 5
274/timings/allocated = 5
286/timings/allocated = 5
289/timings/allocated = 5
294/timings/allocated = 5
307/timings/allocated = 5
310/timings/allocated = 5
323/timings/allocated = 5
347/timings/allocated = 5
350/timings/allocated = 5
363/timings/allocated = 5
378/timings/allocated = 5
381/timings/allocated = 4
384/timings/allocated = 4
398/timings/allocated = 4
401/timings/allocated = 4
404/timings/allocated = 4
407/timings/allocated = 4
421/timings/allocated = 4
436/timings/allocated = 4
439/timings/allocated = 4
442/timings/allocated = 4
445/timings/allocated = 4
448/timings/allocated = 4
451/timings/allocated = 5
454/timings/allocated = 5
457/timings/allocated = 4
460/timings/allocated = 5
463/timings/allocated = 5
466/timings/allocated = 4
469/timings/allocated = 5
472/timings/allocated = 5
475/timings/allocated = 5
486/timings/allocated = 5
492/timings/allocated = 5
495/timings/allocated = 5
498/timings/allocated = 5
501/timings/allocated = 5
504/timings/allocated = 5
525/timings/allocated = 5

Sadly, I don't have good data on malloc memory usage, and every one of these was triggered (nonincrementally, even) with nonincremental_reason=MallocBytesTrigger (and reason=TOO_MUCH_MALLOC).

This scenario is pretty awful, though. Maybe it's a result of an extremely slow madvise on osx, but it lands us in one of the worst case scheduling behaviors: all GCs are nonincremental. But would it help here to change the triggers so that these are incremental? We're not keeping up with the allocation rate. But we are returning to the event loop, so we're not just buried in run-to-completion encoding_rs work or something. So maybe it would help.

Still, I'm hopeful that either background external string finalization will make this go away, or that these strings are short-lived enough that nursery allocation of external strings will incrementalize this nicely.

Note that this is similar to bug 1374797 observed on Facebook, except that that is GCBytesTrigger instead of MallocBytesTrigger.
OK, so now we landed the background finalization bits in bug 627220 and the too-long slices now look like this:

[content-87052] GC Slice 0 - Pause: 15.372ms of 10ms budget (@ 0.000ms); Reason: TOO_MUCH_MALLOC; Reset: no; Times: Wait Background Thread: 12.881ms, Mark: 1.711ms, Other: 1.582ms, Mark Roots: 0.130ms, Sweep: 0.677ms, Sweep Compartments: 0.415ms, Other: 0.118ms, Sweep type information: 0.168ms, Other: 0.165ms

Lots less sweep, lots more "Wait Background Thread", not much change in the gc pause.

That might also explain why the wall-clock times for the benchmark did not improve for me...
Flags: needinfo?(sphink)
Bug 1384049 would help here as long as the allocation rate wasn't too high.  We could also investigate using multiple threads for background sweeping.  I file bug 1385919 for this.
Whiteboard: [qf:p1] → [qf:p2]
Priority: -- → P3
Keywords: perf
Whiteboard: [qf:p2] → [qf:f63][qf:p2]
Whiteboard: [qf:f63][qf:p2] → [qf:f63][qf:p1]
Whiteboard: [qf:f63][qf:p1] → [qf:f64][qf:p1]
Whiteboard: [qf:f64][qf:p1] → [qf:p1:f64]
Another thing we could investigate here is using more than one malloc heap for JS, to try and reduce lock contention between allocation on the main thread and freeing on the background thread.
Whiteboard: [qf:p1:f64] → [qf:p2]

I'm not planning on working on this anytime soon. I don't know if it's still an issue, either. jonco's recent changes to stall the mutator might be relevant here.

Assignee: sphink → nobody
Flags: needinfo?(sphink)

(In reply to Steve Fink [:sfink] [:s:] from comment #12)
I took a profile of this: https://share.firefox.dev/3ynRtfp

It's doing very frequent GC (every ~35ms), although the GCs themselves aren't that bad (max pause ~6.5ms) and the proportion of GC time is around 15%.

We can improve this by changing the way we calculate our malloc triggers. For the GC bytes threshold we increase the heap growth factor when in 'high frequency GC mode' but this doesn't currently happen for malloc threshold.

Extending this to cover malloc triggers reduces the frequency of GC (at the cost of increasing the threshold, but only to ~115MB). With this change we now see only 4% of time spent in GC:

https://share.firefox.dev/3fmSKM8

This doesn't help the benchmark result much but is still a useful change.

Assignee: nobody → jcoppeard
Status: NEW → ASSIGNED
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/68165a3d99c8
Use the same heap growth factor calculation for both GC cell and malloc heaps r=sfink
https://hg.mozilla.org/integration/autoland/rev/82ee3e97047f
Remove the now unused malloc growth factor parameter r=sfink
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 92 Branch
Performance Impact: --- → P2
Whiteboard: [qf:p2]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: