We can blow our slice budgets due to arena finalization taking too long
Categories
(Core :: JavaScript: GC, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox92 | --- | fixed |
People
(Reporter: bzbarsky, Assigned: jonco)
References
()
Details
(Keywords: perf)
Attachments
(2 files)
From a memory log on https://hsivonen.com/test/moz/encoding_bench_web/english-only.html (after clicking the "Run" button): [content-98761] GC Slice 0 - Pause: 16.151ms of 5ms budget (@ 0.000ms); Reason: TOO_MUCH_MALLOC; Reset: no; Times: Wait Background Thread: 0.211ms, Mark: 1.262ms, Other: 1.149ms, Mark Roots: 0.112ms, Sweep: 14.128ms, Other: 12.340ms, Mark During Sweeping: 0.984ms, Mark Gray: 0.952ms, Sweep Compartments: 0.614ms, Other: 0.199ms, Sweep JIT Data: 0.131ms, Sweep type information: 0.237ms, Other: 0.236ms, Minor GCs to Evict Nursery: 0.453ms, Other: 0.426ms [content-98761] GC Slice 0 - Pause: 16.216ms of 10ms budget (@ 0.000ms); Reason: TOO_MUCH_MALLOC; Reset: no; Times: Wait Background Thread: 0.266ms, Mark: 1.569ms, Other: 1.460ms, Mark Roots: 0.109ms, Sweep: 14.138ms, Other: 12.072ms, Mark During Sweeping: 1.116ms, Mark Gray: 1.081ms, Sweep Compartments: 0.770ms, Other: 0.185ms, Sweep JIT Data: 0.251ms, Sweep type information: 0.297ms, Other: 0.294ms, Minor GCs to Evict Nursery: 0.159ms, Other: 0.130ms etc. Note that we're blowing our 5/10ms budgets because Sweep is taking forever, and in particular Sweep:Other. What we're doing under there (on Mac) is mostly madvise/munmap from FinalizeArenas, called by gc::ArenaLists::foregroundFinalize. I wonder whether we could make this more incremental somehow or something... See https://perf-html.io/public/f6a0a4a61edcd784b461d17ea3879c30e03ee7fb/calltree/?implementation=cpp&search=gcIfRequested&thread=2 for a profile.
Comment 1•7 years ago
|
||
The mysterious thing here is that our stack is: huge_dalloc(void*) unsigned long js::gc::Arena::finalize<JSExternalString>(js::FreeOp*, js::gc::AllocKind, unsigned long) From what I can see, finalize doesn't call free() itself. If the arena is unused, it gets appended to a free list. It does call t->finalize() on each cell, which in this case would end up invoking the external string finalizer, which *would* call free(). But that frame doesn't show up in the profile, and would require inlining nsStringBuffer::Release() into XPCStringConvert::FinalizeDOMString into JSExternalString::finalize into js::gc::Arena::finalize, and the FinalizeDOMString into JSExternalString::finalize step is not very plausible because it's through a function pointer. (I also skipped some intermediate steps.) But I never trust stacks, so I'm not sure what to think.
Comment 2•7 years ago
|
||
It looks like the only string finalizer we use in Gecko involves a release of an atomic variable. Maybe we could do this on a background thread? (If we're not already...)
![]() |
Reporter | |
Updated•7 years ago
|
![]() |
Reporter | |
Comment 3•7 years ago
|
||
> Maybe we could do this on a background thread? That's bug 627220.
![]() |
Reporter | |
Comment 4•7 years ago
|
||
I tried stepping through this in a debugger, and in an opt build it really does claim that JSExternalString::finalize directly calls free. Maybe we end up with some sort of tail-call business going on or something? In a debug build, we're definitely calling through XPCStringConvert::FinalizeDOMString, as expected, and doing huge_dalloc there. I guess the real point of this bug is that whatever budgeting we're doing for finalization it's off.... The other bugs track doing this finalization offthread and whatnot.
Comment 5•7 years ago
|
||
Steve, I believe you are doing work in the area already.
Comment 6•7 years ago
|
||
Wow, that email address is still in the system? I guess it's disabled, at least. (It's from my Cal grad school days.) I'm not sure there's anything to do on this bug? Given that it appears to be straightforward to do this in the background, I don't see much point in doing it incrementally unless backgrounding it introduces a bunch of contention and thread thrashing. Which is certainly possible, mind, but until we try it we won't know. I guess I'll mark it as depending on that one.
![]() |
Reporter | |
Comment 7•7 years ago
|
||
I think what bothers me is that we clearly have _some_ sort of finalization budget, because all the finalizations were about 14ms. That was just bigger than our actual gc slice budget. This bug is about trying to align the two, if that makes sense. I agree that for the specific case of external strings we should just background them.
Comment 8•7 years ago
|
||
Ok, that makes sense. My guess is that this is an artifact, though -- looking at the profile, the GCs are marching along like a row of identical penguins. I think we're running through a homogeneous workload. We allocate the same amount between each GC, and end up doing the same amount of finalization for each one, so it takes about the same amount of time. I dumped out the sweep times for all of these slices (one line per GCSlice marker, value is the time spent sweeping in that slice): 108/timings/times/sweep = 19.022 111/timings/times/sweep = 17.459 114/timings/times/sweep = 18.557 125/timings/times/sweep = 16.378 128/timings/times/sweep = 17.331 131/timings/times/sweep = 17.069 134/timings/times/sweep = 17.433 137/timings/times/sweep = 17.71 140/timings/times/sweep = 18.905 143/timings/times/sweep = 17.506 146/timings/times/sweep = 18.013 149/timings/times/sweep = 17.758 152/timings/times/sweep = 16.382 167/timings/times/sweep = 16.661 181/timings/times/sweep = 18.573 184/timings/times/sweep = 17.755 187/timings/times/sweep = 17.227 190/timings/times/sweep = 16.573 193/timings/times/sweep = 15.796 196/timings/times/sweep = 16.377 199/timings/times/sweep = 18.309 202/timings/times/sweep = 16.835 220/timings/times/sweep = 16.977 231/timings/times/sweep = 16.488 234/timings/times/sweep = 16.347 237/timings/times/sweep = 16.683 240/timings/times/sweep = 17.183 243/timings/times/sweep = 15.885 246/timings/times/sweep = 17.935 249/timings/times/sweep = 17.177 252/timings/times/sweep = 16.948 267/timings/times/sweep = 18.216 270/timings/times/sweep = 17.001 273/timings/times/sweep = 16.702 285/timings/times/sweep = 16.141 288/timings/times/sweep = 17.679 293/timings/times/sweep = 17.356 306/timings/times/sweep = 17.702 309/timings/times/sweep = 18.695 322/timings/times/sweep = 17.577 346/timings/times/sweep = 18.077 349/timings/times/sweep = 16.904 362/timings/times/sweep = 17.77 377/timings/times/sweep = 15.593 380/timings/times/sweep = 14.612 383/timings/times/sweep = 15.426 397/timings/times/sweep = 15.941 400/timings/times/sweep = 15.119 403/timings/times/sweep = 15.675 406/timings/times/sweep = 16.172 420/timings/times/sweep = 18.932 435/timings/times/sweep = 17.703 438/timings/times/sweep = 16.227 441/timings/times/sweep = 15.887 444/timings/times/sweep = 17.348 447/timings/times/sweep = 17.025 450/timings/times/sweep = 16.278 453/timings/times/sweep = 17.261 456/timings/times/sweep = 17.559 459/timings/times/sweep = 16.029 462/timings/times/sweep = 21.821 465/timings/times/sweep = 25.236 468/timings/times/sweep = 17.515 471/timings/times/sweep = 16.811 474/timings/times/sweep = 16.263 485/timings/times/sweep = 17.362 491/timings/times/sweep = 19.643 494/timings/times/sweep = 18.532 497/timings/times/sweep = 16.907 500/timings/times/sweep = 16.519 503/timings/times/sweep = 17.253 524/timings/times/sweep = 12.574 527/timings/times/sweep = 2.632 You're right, they cluster, though I'm seeing about 16ms. The amount of GC things allocated is about the same (now it's one line per GCMajor marker, value is megabytes of GC data): /threads/2/markers/data> cat */timings/allocated 109/timings/allocated = 5 112/timings/allocated = 5 115/timings/allocated = 5 126/timings/allocated = 5 129/timings/allocated = 5 132/timings/allocated = 5 135/timings/allocated = 5 138/timings/allocated = 5 141/timings/allocated = 5 144/timings/allocated = 5 147/timings/allocated = 5 150/timings/allocated = 5 153/timings/allocated = 5 168/timings/allocated = 5 182/timings/allocated = 5 185/timings/allocated = 5 188/timings/allocated = 5 191/timings/allocated = 5 194/timings/allocated = 5 197/timings/allocated = 5 200/timings/allocated = 5 203/timings/allocated = 5 221/timings/allocated = 5 232/timings/allocated = 5 235/timings/allocated = 5 238/timings/allocated = 5 241/timings/allocated = 5 244/timings/allocated = 5 247/timings/allocated = 5 250/timings/allocated = 5 253/timings/allocated = 5 268/timings/allocated = 5 271/timings/allocated = 5 274/timings/allocated = 5 286/timings/allocated = 5 289/timings/allocated = 5 294/timings/allocated = 5 307/timings/allocated = 5 310/timings/allocated = 5 323/timings/allocated = 5 347/timings/allocated = 5 350/timings/allocated = 5 363/timings/allocated = 5 378/timings/allocated = 5 381/timings/allocated = 4 384/timings/allocated = 4 398/timings/allocated = 4 401/timings/allocated = 4 404/timings/allocated = 4 407/timings/allocated = 4 421/timings/allocated = 4 436/timings/allocated = 4 439/timings/allocated = 4 442/timings/allocated = 4 445/timings/allocated = 4 448/timings/allocated = 4 451/timings/allocated = 5 454/timings/allocated = 5 457/timings/allocated = 4 460/timings/allocated = 5 463/timings/allocated = 5 466/timings/allocated = 4 469/timings/allocated = 5 472/timings/allocated = 5 475/timings/allocated = 5 486/timings/allocated = 5 492/timings/allocated = 5 495/timings/allocated = 5 498/timings/allocated = 5 501/timings/allocated = 5 504/timings/allocated = 5 525/timings/allocated = 5 Sadly, I don't have good data on malloc memory usage, and every one of these was triggered (nonincrementally, even) with nonincremental_reason=MallocBytesTrigger (and reason=TOO_MUCH_MALLOC). This scenario is pretty awful, though. Maybe it's a result of an extremely slow madvise on osx, but it lands us in one of the worst case scheduling behaviors: all GCs are nonincremental. But would it help here to change the triggers so that these are incremental? We're not keeping up with the allocation rate. But we are returning to the event loop, so we're not just buried in run-to-completion encoding_rs work or something. So maybe it would help. Still, I'm hopeful that either background external string finalization will make this go away, or that these strings are short-lived enough that nursery allocation of external strings will incrementalize this nicely. Note that this is similar to bug 1374797 observed on Facebook, except that that is GCBytesTrigger instead of MallocBytesTrigger.
![]() |
Reporter | |
Comment 9•7 years ago
|
||
OK, so now we landed the background finalization bits in bug 627220 and the too-long slices now look like this: [content-87052] GC Slice 0 - Pause: 15.372ms of 10ms budget (@ 0.000ms); Reason: TOO_MUCH_MALLOC; Reset: no; Times: Wait Background Thread: 12.881ms, Mark: 1.711ms, Other: 1.582ms, Mark Roots: 0.130ms, Sweep: 0.677ms, Sweep Compartments: 0.415ms, Other: 0.118ms, Sweep type information: 0.168ms, Other: 0.165ms Lots less sweep, lots more "Wait Background Thread", not much change in the gc pause. That might also explain why the wall-clock times for the benchmark did not improve for me...
![]() |
Reporter | |
Updated•7 years ago
|
Assignee | ||
Comment 10•7 years ago
|
||
Bug 1384049 would help here as long as the allocation rate wasn't too high. We could also investigate using multiple threads for background sweeping. I file bug 1385919 for this.
Updated•7 years ago
|
Updated•7 years ago
|
Updated•7 years ago
|
Updated•7 years ago
|
Updated•6 years ago
|
Updated•6 years ago
|
Assignee | ||
Comment 11•6 years ago
|
||
Another thing we could investigate here is using more than one malloc heap for JS, to try and reduce lock contention between allocation on the main thread and freeing on the background thread.
Comment 12•3 years ago
|
||
I'm not planning on working on this anytime soon. I don't know if it's still an issue, either. jonco's recent changes to stall the mutator might be relevant here.
Assignee | ||
Comment 13•3 years ago
|
||
(In reply to Steve Fink [:sfink] [:s:] from comment #12)
I took a profile of this: https://share.firefox.dev/3ynRtfp
It's doing very frequent GC (every ~35ms), although the GCs themselves aren't that bad (max pause ~6.5ms) and the proportion of GC time is around 15%.
Assignee | ||
Comment 14•3 years ago
|
||
We can improve this by changing the way we calculate our malloc triggers. For the GC bytes threshold we increase the heap growth factor when in 'high frequency GC mode' but this doesn't currently happen for malloc threshold.
Extending this to cover malloc triggers reduces the frequency of GC (at the cost of increasing the threshold, but only to ~115MB). With this change we now see only 4% of time spent in GC:
https://share.firefox.dev/3fmSKM8
This doesn't help the benchmark result much but is still a useful change.
Assignee | ||
Comment 15•3 years ago
|
||
Updated•3 years ago
|
Assignee | ||
Comment 16•3 years ago
|
||
Depends on D121625
Comment 17•3 years ago
|
||
Pushed by jcoppeard@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/68165a3d99c8 Use the same heap growth factor calculation for both GC cell and malloc heaps r=sfink https://hg.mozilla.org/integration/autoland/rev/82ee3e97047f Remove the now unused malloc growth factor parameter r=sfink
Comment 18•3 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/68165a3d99c8
https://hg.mozilla.org/mozilla-central/rev/82ee3e97047f
Updated•2 years ago
|
Description
•