Closed Bug 1408375 Opened 7 years ago Closed 7 years ago

Telemetry regressions following malloc accounting changes

Categories

(Core :: JavaScript: GC, enhancement, P3)

55 Branch
enhancement

Tracking

()

RESOLVED FIXED
Tracking Status
firefox58 --- fix-optional

People

(Reporter: jonco, Assigned: jonco)

References

Details

Attachments

(5 files, 1 obsolete file)

Following bug 1405274 and bug 1384049 to improve our malloc accounting, we've seen regressions in GC telemetry, specifically an increase in the proportion of non-incremental GCs and a rise in the max pause time.

I'm going to try and address these problems in this bug.

(Previously I tried backing out one of the patches in the first bug, but it got relanded for causing test timeouts.)
Attached patch bug1408375-split-zone-runtime-counters (deleted) β€” β€” Splinter Review
Further to the description, the telemetry shows the reason for the non-incremental GCs is the malloc bytes trigger.

I think the reason that we're seeing this is that bug 1405274 made all malloc allocations to the zone also increment the runtime malloc counter.  (Previously this happened for some allocations but not others which was pretty confusing, so I made the change for reasons of simplicity and understandability.)  Since we don't clear the runtime counter unless we do a full GC then this will lead to an increase in the number of GCs (obvious in hindsight).

I'm not sure why these are non-incremental GCs now that we have an incremental malloc bytes trigger, but it's possible that this isn't kicking in soon enough.

Whatever the reason, I'd like to try splitting the runtime and zone counter to take the opposite approach and make the runtime and zone counters completely independent.  This should certainly reduce the number of GCs, and hopefully the number of non-incremental GCs.  This also simplifies the runtime counter updating behaviour which was pretty unintuitive TBH.
Attachment #8919209 - Flags: review?(sphink)
Comment on attachment 8919209 [details] [diff] [review]
bug1408375-split-zone-runtime-counters

Review of attachment 8919209 [details] [diff] [review]:
-----------------------------------------------------------------

First, to clarify - I gave the recommendation to back out only the backout of the 3rd patch purely as a tactical thing -- the tree was closed, the choices were to backout the backout or backout the other two patches in this bug. I went for the lowest risk approach, which was to revert to the most recently known-green state (even though it was known to have telemetry regressions). Backing out the whole bug would probably have worked fine, but the tree closure was getting lengthy and I just wanted to reopen it, thinking you could decide later whether to backout the whole thing or not. (I probably should have fired off a try build of the whole-bug backout at the same time.)

As for the patch, I'm fine with it as an experiment. It's a little weird, though. But probably less weird than what we had?

It makes the runtime memory into sort of another independent zone, rather than something associated with the total memory usage. We lose the total memory usage entirely with this patch. So the obvious pathological situation is where you have lots of medium-large zones but no GCs, leading to OOMs. If we actually had size measurements, it seems like it would be preferable to have an independent runtime pile *plus* an overall memory size, and use a higher threshold on the latter. But we don't have size measurements, we have allocation activity measurements, so this new scheme seems like it's worth a try.

The alternative would be keeping the previous setup (before this patch) and just bumping up the runtime threshold (possibly by a lot). But I'm fine with doing the experiment here and keeping an eye on OOM rates.
Attachment #8919209 - Flags: review?(sphink) → review+
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8b0c016826e0
Split zone and runtime malloc counters r=sfink
Let's leave this open until we see whether this has fixed the problem.
Keywords: leave-open
https://hg.mozilla.org/mozilla-central/rev/8b0c016826e0
Priority: -- → P3
The previous patch didn't fix the problem.
Attached patch bug1408375-update-jit-counter (deleted) β€” β€” Splinter Review
One problem is that I didn't clear the zone's JIT counters on GC.
Attachment #8921098 - Flags: review?(sphink)
Attached patch bug1408375-remove-unnecessary-weirdness (deleted) β€” β€” Splinter Review
Something else I overlooked was some remaining weirdness about keeping the threshold as a ptrdiff_t, which isn't necessary anymore.
Attachment #8921099 - Flags: review?(sphink)
Comment on attachment 8921098 [details] [diff] [review]
bug1408375-update-jit-counter

Review of attachment 8921098 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jsgc.cpp
@@ +4344,5 @@
>  
>      // Update the malloc counters for any zones we are collecting.
>      for (ZonesIter zone(rt, WithAtoms); !zone.done(); zone.next()) {
>          if (zone->isCollecting())
> +            zone->updateAllMallocBytesOnGC(lock);

In my current tree, I see no other callers of Zone::updateGCMallocBytesOnGC. Does it need to exist?
Attachment #8921098 - Flags: review?(sphink) → review+
(In reply to Jon Coppeard (:jonco) from comment #7)
> Created attachment 8921098 [details] [diff] [review]
> bug1408375-update-jit-counter
> 
> One problem is that I didn't clear the zone's JIT counters on GC.

I guess you can't tell whether that was triggering extra GCs.
Attachment #8921099 - Flags: review?(sphink) → review+
(In reply to Steve Fink [:sfink] [:s:] from comment #9)
> In my current tree, I see no other callers of Zone::updateGCMallocBytesOnGC.
> Does it need to exist?

Good point, I'll remove it.
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c4e07342e386
Also update JIT memory counters on GC r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/ab8d11c73647
Remove unnecessary treatment of malloc bytes count now this is no longer stored as a ptrdiff_t r=sfink
Attached patch bug1408375-fix-triggers (obsolete) (deleted) β€” β€” Splinter Review
There are two further problems with the current setup:
 - the incremental GC trigger for malloc memory is checked on arena allocation, but we can allocate malloc memory without allocating GC things so we may miss the incremental trigger and go straight to a non-incremental GC
 - we reset the bytes count at the start of a possibly-incremental GC, so we will not trigger a full GC if we already triggered an incremental GC for a zone / the runtime

Here's a patch to address these by checking the malloc thresholds in the malloc path and by keeping track of which type of GC has been triggered.  I think this makes more sense than what we have currently.

Note that although the fix for the first issue should reduce the number of non-incremental GCs, the fix for the second may increase it.  However in local testing I'm not seeing a lot of non-incremental GCs with this patch applied.
Attachment #8921476 - Flags: review?(sphink)
Comment on attachment 8921476 [details] [diff] [review]
bug1408375-fix-triggers

Review of attachment 8921476 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/gc/GCRuntime.h
@@ +663,4 @@
>  template <typename T>
>  class MemoryCounter
>  {
> +    const GCSchedulingTunables& tunables_;

I'm feeling like this MemoryCounter class is getting a bit hairy. It started out basically encapsulating a counter. Now it has all kinds of state and history for the counter, it knows about starts and ends of GCs, it has a reference to a GCSchedulingTunables, and it has detailed logic on trigger handling (including state management.)

All of that is good functionality, and most of it is probably ok to put into a helper class like this, but I still feel like the class is getting too smart -- it doesn't really have an identifiable restricted set of concerns.

Hm... I'm still trying to work out what I'd like better. Maybe we should just rename it to MallocGCState or MallocGCStatefulTrigger or something and call it good.

If we *were* to trim it down, what would be the boundary? I'll sketch something out here just to see if it might make sense:

 - Instead of holding a ref to a tunables, hold just a thresholdFactor. If it changes, update it through an API. Hm... actually, maybe this needs to be passed in during update anyway, if we are to maintain the AllocThresholdFactorAvoidInterrupt distinction?

 - Pull the owner->triggerGCForTooMuchMalloc() into the caller of update() (thus removing the template parameter entirely, which probably should have been called OwnerT or something, since it's unclear what it is right now). Something like

    TriggerKind trigger = counter.update(nbytes);
    if (trigger != NoTrigger && triggerGCForTooMuchMalloc())
        counter.triggered(counterTrigger); // recordTrigger()?

I'm imagining update() would return NoTrigger if you started in the between-threshold range and are still there.

 - Be more explicit about this being a tristate. We have a counter, and it can either be under a low threshold, in the mid range, or over the high threshold. Rename shouldTrigger() to triggerState() or computeState() or something.

I guess it's a little weird because the low threshold is a varying percentage of the high threshold.

 - I'd sort of like to divorce it from GC knowledge (almost) entirely, in which case the enum values would be renamed to something like UnderLowThreshold, WithinThresholdRange, OverHighThreshold. Then updateOnGCEnd() would be decrement() or decrementByAmountWhenTriggered() or something. But given that we have a GCLockData in here, maybe there's no point. It just seems weird to have a *counter* declaring "we should do an incremental GC!" I'd rather see that logic in the caller based on which threshold has been reached.

 - Right now, it's a little weird that we have a low and high threshold, but isTooMuchMalloc() only refers to the high threshold. isOverHighThreshold()? It also bothers me that isTooMuchMalloc and shouldTrigger are kind of computing the same thing; if everything was in terms of thresholds, then isTooMuchMalloc could go away. (I haven't looked at the external callers of that; maybe it can go away already?)

Anyway, just thinking out loud. I'm ok with this change as an incremental fix, though I think we probably ought to file a bug for cleaning this up. The growth/shrink factors ought to be passed in, too.

@@ +703,5 @@
>              maxBytes_ *= 2;
>          else
>              maxBytes_ = std::max(initialMaxBytes_.ref(), size_t(maxBytes_ * 0.9));
> +        bytes_ -= bytesAtStartOfGC_;
> +        triggered_ = NoTrigger;

Hm. So we might do this update, and end up in NoTrigger but still be over the low and maybe even high threshold. Is that what we want? I could imagine doing a huge allocation, then doing a nonincremental GC for the next log(n) tiny allocations.

@@ +735,5 @@
>  
>    private:
> +    TriggerKind shouldTrigger() const {
> +        if (bytes_ < maxBytes_ * tunables_.allocThresholdFactor())
> +            return NoTrigger;

Doesn't this lose the "be more reluctant to GC if it would reset an ongoing incremental GC" behavior?
Attachment #8921476 - Flags: review?(sphink) → review+
(In reply to Steve Fink [:sfink] [:s:] from comment #14)
These are all good points.  I'll rework this.
Attached patch bug1408375-fix-triggers v2 (deleted) β€” β€” Splinter Review
I removed the template as suggested and moved the GC triggering behaviour into the callers.  I think this is preferable, even if there is a bunch of almost-duplicated code in GCRuntime and Zone - it's not quite close enough to actual factor out.

I left the GC references in the counter class.  I think this is fine because it is specifically for GC, and because I think that introducing generic names for things like the thresholds makes it less clear what's going on.

I fixed the lack of delay for interrupting currently incremental GC for zone GCs.
Attachment #8921476 - Attachment is obsolete: true
Attachment #8921953 - Flags: review?(sphink)
Attached patch bug1408375-equal-zone-malloc-thresholds (deleted) β€” β€” Splinter Review
We used to set the zone malloc threshold to 0.9 times the runtime threshold.  This made sense when the runtime counter was (supposed to be) the sum of all the zone counters.  It's not any more though so we should just make these all the same.  This lets us move the parameter to GCSchedulingTunables.
Attachment #8921992 - Flags: review?(sphink)
Comment on attachment 8921992 [details] [diff] [review]
bug1408375-equal-zone-malloc-thresholds

Review of attachment 8921992 [details] [diff] [review]:
-----------------------------------------------------------------

Oh, nice.
Attachment #8921992 - Flags: review?(sphink) → review+
Comment on attachment 8921953 [details] [diff] [review]
bug1408375-fix-triggers v2

Review of attachment 8921953 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, I find this much easier to read.

::: js/src/gc/GCRuntime.h
@@ +694,5 @@
> +
> +    void adopt(MemoryCounter& other);
> +
> +    TriggerKind shouldTriggerGC(const GCSchedulingTunables& tunables,
> +                                bool wouldInterruptGC = false) const

wouldInterruptGC is unused. (I assume this was from a previous attempt at handling the reset.)
Attachment #8921953 - Flags: review?(sphink) → review+
(In reply to Steve Fink [:sfink] [:s:] from comment #20)
> wouldInterruptGC is unused. (I assume this was from a previous attempt at
> handling the reset.)

Cheers, yes it was.
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5192e1ad48e7
Move malloc threshold check to malloc allocation r=sfink
https://hg.mozilla.org/integration/mozilla-inbound/rev/1c88d7b89e27
Move max malloc bytes parameter to GCSchedulingTunables r=sfink
Telemetry is looking much better since the 25th October.  The patches that landed in comment 12 seem to have fixed this.

Current proportion of non-incremental GCs is now around 3% compared to 4% originally, and 15-20% after bug 1405274 landed.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: