1314828 - Analyze GC telemetry data

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Description

•

8 years ago

My first attempt to analyze this data is as follows:

1. Look at some random users and see what the most common problems are.
2. Write a function to automatically group GCs into buckets based on what appears to be slowest about that GC. The buckets I'm using are:
  PageFaults: there are more than 5000 faults in the slowest slice
  COMPARTMENT_REVIVED: first slice reason is COMPARTMENT_REVIVED
  CC_FORCED1: only one slice, and the reason is CC_FORCED
  CC_FORCED+: more than one slice, and the last slice reason is CC_FORCED
  KeepAtomsSet: nonincremental_reason is KeepAtomsSet
  GCBytesTrigger: nonincremental_reason is GCBytesTrigger
  MallocBytesTrigger: nonincremental_reason is MallocBytesTrigger
  Compact: slowest slice is >=75% in compaction
  Sweep: slowest slice is >=75% in sweeping
  MinorGCsToEvictNursery: slowest slice is >=75% in minor_gcs_to_evict_nursery
  Other: everything else

3. Throw away GCs where the max_pause is >30s. The machine probably went to sleep during the GC or something (I hope).
4. See how many GCs are in each bucket.
5. See how many GCs where the max pause was >= 50ms (or 500ms or 5s) are in each bucket.

You can see the results here:
https://gist.github.com/bill-mccloskey/f94c25ad00e851698680586b42399d00

Let's say we are interested in GCs longer than 500ms. Then the sizes of the buckets are as follows:

[('Compact', 153),
 ('GCBytesTrigger', 86),
 ('PageFaults', 2747),
 ('KeepAtomsSet', 1258),
 ('CC_FORCED1', 52),
 ('MallocBytesTrigger', 97),
 ('Sweep', 1182),
 ('COMPARTMENT_REVIVED', 2373),
 ('MinorGCsToEvictNursery', 17),
 ('Other', 269),
 ('CC_FORCED+', 154)]

From this perspective, I think we can get the most bang for our buck by addressing COMPARTMENT_REVIVED and KeepAtomsSet. These are pretty big buckets and I suspect they would be easier to fix than Sweep or PageFaults (which we can dig into after the others are fixed).

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Updated

•

8 years ago

Depends on: 1213977

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Updated

•

8 years ago

Depends on: 1311734

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 1

•

8 years ago

I'd like to look at this again now that bug 1318384 has landed.

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 2

•

8 years ago

It looks like bug 1318384 was totally successful! Normally it would be a little difficult to make a direct comparison since we get different amounts of data on different days. But the difference is so stark that it doesn't really matter.

Data from Nov. 1:
Across "worst" GCs reported from the content process (those with the worst max pause), 32399 were caused by COMPARTMENT_REVIVED.

Data from Nov. 25 and 26:
Across "worst" GCs reported from the content process (those with the worst max pause), 11 were caused by COMPARTMENT_REVIVED.

And keep in mind that the total number of GCs recorded in the second data set was much higher (more than twice as high for some reason--maybe the US holiday).

Great job Jon! Based on the new data, bug 1213977 is the next easiest target.
https://gist.github.com/bill-mccloskey/2fe74101cb4e807e31c0f4215d3be2b9

Flags: needinfo?(wmccloskey)

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 3

•

8 years ago

I ran another analysis now that bug 1213977 is fixed. The keepAtoms stuff totally disappeared. Hopefully we'll be able to see a difference in GC_MAX_PAUSE_MS in the telemetry evolution view. Unfortunately, it seems to be over a week behind, so we'll need to wait.

The two biggest categories are now "PageFaults" and "Sweep". We can try to fix the page faults category by avoiding GCs of infrequently used zones. That way we won't be paging in data that we're probably not going to use soon. This isn't something we can make quick short-term progress on though.

I broke down the Sweep category to try to understand what parts of sweeping are slow. The two big sub-components are "Mark During Sweeping" and "Sweep Miscellaneous". For "Mark During Sweeping", I saw a fairly even mix of "Mark Weak", "Mark Gray", and "Mark Gray and Weak" (although "Mark Gray" was somewhat larger than the other two).

I filed bug 1323078 to break down the "misc" category into finer-grained phases.

Bug 1167452 already covers weak marking.

Gray marking will be tricky. The easier approach is probably bug 1323087. That will mark more objects black and fewer objects gray. I also filed bug 1323083 to incrementalize gray marking. We should implement bug 1323087 first and see how much it helps. If it's not enough, we can try to do bug 1323083.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 4

•

8 years ago

Here's the latest gist:
https://gist.github.com/bill-mccloskey/7d61e025c3c66f5fbfc19067fad941f7
I wish I knew how to make it public, but at least I can see it...

Bill McCloskey [inactive unless it's an emergency] (:billm)

Assignee

Comment 5

•

8 years ago

I filed bug 1323306 based on some more analysis of weak marking.

Firefox Bug Husbandry Bot

Updated

•

7 years ago

Keywords: triage-deferred

Priority: -- → P3

Jon Coppeard (:jonco) (PTO until 14th September)

Comment 6

•

3 years ago

There's always more work to be done here, but that will continue elsewhere.

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → FIXED

Bugzilla

Quick Search

Analyze GC telemetry data

Categories

(Core :: JavaScript: GC, defect, P3)

Tracking

()

People

(Reporter: billm, Assigned: billm)

References

Details

(Keywords: triage-deferred)

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Comment 6