Closed Bug 1508327 Opened 6 years ago Closed 4 years ago

Clearing mark bitmaps can take a long time

Categories

(Core :: JavaScript: GC, enhancement, P3)

61 Branch
enhancement

Tracking

()

RESOLVED DUPLICATE of bug 1677765
Performance Impact low
Tracking Status
firefox66 --- fix-optional

People

(Reporter: jonco, Unassigned)

References

(Depends on 1 open bug)

Details

(Keywords: perf:responsiveness)

At the start of a GC we clear all the mark bits for collecting zones. This shows up as something that often causes budget overruns in our telemetry. This is plausible since it's linear in the size of the heap. Currently we do this during the first slice, in parallel with a bunch of other work. One approach to improve this situation might be to parallelise this further and use multiple threads to do this during the first slice. This is fairly simple, but I'm not sure how much benefit we can get with this since there is already other work going on in this slice and we are limited by number of cores and memory bandwidth. Another approach might be to kick this off in the first slice and it run concurrently to the mutator, only starting the GC for real in the second slice. This will add at least one extra slice to every GC, but is probably preferable. There might be other preparatory work we can do in advance too. A third approach would be to move to an epoc-based scheme where 'black' for the previous collection is treated as 'white' for the current collection. However the fact that we also have the extra gray colour makes this more difficult, as does the fact that the epoc would have to be per-zone because we don't collect all zones in every collection. It would remove this work completely but could impact marking performance if we have to test the mark state against per-zone information every time we want to mark something.
Is it something that could be done as a background low-priority housekeeping task? The task would be started directly after the previous GC finishes and if we have to GC before the clearing is done then the GC's first slice must provide an assist for it, but it seems likely that that won't be common.
(In reply to Jon Coppeard (:jonco) from comment #0) > A third approach would be to move to an epoc-based scheme where 'black' for > the previous collection is treated as 'white' for the current collection. > However the fact that we also have the extra gray colour makes this more > difficult, as does the fact that the epoc would have to be per-zone because > we don't collect all zones in every collection. It would remove this work > completely but could impact marking performance if we have to test the mark > state against per-zone information every time we want to mark something. I've had this idea in the back of my mind, trying to work out how to do it with these concerns. It ought to be possible.
(In reply to Lars T Hansen [:lth] from comment #1) > Is it something that could be done as a background low-priority housekeeping > task? The task would be started directly after the previous GC finishes and > if we have to GC before the clearing is done then the GC's first slice must > provide an assist for it, but it seems likely that that won't be common. The cycle collector uses these bits in between GCs.
I had an idea for another approach to this: - Mark state is moved out of the chunk trailer and is instead located via a pointer in the arena header which points to the mark bits for all the cells in that arena. This can be null, meaning all cells are unmarked. - Mark state is bump allocated on demand from per-zone blocks. - All mark state is freed for each collected zone at once when we start a GC and the pointers in arena headers are set to null. This has the following advantages: - allocation of mark state is relatively cheap, and mostly happens during marking which is incremental. - clearing the mark bits means overwriting a pointer per arena rather than memset()ing the bits themselves and this is a 16 times reduction in memory writes. - external mark state makes it much easier to change its format later, e.g. to move to one byte per cell. This has the following disadvantages: - this adds an extra load and compare when accessing mark bits.
(In reply to Jon Coppeard (:jonco) from comment #4) Paul had a better idea, which was to clear the mark bits lazily by having a flag in the arena header to indicate that all cells should be treated as unmarked. At the start of the GC we set the flag for all arenas and we only clear the mark state when we need to mark a cell in an arena that has the flag set. We can do this immediately with the current mark bit layout.
Priority: -- → P3

(In reply to Jon Coppeard (:jonco) from comment #5)

(In reply to Jon Coppeard (:jonco) from comment #4)
Paul had a better idea, which was to clear the mark bits lazily by having a
flag in the arena header to indicate that all cells should be treated as
unmarked. At the start of the GC we set the flag for all arenas and we only
clear the mark state when we need to mark a cell in an arena that has the
flag set. We can do this immediately with the current mark bit layout.

I've been thinking more about this.

I propose having two flags in each arena, One for "Everything is clear" and one for "Everything is marked" aka "This arena was allocated during collection". This second one will remove the cost of premarking (but not fixing the mark bits after sweeping). So there are 3 states we need to represent.

  • Everything is unmarked
  • Everything is marked
  • Check individual mark bits.

We can do this separately also. Except there's currently one spare bit in the arena header, not two! Whichever we do 2nd (probably premarking) will involve changing the usable size of an arena.

Or we can use a single bit to defer to other data:

  • Check the first cell's mark bit, that's the state of all the other mark bits.
  • Check individual mark bits.

(if we really care about saving this space).

I chatted with Jon and I'll take this bug soon.

Priority: P3 → P2
Whiteboard: [qf?]
Depends on: 566578
Whiteboard: [qf?] → [qf:p3:responsiveness]
Assignee: nobody → jcoppeard

I've tried a several different approaches, but haven't found any that give clear benefits. Abandoning this for now.

Assignee: jcoppeard → nobody
Priority: P2 → P3

We parallelised this in bug 1677765.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → DUPLICATE
Performance Impact: --- → P3
Whiteboard: [qf:p3:responsiveness]
You need to log in before you can comment on or make changes to this bug.