Closed Bug 1661872 Opened 4 years ago Closed 4 years ago

beginMarkPhase can block on background tasks

Categories

(Core :: JavaScript: GC, defect)

defect

Tracking

()

RESOLVED WONTFIX

People

(Reporter: pbone, Unassigned)

References

(Blocks 1 open bug)

Details

GCRuntime::beginMarkPhase starts several background tasks to do unmarking and similar, but it can block on these tasks and blow it's slice budget. See https://perfht.ml/2XmQBId "Web Content (7/8)".

This is probably pathological - the system is very busy doing GC in multiple processors. But ideally this process should yield from the GC if the background work would cause it to blow its budget.

Blocks: 1629064

Most of the work that happens in beginMarkPhase needs to happen in the first slice of GC, before we yield to the mutator, so this is not possible.

We could run startBackgroundFreeAfterMinorGC() later which might help in some circumstances but it's not likely to make much difference.

Mm true, I didn't think it through.

I'd like to know which background task is taking the most time here (if one stands out). Maybe there's some specific way we could apply effort / or learn why this case is pathological.

(In reply to Paul Bone [:pbone] from comment #2)

I'd like to know which background task is taking the most time here (if one stands out).

GC_SLOW_TASK telemetry will give you some information:

https://telemetry.mozilla.org/new-pipeline/dist.html#!cumulative=0&end_date=2020-09-09&include_spill=0&keys=__none__!__none__!__none__&max_channel_version=nightly%252F82&measure=GC_SLOW_TASK&min_channel_version=nightly%252F79&processType=*&product=Firefox&sanitize=1&sort_by_value=0&sort_keys=submissions&start_date=2020-08-24&table=0&trim=1&use_submission_date=0

There's nothing particularly actionable here so I'm going to WONTFIX this.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX

When I look at that profile, it seems like something (actionable) is going wrong here. Specifically, this is a 3.3sec main-thread stall because of an initial GC slice, all because it blocked for 3.2sec waiting for background tasks. Was this GC slice really that critical that it needed to happen right away? It seems like we should either wait for the background tasks first, with a timeout after which we abort preparation and just don't start the GC -- or we do it in the current order, but again we abort the preparation and re-do whatever is necessary if it takes too long.

That might not fix this particular problem, if it's going to delay the GC for too long (so we hit another trigger that makes it synchronous, or blows out RAM so that we should have another trigger, or whatever). But allowing a 49ms budget GC slice to disappear for 3.3sec because it's waiting on a background thread doesn't seem great.

I wonder... with XPCOM threads, is there a chance of getting some priority inheritance to assist here? It's possible that the background thread took so long because the system was busy and it was lower priority. But that's rank speculation.

(In reply to Steve Fink [:sfink] [:s:] from comment #4)

It seems like we should either wait for the background tasks first

To be clear, we're waiting for preparation work that we started in this slice. This is not waiting for outstanding background tasks to finish.

Was this GC slice really that critical that it needed to happen right away?

This is a good question, but we don't have any information in the GC to answer that. We should run the initial slice in idle time, if this doesn't happen already.

On our side, one thing we could do is to kick off the unmarking work in the first slice and have that run in parallel with the mutator and then run everything else in the second slice. That would help, but it we still need to buffer the gray roots from the browser and mark the roots, all of which must happen in a single slice.

But allowing a 49ms budget GC slice to disappear for 3.3sec because it's waiting on a background thread doesn't seem great.

Yeah this is not great. Hopefully this doesn't happen too often.

(In reply to Jon Coppeard (:jonco) from comment #5)

(In reply to Steve Fink [:sfink] [:s:] from comment #4)

It seems like we should either wait for the background tasks first

To be clear, we're waiting for preparation work that we started in this slice. This is not waiting for outstanding background tasks to finish.

Oh! Thank you, that was the part I was missing.

Was this GC slice really that critical that it needed to happen right away?

This is a good question, but we don't have any information in the GC to answer that. We should run the initial slice in idle time, if this doesn't happen already.

Right, this question is different from what I was thinking given that the first slice triggered the work.

On our side, one thing we could do is to kick off the unmarking work in the first slice and have that run in parallel with the mutator and then run everything else in the second slice. That would help, but it we still need to buffer the gray roots from the browser and mark the roots, all of which must happen in a single slice.

Yes, given my updated understanding that's what I would want to do.

But allowing a 49ms budget GC slice to disappear for 3.3sec because it's waiting on a background thread doesn't seem great.

Yeah this is not great. Hopefully this doesn't happen too often.

I tried checking telemetry. Here's the max pause histogram for Nightly, but with anything 50ms or under dropped (I only want to look at long pauses):

57-67ms	23.1%
68-80ms	15.3%
81-95ms	12.0%
96-113ms	17.8%
114-134ms	6.7%
135-159ms	5.0%
160-189ms	4.0%
190-225ms	3.0%
226-267ms	2.3%
268-317ms	1.9%
318-377ms	1.5%
378-448ms	1.3%
449-532ms	1.0%
533-632ms	0.8%
633-751ms	0.7%
752-893ms	0.6%
894-1061ms	0.5%
1062-1261ms	0.4%
1262-1499ms	0.4%
1500-1781ms	0.3%
1782-2116ms	0.2%
2117-2515ms	0.2%
2516-2989ms	0.2%
2990-3552ms	0.1%
3553-4221ms	0.1%
4222-5016ms	0.1%
5017-5960ms	0.1%
5961-7082ms	0.1%
7083-8415ms	0.1%
8416-9999ms	0.0%
10000-99999999ms	0.2%

The percentages of are this trimmed data, not of the total telemetry reports. This shows that we have a long tail, but I don't think it really answers the question. (I'm trying to decide whether this case happens often enough to be worth bothering with.)

GC_SLOW_PHASE shows PREPARE as being the slow phase 17% of the time, pretty much tied for most common with EVICT_NURSERY_FOR_MAJOR_GC and MARK. (Why would MARK run long? That seems odd. We should be checking the budget frequently enough to not go over there, I thought.)

So there's some evidence that it might be a problem, at least.

GC_SLOW_TASK shows mostly BUFFER_GRAY_ROOTS (49%) and UNMARK (36%), with some COMPACT_UPDATE_CELLS (11%). So again, it looks like it might help some.

Could both gray buffering and unmarking be made incremental?

Gray buffering: you would need to barrier the gray roots. But if we had a flag that said whether an involved data structure had been buffered yet (or maybe even the limit up to which we've buffered), then we could re-scan everything every root marking slice and skip whatever is known to have already been buffered. If something were too hard to barrier, we could just redo it every time.

We'd probably need to check if we're making any progress, or enough progress, and at some point ignore the budget but just for this slice.

Unmarking: would it break anything if unmarking were a "rolling" unmark, so that not everything is cleared in the same slice? Before you moved onto the marking phase, every arena would be unmarked, but there might be some mutator slices in between. Would that be ok? Any barriers that mark things ought to be pushing onto the mark stack.

I've taken a look at what it would require to rerun MarkRoots for a budgeted amount of time each slice, and it seems messy but doable. It just depends on whether it makes sense to make either or very preferably both of the above operations incremental.

(In reply to Steve Fink [:sfink] [:s:] from comment #7)

Unmarking: would it break anything if unmarking were a "rolling" unmark, so that not everything is cleared in the same slice? Before you moved onto the marking phase, every arena would be unmarked, but there might be some mutator slices in between. Would that be ok? Any barriers that mark things ought to be pushing onto the mark stack.

I've taken a look at what it would require to rerun MarkRoots for a budgeted amount of time each slice, and it seems messy but doable. It just depends on whether it makes sense to make either or very preferably both of the above operations incremental.

How about,

  1. During sweeping, after sweeping each arena mark the arena as having a "dirty mark bitmap"
  2. When allocating a new arena mark it as dirty, if it's during a GC mark it as "everything here is marked" (aka fully-marked). We already do the latter.

Then when we begin a collection:

  1. All arenas are either dirty or fully-marked.
  2. Begin our mark phase, when we mark a cell check its arena, if the arena is dirty then we unmark that arena and remove the dirty bit. Let this work count towards our incremental budget.

There's no distinct unmarking phase, it's distributed with the rest of marking. and "lazy", a completely free arena is never unmarked.

(In reply to Steve Fink [:sfink] [:s:] from comment #7)

Gray buffering: you would need to barrier the gray roots.

The gray roots come mostly from CycleCollectedJSRuntime's map of JS holders, which are cycle collected objects with pointers into the JS heap. To get the gray roots we call a Trace method on each holder which tells us the edges. These edges are used as the gray roots for GC purposes. Mostly these edges are JS::Heap<>s so it should be possible to put a barrier on them (in fact we used to have a write barrier on JS::Heap but removed it because it wasn't necessary in the current setup).

So, in theory we could have a write barrier on JS::Heap that marked the previous contents gray during marking. Then we could move tracing all of these to the GC slice where we start gray marking. We would still have to buffer other gray roots that were not in JS::Heap, or find another way of barriering them. And we'd need a barrier on holder removal.

There are probably more details to be worked out but it sounds like that would work. It would a be a great improvement to not to have to barrier all those roots - there are often a large number of them.

Unmarking: would it break anything if unmarking were a "rolling" unmark, so that not everything is cleared in the same slice?

Well, you could just not run any GC slices until unmark is finished, or have them return immediately. It's not possible to do any GC work before unmark is finished (unless you use lazy unmarking, but see next comment).

(In reply to Paul Bone [:pbone] from comment #8)

How about,

  1. During sweeping, after sweeping each arena mark the arena as having a "dirty mark bitmap"
  2. When allocating a new arena mark it as dirty, if it's during a GC mark it as "everything here is marked" (aka fully-marked). We already do the latter.

Then when we begin a collection:

  1. All arenas are either dirty or fully-marked.
  2. Begin our mark phase, when we mark a cell check its arena, if the arena is dirty then we unmark that arena and remove the dirty bit. Let this work count towards our incremental budget.

There's no distinct unmarking phase, it's distributed with the rest of marking. and "lazy", a completely free arena is never unmarked.

The cycle collector needs the gray marking information after we finish sweeping so we can't mark arenas dirty during sweeping (I'm taking "dirty" above means "all cells are unmarked"). Changing marking state during sweeping is also problematic because most sweeping is done off main thread and so this requires synchronisation with gray unmarking which happens on the main thread.

I implemented something similar to this earlier this year (it copied the gray marking state somewhere else) but it ended up slower than what we have already. This was partly due to the extra work of copying gray marking state and partly because it makes any operation involving the mark bits must more complicated.

(In reply to Jon Coppeard (:jonco) from comment #9)

(In reply to Steve Fink [:sfink] [:s:] from comment #7)

Gray buffering: you would need to barrier the gray roots.

The gray roots come mostly from CycleCollectedJSRuntime's map of JS holders, which are cycle collected objects with pointers into the JS heap. To get the gray roots we call a Trace method on each holder which tells us the edges. These edges are used as the gray roots for GC purposes. Mostly these edges are JS::Heap<>s so it should be possible to put a barrier on them (in fact we used to have a write barrier on JS::Heap but removed it because it wasn't necessary in the current setup).

So, in theory we could have a write barrier on JS::Heap that marked the previous contents gray during marking.

Oh wait, so you would mark all JS::Heap gray if they are written to during marking? Is overmarking gray not a problem?... wait, no, it isn't. Something unnecessarily marked gray will only stay gray if it is never blackened (duh!), so it should be fine. Somehow I was worried about finding a gray cycle that wouldn't otherwise be found, but that's not a problem.

Then we could move tracing all of these to the GC slice where we start gray marking. We would still have to buffer other gray roots that were not in JS::Heap, or find another way of barriering them.

Right. I was thinking we'd probably buffer all of these the way we do now, as long as they're usually a small percentage of the total.

And we'd need a barrier on holder removal.

Removing a holder, or removing an object from a holder?

There are probably more details to be worked out but it sounds like that would work. It would a be a great improvement to not to have to barrier all those roots - there are often a large number of them.

Sorry, did you mean not buffer all those roots (synchronously)? (Because this would be more barriering, not less!)

Unmarking: would it break anything if unmarking were a "rolling" unmark, so that not everything is cleared in the same slice?

Well, you could just not run any GC slices until unmark is finished, or have them return immediately. It's not possible to do any GC work before unmark is finished (unless you use lazy unmarking, but see next comment).

Right, that's what I'm going for -- unmarking happens during RootMarking, and I wanted to do the unmarking incrementally in multiple budgeted slices. As soon as everything is unmarked, we would proceed to Marking. Nothing fancy; I wasn't looking to interleave unmarking with the rest of GC.

But from telemetry it looks like unmarking and buffering gray roots are slow in similar percentages of collections, so I was hoping to make them both incremental so that they would be running in parallel budgeted slices. They're already running in parallel (with potentially multiple unmarking threads and multiple gray buffering threads). This would budget them as well. I had not considered avoiding the buffering of JS holder gray roots entirely, but you're right, that's what it ought to be. (No point to incrementalize buffering of things that don't need to be buffered anymore.)

Ah. As I see from bug 1677765, none of this budgeting stuff is needed because the unmarking can be done concurrently with the mutator, and the gray roots mostly don't need to be buffered if the bulk of them are barriered.

(In reply to Jon Coppeard (:jonco) from comment #10)

(In reply to Paul Bone [:pbone] from comment #8)

How about,

  1. During sweeping, after sweeping each arena mark the arena as having a "dirty mark bitmap"
  2. When allocating a new arena mark it as dirty, if it's during a GC mark it as "everything here is marked" (aka fully-marked). We already do the latter.

Then when we begin a collection:

  1. All arenas are either dirty or fully-marked.
  2. Begin our mark phase, when we mark a cell check its arena, if the arena is dirty then we unmark that arena and remove the dirty bit. Let this work count towards our incremental budget.

There's no distinct unmarking phase, it's distributed with the rest of marking. and "lazy", a completely free arena is never unmarked.

The cycle collector needs the gray marking information after we finish sweeping so we can't mark arenas dirty during sweeping (I'm taking "dirty" above means "all cells are unmarked"). Changing marking state during sweeping is also problematic because most sweeping is done off main thread and so this requires synchronisation with gray unmarking which happens on the main thread.

If the "dirty" bit just means "This has not been unmarked for the next/current GC". Before a GC begins the cycle collector can still read the marking information. Regarding synchronisation with gray unmarking - I'm not sure, I guess it can be done but I haven't read that code (lately).

I implemented something similar to this earlier this year (it copied the gray marking state somewhere else) but it ended up slower than what we have already. This was partly due to the extra work of copying gray marking state and partly because it makes any operation involving the mark bits must more complicated.

In this case I'd be worried about the extra branch during marking to check the arena's "dirty" bit before checking a cell's mark bits.

You need to log in before you can comment on or make changes to this bug.