1772806 - Lots of wake-ups triggered by garbage collection in a background tab

Reporter

Description

•

2 years ago

here another >sitting in the background for days< profile, this time twitch taking apparently 2GB, maybe useful? https://share.firefox.dev/3aFWitv

I'm surprised that we have so many GCSlide of only a few microseconds at the beginning. Is there something we can do about it?

It's also unfortunate that the machine appears to be under heavy load (out of a 599ms awake markers, we only get 40ms of actual CPU time), but we kept this background process at the "User Interactive" Quality of Service. I guess that's bug 1394714.

Flags: needinfo?(pbone)

Nicolas B. Pierron [:nbp]

Comment 1

•

2 years ago

Forwarding the needinfo to Steve, as Paul is no longer working on GCs.

Blocks: GC.performance

Severity: -- → S4

Flags: needinfo?(pbone) → needinfo?(sphink)

Priority: -- → P3

Paul Bone [:pbone]

Comment 2

•

2 years ago

Thanks Florian,

It looks like maybe it starts to do a CC then bails out of a GC, but whatever told it it wants to do a CC retriggers. I'll have to spend some time reading the code to understand.

It might also be helpful to have markers about the management of idle time.

Paul Bone [:pbone]

Updated

•

2 years ago

Depends on: 1394714, 1773198

Florian Quèze [:florian]

Reporter

Comment 3

•

2 years ago

I just noticed that there's a TaskController thread stuck in js::gc::BackgroundUnmarkTask::run (a 13s Runnable) and attempting to use CPU there, and the very short repeated GCSlice markers on the main thread stop when the TaskController runnable has finished. Should the main thread code more explicitly wait for that runnable to finish, and avoid triggering EnsureCCRunner::CCRunnerFired timers until it is finished?

matthias koplenig [:metasieben]

Comment 4

•

2 years ago

Some more info about the profile:
It was generated via about:processes after switching to the backgrounded twitch-page.

STR(using Mac mini M1 with 8GB RAM):

open a vod https://www.twitch.tv/videos/1486195733
play the video
pause the video
switch to another tab
leave the browser running for a few days

Sadly the STR aren't more specific and probably not easily reproducible.

Jon Coppeard (:jonco) (PTO until 14th September)

Comment 5

•

2 years ago

This is interesting. We're waiting for the unmarking background task to finish, so idle time GC slices are returning immediately. Normally this task doesn't take so long to complete.

Steve Fink [:sfink] [:s:]

Comment 6

•

2 years ago

The background unmarking task is effectively a memset(ptr, 0, size) loop over various memory regions. It looks up some pointers and writes zeroes to them. You can see in the profile that the compiler inlines it all. It's not even much memory usually (the size of each is fairly small since it's a bitmap). And yet, the profile shows 2 full seconds just doing this unmarking. Which makes me think that either (1) the machine is swapping heavily, or (2) the thread is deprioritized behind other things happening so that it's not running (it's schedulable so it doesn't show up as idle)? The CPU graph shows that it's mostly around 15% CPU utilization, peaking at about 30%.

Flags: needinfo?(sphink)

Steve Fink [:sfink] [:s:]

Comment 7

•

2 years ago

@mstange tracked down the code that is being executed:


0000000001ac44c4         stur       xzr, [x13, #-0xa0]
0000000001ac44c8         add        x13, x20, #0x38
0000000001ac44cc         ldar       w13, [x13]
0000000001ac44d0         cbnz       w13, loc_1ac44dc

0000000001ac44d4         ldr        x12, [x12, #0x10] ; <---  this is where all the samples spend their time
0000000001ac44d8         cbnz       x12, loc_1ac4498

                     loc_1ac44dc:
0000000001ac44dc         add        x10, x10, #0x1
0000000001ac44e0         cmp        x10, #0x23

ldr x12, [x12, #0x10] looks very much like the call to arena.next() since the next field of js::gc::Arena is at 0x10. That gave me pause initially because I thought it was already accessing the Arena and so any paging would happen on an earlier access. But actually, the mark bits that are being cleared here are stored in the chunk, so this very well could be the first access of this address.

So it looks like page faults of some sort. If it were thread scheduling, I would expect the sampled addresses to be more spread out. (But I could be full of crap, too.)

Steve Fink [:sfink] [:s:]

Comment 8

•

2 years ago

If you get it into this state again, could you collect a report from about:memory?

jrmuizel suggested that a profile from Instruments with kernel stacks on would confirm if the unmark thread is going so slowly because of swapping. I don't have an OSX machine so I don't know how to do that, but that would be useful if you happen to know how.

Anything else that could tell whether it's swapping would confirm it to. free on linux would give a hint, or top or htop or whatever. Or the machine just being miserably slow at everything and spinning the fans fast enough to take off.

Flags: needinfo?(mozilla)

matthias koplenig [:metasieben]

Comment 9

•

2 years ago

I filed bug 1773944 last week which has a memory report attached, since this bug seemed
to be more directed at the GC wakeups.

I'm unsure how related the memory report in the other bug is though.
The symptoms look similar, background twitch.tv tab using lots of memory(1GB) while being paused.

As for creating a profile with Instruments, I will try to create one the next time.

Flags: needinfo?(mozilla)

Gabriele Svelto [:gsvelto]

Comment 10

•

2 years ago

FYI macOS goes to extreme lengths to prevent applications from going OOM, I've seen it use a swap file twice the size of the machine's memory. That's to say don't discount swapping as the cause of the slowness because macOS is very happy to do it.

Paul Bone [:pbone]

Comment 11

•

2 years ago

sfink, Do you think it's worth pausing the CC and GC timers while background unmarking is happening. Doing this could be arkward, especially re-enabling the timers once the thread has finished. And in most cases unmarking is fast so it's possibly not worth it. Maybe it's easier to backoff these timers so they execute less often until unmarking is done.

Flags: needinfo?(sphink)

Steve Fink [:sfink] [:s:]

Comment 12

•

2 years ago

(In reply to Paul Bone [:pbone] from comment #11)

sfink, Do you think it's worth pausing the CC and GC timers while background unmarking is happening. Doing this could be arkward, especially re-enabling the timers once the thread has finished. And in most cases unmarking is fast so it's possibly not worth it. Maybe it's easier to backoff these timers so they execute less often until unmarking is done.

It's a good idea, but I guess I kind of feel like I'd want more evidence that it would do some good. If the cause turns out to be that the machine was swapping, then I doubt the do-nothing GC slices are adding any additional wakeups or other overhead. Stalling off the slices would make it so the GC wouldn't be blamed, but unless there's something else going on it wouldn't improve actual performance or battery usage. Right now, the report is consistent with the hypothesis that the machine was swapping, everything was painfully slow, and yet it managed to gather a profile and the scheduled slices are the obvious thing that stand out. It would be really nice to have swap activity (or some other correlated metric) in the profiles...

I guess we kind of do; as florian said: "the machine appears to be under heavy load (out of a 599ms awake markers, we only get 40ms of actual CPU time)". I think whatever is causing that is the true culprit here. It's probably some variant of a memory leak. Twitch should not be using 2GB when it's backgrounded. It could be a site leak, or it could be something like unread messages piling up. I'm looking for anything in bug 1749366 that could be related.

Flags: needinfo?(sphink)

Comment hidden (offtopic)

Paul Bone [:pbone]

Updated

•

2 years ago

Depends on: 1801575

Bugzilla

Lots of wake-ups triggered by garbage collection in a background tab

Categories

(Core :: JavaScript: GC, defect, P3)

Tracking

()

People

(Reporter: florian, Unassigned)

References

(Depends on 3 open bugs, Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Updated

Attachment

General

Description

File Name

Content Type