495516 - cycle collector causing pauses every few seconds

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Description

•

16 years ago

Attached file xperf profile output (deleted) — Details

Build identifier: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1pre) Gecko/20090521 Firefox/3.5b5pre (.NET CLR 3.5.30729) Every 10 seconds or so, I get a pretty noticable pause (1-2 seconds). I can capture a trace with xperf, and for these periods cpu activity spikes to 100%. The profile of one of these periods is attached, albeit a little hard to read. The relevant points, with the % pulled to the front: 33.08% , xul.dll, , 962.149360, 33.08, 962, 7.74% , , PL_DHashTableOperate, 225.027400, 7.74, 225, 5.64% , , nsGenericElement::cycleCollection::Traverse, 164.037640, 5.64, 164, 2.48% , , ChangeTable, 72.012720, 2.48, 72, 2.30% , , nsCycleCollector::MarkRoots, 67.009880, 2.30, 67, 1.96% , , GraphWalker::DoWalk, 57.003800, 1.96, 57, 1.86% , , GCGraphBuilder::NoteXPCOMChild, 54.001880, 1.86, 54, 1.13% , , nsGenericDOMDataNode::cycleCollection::Traverse, 33.000120, 1.13, 33, and then a bunch more 10.56% , js3250.dll, , 307.046040, 10.56, 307, 4.06% , , js_TraceObject, 118.017080, 4.06, 118, 2.10% , , JS_TraceChildren, 61.007800, 2.10, 61, 0.86% , , array_trace, 25.003680, 0.86, 25, 0.65% , , JS_CallTracer, 19.003200, 0.65, 19, Looks to me like this is all in cycle collection, right? Any idea what would cause it to take so long, and on every run?

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 1

•

16 years ago

(In reply to comment #0) > 10.56% , js3250.dll, , 307.046040, 10.56, 307, > 4.06% , , js_TraceObject, 118.017080, 4.06, 118, > 2.10% , , JS_TraceChildren, 61.007800, 2.10, 61, > 0.86% , , array_trace, 25.003680, 0.86, 25, > 0.65% , , JS_CallTracer, 19.003200, 0.65, 19, I've seen these symptoms as well (periodic hangs), and my shark profiles show signs of pretty deep recursion with those functions involved. My windows typically have 20-100 tabs loaded (mostly Bugzilla pages+Gmail+Zimbra).

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Comment 2

•

16 years ago

I can get in this state pretty often; is there any useful data that I can gather? Not sure if there's a page or anything that's causing it; I have two zimbras open (mozilla + personal), and then about 20-30 other tabs.

Flags: wanted1.9.1.x?

Flags: blocking1.9.2?

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 3

•

16 years ago

Are you using GMail? I see something like this with GMail running every second or so and triggering a cycle collection each time. There are basically only a couple of things we can easily do AFAIK: run idle cycle collections less often, and make them faster. But they have to scan the whole heap AFAIK so we can't make them arbitrarily fast.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Comment 4

•

16 years ago

Nope, no gmail, but 2x zimbra instances are open. Can we do some kind of generational collection here, if we don't already? I get the big pauses even if I haven't really changed much of anything, just typing in an input field. It seems like we should be able to just ignore things we've already checked in the last iteration pretty often.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 5

•

16 years ago

Generational might be hard. But it seems to me that the opportunistic "GC on idle" (as opposed to "GC on OOM") should only run after a certain lower limit of GC-managed memory allocations have occurred. There's no point in GC'ing on idle if we've only allocated 10 objects since the last GC.

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 6

•

16 years ago

(I assume Zimbra, GMail etc aren't actually allocating a bazillion objects every few seconds.)

Graydon Hoare :graydon

Comment 7

•

16 years ago

I don't understand how you guys have enough memory for 100 tabs. But whatever. Keep in mind that this is not a garbage collector in the sense that it starts from live roots. We're not searching for live memory, we're searching for *dead* memory. We're starting from a set of possibly-dead nodes and, for each, trying to prove that the transitive closure of its children forms a really-dead group, by accounting for all the refcounts in the group through interior links. I *think* this fact lends itself to a different form of incrementalism than you get in a "starting from live" GC, because our "suspects" are mostly innocent and mostly take themselves off our radar as they age (unlike the conventional generational hypothesis). So .. I think we can run the existing algorithm on a bounded subset of the purple buffer, and/or for a bounded number of steps. Then as now, we can free any decidedly-white nodes we discover; we just have to put all the grey nodes (or at least the initially-purple roots that didn't turn white) *back* in the purple buffer if we stop due to timeout, rather than now where we run to exhaustion and color all those still grey as black (and forget about them.) So then... we ought to *eventually* process the purple buffer in its entirety. But maybe nibbling at it regularly would free smaller cycles in the meantime without having to traverse the whole program graph. Depends how "wide" the typical cycles are. Spooky heuristic city. (As an example of the weirdness of the heuristics: note that the generational hypothesis is half-backwards here. Younger purple roots are simultaneously more-likely garbage, since young objects don't live long, and also *less*-likely garbage, since the longer you stay in the purple buffer the more likely you are to be dead. Maybe the key is to differentiate age-in-purple-buffer from time-since-object-was-originally-allocated? Which we don't track...) Anyway, I think there are probably still lower-hanging fruit at this point. It's not clear to me that all the collector structures are optimal; dbaron was working on a more efficient purple buffer recently but as far as I know we're still hammering on PLDhash during traversal: that's really not a good pointer-set abstraction at all. Maybe brucehoult would like to dig in here? It's also not clear to me that we have perfect avoidance of overlapping sub-scans. I think so but I'm not certain. Finally -- or perhaps "initially!" -- I'd also wager roc's right and some simpler heuristic-tuning on the activation and duration will help.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Comment 8

•

16 years ago

One thing that I'm wondering is if the actions performed by cc in whatever case I'm hitting are actually valid -- that we're not hitting some weird bug or something.

Benjamin Smedberg

Comment 9

•

15 years ago

I've seen this too, but only a few times... not sure we can block on it without additional data.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Comment 10

•

15 years ago

What kind of data needs to be gathered?

Benjamin Smedberg

Comment 11

•

15 years ago

A particular site that triggers it would be good, or better yet a testcase. Otherwise it's just something randomly bad happening, right? I'm happy to catch it in a debugger next time it happens to me on Windows, but I'm not what to be looking for. It's also concerning that this happens every few seconds... does that mean the JS engine thinks it's always out of memory and is triggering GC?

Martijn Wargers (dead)

Comment 12

•

15 years ago

This bug reminds me of bug 373462 (which I still encounter sometimes, it seems, but not sure, the slowdown might also be caused by something else).

Samuel Sidler (old account; do not CC)

Updated

•

15 years ago

status1.9.1: --- → wanted

Flags: wanted1.9.1.x?

Benjamin Smedberg

Updated

•

15 years ago

Flags: blocking1.9.2? → blocking1.9.2-

:Gavin Sharp [email: gavin@gavinsharp.com]

Comment 13

•

15 years ago

bug 515287 may have made a difference here.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 14

•

15 years ago

I've seen this too, and when profiling it is JS_CallTracer which takes most of the time.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 15

•

15 years ago

It's _in_ JS_CallTracer, or it's _under_ JS_CallTracer? Some attached profiles would probably help, since we have wildly different options depending on whether we're mostly in the JS heap or mostly in the cycle collector's world. Luke Wagner's hash table joy might make a difference here too, though I suspect a proportionately small one. Do we know why the CC is being triggered every few seconds? I can't tell from the comments if there's one reason or many for us to trigger a CC, so we may be looking at multiple bugs.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 16

•

15 years ago

(In reply to comment #15) > Do we know why the CC is being triggered every > few seconds? Bug 515287 may have helped here.

Ryan VanderMeulen [:RyanVM]

Updated

•

13 years ago

Whiteboard: [Snappy]

Andrew McCreight [:mccr8]

Comment 17

•

13 years ago

I'm going to close this. The cycle collector has changed a lot in the last two years, and there's not really anything specifically actionable in here.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → INCOMPLETE

Phoenix

Comment 18

•

13 years ago

So, what bug now responsible for several seconds pauses because of cc? Bug 377787 seems outdated

Andrew McCreight [:mccr8]

Comment 19

•

13 years ago

Probably the best thing to do is to file a new bug with [Snappy] in the whiteboard. CCing me is would also be helpful. Generally, having a giant bug where a lot of people post similar symptoms that likely have different underlying causes is not very useful, except that it is a little easier for users to deal with than just filing a new bug that may end up getting put in the wrong component and lost.

Phoenix

Comment 20

•

13 years ago

It's already filled, while I don't get any response on https://bugzilla.mozilla.org/show_bug.cgi?id=608954#c50 Added [Snappy] in whiteboard, see if it helps somehow...