Closed Bug 495516 Opened 16 years ago Closed 13 years ago

cycle collector causing pauses every few seconds

Categories

(Core :: XPCOM, defect)

x86
Windows Vista
defect
Not set
normal

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
status1.9.1 --- wanted

People

(Reporter: vlad, Unassigned)

Details

(Whiteboard: [Snappy])

Attachments

(1 file)

Attached file xperf profile output (deleted) —
Build identifier: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1pre) Gecko/20090521 Firefox/3.5b5pre (.NET CLR 3.5.30729) Every 10 seconds or so, I get a pretty noticable pause (1-2 seconds). I can capture a trace with xperf, and for these periods cpu activity spikes to 100%. The profile of one of these periods is attached, albeit a little hard to read. The relevant points, with the % pulled to the front: 33.08% , xul.dll, , 962.149360, 33.08, 962, 7.74% , , PL_DHashTableOperate, 225.027400, 7.74, 225, 5.64% , , nsGenericElement::cycleCollection::Traverse, 164.037640, 5.64, 164, 2.48% , , ChangeTable, 72.012720, 2.48, 72, 2.30% , , nsCycleCollector::MarkRoots, 67.009880, 2.30, 67, 1.96% , , GraphWalker::DoWalk, 57.003800, 1.96, 57, 1.86% , , GCGraphBuilder::NoteXPCOMChild, 54.001880, 1.86, 54, 1.13% , , nsGenericDOMDataNode::cycleCollection::Traverse, 33.000120, 1.13, 33, and then a bunch more 10.56% , js3250.dll, , 307.046040, 10.56, 307, 4.06% , , js_TraceObject, 118.017080, 4.06, 118, 2.10% , , JS_TraceChildren, 61.007800, 2.10, 61, 0.86% , , array_trace, 25.003680, 0.86, 25, 0.65% , , JS_CallTracer, 19.003200, 0.65, 19, Looks to me like this is all in cycle collection, right? Any idea what would cause it to take so long, and on every run?
(In reply to comment #0) > 10.56% , js3250.dll, , 307.046040, 10.56, 307, > 4.06% , , js_TraceObject, 118.017080, 4.06, 118, > 2.10% , , JS_TraceChildren, 61.007800, 2.10, 61, > 0.86% , , array_trace, 25.003680, 0.86, 25, > 0.65% , , JS_CallTracer, 19.003200, 0.65, 19, I've seen these symptoms as well (periodic hangs), and my shark profiles show signs of pretty deep recursion with those functions involved. My windows typically have 20-100 tabs loaded (mostly Bugzilla pages+Gmail+Zimbra).
I can get in this state pretty often; is there any useful data that I can gather? Not sure if there's a page or anything that's causing it; I have two zimbras open (mozilla + personal), and then about 20-30 other tabs.
Flags: wanted1.9.1.x?
Flags: blocking1.9.2?
Are you using GMail? I see something like this with GMail running every second or so and triggering a cycle collection each time. There are basically only a couple of things we can easily do AFAIK: run idle cycle collections less often, and make them faster. But they have to scan the whole heap AFAIK so we can't make them arbitrarily fast.
Nope, no gmail, but 2x zimbra instances are open. Can we do some kind of generational collection here, if we don't already? I get the big pauses even if I haven't really changed much of anything, just typing in an input field. It seems like we should be able to just ignore things we've already checked in the last iteration pretty often.
Generational might be hard. But it seems to me that the opportunistic "GC on idle" (as opposed to "GC on OOM") should only run after a certain lower limit of GC-managed memory allocations have occurred. There's no point in GC'ing on idle if we've only allocated 10 objects since the last GC.
(I assume Zimbra, GMail etc aren't actually allocating a bazillion objects every few seconds.)
I don't understand how you guys have enough memory for 100 tabs. But whatever. Keep in mind that this is not a garbage collector in the sense that it starts from live roots. We're not searching for live memory, we're searching for *dead* memory. We're starting from a set of possibly-dead nodes and, for each, trying to prove that the transitive closure of its children forms a really-dead group, by accounting for all the refcounts in the group through interior links. I *think* this fact lends itself to a different form of incrementalism than you get in a "starting from live" GC, because our "suspects" are mostly innocent and mostly take themselves off our radar as they age (unlike the conventional generational hypothesis). So .. I think we can run the existing algorithm on a bounded subset of the purple buffer, and/or for a bounded number of steps. Then as now, we can free any decidedly-white nodes we discover; we just have to put all the grey nodes (or at least the initially-purple roots that didn't turn white) *back* in the purple buffer if we stop due to timeout, rather than now where we run to exhaustion and color all those still grey as black (and forget about them.) So then... we ought to *eventually* process the purple buffer in its entirety. But maybe nibbling at it regularly would free smaller cycles in the meantime without having to traverse the whole program graph. Depends how "wide" the typical cycles are. Spooky heuristic city. (As an example of the weirdness of the heuristics: note that the generational hypothesis is half-backwards here. Younger purple roots are simultaneously more-likely garbage, since young objects don't live long, and also *less*-likely garbage, since the longer you stay in the purple buffer the more likely you are to be dead. Maybe the key is to differentiate age-in-purple-buffer from time-since-object-was-originally-allocated? Which we don't track...) Anyway, I think there are probably still lower-hanging fruit at this point. It's not clear to me that all the collector structures are optimal; dbaron was working on a more efficient purple buffer recently but as far as I know we're still hammering on PLDhash during traversal: that's really not a good pointer-set abstraction at all. Maybe brucehoult would like to dig in here? It's also not clear to me that we have perfect avoidance of overlapping sub-scans. I think so but I'm not certain. Finally -- or perhaps "initially!" -- I'd also wager roc's right and some simpler heuristic-tuning on the activation and duration will help.
One thing that I'm wondering is if the actions performed by cc in whatever case I'm hitting are actually valid -- that we're not hitting some weird bug or something.
I've seen this too, but only a few times... not sure we can block on it without additional data.
What kind of data needs to be gathered?
A particular site that triggers it would be good, or better yet a testcase. Otherwise it's just something randomly bad happening, right? I'm happy to catch it in a debugger next time it happens to me on Windows, but I'm not what to be looking for. It's also concerning that this happens every few seconds... does that mean the JS engine thinks it's always out of memory and is triggering GC?
This bug reminds me of bug 373462 (which I still encounter sometimes, it seems, but not sure, the slowdown might also be caused by something else).
Flags: wanted1.9.1.x?
Flags: blocking1.9.2? → blocking1.9.2-
bug 515287 may have made a difference here.
I've seen this too, and when profiling it is JS_CallTracer which takes most of the time.
It's _in_ JS_CallTracer, or it's _under_ JS_CallTracer? Some attached profiles would probably help, since we have wildly different options depending on whether we're mostly in the JS heap or mostly in the cycle collector's world. Luke Wagner's hash table joy might make a difference here too, though I suspect a proportionately small one. Do we know why the CC is being triggered every few seconds? I can't tell from the comments if there's one reason or many for us to trigger a CC, so we may be looking at multiple bugs.
(In reply to comment #15) > Do we know why the CC is being triggered every > few seconds? Bug 515287 may have helped here.
Whiteboard: [Snappy]
I'm going to close this. The cycle collector has changed a lot in the last two years, and there's not really anything specifically actionable in here.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INCOMPLETE
So, what bug now responsible for several seconds pauses because of cc? Bug 377787 seems outdated
Probably the best thing to do is to file a new bug with [Snappy] in the whiteboard. CCing me is would also be helpful. Generally, having a giant bug where a lot of people post similar symptoms that likely have different underlying causes is not very useful, except that it is a little easier for users to deal with than just filing a new bug that may end up getting put in the wrong component and lost.
It's already filled, while I don't get any response on https://bugzilla.mozilla.org/show_bug.cgi?id=608954#c50 Added [Snappy] in whiteboard, see if it helps somehow...
Sorry, I forgot to follow up there. I've done that now.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: