Closed Bug 479553 Opened 16 years ago Closed 16 years ago

Minimize forking of property tree

Categories

(Core :: JavaScript Engine, defect, P1)

x86
All
defect

Tracking

()

VERIFIED FIXED
mozilla1.9.2a1
Tracking Status
status1.9.2 --- beta1-fixed

People

(Reporter: fehe, Assigned: graydon)

References

()

Details

(5 keywords, Whiteboard: fixed-in-tracemonkey)

Attachments

(9 files)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090220 Firefox/3.0 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090220 Firefox/3.2a1pre Minefield experiences high system resource consumption when clicking a module in the module list at the linked URL. The hang is caused by a rather scary-looking JavaScript file http://drupal.org/misc/jquery.js (version at time of this bug report attached). Reproducible: Always Steps to Reproduce: 1. Visit the linked URL 2. Click one of the module links (e.g. "Administration", etc.) 3. Observe the high system resource consumption.
Component: General → JavaScript Engine
Keywords: hang
OS: Windows XP → All
Product: Firefox → Core
Version: unspecified → Trunk
I too am experiencing this. Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2a1pre) Gecko/20090220 Minefield/3.2a1pre (.NET CLR 3.5.30729) ID:20090220102957
http://pastebin.com/f7d83bc4b That's the decoded version of the javascript causing the problem (jquery.js), should help whoever decides to work on fixing this. (Thanks to Apoc for decoding it.)
Status: UNCONFIRMED → NEW
Ever confirmed: true
Assignee: nobody → general
QA Contact: general → general
More observations: 1. If one middle-clicks to load one of the project (module) links in a new tab, it loads without issue. 2. When either closing the tab loaded with the project (module) list page (http://drupal.org/node/206666) or closing the browser after loading that page, the high system resource and hang issue occurs. So it turns out the problem actually occurs only when transitioning from the page at http://drupal.org/node/206666. It is then that the script kicks in and causes the problems. I will shortly attach a copy of the script (seems basically the same as jquery.js attached earlier, but with some additional code) responsible for the hang when I click to close a tab loaded with http://drupal.org/node/206666
Flags: blocking1.9.2?
I am now finding this bug difficult to avoid, because I am doing website work with Drupal. Last night it took about 10 minutes for Firefox to shutdown, because memory kept cycling up and down into the hundreds of megabytes. There are many functions that rely on jQuery.js. It is also being used on various other sites. Though I filtered out jQuery as a workaround, I cannot do so completely, because some sites rely on it; and besides, whatever the issue is likely also affects other scripts. So a fix for this would be greatly appreciated. Thanks
Keywords: footprint, perf
What is your build id?
This sure seems like a 1.9.1 blocker to me. /be
Flags: blocking1.9.1?
Does it happen on the branch? All the reports so far are of 1.9.2-stream builds.
Attached file test case (deleted) —
To reproduce: 1. Open Windows Task Manager 2. After decompressing the archive, launch 206666.htm 3. Wait for the page to completely load then close the tab. 4. Notice firefox.exe process activity in Task Manager For shutdown, the more tabs open, that have pages using jQuery.js, the greater the problem. Note: jQuery.js is the file b0f4a4d77baac5b6ea14727108f1e722.js.
(In reply to comment #7) > What is your build id? Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090323 Minefield/3.2a1pre ID:20090323044646 (In reply to comment #9) > Does it happen on the branch? All the reports so far are of 1.9.2-stream > builds. Just verified that the latest Shiretoko branch is affected.
I confirmed that CPU activity goes to 100% for 10-30 seconds with JIT on or off. It seems that it does eventually terminate. The problem doesn't seem to occur with Safari, but I don't know if that's because Safari lacks a bug or just doesn't run the code at all. I don't have time to work on this for the rest of the week as we have a paper deadline on Sunday. I did prepare a version of the JS file that has the uncompressed jQuery, so you can get line numbers. The bug still occurs. To use this file, copy it over the file of the same name from comment 10. For me, the "stop script" dialog came up on line 719. I think the next step would be to use a JS debugger or something to get a full stack trace at that point. That should make it fairly clear what code is running that takes so long and where it gets activated from.
Flags: blocking1.9.1? → blocking1.9.1+
Attached file Slightly reduced testcase (deleted) —
The problem seems to occur when a bunch of event listeners are removed on unload. I've stripped out some cruft and added a button in which triggers the freeze by attempting to remove all the event listeners.
(In reply to comment #14) > Created an attachment (id=369168) [details] > Slightly reduced testcase > > The problem seems to occur when a bunch of event listeners are removed on > unload. I've stripped out some cruft and added a button in which triggers the > freeze by attempting to remove all the event listeners. Drive-by comment: could the problem be something involving (a) modifying a list while iterating over it or (b) code that adds new event listeners during the loop that tries to remove them all?
tinderboxpushlog has been slow-script-hanging for awhile on page unload; I speculate this might have something to do with it given that the hanging script is jquery.
Keywords: dogfood
Assignee: general → graydon
Some preliminary notes on this bug: - The bug doesn't entirely "lock up" in the sense of being fatal to the browser; the JS engine gets involved doing so much work that it never yields, until the operation callback fires and the script timeout dialog comes forth. It's not like it's unique in the sense of being able to do that, so I'm not sure this has to count as super-high-priority blocking. - That said, it seems to be doing something we handle particularly poorly, so I'm still trying to work out what that is. - Memory use peaks dramatically within an instant of performing the attempted removal; from 60MB to 400MB in under a second. - The interpreter then enters a last-ditch GC. - Oprofile measures the "lockup" as centering in property-tree-management code: samples % symbol name 452914 47.1923 js_SweepScopeProperties 306923 31.9804 InsertPropertyTreeChild() 12168 1.2679 SearchTable() - The property-tree statistics I get from this are interesting. The first histogram shows 8,000 nodes: [ 0]: 2348 ************ [ 1]: 6246 ************* [ 2]: 82 ******* [ 3, 4]: 18 ***** [ 5, 8]: 6 *** [ 9, 16]: 3 ** [ 17, 32]: 2 * [ 33, 64]: 1 [ 65, 128]: 2 * [ 129, 256]: 2 * [ 257, 512]+ 4 ** immediately followed by a histogram with 12 million nodes! [ 0]: 17690 ***** [ 1]: 12524958 ******** [ 2]: 2721 **** [ 3, 4]: 66 ** [ 5, 8]: 13 ** [ 9, 16]: 14 ** [ 17, 32]: 6 * [ 33, 64]: 1 [ 65, 128]: 3 * [ 129, 256]: 3 * [ 257, 512]+ 15 ** and then immediately back down to 47,000 or so: [ 0]: 5004 ************* [ 1]: 42354 **************** [ 2]: 123 ******* [ 3, 4]: 44 ****** [ 5, 8]: 14 **** [ 9, 16]: 5 *** [ 17, 32]: 5 *** [ 33, 64]: 2 * [ 65, 128]: 2 * [ 129, 256]: 4 ** [ 257, 512]+ 9 **** from then on it gradually settles down to about 27,000. - Reducing the number of 'a' nodes in the document naturally causes the problem to get less-bad. The above histogram was with some 2500 'a' nodes (the rest commented out). If you reduce the number down to about 1000 'a' nodes, the same-shaped histogram shows up, but the browser manages to survive the massive flood of a mere *two* million newly created (and shortly thereafter, destroyed) property-tree nodes in enough time to avoid appearing totally locked up. - Given those counts, it sounds like something we're doing is causing more or less 1000 new property tree nodes to be added per 'a' node in the document. I'll try to produce some kind of trace of that process for a single 'a' node next.
Hm. It appears that jquery adds a UUID value (keyed by its own improbable-property-name) to every object it wants to store information about, and it does so in passing while *fetching* information-it-might-have-stored about an object; thus each of our 2500 objects has a UUID tacked onto it while we're unbinding, and the jquery cache has an object manufactured under each such UUID to store the key 'events' it is probing for. See the data: method in the JQuery object. That explains a certain amount of churn during unbinding, but I'm not sure it explains many-thousands of new properties per node yet. Out of time for today, will pick up tomorrow. (Also, why did I think 1000 per node? We have 1000 HTML nodes ~ 2m prop nodes, and 2500 HTML nodes ~ 12m prop nodes. So it's more like 2n^2 prop nodes. An algorithmic bug in jquery? I wonder if it's calling 'each' twice, or something silly...)
This sounds like bug 367425. Please dup and take that one if so. /be
(In reply to comment #19) > This sounds like bug 367425. Please dup and take that one if so. I'm not certain, but I believe this one has to do with property-tree forking due to middle-deletions. I can't quite tell what 367425 has to do with; it might be the same. Your property-tree intuitions (or even wild guesses, you have a decade more of accuracy in your hunches about this data structure than I do) will be valuable here. Fwiw if I comment out the statement: delete jQuery.cache[ id ][ name ]; on line 694 of the jquery.js file in the testcase, everything's peachy. I'm trying to produce a synthetic reduced script that does the same thing, presently.
Attached file minimal synthetic testcase (deleted) —
The attached case demonstrates the misbehavior somewhat more concisely: run in the shell, it allocates 500mb RSS or so and (if the histograms are correct) produces a transient property-tree with about 33 million nodes (and counting 49 million searches). Now the fun part? I'm going to be staring at the property-tree code a long time before I know what to change to make it *not* do this. Help from old hands requested?
(note that if you move the single assignment from the first loop to the first line of the second loop, it executes instantly and uses almost no memory.)
Ok. I've spent some time looking at js_AddScopeProperty and js_RemoveScopeProperty and reading the big explanatory comment in jsscope.h, and I think I have a picture of what's going on with our friendly SCOPE_MIDDLE_DELETE bit. The comment in jsscope.h says this: * What if we add Y again? X->Y->Z->Y is wrong and we'll enumerate Y twice. * Therefore we must fork in such a case, if not earlier. Because delete is * "bursty", we should not fork eagerly. Delaying a fork till we are at risk * of adding Y after it was deleted already requires a flag in the JSScope, to * wit, SCOPE_MIDDLE_DELETE. The key phrase here is "if not earlier". The testcase, recall, is not actually inserting duplicates; so it sort of feels like we're forking earlier than necessary under the documented intent. If we look at js_AddScopeProperty, the loop that starts around line 1120 of jsscope.cpp, there is this encouraging comment: /* * If properties were deleted from the middle of the list starting at * scope->lastProp, we may need to fork the property tree and squeeze * all deleted properties out of scope's ancestor line. Otherwise we * risk adding a node with the same id as a "middle" node, violating * the rule that properties along an ancestor line have distinct ids. */ that sounds like fulfilling the plan of the header-file, but then we get to the innermost loop itself: /* * Now loop forward through spvec, forking the property tree * whenever we see a "parent gap" due to deletions from scope. * NB: sprop is null on first entry to the loop body. */ Sure enough, looking at the code it seems there's no conditioning on the presence of duplicates at all, just gaps. IOW it seems -- if I'm reading correctly -- that we fork quite a bit "earlier" than we need to: we fork at *every* middle-delete we encounter while walking the temp-vector spvec. So any property-access pattern that bounces back and forth between middle-deletes and fresh inserts is going to cause such a blow-up. I'm surprised we're not seeing it more often. There is a story buried in here I'm sure, and I still don't know how to cough up code to fix it precisely or easily. I imagine the current state of affairs flows from the "stinking ECMA-mandated duplicate formal parameters" described in the other comments in this loop. Is there hope? It seems it might get tricky..
It would be interesting to know how the JavaScript engines of the other browsers achieve what you've just stated, since none of them, including IE 6, have these problems. Too bad this is all over my head.
(In reply to comment #24) > It would be interesting to know how the JavaScript engines of the other > browsers achieve what you've just stated, since none of them, including IE 6, > have these problems. Too bad this is all over my head. Oh, other engines use entirely different representations of properties. This is just a bug in a very specific algorithm in our very-unique representation; there's no reason to imagine it'd hurt other engines. Indeed, it's not even *supposed* to be something that hurts us (as the initial comment states), it just turns out our code isn't quite doing what the earlier comments wish. I'm sure it's fixable one way or another. By "is there hope?" I was just wondering if there's hope of an *easy* fix. There's usually a hard fix to be found :)
(In reply to comment #23) > There is a story buried in here I'm sure, and I still don't know how to cough > up code to fix it precisely or easily. I imagine the current state of affairs > flows from the "stinking ECMA-mandated duplicate formal parameters" described > in the other comments in this loop. Igor fixed the stinking SpiderMonkey (originally, 1995 Mocha runtime) bug of using function object properties to represent formal paramters and local vars, back in bug 399544. But he left these comments. I'm eliminating them in the patch I hope to attach tonight. > Is there hope? It seems it might get tricky.. I think so. Could you run with the patch that I'll attach? I'm finishing upvar2 and busy with other stuff (on the road atm). /be
(In reply to comment #26) > > Is there hope? It seems it might get tricky.. > > I think so. I was responding to both "hope" and "tricky" there ;-). Indeed a few things are tricky, but generally the code gets simpler due. Thanks for reading all the comments of this child of mine from 2002 (bug 62164), and double thanks for pointing out the bug. I must have grown tired of deferring middle delete clearing, but it's always better to avoid forking the property tree as long as possible. Hashed access means the potentially long ancestor line won't hurt scope performance. We could have some ancestor line height limit, if an over-tall line hurt. But I don't see the need. Thoughts? /be
(In reply to comment #26) > I think so. Could you run with the patch that I'll attach? I'm finishing upvar2 > and busy with other stuff (on the road atm). Yeah, just fishing for a verbal sketch of how you think the correct solution would look. I'm happy to carry such a sketch through to completion, I just don't know which of the various strategies I can imagine smells the way you'd prefer.
Attached patch proposed fix (deleted) — Splinter Review
Attachment #369837 - Flags: review?(graydon)
Sorry, can't get to irc and didn't know you were working today -- I did more than a sketch. Still, testing and landing help much appreciated -- and of course this means you are now a knight of the property tree! /be
The comments 'splain why a very tall ancestor line is not an issue. Once again GC of property tree nodes not marked through scopes saves the day. /be
The middle delete code in js_AddScopeProperty dealt with an empty scope (net of the property being overwritten by the Add -- this was the splen == 0 case). But this was possible only because js_RemoveScopeProperty (for no good reason) left SCOPE_MIDDLE_DELETE set after removing the last and only entry from a scope that had middle deletes. The fix to clear the scope flag is in the last hunk in the patch. /be
Priority: -- → P1
Comment on attachment 369837 [details] [diff] [review] proposed fix I think I'm not quite happy with this code, since it still permits the pathology, just makes it less likely with *this* testcase: it's still pretty easy to trigger on a busy hashtable. E.g. if you do 1000 middle-delete / nonconflicting-add cycles on a hashtable, then a single conflicting add (say, re-add a key you previously deleted), you'll still get a fork-storm of a million nodes. Not cool. But I get the approach you have in mind, and it looks doable. I'll take a swing at it presently. (Also not sure why you think it's safe to unconditionally assert splen==0...)
Attachment #369837 - Flags: review?(graydon) → review-
(In reply to comment #33) > (From update of attachment 369837 [details] [diff] [review]) > I think I'm not quite happy with this code, since it still permits the > pathology, just makes it less likely with *this* testcase: it's still > pretty easy to trigger on a busy hashtable. E.g. if you do 1000 > middle-delete / nonconflicting-add cycles on a hashtable, then a > single conflicting add (say, re-add a key you previously deleted), > you'll still get a fork-storm of a million nodes. Not cool. Yes, we should not go crazy. But in the scenario you describe, we will have initially: a, b middle-delete: _, b non-conflicto: _, b, c repeat: _, _, c etc.: _, _, c, d when we finally get a conflict, we'll build spvec to list only the non-_ nodes (the _ is not in the hashtable, but it is on the property tree ancestor line). If the _ nodes are all adjacent we won't fork more than once, to make: final state: c, d True, we could end up with little islands of one or two _ nodes separated by live nodes mapped by the scope's hash. I'm not sure what you could do there other than the idea mentioned above of not exceeding a threshold. And the GC will help, in due course (node allocation increases memory pressure). What am I missing? > (Also not sure why you think it's safe to unconditionally assert splen==0...) Comment 32 tried to give the reason. The code in jsscope.cpp is small-world enough that we can verify anything that would take the scope to zero entryCount should clear the middle-delete flag. I think the patch arranges for that in the place that I missed long ago, when I added the compensating splen == 0 if. /be
Talked at length on irc, looking forward to patch -- graydon is head knight! /be
Attached patch Proposed, lazier fix. (deleted) — Splinter Review
I think I misunderstood whether your variant was going to cause a storm; it didn't *quite* do so, but was still dragging its ass a bit because it had to probe the hashtable for every element of the ancestor line. It's faster to scan the line looking for an id conflict and skip the entire process -- allocation and probing -- if there are none. Anyway, this is a bit more minimal, mostly follows your lead (and the existing code), fixes the testcase as near as I can tell, both the minimal and the original, and a few variants I've tried here. I can't provoke bad behavior from it anymore. (I also did away with using alloca, since when and if you *do* fork under this code, it can be on very large ancestor lines -- large enough to smash the stack segment and crash the program using alloca. One example I tried did so.)
Attachment #370101 - Flags: review?(brendan)
Comment on attachment 370101 [details] [diff] [review] Proposed, lazier fix. >@@ -1130,9 +1129,14 @@ > CHECK_ANCESTOR_LINE(scope, JS_TRUE); > > splen = scope->entryCount; Seems you could move this down (and the later assertion that this value of splen == 0) into the if (conflicts) {...} then clause. >+ conflicts = 0; >+ for (sprop = SCOPE_LAST_PROP(scope); sprop; sprop = sprop->parent) { >+ if (sprop->id == id) >+ conflicts++; Should break here, so conflicts could be bool, to save walking up past first conflict looking for more, since we need only to test only whether conflicts != 0. >+ } >+ JS_ASSERT(splen != 0); This assertion moves too. r=me with these. /be
Attachment #370101 - Flags: review?(brendan) → review+
Summary: 100 % CPU + high memory consumption with javascript at drupal.org [drupal.org/misc/jquery.js] → Minimize forking of property tree
Whiteboard: fixed-in-tracemonkey
Yeah! Thanks for the fix. Using the TraceMonkey build, I can confirm there is no more memory spike, and CPU utilization is at around 98 % for only about 15 s (933 MHz P3) before I get the unresponsive script alert.
(In reply to comment #39) > Yeah! Thanks for the fix. Using the TraceMonkey build, I can confirm there is > no more memory spike, and CPU utilization is at around 98 % for only about 15 s > (933 MHz P3) before I get the unresponsive script alert. Hmm. I'd have hoped for a bit better. My desktop completes the test entirely in about 3s -- no slow-script dialogue -- but it's a rather faster machine. But even that, really, seems a bit much: intuitively, it feels like we ought to be able to do away with the remaining O(N) scan cost on inserts in lineages with middle-deletes. Now that I think of it in this light, the patch may have made matters <em>worse</em> for some sites: you'll <em>avoid</em> paying the rebuild cost -- so avoid a fork -- until the last moment when a conflict is inserted, but in doing so you'll <em>delay</em> compacting the lineage and clearing the middle-delete bit; this means you'll carry on paying scan-costs on each insert <em>until</em> that last moment. So one middle-delete early in your insert sequence can cost you a million pointless lineage-scans. Really we want some kind of ... balanced cost heuristic between compaction/forking and scanning? A simple heuristic could be a fixed maximum length for an un-compacted lineage with middle deletes: once beyond the maximum, <em>any</em> insert would force the compaction (and bit-clearing), even if not conflicting. Alternatively we could seek a way to make the scan-cost on un-compacted lineages really low. Like near-constant. Hmm. Keeping a hash set of all the deleted keys (until compaction) is possible I suppose, or even just a Bloom filter of the deleted keys, to exclude a some subset of pointless scans? I think I'm going to retract the 'fixed-in-tracemonkey' status on this until tomorrow. It's still not quite right. Opinions on which of these strategies is best? The max-size one is probably easiest to write up.
Whiteboard: fixed-in-tracemonkey
Height threshold seems best at this point. /be
This is a bit sneaky but after enough poking around with fixed thresholds I think it's the only way. If we try to pick a fixed maximum size, any scope with *live* size larger than the maximum recreates the fork-storm situation that provoked the initial patch, because compaction doesn't actually make it duck down beneath the maximum size, so it keeps trying (pointlessly) to *recompact* on each new addition. Moreover, anything based on counting the uncompacted deletions and considering them as a proportion of the total is also no good, since we want to catch even outlandish cases like "1 old middle-deletion followed by a million inserts". The heuristic I settled on is "compact if we have enough uncompacted deletions to exceed our distance to the next-highest power-of-two", which is a slightly (and I think desirably) more general way of hitting the minimal condition "any uncompacted deletion at all, but only check when our size is passing through a power-of-two". This will eventually compact any straggling deletions (soon enough to amortize out, cost-wise) while not compacting too eagerly. I think. In practice, it seems to work. Look ok?
Attachment #370157 - Flags: review?(brendan)
Comment on attachment 370157 [details] [diff] [review] A followup to enhance the compaction heuristic. > JSBool conflicts = JS_FALSE; Nit: use bool/true/false in new code unless there's an API/ABI reason to use JSBool (you can still use true/false in this case). >+ uint32 count = 0; >+ uint32 threshold = (1 << JS_CeilingLog2(scope->entryCount)); Nit: use JS_BIT(...) rather than (1 << ...). Could move threshold down to right before first use, maybe this way is nicer to read tho -- your call. Good solution, r=me with nits picked. /be
Attachment #370157 - Flags: review?(brendan) → review+
Whiteboard: fixed-in-tracemonkey
Performance is a bit better. Unresponsive script message in about 12 sec and settles within 2-3 sec, if I choose to continue script execution. Still no memory spike. That's probably about as much as can be squeezed out of my CPU. Great job!
For comparison purposes: IE 6: 11 sec IE 8: 8 sec So it's definitely slower, but still a huge improvement from what it was before.
This would appear to explain my horrible shutdown hangs when restarting for updates lately, as I always have tinderboxpushlog open and that uses JQuery. I finally debugged it and the stack looks just like this. I look forward to the trunk landing!
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Flags: in-testsuite?
Status: RESOLVED → VERIFIED
verified FIXED on Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b4pre) Gecko/20090421 Shiretoko/3.5b4pre ID:20090421030848
Flags: blocking1.9.2? → blocking1.9.2+
fixed before we branched 1.9.2
Target Milestone: --- → mozilla1.9.2a1
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: