Closed Bug 363529 (jsbcopt) Opened 18 years ago Closed 13 years ago

Optimize JS bytecodes

Categories

(Core :: JavaScript Engine, defect, P2)

defect

Tracking

()

RESOLVED WONTFIX

People

(Reporter: brendan, Assigned: dmandelin)

References

Details

(Keywords: meta)

Attachments

(12 files, 22 obsolete files)

(deleted), image/png
Details
(deleted), text/plain
Details
(deleted), image/png
Details
(deleted), text/plain
Details
(deleted), image/png
Details
(deleted), text/plain
Details
(deleted), patch
brendan
: review+
Details | Diff | Splinter Review
(deleted), patch
brendan
: review+
Details | Diff | Splinter Review
(deleted), patch
shaver
: review+
Details | Diff | Splinter Review
(deleted), image/png
Details
(deleted), text/plain
Details
(deleted), text/plain
Details
JS bytecode interpretation has not been optimized based on profiles and direct instrumentation since 1997 or so. This bug will track progress improving bytecode efficiencies, matching bytecodes to common cliches, consolidating common sequences and reducing interpreter overhead. /be
Depends on: 312354
Depends on: 363530
This filters edges accounting for < 1% of the grand total bytecode count, to avoid making an over-connected graph. /be
Attached image bytecode graph for perfect.js (obsolete) (deleted) —
Attached image startup, load gmail, load first message, quit (obsolete) (deleted) —
Depends on: 363534
Depends on: 363536
Big-picture analysis and comments welcome here. Also help fixing dependent bugs. /be
Status: NEW → ASSIGNED
Priority: -- → P1
Summary: optimize JS bytecodes → Optimize JS bytecodes
Attached image filtered graph for perfect.js (obsolete) (deleted) —
Cutoff is .5% now for more detail, yet still visualizeable. /be
Attachment #248353 - Attachment is obsolete: true
More detailed graph from gmail mini-session: http://wiki.mozilla.org/Image:Filtered_gmail_bytecode_graph.png /be
How about putting the actual percentages themselves in the graph?
Attached file dot file for filtered graph of gmail mini-session (obsolete) (deleted) —
I haven't figured out how to get the weights to show up, even as thinner or thicker edges. Anyone with more GraphViz skilz, please help. /be
Attached image With weights labeled on edges (obsolete) (deleted) —
s/weight=/label=/ does the trick.
Attachment #248363 - Attachment is obsolete: true
Attachment #248403 - Attachment description: dot file for filtered graph of gmail mini-sessionn → dot file for filtered graph of gmail mini-session
s/weight/label/g
Attachment #248403 - Attachment is obsolete: true
Note that the labels are fractions of outgoing bytecode dispatch counts, not incoming. So you'll see them sum > 1 if you sum across the node into which several come. Perhaps it would be better to label the absolute counts, so we could see what's hot and what's not. /be
Oh yes, agreed. I had misunderstood that. (should probably filter by incoming too?)
With labels telling counts of successors to bytecodes dispatched. /be
Attachment #248415 - Attachment is obsolete: true
Brian, how'd you get around bugzilla's 300K limit? I uploaded to the wiki again: http://wiki.mozilla.org/Image:Filtered_gmail_bytecode_graph.png /be
(In reply to comment #14) > Oh yes, agreed. I had misunderstood that. (should probably filter by incoming > too?) No, because the outgoing counts from op1 to all op2 that succeed it will sum in all cases to the total count for op2 provided op1 is not a branch -- i.e., if op1 and op2 are in a basic block, or op1 always flows into the leader op2 of the successor basic block. The important edges are the high count ones flowing from a non-branch. So it may be worth filtering branches, but I thought I'd leave things maximally visible (and the code simple) for now. /be
(In reply to comment #17) > (In reply to comment #14) > > Oh yes, agreed. I had misunderstood that. (should probably filter by incoming > > too?) > > No, because the outgoing counts from op1 to all op2 that succeed it will sum in > all cases to the total count for op2 provided op1 is not a branch -- i.e., if > op1 and op2 are in a basic block, or op1 always flows into the leader op2 of > the successor basic block. I ignored the filtering effect -- we leave out edges with count < .5% of the grand total of all bytecodes dispatched. Also, above in "the outgoing counts from op1 to all op2", I should have written more clearly. What I mean, for any op2, the sum of all labels for all op1 to that op2 should be the total count of op2 executions, ignoring filtering. /be
Dotted, dashed, solid, bold by segmenting the edges into quarters, after sorting (experiment in readability). Perl hack that does it next. Not having trouble with 300kb limit because for whatever reason my PNGs are coming out -way- smaller than yours. This one is 27kb-ish?
In case anyone else is interested.
Attachment #248427 - Attachment is obsolete: true
Attachment #249120 - Attachment is obsolete: true
My brain hurts. I don't see how your patch can work (edges not used, nonzeroedges not used where it seems it might need to be). What's more, I can't see how to avoid sorting along each row to get busy columns to the front, then sorting along rows, and finally making a sorted row map. If this is not minimal, please show me the light. /be
Attachment #249131 - Attachment is obsolete: true
Attachment #249133 - Flags: review?(crowder)
Tidy a bit. /be
Attachment #249133 - Attachment is obsolete: true
Attachment #249135 - Flags: review?(crowder)
Attachment #249133 - Flags: review?(crowder)
Attached image perfect.png (obsolete) (deleted) —
The GraphVis-generated PNG for my gmail mini-session is too big, as usual, so I've updated http://wiki.mozilla.org/images/e/e5/Filtered_gmail_bytecode_graph.png. Anyone know how to get OS X GraphViz to make smaller PNGs? /be
Attachment #248361 - Attachment is obsolete: true
Attached file perfect.dot (obsolete) (deleted) —
Attachment #248357 - Attachment is obsolete: true
Attachment #248358 - Attachment is obsolete: true
Attachment #248408 - Attachment is obsolete: true
Attachment #248414 - Attachment is obsolete: true
Attached file gmail.dot (obsolete) (deleted) —
Attachment #248421 - Attachment is obsolete: true
Attachment #249137 - Attachment description: filtered graph for perfect.js → perfect.png
Comment on attachment 249135 [details] [diff] [review] C only approach, seems to require sort x sort | sort, v2 Yeah, that's better. Might be cool to tidy this up a bit (maybe guard with preprocessor flag (DEBUG?)?) and make it generatable/clearable from some shell native func?
Attachment #249135 - Flags: review?(crowder) → review+
(In reply to comment #23) > Created an attachment (id=249133) [edit] > C only approach, seems to require sort x sort | sort > > My brain hurts. I don't see how your patch can work (edges not used, > nonzeroedges not used where it seems it might need to be). What's more, I > can't see how to avoid sorting along each row to get busy columns to the front, > then sorting along rows, and finally making a sorted row map. > > If this is not minimal, please show me the light. > > /be > Yeah, I stopped using the edges and went with count/total to segment by quarters (not mathematically the same), instead of using the actual sort position. Your newer patch is better.
Simpler patch avoids keeping the extra book that I thought might be handy (total stored per predecessor opcode, length of sorted non-zero-count successors for each predecessor), and loses stability, so data looks slightly different. That's because related edges have the same count, but the quartiling has to cut them into separate buckets differently depending on sort instability. Too bad; a bug? Comments welcome. I added #ifdef JS_OPMETER bracketing and otherwise cleaned up. /be
Attachment #249153 - Flags: review?(crowder)
To shorten my last comment into a question: perhaps all edges with the same count should go in the same quartile? /be
Comment on attachment 249153 [details] [diff] [review] avoid sort x sort, losing stability w.r.t. previous data (most of the following are style questions for my own edification, not critiques of the patch) Would be nice if there were a #defined value for the max opcodes for building the static array "succeeds". Any chance you'll get bit by JSOP_GETPROP2 and JSOP_GETELEM2? I didn't look very closely. Is using the magic number 256 in your loops more appropriate than js_NumCodeSpecs? > + if (count != 0 && SIGNIFICANT(count, total)) { isn't (count != 0 && SIGNIFICANT(...)) redundant? Should nedges or total ever be 0 after the first loop? Worth asserting about (or closing fp and bailing)? Looks good.
Attachment #249153 - Flags: review?(crowder) → review+
(In reply to comment #31) > To shorten my last comment into a question: perhaps all edges with the same > count should go in the same quartile? > > /be > That's one of the things I considered when I just went for count/total in my original C-version. It ends up being a comparison by value instead of by position in a not-stably sorted array. Not real quartiles, then, though.
(In reply to comment #32) > (From update of attachment 249153 [details] [diff] [review] [edit]) > (most of the following are style questions for my own edification, not > critiques of the patch) > > Would be nice if there were a #defined value for the max opcodes for building > the static array "succeeds". There is: JSOP_LIMIT -- thanks for reminding me. But see next patch for why it's not used as the "columns" dimension. > Any chance you'll get bit by JSOP_GETPROP2 and JSOP_GETELEM2? No, the decompiler just uses these internally, they're never stored. > Is using the magic number 256 in > your loops more appropriate than js_NumCodeSpecs? No (no point in testing zeroed cells), but JSOP_LIMIT is faster and better than js_NumCodeSpecs. > > + if (count != 0 && SIGNIFICANT(count, total)) { > > isn't (count != 0 && SIGNIFICANT(...)) redundant? Should nedges or total ever > be 0 after the first loop? Worth asserting about (or closing fp and bailing)? No, nedges is an overestimate, because we haven't computed total before the first loop and we don't want to use it during the loop, before it is in fact the grand total. /be
Attached patch improved thanks to review comments (obsolete) (deleted) — Splinter Review
Quartiles are distorted to include higher-cost edges having the same cost as any that fall in the upper quartile, in order to avoid splitting edges with the same label into two different styles. Checking in shortly. /be
Attachment #249135 - Attachment is obsolete: true
Attachment #249153 - Attachment is obsolete: true
Attachment #249176 - Flags: review+
Attached file Y.dot (deleted) —
Attached image Y.png (deleted) —
Attached file perfect.dot (deleted) —
Attachment #249139 - Attachment is obsolete: true
Attached image perfect.png (deleted) —
Attachment #249137 - Attachment is obsolete: true
Attached file gmail.dot (deleted) —
Attachment #249140 - Attachment is obsolete: true
Y.dot and Y.png come from a run of Y.js, the applicative order Y combinator demo. Y.js is of course included with SpiderMonkey, but short: // The Y combinator, applied to the factorial function function factorial(proc) { return function (n) { return (n <= 1) ? 1 : n * proc(n-1); } } function Y(outer) { function inner(proc) { function apply(arg) { return proc(proc)(arg); } return outer(apply); } return inner(inner); } print("5! is " + Y(factorial)(5)); The low call counts help to understand the algorithm a bit, if you know JS bytecode already. More JSOP_NAME -> JSOP_PUSHOBJ evidence for bug 363530, and some JSOP_RETURN -> JSOP_PUSHOBJ higher-order functional goodness. /be
We should reorganize jsopcode.tbl with a global review after the dependency bugs are fixed. We can try to reclaim bytecode, e.g. JSOP_PUSHOBJ and JSOP_GETMETHOD which are to be replaced by a patch for bug 363530, as we go, adding as few as possible but leaving related bytecodes widely separated. The global cleanup after should try to put related bytecodes together, and look for other opportunities to speed up decoding. It would be interesting, given the PUSHOBJ -> CALL edges I'm seeing, to histogram the number of actual arguments passed when running Firefox from startup through a gmail mini-session, e.g. /be
Oh, forgot to remind myself (and anyone else who needs it) to change JSXDR_BYTECODE_VERSION when making backward-incompatible changes to jsopcode.tbl. Adding a new bytecode at the end should not require this, but it would be nice to change it for such cases too. I hope I have my backward and forward directions correct: backward-compatible: new code runs on old profile with saved XUL.mfasl forward-compatible: old code runs on new profile with saved XUL.mfasl We don't worry about forward compatibility across major releases, or at least we haven't, since profile data formats may change incompatibly. This is not a good policy, and perhaps now is the time to change it (or Mozilla 2, if 1.9 is already too far gone). /be
Because fastload is a generated cache, we really don't need to worry about forward or backwards incompatibility: we blow away the fastload cache whenever the buildID changes. It's not like a data-storage format.
(In reply to comment #44) > Because fastload is a generated cache, we really don't need to worry about > forward or backwards incompatibility: we blow away the fastload cache whenever > the buildID changes. It's not like a data-storage format. Cool, I wondered about that. Lots of JS embeddings use XDR, though, so we need to keep JSXDR_BYTECODE_VERSION moving forward when we "ship". Not sure what "ship" means -- not nightly Firefox builds of course. /be
Attached patch final patch for checkin (obsolete) (deleted) — Splinter Review
Important #ifdef JS_OPMETER change from always-true #ifdef METER_OP_PAIR guard around op = JSOP_STOP in js_Interpret. /be
Attachment #249176 - Attachment is obsolete: true
Attachment #249219 - Flags: review+
Attached patch really final patch for checkin (deleted) — Splinter Review
Good thing I tested without JS_OPMETER defined :-/. /be
Attachment #249219 - Attachment is obsolete: true
Attachment #249221 - Flags: review+
Checking in jsapi.c; /cvsroot/mozilla/js/src/jsapi.c,v <-- jsapi.c new revision: 3.296; previous revision: 3.295 done Checking in jsinterp.c; /cvsroot/mozilla/js/src/jsinterp.c,v <-- jsinterp.c new revision: 3.310; previous revision: 3.309 done /be
It would be interesting to know what percentage of getvar, setvar and getarg access the first two or four local vars / arguments. Specialized opcodes like getarg1 or setvar0 would save two byte reads to fetch the index, if they're used often enough it could be worth the effort. For comparison, CIL (a.k.a. MSIL) has special opcodes to load the first four arguments and to load or store the first four local variables.
(In reply to comment #49) > It would be interesting to know what percentage of getvar, setvar and getarg > access the first two or four local vars / arguments. Specialized opcodes like > getarg1 or setvar0 would save two byte reads to fetch the index, if they're > used often enough it could be worth the effort. > For comparison, CIL (a.k.a. MSIL) has special opcodes to load the first four > arguments and to load or store the first four local variables. Lots of VMs do this, it's a good idea. Here are some stats gathered using the next patch, from another startup/load-gmail/click-on-a-message: bytecode slot 0 slot 1 slot 2 slot 3 slot 4 slot 5 slot 6 slot 7 ======== ======= ======= ======= ======= ======= ======= ======= ======= getarg 49015 17388 4797 3574 917 425 23 20 setarg 1164 112 4 42 0 0 0 0 getvar 43795 22463 14960 11942 6049 4646 1007 1621 setvar 11808 8506 5120 4106 2588 1434 557 481 incvar 30 211 396 10 857 161 0 4 decvar 0 0 6 0 2 0 0 0 varinc 4364 2675 640 693 31 641 56 115 argdec 0 0 0 1 0 0 0 0 vardec 158 0 0 0 0 0 0 0 getgvar 14 0 4 6 2 0 7 0 setgvar 2 0 1 1 1 1 1 1 Interesting that we have some busy functions with many args/vars, but the low four of each dominate most rows. I'll file a dependent bug on this idea in a minute. /be
Attached patch meter slot ops for slots 0-7 (obsolete) (deleted) — Splinter Review
Attachment #249411 - Flags: review?(crowder)
Alias: jsbcopt
Depends on: 364683
Comment on attachment 249411 [details] [diff] [review] meter slot ops for slots 0-7 + for (j = 0; ; j++) { + if (j == HIST_NSLOTS) + break; Why not just + for (j = 0; j < HIST_NSLOTS; j++) { + for (j = 0; j < HIST_NSLOTS; j++) + fprintf(fp, " %7lu", (unsigned long)slot_ops[i][j]); Is it your intention to reuse j here, even though it is the loop counter for the enclosing loop? This section looks odd in general to me...
Attachment #249411 - Flags: review?(crowder) → review+
(In reply to comment #52) > (From update of attachment 249411 [details] [diff] [review] [edit]) > + for (j = 0; ; j++) { > + if (j == HIST_NSLOTS) > + break; > > Why not just > + for (j = 0; j < HIST_NSLOTS; j++) { It seemed clearer to make the loop test parallel to the other case that breaks, the one that reuses j: > + for (j = 0; j < HIST_NSLOTS; j++) > + fprintf(fp, " %7lu", (unsigned long)slot_ops[i][j]); > Is it your intention to reuse j here, even though it is the loop counter for > the enclosing loop? This section looks odd in general to me... This reuse of j is within a then clause ending in a break. It saves writing the first (outer j loop) as a hunt for non-zero counts, breaking early when it finds such a count; then testing whether j reached HIST_NSLOTS and continuing the outer i loop if so (all counters for this i-op were zero), and only then writing the final (innermost, in the patch) j-loop. I'll make the outer j-loop canonical, since the parallel if-then-break scheme obviously didn't help 'splain any of this ;-). /be
Landed: Checking in jsapi.c; /cvsroot/mozilla/js/src/jsapi.c,v <-- jsapi.c new revision: 3.299; previous revision: 3.298 done Checking in jsinterp.c; /cvsroot/mozilla/js/src/jsinterp.c,v <-- jsinterp.c new revision: 3.314; previous revision: 3.313 done /be
Attachment #249411 - Attachment is obsolete: true
Attachment #249455 - Flags: review+
Depends on: js-propcache
Depends on: 371802
Attached patch un-bitrot JS_OPMETER (deleted) — Splinter Review
This is just to get back to where we were. More JS_OPMETER ifdef'ed work to come, but I'd like to land this ASAP. /be
Attachment #305953 - Flags: review?(shaver)
The .5% cutoff sure leaves a lot of ops out, and focuses the mind on some very common ops that ought to be close together. That cutoff is good for keeping the graph small, but for the further instrumentation I don't want to leave out any ops. More tomorrow. /be
Attachment #305955 - Attachment is patch: false
Attachment #305955 - Attachment mime type: text/plain → image/png
Comment on attachment 305953 [details] [diff] [review] un-bitrot JS_OPMETER r=shaver
Attachment #305953 - Flags: review?(shaver) → review+
Going to check in some #ifdef'ed off by default in debug and release builds metering, no one panic. /be
Keywords: meta
Attachment #305953 - Flags: approval1.9b4?
Attachment #305953 - Flags: approval1.9+
Comment on attachment 305953 [details] [diff] [review] un-bitrot JS_OPMETER a1.9b4=beltzner
Attachment #305953 - Flags: approval1.9b4? → approval1.9b4+
JS_OPMETER unrotted: js/src/jsinterp.c 3.449 js/src/jsopcode.c 3.296 js/src/jsopcode.h 3.60 /be
Beltzner: meta, this bug is a keeper. Any reason it got on your radar? /be
Target Milestone: mozilla1.9alpha1 → ---
Probably because it has a approval1.9b4+ patch, but isn't resolved.
Comment on attachment 305953 [details] [diff] [review] un-bitrot JS_OPMETER Clearing approval flags, since this has landed but the bug remains open.
Attachment #305953 - Flags: approval1.9b4+
Attachment #305953 - Flags: approval1.9+
Attached file Measurements from Sunspider (deleted) —
I did some new measurements with Sunspider with a different output format. My format shows the pairs ranked by frequency. The first number is the fraction of all pairs represented by this line. The second number is the total so far at this line. There are only 938 that appear. Given that amount, could we consider just implementing all of them as superops? Of course, there is also the issue of triplets and longer sequences. If we want to do triplets, and also if we don't want all 938 pairs, I think the best approach is to record the entire opcode stream, and use standard AI search algorithms to find a superop set that minimizes some metric. Ultimately we'd like to it be the total time taken, but we probably want to start the search with something faster to compute, like (q1 * #of instructions executed) - (q2 * #of superops encoded) where q1 and q2 are magic constants. We could even base this on a performance model derived from some initial tests. I guess there must be prior work in this area. I found one paper [1] in a couple minutes. Do we have docs or code for the TT superinstruction selection procedure? [1] http://www.cs.nuim.ie/~jpower/Research/Papers/2004/esa04.pdf Also, what other test suites should I be using?
The graph attached in comment 56 gives some idea of common sequences of various short lengths. You can see the benchmark-ish getvar;uint16;lt;ifeq group for the (i < CONSTANT) condition of a hot loop or loops, probably a for loop -- which should jump to a test at the bottom and so use ifne instead of ifeq (bug 371802, I should split this out into a new bug and close that one) -- also the lt and ifeq should be fused into iflt (bug 363534). The ifeq could be due to a hot if (i < CONSTANT) ... statement, too. It would be interesting to get pure SunSpider results, but I think we also want some kind of aggregate for real-world browsing -- hence my running in-browser SS + gmail + opening bugzilla links from gmail. We almost certainly don't want hundreds of super-ops. We might do away with some "sub-ops" that never occur alone. The prime example is pop after set (bug 312354). Graydon or Steven knows about TT superword selection. /be
Adding Vladimir Rainish (recent addition to Tamarin team at Adobe), who is starting an investigation into improving TT superword selection via profile-guided feedback.
good section on superoperator selection in this paper. http://citeseer.ist.psu.edu/hoogerbrugge99code.html also see many works by M. Anton Ertl on static and dynamic superinstructions for gforth and vmgen.
Depends on: 441416
Depends on: 441479
Here is a list of the top JSOP sequences for saving dispatches on Sunspider. The attachment contains an explanation of the numbers. I should also note that in Sunspider there are some very common, very long operation sequences. For example, replacing each of the 418240 dynamic occurrences of "getvar int8 rsh dup2 getelem one getvar int8 bitand lsh bitnot bitand setelem pop getvar getvar add setvar pop goto" with a single op would save 5% of dispatches. I don't imagine we'd want to do such a thing, but this information may bear on some decision, like inline threading. And it bodes well for tracing.
Dave, maybe you should take this tracker bug. Feel free, if you agree. /be
Priority: P1 → P2
Only two bugs left, unless someone gets ambitious. I'm gifting this to dmandelin as suggested in comment 70. Nick, see Dave's attachment 327040 [details] and prior. This relates to the peephole optimizer idea, maybe. /be
Assignee: brendan → dmandelin
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: