Closed Bug 121414 Opened 23 years ago Closed 16 years ago

JS indirect threaded interpreter

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: boullet.marc, Assigned: brendan)

References

Details

(Keywords: perf, testcase)

Attachments

(13 files, 7 obsolete files)

(deleted), text/html
Details
(deleted), text/html
Details
(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), patch
rginda
: review+
jband_mozilla
: superreview+
Details | Diff | Splinter Review
(deleted), text/plain
Details
(deleted), application/x-javascript
Details
(deleted), patch
brendan
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
mrbkap
: review+
Details | Diff | Splinter Review
(deleted), patch
igor
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
I used a relatively slow PC (300 MHz) under NT4. The JS tests (attached) doesn't involve any DOM stuff. I just re-code in JS some C programs I wrote a long long time ago to find the best way to get the power-of-two greater or equal to a given number. It involves only a for loop, some assigments, a while loop and some logical and shift operations. The comparison with IE5.5 is shown in the following table: +---------------------+-------------------+-------------------+--------+ | Test | Mozilla (20020121)| IE 5.5 | Ratio | |---------------------|-------------------|-------------------|--------| |Test 1 | 4516 ms | 2814 ms | 1.60 | |---------------------|-------------------|-------------------|--------| |Test 2 | 4086 ms | 2594 ms | 1.57 | |---------------------|-------------------|-------------------|--------| |Test 3 | 4766 ms | 3876 ms | 1.22 | +---------------------+-------------------+-------------------+--------+ The ratio of 1.6 is what triggered this bug. I decided to separate this bug from bug #117611 since it is already huge and now focused on sort and concat. Note that IE5.5 doesn't seem to perform so well in shifting (test3) though still faster than Mozilla. Hope it can help.
Attached file page with the tests (deleted) —
Keywords: perf
Confirming.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Here are my timings on WinNT 4.0 (SP6), 500 MHz CPU, 128M RAM. This was Mozilla vs. IE6. The ratios are the same as Marc's above: +---------------------+-------------------+-------------------+--------+ | Test | Mozilla (20020116)| IE 6 | Ratio | |---------------------|-------------------|-------------------|--------| |Test 1 | 2640 ms | 1625 ms | 1.62 | |---------------------|-------------------|-------------------|--------| |Test 2 | 2390 ms | 1484 ms | 1.61 | |---------------------|-------------------|-------------------|--------| |Test 3 | 2781 ms | 2265 ms | 1.23 | +---------------------+-------------------+-------------------+--------+
Here are Marc's tests: var max = Math.pow(2,17); function test1() { var i,powerOfTwo,j; var start,end; document.forms[0].testResult1.value = ''; start = new Date(); for (i = 1; i <= max; i++) { powerOfTwo = i; j = powerOfTwo; while (j &= j^-j) powerOfTwo = j << 1; } end = new Date(); document.forms[0].testResult1.value = end-start + ' ms'; } function test2() { var i,powerOfTwo,j,k; var start,end; document.forms[0].testResult2.value = ''; start = new Date(); for (i = 1; i <= max; i++) { powerOfTwo = i; j = powerOfTwo; do k = j; while (j &= j^-j); if (k != powerOfTwo) powerOfTwo = k << 1; } end = new Date(); document.forms[0].testResult2.value = end-start + ' ms'; } function test3() { var i,powerOfTwo,k; var start,end; document.forms[0].testResult3.value = ''; start = new Date(); for (i = 1; i <= max; i++) { powerOfTwo = i; k = 0; while (powerOfTwo >>= 1) k++; powerOfTwo = 1 << k if (powerOfTwo != i) powerOfTwo <<= 1 } end = new Date(); document.forms[0].testResult3.value = end-start + ' ms'; }
Assignee: rogerl → khanson
Blocks: js-perf
Hi, first of all, this bug should probably be OS => All and perhaps also Platform => All. And here, my test results: Machine: Athlon/500, 416 Megs of RAM, Windows XP Pro Mozilla: 2002-01-22-09 Win32, SVG/MathML-enabled IE : 6.0 XP-bundled release +---------------------+-------------------+-------------------+--------+ | Test | Mozilla | IE 6.0 | Ratio | |---------------------|-------------------|-------------------|--------| |Test 1 | 2703 ms | 2153 ms | 1.25 | |---------------------|-------------------|-------------------|--------| |Test 2 | 2414 ms | 1933 ms | 1.25 | |---------------------|-------------------|-------------------|--------| |Test 3 | 2884 ms | 2954 ms | 0.98 | <== ! +---------------------+-------------------+-------------------+--------+
As suggested, changing OS: WinNT ---> All, Platform: WinNT --> All Also, tentatively changing summary from "Yet another benchmark comparing JS engines speeds" to "Bitwise operators should be faster"
OS: Windows NT → All
Hardware: PC → All
Summary: Yet another benchmark comparing JS engines speeds → Bitwise operators should be faster
The claim is that these tests don't use the DOM, and that's true, almost. The loop termination variable "max" is a variable with global scope, meaning that the DOM has to search through its nodes each time the variable is accessed, to see if some HTML element named 'max' has changed. (Or so I assume, anyway, not having looked too deeply at how the DOM is connected to JS.) A better test would be to pass 'max' in as the argument to each of the functions: that way, the loop termination condition is outside of that global object. (Any reference to a global object is slow, when JS is connected to a browser...)
I modified Marc's test to run in the standalone JS shell, where there is no DOM. The global object is that of the JS shell itself. The timings I got are roughly comparable to those in Comment #3: Test1: 2750 ms Test2: 2485 ms Test3: 2750 ms
Concerning timings in Comment #5: Mozilla numbers (with a similar clocked machine as Phil's) are roughly comparable to those in comment #3. It seems that IE6 under XP is much slower than under NT4, but that's MS problem not Moz. Also making changes suggested by Phil in comment #6.
No longer blocks: js-perf
Though I don't see what the DOM has to do with global variables, I rerun modified tests passing |max| as a param to the test functions. Hereafter are the results (same machine as in comment #2): +---------------------+-------------------+-------------------+--------+ |Test w/ max as a par.| Mozilla (20020121)| IE 5.5 | Ratio | |---------------------|-------------------|-------------------|--------| |Test 1 | 4276 ms | 2754 ms | 1.55 | |---------------------|-------------------|-------------------|--------| |Test 2 | 3836 ms | 2484 ms | 1.54 | |---------------------|-------------------|-------------------|--------| |Test 3 | 4506 ms | 3705 ms | 1.22 | +---------------------+-------------------+-------------------+--------+ The first two tests show a 5% decrease of the ratio. This means that Mozilla doesn't perform so well with global variables (although it could also mean that Moz perform better than IE on function parameters, which is unlikely).
Another thought occurs to me: Can we run a loop with precisely the same number of iterations and loop logic, but with no body? Then subtract that time from the other times to determine whether this slowdown is because bitwise ops are slow or loop ops are slow. Looking at the tests, this may be somewhat difficult. But that sort of just says that the tests ought to be simplified to directly test what they mean to test. (I.e.: we ought to have a "left-shift" test, a "xor" test, an "and" test, and so on---not the "interesting" algorithms we have now that combine all the ops together.) Just a thought... -scole
Blocks: js-perf
Note: since bug 117611 is "[META] JavaScript Performance Issues", I have added the current bug to the dependency list of bug 117611. We hope to solve each individual performance problem in individual bugs, and track them all from the [META] bug.
Target Milestone: --- → Future
My test result show that the problem maybe not just the bitwise operator's problem. Even empty loops will spend nearly twice times of the IE do. I modified the test script. Just to compare the empty loops and loops with some assignment statement. The modified test1() is show as below, it is used to test the empty loop. remove the comments in the function to test loops with some assignments. function test1() { var i,powerOfTwo,j; var start,end; document.forms[0].testResult1.value = ''; start = new Date(); for (i = 1; i <= max; i++) { //junk = i; //junk2 = i; //junk3 = i; } end = new Date(); document.forms[0].testResult1.value = end-start + ' ms'; } tests(ms) Mozilla IE5 Ratio empty loop: 79 46 1.71 loop with 1 assign: 140 78 1.79 loop with 2 assign: 203 93 2.18 And my system configuration is: P4 1.8GHZ cpu, 256M memory. C:\>uname -a Windows_NT SETPOINT 5 00 586
I am doing some exploration a technique called direct thread code on it. The idea is that use array of code address as the pragram to execute. The basic idea is that get a op code and switch it is very inefficient. Because 'switch' is a very costful, it will be tranlated to about 10 of machine instructions. Below is the general way of interpret code: void engine() { static Inst program[] = { inst1 /* ... */ }; Inst *ip; for (;;) switch (*ip++) { case inst1: ... break; ... } } And the direct thread code use the form below, void engine() { static Inst program[] = { &inst1 /* ... */ }; Inst *ip; goto *ip inst1: ... go *ip++; ... } The detail of the direct thread code can be found a http://www.complang.tuwien.ac.at/forth/threaded-code.html Good news is that it did show about 10% improvement for this case(sparc 500Mhz, solaris 8). The bad news are(:-(): 1. the modification is vast. And I am still debugging it. 2. Do not know if it will cause regression. Although good for this case, do not know if it is OK for general case. 3. The techique need a special C language extension, called 'label as value'. The extension is supported by gcc. This may cause the method unportable. See http://www.freebsd.org/info/gcc/gcc.info.Labels_as_Values.htm Any suggestion or critism is welcome. York
After 2 weeks of struggle, I have to abandon the method of so called direct thread code.After fixing some bugs in my code, it finally do not show any performance improvement, :-(.
Attached patch Patch for review (obsolete) (deleted) — Splinter Review
With the patches, there is a large improvement on solaris, a small improvement on windows. The patches is try to improve the interpret of JSOP_NAME and JSOP_NAMEINC. When the name is a native simple variable, like a int var 'i', the processing is ineffecient. Because the case is obviously of high usage frequency, it should be optimized. I will introduce the parts of patch for interpret of JSOP_NAME and JSOP_NAMEINC respectively below. At last I will give the test data on solaris and windows. 1)interpret of JSOP_NAME To get value of the name, first it store rval with value in obj->slot. But, if the name has a getter, it will call getter and store the returned value in rval, and stor rval to obj->slot. The operation of store returned value rval in obj->slot is needless if the name do not have getter at all, just like int variable 'i'. So just skip the part when it is a native object without getter. See code below, case JSOP_NAME: ... ... ok = js_FindProperty(cx,id,&obj,&obj2,&prop); ... ... rval = (slot != SPROP_INVALID_SLOT) ? LOCKED_OBJ_GET_SLOT(obj2, slot) : JSVAL_VOID; JS_UNLOCK_OBJ(cx, obj2); ok = SPROP_GET(cx, sprop, obj, obj2, &rval); JS_LOCK_OBJ(cx, obj2); ... ... if (SPROP_HAS_VALID_SLOT(sprop, OBJ_SCOPE(obj2))) LOCKED_OBJ_SET_SLOT(obj2, slot, rval); ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^useless for 'int variable i' ... ... #define SPROP_GET(cx,sprop,obj,obj2,vp) \ (((sprop)->attrs & JSPROP_GETTER) \ ? js_InternalCall(cx, obj, OBJECT_TO_JSVAL((sprop)->getter), 0, 0, vp) \ : SPROP_CALL_GETTER(cx, sprop, (sprop)->getter, obj, obj2, vp)) Thus if (!(sprop->attrs & JSPROP_GETTER) && !(sprop->getter)) is true we know that we just useless re-assign rval to slots. 2) JSOP_NAMEINC When process JSOP_NAMEINC, it first get the property of obj[id] by js_FindProperty(), then goto do_incop. and in do_incop it will call CACHED_GET(OBJ_GET_PROPERTY()) to get the value of the property. In js_FindProperty(), it will search cache on rt->propertyCache to find the corresponding scope property. Later it call CACHED_GET(), which will do the cache test again. Quite inefficient. Another inefficient point is it first call OBJECT_TO_JSVAL(obj) and later call VALUE_TO_OBJECT(). It is just kind of waste. Actually if the name is a native object and sprop is returned by js_FindProperty we can just get the value of property as in processing of JSOP_NAME. it is, first get scope property, and then get the value by call SCOPE_GET(). see the code below, case JSOP_INCNAME: ... ... case JSOP_NAMEINC: case JSOP_NAMEDEC: atom = GET_ATOM(cx, script, pc); id = (jsid)atom; SAVE_SP(fp); ok = js_FindProperty(cx, id, &obj, &obj2, &prop); OBJ_DROP_PROPERTY(cx, obj2, prop); ... ... lval = OBJECT_TO_JSVAL(obj); goto do_incop; ... ... do_incop: VALUE_TO_OBJECT(cx, lval, obj); /* The operand must contain a number. */ SAVE_SP(fp); CACHED_GET(OBJ_GET_PROPERTY(cx, obj, id, &rval)); #define CACHED_GET(call) \ JS_BEGIN_MACRO \ if (!OBJ_IS_NATIVE(obj)) { \ ok = call; \ } else { \ JS_LOCK_OBJ(cx, obj); \ PROPERTY_CACHE_TEST(&rt->propertyCache, obj, id, sprop); \ if (sprop) { \ .... ... } else { \ JS_UNLOCK_OBJ(cx, obj); \ ok = call; \ /* No fill here: js_GetProperty fills the cache. */ \ } \ } \ JS_END_MACRO 3) test data Use the attched testcase and ibench to test the patch, on solaris 8 500Mhz cpu: average of 3 times of tests: test1(ms) test2(ms) test3(ms) ibench(s) without the patch 2837 2574 3589 28.02 with the patch, 2591 2467 3025 27.88 ratio 1.09 1.04 1.18 1.005 on windows platform, See data of 5 times repeat the tests below, test1 test2 test3 without the patch 1946 1853 1981 with the patch 1931 1821 1949
Keywords: patch, review
cc'ing more reviewers for this patch -
Comment on attachment 92205 [details] [diff] [review] Patch for review First of all, I don't think this increase in source complexity and compiled code space is necessarily worth the slight performance gain in a handful of synthetic benchmarks. I'm glad the "threaded" interpreter approach was dropped after it proved not to matter; I hope you didn't spend too much time on it. It seems to me best to *propose and discuss* approaches before hacking patches. I'm sorry if I missed some e-mail on this, and failed to say "don't go there". This bug blamed bitwise operators, but they probably weren't dominating the critical path any more than name operations or increment operations were. Can someone jprof or quantify the benchmarks used here and show hierarchical cycle profile results? One thing disturbs me: I don't see how JSOP_NAME and JSOP_NAMEINC matter here, because the test cases use local variables, except for the max variable tested in the loop condition which, as scole points out, is global. Local variables are already specialized to avoid name gets and sets. I think it is incumbent on anyone writing combinatorial code that must perform well to use local variables in functions. Please fix the testcases to make max an *argument* to the functions, so its accesses are as optimized as are accesses of local variables. (Phil, can you fix the tests and attach them to this bug? Thanks.) While I'm here, let me point out some bad bugs and pick some nits, to help you work on a final patch if we agree that's the best way to spend effort on this bug -- or just for your information when you write another patch for js/src. /be >diff -u -r3.106 jsinterp.c >--- jsinterp.c 12 Jun 2002 08:31:43 -0000 3.106 >+++ jsinterp.c 22 Jul 2002 06:32:41 -0000 >@@ -2445,6 +2445,28 @@ > } > > OBJ_DROP_PROPERTY(cx, obj2, prop); >+ /*optimized for native object*/ Please put an extra newline before such comments, put a space after "/*" and before "*/", and use a proper sentence (capitalized, with a period at the end). >+ if (prop && OBJ_IS_NATIVE(obj) && OBJ_IS_NATIVE(obj2)) { >+ JSScope *scope_ = OBJ_SCOPE(obj); Non-nits, i.e., bad bugs, here: you are not worrying about thread-safety via object scope locking, and you are using prop after it has been dropped by the call to OBJ_DROP_PROPERTY just above. You must call JS_LOCK_OBJ(cx, obj) before fetching OBJ_SCOPE(obj), and you must call js_GetMutableScope and re-lookup sprop when setting (as ++ does) as opposed to getting, in case the property is inherited from a prototype. See js_SetProperty in jsobj.c. Nit: this block-local is not a variable declared in a macro, so no need to name it scope_ -- scope will do. >+ sprop = (JSScopeProperty *)prop; >+ slot = (uintN)sprop->slot; >+ rval = (slot != SPROP_INVALID_SLOT) >+ ? LOCKED_OBJ_GET_SLOT(obj, slot) Non-nit: Sorry, sprop is not in obj's scope -- it is in obj2's scope, and no object is locked at this point, so you can't use LOCKED_OBJ_GET_SLOT. >+ : JSVAL_VOID; >+ >+ if (!(sprop->attrs & JSPROP_GETTER) && !(sprop->getter)) Nit: don't overparenthesize the !sprop->getter clause. >+ goto inc_rval; >+ >+ JS_UNLOCK_SCOPE(cx, scope_); Non-nit: scope_ wasn't locked, see above. >+ ok = SPROP_GET(cx, sprop, obj, obj, &rval); >+ JS_LOCK_SCOPE(cx, scope_); >+ if (ok && SPROP_HAS_VALID_SLOT(sprop, scope_)) >+ LOCKED_OBJ_SET_SLOT(obj, slot, rval); >+ JS_UNLOCK_SCOPE(cx, scope_); >+ if (!ok) >+ goto out; >+ goto inc_rval; >+ } > lval = OBJECT_TO_JSVAL(obj); > goto do_incop; > >@@ -2473,6 +2495,7 @@ > if (!ok) > goto out; > >+ inc_rval: > /* The expression result goes in rtmp, the updated value in rval. */ > if (JSVAL_IS_INT(rval) && > rval != INT_TO_JSVAL(JSVAL_INT_MIN) && >@@ -2832,6 +2855,10 @@ > rval = (slot != SPROP_INVALID_SLOT) > ? LOCKED_OBJ_GET_SLOT(obj2, slot) > : JSVAL_VOID; >+ /*optimized for the common case*/ Nit: see above. >+ if (!(sprop->attrs & JSPROP_GETTER) && !(sprop->getter)) Nit: see above. >+ goto save_the_opnd; >+ > JS_UNLOCK_OBJ(cx, obj2); > ok = SPROP_GET(cx, sprop, obj, obj2, &rval); > JS_LOCK_OBJ(cx, obj2); >@@ -2841,6 +2868,7 @@ > } > if (SPROP_HAS_VALID_SLOT(sprop, OBJ_SCOPE(obj2))) > LOCKED_OBJ_SET_SLOT(obj2, slot, rval); >+ save_the_opnd: > OBJ_DROP_PROPERTY(cx, obj2, prop); > PUSH_OPND(rval); > break; At this point, I really think more study is required before attempting so low-level a patch. Object locking, prototype vs. object-owned scope (js_GetMutableScope), and the fundamental lack of a need(*) to optimize JSOP_NAME, JSOP_NAMEINC, etc. given the fact that no sound benchmark or real JS function should use global variables in combinatorially intensive code, make me say: let's not hack more without lots of measurement, followed by analysis, followed by design of more sound optimizations. /be (* It may be that real-world JS abuses global variables too much, but I believe that we should measure before optimizing. This bug claims that bitwise operators are too slow, but provides no evidence; its tests are not reduced, inviting confounders. We need to rehabilitate this bug with good profiling data, or close it as invalid.)
Attachment #92205 - Flags: needs-work+
Results with testcase in comment #19 2002072304 +---------------------+-------------------+-------------------+--------+ | Test | Mozilla (20020121)| IE 5.5 | Ratio | |---------------------|-------------------|-------------------|--------| |Test 1 | 4567 ms | 2814 ms | 1.62 | |---------------------|-------------------|-------------------|--------| |Test 2 | 4146 ms | 2594 ms | 1.60 | |---------------------|-------------------|-------------------|--------| |Test 3 | 4847 ms | 3876 ms | 1.25 | +---------------------+-------------------+-------------------+--------+
Thanks -- one thing you might do to reduce noise, at the price of using an ECMA extension that's only in Mozilla: change those new Date constructions to use Date.now() instead. This may not matter much; it would be interesting to know the savings. With or without that change, can someone run the JS profiler that's in venkman, if not a machine cycle (hierarchical, we need to spread blame along callers of common subroutines, a la gprof) profiler? Maybe rjesup is able to help. /be
Another thing to try: run the tests, modified so they don't depend on the DOM, in the JS shell (built on Unix from js/src/Makefile.ref, on Windows from js.mak, and on Mac from macbuild/JSRef.mcp). Also perhaps try the xpcshell, built when you build Mozilla and installed into dist/bin. If the numbers get dramatically better, then the DOM is slowing down some (few) global accesses too much, still or again. /be
Attached file venkman profile (deleted) —
Here is the result of profiling a version of the script with the DOM and timing calls removed, and |max| passed as a parameter. I ran this on my 850Mhz laptop, running RH Linux 7.1. Of course, these numbers can't really be compared to any others, so they're probably not very useful.
I've tried Date.now(). Times are basically the same as those in comment #20
Here are my timings on WinNT 4.0(SP6), 500 MHz CPU, 128M RAM. Using optimized JS shell built today, and Mozilla trunk 2002072308. Using the latest tests, with |max| passed as a parameter. All times in milliseconds: JS Shell XPCShell Moz IE6 TEST1: 2625 2625 2703 1578 TEST2: 2360 2360 2453 1422 TEST3: 2672 2703 2875 2203
I have quantified test1 on solaris, and it shows %71.59 js_Interpret %3.30 js_ValueBoolean %1.84 nsJSContext::DOMBranchCallback %1.07 JS_GetParent jsengine consumed most of the time. maybe we need a fine granularity timing method, thus we can tell which part in jsengine consume most of the cpu.
We need line level or basic block level profiling, not function level -- quantify should be able to do that, at least on Windows. /be
Are you sure? I collect the data with granularity=line, but cannot find option that can view the line level data?
It is the build options > quantify_what_options ./mozilla-bin.quantify Build time options: "-collection-granularity=line" Command line options (run time defaults): "-collection-granularity=line" -quanti fy-home=/export/home/yddu/rational/releases/quantify.sol.2002a.06.00 -threads=ye s -use-internal-locks=yes But in the exported data file, I cannot find any data of line, it just data of function. I miss something?
cc'ing jband in case he knows the answer to these Quantify questions -
Attached file quantify result on solaris (deleted) —
quantified on solaris
Quantified with the new testcase that remove global variable max. CPU is mostly consumed by js_Interpret(). So Just consider statistics inside the function: main loop , reinit of local var 29.4% JS debugger support: 19.22% JSOP_GETVAR 8.77% JSOP_BITXOR 6.88% JSOP_LSH 6.26% JSOP_POPV 5.7% JSOP_IFEQ 5.5% JSOP_NEG 5.4% JSOP_GOTO 4.53% JSOP_SETVAR 4% The js debugger support seems to be a pain. Do not know why it eat so much cpu.
YueDong Du: I think (but I'm not sure since it's the first time I look at such a quantify "beast") that the 12.28% are for the switch statement just below the "}". If the above is true, js debugger support should be decreased by this value and the 12.28% should be added to the loop time.
Thanks for the qfy output. Can you try putting #if 0 and #endif around that entire debugger rt->interruptHandler block statement, and re-quanitfy, so we can tell what it is contributing? I have some ideas on how to get rid of all that overhead; more in a bit. /be
This patch loads rt->interruptHandler into a local before the interpreter loop, and reloads it only after returns from js_Invoke (which might call a native function). It may not help, if optimizing compilers and target architectures don't lead to the interruptHandler local being kept in a register. /be
The patch is valuable at least on solaris: with the testcase test1(sun blade 100, 500Mhz, solaris 8, average of 3 times): with the patch : 2251 ms without the patch : 2595 ms
quantified as /be requested.
So 13.19% of jsinterpret execution time is spent in the single statement: switch(op) Since "threaded code" does not seem to give any improvement, I get no idea on how we could reduce the execution time here. Maybe, disassembling and looking to what the compiler generates.
And the patch of Brendan is obviously beneficial. So can we let patch go through the review process? We have proved that bug has nothing to do with bitwise operator. Maybe switch is too heavy but no obvious way for leveage the overhead. So the patch is probably the only fruit of this bug.
Attachment #92205 - Attachment is obsolete: true
Taking. /be
Assignee: khanson → brendan
Keywords: mozilla1.2
Target Milestone: Future → mozilla1.2alpha
I'll try to find some time to look at the function and what it compiles to (I used to alpha- and beta-test compilers, mostly from an optimization point-of-view).
Randel, as far as I know, A switch consist of range check, table lookup and branch to the rutine. All are necessary. Even we implement a table driven interpret ourselves, we need do all the 3 steps, needless to say that the table-driven method may need some un-portable language feature. Even we found that switch is bad, what should we do? I have wasted a lot of time on it. But it is on youself.
I'm resummarizing and possibly morphing (but I don't think I am -- I'm goin gto make the testcase benchmarks here scream) based on a partial evaluation design I'm pursuing that should improve JS performance quite a bit. /be
Status: NEW → ASSIGNED
Summary: Bitwise operators should be faster → JS needs partial evaluation
Comment on attachment 92768 [details] [diff] [review] patch to try that localizes rt->interruptHandler sr=jband. Sure, why not? I'm curious if it really measures as a win.
Attachment #92768 - Flags: superreview+
Comment on attachment 92768 [details] [diff] [review] patch to try that localizes rt->interruptHandler r=rginda Doesn't seem to cause any debugger regressions, afaict.
Attachment #92768 - Flags: review+
jband: see comment #36 for YueDong Du's measurement with the patch vs. without. I'll check the patch in when the 1.2alpha trunk opens, but keep the bug open for bigger improvements. /be
The latest patch passes the JS testsuite on WinNT in both the debug and optimized shells. No test regressions occurred -
I checked attachment 92768 [details] [diff] [review] into the 1.2alpha trunk. /be
Whiteboard: branchOEM
Whiteboard: branchOEM → branchOEM+
In OEM branch.
Whiteboard: branchOEM+ → branchOEM+, fixedOEM
Moving out, some of these may move to 1.3alpha shortly. /be
Target Milestone: mozilla1.2alpha → mozilla1.2beta
Not beta material, not making this beta. /be
Target Milestone: mozilla1.2beta → mozilla1.3alpha
Whiteboard: branchOEM+, fixedOEM → [92768:fixedOEM]
Whiteboard: [92768:fixedOEM]
Next alpha. /be
Target Milestone: mozilla1.3alpha → mozilla1.4alpha
After a running start, my grand plans for jit'ing and partial evaluation stalled. I don't know when I'll get to this. I think the memory optimization bugs on my JS list are more important, along with some correctness ones on my and khanson's lists. So I'm going to future this for now. /be
Target Milestone: mozilla1.4alpha → Future
Keywords: perf, testcase
Attached file Testcase for js shell (deleted) —
Attached patch Making test run faster by 10% (deleted) — Splinter Review
The main idea behind the patch is that results of <<, >>, ^ etc. are always integers while the current code in jsinterp.c uses STORE_NUMBER macro which first calls JSDOUBLE_IS_INT. Thus the patch adds STORE_INT and STORE_UINT macros that are essentially STORE_NUMBER without JSDOUBLE_IS_INT and adjust the code accordingly. The patch also removes unnecessary check for false ok in JSDOUBLE_IS_INT as this check is already done in FETCH_INT. It not only makes SM to run faster on the test cases but also shrinks the code size.
Data on my Linux box (Athlon 1.2GHz / Fedora Core 1 / GCC 3.3.2) for optimized build of js shell: Before the patch: test 1: 795 ms test 2: 731 ms test 3: 827 ms After the patch: test 1: 660 ms test 2: 627 ms test 3: 729 ms which indicates a speedup of at least 15%. For comparison here is the results for Rhino with JDK 1.4.2 test 1: 760 ms test 2: 719 ms test 3: 602 ms which also tells that JDK jit can do amazing job with JVM bytecode even for typeless language..
Comment on attachment 145602 [details] [diff] [review] Making test run faster by 10% Great! Thanks, good catch on the bogus extra ok test. I'll get this into 1.8a. I doubt this is the end of the story, though. Maybe I should unmorph this bug and let it be about shift ops again. /be
Attachment #145602 - Flags: review+
Re: comment 57, second paragraph: Yes, JITs are the way to go. .NET's IL is a typeless stack bytecode, precisely so that JITs, which model operand types based on declarations, inferences, or runtime code that must be generated in cases where the type wasn't declared and can't be inferred, don't have to deal with a bunch of load-int, load-double, etc. opcodes. The interesting thing about the testcase is that types can be inferred, and in fact partial evaluation would make the test go even faster. So I do want a bug on that topic, but maybe it shouldn't be this one. /be
If I extend the patch to treat JSVAL_IS_INT specially for binary/unary +/- as JSDOUBLE_IS_INT always holds for the result (and for the unary minus the result always fits jsval), then the total speedup reaches almost 25%. It is interesting that in the Rhino case a patch to optimize double to int32 applied a couple of years ago made both interpreted and compiled mode the test to run 2.5 times faster and when I switched of the patch for the compiled mode it started to run slower then interpreted mode. Which tells that in Rhino case the type inference that the compiled mode uses to deduce pure numerical variables benefited the test case less then a simple 3-line optimization in frequently executed code. So I would not be surprised if SM with partial evaluation would benefit to the test case less then numerical tricks.
The patch extends the attachment 145602 [details] [diff] [review] with the special treatment of tagged ints in JSOPT_NEG and JSOPT_POS. Again it not only makes things faster but also shrinks code size.
Updated results for the new patch for configuration from comments #57: Before the patch: test 1: 795 ms test 2: 731 ms test 3: 827 ms size of js executable: 374660 After the patch: test 1: 599 ms test 2: 567 ms test 3: 726 ms size of js executable: 373700
Attached patch cleaned up version (deleted) — Splinter Review
Main substantive change is to use js_NewNumberValue instead of its inline expansion in JSOP_POS. That should save even more code size. Neither unary + nor - is all that common in real world scripts, so I'm not concerned about the extra call overhead in the JSVAL_IS_INT case. I don't want to tune for this bug's benchmark only. Although it may hurt with some compilers, most interpreter switch cases use the common operand temporaries, rval, etc., so I made JSOP_NEG and JSOP_POS use rval too, instead of a new block-local v. Also compressed DEBUG ifdef into JS_ASSERT and used interpreter temporary i. Thanks again, Igor. /be
Attachment #145738 - Attachment is obsolete: true
Re: comment 60, second paragraph: The idea with partial evaluation is not simply to deduce numerical type when compiling (really, when specializing compiled code based on constant arguments and other invariants). It's to deduce integer type, loop upper bounds, degenerate tests, etc. It's true that using the ALU instead of the FPU will get you something like a 3X speedup on most architectures. Even better would be to get rid of the runtime tests that have to check whether JSDOUBLE_IS_INT && INT_FITS_IN_JSVAL. Thanks to Igor's work here, I think I will just keep this bug open. Encore! /be
This version extends the previous patch with the explicit optimization of addition of 2 tagged ints: case JSOP_ADD: ... + /* Optimize for two int-tagged operands as sum fits int32. */ + if ((lval & rval) & JSVAL_INT) { + if (lval != JSVAL_VOID && rval != JSVAL_VOID) { + i = JSVAL_TO_INT(lval) + JSVAL_TO_INT(rval); + JS_ASSERT((jsdouble)i == (jsdouble)JSVAL_TO_INT(lval) + + (jsdouble)JSVAL_TO_INT(rval)); + if (INT_FITS_IN_JSVAL(i)) { + rval = INT_TO_JSVAL(i); + sp--; + STORE_OPND(-1, rval); + break; + } + } + } and similar code for the subtraction: case JSOP_SUB: ... + /* Optimize for two int-tagged operands as differece fits int32. */ + if ((lval & rval) & JSVAL_INT) { + if (lval != JSVAL_VOID && rval != JSVAL_VOID) { + i = JSVAL_TO_INT(lval) - JSVAL_TO_INT(rval); + JS_ASSERT((jsdouble)i == (jsdouble)JSVAL_TO_INT(lval) + - (jsdouble)JSVAL_TO_INT(rval)); + if (INT_FITS_IN_JSVAL(i)) { + rval = INT_TO_JSVAL(i); + sp--; + STORE_OPND(-1, rval); + break; + } + } + } Compared with previous cases the new optimizations adds more code and small overhead of doing "if ((lval & rval) & JSVAL_INT)" when operands of binary +/- are not tagged ints but they help to reduce by 15% running time of code like: function NestedForLoopBenchmark_run() { var sum, i, j; sum = 0; for (i = 1000; i > 0; i--) { for (j = 100; j > 0; j--) { sum = sum + 1; } } if (sum != 100000) { this.error("NestedForLoopBenchmark"); } } which is used in benchmarks/BenchPress.js. The total time of benchmarks/BenchPress.js is reduced by 4%.
Bogus benchmark alarms go off about now: sum = sum + 1; is not something to optimize. ++sum or sum++ is already optimized for the int case. But ok, it could have been sum += 42 or sum += SOME_INT_CONST. Still, I'd rather see the compiler notice, without having to build an SSA graph, that variables initialized to 0 or other consts are subject only to certain operations, and constrained by loop control, so that they can be optimized to use int-only JSOPs. I'll patch that way soon. /be
(In reply to comment #66) > Still, I'd rather see the compiler notice, without having to build an SSA graph, > that variables initialized to 0 or other consts are subject only to certain > operations, and constrained by loop control, so that they can be optimized to > use int-only JSOPs. But this can only be applicable to numerical loops without any function calls in between, right? Since SM supports indirect eval any function can potentially change local variable value since any function can be an alias to eval. BTW, is eval a reserved keyword in JS2?
> But this can only be applicable to numerical loops without any function > calls in between, right? Since SM supports indirect eval any function can > potentially change local variable value since any function can be an alias > to eval. Not so. First, per ECMA, we are free to start throwing EvalError. Second, the only thing per ECMA-262 that eval (indirect or direct) can do to change the set of variables visible to the compiler of the function calling eval is to add to that set. The only thing an eval can do to the set of variables visible to the compiler of eval code, where eval is called from a function, is to delete only those members of the set that were added via an earlier eval (they lack DontDelete, per 10.2.2). An ECMA-262-based compiler can therefore optimize all permanent properties of the activation object that have slots in the JSStackFrame.vars array. If the function is lightweight (no direct eval call), and the implementation throws EvalError for indirect calls, then the compiler can optimize all declared vars. I'd like to enable this mode in SpiderMonkey, but I bet there are web pages that call eval via obj.eval for some object obj. Still, the name by which eval is called in that case is 'eval', so we can probably make an exemption. If the function is heavyweight, or if there is an indirect eval call that can't be detected at compile time, the compiler of the function can still optimize declared vars, unless the var name collides with a closure (conditional function declaration) name. SpiderMonkey does not work properly for this hard case in the indirect-eval case today. I'll fix that while I'm patching this bug. > BTW, is eval a reserved keyword in JS2? No. See http://www.mozilla.org/js/language/js20/core/lexer.html#reserved-word. /be
(In reply to comment #68) > Not so. First, per ECMA, we are free to start throwing EvalError. ... > If the function is lightweight (no direct eval call), and the implementation > throws EvalError for indirect calls, then the compiler can optimize all declared > vars. I'd like to enable this mode in SpiderMonkey, but I bet there are web > pages that call eval via obj.eval for some object obj. Still, the name by which > eval is called in that case is 'eval', so we can probably make an exemption. Rhino throws EvalError for indirect eval calls exactly to perform such optimization and that bit arround year 2000 my colleague from the previous job when he had to support window.eval construction in a Java browser. The quick fix was to add a kind of prescan to literaly replace window.eval with eval (and do few other quirks like replacing --> with // that Rhino did not support at that time). We expected that at some point that would have to be extented but that was never necessary. So the indirect calls are still called eval indeed. But my point in the previous comment was that with support for indirect eval compiler can not assume that any variable keeps its original value after arbitrary function call so optimization to track integer-only operations can not be supported accross function calls unless indirect eval is disabled. ... > > BTW, is eval a reserved keyword in JS2? > > No. See http://www.mozilla.org/js/language/js20/core/lexer.html#reserved-word. > Is it possible to submit a suggestion to change that or at least to require that any indirect eval must throw an exception? Or is it too late?
> But my point in the previous comment was that with support for indirect eval > compiler can not assume that any variable keeps its original value after > arbitrary function call so optimization to track integer-only operations can > not be supported accross function calls unless indirect eval is disabled. You can still get a win by retesting int-ness only after the function that might be eval returns, and if the variable has morphed into a non-int, bail out of the specialized code. This is complex, but it has been done (Anamorphic, acquired by Sun for their HotSpot runtime, which was originally for Self). But you're right, it's much better to forbid indirect eval. I'm all in favor of that provided we can treat foo.eval like eval in web-ish contexts and say we're done. De-optimizing such old web-JS scripts will not matter. > Is it possible to submit a suggestion to change that or at least to require > that any indirect eval must throw an exception? Or is it too late? Probably too late, in part because ECMA has made compatibility important, to avoid diverging too much and designing-by-committee. Also to ease migration. I'm trying to get Mozilla Foundation to join ECMA as a NFP organization, so we can contribute. ECMA Edition 4 is stalled at the moment, and I would not want to try to unstall it by suggesting eval become a keyword. We need compatibility and short time-to-market here, more than polishing. /be
Priority: -- → P5
Flags: testcase?
Patch to thread (in the Forth sense, not in the concurrency sense) js_Interpret as a prolog to work here coming next. /be
Priority: P5 → P1
Summary: JS needs partial evaluation → JS could use some focused partial evaluation
Target Milestone: Future → mozilla1.9alpha
Attached patch indirect threading, passes testsuite (obsolete) (deleted) — Splinter Review
Hmm, I need to force threading off and verify I didn't break the #else clauses. Anyway, this is the first step. /be
Attachment #203044 - Flags: superreview?(shaver)
Attachment #203044 - Flags: review?(mrbkap)
And to fuse the common op = *pc; DO_OP(). It would be cool if someone would test Venkman with this patch applied on top of the big patch. /be
Comment on attachment 203044 [details] [diff] [review] indirect threading, passes testsuite Oops, broke the non-threaded case. New complete patch in a bit. /be
Attachment #203044 - Attachment is obsolete: true
Attachment #203044 - Flags: superreview?(shaver)
Attachment #203044 - Flags: review?(mrbkap)
So I tested that patch on the testcases at http://www.24fun.com/downloadcenter/benchjs/benchjs.html and http://www.speich.net/computer/3d.php (big cube in separate window). Test 6 on the former is particularly interesting to me, in general. Looks like about a 4% win on test 6, and about neutral elsewhere, including the wireframe cube. For comparison, Opera is about 30-50% faster than we are on test 6 and the wireframe... But a lot of those testcases is not in JS engine, of course.
Attached patch patch, v2 (obsolete) (deleted) — Splinter Review
This is more like it. I'd like to land this to move on to the next stage. Docs on the grand plan coming soon; I'll hack on those while waiting for review ;-). /be
Attachment #203073 - Attachment is obsolete: true
Attachment #203082 - Flags: superreview?
Attachment #203082 - Flags: review?(mrbkap)
Attached patch interdiff against last patch to compile on MSVC (obsolete) (deleted) — Splinter Review
GCC let me get away with uintN flags in a statement context! Oh well, I took the opportunity to fix a comment glitch too. Reviewers take note. /be
Comment on attachment 203082 [details] [diff] [review] patch, v2 Bob found a case (JSOP_LITOPX) that needs to be #ifdef JS_THREADED_INTERP. Roll-up patch next. /be
Attachment #203082 - Attachment is obsolete: true
Attachment #203082 - Flags: superreview?
Attachment #203082 - Flags: review?(mrbkap)
Attached patch patch, v3 (obsolete) (deleted) — Splinter Review
JSOP_LITOPX case must advance pc by its length (5) minus the length hardcoded into the DO_NEXT_OP code at the end of the case it jumps into, but only when threading. /be
Attachment #203085 - Attachment is obsolete: true
Attachment #203087 - Flags: superreview?(shaver)
Attachment #203087 - Flags: review?(mrbkap)
(In reply to comment #79) No regressions with this latest version of the patch in opt/dbg shell tests on winxp.
Branch prediction hardware indexes instruction address to memoize branch target, so the switch in the old interpreter loop almost never predicts, but distributed jumps at the end of each case do better, depending on correlation of bytecode to bytecode compiled from scripts. The indirect threading in the current patch is a win, but not huge for real-world DHTML benchmarks because (as usual) JS is not on the critical path enough. But this is just the first step. Inspired by http://citeseer.ist.psu.edu/berndl05context.html, OCaml (which I'd looked at a while ago, and then lately after keynoting the 2005 ACM ICFP), and great stuff from the past such as Self and Synthesis, I'm finally hot to optimize good old SpiderMonkey to pieces. It will be tricky retaining API compatibility, including debugger API compat, but that's the real win here. SpiderMonkey is > 9 years old and has a ton of active embeddings. We should bear the price of compatibility, even as we add new and entirely different, better APIs, until clients can afford to migrate. We should not break the world, and I believe we won't have to, just to optimize performance. Sure, it's harder than clean-slating, but that's nothing new to Mozilla or the Web. Compatibility is king. /be
sr=shaver (couldn't put it in the bug, because bugzilla pukes on the inactive flag type, or something -- whatever)
Comment on attachment 203087 [details] [diff] [review] patch, v3 Trying to sr=shaver again.
Attachment #203087 - Flags: superreview?(shaver) → superreview+
Attachment #203087 - Attachment is obsolete: true
Attachment #203087 - Flags: review?(mrbkap)
Attached patch patch I checked in (deleted) — Splinter Review
Blake will review, I have no doubt ;-). /be
Attachment #203610 - Flags: review?(mrbkap)
On the first testcase, here's how my scores changed on my Athlon64 X2 system: +-------------+-------------+-------------+---------+ | Test | Before | After | Speedup | |-------------|-------------|-------------|---------| | Test 1 | 344 ms | 312 ms | 9.3% | |-------------|-------------|-------------|---------| | Test 2 | 312 ms | 297 ms | 4.8% | |-------------|-------------|-------------|---------| | Test 3 | 390 ms | 375 ms | 3.8% | +-------------+-------------+-------------+---------+ Not bad speed win at all :-)
This patch appears to have increased Tp on btek from 918ms to 924ms, and on luna from 863ms to 870ms.
Blocks: 316879
(In reply to comment #86) > This patch appears to have increased Tp on btek from 918ms to 924ms, and on > luna from 863ms to 870ms. What is the value of __GNUC__ in the old egcs version used on btek? How about on luna? I don't believe the threaded interpreter code was enabled there. We need new benches (machines) to mark, those old ones are not representative. /be
Results for running tests in optimized build of JS shell on 6 years old Pentium II 360 MHz laptop with Fedora Core 4 and GCC 4.0.1: +-------------+-------------+-------------+---------+ | Test | Before | After | Speedup | |-------------|-------------|-------------|---------| | Test 1 | 3452 ms | 2946 ms | 17% | |-------------|-------------|-------------|---------| | Test 2 | 3099 ms | 2598 ms | 19% | |-------------|-------------|-------------|---------| | Test 3 | 3509 ms | 2890 ms | 21% | +-------------+-------------+-------------+---------+ This is really nice speedup for single-threaded embedding which shows benefits of the patch for old hardware with good compiler.
(In reply to comment #88) > does gcc 2 not support computed gotos? > http://gcc.gnu.org/onlinedocs/gcc-2.95.3/gcc_4.html#SEC64 > http://computing.ee.ethz.ch/sepp/egcs-1.1.2-to/gcc.html#SEC61 It does, but when I first checked in (and broke Windows due to #else brain damage) btek slowed down. So I tried __GNUC_ >= 3 instead, hoping to put btek and its ilk in the (fixed) #else clause that used a loop around a switch, as before. But if btek slowed down with that test, then I don't know why -- either it is still getting the threaded interpreter code, or somehow the #else clause code is slower than before. Clues welcome. Igor's data is important, it shows that we should not be catering to old compilers, only to old computers. We have the ability to recompile, since this is a new patch and no one supports the broken old gcc/egcs compiler (AFAIK). /be
luna is gcc 3.2 and it too slowed down a bit, which makes me wonder if this change is a real world win (assuming our Tp test is still representative of the web world out there). Btw, I'm all for giving btek a new compiler (and maybe OS upgrade?) to play with.
For the record, I do my builds with Visual C++ 2003.
(In reply to comment #91) > luna is gcc 3.2 and it too slowed down a bit, which makes me wonder if this > change is a real world win (assuming our Tp test is still representative of the > web world out there). What gcc version has the tree-ssa stuff? Could it be that is required to see the gain? Or is this not a compiler issue so much as a machine (icache, e.g.) issue? /be
(In reply to comment #93) > Or is this not a compiler issue so much as a machine (icache, e.g.) > issue? The last form of the patch does not remove the switch statement for the threaded cases. It remains in the code with all related case labels but is no longer used. Perhaps this is what confuses older versions of GCC where it can not eliminate the dead code?
(In reply to comment #95) > (In reply to comment #93) > > Or is this not a compiler issue so much as a machine (icache, e.g.) > > issue? > > The last form of the patch does not remove the switch statement for the > threaded cases. It remains in the code with all related case labels but is no > longer used. Perhaps this is what confuses older versions of GCC where it can > not eliminate the dead code? It can't eliminate the dead code, but it shouldn't execute anything to-do with the switch. The slightly larger icache footprint from any such switch code should not be enough to hurt perf. Something else is going wrong. But first, who here sees a perf *win* with the current patch using GCC? Igor does. What is different in his machine and GCC compared to btek or luna's? /be
(In reply to comment #96) > The slightly larger icache footprint from any such switch code > should not be enough to hurt perf. But why the dead is there? I.e. why it is necessary to keep switch/case for the threaded case?
(In reply to comment #97) > (In reply to comment #96) > > The slightly larger icache footprint from any such switch code > > should not be enough to hurt perf. > > But why the dead is there? I.e. why it is necessary to keep switch/case for the > threaded case? You're right, it's not. At one point I was going to leave the while/switch loop for the debugger API, but then I thought of the interruptJumpTable hack. Patch to remove any switch (and any doubt that this is causing some perf problem) in a trice! /be
Attached patch no switch in JS_THREADED_INTERP (deleted) — Splinter Review
Attachment #203867 - Flags: review?(igor.bukanov)
Attachment #203867 - Flags: review?(igor.bukanov) → review+
To whoever's interested, I've taken a copy of the jsinterp.c file before these checkins, applied both patches from this bug, and manually removed the threaded code and replaced the macros to make it easier to see what changes were made for the non-threaded case. Hopefully this will allow someone to see where things slowed down.
Comment on attachment 203610 [details] [diff] [review] patch I checked in >Index: jsinterp.c >+#define SAVE_SP_AND_PC(fp) (SAVE_SP(fp), (fp)->pc = pc) Does this want a more general name (such as UPDATE_FP_STATE) so if there is more deferred storage to fp, we don't need to change the macro name each time? r=mrbkap, either way.
Attachment #203610 - Flags: review?(mrbkap) → review+
(In reply to comment #101) > (From update of attachment 203610 [details] [diff] [review] [edit]) > >Index: jsinterp.c > >+#define SAVE_SP_AND_PC(fp) (SAVE_SP(fp), (fp)->pc = pc) > > Does this want a more general name (such as UPDATE_FP_STATE) so if there is > more deferred storage to fp, we don't need to change the macro name each time? FP sounds like an acronym, Floating Point. UPDATE connotes other things than what is going on here. This is low-level code, and it should not use much abstraction. The abstraction-leaks will only hurt worse, the more abstract it gets. But more to the point, I don't expect or want any additional mutable-over-script-execution VM registers being homed in a stack frame. Jag: do you know of other machines than btek and luna for which this patch seems to have regressed performance? Cc'ing dbaron. /be
The other tinderboxen I looked at didn't show a (clear) slow-down. On btek there's a peak covering the first time you checked in and consequently backed out, and it goes up again the second time you checked in. luna shows a similar peak, but it could also just be noise. The slow-down is < 1%, so not a big deal (IMO), but interesting, since it shouldn't be there, and ideally there'd be a speed-up.
Flags: testcase? → testcase-
QA Contact: pschwartau → general
Priority: P1 → --
Target Milestone: mozilla1.9alpha1 → ---
Summary: JS could use some focused partial evaluation → JS indirect threaded interpreter
This bug is patched and I should not morph it any more. The summary was grandiose before I fixed it just now to reflect what landed. I've learned my lesson about one patch per bug (several times over). /be
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: