Closed Bug 121414 Opened 23 years ago Closed 16 years ago

JS indirect threaded interpreter

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: boullet.marc, Assigned: brendan)

References

Details

(Keywords: perf, testcase)

Attachments

(13 files, 7 obsolete files)

(deleted), text/html
Details
(deleted), text/html
Details
(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), patch
rginda
: review+
jband_mozilla
: superreview+
Details | Diff | Splinter Review
(deleted), text/plain
Details
(deleted), application/x-javascript
Details
(deleted), patch
brendan
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
mrbkap
: review+
Details | Diff | Splinter Review
(deleted), patch
igor
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
I used a relatively slow PC (300 MHz) under NT4.

The JS tests (attached) doesn't involve any DOM stuff. I just re-code in
JS some C programs I wrote a long long time ago to find the
best way to get the power-of-two greater or equal to a given number.
It involves only a for loop, some assigments, a while loop and some
logical and shift operations.

The comparison with IE5.5 is shown in the following table:

+---------------------+-------------------+-------------------+--------+
|       Test          | Mozilla (20020121)|      IE 5.5       |  Ratio |
|---------------------|-------------------|-------------------|--------|
|Test 1               |     4516 ms       |      2814 ms      |  1.60  |
|---------------------|-------------------|-------------------|--------|
|Test 2               |     4086 ms       |      2594 ms      |  1.57  |
|---------------------|-------------------|-------------------|--------|
|Test 3               |     4766 ms       |      3876 ms      |  1.22  |
+---------------------+-------------------+-------------------+--------+

The ratio of 1.6 is what triggered this bug. I decided to separate
this bug from bug #117611 since it is already huge and now focused on sort
and concat.

Note that IE5.5 doesn't seem to perform so well in shifting (test3) though
still faster than Mozilla.

Hope it can help.
Attached file page with the tests (deleted) —
Keywords: perf
Confirming.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Here are my timings on WinNT 4.0 (SP6), 500 MHz CPU, 128M RAM.
This was Mozilla vs. IE6. The ratios are the same as Marc's above:

+---------------------+-------------------+-------------------+--------+
|       Test          | Mozilla (20020116)|      IE 6         |  Ratio |
|---------------------|-------------------|-------------------|--------|
|Test 1               |     2640 ms       |      1625 ms      |  1.62  |
|---------------------|-------------------|-------------------|--------|
|Test 2               |     2390 ms       |      1484 ms      |  1.61  |
|---------------------|-------------------|-------------------|--------|
|Test 3               |     2781 ms       |      2265 ms      |  1.23  |
+---------------------+-------------------+-------------------+--------+
Here are Marc's tests:

var max = Math.pow(2,17);

function test1() {
  var i,powerOfTwo,j;
  var start,end;
  document.forms[0].testResult1.value = '';
  start = new Date();
  for (i = 1; i <= max; i++) {
    powerOfTwo = i;
    j = powerOfTwo;
    while (j &= j^-j) powerOfTwo = j << 1;
  }
  end = new Date();
  document.forms[0].testResult1.value = end-start + ' ms';
}


function test2() {
  var i,powerOfTwo,j,k;
  var start,end;
  document.forms[0].testResult2.value = '';
  start = new Date();
  for (i = 1; i <= max; i++) {
    powerOfTwo = i;
    j = powerOfTwo;
    do k = j; while (j &= j^-j);
    if (k != powerOfTwo) powerOfTwo = k << 1;
  }
  end = new Date();
  document.forms[0].testResult2.value = end-start + ' ms';
}


function test3() {
  var i,powerOfTwo,k;
  var start,end;
  document.forms[0].testResult3.value = '';
  start = new Date();
  for (i = 1; i <= max; i++) {
    powerOfTwo = i; k = 0;
    while (powerOfTwo >>= 1) k++;
    powerOfTwo = 1 << k
    if (powerOfTwo != i) powerOfTwo <<= 1
  }
  end = new Date();
  document.forms[0].testResult3.value = end-start + ' ms';
}
Assignee: rogerl → khanson
Blocks: js-perf
Hi,

first of all, this bug should probably be OS => All and perhaps also Platform =>
All.

And here, my test results:

Machine: Athlon/500, 416 Megs of RAM, Windows XP Pro
Mozilla: 2002-01-22-09 Win32, SVG/MathML-enabled
IE     : 6.0 XP-bundled release

+---------------------+-------------------+-------------------+--------+
|       Test          |     Mozilla       |      IE 6.0       |  Ratio |
|---------------------|-------------------|-------------------|--------|
|Test 1               |     2703 ms       |      2153 ms      |  1.25  |
|---------------------|-------------------|-------------------|--------|
|Test 2               |     2414 ms       |      1933 ms      |  1.25  |
|---------------------|-------------------|-------------------|--------|
|Test 3               |     2884 ms       |      2954 ms      |  0.98  | <== !
+---------------------+-------------------+-------------------+--------+
As suggested, changing OS: WinNT ---> All, Platform: WinNT --> All
Also, tentatively changing summary from

       "Yet another benchmark comparing JS engines speeds"
to
       "Bitwise operators should be faster"
OS: Windows NT → All
Hardware: PC → All
Summary: Yet another benchmark comparing JS engines speeds → Bitwise operators should be faster
The claim is that these tests don't use the DOM, and that's true, almost.

The loop termination variable "max" is a variable with global scope, meaning
that the DOM has to search through its nodes each time the variable is accessed,
to see if some HTML element named 'max' has changed.  (Or so I assume, anyway,
not having looked too deeply at how the DOM is connected to JS.)

A better test would be to pass 'max' in as the argument to each of the
functions: that way, the loop termination condition is outside of that global
object.

(Any reference to a global object is slow, when JS is connected to a browser...)
I modified Marc's test to run in the standalone JS shell, where
there is no DOM. The global object is that of the JS shell itself.
The timings I got are roughly comparable to those in Comment #3:

Test1:  2750 ms
Test2:  2485 ms
Test3:  2750 ms
Concerning timings in Comment #5:
Mozilla numbers (with a similar clocked machine as Phil's) are
roughly comparable to those in comment #3.
It seems that IE6 under XP is much slower than under NT4, but that's
MS problem not Moz.

Also making changes suggested by Phil in comment #6.
No longer blocks: js-perf
Though I don't see what the DOM has to do with global
variables, I rerun modified tests passing |max| as a param
to the test functions.

Hereafter are the results (same machine as in comment #2):

+---------------------+-------------------+-------------------+--------+
|Test w/ max as a par.| Mozilla (20020121)|      IE 5.5       |  Ratio |
|---------------------|-------------------|-------------------|--------|
|Test 1               |     4276 ms       |      2754 ms      |  1.55  |
|---------------------|-------------------|-------------------|--------|
|Test 2               |     3836 ms       |      2484 ms      |  1.54  |
|---------------------|-------------------|-------------------|--------|
|Test 3               |     4506 ms       |      3705 ms      |  1.22  |
+---------------------+-------------------+-------------------+--------+

The first two tests show a 5% decrease of the ratio.
This means that Mozilla doesn't perform so well with global variables (although
it could also mean that Moz perform better than IE on function parameters, which
is unlikely).
Another thought occurs to me: Can we run a loop with precisely the same number
of iterations and loop logic, but with no body?  Then subtract that time from
the other times to determine whether this slowdown is because bitwise ops are
slow or loop ops are slow.

Looking at the tests, this may be somewhat difficult.  But that sort of just
says that the tests ought to be simplified to directly test what they mean to
test.  (I.e.: we ought to have a "left-shift" test, a "xor" test, an "and" test,
and so on---not the "interesting" algorithms we have now that combine all the
ops together.)

Just a thought...

-scole
Blocks: js-perf
Note: since bug 117611 is "[META] JavaScript Performance Issues",
I have added the current bug to the dependency list of bug 117611. 

We hope to solve each individual performance problem in individual bugs,
and track them all from the [META] bug.
Target Milestone: --- → Future
My test result show that the problem maybe not just the bitwise 
operator's problem. Even empty loops will spend nearly twice 
times of the IE do.

I modified the test script. Just to compare the empty loops 
and loops with some assignment statement.

The modified test1() is show as below, it is used to test
the empty loop. remove the comments in the function to test loops 
with some assignments.

function test1() {
  var i,powerOfTwo,j;
  var start,end;
  document.forms[0].testResult1.value = '';
  start = new Date();
  for (i = 1; i <= max; i++) {
      //junk = i;
      //junk2 = i;
      //junk3 = i;
  }
  end = new Date();
  document.forms[0].testResult1.value = end-start + ' ms';
}


tests(ms)                  Mozilla      IE5     Ratio
empty loop:                  79          46     1.71
loop with 1 assign:          140         78     1.79
loop with 2 assign:          203         93     2.18


And my system configuration is:

P4 1.8GHZ cpu, 256M memory.
C:\>uname -a
Windows_NT SETPOINT 5 00 586
I am doing some exploration a technique called direct thread code on it.
The idea is that use array of code address as the pragram to execute.

The basic idea is that get a op code and switch it is very inefficient.
Because 'switch' is a very costful, it will be tranlated to about 10 of machine
instructions.

Below is the general way of interpret code:
void engine()
{
  static Inst program[] = { inst1 /* ... */ };

  Inst *ip;

  for (;;)
    switch (*ip++) {
    case inst1:
      ...
      break;
    ...
    }
}

And the direct thread code use the form below,
void engine()
{
  static Inst program[] = { &inst1 /* ... */ };

   Inst *ip; 

   goto *ip

    inst1: ... go *ip++;
    ...
}

The detail of the direct thread code can  be found a 
http://www.complang.tuwien.ac.at/forth/threaded-code.html


Good news is that it did show about 10% improvement for this case(sparc 500Mhz,
solaris 8).

The bad news are(:-(): 1. the modification is vast. And I am still debugging it.
2. Do not know if it will cause regression. Although good for this case,
do not know if it is OK for general case.
3. The techique need a special C language extension, called 'label as value'.
The extension is supported by gcc. This may cause the method unportable.
See http://www.freebsd.org/info/gcc/gcc.info.Labels_as_Values.htm

Any suggestion or critism is welcome.

York
After 2 weeks of struggle, I have to abandon the method of so called direct
thread code.After fixing some bugs in my code, it finally do not show any
performance improvement, :-(.


Attached patch Patch for review (obsolete) (deleted) — Splinter Review
With the patches, there is a large improvement on solaris, a small improvement 

on windows. The patches is try to improve the interpret of JSOP_NAME and
JSOP_NAMEINC. When the name is a native simple variable, like a int var 'i', 
the processing is ineffecient. Because the case is obviously of high usage 
frequency, it should be optimized.

I will introduce the parts of patch for interpret of JSOP_NAME and JSOP_NAMEINC

respectively below. At last I will give the test data on solaris and windows.

1)interpret of JSOP_NAME
To get value of the name, first it store rval with value in obj->slot. But, if
the name has a getter, it will call getter and store the returned value in
rval,
and stor rval to obj->slot. The operation of store returned value rval in
obj->slot is needless if the name do not have getter at all, just like int
variable 'i'. 

So just skip the part when it is a native object without getter.

See code below,

	case JSOP_NAME:
	    ... ...
	    ok = js_FindProperty(cx,id,&obj,&obj2,&prop);
	    ... ...
	    rval = (slot != SPROP_INVALID_SLOT)
		   ? LOCKED_OBJ_GET_SLOT(obj2, slot)
		   : JSVAL_VOID;
	    JS_UNLOCK_OBJ(cx, obj2);
	    ok = SPROP_GET(cx, sprop, obj, obj2, &rval);
	    JS_LOCK_OBJ(cx, obj2);
	    ... ...		 
	    if (SPROP_HAS_VALID_SLOT(sprop, OBJ_SCOPE(obj2)))
		LOCKED_OBJ_SET_SLOT(obj2, slot, rval);
	    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^useless for 'int variable i'
	    ... ...

 #define SPROP_GET(cx,sprop,obj,obj2,vp)				      
\
     (((sprop)->attrs & JSPROP_GETTER)					      
\
      ? js_InternalCall(cx, obj, OBJECT_TO_JSVAL((sprop)->getter), 0, 0, vp)  
\
      : SPROP_CALL_GETTER(cx, sprop, (sprop)->getter, obj, obj2, vp))

Thus  if (!(sprop->attrs & JSPROP_GETTER) && !(sprop->getter)) is true we know
that we just useless re-assign rval to slots.

2)  JSOP_NAMEINC
When process JSOP_NAMEINC, it first get the property of obj[id] by
js_FindProperty(),
then goto do_incop. and in do_incop it will call CACHED_GET(OBJ_GET_PROPERTY())

to get the value of the property.

In js_FindProperty(), it will search cache on rt->propertyCache to find the 
corresponding scope property. Later it call CACHED_GET(), which will do the
cache
test again. Quite inefficient.

Another inefficient point is it first call OBJECT_TO_JSVAL(obj) and later call 

VALUE_TO_OBJECT(). It is just kind of waste.

Actually if the name is a native object and sprop is returned by
js_FindProperty
we can just get the value of property as in processing of JSOP_NAME. it is,
first
get scope property, and then get the value by call SCOPE_GET().



see the code below,
case JSOP_INCNAME:
	  ... ...
	  case JSOP_NAMEINC:
	  case JSOP_NAMEDEC:
	    atom = GET_ATOM(cx, script, pc);
	    id	 = (jsid)atom;

	    SAVE_SP(fp);
	    ok = js_FindProperty(cx, id, &obj, &obj2, &prop);
		       OBJ_DROP_PROPERTY(cx, obj2, prop);
	    ... ...
	    lval = OBJECT_TO_JSVAL(obj);
	    goto do_incop;
	    ... ...
	  do_incop:
	    VALUE_TO_OBJECT(cx, lval, obj);

	    /* The operand must contain a number. */
	    SAVE_SP(fp);
	    CACHED_GET(OBJ_GET_PROPERTY(cx, obj, id, &rval));
 

#define CACHED_GET(call)						      
\
     JS_BEGIN_MACRO							      
\
	 if (!OBJ_IS_NATIVE(obj)) {					      
\
	     ok = call; 						      
\
	 } else {							      
\
	     JS_LOCK_OBJ(cx, obj);					      
\
	     PROPERTY_CACHE_TEST(&rt->propertyCache, obj, id, sprop);	      
\
	     if (sprop) {						      
\
		    .... ...
	     } else {							      
\
		 JS_UNLOCK_OBJ(cx, obj);				      
\
		 ok = call;						      
\
		 /* No fill here: js_GetProperty fills the cache. */	      
\
	     }								      
\
	 }								      
\
     JS_END_MACRO




3) test data
Use the attched testcase and ibench to test the patch, on solaris 8 500Mhz cpu:


average of 3 times of tests:
			    test1(ms)	test2(ms)  test3(ms)   ibench(s)
without the patch	    2837	2574	    3589	28.02	
with the patch, 	    2591	2467	    3025	27.88  
ratio			    1.09	1.04	    1.18	1.005 

on windows platform, See data of  5 times repeat the tests below,
			    test1  test2   test3
without the patch	    1946   1853    1981
with the patch		    1931   1821    1949
Keywords: patch, review
cc'ing more reviewers for this patch -
Comment on attachment 92205 [details] [diff] [review]
Patch for review

First of all, I don't think this increase in source complexity and compiled
code space is necessarily worth the slight performance gain in a handful of
synthetic benchmarks.  I'm glad the "threaded" interpreter approach was dropped
after it proved not to matter; I hope you didn't spend too much time on it.  It
seems to me best to *propose and discuss* approaches before hacking patches. 
I'm sorry if I missed some e-mail on this, and failed to say "don't go there".

This bug blamed bitwise operators, but they probably weren't dominating the
critical path any more than name operations or increment operations were.  Can
someone jprof or quantify the benchmarks used here and show hierarchical cycle
profile results?

One thing disturbs me: I don't see how JSOP_NAME and JSOP_NAMEINC matter here,
because the test cases use local variables, except for the max variable tested
in the loop condition which, as scole points out, is global.  Local variables
are already specialized to avoid name gets and sets.  I think it is incumbent
on anyone writing combinatorial code that must perform well to use local
variables in functions.  Please fix the testcases to make max an *argument* to
the functions, so its accesses are as optimized as are accesses of local
variables.

(Phil, can you fix the tests and attach them to this bug?  Thanks.)

While I'm here, let me point out some bad bugs and pick some nits, to help you
work on a final patch if we agree that's the best way to spend effort on this
bug -- or just for your information when you write another patch for js/src.

/be

>diff -u -r3.106 jsinterp.c
>--- jsinterp.c	12 Jun 2002 08:31:43 -0000	3.106
>+++ jsinterp.c	22 Jul 2002 06:32:41 -0000
>@@ -2445,6 +2445,28 @@
>             }
> 
>             OBJ_DROP_PROPERTY(cx, obj2, prop);
>+            /*optimized for native object*/

Please put an extra newline before such comments, put a space after "/*" and
before "*/", and use a proper sentence (capitalized, with a period at the end).

>+            if (prop && OBJ_IS_NATIVE(obj) && OBJ_IS_NATIVE(obj2)) {
>+                JSScope *scope_ = OBJ_SCOPE(obj);                             

Non-nits, i.e., bad bugs, here: you are not worrying about thread-safety via
object scope locking, and you are using prop after it has been dropped by the
call to OBJ_DROP_PROPERTY just above.  You must call JS_LOCK_OBJ(cx, obj)
before fetching OBJ_SCOPE(obj), and you must call js_GetMutableScope and
re-lookup sprop when setting (as ++ does) as opposed to getting, in case the
property is inherited from a prototype.  See js_SetProperty in jsobj.c.

Nit: this block-local is not a variable declared in a macro, so no need to name
it scope_ -- scope will do.

>+                sprop = (JSScopeProperty *)prop;
>+                slot = (uintN)sprop->slot;                                    
>+                rval = (slot != SPROP_INVALID_SLOT)                           
>+                       ? LOCKED_OBJ_GET_SLOT(obj, slot)                       

Non-nit: Sorry, sprop is not in obj's scope -- it is in obj2's scope, and no
object is locked at this point, so you can't use LOCKED_OBJ_GET_SLOT.

>+                       : JSVAL_VOID;                                          
>+
>+                if (!(sprop->attrs & JSPROP_GETTER) && !(sprop->getter))

Nit: don't overparenthesize the !sprop->getter clause.

>+                    goto inc_rval;
>+
>+                JS_UNLOCK_SCOPE(cx, scope_);                                  

Non-nit: scope_ wasn't locked, see above.

>+                ok = SPROP_GET(cx, sprop, obj, obj, &rval);                   
>+                JS_LOCK_SCOPE(cx, scope_);                                    
>+                if (ok && SPROP_HAS_VALID_SLOT(sprop, scope_))                
>+                    LOCKED_OBJ_SET_SLOT(obj, slot, rval);                     
>+                JS_UNLOCK_SCOPE(cx, scope_);                                  
>+                if (!ok)
>+                    goto out;
>+                goto inc_rval;
>+            }
>             lval = OBJECT_TO_JSVAL(obj);
>             goto do_incop;
> 
>@@ -2473,6 +2495,7 @@
>             if (!ok)
>                 goto out;
> 
>+          inc_rval:
>             /* The expression result goes in rtmp, the updated value in rval. */
>             if (JSVAL_IS_INT(rval) &&
>                 rval != INT_TO_JSVAL(JSVAL_INT_MIN) &&
>@@ -2832,6 +2855,10 @@
>             rval = (slot != SPROP_INVALID_SLOT)
>                    ? LOCKED_OBJ_GET_SLOT(obj2, slot)
>                    : JSVAL_VOID;
>+            /*optimized for the common case*/

Nit: see above.

>+            if (!(sprop->attrs & JSPROP_GETTER) && !(sprop->getter))

Nit: see above.

>+                goto save_the_opnd;
>+
>             JS_UNLOCK_OBJ(cx, obj2);
>             ok = SPROP_GET(cx, sprop, obj, obj2, &rval);
>             JS_LOCK_OBJ(cx, obj2);
>@@ -2841,6 +2868,7 @@
>             }
>             if (SPROP_HAS_VALID_SLOT(sprop, OBJ_SCOPE(obj2)))
>                 LOCKED_OBJ_SET_SLOT(obj2, slot, rval);
>+         save_the_opnd:
>             OBJ_DROP_PROPERTY(cx, obj2, prop);
>             PUSH_OPND(rval);
>             break;

At this point, I really think more study is required before attempting so
low-level a patch.  Object locking, prototype vs. object-owned scope
(js_GetMutableScope), and the fundamental lack of a need(*) to optimize
JSOP_NAME, JSOP_NAMEINC, etc. given the fact that no sound benchmark or real JS
function should use global variables in combinatorially intensive code, make me
say: let's not hack more without lots of measurement, followed by analysis,
followed by design of more sound optimizations.

/be

(* It may be that real-world JS abuses global variables too much, but I believe
that we should measure before optimizing.  This bug claims that bitwise
operators are too slow, but provides no evidence; its tests are not reduced,
inviting confounders.  We need to rehabilitate this bug with good profiling
data, or close it as invalid.)
Attachment #92205 - Flags: needs-work+
Results with testcase in comment #19 2002072304
+---------------------+-------------------+-------------------+--------+
|       Test          | Mozilla (20020121)|      IE 5.5       |  Ratio |
|---------------------|-------------------|-------------------|--------|
|Test 1               |     4567 ms       |      2814 ms      |  1.62  |
|---------------------|-------------------|-------------------|--------|
|Test 2               |     4146 ms       |      2594 ms      |  1.60  |
|---------------------|-------------------|-------------------|--------|
|Test 3               |     4847 ms       |      3876 ms      |  1.25  |
+---------------------+-------------------+-------------------+--------+
Thanks -- one thing you might do to reduce noise, at the price of using an ECMA
extension that's only in Mozilla: change those new Date constructions to use
Date.now() instead.  This may not matter much; it would be interesting to know
the savings.

With or without that change, can someone run the JS profiler that's in venkman,
if not a machine cycle (hierarchical, we need to spread blame along callers of
common subroutines, a la gprof) profiler?  Maybe rjesup is able to help.

/be
Another thing to try: run the tests, modified so they don't depend on the DOM,
in the JS shell (built on Unix from js/src/Makefile.ref, on Windows from js.mak,
and on Mac from macbuild/JSRef.mcp).  Also perhaps try the xpcshell, built when
you build Mozilla and installed into dist/bin.  If the numbers get dramatically
better, then the DOM is slowing down some (few) global accesses too much, still
or again.

/be
Attached file venkman profile (deleted) —
Here is the result of profiling a version of the script with the DOM and timing
calls removed, and |max| passed as a parameter.  I ran this on my 850Mhz
laptop, running RH Linux 7.1.  Of course, these numbers can't really be
compared to any others, so they're probably not very useful.
I've tried Date.now().
Times are basically the same as those in comment #20
Here are my timings on WinNT 4.0(SP6), 500 MHz CPU, 128M RAM.
Using optimized JS shell built today, and Mozilla trunk 2002072308.

Using the latest tests, with |max| passed as a parameter.
All times in milliseconds:

              JS Shell    XPCShell     Moz       IE6            
TEST1:          2625        2625       2703      1578          
TEST2:          2360        2360       2453      1422          
TEST3:          2672        2703       2875      2203         
I have quantified test1 on solaris, and it shows
%71.59  js_Interpret
%3.30   js_ValueBoolean
%1.84   nsJSContext::DOMBranchCallback
%1.07   JS_GetParent

jsengine consumed most of the time. 
maybe we need a fine granularity timing method, thus we can tell which part in
jsengine consume most of the cpu.
We need line level or basic block level profiling, not function level --
quantify should be able to do that, at least on Windows.

/be
Are you sure? I collect the data with granularity=line, but cannot find option
that    can view the line level data?
It is the build options

> quantify_what_options ./mozilla-bin.quantify
Build time options: "-collection-granularity=line"
Command line options (run time defaults): "-collection-granularity=line" -quanti
fy-home=/export/home/yddu/rational/releases/quantify.sol.2002a.06.00 -threads=ye
s -use-internal-locks=yes

But in the exported data file, I cannot find any data of line, it just data of
function. I miss something?
cc'ing jband in case he knows the answer to these Quantify questions -
Attached file quantify result on solaris (deleted) —
quantified on solaris
Quantified with the new testcase that remove global variable max.

CPU is mostly consumed by js_Interpret(). So Just consider statistics inside the
function:

main loop , reinit of local var   29.4%
JS debugger support:             19.22%
JSOP_GETVAR                       8.77%
JSOP_BITXOR                       6.88%
JSOP_LSH                          6.26%
JSOP_POPV                         5.7%
JSOP_IFEQ                         5.5%
JSOP_NEG                          5.4%
JSOP_GOTO                        4.53%
JSOP_SETVAR                         4%


The js debugger support seems to be a pain. Do not know why it eat so much cpu.
YueDong Du: I think (but I'm not sure since it's the first time
I look at such a quantify "beast") that the 12.28% are for the
switch statement just below the "}".
If the above is true, js debugger support should be decreased by this value
and the 12.28% should be added to the loop time.
Thanks for the qfy output.  Can you try putting #if 0 and #endif around that
entire debugger rt->interruptHandler block statement, and re-quanitfy, so we can
tell what it is contributing?  I have some ideas on how to get rid of all that
overhead; more in a bit.

/be
This patch loads rt->interruptHandler into a local before the interpreter loop,
and reloads it only after returns from js_Invoke (which might call a native
function).  It may not help, if optimizing compilers and target architectures
don't lead to the interruptHandler local being kept in a register.

/be
The patch is valuable at least on solaris:
with the testcase test1(sun blade 100, 500Mhz, solaris 8, average of 3 times):

with the patch      : 2251 ms
without the patch   : 2595 ms
quantified as /be requested.
So 13.19% of jsinterpret execution time is spent in the
single statement:
       switch(op)

Since "threaded code" does not seem to give any improvement, I
get no idea on how we could reduce the execution time here.

Maybe, disassembling and looking to what the compiler generates.
And the patch of Brendan is obviously beneficial. So can we let patch go through
the review process? 

We have proved that bug has nothing to do with bitwise operator. Maybe switch is
too heavy but no obvious way for leveage the overhead. So the patch is probably the
only fruit of this bug.
Attachment #92205 - Attachment is obsolete: true
Taking.

/be
Assignee: khanson → brendan
Keywords: mozilla1.2
Target Milestone: Future → mozilla1.2alpha
I'll try to find some time to look at the function and what it compiles to (I
used to alpha- and beta-test compilers, mostly from an optimization point-of-view).
Randel, as far as I know, A switch consist of  range check, table lookup 
and branch to the rutine. All are necessary. Even we implement a table 
driven interpret ourselves, we need do all the 3 steps, needless to say
that the table-driven method may need some un-portable language feature.


Even we found that switch is bad, what should we do? I have wasted a 
lot of time on it. But it is on youself.

I'm resummarizing and possibly morphing (but I don't think I am -- I'm goin gto
make the testcase benchmarks here scream) based on a partial evaluation design
I'm pursuing that should improve JS performance quite a bit.

/be
Status: NEW → ASSIGNED
Summary: Bitwise operators should be faster → JS needs partial evaluation
Comment on attachment 92768 [details] [diff] [review]
patch to try that localizes rt->interruptHandler

sr=jband. Sure, why not? I'm curious if it really measures as a win.
Attachment #92768 - Flags: superreview+
Comment on attachment 92768 [details] [diff] [review]
patch to try that localizes rt->interruptHandler

r=rginda
Doesn't seem to cause any debugger regressions, afaict.
Attachment #92768 - Flags: review+
jband: see comment #36 for YueDong Du's measurement with the patch vs. without.

I'll check the patch in when the 1.2alpha trunk opens, but keep the bug open for
bigger improvements.

/be
The latest patch passes the JS testsuite on WinNT in both the
debug and optimized shells. No test regressions occurred -
I checked attachment 92768 [details] [diff] [review] into the 1.2alpha trunk.

/be
Whiteboard: branchOEM
Whiteboard: branchOEM → branchOEM+
In OEM branch.
Whiteboard: branchOEM+ → branchOEM+, fixedOEM
Moving out, some of these may move to 1.3alpha shortly.

/be
Target Milestone: mozilla1.2alpha → mozilla1.2beta
Not beta material, not making this beta.

/be
Target Milestone: mozilla1.2beta → mozilla1.3alpha
Whiteboard: branchOEM+, fixedOEM → [92768:fixedOEM]
i split attachment 92768 [details] [diff] [review]'s life story into bug 169278
Whiteboard: [92768:fixedOEM]
Next alpha.

/be
Target Milestone: mozilla1.3alpha → mozilla1.4alpha
After a running start, my grand plans for jit'ing and partial evaluation
stalled. I don't know when I'll get to this.  I think the memory optimization
bugs on my JS list are more important, along with some correctness ones on my
and khanson's lists.  So I'm going to future this for now.

/be
Target Milestone: mozilla1.4alpha → Future
Keywords: perf, testcase
Attached file Testcase for js shell (deleted) —
Attached patch Making test run faster by 10% (deleted) — Splinter Review
The main idea behind the patch is that results of <<, >>, ^ etc. are always
integers while the current code in jsinterp.c uses STORE_NUMBER macro which
first calls JSDOUBLE_IS_INT. 

Thus the patch adds STORE_INT and STORE_UINT macros that are essentially
STORE_NUMBER without JSDOUBLE_IS_INT and adjust the code accordingly. The patch
also removes unnecessary check for false ok in JSDOUBLE_IS_INT as this check is
already done in FETCH_INT.

It not only makes SM to run faster on the test cases but also shrinks the code
size.
Data on my Linux box (Athlon 1.2GHz / Fedora Core 1 / GCC 3.3.2) for optimized
build of js shell:

Before the patch:

test 1: 795 ms
test 2: 731 ms
test 3: 827 ms

After the patch:

test 1: 660 ms
test 2: 627 ms
test 3: 729 ms

which indicates a speedup of at least 15%.

For comparison here is the results for Rhino with JDK 1.4.2

test 1: 760 ms
test 2: 719 ms
test 3: 602 ms

which also tells that JDK jit can do amazing job with JVM bytecode even for
typeless language..
Comment on attachment 145602 [details] [diff] [review]
Making test run faster by 10%

Great!	Thanks, good catch on the bogus extra ok test.

I'll get this into 1.8a.  I doubt this is the end of the story, though.  Maybe
I should unmorph this bug and let it be about shift ops again.

/be
Attachment #145602 - Flags: review+
Re: comment 57, second paragraph:

Yes, JITs are the way to go.  .NET's IL is a typeless stack bytecode, precisely
so that JITs, which model operand types based on declarations, inferences, or
runtime code that must be generated in cases where the type wasn't declared and
can't be inferred, don't have to deal with a bunch of load-int, load-double,
etc. opcodes.

The interesting thing about the testcase is that types can be inferred, and in
fact partial evaluation would make the test go even faster.  So I do want a bug
on that topic, but maybe it shouldn't be this one.

/be
If I extend the patch to treat JSVAL_IS_INT specially for binary/unary +/- as
JSDOUBLE_IS_INT always holds for the result (and for the unary minus the result
always fits jsval), then the total speedup reaches almost 25%.

It is interesting that in the Rhino case a patch to optimize double to int32
applied a couple of years ago made both interpreted and compiled mode the test
to run 2.5 times faster and when I switched of the patch for the compiled mode
it started to run slower then interpreted mode. Which tells that in Rhino case
the type inference that the compiled mode uses to deduce pure numerical
variables benefited the test case less then a simple 3-line optimization in
frequently executed code.

So I would not be surprised if SM with partial evaluation would benefit to the
test case less then numerical tricks. 
The patch extends the attachment 145602 [details] [diff] [review] with the special treatment of tagged
ints in JSOPT_NEG and JSOPT_POS.  Again it not only makes things faster but
also shrinks code size.
Updated results for the new patch for configuration from comments #57:

Before the patch:
test 1: 795 ms
test 2: 731 ms
test 3: 827 ms
size of js executable: 374660

After the patch:
test 1: 599 ms
test 2: 567 ms
test 3: 726 ms
size of js executable: 373700


Attached patch cleaned up version (deleted) — Splinter Review
Main substantive change is to use js_NewNumberValue instead of its inline
expansion in JSOP_POS.	That should save even more code size.  Neither unary +
nor - is all that common in real world scripts, so I'm not concerned about the
extra call overhead in the JSVAL_IS_INT case.  I don't want to tune for this
bug's benchmark only.

Although it may hurt with some compilers, most interpreter switch cases use the
common operand temporaries, rval, etc., so I made JSOP_NEG and JSOP_POS use
rval too, instead of a new block-local v.  Also compressed DEBUG ifdef into
JS_ASSERT and used interpreter temporary i.

Thanks again, Igor.

/be
Attachment #145738 - Attachment is obsolete: true
Re: comment 60, second paragraph:

The idea with partial evaluation is not simply to deduce numerical type when
compiling (really, when specializing compiled code based on constant arguments
and other invariants).  It's to deduce integer type, loop upper bounds,
degenerate tests, etc.

It's true that using the ALU instead of the FPU will get you something like a 3X
speedup on most architectures.  Even better would be to get rid of the runtime
tests that have to check whether JSDOUBLE_IS_INT && INT_FITS_IN_JSVAL.  Thanks
to Igor's work here, I think I will just keep this bug open.  Encore!

/be
This version extends the previous patch with the explicit optimization of
addition of 2 tagged ints:

	   case JSOP_ADD:
...
+	     /* Optimize for two int-tagged operands as sum fits int32. */
+	     if ((lval & rval) & JSVAL_INT) {
+		 if (lval != JSVAL_VOID && rval != JSVAL_VOID) {
+		     i = JSVAL_TO_INT(lval) + JSVAL_TO_INT(rval);
+		     JS_ASSERT((jsdouble)i == (jsdouble)JSVAL_TO_INT(lval)
+					      + (jsdouble)JSVAL_TO_INT(rval));
+		     if (INT_FITS_IN_JSVAL(i)) {
+			 rval = INT_TO_JSVAL(i);
+			 sp--;
+			 STORE_OPND(-1, rval);
+			 break;
+		     }
+		 }
+	     }

and similar code for the subtraction:

	   case JSOP_SUB:
...
+	     /* Optimize for two int-tagged operands as differece fits int32.
*/
+	     if ((lval & rval) & JSVAL_INT) {
+		 if (lval != JSVAL_VOID && rval != JSVAL_VOID) {
+		     i = JSVAL_TO_INT(lval) - JSVAL_TO_INT(rval);
+		     JS_ASSERT((jsdouble)i == (jsdouble)JSVAL_TO_INT(lval)
+					      - (jsdouble)JSVAL_TO_INT(rval));
+		     if (INT_FITS_IN_JSVAL(i)) {
+			 rval = INT_TO_JSVAL(i);
+			 sp--;
+			 STORE_OPND(-1, rval);
+			 break;
+		     }
+		 }
+	     }


Compared with previous cases the new optimizations adds more code and small
overhead of doing "if ((lval & rval) & JSVAL_INT)" when operands of binary +/-
are not tagged ints but they help to reduce by 15% running time of code like:

function NestedForLoopBenchmark_run()
{
  var sum, i, j;

  sum = 0;
  for (i = 1000; i > 0; i--) {
    for (j = 100; j > 0; j--) {
      sum = sum + 1;
    }
  }
  if (sum != 100000) {
    this.error("NestedForLoopBenchmark");
  }
}

which is used in benchmarks/BenchPress.js. The total time of
benchmarks/BenchPress.js is reduced by 4%.
Bogus benchmark alarms go off about now:

       sum = sum + 1;

is not something to optimize.  ++sum or sum++ is already optimized for the int
case.  But ok, it could have been sum += 42 or sum += SOME_INT_CONST.

Still, I'd rather see the compiler notice, without having to build an SSA graph,
that variables initialized to 0 or other consts are subject only to certain
operations, and constrained by loop control, so that they can be optimized to
use int-only JSOPs.  I'll patch that way soon.

/be
(In reply to comment #66)
> Still, I'd rather see the compiler notice, without having to build an SSA graph,
> that variables initialized to 0 or other consts are subject only to certain
> operations, and constrained by loop control, so that they can be optimized to
> use int-only JSOPs.

But this can only be applicable to numerical loops without any function calls in
between, right? Since SM supports indirect eval any function can potentially
change local variable value since any function can be an alias to eval.

BTW, is eval a reserved keyword in JS2?
> But this can only be applicable to numerical loops without any function 
> calls in between, right? Since SM supports indirect eval any function can
> potentially change local variable value since any function can be an alias
> to eval.

Not so.  First, per ECMA, we are free to start throwing EvalError.

Second, the only thing per ECMA-262 that eval (indirect or direct) can do to
change the set of variables visible to the compiler of the function calling eval
is to add to that set.  The only thing an eval can do to the set of variables
visible to the compiler of eval code, where eval is called from a function, is
to delete only those members of the set that were added via an earlier eval
(they lack DontDelete, per 10.2.2).

An ECMA-262-based compiler can therefore optimize all permanent properties of
the activation object that have slots in the JSStackFrame.vars array.

If the function is lightweight (no direct eval call), and the implementation
throws EvalError for indirect calls, then the compiler can optimize all declared
vars.  I'd like to enable this mode in SpiderMonkey, but I bet there are web
pages that call eval via obj.eval for some object obj.  Still, the name by which
eval is called in that case is 'eval', so we can probably make an exemption.

If the function is heavyweight, or if there is an indirect eval call that can't
be detected at compile time, the compiler of the function can still optimize
declared vars, unless the var name collides with a closure (conditional function
declaration) name.  SpiderMonkey does not work properly for this hard case in
the indirect-eval case today.  I'll fix that while I'm patching this bug.

> BTW, is eval a reserved keyword in JS2?

No.  See http://www.mozilla.org/js/language/js20/core/lexer.html#reserved-word.

/be
(In reply to comment #68)
> Not so.  First, per ECMA, we are free to start throwing EvalError.
...
> If the function is lightweight (no direct eval call), and the implementation
> throws EvalError for indirect calls, then the compiler can optimize all declared
> vars.  I'd like to enable this mode in SpiderMonkey, but I bet there are web
> pages that call eval via obj.eval for some object obj.  Still, the name by which
> eval is called in that case is 'eval', so we can probably make an exemption.

Rhino throws EvalError for indirect eval calls exactly to perform such
optimization and that bit arround year 2000 my colleague from the previous job
when he had to support window.eval construction in a Java browser. The quick fix
was to add a kind of prescan to literaly replace window.eval with eval (and do
few other quirks like replacing --> with // that Rhino did not support at that
time). We expected that at some point that would have to be extented but that
was never necessary. So the indirect calls are still called eval indeed.

But my point in the previous comment was that with support for indirect eval
compiler can not assume that any variable keeps its original value after
arbitrary function call so optimization to track integer-only operations  can
not be supported accross function calls unless indirect eval is disabled.  

...
> > BTW, is eval a reserved keyword in JS2?
> 
> No.  See http://www.mozilla.org/js/language/js20/core/lexer.html#reserved-word.
> 

Is it possible to submit a suggestion to change that or at least to require that
any indirect eval must throw an exception? Or is it too late?
> But my point in the previous comment was that with support for indirect eval
> compiler can not assume that any variable keeps its original value after
> arbitrary function call so optimization to track integer-only operations can
> not be supported accross function calls unless indirect eval is disabled.

You can still get a win by retesting int-ness only after the function that might
be eval returns, and if the variable has morphed into a non-int, bail out of the
specialized code.  This is complex, but it has been done (Anamorphic, acquired
by Sun for their HotSpot runtime, which was originally for Self).

But you're right, it's much better to forbid indirect eval.  I'm all in favor of
that provided we can treat foo.eval like eval in web-ish contexts and say we're
done.  De-optimizing such old web-JS scripts will not matter.

> Is it possible to submit a suggestion to change that or at least to require
> that any indirect eval must throw an exception? Or is it too late?

Probably too late, in part because ECMA has made compatibility important, to
avoid diverging too much and designing-by-committee.  Also to ease migration.

I'm trying to get Mozilla Foundation to join ECMA as a NFP organization, so we
can contribute.  ECMA Edition 4 is stalled at the moment, and I would not want
to try to unstall it by suggesting eval become a keyword.  We need compatibility
and short time-to-market here, more than polishing.

/be
Priority: -- → P5
Flags: testcase?
Patch to thread (in the Forth sense, not in the concurrency sense) js_Interpret as a prolog to work here coming next.

/be
Priority: P5 → P1
Summary: JS needs partial evaluation → JS could use some focused partial evaluation
Target Milestone: Future → mozilla1.9alpha
Attached patch indirect threading, passes testsuite (obsolete) (deleted) — Splinter Review
Hmm, I need to force threading off and verify I didn't break the #else clauses.

Anyway, this is the first step.

/be
Attachment #203044 - Flags: superreview?(shaver)
Attachment #203044 - Flags: review?(mrbkap)
And to fuse the common op = *pc; DO_OP().

It would be cool if someone would test Venkman with this patch applied on top of the big patch.

/be
Comment on attachment 203044 [details] [diff] [review]
indirect threading, passes testsuite

Oops, broke the non-threaded case.  New complete patch in a bit.

/be
Attachment #203044 - Attachment is obsolete: true
Attachment #203044 - Flags: superreview?(shaver)
Attachment #203044 - Flags: review?(mrbkap)
So I tested that patch on the testcases at http://www.24fun.com/downloadcenter/benchjs/benchjs.html and http://www.speich.net/computer/3d.php (big cube in separate window).  Test 6 on the former is particularly interesting to me, in general.  Looks like about a 4% win on test 6, and about neutral elsewhere, including the wireframe cube.

For comparison, Opera is about 30-50% faster than we are on test 6 and the wireframe...  But a lot of those testcases is not in JS engine, of course.
Attached patch patch, v2 (obsolete) (deleted) — Splinter Review
This is more like it.  I'd like to land this to move on to the next stage.  Docs on the grand plan coming soon; I'll hack on those while waiting for review ;-).

/be
Attachment #203073 - Attachment is obsolete: true
Attachment #203082 - Flags: superreview?
Attachment #203082 - Flags: review?(mrbkap)
Attached patch interdiff against last patch to compile on MSVC (obsolete) (deleted) — Splinter Review
GCC let me get away with uintN flags in a statement context!  Oh well, I took the opportunity to fix a comment glitch too.  Reviewers take note.

/be
Comment on attachment 203082 [details] [diff] [review]
patch, v2

Bob found a case (JSOP_LITOPX) that needs to be #ifdef JS_THREADED_INTERP.  Roll-up patch next.

/be
Attachment #203082 - Attachment is obsolete: true
Attachment #203082 - Flags: superreview?
Attachment #203082 - Flags: review?(mrbkap)
Attached patch patch, v3 (obsolete) (deleted) — Splinter Review
JSOP_LITOPX case must advance pc by its length (5) minus the length hardcoded into the DO_NEXT_OP code at the end of the case it jumps into, but only when threading.

/be
Attachment #203085 - Attachment is obsolete: true
Attachment #203087 - Flags: superreview?(shaver)
Attachment #203087 - Flags: review?(mrbkap)
(In reply to comment #79)
No regressions with this latest version of the patch in opt/dbg shell tests on winxp.
Branch prediction hardware indexes instruction address to memoize branch target, so the switch in the old interpreter loop almost never predicts, but distributed jumps at the end of each case do better, depending on correlation of bytecode to bytecode compiled from scripts.  The indirect threading in the current patch is a win, but not huge for real-world DHTML benchmarks because (as usual) JS is not on the critical path enough.

But this is just the first step.  Inspired by http://citeseer.ist.psu.edu/berndl05context.html, OCaml (which I'd looked at a while ago, and then lately after keynoting the 2005 ACM ICFP), and great stuff from the past such as Self and Synthesis, I'm finally hot to optimize good old SpiderMonkey to pieces.

It will be tricky retaining API compatibility, including debugger API compat, but that's the real win here.  SpiderMonkey is > 9 years old and has a ton of active embeddings.  We should bear the price of compatibility, even as we add new and entirely different, better APIs, until clients can afford to migrate.  We should not break the world, and I believe we won't have to, just to optimize performance.  Sure, it's harder than clean-slating, but that's nothing new to Mozilla or the Web.  Compatibility is king.

/be
sr=shaver (couldn't put it in the bug, because bugzilla pukes on the inactive flag type, or something -- whatever)
Comment on attachment 203087 [details] [diff] [review]
patch, v3

Trying to sr=shaver again.
Attachment #203087 - Flags: superreview?(shaver) → superreview+
Attachment #203087 - Attachment is obsolete: true
Attachment #203087 - Flags: review?(mrbkap)
Attached patch patch I checked in (deleted) — Splinter Review
Blake will review, I have no doubt ;-).

/be
Attachment #203610 - Flags: review?(mrbkap)
On the first testcase, here's how my scores changed on my Athlon64 X2 system:
+-------------+-------------+-------------+---------+
|    Test     |   Before    |    After    | Speedup |
|-------------|-------------|-------------|---------|
|   Test 1    |   344 ms    |   312 ms    |   9.3%  |
|-------------|-------------|-------------|---------|
|   Test 2    |   312 ms    |   297 ms    |   4.8%  |
|-------------|-------------|-------------|---------|
|   Test 3    |   390 ms    |   375 ms    |   3.8%  |
+-------------+-------------+-------------+---------+

Not bad speed win at all :-)
This patch appears to have increased Tp on btek from 918ms to 924ms, and on luna from 863ms to 870ms.
Blocks: 316879
(In reply to comment #86)
> This patch appears to have increased Tp on btek from 918ms to 924ms, and on
> luna from 863ms to 870ms.

What is the value of __GNUC__ in the old egcs version used on btek?  How about on luna?  I don't believe the threaded interpreter code was enabled there.

We need new benches (machines) to mark, those old ones are not representative.

/be

Results for running tests in optimized build of JS shell on 6 years old Pentium II 360 MHz laptop with Fedora Core 4 and GCC 4.0.1:

+-------------+-------------+-------------+---------+
|    Test     |   Before    |    After    | Speedup |
|-------------|-------------|-------------|---------|
|   Test 1    |   3452 ms   |   2946 ms   |   17%   |
|-------------|-------------|-------------|---------|
|   Test 2    |   3099 ms   |   2598 ms   |   19%   |
|-------------|-------------|-------------|---------|
|   Test 3    |   3509 ms   |   2890 ms   |   21%   |
+-------------+-------------+-------------+---------+

This is really nice speedup for single-threaded embedding which shows benefits of the patch for old hardware with good  compiler.
(In reply to comment #88)
> does gcc 2 not support computed gotos? 
> http://gcc.gnu.org/onlinedocs/gcc-2.95.3/gcc_4.html#SEC64
> http://computing.ee.ethz.ch/sepp/egcs-1.1.2-to/gcc.html#SEC61

It does, but when I first checked in (and broke Windows due to #else brain damage) btek slowed down.  So I tried __GNUC_ >= 3 instead, hoping to put btek and its ilk in the (fixed) #else clause that used a loop around a switch, as before.  But if btek slowed down with that test, then I don't know why -- either it is still getting the threaded interpreter code, or somehow the #else clause code is slower than before.

Clues welcome.  Igor's data is important, it shows that we should not be catering to old compilers, only to old computers.  We have the ability to recompile, since this is a new patch and no one supports the broken old gcc/egcs compiler (AFAIK).

/be
luna is gcc 3.2 and it too slowed down a bit, which makes me wonder if this change is a real world win (assuming our Tp test is still representative of the web world out there).

Btw, I'm all for giving btek a new compiler (and maybe OS upgrade?) to play with.
For the record, I do my builds with Visual C++ 2003.
(In reply to comment #91)
> luna is gcc 3.2 and it too slowed down a bit, which makes me wonder if this
> change is a real world win (assuming our Tp test is still representative of the
> web world out there).

What gcc version has the tree-ssa stuff?  Could it be that is required to see the gain?  Or is this not a compiler issue so much as a machine (icache, e.g.) issue?

/be
(In reply to comment #93)
> Or is this not a compiler issue so much as a machine (icache, e.g.)
> issue?

The last form of the patch does not remove the switch statement for the threaded cases. It remains in the code with all related case labels but is no longer used. Perhaps this is what confuses older versions of GCC where it can not eliminate the dead code?
(In reply to comment #95)
> (In reply to comment #93)
> > Or is this not a compiler issue so much as a machine (icache, e.g.)
> > issue?
> 
> The last form of the patch does not remove the switch statement for the
> threaded cases. It remains in the code with all related case labels but is no
> longer used. Perhaps this is what confuses older versions of GCC where it can
> not eliminate the dead code?

It can't eliminate the dead code, but it shouldn't execute anything to-do with the switch.  The slightly larger icache footprint from any such switch code should not be enough to hurt perf.  Something else is going wrong.  But first, who here sees a perf *win* with the current patch using GCC?  Igor does.  What is different in his machine and GCC compared to btek or luna's?

/be
(In reply to comment #96)
> The slightly larger icache footprint from any such switch code
> should not be enough to hurt perf.

But why the dead is there? I.e. why it is necessary to keep switch/case for the threaded case?


(In reply to comment #97)
> (In reply to comment #96)
> > The slightly larger icache footprint from any such switch code
> > should not be enough to hurt perf.
> 
> But why the dead is there? I.e. why it is necessary to keep switch/case for the
> threaded case?

You're right, it's not.  At one point I was going to leave the while/switch loop for the debugger API, but then I thought of the interruptJumpTable hack.  Patch to remove any switch (and any doubt that this is causing some perf problem) in a trice!

/be
Attached patch no switch in JS_THREADED_INTERP (deleted) — Splinter Review
Attachment #203867 - Flags: review?(igor.bukanov)
Attachment #203867 - Flags: review?(igor.bukanov) → review+
To whoever's interested, I've taken a copy of the jsinterp.c file before these checkins, applied both patches from this bug, and manually removed the threaded code and replaced the macros to make it easier to see what changes were made for the non-threaded case. Hopefully this will allow someone to see where things slowed down.
Comment on attachment 203610 [details] [diff] [review]
patch I checked in

>Index: jsinterp.c
>+#define SAVE_SP_AND_PC(fp) (SAVE_SP(fp), (fp)->pc = pc)

Does this want a more general name (such as UPDATE_FP_STATE) so if there is more deferred storage to fp, we don't need to change the macro name each time?

r=mrbkap, either way.
Attachment #203610 - Flags: review?(mrbkap) → review+
(In reply to comment #101)
> (From update of attachment 203610 [details] [diff] [review] [edit])
> >Index: jsinterp.c
> >+#define SAVE_SP_AND_PC(fp) (SAVE_SP(fp), (fp)->pc = pc)
> 
> Does this want a more general name (such as UPDATE_FP_STATE) so if there is
> more deferred storage to fp, we don't need to change the macro name each time?

FP sounds like an acronym, Floating Point.  UPDATE connotes other things than what is going on here.  This is low-level code, and it should not use much abstraction. The abstraction-leaks will only hurt worse, the more abstract it gets.  But more to the point, I don't expect or want any additional mutable-over-script-execution VM registers being homed in a stack frame.

Jag: do you know of other machines than btek and luna for which this patch seems to have regressed performance?  Cc'ing dbaron.

/be
The other tinderboxen I looked at didn't show a (clear) slow-down. On btek there's a peak covering the first time you checked in and consequently backed out, and it goes up again the second time you checked in. luna shows a similar peak, but it could also just be noise. The slow-down is < 1%, so not a big deal (IMO), but interesting, since it shouldn't be there, and ideally there'd be a speed-up.
Flags: testcase? → testcase-
QA Contact: pschwartau → general
Priority: P1 → --
Target Milestone: mozilla1.9alpha1 → ---
Summary: JS could use some focused partial evaluation → JS indirect threaded interpreter
This bug is patched and I should not morph it any more. The summary was grandiose before I fixed it just now to reflect what landed. I've learned my lesson about one patch per bug (several times over).

/be
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: