Closed Bug 581893 Opened 14 years ago Closed 14 years ago

build a js::Invoke Gatling gun for str_replace / array_sort / array_extra

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: luke, Assigned: luke)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 11 obsolete files)

When doing an array_sort, str_replace, or array_extra, js::Invoke is called with the same callee/this on the same number of arguments many many times.  We already avoid the overhead of pushing the arguments for each invocation using the pattern:

pushInvokeArgs
for (...) {
  set args
  js::Invoke
}
popInvokeArgs

We could take this much farther by factoring more work out of the loop.  That is, in preparing to call js::Invoke many times, we can indicate the callee and number of actual arguments and dynamically choose a specialization of js::Invoke to take advantage of the dynamic properties of the callee.  Kinda like a PIC.  Furthermore, a single stack frame can be pushed and reused for each js::Invoke call.  In the best case, Invoke could be reduced to little more than a ptr-to-fun call (to the dynamically-chosen specialization) and a single memcpy to initialize the stack from a prototype built during the preparation step.

Shark shows all of SS spending 1% (~5ms) in Invoke + InvokeCommon called from str_replace and array_sort.  Looks like dmandelin already identified us as 2x slower doing Array.sort in bug 578226.  5ms is not big, so not a high priority, but, as bug 578226 points out, when its a hot path we are 2x slower than JSC.  Of course, JaegerCalls will help the situation too.

V8 has non-trivial Invoke overhead too, but of the single-shot mostly from apply/call, so no wins there.
fwiw, JM jits the callback function, so the impact of the invoke overhead is proportionally more there than on TM....
Even in case of TM we will invoke the JM-ied callback function eventually.
Assignee: general → apierce
Status: NEW → ASSIGNED
Attached patch WIP Patch (obsolete) (deleted) — Splinter Review
This passes all the tests, but doesn't optimize cases like native functions, and the fast invoke call can probably be made a lot faster. It's about a 1% SS win, and will hopefully be more by the time the patch is done.
Attached file Sunspider comparison (obsolete) (deleted) —
Awesome!  In addition to measuring progress on SS, it might be useful to measure progress on a few micro-benchmarks.
Attached patch Patch, v1 (obsolete) (deleted) — Splinter Review
This patch is for TM (I also have a version for JM, if it's better to apply the patch there). It passes trace tests and jstests. I don't have try server access anymore, so I wasn't able to test that. A few notes:

The location of the code is a little weird. BatchInvokeSession's declaration is in jscntxt.h,  since it needs InvokeArgsGuard and InvokeFrameGuard, while the function definitions are in jsinterp.cpp, since I think it's better to have all invoke code in jsinterp.cpp.

I tried with and without a memcpy to set the stack frame entries, and it's faster without the memcpy.

My main concern with this patch is that between the calls to prepare() and invoke(), the state of the stack/registers is weird, since we've pushed a stack frame, but aren't immediately going to use it. In this state, array_extra calls GetArrayElement, which can eventually call js_InferFlags (called from CallResolveOp), which uses cx->regs->pc, so we can't have an invalid pc. However, we also can't just temporarily give cx->fp and cx->regs their previous values because in some places (e.g. ValueToNumber in sort_compare) we need to do a JS function call while a BatchInvokeSession is prepared, which would cause us to clobber the arguments and the frame (this was causing an assert, although there might be a way to get around it). My solution was to have the prepare() create a full stack frame just like is done before every invoke. This would mess up js_InferFlags, I think, since we use a different cx->regs-pc than before, but I'm not sure if its correctness was necessary in these cases.

Perf updates in a soon-to-follow commment.
Attachment #464257 - Attachment is obsolete: true
Attachment #466690 - Flags: review?(lw)
Attached file Sunspider comparison (obsolete) (deleted) —
After I rebased on top of YARR and some stack frame changes, some other parts of unpack-code seemed to get slower from my patch, so the total win on unpack-code wasn't as much. I spent a while trying to figure it out, with not much success. I've seen unpack-code have weird fluctuations like this before, so it may just be little differences in the optimizer or cache behavior. So, the win on unpack-code may or may not be more than what's shown here. Also, the wins in places other than tagcloud and unpack-code can probably be ignored.


In addition to the sunspider results attached, here's the running time on a few microbenchmarks:

Calling replace on a string of length 2^20, with the identity function as a lambda:

Before: 167ms
After:  149ms


Sorting an initially-reversed list with 1000000 elements, specifying a regular integer comparison lambda:

Before: 925ms
After:  739ms

So, the patch doesn't hugely speed up either string replace or sort, but it's enough to be significant.
Attachment #464258 - Attachment is obsolete: true
I'm sorry for the delay.  I've been finishing some Invoke-changing patches (bug 581263 and bug 539144) which will both simplify this patch and change its performance impact.
Looks like I'm going to be taking this and rebasing it on top of the argv removal patch if there are no objections.
Assignee: alangpierce → cdleary
bhackett said he would like to take this while I bang my head against regexp stuff.
Assignee: cdleary → bhackett1024
Stealing back with permission.
Assignee: bhackett1024 → lw
Attached patch WIP (obsolete) (deleted) — Splinter Review
Rebased and updated.  I was able to rip a bit more out of the hot invoke path and there may be an ounce or two of perf left still -- I did a few things for code simplicity that I could gross for speed.  Micro-benchmarks of forEach/sort/replace show a 20-45% speedup.  SS shows nice speedup (-j -m) for the two benchmarks that actually Invoke a lot:

  string:              1.044x as fast    105.8ms +/- 0.5%   101.3ms +/- 0.5%
    base64:            -                   5.7ms +/- 4.0%     5.4ms +/- 4.3%
    fasta:             -                  21.1ms +/- 1.1%    20.9ms +/- 1.0%
    tagcloud:          1.027x as fast     27.0ms +/- 0.6%    26.3ms +/- 1.0%
    unpack-code:       1.091x as fast     41.8ms +/- 0.6%    38.3ms +/- 1.0%
    validate-input:    ??                 10.3ms +/- 2.2%    10.4ms +/- 2.3%

For an overall 1.8% (6.5ms) speedup.

V8 is mostly unaffected (no hot Invoke), but v8-raytrace is showing a 10% regression, which is weird, since that shouldn't be affected at all, so that needs investigation.
Attachment #466690 - Attachment is obsolete: true
Attachment #466692 - Attachment is obsolete: true
Attachment #466690 - Flags: review?(lw)
(In reply to comment #12)
> V8 is mostly unaffected (no hot Invoke), but v8-raytrace is showing a 10%
> regression, which is weird, since that shouldn't be affected at all, so that
> needs investigation.

Ah, it was a poltergeist.  No measurable change now.  tagcloud/unpack-code speedup remains.
Attached patch patch (obsolete) (deleted) — Splinter Review
(With bug 587707 landed) I was able to avoid clearing rval on each iteration which shaves off maybe .5ms on unpack-code.
Attachment #479212 - Attachment is obsolete: true
Attachment #479485 - Flags: review?(jwalden+bmo)
Attached patch patch (obsolete) (deleted) — Splinter Review
Temporarily put back the rval-setting to avoid doing a 10% job of what bug 593882 is going to do properly; once that lands it will be trivial to remove.
Attachment #479485 - Attachment is obsolete: true
Attachment #479850 - Flags: review?(jwalden+bmo)
Attachment #479485 - Flags: review?(jwalden+bmo)
Attached patch faster (obsolete) (deleted) — Splinter Review
This patch optimizes access of session arguments to not go through fp->canonicalActualArgs() which gives a 7% boost on a str_replace micro-benchmark.
Attachment #479850 - Attachment is obsolete: true
Attachment #479862 - Flags: review?(jwalden+bmo)
Attachment #479850 - Flags: review?(jwalden+bmo)
Attached patch fix assert (obsolete) (deleted) — Splinter Review
Fixed silly assert.  Looking green on try server.
Attachment #479862 - Attachment is obsolete: true
Attachment #479953 - Flags: review?(jwalden+bmo)
Attachment #479862 - Flags: review?(jwalden+bmo)
Gary, Jesse: the code paths added by this patch are only exercised by String.replace (with a function replace arg), Array.sort (with a function comparator), and the Array extras (filter, forEach, every, map, some, reduce, reduceRight).  Do you suppose there is any way to get some fuzzing that pounds on usage of these?
Comment on attachment 479953 [details] [diff] [review]
fix assert

># HG changeset patch
># User Luke Wagner <lw@mozilla.com>
># Date 1285712623 25200
># Node ID ed71a2c842fbd8e99c8f9a331ef739417b669d02
># Parent  a307ab9638b45ce02a4b08ed5ab034c5527a9ec9
>try: -b do -p linux,macosx64 -m none -u all -t none

I think your commit message needs work.  :-P


>diff --git a/js/src/jsarray.cpp b/js/src/jsarray.cpp

>+    InvokeSessionGuard &session = ca->session;
>+    session[0] = *av;
>+    session[1] = *bv;

Range-checking of argument initialization?  BRILLIANT!


>diff --git a/js/src/jsinterp.cpp b/js/src/jsinterp.cpp

> JS_REQUIRES_STACK bool
> Invoke(JSContext *cx, const CallArgs &argsRef, uint32 flags)
> {
>+    /* N.B. Must be kept in sync with InvokeSessionGuard::start/invoke */

This is going to be an ongoing nightmare.  :-(


> bool
>+InvokeSessionGuard::start(JSContext *cx, const Value &calleev, JSObject *thisp, uintN argc)

I will admit my eyes glazed over somewhat reading this.  It could be reasonable; it does roughly seem to follow what Invoke does.  But I see one issue that, in concert with a nearby comment, makes me wonder: is where this-computation happens correct?  Is it after get/pushInvokeFrame or before?  I *thought* that was observable in terms of scope chain for the thisObject hook, but maybe I'm just wrong and am misunderstanding the "Currently, this must happen after the frame is pushed and fp->scopeChain is correct because the thisObject hook may call JS_GetScopeChain." comment in my tree.


>+        /*
>+         * Use the optimized invoke path.
>+         *
>+         * Set (callee, this) once for the session.
>+         * N.B. set before getInvokeFrame does any arg overflow duplication.
>+         */
>+        args_.callee() = calleev;
>+        // FIXME bug 592992: hoist thisObject hook to ExternalInvoke
>+        if (thisp) {
>+            thisp = thisp->thisObject(cx);
>+            if (!thisp)
>+                return false;
>+            JS_ASSERT(IsSaneThisObject(*thisp));
>+            args_.thisv().setObject(*thisp);
>+        } else {
>+            args_.thisv().setNull();
>+        }
>+
>+        /* Push the stack frame once for the session. */
>+        uint32 flags = 0;
>+        if (!stack.getInvokeFrame(cx, args_, fun, script, &flags, &frame_))
>+            return false;
>+
>+        JSStackFrame *fp = frame_.fp();
>+        fp->initCallFrame(cx, calleev.toObject(), fun, argc, flags);
>+
>+        stack.pushInvokeFrame(cx, args_, &frame_);


>diff --git a/js/src/jsinterp.h b/js/src/jsinterp.h

> /*
>+ * Natives like sort/forEach/replace call Invoke repeatedly with the same
>+ * callee, this, and number of arguments. To optimize this, such natives can
>+ * start an "invoke session" to factor out much of the dynamic setup logic
>+ * required by a normal Invoke. Usage is:
>+ *
>+ *   InvokeSessionGuard session(cx);
>+ *   if (!session.start(cx, callee, thisp, argc, &session))
>+ *     ...
>+ *
>+ *   while (...) {
>+ *     // write actual args (not callee, this)
>+ *     session[0] = ...
>+ *     ...
>+ *     session[argc-1] = ...

Spaces around -.


Address the thisObject-calling-location question, and I think you'll be good to go.
Attachment #479953 - Flags: review?(jwalden+bmo) → review-
Attached patch patch (obsolete) (deleted) — Splinter Review
Good catch Waldo.  I originally had that in the right place (after the push), but then I forgot and moved it before because args.thisv() is the non-canonical-this when fp->hasOverflowArgs().

The mjit calling interface changed a bit so I changed things around a bit too.  I also moved a few other bits out of the 'invoke' path and onto the 'start' path.
Attachment #479953 - Attachment is obsolete: true
Attachment #481383 - Flags: review?(jwalden+bmo)
Attached patch with lazy allocation (obsolete) (deleted) — Splinter Review
Attachment #481383 - Attachment is obsolete: true
Attachment #481731 - Flags: review?(jwalden+bmo)
Attachment #481383 - Flags: review?(jwalden+bmo)
Attached patch this time, the gatling patch (deleted) — Splinter Review
Ignore that last patch, wrong bug.
Attachment #481731 - Attachment is obsolete: true
Attachment #481732 - Flags: review?(jwalden+bmo)
Attachment #481731 - Flags: review?(jwalden+bmo)
Comment on attachment 481732 [details] [diff] [review]
this time, the gatling patch

One other minor change: don't pass JSObject* thisp, pass const Value& in anticipation of the strict-this patch landing.  (Or, probably better, wait until after it lands and then you'll be forced to make the change.  Do note that these functions should have |undefined| passed as their |this| value, if no other such value has been specified.)
Attachment #481732 - Flags: review?(jwalden+bmo) → review+
(In reply to comment #24)
Thanks, changed.

http://hg.mozilla.org/tracemonkey/rev/0caecf667343
Whiteboard: fixed-in-tracemonkey
http://hg.mozilla.org/mozilla-central/rev/0caecf667343
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: