1211795 - Calling another compartment's Array constructor is too slow.

Reporter

Description

•

9 years ago

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36

Steps to reproduce:

Subclassed an Array for an in-production module. 

Subclassing Method :: (Iframe injection) 

//Create an Iframe for the new script
var subClasser = document.createElement("iframe");
document.body.appendChild(subClasser);
var currentFrame = frames.length - 1;

//create a script that passes it's own Array type to a function in the global scope.
//within this function you can safely extend the prototype of the Array without pollution 
//of the global Array.  
frames[currentFrame].document.write(
    "<script>parent.InjectorFunction(Array);<\/script>"
);

//remove the Iframe
subClasser.parentNode.removeChild(subClasser);




Actual results:

Instantiating the subclassed array is much slower in firefox then expected. 
(compared with IE 11, chrome, and Safari) 

See jsPerf here :: http://jsperf.com/array-performance-test-sub


Expected results:

Speedy laser shark infested awesomeness

Tooru Fujisawa [:arai]

Comment 1

•

9 years ago

What the jsPerf test does is something like following:

  // preparation
  var subClasser = document.createElement("iframe");
  document.body.appendChild(subClasser);
  var SubArray = subClasser.contentWindow.Array;

  // test
  var myArray = new SubArray();

So, SubArray is an Array constructor function in another compartment, and the issue is that calling such function is too slow.

Component: Untriaged → JavaScript Engine

Product: Firefox → Core

Summary: Performance issue when Instantiating subclassed array. → Calling another compartment's Array constructor is too slow.

Version: 41 Branch → Trunk

Tooru Fujisawa [:arai]

Updated

•

9 years ago

Status: UNCONFIRMED → NEW

Ever confirmed: true

Boris Zbarsky [:bzbarsky]

Comment 2

•

9 years ago

Right, because this is going through a cross-compartment wrapper, which is much slower than a normal function call.

We really need better JIT support for transparent cross-compartment wrappers.  Though note that in this case there's a limit on how fast this can be compared to the same-compartment case, since it needs to allocate two objects: the new array and a new cross-compartment wrapper for it.  It will also need to do at least the work of computing that cross-compartment wrapper, though we may be able to optimize _that_ a bit since we know we're creating a new object...  Compared to all that, the normal |new Array| allocation path is actually inlined in the JIT, so is _extremely_ fast.

Component: JavaScript Engine → JavaScript Engine: JIT

Depends on: 907369

Jan de Mooij [:jandem]

Comment 3

•

9 years ago

I agree we should look into ways to make the most common operations on CCWs not totally suck. I think they'll always be slower but as long as it's not totally horrible that's probably fine. Maybe we could use a specialized VM function or an IC to get rid of some of the layers; we should investigate that.

Boris Zbarsky [:bzbarsky]

Comment 4

•

9 years ago

That said, I tried profiling this just to see what it looked like and compared to the case of just allocating an Array and setting its first element (to keep the Array allocation from being entirely optimized out). The subarray case is about 150x slower, just for scale, so 1% of the subarray case is about the entirety of the Array case.

Some interesting things:

1) In the SubArray case, about 1/3 of the time is under InterruptCheck doing js::Nursery::collect. This is about half GenericBuffer::trace calling WrapperMapRef::trace, about 1/4 Nursery::sweep calling _platform_bzero (via memset, I expect), and about 1/4 MonoTypeBuffer<ValueEdge>::trace calling TenuringTracer::moveToTenured. In the Array case, I see no GC activity at all, at least over the 5s I ran the test. Terrence, what's up with that?

2) The remaining 2/3 of the time is jit::InvokeFunction calling, eventually, proxy_Construct (via js::Construct, InternalConstruct). Getting there takes a few percent of the time. Then another percent or two to get to js::CrossCompartmentWrapper::construct. The actual DirectProxyHandler::construct case is about 1/6 of the total time; the rest (about 1/3 of the total time) is JSCompartment::wrap.

The CrossCompartmentWrapper::construct lands us right back in js::Construct but now for the proxy target, which goes and calls ArrayConstructor, which ends up doing js:::Allocate and all that jazz. We end up with _more_ gc under js::Allocate, which is where a lot of the time here is spent.

JSCompartment::wrap spends a bunch of time under JSCompartment::putWrapper and even more under Wrapper::New; no obvious GC here, at least....

Boris Zbarsky [:bzbarsky]

Comment 5

•

9 years ago

Oh, the test I used is:

  // preparation
  var subClasser = document.createElement("iframe");
  document.body.appendChild(subClasser);
  var SubArray = subClasser.contentWindow.Array;

  // test
  function test() {
    var start = new Date;
    var count = 1000000;
    for (var i = 0; i < count; ++i) {
      var myArray = new SubArray();
      // var myArray = new Array();
      // myArray[0] = 1;
    }
    var stop = new Date;
    alert((stop - start) / count * 1e6);                    
  }
  test();

Jan de Mooij [:jandem]

Updated

•

9 years ago

Depends on: 1211944

daniel

Reporter

Comment 6

•

9 years ago

Thanks for the extremely fast response! It's much appreciated.

Nathan Froyd [:froydnj]

Comment 7

•

9 years ago

Clicking through bugs and I see:

(In reply to Boris Zbarsky [:bz] from comment #4)
> That said, I tried profiling this just to see what it looked like and
> compared to the case of just allocating an Array and setting its first
> element (to keep the Array allocation from being entirely optimized out). 
> The subarray case is about 150x slower, just for scale, so 1% of the
> subarray case is about the entirety of the Array case.
> 
> Some interesting things:
> 
> 1)  In the SubArray case, about 1/3 of the time is under InterruptCheck
> doing js::Nursery::collect.  This is about half GenericBuffer::trace calling
> WrapperMapRef::trace, about 1/4 Nursery::sweep calling _platform_bzero (via
> memset, I expect), and about 1/4 MonoTypeBuffer<ValueEdge>::trace calling
> TenuringTracer::moveToTenured.  In the Array case, I see no GC activity at
> all, at least over the 5s I ran the test.  Terrence, what's up with that?

ni? to Terrence; maybe this has already been discussed someplace else, but it'd be nice to have it in the bug?

Flags: needinfo?(terrence)

Terrence Cole [:terrence]

Comment 8

•

9 years ago

(In reply to Boris Zbarsky [:bz] from comment #4)
> That said, I tried profiling this just to see what it looked like and
> compared to the case of just allocating an Array and setting its first
> element (to keep the Array allocation from being entirely optimized out). 
> The subarray case is about 150x slower, just for scale, so 1% of the
> subarray case is about the entirety of the Array case.
> 
> Some interesting things:
> 
> 1)  In the SubArray case, about 1/3 of the time is under InterruptCheck
> doing js::Nursery::collect.

Ouch! In theory we shouldn't be doing more nursery collection as these create the same number of nursery objects. In practice, the CCW map is going to be creating store buffer edges that are going to result in FULL_STORE_BUFFER collections. In general, the nursery wants to treat all edges as strong so that collection work is not proportional to the number of dead objects. There are exceptions though: wrapper cache and typed arrays. It sounds like we need a third, which means we need to grow a generic mechanism.

> This is about half GenericBuffer::trace calling
> WrapperMapRef::trace,

Yes. The generic buffer is slow. Converting the wrapper map to two-level hashing should fix this (or at least move the performance cliff elsewhere). That's currently blocked on making CrossCompartmentKey into a mozilla::Variant. This is blocked on us having generic lambdas; otherwise we need to add ~5 new functors that are almost-but-not-quite the same as an existing set of functors for DispatchTyped usage. This is blocked on us getting MSVC2015, which finally adds support for those. And to a certain extent on GCC 4.9, which IIRC was crashing if we used a certain decltype that forces the Variant match functor to use a slightly different calling convention than the other functors. That's all in the works now though, finally.

> about 1/4 Nursery::sweep calling _platform_bzero (via
> memset, I expect),

This should not be happening. We only overwrite the nursery when poisoning and poison bytes are non-zero. I'm not sure what's going on there. Do you have a stack? 

> and about 1/4 MonoTypeBuffer<ValueEdge>::trace calling
> TenuringTracer::moveToTenured.

I'd guess the cross-compartment proto edge is causing the wrapper map to hold the entire minor heap live. I'm kinda surprised that this isn't a larger percentage of the time.

> In the Array case, I see no GC activity at
> all, at least over the 5s I ran the test.  Terrence, what's up with that?

This doesn't seem plausible. The nursery maxes out at 16MiB (and needs 5 collections to get there in the first place). On 32bit an Object4 is 48 bytes, which means that we'd be using a minimum of 48MiB over the course of the test. If there was system activity before, the nursery could already be expanded, but that's still over the limit.

Maybe the elements are COW? I thought that was only for larger element vectors, but if that's the case then the objects would only be 16MiB. That still isn't going to fit because of heap overhead, but at least is closer to plausible.

The only other possibility is that these arrays are getting allocated Tenured. That doesn't seem that likely though if we're running in the jit the entire time.

> 2)  The remaining 2/3 of the time is jit::InvokeFunction calling,
> eventually, proxy_Construct (via js::Construct, InternalConstruct).  Getting
> there takes a few percent of the time.  Then another percent or two to get
> to js::CrossCompartmentWrapper::construct.  The actual
> DirectProxyHandler::construct case is about 1/6 of the total time; the rest
> (about 1/3 of the total time) is JSCompartment::wrap.
> 
> The CrossCompartmentWrapper::construct lands us right back in js::Construct
> but now for the proxy target, which goes and calls ArrayConstructor, which
> ends up doing js:::Allocate and all that jazz.  We end up with _more_ gc
> under js::Allocate, which is where a lot of the time here is spent.

I think all proxies have a finalizer right now, so this is going to be tenured allocations, which are considerably slower than nursery allocations.

> JSCompartment::wrap spends a bunch of time under JSCompartment::putWrapper
> and even more under Wrapper::New; no obvious GC here, at least....

This might be map overhead.

Flags: needinfo?(terrence)

Nathan Froyd [:froydnj]

Comment 9

•

9 years ago

ni? bz for questions in comment 8.

Flags: needinfo?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 10

•

9 years ago

> Do you have a stack? 

This is the only question in comment 8, right?  The stack is pretty simple:

 js::jit::InvokeFunction
 js::jit::InterruptCheck
 JSRuntime::handleInterrupt
 js::gc::GCRuntime::gcIfRequested
 js::gc::GCRuntime::minorGC
 js::Nursery::collect
 js::Nursery::sweep
 _platform_bzero$VARIANT$Unknown

In this case, the bzero call takes 350ms out of a total running time of 5800ms.

What I don't have are line numbers.  Could this be cellsWithUid_.clear()?  Or the initChunk calls?

I tried breakpointing in a debugger, and if it's not lying to me there are bzero calls coming from js::NewObjectCache::clearNurseryObjects, which is called from js::Nursery::collect.  So it's possible the profiler is mis-blaming those hits and they actually belong in collect?  Doing this in opt builds is no fun.

Flags: needinfo?(bzbarsky)

Terrence Cole [:terrence]

Updated

•

9 years ago

Depends on: 1245965

Tom S [:evilpie]

Updated

•

9 years ago

Blocks: 1245974

Terrence Cole [:terrence]

Updated

•

8 years ago

Depends on: 1237058

Tom S [:evilpie]

Updated

•

8 years ago

Priority: -- → P3

Olli Pettay [:smaug][bugs@pettay.fi]

Updated

•

8 years ago

Depends on: 1338802

Boris Zbarsky [:bzbarsky]

Updated

•

6 years ago

Depends on: same-compartment-realms

Boris Zbarsky [:bzbarsky]

Comment 11

•

6 years ago

OK, so with the CCW overhead eliminated, this is about 6x faster, but still much slower than creating an Array from the same global.

I created profiles for the Array case (https://perfht.ml/2SBI9DW) and the SubArray case (https://perfht.ml/2Sx4sKD) using the testcase in comment 5. The latter shows we're calling js::ArrayConstructor and then doing a bunch of work with ObjectGroup, etc, etc. The former shows us calling js::jit::NewArrayWithGroup directly from jitcode.

Presumably there's some same-realm assumption baked in there (e.g. we'd need to change Realm to call NewArrayWithGroup from jitcode in the SubArray case). Worth a separate bug blocking this one?

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Updated

•

6 years ago

Depends on: 1527822

Jan de Mooij [:jandem]

Comment 12

•

6 years ago

(In reply to Boris Zbarsky [:bzbarsky, bz on IRC] from comment #11)

OK, so with the CCW overhead eliminated, this is about 6x faster, but still much slower than creating an Array from the same global.

I filed bug 1527822.

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Updated

•

6 years ago

Depends on: 1527843

Bobby Holley (:bholley)

Comment 13

•

6 years ago

Can we close this now?

Flags: needinfo?(bzbarsky)

Boris Zbarsky [:bzbarsky]

Comment 14

•

6 years ago

We should at least fix bug 1527822 and then remeasure. And decide whether we care at all about the remaining cross-compartment case.

Flags: needinfo?(bzbarsky)

Bobby Holley (:bholley)

Comment 15

•

6 years ago

Ah, I missed the other bug - sorry about that.

Jan de Mooij [:jandem]

Comment 16

•

6 years ago

Attached file Cross-global test case based on the test in comment 5 (deleted) — Details

Now I get:

Browser	Time
Firefox 65	513
Safari	57
Chrome	21
Nightly	19

Without the myArray[0] = 1:

Browser	Time
Firefox 65	427
Safari	32
Chrome	20
Nightly	2

(Yes we're likely optimizing away the unused allocation.)

Jan de Mooij [:jandem]

Updated

•

6 years ago

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED