Open Bug 1836679 Opened 1 year ago Updated 1 year ago

Object.keys can be 30x slower in SM

Categories

(Core :: JavaScript Engine: JIT, defect, P2)

defect

Tracking

()

ASSIGNED

People

(Reporter: jrmuizel, Assigned: mgaudet)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [sp3])

Attachments

(2 files, 2 obsolete files)

Object.keys performance shows up in the heavily used shallowEqual from React. This is a microbenchmark that attempts to explore the performance characteristics in different engines.

if (!globalThis.performance) {
    globalThis.performance = { now: () => preciseTime()*1000 }
}

function f(o) {
    return Object.keys(o)
}

function bench(o) {
    for (var i = 0; i<1000000; i++) {
        f(o);
    }
}

o = {about: 5,
    ballisitic: 6,
    cakebread: 8,
    dalespeople: 9,
    evilproof: 20,
    fairgoing: 30,
    gargoylish: 2,
    harmonici: 1,
    jinniwink: 12,
    kaleidoscopical: 2,
    labellum: 1,
    macadamization: 4,
    neutrino: 1,
    observership: 0,
    quadratomandibular: 9,
    rachicentesis: 1,
    saltcat: 0,
    trousseau: 1,
    view: 10,
    wheelbox: 2,
    xerography: 1,
    yez: 3
}
var start = performance.now();
bench(o)
var end = performance.now();
print(`took ${end - start}ms`);
$ js --ion-inlining=off object-keys.js
took 305.216064453125ms
$ jsc  --maximumInliningDepth=1 ~/tools/jsbench/object-keys.js
took 10.8701171875ms
d8 --no-turbo-inlining object-keys.js
took 26.507ms

Increasing the number of properties in o makes Spidermonkey and V8 slower but does not seem to have an impact on JSC.

Blocks: 1827268

I suspect https://bugs.webkit.org/show_bug.cgi?id=190047 is responsible for most of JSC's speed

Whiteboard: [sp3]
Flags: needinfo?(jdemooij)
Severity: -- → S3
Priority: -- → P2

Profiling the state of this right now, a surprising amount of time is spent growing the properties vector: https://share.firefox.dev/3P7usbM

If I arbitrarily reserve 100 elements before running the benchmark, I can speed it up by 37%:

$  hyperfine '/home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' 'PREALLOC=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js'
Benchmark 1: /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):     351.2 ms ±   4.3 ms    [User: 352.6 ms, System: 9.8 ms]
  Range (min … max):   346.0 ms … 359.0 ms    10 runs
 
Benchmark 2: PREALLOC=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):     255.8 ms ±   1.2 ms    [User: 255.8 ms, System: 9.6 ms]
  Range (min … max):   253.8 ms … 257.3 ms    11 runs
 
Summary
  'PREALLOC=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' ran
    1.37 ± 0.02 times faster than '/home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js'

AFAICT accurately pre-counting is slightly grosser -- seems like a worthwhile patch regardless, but I'll put that in another bug (and we can talk about accurate pre-counting or if we just want to use a magic constant)

Depends on: 1838529

It's worth checking how common it is to have more than 8 keys (the number of elements the vector can store inline I think). I added some logging a few weeks ago and IIRC the vast majority of calls had < 8 keys, so this benchmark might not be super representative of most of the actual calls.

Flags: needinfo?(jdemooij)

(That said, increasing from 8 to 16 is an easy improvement and probably worth it!)

(In reply to Jan de Mooij [:jandem] from comment #3)

It's worth checking how common it is to have more than 8 keys (the number of elements the vector can store inline I think). I added some logging a few weeks ago and IIRC the vast majority of calls had < 8 keys, so this benchmark might not be super representative of most of the actual calls.

Yeah, the main purpose of this benchmark is to show that JSC caches the results of Object.keys() and takes O(1) time on objects with cached Object.keys() instead of O(number of keys) like Spidermonkey

Oh yeah, that's totally understood -- just was a notable problem when looking at a profile and a quick thing to potentially fix.

So, one hypothesized road forward here is the consumption of cached native iterators created by for-in.

I have built a very preliminary prototype patch that does exactly this to evaluate the potential win we could see here. It's built on top of the patch in Bug 1838529 to avoid the reallocation issues we'd seen elsewhere.

In this prototype patch we directly consume the for-in cache if it exists; so the microbenchmark on this bug was tweaked to walk the object once with 'for-in' to prime that cached iterator.

The prototype patch is not sound since we don't handle prototype properties at all; it merely says "if there's a pre-existing for-in iterator, put those properties in, then create the keys array." In this sense, this is almost the best case scenario we can hope for performance wise.

The results are... underwhelming.

$ hyperfine '/home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' 'ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' 
Benchmark 1: /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):     263.8 ms ±   1.0 ms    [User: 265.3 ms, System: 7.6 ms]
  Range (min … max):   262.4 ms … 265.6 ms    11 runs
 
Benchmark 2: ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):     221.7 ms ±   2.9 ms    [User: 221.3 ms, System: 8.0 ms]
  Range (min … max):   217.7 ms … 228.9 ms    13 runs
 
Summary
  'ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' ran
    1.19 ± 0.02 times faster than '/home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js'

So a 20% improvement on an already synthetic microbenchmark is all this gives us. Furthermore, we have more work to do in order to filter out prototype properties (fortunately, I think this is comparatively easy -- we can divide the range that comes out of our enumerators into 'own properties' and 'proto properties', so with a single index/offset we can disambiguate own vs proto properties).

Edit: If anyone is interested, here's a profile of the optimized case

Note that most objects don't have enumerable prototype properties. If the iterator's indicesState is Valid or AvailableOnRequest, then we don't have to filter anything out.

I would guess that a lot of the remaining time is being spent allocating the array and running GC. We could maybe do scalar replacement, but that's not entirely trivial. Maybe we turn Object.keys into MObjectToIterator and then beef up the set of consumers that can do something with an iterator?

Actually, one thing I notice is that this microbenchmark doesn't actually use the result of Object.keys. It might be the case that V8 and JSC are doing some sort of dead code elimination here.

Edit: wait, scratch that, this is being tested with no inlining. So the result escapes, and can't be optimized out. I guess JSC has a fast-path for allocation?

So... if we essentially copy JSC's optimization called by Jeff in Comment 1, we can get a lot faster...

Benchmark 1: /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):     422.8 ms ±   1.3 ms    [User: 339.0 ms, System: 95.0 ms]
  Range (min … max):   421.6 ms … 425.9 ms    10 runs
 
Benchmark 2: ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):      57.0 ms ±   0.3 ms    [User: 53.9 ms, System: 4.3 ms]
  Range (min … max):    56.3 ms …  57.9 ms    51 runs
 
Summary
  'ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' ran
    7.41 ± 0.05 times faster than '/home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js'`

This patch is -extremely- not production quality, but the general gist is that it rides on the already existing ShapeCachePtr and stores the previously computed set of properties, and births a new array from these values at a cache hit (this suggests that array allocation isn't ultimately the long pole here as well).

If we wanted to productize the patch, there's a bunch of work we'd need to do

  1. Adding a new cache type to the ShapePtrCache means that when one cache is active, others don't attach. This conflict likely needs to be resolved in some fashion. If I modify my benchmark to bring back the 'for-in' activating loop mentioned in Comment 7, now this patch provides zero speedup, because we can't attach the cache, since the iterator cache is occupying those bits. (Perhaps the ShapePtrCache needs to instead key into a data table -- IIRC that was JSC's approach)
  2. This would need some thorough evaluation from performance and memory usage. The above makes it likely very easy to get this wrong. The flush-cache-on-gc thing seems sensible tho, and would limit the potential harms a bit.
  3. The code itself is definitely not production quality (yee-haw mallocing and almost certainly leaking memory).

Edit: oh and I forgot. most importantly, this needs to be made sound! We need to figure out how to handle indexed properties being added and ensure we flush the cache when that happens.

Returning to the approach of using the for-in cache, except this time specializing array construction to consume the property names directly, I've got it up and running and passing jstests and jit-tests.

$ hyperfine 'DISABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' 'ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js
 keys-benchmark.js' 
Benchmark 1: DISABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):     256.6 ms ±   1.3 ms    [User: 255.4 ms, System: 10.6 ms]
  Range (min … max):   254.9 ms … 259.5 ms    11 runs
 
Benchmark 2: ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):      91.7 ms ±   0.8 ms    [User: 88.9 ms, System: 5.7 ms]
  Range (min … max):    90.3 ms …  93.8 ms    32 runs
 
Summary
  'ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' ran
    2.80 ± 0.03 times faster than 'DISABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js'

So now we're 2.8x faster rather than the previous 20% faster.

I'm going to get some SP3 results from try before cleaning these patches up.

Local testing of the shell SP3 TodoMVC/ReactRedux, React benchmarks showed no benefit to this optimization.

We'll see what a try dispatch shows.

On my machine, the patchset that you pushed to try makes shallowEqual faster by 11%. (1716 -> 1529 samples), for an overall improvement of the TodoMVC-React-Redux subtest by 0.6%.
Safari only spends 500 samples in this function. Matching Safari's speed would improve the subtest by 3.8%.

Just for future reference; this optimization attempted on resources/todomvc/architecture-examples/react-redux/index.js revision 9c41c606c4, has the following execution stats:

attempts 40414
successes 39803
rate 98.488148%

Next steps for me is to try to get back the rest of the performance by attempting to modify native iterator to store tagged values, then I'll put a patch stack up for review.

So in Comment 10 I suggested a 7x speedup was achievable with a more invasive caching system. That actually was a mistake on my part -- I neglected to notice that in that experiment that -disabled- state also got slower, which created illusory speedup.

A more accurate measure of what is possible would be this:

$ hyperfine 'DISABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' 'ENABLED=1 /home/matthew/unified/obj-cache-list/dist/bin/js keys-benchmark.js' 
Benchmark 1: DISABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):     282.3 ms ±  22.6 ms    [User: 284.3 ms, System: 7.5 ms]
  Range (min … max):   272.0 ms … 346.0 ms    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 2: ENABLED=1 /home/matthew/unified/obj-cache-list/dist/bin/js keys-benchmark.js
  Time (mean ± σ):      67.2 ms ±   1.0 ms    [User: 64.1 ms, System: 4.3 ms]
  Range (min … max):    65.7 ms …  70.0 ms    43 runs
 
Summary
  'ENABLED=1 /home/matthew/unified/obj-cache-list/dist/bin/js keys-benchmark.js' ran
    4.20 ± 0.34 times faster than 'DISABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js'

So without being as invasive, we can get to ~2.7x.


In Comment 13 I said I'd attempt to get the NativeIterator to store tagged values instead of untagged pointers. I did that; however it effectively didn't change the performance of the optimization.

I plan on cleaning up my version of this optimization that doesn't change the storage format of NativeIterator and submit that for review. Further optimization can come later.

Assignee: nobody → mgaudet
Status: NEW → ASSIGNED

Could the patches be attached to a different bug? They only address part of the performance difference so it would be better to keep this bug open until we've closed most of the gap.

Will do

Depends on: 1839437

Comment on attachment 9339901 [details]
Bug 1836679 - Allow creating and Array from an array of JSLinearStrings r?jandem

Revision D181411 was moved to bug 1839437. Setting attachment 9339901 [details] to obsolete.

Attachment #9339901 - Attachment is obsolete: true

Comment on attachment 9339902 [details]
Bug 1836679 - Use the for-in cache to improve performance on Object.keys r?jandem

Revision D181412 was moved to bug 1839437. Setting attachment 9339902 [details] to obsolete.

Attachment #9339902 - Attachment is obsolete: true

So, with the patches in Bug 1839437 landing, I thought it would be good to leave behind a profile after those patches land:

https://share.firefox.dev/3Nhc91c

In that profile, we're at 78% of the time inside of TryEnumerableOwnPropertiesNative being spent inside of js::NewDenseCopiedArray

What are the next steps here?

Flags: needinfo?(mgaudet)
Attached file keys-benchmark.js (deleted) —

At this point it needs to be re-triaged in terms of priority impact; we've gone from 30x worse to ~3x worse on Linux, and 2.3x worse on Mac; measurements made using the tweaked version of the benchmark from Comment #1 attached here.

The remaining fruit here may be hard to get.

I'm going to open one followup bug, which is that we really ought to have the ability to fill the cache for this situation without requiring the object do a for-of first; fundamentally that's why this version of the test case exists is to show the case where the cache hits. Empirically it does quite a lot in some of the targeted benchmarks, but this might help places where for of wasn't used too.

Flags: needinfo?(mgaudet)

Side note: even on the original benchmark, where the cache accessed by the patches in bug 1839437 aren't used, I measure our speed difference exactly as was measured in Comment 1 as 4.8x locally, not 30x. Presumably other things (like Jon's work on C++ nursery allocation) has substantially improved there as well.

Depends on: 1844394
Attached patch CacheStats.patch (deleted) — Splinter Review

A patch to observe the hit rate of the for-of cache use in Object.keys

Do we know whether react-todoMVC hits the for-of cache?

Flags: needinfo?(mgaudet)

For react, (running Speedometer/resources/todomvc/architecture-examples/react/index.js) I see

  • attempts 39802 successes 39800 rate 99.994975

For react-redux, on the shell branch something has changed which has dropped the success rate (running Speedometer/resources/todomvc/architecture-examples/react-redux/index.js)

For commit 9c41c606 (June 16):

  • attempts 40414 successes 39801 rate 98.483199

For commit f339d7b1 (July 17):

  • attempts 40414 successes 1 rate 0.002474

Curiously though, if I run react (not -redux) from that earlier commit, I see zero impact: attempts 39802 successes 0 rate 0.000000.

This instability really does raise the priority of Bug 1844394, as this is clearly sensitive to changes in the benchmark.

Flags: needinfo?(mgaudet)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: