1836679 - Object.keys can be 30x slower in SM

Reporter

Description

•

1 year ago

Object.keys performance shows up in the heavily used shallowEqual from React. This is a microbenchmark that attempts to explore the performance characteristics in different engines.

if (!globalThis.performance) {
    globalThis.performance = { now: () => preciseTime()*1000 }
}

function f(o) {
    return Object.keys(o)
}

function bench(o) {
    for (var i = 0; i<1000000; i++) {
        f(o);
    }
}

o = {about: 5,
    ballisitic: 6,
    cakebread: 8,
    dalespeople: 9,
    evilproof: 20,
    fairgoing: 30,
    gargoylish: 2,
    harmonici: 1,
    jinniwink: 12,
    kaleidoscopical: 2,
    labellum: 1,
    macadamization: 4,
    neutrino: 1,
    observership: 0,
    quadratomandibular: 9,
    rachicentesis: 1,
    saltcat: 0,
    trousseau: 1,
    view: 10,
    wheelbox: 2,
    xerography: 1,
    yez: 3
}
var start = performance.now();
bench(o)
var end = performance.now();
print(`took ${end - start}ms`);

$ js --ion-inlining=off object-keys.js
took 305.216064453125ms

$ jsc  --maximumInliningDepth=1 ~/tools/jsbench/object-keys.js
took 10.8701171875ms

d8 --no-turbo-inlining object-keys.js
took 26.507ms

Increasing the number of properties in o makes Spidermonkey and V8 slower but does not seem to have an impact on JSC.

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

1 year ago

Blocks: 1827268

Jeff Muizelaar [:jrmuizel]

Reporter

Comment 1

•

1 year ago

I suspect https://bugs.webkit.org/show_bug.cgi?id=190047 is responsible for most of JSC's speed

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

1 year ago

Whiteboard: [sp3]

Jira Integration Bot

Updated

•

1 year ago

See Also: → https://mozilla-hub.atlassian.net/browse/SP3-405

Jan de Mooij [:jandem]

Updated

•

1 year ago

Flags: needinfo?(jdemooij)

Steven DeTar [:sdetar]

Updated

•

1 year ago

Severity: -- → S3

Priority: -- → P2

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 2

•

1 year ago

Profiling the state of this right now, a surprising amount of time is spent growing the properties vector: https://share.firefox.dev/3P7usbM

If I arbitrarily reserve 100 elements before running the benchmark, I can speed it up by 37%:

$  hyperfine '/home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' 'PREALLOC=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js'
Benchmark 1: /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):     351.2 ms ±   4.3 ms    [User: 352.6 ms, System: 9.8 ms]
  Range (min … max):   346.0 ms … 359.0 ms    10 runs
 
Benchmark 2: PREALLOC=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):     255.8 ms ±   1.2 ms    [User: 255.8 ms, System: 9.6 ms]
  Range (min … max):   253.8 ms … 257.3 ms    11 runs
 
Summary
  'PREALLOC=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' ran
    1.37 ± 0.02 times faster than '/home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js'

AFAICT accurately pre-counting is slightly grosser -- seems like a worthwhile patch regardless, but I'll put that in another bug (and we can talk about accurate pre-counting or if we just want to use a magic constant)

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Updated

•

1 year ago

Depends on: 1838529

Jan de Mooij [:jandem]

Comment 3

•

1 year ago

It's worth checking how common it is to have more than 8 keys (the number of elements the vector can store inline I think). I added some logging a few weeks ago and IIRC the vast majority of calls had < 8 keys, so this benchmark might not be super representative of most of the actual calls.

Flags: needinfo?(jdemooij)

Jan de Mooij [:jandem]

Comment 4

•

1 year ago

(That said, increasing from 8 to 16 is an easy improvement and probably worth it!)

Jeff Muizelaar [:jrmuizel]

Reporter

Comment 5

•

1 year ago

(In reply to Jan de Mooij [:jandem] from comment #3)

It's worth checking how common it is to have more than 8 keys (the number of elements the vector can store inline I think). I added some logging a few weeks ago and IIRC the vast majority of calls had < 8 keys, so this benchmark might not be super representative of most of the actual calls.

Yeah, the main purpose of this benchmark is to show that JSC caches the results of Object.keys() and takes O(1) time on objects with cached Object.keys() instead of O(number of keys) like Spidermonkey

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 6

•

1 year ago

Oh yeah, that's totally understood -- just was a notable problem when looking at a profile and a quick thing to potentially fix.

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 7

•

1 year ago

So, one hypothesized road forward here is the consumption of cached native iterators created by for-in.

I have built a very preliminary prototype patch that does exactly this to evaluate the potential win we could see here. It's built on top of the patch in Bug 1838529 to avoid the reallocation issues we'd seen elsewhere.

In this prototype patch we directly consume the for-in cache if it exists; so the microbenchmark on this bug was tweaked to walk the object once with 'for-in' to prime that cached iterator.

The prototype patch is not sound since we don't handle prototype properties at all; it merely says "if there's a pre-existing for-in iterator, put those properties in, then create the keys array." In this sense, this is almost the best case scenario we can hope for performance wise.

The results are... underwhelming.

$ hyperfine '/home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' 'ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' 
Benchmark 1: /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):     263.8 ms ±   1.0 ms    [User: 265.3 ms, System: 7.6 ms]
  Range (min … max):   262.4 ms … 265.6 ms    11 runs
 
Benchmark 2: ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):     221.7 ms ±   2.9 ms    [User: 221.3 ms, System: 8.0 ms]
  Range (min … max):   217.7 ms … 228.9 ms    13 runs
 
Summary
  'ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' ran
    1.19 ± 0.02 times faster than '/home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js'

So a 20% improvement on an already synthetic microbenchmark is all this gives us. Furthermore, we have more work to do in order to filter out prototype properties (fortunately, I think this is comparatively easy -- we can divide the range that comes out of our enumerators into 'own properties' and 'proto properties', so with a single index/offset we can disambiguate own vs proto properties).

Edit: If anyone is interested, here's a profile of the optimized case

Iain Ireland [:iain]

Comment 8

•

1 year ago

Note that most objects don't have enumerable prototype properties. If the iterator's indicesState is Valid or AvailableOnRequest, then we don't have to filter anything out.

I would guess that a lot of the remaining time is being spent allocating the array and running GC. We could maybe do scalar replacement, but that's not entirely trivial. Maybe we turn Object.keys into MObjectToIterator and then beef up the set of consumers that can do something with an iterator?

Actually, one thing I notice is that this microbenchmark doesn't actually use the result of Object.keys. It might be the case that V8 and JSC are doing some sort of dead code elimination here.

Edit: wait, scratch that, this is being tested with no inlining. So the result escapes, and can't be optimized out. I guess JSC has a fast-path for allocation?

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 9

•

1 year ago

So... if we essentially copy JSC's optimization called by Jeff in Comment 1, we can get a lot faster...

Benchmark 1: /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):     422.8 ms ±   1.3 ms    [User: 339.0 ms, System: 95.0 ms]
  Range (min … max):   421.6 ms … 425.9 ms    10 runs
 
Benchmark 2: ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):      57.0 ms ±   0.3 ms    [User: 53.9 ms, System: 4.3 ms]
  Range (min … max):    56.3 ms …  57.9 ms    51 runs
 
Summary
  'ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' ran
    7.41 ± 0.05 times faster than '/home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js'`

This patch is -extremely- not production quality, but the general gist is that it rides on the already existing ShapeCachePtr and stores the previously computed set of properties, and births a new array from these values at a cache hit (this suggests that array allocation isn't ultimately the long pole here as well).

If we wanted to productize the patch, there's a bunch of work we'd need to do

Adding a new cache type to the ShapePtrCache means that when one cache is active, others don't attach. This conflict likely needs to be resolved in some fashion. If I modify my benchmark to bring back the 'for-in' activating loop mentioned in Comment 7, now this patch provides zero speedup, because we can't attach the cache, since the iterator cache is occupying those bits. (Perhaps the ShapePtrCache needs to instead key into a data table -- IIRC that was JSC's approach)
This would need some thorough evaluation from performance and memory usage. The above makes it likely very easy to get this wrong. The flush-cache-on-gc thing seems sensible tho, and would limit the potential harms a bit.
The code itself is definitely not production quality (yee-haw mallocing and almost certainly leaking memory).

Edit: oh and I forgot. most importantly, this needs to be made sound! We need to figure out how to handle indexed properties being added and ensure we flush the cache when that happens.

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 10

•

1 year ago

Returning to the approach of using the for-in cache, except this time specializing array construction to consume the property names directly, I've got it up and running and passing jstests and jit-tests.

$ hyperfine 'DISABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' 'ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js
 keys-benchmark.js' 
Benchmark 1: DISABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):     256.6 ms ±   1.3 ms    [User: 255.4 ms, System: 10.6 ms]
  Range (min … max):   254.9 ms … 259.5 ms    11 runs
 
Benchmark 2: ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):      91.7 ms ±   0.8 ms    [User: 88.9 ms, System: 5.7 ms]
  Range (min … max):    90.3 ms …  93.8 ms    32 runs
 
Summary
  'ENABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' ran
    2.80 ± 0.03 times faster than 'DISABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js'

So now we're 2.8x faster rather than the previous 20% faster.

I'm going to get some SP3 results from try before cleaning these patches up.

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 11

•

1 year ago

Local testing of the shell SP3 TodoMVC/ReactRedux, React benchmarks showed no benefit to this optimization.

We'll see what a try dispatch shows.

Markus Stange [:mstange]

Comment 12

•

1 year ago

On my machine, the patchset that you pushed to try makes shallowEqual faster by 11%. (1716 -> 1529 samples), for an overall improvement of the TodoMVC-React-Redux subtest by 0.6%.
Safari only spends 500 samples in this function. Matching Safari's speed would improve the subtest by 3.8%.

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 13

•

1 year ago

Just for future reference; this optimization attempted on resources/todomvc/architecture-examples/react-redux/index.js revision 9c41c606c4, has the following execution stats:

attempts 40414
successes 39803
rate 98.488148%

Next steps for me is to try to get back the rest of the performance by attempting to modify native iterator to store tagged values, then I'll put a patch stack up for review.

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 14

•

1 year ago

So in Comment 10 I suggested a 7x speedup was achievable with a more invasive caching system. That actually was a mistake on my part -- I neglected to notice that in that experiment that -disabled- state also got slower, which created illusory speedup.

A more accurate measure of what is possible would be this:

$ hyperfine 'DISABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js' 'ENABLED=1 /home/matthew/unified/obj-cache-list/dist/bin/js keys-benchmark.js' 
Benchmark 1: DISABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js
  Time (mean ± σ):     282.3 ms ±  22.6 ms    [User: 284.3 ms, System: 7.5 ms]
  Range (min … max):   272.0 ms … 346.0 ms    10 runs
 
  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet PC without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
 
Benchmark 2: ENABLED=1 /home/matthew/unified/obj-cache-list/dist/bin/js keys-benchmark.js
  Time (mean ± σ):      67.2 ms ±   1.0 ms    [User: 64.1 ms, System: 4.3 ms]
  Range (min … max):    65.7 ms …  70.0 ms    43 runs
 
Summary
  'ENABLED=1 /home/matthew/unified/obj-cache-list/dist/bin/js keys-benchmark.js' ran
    4.20 ± 0.34 times faster than 'DISABLED=1 /home/matthew/unified/obj-opt-shell-nodebug-x86_64-pc-linux-gnu/dist/bin/js keys-benchmark.js'

So without being as invasive, we can get to ~2.7x.

In Comment 13 I said I'd attempt to get the NativeIterator to store tagged values instead of untagged pointers. I did that; however it effectively didn't change the performance of the optimization.

I plan on cleaning up my version of this optimization that doesn't change the storage format of NativeIterator and submit that for review. Further optimization can come later.

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 15

•

1 year ago

Attached file Bug 1836679 - Allow creating and Array from an array of JSLinearStrings r?jandem (obsolete) (deleted) — Details

Depends on D181004

Phabricator Automation

Updated

•

1 year ago

Assignee: nobody → mgaudet

Status: NEW → ASSIGNED

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 16

•

1 year ago

Attached file Bug 1836679 - Use the for-in cache to improve performance on Object.keys r?jandem (obsolete) (deleted) — Details

Depends on D181411

Markus Stange [:mstange]

Comment 17

•

1 year ago

Could the patches be attached to a different bug? They only address part of the performance difference so it would be better to keep this bug open until we've closed most of the gap.

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 18

•

1 year ago

Will do

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Updated

•

1 year ago

Depends on: 1839437

Phabricator Automation

Comment 19

•

1 year ago

Comment on attachment 9339901 [details]
Bug 1836679 - Allow creating and Array from an array of JSLinearStrings r?jandem

Revision D181411 was moved to bug 1839437. Setting attachment 9339901 [details] to obsolete.

Attachment #9339901 - Attachment is obsolete: true

Phabricator Automation

Comment 20

•

1 year ago

Comment on attachment 9339902 [details]
Bug 1836679 - Use the for-in cache to improve performance on Object.keys r?jandem

Revision D181412 was moved to bug 1839437. Setting attachment 9339902 [details] to obsolete.

Attachment #9339902 - Attachment is obsolete: true

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 21

•

1 year ago

So, with the patches in Bug 1839437 landing, I thought it would be good to leave behind a profile after those patches land:

https://share.firefox.dev/3Nhc91c

In that profile, we're at 78% of the time inside of TryEnumerableOwnPropertiesNative being spent inside of js::NewDenseCopiedArray

Jeff Muizelaar [:jrmuizel]

Reporter

Comment 22

•

1 year ago

What are the next steps here?

Flags: needinfo?(mgaudet)

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 23

•

1 year ago

Attached file keys-benchmark.js (deleted) — Details

At this point it needs to be re-triaged in terms of priority impact; we've gone from 30x worse to ~3x worse on Linux, and 2.3x worse on Mac; measurements made using the tweaked version of the benchmark from Comment #1 attached here.

The remaining fruit here may be hard to get.

I'm going to open one followup bug, which is that we really ought to have the ability to fill the cache for this situation without requiring the object do a for-of first; fundamentally that's why this version of the test case exists is to show the case where the cache hits. Empirically it does quite a lot in some of the targeted benchmarks, but this might help places where for of wasn't used too.

Flags: needinfo?(mgaudet)

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 24

•

1 year ago

Side note: even on the original benchmark, where the cache accessed by the patches in bug 1839437 aren't used, I measure our speed difference exactly as was measured in Comment 1 as 4.8x locally, not 30x. Presumably other things (like Jon's work on C++ nursery allocation) has substantially improved there as well.

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Updated

•

1 year ago

Depends on: 1844394

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 25

•

1 year ago

Attached patch CacheStats.patch (deleted) — Details — Splinter Review

A patch to observe the hit rate of the for-of cache use in Object.keys

Jeff Muizelaar [:jrmuizel]

Reporter

Comment 26

•

1 year ago

Do we know whether react-todoMVC hits the for-of cache?

Flags: needinfo?(mgaudet)

Matthew Gaudet (he/him) [:mgaudet]

Assignee

Comment 27

•

1 year ago

For react, (running Speedometer/resources/todomvc/architecture-examples/react/index.js) I see

attempts 39802 successes 39800 rate 99.994975

For react-redux, on the shell branch something has changed which has dropped the success rate (running Speedometer/resources/todomvc/architecture-examples/react-redux/index.js)

For commit 9c41c606 (June 16):

attempts 40414 successes 39801 rate 98.483199

For commit f339d7b1 (July 17):

attempts 40414 successes 1 rate 0.002474

Curiously though, if I run react (not -redux) from that earlier commit, I see zero impact: attempts 39802 successes 0 rate 0.000000.

This instability really does raise the priority of Bug 1844394, as this is clearly sensitive to changes in the benchmark.

Flags: needinfo?(mgaudet)

Bug 1836679 - Allow creating and Array from an array of JSLinearStrings r?jandem 1 year ago Matthew Gaudet (he/him) [:mgaudet] (deleted), text/x-phabricator-request		Details
Bug 1836679 - Use the for-in cache to improve performance on Object.keys r?jandem 1 year ago Matthew Gaudet (he/him) [:mgaudet] (deleted), text/x-phabricator-request		Details
keys-benchmark.js 1 year ago Matthew Gaudet (he/him) [:mgaudet] (deleted), application/x-javascript		Details
CacheStats.patch 1 year ago Matthew Gaudet (he/him) [:mgaudet] (deleted), patch		Details \| Diff \| Splinter Review