WebKit megamorphic-load benchmark 1.6x slower in FIrefox vs Chrome
Categories
(Core :: JavaScript Engine: JIT, defect, P2)
Tracking
()
People
(Reporter: denispal, Unassigned)
References
(Blocks 2 open bugs)
Details
Firefox behaves quite poorly on the microbenchmark megamorphic-load.js from WebKit which can be found here https://raw.githubusercontent.com/WebKit/WebKit/main/JSTests/microbenchmarks/megamorphic-load.js, but I've changed it slightly to run a bit longer:
start = Date.now();
(function() {
var array = [];
for (var i = 0; i < 1000; ++i) {
var o = {};
o["i" + i] = i;
o.f = 42;
array.push(o);
}
for (var j = 0 ; j < 100 ; j++) {
for (var i = 0; i < 1000000; ++i) {
var result = array[i % array.length].f;
if (result != 42)
throw "Error: bad result: " + result;
}
}
})();
end = Date.now();
console.log(end-start);
Spidermonkey: 1320 ms
V8: 854 ms
The profile from perf looks like this which looks similar to a lot of other profiles from react, matrix, slack, reddit, etc:
Samples: 5K of event 'instructions', Event count (approx.): 15871908413
Overhead Command Shared Object Symbol
+ 67.21% js jitted-1355365-25.so [.] Ion /home/denis/src/WebKit/JSTests/microbenchmarks/megamorphic-load.js:3:9
+ 24.84% js js [.] js::jit::GetNativeDataPropertyPure
+ 7.41% js jitted-1355365-24.so [.] Ion /home/denis/src/WebKit/JSTests/microbenchmarks/megamorphic-load.js:3:9
+ 0.06% js js [.] js::frontend::TokenStreamSpecific<mozilla::Utf8Unit, js::frontend::ParserAnyCharsAccess<js::frontend::GeneralParser<js::frontend::FullParseHandler, mozilla::Utf8Unit> > >::getTokenInternal
+ 0.03% js jitted-1355365-22.so [.] IC: MegamorphicNativeSlot
+ 0.03% js js [.] js::frontend::GeneralParser<js::frontend::FullParseHandler, mozilla::Utf8Unit>::primaryExpr
+ 0.02% js js [.] js::frontend::EmitterScope::lookupInCache
+ 0.02% js jitted-1355365-18.so [.] IC: BinaryArith.Int32.Mod
Comparing performance counters, it seems we execute almost 2x as many instructions as Chome:
Firefox:
6,073,506,603 cycles
15,879,944,293 instructions # 2.61 insn per cycle
2,912,475,009 branch-instructions
489,436 cache-misses
1,829,848 icache_64b.iftag_miss
1.341984818 seconds time elapsed
Chrome:
3,749,745,343 cycles
9,195,343,474 instructions # 2.45 insn per cycle
2,141,285,555 branch-instructions
944,924 cache-misses
3,753,871 icache_64b.iftag_miss
0.812395097 seconds time elapsed
A large number of instructions are part of this sequence of code generated by Ion:
│ ModI ▒
28624373 │ mov %rbx,%rax ▒
│ test %ecx,%ecx ▒
│ ↓ je 337 ▒
410882702 │ test %ebx,%ebx ▒
│ ↓ js 198 ▒
2401639 │ mov %rcx,%rdx ▒
43494055 │ sub $0x1,%edx ▒
│ test %edx,%ecx ▒
│ ↓ jne 18f ▒
│ and %ebx,%edx ▒
│ ↓ jmp 1af ▒
480812223 │18f: xor %edx,%edx ▒
2959710 │ idiv %ecx ▒
1596561001 │ ↓ jmp 1af ▒
│198: cmp $0x80000000,%ebx ▒
│ ↓ je 345 ▒
│1a4: cltd ▒
│ idiv %ecx ▒
│ test %edx,%edx ▒
│ ↓ je 33e ▒
│ BoundsCheck ▒
│1af: cmp %edx,-0x28(%rbp) ▒
│ ↓ jbe 353 ▒
│ SpectreMaskIndex ▒
43915670 │ xor %ecx,%ecx ▒
2896413 │ cmp -0x28(%rbp),%edx ▒
281633276 │ cmovb %edx,%ecx ▒
│ MoveGroup ▒
17590813 │ mov -0x38(%rbp),%rdx ▒
│ LoadElementAndUnbox ▒
8754710 │ movabs $0xfffe000000000000,%r11 ▒
│ xor (%rdx,%rcx,8),%r11 ◆
761616959 │ mov %r11,%rsi ▒
│ shr $0x2f,%r11 ▒
114574485 │ ↓ jne 35a ▒
│ MegamorphicLoadSlot ▒
95856216 │ mov (%rsi),%rax ▒
997275975 │ mov (%rax),%rax ▒
1164739541 │ mov (%rax),%rax ▒
1040131403 │ testl $0x40000,0x8(%rax) ▒
1236461039 │ ↓ jne 361 ▒
130014040 │ movabs $0xfff9800000000000,%r11 ▒
│ push %r11 ▒
5805394 │ mov %rsp,%rbx ▒
152958533 │ movabs $0x7f62b3b25500,%rax ▒
│ movabs $0xa5cf504d00,%rdi ▒
│ sub $0x8,%rsp ▒
│ mov %rbx,%rcx ▒
129553533 │ mov %rdi,%rdx ▒
│ mov %rax,%rdi ▒
8711696 │ → call Ion /home/denis/src/WebKit/JSTests/microbenchmarks/megamorphic-load.js:3:9 ▒
│ add $0x8,%rsp ▒
352224081 │ pop %rcx ▒
17398055 │ test $0xff,%al ▒
│ ↓ je 361
Reporter | ||
Updated•2 years ago
|
Comment 1•2 years ago
|
||
Copying my Matrix comments, for posterity:
- The fixed-size cache doesn't handle 1000 different shapes very well, resulting in a cache hit rate that's less than 1%. On most actual websites, the hit rate is around 80-95% last time I measured.
- All objects have just two own properties and we're looking up one of them. This wouldn't benefit a lot from the cache even in the best case. The cache is most effective for missing properties, objects with many properties, or properties living on the proto chain.
- We could experiment with a different table/hashing strategy (secondary table, different table size?).
- Making the
JSObject/Shape/BaseShape
header words non-atomic might help a bit too. - The
i % array.length
is pretty slow (1024 entries is much faster than 1000). We should check if V8 is also using anidiv
instruction for this.
Comment 2•2 years ago
|
||
Looking only at the generated assembly:
-
The major issue comes from the pointer chasing done within js::jit::MacroAssembler::branchIfNonNativeObj , and js::jit::MacroAssembler::loadObjClassUnsafe. Unfortunately there does not seems to be any way to improve by spreading the instructions as the data dependency is tight.
-
One way to improve this code would be to see that the Array object does not escape despite being manipulated with MArrayPush and assume that the length is a constant of 1000 after the first loop. This way we could replace the "division by a variable" by a "division by a constant". Removing the zero-check, and the sign check.
-
Another aspect is the register allocation chosen is not optimal. Ideally we should not have any MoveGroup being resolved for making this unconditional
callWithABI
. The Lowering phase should be updated to select a specific set of registers. -
The register allocator only seems to restore the array in
rdx
using aMoveGroup
instruction. Most likely due to the register flush required by theMegamorphicLoadSlot
. However, loading it this late, probably cause misses in the loading of the value out of the array. The value could have been loaded memory prior theSpecterMaskIndex
, using a different temporary register. -
Why are
BoundCheck
/SpecterMaskIndex
even necessary, when we are doing a modulo with the size of the array?
Updated•2 years ago
|
Comment 3•2 years ago
|
||
We're basically on par with Chrome for this microbenchmark now. We can beat Chrome handily by increasing our megamorphic cache size, but I don't think that's representative of real life. Closing this out, but feel free to reopen.
Description
•