High cache miss count for subtest React-TodoMVC in GrandPrix
Categories
(Core :: JavaScript Engine, task, P3)
Tracking
()
People
(Reporter: denispal, Unassigned)
References
(Blocks 1 open bug)
Details
(Whiteboard: [sp3-react-todomvc])
The React-TodoMVC subtest, located at https://grandprixbench.netlify.app/?suites=React-TodoMVC has a pretty bad IPC compared to Chromium, and it seems largely related to cache misses, especially in the icache.
Chromium:
Performance counter stats for process id '1329513':
114,250,901,598 cycles
160,562,154,374 instructions # 1.41 insn per cycle
32,684,497,884 branch-instructions
811,569,390 cache-misses
2,500,193,250 icache_64b.iftag_miss
42.170563185 seconds time elapsed
Firefox:
Performance counter stats for process id '1330136':
128,673,347,788 cycles
148,228,288,424 instructions # 1.15 insn per cycle
28,009,990,401 branch-instructions
1,246,536,046 cache-misses
4,082,621,693 icache_64b.iftag_miss
42.165150893 seconds time elapsed
Here is a surface profile of the cache-misses for this test. Most of them seem to be coming from the GC and jumping into Trampolines.
Samples: 64K of event 'cache-misses', Event count (approx.): 762002296
Overhead Command Shared Object Symbol
+ 4.56% Isolated Web Co libxul.so [.] js::GCMarker::doMarking<0u>
+ 2.17% Isolated Web Co jitted-1331115-1.so [.] Trampolines
+ 1.49% Isolated Web Co libxul.so [.] js::jit::InitRestParameter
+ 1.37% Isolated Web Co [unknown] [k] 0xffffffff99149667
+ 1.30% Isolated Web Co libxul.so [.] js::TraceManuallyBarrieredGCCellPtr
+ 1.25% Isolated Web Co libxul.so [.] js::AtomizeString
+ 1.24% Isolated Web Co libxul.so [.] nsPurpleBuffer::VisitEntries<SnowWhiteKiller>
+ 1.12% Isolated Web Co libxul.so [.] js::jit::GetNativeDataPropertyByValuePure
+ 1.00% Isolated Web Co libxul.so [.] js::TenuringTracer::traceObject
+ 0.93% Isolated Web Co libxul.so [.] js::NativeObject::addProperty
+ 0.87% Isolated Web Co libxul.so [.] js::jit::SetElementMegamorphic
+ 0.78% Isolated Web Co libxul.so [.] js::jit::ICEntry::trace
+ 0.62% Isolated Web Co libxul.so [.] js::GenericTracerImpl<js::gc::MarkingTracerT<0u> >::onJitCodeEdge
+ 0.61% Isolated Web Co libc.so.6 [.] __strncpy_sse2_unaligned
+ 0.61% Isolated Web Co libc.so.6 [.] __stpncpy_sse2_unaligned
+ 0.56% Isolated Web Co jitted-1331115-782.so [.] RegExp
+ 0.56% Isolated Web Co libxul.so [.] js::jit::GetNativeDataPropertyPure
+ 0.53% Isolated Web Co firefox [.] Allocator<MozJemallocBase>::free
Same thing but for icache misses:
Samples: 20K of event 'icache_64b.iftag_miss', Event count (approx.): 4048060720
Overhead Command Shared Object Symbol
+ 1.61% Isolated Web Co libxul.so [.] js::jit::GetNativeDataPropertyByValuePure ◆
+ 1.22% Isolated Web Co libxul.so [.] js::jit::SetElementMegamorphic ▒
+ 1.14% Isolated Web Co jitted-1331115-4.so [.] BaselineInterpreter ▒
+ 1.04% Isolated Web Co libxul.so [.] js::AtomizeString ▒
+ 1.02% Isolated Web Co jitted-1331115-1.so [.] Trampolines ▒
+ 0.78% Isolated Web Co libxul.so [.] js::GetIterator ▒
+ 0.72% Isolated Web Co libc.so.6 [.] __strncpy_sse2_unaligned ▒
+ 0.63% Isolated Web Co libxul.so [.] js::NativeObject::addProperty ▒
+ 0.55% Isolated Web Co libxul.so [.] Interpret
Reporter | ||
Comment 1•2 years ago
|
||
For comparison, Chromiun spends very little time outside of JIT code for this benchmark:
Samples: 83K of event 'cycles', Event count (approx.): 101000369374, Thread: chrome
Overhead Comman Shared Object Symbol
+ 3.59% chrome chrome [.] Builtins_KeyedStoreIC_Megamorphic ◆
+ 2.79% chrome chrome [.] Builtins_LoadIC ▒
+ 2.55% chrome chrome [.] Builtins_KeyedLoadIC_Megamorphic ▒
+ 2.23% chrome chrome [.] v8::internal::StringTable::TryStringToIndexOrLookupExisting ▒
+ 2.17% chrome chrome [.] Builtins_ObjectPrototypeHasOwnProperty ▒
+ 1.85% chrome chrome [.] std::Cr::__introsort<std::Cr::_ClassicAlgPolicy, v8::internal::EnumIndexComparator<v8::internal::NameDictionary>&, v8::internal::AtomicSlot> ▒
+ 1.62% chrome chrome [.] Builtins_LoadIC_Megamorphic ▒
+ 1.04% chrome chrome [.] v8::internal::FastKeyAccumulator::GetKeys ▒
+ 0.98% chrome chrome [.] blink::HTMLCollection::length ▒
+ 0.75% chrome chrome [.] allocator_shim::internal::PartitionMalloc ▒
+ 0.75% chrome chrome [.] Builtins_ForInFilter ▒
+ 0.66% chrome chrome [.] Builtins_StoreIC ▒
+ 0.64% chrome chrome [.] Builtins_CallFunction_ReceiverIsNotNullOrUndefined ▒
+ 0.61% chrome chrome [.] Builtins_BaselineOutOfLinePrologue ▒
+ 0.52% chrome chrome [.] Builtins_StrictEqual_WithFeedback ▒
0.49% chrome chrome [.] Builtins_StrictEqual_Baseline ▒
0.47% chrome chrome [.] Builtins_ToBooleanForBaselineJump ▒
0.45% chrome chrome [.] v8::internal::BaseNameDictionary<v8::internal::NameDictionary, v8::internal::NameDictionaryShape>::Add ▒
0.45% chrome chrome [.] Builtins_KeyedStoreIC ▒
0.43% chrome chrome [.] v8::internal::Runtime_RegExpExecMultiple ▒
0.39% chrome chrome [.] allocator_shim::internal::PartitionFree ▒
0.37% chrome chrome [.] Builtins_SetDataProperties ▒
0.36% chrome chrome [.] Builtins_CallFunction_ReceiverIsNullOrUndefined ▒
0.36% chrome chrome [.] v8::internal::LookupIterator::Start<false> ▒
0.32% chrome chrome [.] v8::internal::RegExpGlobalCache::FetchNext ▒
0.31% chrome chrome [.] v8::internal::RegExpGlobalCache::RegExpGlobalCache ▒
0.29% chrome chrome [.] v8::internal::String::WriteToFlat<unsigned char> ▒
0.28% chrome chrome [.] Builtins_RecordWriteSaveFP ▒
0.26% chrome chrome [.] Builtins_BaselineLeaveFrame ▒
0.24% chrome chrome [.] Builtins_KeyedLoadIC ▒
0.24% chrome chrome [.] Builtins_LoadICTrampoline_Megamorphic ▒
0.22% chrome chrome [.] Builtins_Call_ReceiverIsNotNullOrUndefined_Baseline_Compact ▒
0.22% chrome chrome [.] Builtins_Call_ReceiverIsNotNullOrUndefined ▒
0.21% chrome chrome [.] Builtins_ArrayFilter ▒
0.20% chrome chrome [.] Builtins_StringAdd_CheckNone ▒
0.20% chrome chrome [.] Builtins_RegExpReplace
Comment 2•2 years ago
|
||
It seems like icache_64b.iftag_miss
< cache-misses
. Is cache-misses
L2 misses?
Is it possible to get the cache miss profiles from Chrome as well?
Comment 3•2 years ago
|
||
InitRestParameter
showing up so high on the dcache miss list is interesting. I don't think we'd previously identified it as a hotspot.
Reporter | ||
Comment 4•2 years ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #2)
It seems like
icache_64b.iftag_miss
<cache-misses
. Iscache-misses
L2 misses?
It doesn't actually seem perfectly clear what exactly "cache-misses" entails, but according to perf_event_open
it states PERF_COUNT_HW_CACHE_MISSES: Cache misses. Usually this indicates Last Level Cache misses; this is intended to be used in conjunction with the PERF_COUNT_HW_CACHE_REFERENCES event to calculate cache miss rates.
. It seems like it must also includes all prefetch requests since it's much higher than the LLC-*-misses.
I added some additional cache counters for further comparison below, but I think the icache misses may be the bigger issue and maybe the overall number of references we make. I wonder if this is because we are jumping into C++ to do GetNativeDataPropertyByValuePure, SetElementMegamorphic, addProperty, etc, while Chrome seems to do these things with an IC for the most part.
Chromium:
Performance counter stats for process id '1345316':
6,332,227,906 cache-references (36.31%)
436,142,674 cache-misses # 6.888 % of all cache refs (36.63%)
1,907,876,036 L1-dcache-load-misses # 4.38% of all L1-dcache accesses (36.63%)
43,511,411,953 L1-dcache-loads (36.79%)
17,005,365,797 L1-dcache-stores (36.41%)
2,516,458,663 L1-icache-load-misses (36.23%)
18,363,927 LLC-load-misses # 3.09% of all LL-cache accesses (36.51%)
594,196,920 LLC-loads (36.44%)
40,397,238 LLC-store-misses (17.91%)
135,787,475 LLC-stores (18.10%)
2,471,436,366 icache_64b.iftag_miss (26.94%)
31.129881019 seconds time elapsed
Firefox:
Performance counter stats for process id '1344844':
8,598,874,325 cache-references (35.84%)
713,126,462 cache-misses # 8.293 % of all cache refs (36.08%)
2,230,684,392 L1-dcache-load-misses # 5.73% of all L1-dcache accesses (36.34%)
38,948,140,714 L1-dcache-loads (36.74%)
18,422,283,077 L1-dcache-stores (36.70%)
3,588,594,526 L1-icache-load-misses (36.85%)
38,634,359 LLC-load-misses # 5.00% of all LL-cache accesses (36.83%)
772,720,459 LLC-loads (36.66%)
59,490,735 LLC-store-misses (17.87%)
238,620,687 LLC-stores (17.80%)
3,598,953,522 icache_64b.iftag_miss (26.60%)
39.157739853 seconds time elapsed
Is it possible to get the cache miss profiles from Chrome as well?
Samples: 42K of event 'cache-misses', Event count (approx.): 450303745, Thread: chrome
Overhead Comman Shared Object Symbol
+ 2.12% chrome chrome [.] Builtins_KeyedStoreIC_Megamorphic
+ 1.91% chrome chrome [.] Builtins_LoadIC_Megamorphic
+ 1.35% chrome chrome [.] v8::internal::RegExpGlobalCache::RegExpGlobalCache
+ 1.17% chrome chrome [.] Builtins_ObjectPrototypeHasOwnProperty
+ 1.14% chrome chrome [.] blink::HTMLCollection::length
+ 1.03% chrome chrome [.] allocator_shim::internal::PartitionMalloc
+ 1.01% chrome chrome [.] v8::internal::Runtime_RegExpExecMultiple
+ 0.86% chrome chrome [.] v8::internal::StringTable::TryStringToIndexOrLookupExisting
+ 0.85% chrome chrome [.] Builtins_LoadIC
+ 0.62% chrome chrome [.] v8::internal::Scavenger::ScavengePage
+ 0.57% chrome chrome [.] allocator_shim::internal::PartitionFree
+ 0.54% chrome chrome [.] v8::internal::TracedHandles::ComputeWeaknessForYoungObjects
+ 0.54% chrome chrome [.] Builtins_KeyedLoadIC_Megamorphic
0.50% chrome chrome [.] Builtins_RegExpReplace
and
Samples: 12K of event 'icache_64b.iftag_miss', Event count (approx.): 2504437566, Thread: chrome
Overhead Comman Shared Object Symbol
+ 3.45% chrome chrome [.] Builtins_LoadIC ◆
+ 1.96% chrome chrome [.] Builtins_KeyedLoadIC_Megamorphic ▒
+ 1.63% chrome chrome [.] Builtins_CallFunction_ReceiverIsNotNullOrUndefined ▒
+ 1.31% chrome chrome [.] v8::internal::StringTable::TryStringToIndexOrLookupExisting ▒
+ 1.17% chrome chrome [.] Builtins_KeyedStoreIC_Megamorphic ▒
+ 1.10% chrome chrome [.] Builtins_StoreIC ▒
+ 0.92% chrome chrome [.] blink::Element::RecalcStyle ▒
+ 0.86% chrome chrome [.] blink::StyleResolver::ApplyBaseStyleNoCache ▒
+ 0.84% chrome chrome [.] Builtins_LoadIC_Megamorphic ▒
+ 0.72% chrome chrome [.] v8::internal::Map::TransitionToDataProperty ▒
+ 0.70% chrome chrome [.] v8::internal::Builtin_HandleApiCall ▒
+ 0.67% chrome chrome [.] Builtins_ObjectPrototypeHasOwnProperty ▒
+ 0.65% chrome chrome [.] Builtins_BaselineOutOfLinePrologue ▒
+ 0.62% chrome chrome [.] v8::internal::LookupIterator::Start<false> ▒
+ 0.61% chrome chrome [.] Builtins_CallFunction_ReceiverIsNullOrUndefined ▒
+ 0.57% chrome chrome [.] Builtins_BaselineLeaveFrame ▒
+ 0.56% chrome chrome [.] blink::EventDispatcher::Dispatch ▒
+ 0.54% chrome chrome [.] blink::StyleResolver::ResolveStyle ▒
+ 0.53% chrome chrome [.] Builtins_StrictEqual_Baseline ▒
+ 0.53% chrome chrome [.] allocator_shim::internal::PartitionMalloc ▒
+ 0.50% chrome chrome [.] blink::EventTarget::FireEventListeners
Comment 5•2 years ago
|
||
Do we know how much GC V8 is doing? GC will typically have a lot of cache misses.
(You said it spends little time outside the JIT, so presumably not as much.)
Reporter | ||
Comment 6•2 years ago
|
||
(In reply to Jon Coppeard (:jonco) from comment #5)
Do we know how much GC V8 is doing? GC will typically have a lot of cache misses.
(You said it spends little time outside the JIT, so presumably not as much.)
I actually can't find Chrome spending any significant time in the GC. I do see some call stacks for the GC in the perf script output, but not in the report. That being said I think our GC activity is also quite low when looking at overall cycles, so I think the bulk of the slowdown is probably coming from icache misses which seem to line up closer to the report for cycles.
Comment 7•2 years ago
|
||
(In reply to Denis Palmeiro [:denispal] from comment #4)
I added some additional cache counters for further comparison below, but I think the icache misses may be the bigger issue and maybe the overall number of references we make. I wonder if this is because we are jumping into C++ to do GetNativeDataPropertyByValuePure, SetElementMegamorphic, addProperty, etc, while Chrome seems to do these things with an IC for the most part.
For the purpose of understanding icache behaviour, you probably need to dig in a bit more to which of this builtins are cloned vs shared and if they are layed out in memory in interesting ways. Another consideration might be the types of calls being used (such as direct vs indirect). Translating C++ code to equivalent MASM would not change icache behaviour, so I think we need a better theory about what structurally would be different (eg. a certain hotpath to run less code; avoiding trampoline frames more often; cloning builtins to use short-jumps; etc).
Another thing I notice about the data is one of our hot icache frames is the BaselineInterpreter, while in chrome there are many Builtins that appear to be pieces of their equivalent. I suspect that the fine-grained symbols in Chrome is simply appear better because it is divided up.
Comment 8•2 years ago
|
||
[triage note] I will set this bug as P3 task in the mean time.
My understanding is that this bug is still in the investigation stage, until we figure out the root cause of the icache-misses. Looking at icache-misses at the function level sounds like a fuzzy description of the problem. A better report should probably look at the instruction level and map these instructions back to the logic which is being manipulated by these instruction.
As soon as we have identified what is going on and that we can pin-point the issue, then we can create new bugs to fix each issue, as defects.
Updated•2 years ago
|
Updated•2 years ago
|
Updated•2 years ago
|
Description
•