The throbber animation heavily regresses performance on main-thread-heavy micro-benchmark
Categories
(Core :: Graphics: WebRender, defect, P3)
Tracking
()
People
(Reporter: emilio, Unassigned)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
(deleted),
text/html
|
Details |
STR:
- Enable WebRender.
- Open the test-case attached to this bug.
- Open attachment 9057328 [details], which is attached to bug 1537903, which has the same test-case, but without spinning the throbber.
Expected Results:
- Comparable performance.
Actual results:
- The number I get on my GPU (
Mesa DRI Intel(R) HD Graphics P530 (Skylake GT2)
) is much higher if the throbber is spinning. Performance is comparable in both test-cases if WebRender is disabled.
Diff between the two test-cases is just:
-const DELAY = true;
+const DELAY = false;
window.onload = function() {
DELAY ? setTimeout(runTests, 0) : runTests();
}
This is with integrated Graphics on Linux.
Comment 1•6 years ago
|
||
I get roughly the same number on both attachments with Mac. 4041 vs 3963. What numbers do you get?
Updated•6 years ago
|
Reporter | ||
Comment 3•6 years ago
|
||
I'll get a pair of profiles.
Reporter | ||
Comment 4•6 years ago
|
||
WebRender disabled, throbber test-case: 6912, then 7275
WebRender disabled, no-throbber test-case: 7038, then 7055
WebRender enabled, throbber test-case: 20037, then 19430
- https://perfht.ml/2U6YTOZ (note that this only has the second run because it went over the 30s profiler time window)
WebRender enabled, no-throbber test-case: 7713, then 7850.
Note that WebRender's profiles are profiling a bunch more threads, so maybe not
directly comparable with non-WR.
Also, this is all on a pristine profile, just created and installed the profiler add-on.
Comment 5•6 years ago
|
||
Can you reproduce the problem if you make the window as small as possible?
Reporter | ||
Comment 6•6 years ago
|
||
Reducing the window size does definitely help a lot, yeah: 5522 vs. 5644.
I've verified that the throbber is still visible while at it (just ensuring it doesn't get culled away).
Comment 7•6 years ago
|
||
I'm guessing this is memory bandwidth starvation. What resolution are you running at?
Reporter | ||
Comment 8•6 years ago
|
||
3840x2160
Comment 9•6 years ago
|
||
Can you run the benchmark at https://github.com/jrmuizel/memset-bench and post the results?
Reporter | ||
Comment 10•6 years ago
|
||
171239584
253290568 1 33177600 4.996732
292047110 2 16588800 4.333633
213503808 4 8294400 5.927880
76897571 8 4147200 16.458582
50981800 16 2073600 24.825036
46075951 32 1036800 27.468234
50430943 64 518400 25.096199
53592175 128 259200 23.615854
49455212 256 129600 25.591337
51532981 512 64800 24.559515
35984658 1024 32400 35.171239
31182065 2048 16200 40.588236
38009982 4096 8100 33.297174
32214595 8192 4050 39.287317
31485070 16384 2025 40.197624
40086547 32768 1012 31.572313
51698017 65536 506 24.481113
45728613 131072 253 27.676873
Comment 11•6 years ago
|
||
Can you also try running the benchmark with opengl layers instead of basic?
Reporter | ||
Comment 12•6 years ago
|
||
5067 vs. 7135 on a maximized window, 5031 vs. 5188 on a minimally-sized window.
Comment 13•6 years ago
|
||
And can you try running the benchmark while running cargo run mem --release
on https://github.com/jrmuizel/jrmuizel-membench
Updated•6 years ago
|
Reporter | ||
Comment 14•6 years ago
|
||
WR enabled: 7174 vs. 14101
OpenGL layers: 7619 vs. 12291
Basic layers: 7053 vs. 7981
this is on a maximized window, lmk if you also want me to report numbers with a minimized window.
Comment 15•6 years ago
|
||
Setting P1 because I assume we want this fixed before shipping on Intel?
Comment 16•6 years ago
|
||
So to confirm. With basic layers, it only gets a little bit worse (7053 vs 7981) when running jrmuizel-membench?
Reporter | ||
Comment 19•6 years ago
|
||
It's a Lenovo P50, with 8 logical, 4 physical:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 39 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Xeon(R) CPU E3-1505M v5 @ 2.80GHz
Stepping: 3
CPU MHz: 800.032
CPU max MHz: 3700.0000
CPU min MHz: 800.0000
BogoMIPS: 5616.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 8192K
NUMA node0 CPU(s): 0-7
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
Comment 20•5 years ago
|
||
Sotaro, can you try running this on Windows 10 on your P50 with gfx.webrender.compositor on and off?
Comment 21•5 years ago
|
||
When I tested STR in comment 0 on P50 on Win10 with maximized window. I did not see such a difference. Though, the values were different each time.
- With DC compositor: 4761 vs 4365
- Witout DC compositor:4250 vs 4059
Updated•5 years ago
|
Comment 22•4 years ago
|
||
Wontifixing this one because we're having a hard time reproducing it, and there's so many actionable items in the queue that I'd like to get this one off the perf triage list.
Description
•