Closed
Bug 1435209
Opened 7 years ago
Closed 7 years ago
Use CMOVcc instead of index masking
Categories
(Core :: JavaScript Engine: JIT, defect)
Core
JavaScript Engine: JIT
Tracking
()
RESOLVED
FIXED
mozilla60
People
(Reporter: jandem, Assigned: jandem)
References
(Blocks 1 open bug)
Details
Attachments
(2 files)
(deleted),
text/html
|
Details | |
(deleted),
patch
|
luke
:
review+
RyanVM
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
Using CMOVcc to zero the index register if OOB has a number of benefits:
* Performance
On x64 CMOVcc is a bit faster than the index masking, but it's significantly faster on x86 (since we use the slower masking scheme there). On Kraken I get the following on 32-bit:
no mitigations: 832 ms
index masking: 1011 ms (21.5% slower)
CMOVcc: 911 ms (9.5% slower)
Furthermore, this is the naive way - in many cases (masm.boundsCheck32ForLoad, wasm bounds checks maybe?) we can fold the CMP with the CMP we do for the bounds check and that will result in more speedups.
* simplicity: it's much easier to reason about.
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Assignee | ||
Comment 1•7 years ago
|
||
Does anyone have an older machine, to benchmark a patch for this?
My MBP has an i7-4980HQ CPU but it would be great to verify these results on older CPUs. I'm also curious about AMD.
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(lhansen)
Assignee | ||
Comment 2•7 years ago
|
||
Newer architectures (Broadwell, Skylake) would be interesting too...
Comment 3•7 years ago
|
||
I have access to various ancient hardware, including older AMD and Intel chips (AMD FX-4100 and a couple of Macs that probably run Core-2 Duo, and some older i7 - at least). Until Monday I also have my mom's consumer Lenovo laptop in the house for servicing :-P so I can even run something there...
My new Linux system is some Xeon Sandy Bridge thing, if that's interesting.
Flags: needinfo?(lhansen)
Assignee | ||
Comment 4•7 years ago
|
||
Great! I will work on a patch and do a Try push so we have builds we can compare.
Assignee | ||
Comment 5•7 years ago
|
||
First, here's a silly micro-benchmark that runs in the browser (but it's easy to run it in the shell too, of course).
In the shell I got the following, best of 5 runs:
x64 no mitigations: 333 ms
x64 index masking: 323 ms
x64 CMOVcc: 316 ms
x86 no mitigations: 327 ms
x86 index masking: 420 ms
x86 CMOVcc: 314 ms
So:
(1) Yes, on this micro-benchmark we're slower without mitigations. Probably some weird cache effect but I'm not complaining.
(2) This also confirms CMOVcc is a small win on 64-bit on my machine, but on 32-bit the improvement is much bigger.
Assignee | ||
Comment 6•7 years ago
|
||
I have builds to compare index masking vs CMOVcc. Mitigations are enabled by default with these builds, so you don't need to change any prefs or use shell flags.
If anyone reading this could run Kraken [0] (and probably the micro-benchmark I attached) in a browser/shell build and report the results, that would be great. Ideally both 32-bit and 64-bit, but just one of them would help too.
Index masking: https://treeherder.mozilla.org/#/jobs?repo=try&revision=55f3f4c434f82e4f0c7df9f3404da2d93b073ac5
CMOVcc: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b4efd393f6f3058858304268a6804134dd090bb0
[0] https://krakenbenchmark.mozilla.org/
Assignee | ||
Comment 7•7 years ago
|
||
Talos agrees this is an improvement. Linux64 Talos machines are Skylakes now (Xeon E3-1585Lv5) and it's a consistent 1-3% improvement there on Kraken compared to index masking. As expected, this is bigger on 32-bit: 8.83% on Win32.
So it looks like CMOV performs well on (newer) Intel CPUs. Still very interested in AMD results and older Intels.
Comment 8•7 years ago
|
||
AMD FX-6300 on Win7 (x64)
Kraken, three runs for each configuration (chronology order).
x64 58.0.1: [1678.6ms +/- 4.8%, 1609.9ms +/- 1.4%, 1651.1ms +/- 2.5%]
x64 index masking: [1914.6ms +/- 5.6%, 1839.5ms +/- 1.3%, 1855.2ms +/- 1.5%]
x64 CMOVcc: [1840.5ms +/- 4.0%, 1818.6ms +/- 2.1%, 1782.1ms +/- 3.8%]
x86 index masking: [2104.8ms +/- 2.8%, 2117.3ms +/- 2.6%, 2091.4ms +/- 3.3%]
x86 CMOVcc: [1949.2ms +/- 2.9%, 1968.9ms +/- 2.3%, 1932.7ms +/- 3.8%]
Shell test case:
x64 tip@094c6dbe48a3: 214ms
x64 index masking: 370ms
x64 CMOVcc: 319ms
x86 index masking: 559ms
x86 CMOVcc: 318ms
Assignee | ||
Comment 9•7 years ago
|
||
(In reply to André Bargull [:anba] from comment #8)
> AMD FX-6300 on Win7 (x64)
Thanks a lot! I'm very happy to see CMOVcc is faster on AMD too - I was worried it would be slow.
Comment 10•7 years ago
|
||
I don't know if this is old enough to be interesting, but here are results for an Intel Core i7-3770K (Ivy Bridge, 2012) on Windows 10 x64.
Kraken, average result first, individual results in parentheses:
x64 Nightly: 1145.5ms (1152.7ms 1144.3ms 1166.4ms 1135.1ms 1144.7ms 1129.8ms)
x64 index masking: 1293.8ms (1314.9ms 1284.4ms 1294.6ms 1303.6ms 1280.4ms 1284.7ms)
x64 CMOVcc: 1310.2ms (1293.2ms 1310.3ms 1332.2ms 1298.7ms 1315.1ms 1311.6ms)
x86 Nightly: 1196.6ms (1201.9ms 1195.0ms 1186.8ms 1211.0ms 1189.0ms 1195.7ms)
x86 index masking: 1428.3ms (1404.9ms 1424.1ms 1452.8ms 1421.8ms 1444.0ms 1422.3ms)
x86 CMOVcc: 1359.7ms (1379.6ms 1366.7ms 1359.6ms 1324.1ms 1388.0ms 1340.0ms)
So on x64 it looks like a wash or a (very) slight regression, on x86 CMOVcc is a definite win.
Comment 11•7 years ago
|
||
Results for the microbenchmark (all builds compiled locally):
median min mean max sd
x64 m-c tip: 304 300 305 321 3
x64 index masking: 316 300 318 359 14
x64 CMOVcc: 319 299 320 356 14
x86 m-c tip: 303 300 303 309 2
x86 index masking: 443 436 443 457 4
x86 CMOVcc: 315 299 318 349 13
Note that the difference between index masking and CMOVcc on x64 is not statistically significant after 200 runs.
Assignee | ||
Comment 12•7 years ago
|
||
Thanks! It's interesting the difference is smaller on that machine on x64. That makes me wonder about even older architectures like Core 2, maybe Lars can help us with that. But considering the other benefits, better perf on 32-bit and newer CPUs, we could take a perf regression there, as long as it's not too horrible. And we can still improve perf more by eliminating the second CMP in some cases.
Comment 13•7 years ago
|
||
x64 nightly:
1035.7ms +/- 2.6%
1052.2ms +/- 2.6%
1089.0ms +/- 3.7%
Index Masking
x64:
1233.8ms +/- 4.0%
1182.7ms +/- 1.7%
1247.9ms +/- 3.5%
x86:
1503.0ms +/- 4.2%
1411.1ms +/- 2.0%
1448.0ms +/- 5.9%
CMOVcc
x64:
1155.6ms +/- 1.7%
1168.4ms +/- 5.3%
1157.7ms +/- 2.2%
x86:
1283.2ms +/- 3.1%
1315.2ms +/- 3.5%
1264.7ms +/- 2.1%
Comment 14•7 years ago
|
||
Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz 2.27GHz
Nightly x86: 3668.5ms +/- 5.3%
https://krakenbenchmark.mozilla.org/kraken-1.1/results.html?%7B%22v%22:%20%22kraken-1.1%22,%20%22ai-astar%22:%5B175,180,193,175,184,200,174,176,192,197%5D,%22audio-beat-detection%22:%5B427,313,326,325,323,313,314,311,326,320%5D,%22audio-dft%22:%5B430,421,417,386,589,432,397,444,428,422%5D,%22audio-fft%22:%5B250,221,226,215,220,213,223,214,259,225%5D,%22audio-oscillator%22:%5B187,369,169,177,175,175,187,261,293,171%5D,%22imaging-gaussian-blur%22:%5B277,266,271,270,279,288,270,276,267,278%5D,%22imaging-darkroom%22:%5B358,348,372,346,358,340,347,340,351,351%5D,%22imaging-desaturate%22:%5B311,276,288,328,275,255,258,275,453,275%5D,%22json-parse-financial%22:%5B137,144,133,128,137,168,130,133,188,142%5D,%22json-stringify-tinderbox%22:%5B83,82,89,92,97,92,93,285,106,88%5D,%22stanford-crypto-aes%22:%5B167,167,177,175,196,172,164,175,167,171%5D,%22stanford-crypto-ccm%22:%5B286,272,242,268,306,267,257,309,413,252%5D,%22stanford-crypto-pbkdf2%22:%5B494,508,443,415,428,418,441,497,693,419%5D,%22stanford-crypto-sha256-iterative%22:%5B164,164,152,158,151,154,161,154,180,154%5D%7D
Index masking: 4550.1ms +/- 2.4%
https://krakenbenchmark.mozilla.org/kraken-1.1/results.html?%7B%22v%22:%20%22kraken-1.1%22,%20%22ai-astar%22:%5B274,272,271,300,286,269,288,259,419,265%5D,%22audio-beat-detection%22:%5B469,432,362,351,362,387,438,366,451,352%5D,%22audio-dft%22:%5B565,597,616,651,622,543,593,573,582,599%5D,%22audio-fft%22:%5B242,347,307,319,261,280,315,284,248,247%5D,%22audio-oscillator%22:%5B226,433,288,339,297,549,226,293,222,250%5D,%22imaging-gaussian-blur%22:%5B472,503,444,440,482,449,446,437,447,450%5D,%22imaging-darkroom%22:%5B387,386,376,365,374,370,400,373,364,360%5D,%22imaging-desaturate%22:%5B398,348,338,358,392,323,360,361,376,335%5D,%22json-parse-financial%22:%5B200,141,136,138,173,154,151,138,167,152%5D,%22json-stringify-tinderbox%22:%5B119,105,138,139,115,104,103,113,109,125%5D,%22stanford-crypto-aes%22:%5B219,222,204,243,201,239,207,251,213,224%5D,%22stanford-crypto-ccm%22:%5B277,289,268,280,296,362,329,273,369,287%5D,%22stanford-crypto-pbkdf2%22:%5B445,516,480,502,470,446,454,532,440,444%5D,%22stanford-crypto-sha256-iterative%22:%5B187,166,168,187,183,200,440,192,169,206%5D%7D
CMOVcc: 4254.8ms +/- 5.6%
https://krakenbenchmark.mozilla.org/kraken-1.1/results.html?%7B%22v%22:%20%22kraken-1.1%22,%20%22ai-astar%22:%5B247,193,202,194,192,212,220,196,187,187%5D,%22audio-beat-detection%22:%5B432,607,384,384,342,343,425,345,360,345%5D,%22audio-dft%22:%5B510,656,538,538,479,494,621,482,515,486%5D,%22audio-fft%22:%5B264,359,235,314,286,252,276,245,234,238%5D,%22audio-oscillator%22:%5B184,389,184,358,269,232,192,188,209,229%5D,%22imaging-gaussian-blur%22:%5B360,416,366,367,375,367,365,377,382,358%5D,%22imaging-darkroom%22:%5B380,373,372,372,398,366,382,376,394,381%5D,%22imaging-desaturate%22:%5B326,389,308,285,355,286,315,282,308,310%5D,%22json-parse-financial%22:%5B190,149,162,134,231,187,157,143,412,132%5D,%22json-stringify-tinderbox%22:%5B143,122,111,119,115,113,124,110,109,106%5D,%22stanford-crypto-aes%22:%5B249,196,196,202,279,193,209,200,202,197%5D,%22stanford-crypto-ccm%22:%5B535,270,292,268,469,275,275,279,269,271%5D,%22stanford-crypto-pbkdf2%22:%5B664,462,432,523,435,448,520,488,476,430%5D,%22stanford-crypto-sha256-iterative%22:%5B287,260,198,182,168,259,191,175,179,232%5D%7D
Comment 15•7 years ago
|
||
Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz 2.27GHz
micro-benchmark jsshell x86
no mitigations: 491
index masking: 1279
CMOVcc: 850
Assignee | ||
Comment 16•7 years ago
|
||
Thanks a lot people, very helpful.
Ekanan, just curious, is that 32-bit Windows? If 64-bit, it would be very interesting to have data for Firefox 64-bit on Core 2 :)
Comment 17•7 years ago
|
||
It's windows 10 32-bit :)
Comment 18•7 years ago
|
||
I forgot to mention my build specs in #13
Win10 Pro x64
i7 6700HQ
Comment 19•7 years ago
|
||
Huh, turns out both my Core 2 Duo systems run Mac OS X 32-bit, not really a current or interesting platform... Anyway one of them is a model with the CPU designation T7600 @ 2.33GHz (MacBook Pro ca 2006).
Assignee | ||
Comment 20•7 years ago
|
||
It turns out that this version:
CMP index, length
CMOVB index, output
is a lot faster than this one:
CMP length, index
CMOVA index, output
I don't understand why, but it repros on multiple benchmarks (index and length are both in registers and the code size is exactly the same).
Anyway, I replaced the first version with the second, then had to revert it because the first is faster. Oh well.
Attachment #8948819 -
Flags: review?(luke)
Comment 21•7 years ago
|
||
Comment on attachment 8948819 [details] [diff] [review]
Patch
Review of attachment 8948819 [details] [diff] [review]:
-----------------------------------------------------------------
Faster, shorter, simpler; pick any three.
Attachment #8948819 -
Flags: review?(luke) → review+
Comment 22•7 years ago
|
||
Pushed by jandemooij@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0bc556c6e060
Use CMOVcc instead of index masking. r=luke
Comment 23•7 years ago
|
||
bugherder |
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
status-firefox60:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
Assignee | ||
Comment 24•7 years ago
|
||
Attachment #8948819 -
Flags: approval-mozilla-beta?
Comment 25•7 years ago
|
||
Comment on attachment 8948819 [details] [diff] [review]
Patch
Spectre-related fix. Taking for 59b8.
Attachment #8948819 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 26•7 years ago
|
||
bugherder uplift |
status-firefox59:
--- → fixed
Comment 27•7 years ago
|
||
Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz
Ubuntu 16.04
Micro benchmark:
CMOV: 511
Index Masking: 522
==================== Kraken ===================
--------------------- CMOV --------------------
-----------------------------------------------
Total: 2368.4ms +/- 12.7%
-----------------------------------------------
ai: 132.7ms +/- 17.0%
astar: 132.7ms +/- 17.0%
audio: 757.9ms +/- 9.9%
beat-detection: 171.8ms +/- 15.8%
dft: 354.9ms +/- 9.9%
fft: 109.3ms +/- 16.0%
oscillator: 121.9ms +/- 19.2%
imaging: 610.1ms +/- 17.7%
gaussian-blur: 197.8ms +/- 21.3%
darkroom: 222.3ms +/- 18.6%
desaturate: 190.0ms +/- 13.3%
json: 146.7ms +/- 26.8%
parse-financial: 88.8ms +/- 36.5%
stringify-tinderbox: 57.9ms +/- 21.6%
stanford: 721.0ms +/- 17.6%
crypto-aes: 134.2ms +/- 17.8%
crypto-ccm: 221.5ms +/- 20.8%
crypto-pbkdf2: 256.1ms +/- 20.5%
crypto-sha256-iterative: 109.2ms +/- 30.7%
-------------------- Index Masking ------------
-----------------------------------------------
Total: 2574.1ms +/- 11.7%
-----------------------------------------------
ai: 139.9ms +/- 16.6%
astar: 139.9ms +/- 16.6%
audio: 887.8ms +/- 12.9%
beat-detection: 190.9ms +/- 16.3%
dft: 422.5ms +/- 13.2%
fft: 140.6ms +/- 28.9%
oscillator: 133.8ms +/- 17.0%
imaging: 626.9ms +/- 18.0%
gaussian-blur: 210.6ms +/- 22.9%
darkroom: 229.7ms +/- 18.0%
desaturate: 186.6ms +/- 19.1%
json: 146.1ms +/- 23.5%
parse-financial: 75.3ms +/- 21.6%
stringify-tinderbox: 70.8ms +/- 39.8%
stanford: 773.4ms +/- 20.3%
crypto-aes: 144.6ms +/- 19.0%
crypto-ccm: 271.0ms +/- 39.1%
crypto-pbkdf2: 262.2ms +/- 17.9%
crypto-sha256-iterative: 95.6ms +/- 20.6%
============================== Octane ==========================
-------------------------------- CMOV ---------------------------
Octane 2.0
Octane Score: 17463
Richards
19691 Core language features
Deltablue
23683
Core language features
Crypto
16758
Bit & Math operations
Raytrace
57218
Core language features
EarleyBoyer
15394
Memory & GC
Regexp
2898
Strings & arrays
Splay
13470
Memory & GC
SplayLatency
13934
GC latency
NavierStokes
26284
Strings & arrays
pdf.js
11430
Strings & arrays
Mandreel
12715
Virtual machine
MandreelLatency
11785
Compiler latency
GB Emulator
33341
Virtual machine
CodeLoad
14284
Loading & Parsing
Box2DWeb
21731
Bit & Math operations
zlib
53209
asm.js
Typescript
14073
Virtual machine & GC
------------------------------ Index Masking -------------------
Octane 2.0
Octane Score: 16929
Richards
20606 Core language features
Deltablue
21323
Core language features
Crypto
16531
Bit & Math operations
Raytrace
53057
Core language features
EarleyBoyer
15761
Memory & GC
Regexp
2286
Strings & arrays
Splay
13438
Memory & GC
SplayLatency
15719
GC latency
NavierStokes
26389
Strings & arrays
pdf.js
7344
Strings & arrays
Mandreel
14224
Virtual machine
MandreelLatency
17250
Compiler latency
GB Emulator
37916
Virtual machine
CodeLoad
13920
Loading & Parsing
Box2DWeb
18199
Bit & Math operations
zlib
35333
asm.js
Typescript
16272
Virtual machine & GC
Comment 28•7 years ago
|
||
Ooops! Please, ignore my comment #27.
The correct results are:
Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz
Ubuntu 16.04
MICRO BENCHMARK
---------------
CMOV: 520
INDEX MASKING: 559
KRAKEN
-------
CMOV: 2394.7ms +/- 16.6%
INDEX MASKING: 2609.1ms +/- 10.4%
CMOV - KRAKEN - details
================================
Kraken JavaScript Benchmark Results
Content Version: kraken-1.1
Run Again
(You can bookmark this results URL for later comparison.)
To compare to another run, paste a saved result URL in the text field below and press enter:
===============================================
RESULTS (means and 95% confidence intervals)
-----------------------------------------------
Total: 2394.7ms +/- 16.6%
-----------------------------------------------
ai: 126.9ms +/- 16.3%
astar: 126.9ms +/- 16.3%
audio: 896.9ms +/- 25.1%
beat-detection: 174.1ms +/- 18.5%
dft: 457.1ms +/- 39.5%
fft: 134.2ms +/- 30.5%
oscillator: 131.5ms +/- 19.9%
imaging: 592.7ms +/- 20.3%
gaussian-blur: 190.1ms +/- 21.9%
darkroom: 219.9ms +/- 21.7%
desaturate: 182.7ms +/- 18.1%
json: 127.8ms +/- 17.9%
parse-financial: 77.2ms +/- 20.1%
stringify-tinderbox: 50.6ms +/- 19.2%
stanford: 650.4ms +/- 11.4%
crypto-aes: 123.2ms +/- 13.3%
crypto-ccm: 211.5ms +/- 34.2%
crypto-pbkdf2: 220.1ms +/- 7.8%
crypto-sha256-iterative: 95.6ms +/- 15.1%
INDEX MASKING - KRAKEN - details
=================================
Kraken JavaScript Benchmark Results
Content Version: kraken-1.1
Run Again
(You can bookmark this results URL for later comparison.)
To compare to another run, paste a saved result URL in the text field below and press enter:
===============================================
RESULTS (means and 95% confidence intervals)
-----------------------------------------------
Total: 2609.1ms +/- 10.4%
-----------------------------------------------
ai: 160.9ms +/- 16.1%
astar: 160.9ms +/- 16.1%
audio: 1028.7ms +/- 24.7%
beat-detection: 250.7ms +/- 31.2%
dft: 518.2ms +/- 32.4%
fft: 133.9ms +/- 35.8%
oscillator: 125.9ms +/- 20.5%
imaging: 568.9ms +/- 12.9%
gaussian-blur: 195.2ms +/- 16.0%
darkroom: 205.7ms +/- 15.6%
desaturate: 168.0ms +/- 11.8%
json: 138.3ms +/- 23.4%
parse-financial: 82.5ms +/- 29.0%
stringify-tinderbox: 55.8ms +/- 37.5%
stanford: 712.3ms +/- 15.6%
crypto-aes: 156.6ms +/- 44.3%
crypto-ccm: 192.8ms +/- 22.6%
crypto-pbkdf2: 249.5ms +/- 15.3%
crypto-sha256-iterative: 113.4ms +/- 22.4%
Assignee | ||
Comment 29•7 years ago
|
||
(In reply to tomaspartl from comment #28)
> The correct results are:
Thank you for these results!
Updated•6 years ago
|
Flags: needinfo?(nicolas.b.pierron)
You need to log in
before you can comment on or make changes to this bug.
Description
•