Closed Bug 1435209 Opened 7 years ago Closed 7 years ago

Use CMOVcc instead of index masking

Categories

(Core :: JavaScript Engine: JIT, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox59 --- fixed
firefox60 --- fixed

People

(Reporter: jandem, Assigned: jandem)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

Using CMOVcc to zero the index register if OOB has a number of benefits: * Performance On x64 CMOVcc is a bit faster than the index masking, but it's significantly faster on x86 (since we use the slower masking scheme there). On Kraken I get the following on 32-bit: no mitigations: 832 ms index masking: 1011 ms (21.5% slower) CMOVcc: 911 ms (9.5% slower) Furthermore, this is the naive way - in many cases (masm.boundsCheck32ForLoad, wasm bounds checks maybe?) we can fold the CMP with the CMP we do for the bounds check and that will result in more speedups. * simplicity: it's much easier to reason about.
Depends on: 1435249
Assignee: nobody → jdemooij
Status: NEW → ASSIGNED
Blocks: 1435266
Does anyone have an older machine, to benchmark a patch for this? My MBP has an i7-4980HQ CPU but it would be great to verify these results on older CPUs. I'm also curious about AMD.
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(lhansen)
Newer architectures (Broadwell, Skylake) would be interesting too...
I have access to various ancient hardware, including older AMD and Intel chips (AMD FX-4100 and a couple of Macs that probably run Core-2 Duo, and some older i7 - at least). Until Monday I also have my mom's consumer Lenovo laptop in the house for servicing :-P so I can even run something there... My new Linux system is some Xeon Sandy Bridge thing, if that's interesting.
Flags: needinfo?(lhansen)
Great! I will work on a patch and do a Try push so we have builds we can compare.
Attached file Micro-benchmark (deleted) —
First, here's a silly micro-benchmark that runs in the browser (but it's easy to run it in the shell too, of course). In the shell I got the following, best of 5 runs: x64 no mitigations: 333 ms x64 index masking: 323 ms x64 CMOVcc: 316 ms x86 no mitigations: 327 ms x86 index masking: 420 ms x86 CMOVcc: 314 ms So: (1) Yes, on this micro-benchmark we're slower without mitigations. Probably some weird cache effect but I'm not complaining. (2) This also confirms CMOVcc is a small win on 64-bit on my machine, but on 32-bit the improvement is much bigger.
I have builds to compare index masking vs CMOVcc. Mitigations are enabled by default with these builds, so you don't need to change any prefs or use shell flags. If anyone reading this could run Kraken [0] (and probably the micro-benchmark I attached) in a browser/shell build and report the results, that would be great. Ideally both 32-bit and 64-bit, but just one of them would help too. Index masking: https://treeherder.mozilla.org/#/jobs?repo=try&revision=55f3f4c434f82e4f0c7df9f3404da2d93b073ac5 CMOVcc: https://treeherder.mozilla.org/#/jobs?repo=try&revision=b4efd393f6f3058858304268a6804134dd090bb0 [0] https://krakenbenchmark.mozilla.org/
Talos agrees this is an improvement. Linux64 Talos machines are Skylakes now (Xeon E3-1585Lv5) and it's a consistent 1-3% improvement there on Kraken compared to index masking. As expected, this is bigger on 32-bit: 8.83% on Win32. So it looks like CMOV performs well on (newer) Intel CPUs. Still very interested in AMD results and older Intels.
AMD FX-6300 on Win7 (x64) Kraken, three runs for each configuration (chronology order). x64 58.0.1: [1678.6ms +/- 4.8%, 1609.9ms +/- 1.4%, 1651.1ms +/- 2.5%] x64 index masking: [1914.6ms +/- 5.6%, 1839.5ms +/- 1.3%, 1855.2ms +/- 1.5%] x64 CMOVcc: [1840.5ms +/- 4.0%, 1818.6ms +/- 2.1%, 1782.1ms +/- 3.8%] x86 index masking: [2104.8ms +/- 2.8%, 2117.3ms +/- 2.6%, 2091.4ms +/- 3.3%] x86 CMOVcc: [1949.2ms +/- 2.9%, 1968.9ms +/- 2.3%, 1932.7ms +/- 3.8%] Shell test case: x64 tip@094c6dbe48a3: 214ms x64 index masking: 370ms x64 CMOVcc: 319ms x86 index masking: 559ms x86 CMOVcc: 318ms
(In reply to André Bargull [:anba] from comment #8) > AMD FX-6300 on Win7 (x64) Thanks a lot! I'm very happy to see CMOVcc is faster on AMD too - I was worried it would be slow.
I don't know if this is old enough to be interesting, but here are results for an Intel Core i7-3770K (Ivy Bridge, 2012) on Windows 10 x64. Kraken, average result first, individual results in parentheses: x64 Nightly: 1145.5ms (1152.7ms 1144.3ms 1166.4ms 1135.1ms 1144.7ms 1129.8ms) x64 index masking: 1293.8ms (1314.9ms 1284.4ms 1294.6ms 1303.6ms 1280.4ms 1284.7ms) x64 CMOVcc: 1310.2ms (1293.2ms 1310.3ms 1332.2ms 1298.7ms 1315.1ms 1311.6ms) x86 Nightly: 1196.6ms (1201.9ms 1195.0ms 1186.8ms 1211.0ms 1189.0ms 1195.7ms) x86 index masking: 1428.3ms (1404.9ms 1424.1ms 1452.8ms 1421.8ms 1444.0ms 1422.3ms) x86 CMOVcc: 1359.7ms (1379.6ms 1366.7ms 1359.6ms 1324.1ms 1388.0ms 1340.0ms) So on x64 it looks like a wash or a (very) slight regression, on x86 CMOVcc is a definite win.
Results for the microbenchmark (all builds compiled locally): median min mean max sd x64 m-c tip: 304 300 305 321 3 x64 index masking: 316 300 318 359 14 x64 CMOVcc: 319 299 320 356 14 x86 m-c tip: 303 300 303 309 2 x86 index masking: 443 436 443 457 4 x86 CMOVcc: 315 299 318 349 13 Note that the difference between index masking and CMOVcc on x64 is not statistically significant after 200 runs.
Thanks! It's interesting the difference is smaller on that machine on x64. That makes me wonder about even older architectures like Core 2, maybe Lars can help us with that. But considering the other benefits, better perf on 32-bit and newer CPUs, we could take a perf regression there, as long as it's not too horrible. And we can still improve perf more by eliminating the second CMP in some cases.
x64 nightly: 1035.7ms +/- 2.6% 1052.2ms +/- 2.6% 1089.0ms +/- 3.7% Index Masking x64: 1233.8ms +/- 4.0% 1182.7ms +/- 1.7% 1247.9ms +/- 3.5% x86: 1503.0ms +/- 4.2% 1411.1ms +/- 2.0% 1448.0ms +/- 5.9% CMOVcc x64: 1155.6ms +/- 1.7% 1168.4ms +/- 5.3% 1157.7ms +/- 2.2% x86: 1283.2ms +/- 3.1% 1315.2ms +/- 3.5% 1264.7ms +/- 2.1%
Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz 2.27GHz Nightly x86: 3668.5ms +/- 5.3% https://krakenbenchmark.mozilla.org/kraken-1.1/results.html?%7B%22v%22:%20%22kraken-1.1%22,%20%22ai-astar%22:%5B175,180,193,175,184,200,174,176,192,197%5D,%22audio-beat-detection%22:%5B427,313,326,325,323,313,314,311,326,320%5D,%22audio-dft%22:%5B430,421,417,386,589,432,397,444,428,422%5D,%22audio-fft%22:%5B250,221,226,215,220,213,223,214,259,225%5D,%22audio-oscillator%22:%5B187,369,169,177,175,175,187,261,293,171%5D,%22imaging-gaussian-blur%22:%5B277,266,271,270,279,288,270,276,267,278%5D,%22imaging-darkroom%22:%5B358,348,372,346,358,340,347,340,351,351%5D,%22imaging-desaturate%22:%5B311,276,288,328,275,255,258,275,453,275%5D,%22json-parse-financial%22:%5B137,144,133,128,137,168,130,133,188,142%5D,%22json-stringify-tinderbox%22:%5B83,82,89,92,97,92,93,285,106,88%5D,%22stanford-crypto-aes%22:%5B167,167,177,175,196,172,164,175,167,171%5D,%22stanford-crypto-ccm%22:%5B286,272,242,268,306,267,257,309,413,252%5D,%22stanford-crypto-pbkdf2%22:%5B494,508,443,415,428,418,441,497,693,419%5D,%22stanford-crypto-sha256-iterative%22:%5B164,164,152,158,151,154,161,154,180,154%5D%7D Index masking: 4550.1ms +/- 2.4% https://krakenbenchmark.mozilla.org/kraken-1.1/results.html?%7B%22v%22:%20%22kraken-1.1%22,%20%22ai-astar%22:%5B274,272,271,300,286,269,288,259,419,265%5D,%22audio-beat-detection%22:%5B469,432,362,351,362,387,438,366,451,352%5D,%22audio-dft%22:%5B565,597,616,651,622,543,593,573,582,599%5D,%22audio-fft%22:%5B242,347,307,319,261,280,315,284,248,247%5D,%22audio-oscillator%22:%5B226,433,288,339,297,549,226,293,222,250%5D,%22imaging-gaussian-blur%22:%5B472,503,444,440,482,449,446,437,447,450%5D,%22imaging-darkroom%22:%5B387,386,376,365,374,370,400,373,364,360%5D,%22imaging-desaturate%22:%5B398,348,338,358,392,323,360,361,376,335%5D,%22json-parse-financial%22:%5B200,141,136,138,173,154,151,138,167,152%5D,%22json-stringify-tinderbox%22:%5B119,105,138,139,115,104,103,113,109,125%5D,%22stanford-crypto-aes%22:%5B219,222,204,243,201,239,207,251,213,224%5D,%22stanford-crypto-ccm%22:%5B277,289,268,280,296,362,329,273,369,287%5D,%22stanford-crypto-pbkdf2%22:%5B445,516,480,502,470,446,454,532,440,444%5D,%22stanford-crypto-sha256-iterative%22:%5B187,166,168,187,183,200,440,192,169,206%5D%7D CMOVcc: 4254.8ms +/- 5.6% https://krakenbenchmark.mozilla.org/kraken-1.1/results.html?%7B%22v%22:%20%22kraken-1.1%22,%20%22ai-astar%22:%5B247,193,202,194,192,212,220,196,187,187%5D,%22audio-beat-detection%22:%5B432,607,384,384,342,343,425,345,360,345%5D,%22audio-dft%22:%5B510,656,538,538,479,494,621,482,515,486%5D,%22audio-fft%22:%5B264,359,235,314,286,252,276,245,234,238%5D,%22audio-oscillator%22:%5B184,389,184,358,269,232,192,188,209,229%5D,%22imaging-gaussian-blur%22:%5B360,416,366,367,375,367,365,377,382,358%5D,%22imaging-darkroom%22:%5B380,373,372,372,398,366,382,376,394,381%5D,%22imaging-desaturate%22:%5B326,389,308,285,355,286,315,282,308,310%5D,%22json-parse-financial%22:%5B190,149,162,134,231,187,157,143,412,132%5D,%22json-stringify-tinderbox%22:%5B143,122,111,119,115,113,124,110,109,106%5D,%22stanford-crypto-aes%22:%5B249,196,196,202,279,193,209,200,202,197%5D,%22stanford-crypto-ccm%22:%5B535,270,292,268,469,275,275,279,269,271%5D,%22stanford-crypto-pbkdf2%22:%5B664,462,432,523,435,448,520,488,476,430%5D,%22stanford-crypto-sha256-iterative%22:%5B287,260,198,182,168,259,191,175,179,232%5D%7D
Intel(R) Core(TM)2 Duo CPU P8400 @ 2.26GHz 2.27GHz micro-benchmark jsshell x86 no mitigations: 491 index masking: 1279 CMOVcc: 850
Thanks a lot people, very helpful. Ekanan, just curious, is that 32-bit Windows? If 64-bit, it would be very interesting to have data for Firefox 64-bit on Core 2 :)
It's windows 10 32-bit :)
I forgot to mention my build specs in #13 Win10 Pro x64 i7 6700HQ
Huh, turns out both my Core 2 Duo systems run Mac OS X 32-bit, not really a current or interesting platform... Anyway one of them is a model with the CPU designation T7600 @ 2.33GHz (MacBook Pro ca 2006).
Attached patch Patch (deleted) — Splinter Review
It turns out that this version: CMP index, length CMOVB index, output is a lot faster than this one: CMP length, index CMOVA index, output I don't understand why, but it repros on multiple benchmarks (index and length are both in registers and the code size is exactly the same). Anyway, I replaced the first version with the second, then had to revert it because the first is faster. Oh well.
Attachment #8948819 - Flags: review?(luke)
Comment on attachment 8948819 [details] [diff] [review] Patch Review of attachment 8948819 [details] [diff] [review]: ----------------------------------------------------------------- Faster, shorter, simpler; pick any three.
Attachment #8948819 - Flags: review?(luke) → review+
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
Attachment #8948819 - Flags: approval-mozilla-beta?
Comment on attachment 8948819 [details] [diff] [review] Patch Spectre-related fix. Taking for 59b8.
Attachment #8948819 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz Ubuntu 16.04 Micro benchmark: CMOV: 511 Index Masking: 522 ==================== Kraken =================== --------------------- CMOV -------------------- ----------------------------------------------- Total: 2368.4ms +/- 12.7% ----------------------------------------------- ai: 132.7ms +/- 17.0% astar: 132.7ms +/- 17.0% audio: 757.9ms +/- 9.9% beat-detection: 171.8ms +/- 15.8% dft: 354.9ms +/- 9.9% fft: 109.3ms +/- 16.0% oscillator: 121.9ms +/- 19.2% imaging: 610.1ms +/- 17.7% gaussian-blur: 197.8ms +/- 21.3% darkroom: 222.3ms +/- 18.6% desaturate: 190.0ms +/- 13.3% json: 146.7ms +/- 26.8% parse-financial: 88.8ms +/- 36.5% stringify-tinderbox: 57.9ms +/- 21.6% stanford: 721.0ms +/- 17.6% crypto-aes: 134.2ms +/- 17.8% crypto-ccm: 221.5ms +/- 20.8% crypto-pbkdf2: 256.1ms +/- 20.5% crypto-sha256-iterative: 109.2ms +/- 30.7% -------------------- Index Masking ------------ ----------------------------------------------- Total: 2574.1ms +/- 11.7% ----------------------------------------------- ai: 139.9ms +/- 16.6% astar: 139.9ms +/- 16.6% audio: 887.8ms +/- 12.9% beat-detection: 190.9ms +/- 16.3% dft: 422.5ms +/- 13.2% fft: 140.6ms +/- 28.9% oscillator: 133.8ms +/- 17.0% imaging: 626.9ms +/- 18.0% gaussian-blur: 210.6ms +/- 22.9% darkroom: 229.7ms +/- 18.0% desaturate: 186.6ms +/- 19.1% json: 146.1ms +/- 23.5% parse-financial: 75.3ms +/- 21.6% stringify-tinderbox: 70.8ms +/- 39.8% stanford: 773.4ms +/- 20.3% crypto-aes: 144.6ms +/- 19.0% crypto-ccm: 271.0ms +/- 39.1% crypto-pbkdf2: 262.2ms +/- 17.9% crypto-sha256-iterative: 95.6ms +/- 20.6% ============================== Octane ========================== -------------------------------- CMOV --------------------------- Octane 2.0 Octane Score: 17463 Richards 19691 Core language features Deltablue 23683 Core language features Crypto 16758 Bit & Math operations Raytrace 57218 Core language features EarleyBoyer 15394 Memory & GC Regexp 2898 Strings & arrays Splay 13470 Memory & GC SplayLatency 13934 GC latency NavierStokes 26284 Strings & arrays pdf.js 11430 Strings & arrays Mandreel 12715 Virtual machine MandreelLatency 11785 Compiler latency GB Emulator 33341 Virtual machine CodeLoad 14284 Loading & Parsing Box2DWeb 21731 Bit & Math operations zlib 53209 asm.js Typescript 14073 Virtual machine & GC ------------------------------ Index Masking ------------------- Octane 2.0 Octane Score: 16929 Richards 20606 Core language features Deltablue 21323 Core language features Crypto 16531 Bit & Math operations Raytrace 53057 Core language features EarleyBoyer 15761 Memory & GC Regexp 2286 Strings & arrays Splay 13438 Memory & GC SplayLatency 15719 GC latency NavierStokes 26389 Strings & arrays pdf.js 7344 Strings & arrays Mandreel 14224 Virtual machine MandreelLatency 17250 Compiler latency GB Emulator 37916 Virtual machine CodeLoad 13920 Loading & Parsing Box2DWeb 18199 Bit & Math operations zlib 35333 asm.js Typescript 16272 Virtual machine & GC
Ooops! Please, ignore my comment #27. The correct results are: Intel(R) Core(TM) i5-2430M CPU @ 2.40GHz Ubuntu 16.04 MICRO BENCHMARK --------------- CMOV: 520 INDEX MASKING: 559 KRAKEN ------- CMOV: 2394.7ms +/- 16.6% INDEX MASKING: 2609.1ms +/- 10.4% CMOV - KRAKEN - details ================================ Kraken JavaScript Benchmark Results Content Version: kraken-1.1 Run Again (You can bookmark this results URL for later comparison.) To compare to another run, paste a saved result URL in the text field below and press enter: =============================================== RESULTS (means and 95% confidence intervals) ----------------------------------------------- Total: 2394.7ms +/- 16.6% ----------------------------------------------- ai: 126.9ms +/- 16.3% astar: 126.9ms +/- 16.3% audio: 896.9ms +/- 25.1% beat-detection: 174.1ms +/- 18.5% dft: 457.1ms +/- 39.5% fft: 134.2ms +/- 30.5% oscillator: 131.5ms +/- 19.9% imaging: 592.7ms +/- 20.3% gaussian-blur: 190.1ms +/- 21.9% darkroom: 219.9ms +/- 21.7% desaturate: 182.7ms +/- 18.1% json: 127.8ms +/- 17.9% parse-financial: 77.2ms +/- 20.1% stringify-tinderbox: 50.6ms +/- 19.2% stanford: 650.4ms +/- 11.4% crypto-aes: 123.2ms +/- 13.3% crypto-ccm: 211.5ms +/- 34.2% crypto-pbkdf2: 220.1ms +/- 7.8% crypto-sha256-iterative: 95.6ms +/- 15.1% INDEX MASKING - KRAKEN - details ================================= Kraken JavaScript Benchmark Results Content Version: kraken-1.1 Run Again (You can bookmark this results URL for later comparison.) To compare to another run, paste a saved result URL in the text field below and press enter: =============================================== RESULTS (means and 95% confidence intervals) ----------------------------------------------- Total: 2609.1ms +/- 10.4% ----------------------------------------------- ai: 160.9ms +/- 16.1% astar: 160.9ms +/- 16.1% audio: 1028.7ms +/- 24.7% beat-detection: 250.7ms +/- 31.2% dft: 518.2ms +/- 32.4% fft: 133.9ms +/- 35.8% oscillator: 125.9ms +/- 20.5% imaging: 568.9ms +/- 12.9% gaussian-blur: 195.2ms +/- 16.0% darkroom: 205.7ms +/- 15.6% desaturate: 168.0ms +/- 11.8% json: 138.3ms +/- 23.4% parse-financial: 82.5ms +/- 29.0% stringify-tinderbox: 55.8ms +/- 37.5% stanford: 712.3ms +/- 15.6% crypto-aes: 156.6ms +/- 44.3% crypto-ccm: 192.8ms +/- 22.6% crypto-pbkdf2: 249.5ms +/- 15.3% crypto-sha256-iterative: 113.4ms +/- 22.4%
(In reply to tomaspartl from comment #28) > The correct results are: Thank you for these results!
Flags: needinfo?(nicolas.b.pierron)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: