Closed
Bug 1316803
Opened 8 years ago
Closed 8 years ago
Wasm baseline: fold constant rhs into integer operations
Categories
(Core :: JavaScript Engine: JIT, defect, P3)
Core
JavaScript Engine: JIT
Tracking
()
RESOLVED
FIXED
People
(Reporter: lth, Assigned: dannas, Mentored)
References
Details
Attachments
(11 files, 11 obsolete files)
(deleted),
text/plain
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
application/javascript
|
Details | |
(deleted),
application/x-shellscript
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
lth
:
review+
|
Details | Diff | Splinter Review |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
lth
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
application/x-javascript
|
Details |
There are many opportunities for cheaply folding in a constant rhs to arithmetic operations, we do this already for I32 add and shift operators, this reduces register pressure and instruction count. Most places are already tagged with TODO in the code.
(Note, folding in constant rhs for conditionals is covered by bug 1286816.)
Reporter | ||
Comment 1•8 years ago
|
||
There are some special cases that are probably worth adressing:
- Bug 1276009, x ^ -1 is the way wasm expresses ~x, mbebinita may have other cases like this
- Multiplication by certain easy constants
- Divisions by constant-power-of-two if the lhs is nonnegative, maybe worth open-coding the sign check
and specializing the path
The multiply and divide cases are of unclear value to me - it depends on what the front-end does, for whatever source language we care about... probably worth investigating in more depth.
Assignee | ||
Comment 2•8 years ago
|
||
I'll take a stab at this.
The TODO's in the code can be found with this search http://searchfox.org/mozilla-central/search?q=1316803&case=false®exp=false&path=WasmBaselineCompile and the an example of folding for i32 adds can be found at http://searchfox.org/mozilla-central/source/js/src/wasm/WasmBaselineCompile.cpp#3748
Reporter | ||
Comment 3•8 years ago
|
||
That's great. If it were me I would do add/sub/shift/rotate first (I now see the subtract primitives are not annotated with the bug numbers but they probably should be) and then worry about multiply, quotient, and remainder in subsequent rounds.
Assignee | ||
Comment 4•8 years ago
|
||
Some SpiderMonkey newbie questions:
* Why do we need fold optimizations in the BaseCompiler? Won't emscripten et.al. do those optimizations for us already?
* What is the relation between IonMonkey and wasm::BaseCompiler? AFAICT, for regular javascript, the BaselineCompiler is a transition step between interpreting and full-JIT compilation. But in http://searchfox.org/mozilla-central/source/js/src/wasm/WasmIonCompile.cpp#3793 it looks like it's either Ion or Baseline. (And I don't understand how the Ion code can be used for both javascript and wasm; the MIR and LIR bytecodes for javascript is not the same as the bytecode for wasm).
I interpret the wasm::BaseCompiler code as: "generate assembly code by calling the MacroAssembler". How can I inspect the generated assembly code for an wasm function? https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/Hacking_Tips has some suggestions on how to use gdb for inspecting the odin and ion generated code. How do it for the BaseCompiler?
Assignee | ||
Comment 5•8 years ago
|
||
A script using the gdb python api for inspecting the assembly generated by the BaseCompiler. An attempt at codifying the manual instructions at https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/Hacking_Tips#Printing_asm.jswasm_generated_assembly_code_(from_gdb)
Works ok but with two hiccups:
* There are 32 bytes at the beginning of each asm listing, that gdb is not able to disassemble. Why?
* How can I read the length of the assembly? Right now I'm decoding 64 bytes, but it would be neat if I could show only the relevant instructions.
Assignee | ||
Comment 6•8 years ago
|
||
Assignee | ||
Comment 7•8 years ago
|
||
Attachment #8812606 -
Attachment is obsolete: true
Attachment #8812607 -
Attachment is obsolete: true
Reporter | ||
Comment 8•8 years ago
|
||
FWIW, the simplest ways I know how to look at the generated machine code:
(1) look at the codegen output, this is what I do most often:
$ IONFLAGS=codegen dist/bin/js --wasm-always-baseline whatever.js > foo.txt 2>&1
$ <whatever command you like for viewing a text file> foo.txt
search for "wasm baseline" in the file
After that it's pretty self-explanatory, if you know a little bit about how the compiler generates the function prologue code so that you're not confused by the jumping around.
(2) emit a breakpoint in your code and examine the code from gdb:
In the code, call masm.breakpoint() to emit a breakpoint at the start of a function, say, or after the prologue.
Run under gdb.
When gdb stops, x/10i shows you the next 10 instructions.
Use 'stepi' to step past the breakpoint.
Assignee | ||
Comment 9•8 years ago
|
||
Thanks for the suggestions to use IONFLAGS=codegen and setting a breakpoint from the MacroAssembler API.
Attaching my annotations on a disassembly of a Wast function.
Sorry for spamming this issue. Hopefully, that's enough lateral movements, now I can inspect the assembly and understand (sort of) the call flow. Time to move forward and do part 1: add/sub/shift/rotate.
Reporter | ||
Comment 10•8 years ago
|
||
(In reply to Daniel Näslund from comment #9)
> Created attachment 8812902 [details]
> annotated disassembly
A minor correction: Following the retq that follows .returnLabel, the instructions up to and including the next retq comprise the profiling epilogue. After that, the jmpq + int3 and callq + int3 sequences are indeed traps.
Assignee | ||
Comment 11•8 years ago
|
||
Assignee | ||
Comment 12•8 years ago
|
||
Need to ponder a bit more on the shift and rotate instructions. Just need a little more time I guess.
The f32.add and f64.add instructions were marked with this bug number. But floating point add instructions don't provide immediate operands AFAICT. I asked sunfish the leading question: "why do hardware floating point op instr don't provide immediate operands". He suggested that floating point constants aren't as common and that arm/mips has a limited number of bits available.
The MacroAssembler does not expose an API that has immediate operands for adding floating point values either.
Is there a way to provide floating point immediate add operations? Or was the comment a copy-paste slip?
I'll start thinking about part 2. I wonder how we can get a feel for what kind of optimizations are worthwhile? Some sort of instrumentation to determine how often we run mult/div/remainder instructions with a constant operand.
Replace multiplications of factor two by shifts is one thing that can be easily done. Guess its' time to pick out Henry Warrens book Hackers Delight from the shelf and read up on divisions by constants.
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → dannas
Reporter | ||
Comment 13•8 years ago
|
||
(In reply to Daniel Näslund from comment #12)
> Created attachment 8813867 [details] [diff] [review]
> wip-1316803-part1.patch
>
> Is there a way to provide floating point immediate add operations? Or was
> the comment a copy-paste slip?
It wasn't a slip, but that the comment is there is no commitment to actually implement an optimization, it's just where the optimization would go if it were to be implemented.
On all platforms there are plenty of FP registers and it's almost certainly less important to implement this optimization for FP than for integer.
> Replace multiplications of factor two by shifts is one thing that can be
> easily done.
And if a small number of bits is set in the rhs it may be profitable to replace multiplication by a series of shifts and adds. It seems probable that Ion already has some heuristics here for how few bits can be set but I haven't looked.
We have benchmarks, you know: https://github.com/kripken/embenchen/tree/master/asm_v_wasm
Assignee | ||
Comment 14•8 years ago
|
||
(In reply to Lars T Hansen [:lth] from comment #13)
> (In reply to Daniel Näslund from comment #12)
> > Replace multiplications of factor two by shifts is one thing that can be
> > easily done.
>
> And if a small number of bits is set in the rhs it may be profitable to
> replace multiplication by a series of shifts and adds. It seems probable
> that Ion already has some heuristics here for how few bits can be set but I
> haven't looked.
I'll investigate that, though IonMonkey appears to only deal with -1, 0, 1, and powers of two: http://searchfox.org/mozilla-central/source/js/src/jit/x86-shared/CodeGenerator-x86-shared.cpp#944
Agner Fog says that instruction latencies for imul on recent x86 is 3-4. If that's correct, then there wouldn't be much use in replacing imul with shift and adds. But when I checked the gcc.godbolt.org output for icc and gcc, they did most of these transformations: http://www.xarg.org/2010/04/optimizing-integer-multiplication/. I guess maybe the shift and add can use different ports.
Anyway. I'll try a few approaches: only imul; replace -1, 0, 1 and powers of two; do shift-add transformations up to some cutoff, and compare the benchmark results.
IonMonkey does division optimizations for non-powers-of-two http://searchfox.org/mozilla-central/source/js/src/jit/shared/CodeGenerator-shared.cpp#1671. That looks fiddly and complicated for a BaseCompiler that is meant to just do the minimal amount of optimization.
I'll start with just replacing powers-of-two and see what that gives me in speedup.
(The IonMonkey links are for x86 only. Haven't investigate if we doe something radically different for arm and mips).
> We have benchmarks, you know:
> https://github.com/kripken/embenchen/tree/master/asm_v_wasm
Ah, great! I wasn't aware of them.
Reporter | ||
Comment 15•8 years ago
|
||
BTW, there's no reason not to finish + review + land the patch for add/sub/shift/rotate/bitwise-ops now, we don't have to have a single patch for this bug. In fact I'd probably prefer to review that patch separately from any multiply/divide work, since the latter is probably somewhat more complex and less obviously a benefit.
Assignee | ||
Comment 16•8 years ago
|
||
Attachment #8813859 -
Attachment is obsolete: true
Assignee | ||
Comment 17•8 years ago
|
||
Assignee | ||
Comment 18•8 years ago
|
||
Assignee | ||
Comment 19•8 years ago
|
||
See assembly_optimized.txt for the assembly output for each operation.
This is supposed to be a performance improvement, but so far I haven't proved that it speed things up. I tried running the embenchen benchmark but saw no performance improvement. Thought, my testing was ad-hoc. I'll try again tomorrow and make sure to have a lightly loaded system, pin the test to one cpu and do a larger number of samples.
Do you have a script for running the embenchen benchmarks from the shell and automate measurements? I've ran the benchmarks in benches/ one by one and checked user cpu time with time(8) so far.
I'll write a few artifical benchmarks for myself as well (e.g. perform N add-operations with constant rhs) to see what to expect in the best-case.
Are you ok with dropping constant folding for floating point? Should I give it more thought?
Attachment #8814718 -
Flags: review?(lhansen)
Reporter | ||
Comment 20•8 years ago
|
||
(In reply to Daniel Näslund from comment #19)
> Do you have a script for running the embenchen benchmarks from the shell
> and automate measurements?
I do now, because I needed it myself. Clone https://github.com/lars-t-hansen/embenchen. Then cd to asm_v_wasm/ (not benches/) and run wasm_bench.sh.
Reporter | ||
Comment 21•8 years ago
|
||
Comment on attachment 8814718 [details] [diff] [review]
bug_1316803_part_1_fold_constant_rhs_for_add_sub_shift_rotate_bitwise_ops.patch
Review of attachment 8814718 [details] [diff] [review]:
-----------------------------------------------------------------
Thank you! Very nice, but we'll do one more round because of the problems noted below for rotate which require fixes + test cases.
As for data showing this is an optimization, we should run the benchmark suite but I expect the results to be mostly in the noise because the benchmark suite is pretty narrow. If we had some crypto or some graphics code, then maybe...
Obviously contributions to the benchmark suite or the benchmark runner that alleviate that problem are very welcome, but I'm perfectly happy to take this patch without that.
::: js/src/wasm/WasmBaselineCompile.cpp
@@ +4463,5 @@
> {
> + int32_t c;
> + if (popConstI32(c)) {
> + RegI32 r = popI32();
> + masm.rotateRight(Imm32(c), r, r);
"c & 31". Apparently we do not have a test case that exercises that case, so it would be lovely if you could create one and include it with this patch. We should also have a test for the case where c is zero, which the MacroAssembler allows but has to handle specially on some architectures.
@@ +4480,5 @@
> {
> + int64_t c;
> + if (popConstI64(c)) {
> + RegI64 r = popI64();
> + masm.rotateRight64(Imm32(c), r, r, InvalidReg);
"c & 63", ditto.
@@ +4497,5 @@
> {
> + int32_t c;
> + if (popConstI32(c)) {
> + RegI32 r = popI32();
> + masm.rotateLeft(Imm32(c), r, r);
"c & 31", ditto.
@@ +4514,5 @@
> {
> + int64_t c;
> + if (popConstI64(c)) {
> + RegI64 r = popI64();
> + masm.rotateLeft64(Imm32(c), r, r, InvalidReg);
"c & 63", ditto.
Attachment #8814718 -
Flags: review?(lhansen)
Reporter | ||
Comment 22•8 years ago
|
||
(In reply to Daniel Näslund from comment #19)
>
> Are you ok with dropping constant folding for floating point?
Yes. Let's drop that (and remove the comments, as you've done) and let that be a data-driven item, whenever we get around to cataloguing instruction sequences.
Assignee | ||
Comment 23•8 years ago
|
||
Fixed range of rotate count.
I've added a test for rotating a too large i64 number. There already exists such a test for i32 (a count &= 0x1f statement in the x64 MacroAssembler caused that test not to trigger)
AFAICT, there already exists tests for rotate counts of zero. The i32 case can be found here: http://searchfox.org/mozilla-central/source/js/src/jit-test/tests/wasm/integer.js#130
The i64 test for 0 rotate count is here: http://searchfox.org/mozilla-central/source/js/src/jit-test/tests/wasm/integer.js#209
To me, it was a surprise to learn that different hardware platforms treat too large shift and rotate amounts differently.
Attachment #8813867 -
Attachment is obsolete: true
Attachment #8814718 -
Attachment is obsolete: true
Attachment #8815042 -
Flags: review?(lhansen)
Assignee | ||
Comment 24•8 years ago
|
||
Benchmark results from running lth's wasm_bench.sh script for the embenchen benchmarks. The results are inconclusive, but atleast there's no major slowdown.
Comment 25•8 years ago
|
||
(In reply to Daniel Näslund from comment #23)
> To me, it was a surprise to learn that different hardware platforms treat
> too large shift and rotate amounts differently.
FWIW, this goes for C++ as well -- and the rationale for C++ doing so is exactly that different hardware behaves differently on it, and picking any one semantics penalizes code compiled on whatever hardware implemented different semantics.
Reporter | ||
Comment 26•8 years ago
|
||
(In reply to Daniel Näslund from comment #24)
> Created attachment 8815064 [details]
> benchmark_results_part_1.txt
>
> Benchmark results from running lth's wasm_bench.sh script for the embenchen
> benchmarks. The results are inconclusive, but atleast there's no major
> slowdown.
Actually, for my benchmark runner the first two columns are such that lower-is-better, but for the ratio in the third column higher-is-better (this is probably dumb but there you have it, I should print a proper legend, WILLFIX). These numbers look pretty noisy; I'm going to do a local run as well, since I have some stable numbers from yesterday.
Reporter | ||
Comment 27•8 years ago
|
||
Compared to a stable run from yesterday (7 runs). Most changes are extremely minor, to get a ++ below you only need to improve the ratio by .02 or more.
box2d 969 1675 .578 +
bullet 1191 2085 .571 -
conditionals 287 960 .298 +
copy 591 2935 .201 +
corrections 961 1467 .655 +
fannkuch 2186 4628 .472 +
fasta 739 2113 .349 -
ifs 197 294 .670 ++
linpack 8276 10401 .795 +
lua_binarytrees 3916 5941 .659 ++
lua_scimark 4147 7102 .583 +
memops 889 3510 .253 +
primes 1141 1155 .987 -
skinning 824 2187 .376 +
zlib 1144 1955 .585 +
I'd say this is just fine.
Reporter | ||
Updated•8 years ago
|
Attachment #8815042 -
Flags: review?(lhansen) → review+
Assignee | ||
Updated•8 years ago
|
Keywords: checkin-needed
Comment 28•8 years ago
|
||
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d62d5b78e234
Fold constant rhs for add/sub/shift/rotate/bitwise-ops. r=lth
Keywords: checkin-needed
I had to back this out for failures like https://treeherder.mozilla.org/logviewer.html#?job_id=40066600&repo=mozilla-inbound
https://hg.mozilla.org/integration/mozilla-inbound/rev/754af8c55a9d
Flags: needinfo?(dannas)
Assignee | ||
Comment 30•8 years ago
|
||
My apologies. I wasn't aware of the distinction between mozilla-central and mozilla-inbound.
I've applied for try server access, and will make sure in the future that my patches apply cleanly against mozilla-inbound.
I'll also make sure to use the "r?somebody" pattern in my commit messages to ensure that a push will fail-fast for missing r= stanzas.
Again, sorry for the inconvenience.
Flags: needinfo?(dannas)
Reporter | ||
Updated•8 years ago
|
Keywords: leave-open
Assignee | ||
Comment 31•8 years ago
|
||
When building for arm64 in treeherder, I get these build failures for part1:
/home/worker/workspace/build/src/js/src/wasm/WasmBaselineCompile.cpp:3990:31: error: use of deleted function 'void js::jit::MacroAssembler::add64(js::jit::Imm64, js::jit::Register64)'
/home/worker/workspace/build/src/js/src/jit/MacroAssembler.h:769:17: error: declared here
/home/worker/workspace/build/src/js/src/wasm/WasmBaselineCompile.cpp:4044:31: error: use of deleted function 'void js::jit::MacroAssembler::sub64(js::jit::Imm64, js::jit::Register64)'
/home/worker/workspace/build/src/js/src/jit/MacroAssembler.h:788:17: error: declared here
/home/worker/workspace/build/src/js/src/wasm/WasmBaselineCompile.cpp:4649:59: error: use of deleted function 'void js::jit::MacroAssembler::rotateRight64(js::jit::Imm32, js::jit::Register64, js::jit::Register64, js::jit::Register)'
/home/worker/workspace/build/src/js/src/jit/MacroAssembler.h:911:17: error: declared here
/home/worker/workspace/build/src/js/src/wasm/WasmBaselineCompile.cpp:4683:58: error: use of deleted function 'void js::jit::MacroAssembler::rotateLeft64(js::jit::Imm32, js::jit::Register64, js::jit::Register64, js::jit::Register)'
/home/worker/workspace/build/src/js/src/jit/MacroAssembler.h:902:17: error: declared here
make[3]: *** [Unified_cpp_js_src37.o] Error 1
make[3]: *** Waiting for unfinished jobs....
make[2]: *** [js/src/target] Error 2
make[1]: *** [compile] Error 2
make: *** [default] Error 2
subprocess.CalledProcessError: Command 'make -s -w -j6' returned non-zero exit status 2
A few operations with immediate operands are not defined for arm64. What to do now?
1. Add #ifndef JS_CODEGEN_ARM64 around the operations using constants. Gets a bit fiddly, but doable. Then create an issue for the missing methods. Or..
2. Add the missing methods to jit/arm64/MacroAssembler-inl.h. But then I'd have to test the functionality and I don't have access to a computer equipped with an arm64 processor. The arm simulator only supports 32 bits https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/Hacking_Tips#Using_the_ARM_simulator. Can I use qemu? Some amazon VM?
Flags: needinfo?(lhansen)
Comment 32•8 years ago
|
||
(In reply to Daniel Näslund from comment #31)
> 2. Add the missing methods to jit/arm64/MacroAssembler-inl.h. But then I'd
> have to test the functionality and I don't have access to a computer
> equipped with an arm64 processor. The arm simulator only supports 32 bits
> https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/
> Hacking_Tips#Using_the_ARM_simulator. Can I use qemu? Some amazon VM?
ARM64 has a separate simulator. On a 64-bit host system, you can build it by passing --enable-simulator=arm64 to configure.
The ARM64 IonMonkey JIT doesn't work, so it's hard to test these things. You could either implement them correctly, or implement them as MOZ_CRASH("NYI"), with an open bug.
The missing functions look easy enough to implement that I would suggest just implementing them. I would be happy to review that part of the patch.
Flags: needinfo?(lhansen)
Reporter | ||
Comment 33•8 years ago
|
||
Daniel, extra points to you if you implement these functions. For a while I was doing that myself, but really, lately Arm64 has gotten in the way of getting work done and I've started just stubbing out everything with a MOZ_CRASH("NYI"). There must be dozens of those by now. A few more won't make much difference.
Assignee | ||
Comment 34•8 years ago
|
||
Assignee | ||
Comment 35•8 years ago
|
||
I've verified that this patch now builds for arm64 by building with configure --enable-simulator=arm64.
I wrote an implementation for Add64; reasoned that by following the pattern for the other overloaded Add methods I'd do ok.
For sub and rotate there were no examples available. So I stubbed them out. The work of implementing the stubs for arm64 is tracked since before in https://bugzilla.mozilla.org/show_bug.cgi?id=1313336
Attachment #8815042 -
Attachment is obsolete: true
Attachment #8817216 -
Flags: review?(lhansen)
Reporter | ||
Comment 36•8 years ago
|
||
Comment on attachment 8817216 [details] [diff] [review]
bug_1316803_v3_fold_constant_rhs_for_add_sub_shift_rotate_bitwise_ops.patch
Review of attachment 8817216 [details] [diff] [review]:
-----------------------------------------------------------------
Thank you! Nice work.
Attachment #8817216 -
Flags: review?(lhansen) → review+
Assignee | ||
Updated•8 years ago
|
Keywords: checkin-needed
Comment 37•8 years ago
|
||
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/3bfbdec4ea08
part 1 - fold constant rhs for add/sub/shift/rotate/bitwise-ops r=lth
Keywords: checkin-needed
Comment 38•8 years ago
|
||
Sorry, I had to back this out for asserting in wasm.js (low 32 bits don't match) on Windows XP 32-bit:
https://hg.mozilla.org/integration/mozilla-inbound/rev/dffe4787171445ea01f99a41514d262f7882d120
Push with failure: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=3bfbdec4ea08c9fb7fb0a4ee624c1607e85a4013
Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=40379278&repo=mozilla-inbound
TEST-PASS | js\src\jit-test\tests\wasm\import-export.js | Success (code 0, args "--wasm-always-baseline")
TEST-PASS | js\src\jit-test\tests\wasm\import-gc.js | Success (code 0, args "--no-baseline --baseline-eager")
TEST-PASS | js\src\jit-test\tests\wasm\import-gc.js | Success (code 0, args "--no-baseline --wasm-always-baseline")
TEST-PASS | js\src\jit-test\tests\wasm\import-gc.js | Success (code 0, args "--no-baseline --no-baseline --no-ion")
TEST-PASS | js\src\jit-test\tests\wasm\js-reexport.js | Success (code 0, args "")
c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:67:9 Error: Assertion failed: got 1898136935, expected -2128394905: low 32 bits don't match
Stack:
assertEqI64@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:67:9
_wasmFullPassInternal@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:125:5
wasmFullPassI64@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:147:70
testBinary64@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\tests\wasm\integer.js:32:5
@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\tests\wasm\integer.js:206:5
Exit code: 3
FAIL - wasm\integer.js
TEST-UNEXPECTED-FAIL | js\src\jit-test\tests\wasm\integer.js | c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:67:9 Error: Assertion failed: got 1898136935, expected -2128394905: low 32 bits don't match (code 3, args "--wasm-always-baseline")
INFO exit-status : 3
INFO timed-out : False
INFO stderr 2> c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:67:9 Error: Assertion failed: got 1898136935, expected -2128394905: low 32 bits don't match
INFO stderr 2> Stack:
INFO stderr 2> assertEqI64@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:67:9
INFO stderr 2> _wasmFullPassInternal@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:125:5
INFO stderr 2> wasmFullPassI64@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\lib\wasm.js:147:70
INFO stderr 2> testBinary64@c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\tests\wasm\integer.js:32:5
INFO stderr 2> @c:\builds\moz2_slave\m-in_w32_sm-plain-000000000000\src\js\src\jit-test\tests\wasm\integer.js:206:5
Please fix the issue and submit a new patch. Thank you.
Flags: needinfo?(dannas)
Assignee | ||
Comment 39•8 years ago
|
||
Assignee | ||
Comment 40•8 years ago
|
||
Assignee | ||
Comment 41•8 years ago
|
||
Changes v3 => v4
* Allocate valid temp register on x86 for rotateI64.
Attachment #8817216 -
Attachment is obsolete: true
Flags: needinfo?(dannas)
Attachment #8817919 -
Flags: review?(lhansen)
Reporter | ||
Updated•8 years ago
|
Attachment #8817919 -
Flags: review?(lhansen) → review+
Assignee | ||
Comment 42•8 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=ec3372c74e8ccb03e4edd87d5456bfd292ffe948
Three "intermittent orange" (jandem helped me classify them):
tc(V): Vague valgrind error. Green when I re-ran the test.
tc(B): gcc segfault. Green when I re-ran.
SM(cgc): Known intermittent error.
All other tests pass.
Keywords: checkin-needed
Comment 43•8 years ago
|
||
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/8f771233d340
b=Bug 1316803 part 1 - fold constant rhs for add/sub/shift/rotate/bitwise-ops. r=lth
Keywords: checkin-needed
Comment 44•8 years ago
|
||
bugherder |
Assignee | ||
Comment 45•8 years ago
|
||
The TODO comment said: optimize for lhs >= 0 && c is power of two. But we can replace division with shifts even for negative lhs. I did that.
The div/rem emit-methods for i64 are different for 32-bit platforms and 64-bit platforms. I've only modified the 64bit platforms. The 32 bit code was a little too entangled. I can do that as well, if we think that some improvements can be gained there (I recently saw numbers from some mozilla metrics page suggesting that 90% of all firefox users are on 32 bit platforms, so maybe we should consider that case as well).
I'll attach some embenchen benchmarks numbers. I saw some improvements.
Attachment #8820830 -
Flags: review?(lhansen)
Assignee | ||
Comment 46•8 years ago
|
||
Reporter | ||
Comment 47•8 years ago
|
||
Comment on attachment 8820830 [details] [diff] [review]
bug1316803_part2_optimize_division_for_constant_rhs.patch
Review of attachment 8820830 [details] [diff] [review]:
-----------------------------------------------------------------
Sorry this took so long to review; Hawaii encourages laziness...
This is looking good but I'd like you to do another round with more test cases and with refactorings as suggested below.
I think not handling the 64-bit case on 32-bit platforms is fine. The situation with many 32-bit users is temporary: The majority of Firefox users are on Windows, and we only recently started shipping 64-bit Windows builds. Since 32-bit Firefox does not auto-upgrade to 64-bit the users must be persuaded to switch over; this is taking time. I expect that we will soon have few desktop users that run 32-bit. Mobile is a different matter but let's pick low-hanging fruit for now. I'm much more interested in diverting your attention to the multiply-by-constant case than in fast 64-bit division :-)
::: js/src/jit-test/tests/wasm/integer.js
@@ +109,5 @@
> testBinary32('div_s', -40, 2, -20);
> testBinary32('div_u', -40, 2, 2147483628);
> testBinary32('rem_s', 40, -3, 1);
> testBinary32('rem_s', 0, -3, 0);
> +testBinary32('rem_s', 5, 2, 1);
I'd like to see a test that does not divide by 2, but by a higher power of two, since that is more interesting. The same goes for div_s, div_u, and rem_u (which appears not to have a test for constant power-of-two rhs at all).
Since there are two code paths for div_s and rem_s depending on whether the lhs is negative or not, we want test cases here for both paths.
::: js/src/wasm/WasmBaselineCompile.cpp
@@ +4155,5 @@
> void
> BaseCompiler::emitQuotientI32()
> {
> + int32_t c;
> + if (popConstI32(c)) {
Here I'd like to see a different pattern: instead of what you do here, introduce popConstPositivePowerOfTwoI32(uint32_t& c, uint8_fast_t& power) and use that instead - the second argument is the exponent.
(And since we know it is a power of two presumably the exponent can be computed quickly in that routine by counting leading or trailing zeroes, we may not need the full power / expense of FloorLog2?)
Then the code for the known-power-of-two case becomes the "then" branch for that condition, and the existing general code becomes the "else" branch, as so many other places in this compiler. That gets rid of the try-to-pop-but-undo pattern used here.
Ditto for the 64-bit case.
Attachment #8820830 -
Flags: review?(lhansen)
Assignee | ||
Comment 48•8 years ago
|
||
Assignee | ||
Comment 49•8 years ago
|
||
Assignee | ||
Comment 50•8 years ago
|
||
Changes v1 => v2
* Introduced popConstPositivePowerOfTwoIXX(). One quirk: I'm not allowing 1 as a valid power-of-two, to avoid problems with ARM shift instructions not accepting zero shift widths. The alternative was to add if statements for the shift-calls. This looks cleaner to me.
* You suggested that maybe I should replace FloorLog2. I pasted the gcc mozilla::FloorLog2 code to https://godbolt.org/g/5zR0Fr; it generates a bsr instruction and a subtract. So I guess FloorLog2 is good enough.
* Added testcases for larger power-of-two divisors and negative dividends.
Attachment #8820830 -
Attachment is obsolete: true
Attachment #8822264 -
Flags: review?(lhansen)
Reporter | ||
Comment 51•8 years ago
|
||
Comment on attachment 8822264 [details] [diff] [review]
bug1316803_part2_v2_optimize_division_for_constant_rhs.patch
Review of attachment 8822264 [details] [diff] [review]:
-----------------------------------------------------------------
Nice work.
::: js/src/wasm/WasmBaselineCompile.cpp
@@ +1701,5 @@
> stk_.popBack();
> return true;
> }
>
> + MOZ_MUST_USE bool popConstPositivePowerOfTwoI32(int32_t& c, uint_fast8_t &power) {
House style is "uint8_fast_t& power", note location of the space character. Also below.
@@ +1708,5 @@
> + return false;
> + c = v.i32val();
> + uint32_t u = static_cast<uint32_t>(c);
> + // c=1 is a power-of-two, but we use |power| for shift amounts, and
> + // some platforms don't support shifts by zero.
I can live with this and you can land it if you like, but consider the following.
For quotient, it would probably be better to test c <= 0 here and special-case c==1 in the code below (where the entire body of the constant-rhs case becomes wrapped by "if (power != 0) { ... }"), as this does not penalize simple code generators that emit e/1, which I can see as somewhat likely.
For remainder, e%1 is zero always, surely, so perhaps again the code below can guard on that and always push zero if power == 0? This is not an important optimization but just a fallout of changing this routine here to benefit quotient. The extra complexity is regrettable.
If it were me, I would probably pass the cutoff value for c as a parameter and pass '0' from quotient and '1' from remainder, just to keep things manageable.
Anyway, if you do decide on something fancier here I'd appreciate another look before you land.
@@ +1709,5 @@
> + c = v.i32val();
> + uint32_t u = static_cast<uint32_t>(c);
> + // c=1 is a power-of-two, but we use |power| for shift amounts, and
> + // some platforms don't support shifts by zero.
> + if (c <= 1 || !IsPowerOfTwo(u))
Now I'm nitpicking but since u has a single use, why not expand the rhs of the definition for u into the argument of IsPowerOfTwo? Indeed, doesn't the language provide int32_t -> uint32_t casting for us? (Ditto below.)
@@ +1722,5 @@
> + if (v.kind() != Stk::ConstI64)
> + return false;
> + c = v.i64val();
> + uint64_t u = static_cast<uint64_t>(c);
> + // See comment for popConstPositivePowerOfTwoI32
Might as well copy that comment down here, as it is quite short.
Attachment #8822264 -
Flags: review?(lhansen) → review+
Assignee | ||
Comment 52•8 years ago
|
||
Assignee | ||
Comment 53•8 years ago
|
||
Changes v2 => v3
* Adjust whitespace and line length according to style guide.
* Handle power=0 by providing a cutoff and special if-statement for quotient
* Remove temp var for arg to IsPowerOfTwo; but it's a template function so I can't rely on implicit casting - need to provide an exact match.
Attachment #8822264 -
Attachment is obsolete: true
Attachment #8822620 -
Flags: review?(lhansen)
Reporter | ||
Comment 54•8 years ago
|
||
Comment on attachment 8822620 [details] [diff] [review]
bug1316803_part2_v3_optimize_division_for_constant_rhs.patch
Review of attachment 8822620 [details] [diff] [review]:
-----------------------------------------------------------------
Very nice. Just a couple of nits to clean up here, I think we're otherwise ready to ship.
::: js/src/wasm/WasmBaselineCompile.cpp
@@ +1703,5 @@
> }
>
> + MOZ_MUST_USE bool popConstPositivePowerOfTwoI32(int32_t& c,
> + uint_fast8_t& power,
> + int32_t cutoff = 0) {
Opening brace goes on next line when signature takes up more than one line. Also below. And I strongly prefer for the cutoff argument not to have a default value.
@@ +4205,5 @@
> {
> + int32_t c;
> + uint_fast8_t power;
> + if (popConstPositivePowerOfTwoI32(c, power)) {
> + RegI32 r = popI32();
You can move pop/push into the if, since pushing what you just popped is a no-op -- except that it may also force the value into a register, so it's really best to just leave it alone. Also below, several places.
Attachment #8822620 -
Flags: review?(lhansen) → review+
Reporter | ||
Comment 55•8 years ago
|
||
(In reply to Lars T Hansen [:lth] from comment #54)
> Comment on attachment 8822620 [details] [diff] [review]
> bug1316803_part2_v3_optimize_division_for_constant_rhs.patch
>
> @@ +4205,5 @@
> > {
> > + int32_t c;
> > + uint_fast8_t power;
> > + if (popConstPositivePowerOfTwoI32(c, power)) {
> > + RegI32 r = popI32();
>
> You can move pop/push into the if, since pushing what you just popped is a
> no-op -- except that it may also force the value into a register, so it's
> really best to just leave it alone.
Poor English in my part - what I mean is, it really is best to move pop/push into the if(), leaving the value on the stack alone if the power is zero. It won't matter in practice but it's more logical code.
Assignee | ||
Comment 56•8 years ago
|
||
Changes v3 => v4
* Fix brace placement
* Replace default param with explicit param
* Move push/pop inside the if statement.
Attachment #8822620 -
Attachment is obsolete: true
Attachment #8823138 -
Flags: review?(lhansen)
Assignee | ||
Comment 57•8 years ago
|
||
First attempt at optimizing multiply-by-constant.
I replace near-powers-of-two rhs with shl+add, shl or shl+sub. Ran the embenchen benchmark. No speedups.
Alternatives:
* Replace all rhs that has just one or two bits set, with shl+shl+add. E.g. replace 1010, but not 1110. mbx listed the most common multipliers for some wast source files here: https://bugzilla.mozilla.org/show_bug.cgi?id=1275442#c6 (I wonder if there's an already existing tool for analyzing wast/wasm source files like that or if he wrote an ad-hoc script). Many of the most common multipliers there have two bits set.
* Optimize each case for values below some threshold, say 10 or 20, individually.
* Add optimizations for negative multipliers as well, a neg instruction after the shl's+add's.
FWIW, here's a godbolt link where I list what optimizations gcc/clang/icc do https://godbolt.org/g/chLZtN
Reporter | ||
Comment 58•8 years ago
|
||
Comment on attachment 8823138 [details] [diff] [review]
bug1316803_part2_v4_optimize_division_for_constant_rhs.patch
Review of attachment 8823138 [details] [diff] [review]:
-----------------------------------------------------------------
Yeah, ship it.
As a matter of process: once you've got r+ with a request to fix a few straggling issues, it is usually not necessary to ask for review again. Mostly people just push the patch, or update the patch while setting r+ themselves with a comment saying "Carrying r+", if they are allowed to do that.
Attachment #8823138 -
Flags: review?(lhansen) → review+
Assignee | ||
Updated•8 years ago
|
Keywords: checkin-needed
Comment 59•8 years ago
|
||
Pushed by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7410f0a2600e
part 2 - optimize division for constant rhs. r=lth
Keywords: checkin-needed
Comment 60•8 years ago
|
||
bugherder |
Assignee | ||
Comment 61•8 years ago
|
||
I've added optimizations for rhs with popcount == 2 in addition to rhs close to powers-of-two.
No speedups when running the embenchen benchmarks
So, either we're not calling the mult-with-constant code or that code is not significantly faster than the original.
I'll try lth's patch for annotating optimizations with counters to see how often we do run the mult-with-constant path.
I'll try to write an artifical wast benchmark program as well that does lots of multiplications with constants to see if I do get a speedup when the new code is called often.
Attachment #8823139 -
Attachment is obsolete: true
Reporter | ||
Comment 62•8 years ago
|
||
(In reply to Daniel Näslund from comment #61)
> Created attachment 8825516 [details] [diff] [review]
> bug1316803_part3_v1_optimize_mult_for_constant_rhs.patch
>
> No speedups when running the embenchen benchmarks
>
> So, either we're not calling the mult-with-constant code or that code is not
> significantly faster than the original.
CPUs are getting really good at multiplication, probably a good thing.
I have an ARM dev board; my range check optimization, which yields little speedup on x86, is much more effective on the simpler ARM. If I can find some time today I'll try your patch there to see if the same holds for multiplication. (I saw a couple of big speedups after your division patch.)
It's quite possible that multiplications by constants are not common in current wasm code, the C++ compiler already having strength-reduced them. That may change once we get other and less sophisticated producers of wasm code. On balance, let's do the obvious things here -- popcount==2 and near-power-of-two are good -- but let's not overdo it.
--lars
Assignee | ||
Comment 63•8 years ago
|
||
Comment on attachment 8825516 [details] [diff] [review]
bug1316803_part3_v1_optimize_mult_for_constant_rhs.patch
Adding the r? flag to make it easier to find in your review queue.
As I mentioned earlier, I haven't been able to detect any speed ups with this patch on my Skylake i7 laptop. So I guess there are three options:
1. Find some other, less capable, hardware and verify that this patch speeds up things there. Then ship the patch.
2. Assume that embenchen does a good job of strength reducing multiplication by constant, but that future wasm generators may be less capable. So, ship the patch.
3. Drop the patch.
Attachment #8825516 -
Flags: review?(lhansen)
Reporter | ||
Comment 64•8 years ago
|
||
Simple microbenchmark.
Jetson TK1, release build:
Without the optimization: 562ms
With the optimization: 952ms
I'm pretty sure I got this right, so the optimization represents a significant performance regression. I don't know why that is, I haven't had time to look any deeper. The code in the patch looks all right though.
I've experimented briefly with other operands (not just 1*2 as here) without seeing any slowdown from the multiply instruction.
(The optimization does not change the running times of any of the benchmarks on ARM.)
Reporter | ||
Comment 65•8 years ago
|
||
Comment on attachment 8825516 [details] [diff] [review]
bug1316803_part3_v1_optimize_mult_for_constant_rhs.patch
Review of attachment 8825516 [details] [diff] [review]:
-----------------------------------------------------------------
With a fix to the benchmark to avoid the problem I've commented on below, things are getting more interesting.
- the case where we can transform a multiply to a shift is marginally faster than a multiply (520ms vs 560ms)
- the case where the multiply turns into a mov, a shift, and an add is almost 2x slower than multiply (950ms vs 560ms)
In both cases the lhs operands were smallish constants. It's possible larger constants on both lhs and rhs will change this (no early exits) but I don't know.
Presumably these ratios are massively hardware-dependent, but the Tegra on the Jetson TK1 is not brand new, and our target is fairly high-end hardware, so let's assume it's representative.
I'll clear the r? flag for now. I think that (a) we can probably land a fix for power-of-two only since that seems like a clear win and (b) benchmark more if we want to do near-power-of-two.
::: js/src/wasm/WasmBaselineCompile.cpp
@@ +1785,5 @@
> + return false;
> + uint64_t u = static_cast<uint64_t>(c);
> + if (IsPowerOfTwo(u - 1))
> + offset = -1;
> + else if (IsPowerOfTwo(u))
We need to test IsPowerOfTwo(u) first because of the presumably common case '2', since 2-1==1 is also a power of 2, leading to pretty silly code for that case.
Attachment #8825516 -
Flags: review?(lhansen)
Reporter | ||
Comment 66•8 years ago
|
||
That said, given how marginal the win is for the shift, I don't think this particular patch is high priority, in the event you are torn about what to spend time on.
Reporter | ||
Comment 67•8 years ago
|
||
In order to do much better on multiply (notably, better than a multiply instruction) we'll need to descend to the platform level to use LEA on x86 and add-shifted-operand on ARM, to keep the instructions counts down. If we're going to do that kind of instruction selection we have a long list of opportunities, but that's a different and very low priority bug (which I will open, just to track it).
I think we're done here.
Reporter | ||
Comment 68•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/f91c4c151a403db894c086e1b1b0a4c9513182e6
Bug 1316803 - Remove obsolete comments. r=me DONTBUILD
Comment 69•8 years ago
|
||
bugherder |
You need to log in
before you can comment on or make changes to this bug.
Description
•