Closed Bug 1021716 Opened 10 years ago Closed 10 years ago

SIMD backend: implement Shuffle

Categories

(Core :: JavaScript Engine: JIT, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla36

People

(Reporter: bbouvier, Assigned: bbouvier)

References

Details

Attachments

(7 files, 23 obsolete files)

(deleted), patch
sunfish
: review+
Details | Diff | Splinter Review
(deleted), patch
bbouvier
: review+
Details | Diff | Splinter Review
(deleted), patch
bbouvier
: review+
Details | Diff | Splinter Review
(deleted), patch
sunfish
: review+
Details | Diff | Splinter Review
(deleted), patch
sunfish
: review+
Details | Diff | Splinter Review
(deleted), patch
sunfish
: review+
Details | Diff | Splinter Review
(deleted), patch
sunfish
: review+
Details | Diff | Splinter Review
No description provided.
Attached patch bug1021716.patch (obsolete) (deleted) — Splinter Review
The patch would actually need to be more split up (split the IonTypes.h parts from the rest), but this is an instance of an proper implementation of a SIMD operator, which shall be a good example of SIMD operators impl.
Attachment #8435808 - Flags: feedback?(sunfish)
Comment on attachment 8435808 [details] [diff] [review] bug1021716.patch Review of attachment 8435808 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/assembler/assembler/X86Assembler.h @@ +2581,5 @@ > m_formatter.twoByteOp(OP2_MOVD_VdEd, (RegisterID)dst, src); > } > > + void pshufd_rr(uint32_t lane, XMMRegisterID src, XMMRegisterID dst) > + { This could assert lane < 256. Also, calling this variable "lane" is a little confusing, since it's not a lane index in pshufd. How about "mask"? @@ +2582,5 @@ > } > > + void pshufd_rr(uint32_t lane, XMMRegisterID src, XMMRegisterID dst) > + { > + spew("pshufd %d, %s, %s", %x like shufps uses, would be nicer than %d here. @@ +2589,5 @@ > + m_formatter.twoByteOp(OP2_PSHUFD_VdqWdq, (RegisterID)dst, (RegisterID)src); > + m_formatter.immediate8(lane); > + } > + void shufps_rr(uint32_t lane, XMMRegisterID src, XMMRegisterID dst) > + { The comments on pshufd apply here too. ::: js/src/jit/IonTypes.h @@ +263,5 @@ > + return type == MIRType_Int32x4 || type == MIRType_Float32x4; > +}; > + > +static inline unsigned > +SIMDTypeToArity(MIRType type) "Arity" to describe the number of vector elements seems a little uncommon, but I think it fits, and is less clunky than "NumElements". It would be nice to have a comment here confirming that that's what this is. ::: js/src/jit/LIR-Common.h @@ +127,5 @@ > return moves_[i]; > } > }; > > +// Retrieve element from a SIMD int32x4 Many people call this operation "extract" so it would be good to use that word somewhere in the comment. @@ +146,5 @@ > + return lane_; > + } > +}; > + > +// Retrieve element from a SIMD float32x4: Ditto about "extract". ::: js/src/jit/MIR.h @@ +1106,5 @@ > > bool canProduceFloat32() const; > }; > > +class MSimdGetLane : public MUnaryInstruction Please add a brief comment above this class describing it, and including the word "extract" somewhere. ::: js/src/jit/shared/CodeGenerator-x86-shared.cpp @@ +2096,5 @@ > + uint32_t lane = ins->lane(); > + if (lane == 0 && input != output) > + masm.moveLowFloat32(input, output); > + else > + masm.shuffleFloat32(lane, input, output); The lane variable here appears to represent a SIMD element index, but shufps (called by shuffleFloat32) needs a full shuffle mask value. You'll need to convert from lane index to shuffle mask somewhere. Is shuffleFloat32 intended to be an interface that other platforms will also provide? If so, we should figure out some common mask representation that can be shared.
Attachment #8435808 - Flags: feedback?(sunfish)
Blocks: 1023404
Attached patch bug1021716.patch (obsolete) (deleted) — Splinter Review
Splitted patch
Attachment #8435808 - Attachment is obsolete: true
(In reply to Dan Gohman [:sunfish] from comment #2) > Comment on attachment 8435808 [details] [diff] [review] > bug1021716.patch > > Review of attachment 8435808 [details] [diff] [review]: > ----------------------------------------------------------------- > > > This could assert lane < 256. Also, calling this variable "lane" is a little > confusing, since it's not a lane index in pshufd. How about "mask"? Done. Also used a uint8_t rather than a uint32_t, making lane < 256 by design (and fixing a bug I was seeing locally \o/). > > %x like shufps uses, would be nicer than %d here. Done. I also added the comments you asked for. > > ::: js/src/jit/shared/CodeGenerator-x86-shared.cpp > @@ +2096,5 @@ > > + uint32_t lane = ins->lane(); > > + if (lane == 0 && input != output) > > + masm.moveLowFloat32(input, output); > > + else > > + masm.shuffleFloat32(lane, input, output); > > The lane variable here appears to represent a SIMD element index, but shufps > (called by shuffleFloat32) needs a full shuffle mask value. You'll need to > convert from lane index to shuffle mask somewhere. Is shuffleFloat32 > intended to be an interface that other platforms will also provide? If so, > we should figure out some common mask representation that can be shared. Coincidentally, on x86, for extracting an element from any word to the low word, the lane *is* the mask. I will add a comment for that. shuffleFloat32 is intended to be a common interface in the future, so we might need this function to convert from lane index to shuffle mask, if this property doesn't hold on ARM.
Attached patch bug1021716.patch (obsolete) (deleted) — Splinter Review
Modifications: - all remarks addressed - moved methods from jit/shared/Lowering-x86-shared to jit/Lowering - dummy implementation in ARM, so that it can compile - made a Lane enum, so that we can abstract the x86 equivalence between lane and mask (and prepare for ARM?).
Attachment #8437782 - Attachment is obsolete: true
Attachment #8439353 - Flags: feedback?(sunfish)
I am actually wondering whether this MIR node shouldn't even be called MSimdExtractElement, as this will make it clearer to the ground of what it's achieving. I'll probably add an implementation of MSimdShuffle here too, which is just a generalization of this first MIR node after all. Or even just replace it.
Comment on attachment 8439353 [details] [diff] [review] bug1021716.patch Review of attachment 8439353 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/assembler/assembler/X86Assembler.h @@ +2580,5 @@ > m_formatter.prefix(PRE_SSE_66); > m_formatter.twoByteOp(OP2_MOVD_VdEd, (RegisterID)dst, src); > } > > + void pshufd_rr(uint8_t mask, XMMRegisterID src, XMMRegisterID dst) Using uint8_t here is neat, but unfortunately both GCC and clang will silently truncate if a wider integer is passed in. There's no warning even with -Wall -Wextra in either (there is a warning with -Wconversion, but you have to ask for it specifically). Consequently, I think this code would be more strict in practice if it accepts an int32_t or whatever and checks that it's in range with an assert. @@ +2588,5 @@ > + m_formatter.prefix(PRE_SSE_66); > + m_formatter.twoByteOp(OP2_PSHUFD_VdqWdq, (RegisterID)dst, (RegisterID)src); > + m_formatter.immediate8(mask); > + } > + void shufps_rr(uint8_t mask, XMMRegisterID src, XMMRegisterID dst) Ditto about uint8_t. ::: js/src/jit/shared/CodeGenerator-x86-shared.cpp @@ +2097,5 @@ > + SIMDLane lane = ins->lane(); > + if (lane == LaneX) { > + // The value we want to extract is in the low double-word > + if (input != output) > + masm.moveLowFloat32(input, output); Quoting for reference below... ::: js/src/jit/shared/MacroAssembler-x86-shared.h @@ +507,5 @@ > + > + void shuffleFloat32(uint8_t mask, FloatRegister src, FloatRegister dest) { > + shufps(mask, src, dest); > + } > + void moveLowFloat32(FloatRegister src, FloatRegister dest) { I see that this method is consistent with moveLowInt32, but it's a little confusing because x86 has an instruction which moves just the low float32, movss_rr, and this is not that. Since we are assuming that all SIMD values use FloatRegister everywhere else, we could just eliminate this function and have the code which uses it use moveFloat32 instead.
Attachment #8439353 - Flags: feedback?(sunfish) → feedback+
(In reply to Benjamin Bouvier [:bbouvier] from comment #6) > I am actually wondering whether this MIR node shouldn't even be called > MSimdExtractElement, as this will make it clearer to the ground of what it's > achieving. I'll probably add an implementation of MSimdShuffle here too, > which is just a generalization of this first MIR node after all. Or even > just replace it. Sounds reasonable.
Attached patch bug1021716-extractElement.patch (obsolete) (deleted) — Splinter Review
Diff: - Renames GetLane into ExtractElement everywhere. - Fixes an issue with shufps, which actually extract values from the dest operand to the two low words, and values from the src operand to the two high words (why?!). I just imitated the way Haitao implemented it, by clobbering the output register with the input register before doing the shufps. - Renamed the enumeration members in X86Assembler.h so that it actually follows the Intel doc.
Attachment #8439353 - Attachment is obsolete: true
Attachment #8441390 - Flags: feedback?(sunfish)
Attached patch bug1021716-shuffle.patch (obsolete) (deleted) — Splinter Review
This is the generalization of MSimdExtractElement into MSimdShuffle. I think it makes sense to only have one single MInstruction, as these two operations are the same at the semantic level, the extract being just a special case of shuffle. nbp disagrees with that, as in the case of the extract, the move happening after the shupfs/pshufd is hidden in the instruction. I've hidden the difference between the two variants in the ctor helpers: NewSimdShuffle creates an actual shuffle operation with all lanes while NewSimdExtractElement just creates the equivalent of the previous SimdExtractElement operation. As the only difference happens in regalloc (where shuffle's output must be a vector register and extract element's output must be a "scalar" register), different LIR instructions are created.
Attachment #8441393 - Flags: feedback?(sunfish)
Attached patch bug1021716-shuffle.patch (obsolete) (deleted) — Splinter Review
And another patch that actually makes GVN works for SimdShuffle (only change is in congruentTo, which didn't take lanes into account).
Attachment #8441393 - Attachment is obsolete: true
Attachment #8441393 - Flags: feedback?(sunfish)
Attachment #8441405 - Flags: feedback?(sunfish)
Comment on attachment 8441390 [details] [diff] [review] bug1021716-extractElement.patch Review of attachment 8441390 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jit/shared/MacroAssembler-x86-shared.h @@ +489,5 @@ > movaps(src, dest); > } > > + static uint32_t ComputeShuffleMask(SIMDLane x, SIMDLane y = LaneX, > + SIMDLane z = LaneX, SIMDLane w = LaneX) Ok, ExactElement is implemented here as broadcasting the extracted element to all lanes. This is interesting because ordinarily in JIT code, a scalar float32 value will have all high bits of an xmm register set to zero, and now we will have an operator that produces scalar float32 values with the high bits set to Other Things. This is what we want, because zeroing them adds overhead, but we will need to watch out for any code elsewhere in the jit which expects the bits to be zero or non-NaN. Another note here; there are some optimizations possible here, though we should file a separate bug for them. First, extracting element X should be a no-op (since we don't care about the high bits). That'll be important since extracting element X is common. Second, extracting element Z of a float32x4 can be done with movhlps, which unlike shufps doesn't require an extra move (and is also one byte shorter than shufps). To do these, I guess we'll need to generalize the concept of a mask somewhat to include a selector for which shuffle instruction to use. @@ +507,5 @@ > + movd(src, dest); > + } > + > + void shuffleFloat32(uint32_t mask, FloatRegister src, FloatRegister dest) { > + // The shuffle instruction on x86 is such that if moves 2 words from Typo "if moves" -> "it moves"
Attachment #8441390 - Flags: feedback?(sunfish) → feedback+
Comment on attachment 8441405 [details] [diff] [review] bug1021716-shuffle.patch Review of attachment 8441405 [details] [diff] [review]: ----------------------------------------------------------------- (In reply to Benjamin Bouvier [:bbouvier] from comment #10) > Created attachment 8441393 [details] [diff] [review] > bug1021716-shuffle.patch > > This is the generalization of MSimdExtractElement into MSimdShuffle. I think > it makes sense to only have one single MInstruction, as these two operations > are the same at the semantic level, the extract being just a special case of > shuffle. nbp disagrees with that, as in the case of the extract, the move > happening after the shupfs/pshufd is hidden in the instruction. I'm inclined to agree with nbp here. extract returns a scalar, while shuffle returns a vector, and the fact that their combination also requires a dedicated flag to tell it which one it is suggests that they should be separate.
Attachment #8441405 - Flags: feedback?(sunfish)
Attached patch bug1021716-shuffle.patch (obsolete) (deleted) — Splinter Review
(In reply to Dan Gohman [:sunfish] from comment #13) > I'm inclined to agree with nbp here. extract returns a scalar, while shuffle > returns a vector, and the fact that their combination also requires a > dedicated flag to tell it which one it is suggests that they should be > separate. Alrighty then, I don't have a strong opinion here. Actually splitting it was fairly easy.
Attachment #8441405 - Attachment is obsolete: true
Attachment #8442363 - Flags: feedback?(sunfish)
Comment on attachment 8442363 [details] [diff] [review] bug1021716-shuffle.patch Review of attachment 8442363 [details] [diff] [review]: ----------------------------------------------------------------- LGTM. Note that the clang __builtin_shufflevector intrinsic takes two input vectors, rather than one, and the mask can select elements from either. We don't need that just now, but it is something to keep in mind.
Attachment #8442363 - Flags: feedback?(sunfish) → feedback+
For the record, these patches are waiting for 1) proper testing of MSimdShuffle in the Odin SIMD patch 2) a version of MSimdShuffle with two inputs, as there actually is a SIMD.int32x4.shuffle(vec1, vec2, mask) variant in the interpreter (as guessed in comment 15). 3) patches in parent bugs and move / regalloc bugs to be finished.
Attached patch 1 Extract element (obsolete) (deleted) — Splinter Review
Flipping to r? now that it got tested in the asm.js patch.
Attachment #8441390 - Attachment is obsolete: true
Attachment #8469344 - Flags: review?(sunfish)
Comment on attachment 8469344 [details] [diff] [review] 1 Extract element Review of attachment 8469344 [details] [diff] [review]: ----------------------------------------------------------------- Looks good! ::: js/src/assembler/assembler/X86Assembler.h @@ +2596,5 @@ > + mask, nameFPReg(src), nameFPReg(dst)); > + m_formatter.prefix(PRE_SSE_66); > + m_formatter.twoByteOp(OP2_PSHUFD_VdqWdqIb, (RegisterID)dst, (RegisterID)src); > + m_formatter.immediate8(uint8_t(mask)); > + } Style nit: Add a blank line between these two functions, for consistency with the rest of the file. ::: js/src/jit/MIR.h @@ +1172,5 @@ > }; > > +// Extracts a lane element from a given vector type, given by its symbol: > +// X for the first lane, Y for the second, Z for the third (if any), W for the > +// fourth (if any). I think a better place to comment about what X, Y, Z, W represent is at the SimdLane enum definition itself. ::: js/src/jit/shared/CodeGenerator-x86-shared.cpp @@ +2079,5 @@ > + > + SimdLane lane = ins->lane(); > + if (lane == LaneX) { > + // The value we want to extract is in the low double-word > + if (input != output) Yay for opportunistic coalescing :-). ::: js/src/jit/shared/MacroAssembler-x86-shared.h @@ +522,5 @@ > + // the dest and 2 words from the src operands. To simplify things, just > + // clobber the output with the input and apply the instruction > + // afterwards. > + if (src != dest) > + moveAlignedFloat32x4(src, dest); I was initially worried that this isn't useAtStart-safe, but it turns out to be ok because src isn't read afterwards. Please add a comment mentioning that this is indeed safe :). @@ +523,5 @@ > + // clobber the output with the input and apply the instruction > + // afterwards. > + if (src != dest) > + moveAlignedFloat32x4(src, dest); > + shufps(mask, dest, dest); Extract-element Z can be done with movhlps, which is the same speed as shufps but one byte smaller. Not a big win, and not worth making this code more complicated at this time, but it would be good to note it in a comment.
Attachment #8469344 - Flags: review?(sunfish) → review+
Attached patch Bonus: movhlps (deleted) — Splinter Review
Tested on the odin simd patch, works like a charm! (I made sure it got called by setting breakpoints in AssemblerX86Shared::movhlps()).
Attachment #8469981 - Flags: review?(sunfish)
Comment on attachment 8469981 [details] [diff] [review] Bonus: movhlps Review of attachment 8469981 [details] [diff] [review]: ----------------------------------------------------------------- Yay, bonus :-). ::: js/src/jit/shared/MacroAssembler-x86-shared.h @@ +516,5 @@ > void moveLowInt32(FloatRegister src, Register dest) { > movd(src, dest); > } > > + void moveHighToLowFloat32(FloatRegister src, FloatRegister dest) { I appreciate trying to translate the x86 opcode into a name here, but this is a little ambiguous because it's moving Z to the X slot. How about moveZWtoXY? Or moveHighPairToLowPairFloat32?
Attachment #8469981 - Flags: review?(sunfish) → review+
Attached patch 1 Extract element (deleted) — Splinter Review
Carrying forward r+ from sunfish. Two remarks: 1) I've added a canonicalizeFloat() in codegen of ExtractElementF. As a lane extraction is the border between SIMD and scalar world, that seems like the right place to put it. Does it sound fine to you? 2) I don't understand why reading src wouldn't be useAtStart safe: if we go through the branch that uses shuffleFloat32, it is the first time we read src, and it isn't used afterwards, so we're fine. What kind of issues could that have raised? Could that have happened with shuffleInt32 as well?
Attachment #8469344 - Attachment is obsolete: true
Attachment #8470859 - Flags: review+
Flags: needinfo?(sunfish)
(In reply to Benjamin Bouvier [:bbouvier] from comment #21) > Created attachment 8470859 [details] [diff] [review] > 1 Extract element > > Carrying forward r+ from sunfish. > > Two remarks: > 1) I've added a canonicalizeFloat() in codegen of ExtractElementF. As a lane > extraction is the border between SIMD and scalar world, that seems like the > right place to put it. Does it sound fine to you? Yes, that sounds reasonable. > 2) I don't understand why reading src wouldn't be useAtStart safe: if we go > through the branch that uses shuffleFloat32, it is the first time we read > src, and it isn't used afterwards, so we're fine. What kind of issues could > that have raised? Could that have happened with shuffleInt32 as well? Oh, I guess this code is safe for multiple reasons. useAtStart safety is about whether the emitted instructions write to the output(s) before reading any usedAtStart input(s). A common pattern is a binary operator where the instruction sequence starts by copying one of the inputs into the output register before reading the other input. If the other input is usedAtStart, that may clobber it. When I read shuffleFloat32's CodeGenerator expansion, I saw an input being copied to the output at the top, and it tripped my yellow-alert flag. However, shuffleFloat32 is a unary operator, and it doesn't try to read src after writing dst, so it's good.
Flags: needinfo?(sunfish)
remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/0cab1edaa959 remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/228b3416dace Thanks for the explanation. I've added a comment to make it clear. This bug needs to stay open because we want shuffleMix here as well.
Keywords: leave-open
Attached patch SIMD x86x-64: Implement MSimdShuffle. (obsolete) (deleted) — Splinter Review
Rebased. Could this be ready to review and land?
Attachment #8442363 - Attachment is obsolete: true
Attachment #8479675 - Flags: feedback?(benj)
Comment on attachment 8479675 [details] [diff] [review] SIMD x86x-64: Implement MSimdShuffle. Review of attachment 8479675 [details] [diff] [review]: ----------------------------------------------------------------- Yes, looks good. Will upload shuffleMix for consistency in a bit
Attachment #8479675 - Flags: feedback?(benj) → review?(sunfish)
Attached patch 3 - ShuffleMix (obsolete) (deleted) — Splinter Review
ShuffleMix(a, b, mask) takes two vectors and a mask, applies the shuffle operation to a's 2 first lanes and b's 2 last lanes.
Attachment #8479995 - Flags: review?(sunfish)
Blocks: 1059408
Comment on attachment 8479675 [details] [diff] [review] SIMD x86x-64: Implement MSimdShuffle. Review of attachment 8479675 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jit/LIR-Common.h @@ +190,5 @@ > +{ > + SimdLane laneX_; > + SimdLane laneY_; > + SimdLane laneZ_; > + SimdLane laneW_; Instead of storing each of the lane indices here in the LSimdShuffleI, we can add a mir() function like other LInstructions which returns the MSimdShuffle, and then laneX() can be "return mir()->laneX()". It looks like we forgot to do this for LSimdExtractElementI and LSimdExtractElementF, but it be nice to do the same for them too. It saves memory, and it avoids duplicating information. @@ +216,5 @@ > +{ > + SimdLane laneX_; > + SimdLane laneY_; > + SimdLane laneZ_; > + SimdLane laneW_; The comment in LSimdShuffleI applies here too. ::: js/src/jit/MIR.h @@ +1359,5 @@ > } > }; > > +// Applies a shuffle operation to the input, putting the input lanes as > +// indicated in the output register's lanes. In some circles, "shuffle" means an operation with two vector inputs. To help avoid confusion, I suggest renaming the class here to MSimdSwizzle, or perhaps MSimdUnaryShuffle if you prefer (and renaming associated things accordingly).
Attachment #8479675 - Flags: review?(sunfish) → review+
Comment on attachment 8479995 [details] [diff] [review] 3 - ShuffleMix Review of attachment 8479995 [details] [diff] [review]: ----------------------------------------------------------------- Looks good. With my comments on the previous patch, this patch will need some updating, but I believe it's straight-forward enough that another round of review isn't necessary. ::: js/src/jit/MIR.h @@ +1396,5 @@ > +}; > + > +// Applies a shuffle operation to the input, putting the input lanes as > +// indicated in the output register's lanes. > +class MSimdShuffle : public MUnaryInstruction, public MSimdShuffleBase I think things will work out a little nicer if we make MSimdShuffleBase inherit from MUnaryInstruction rather than using multiple inheritance here. Then LSimdShuffleI etc. can have a mir() function with MSimdShuffleBase* as the return type. If we add a binary shuffle in the future, we'll have to reorganize some stuff here, but I think we're ok for now. @@ +1431,5 @@ > + return sameLanes(other) && congruentIfOperandsEqual(other); > + } > +}; > + > +class MSimdShuffleMix : public MBinaryInstruction, public MSimdShuffleBase Please add a brief comment for this class explaining what a "shuffle mix" is. I think it's the right name here, but it's a name that may not be known to people coming form other domains. ::: js/src/jit/shared/MacroAssembler-x86-shared.h @@ +548,5 @@ > moveAlignedFloat32x4(src, dest); > shufps(mask, dest, dest); > } > + void shuffleMix(uint32_t mask, const Operand &src, FloatRegister dest) { > + // Note this uses shufps, which will create a false dependency for It's a domain crossing penalty (on CPUs where this applies), rather than a false dependency.
Attachment #8479995 - Flags: review?(sunfish) → review+
Attached patch 2. SIMD x86x-64: Implement MSimdShuffle. (obsolete) (deleted) — Splinter Review
Just rebase after the bitwise ops patch.
Attachment #8479675 - Attachment is obsolete: true
Attachment #8480420 - Flags: review+
Attached patch 3 - ShuffleMix (obsolete) (deleted) — Splinter Review
Just rebasing. Still to integrate the reviewers feedback.
Attachment #8479995 - Attachment is obsolete: true
Attachment #8480425 - Flags: review+
Attached patch 2. Implement MSimdShuffle. (obsolete) (deleted) — Splinter Review
Rebase. Carry forward r+.
Attachment #8480420 - Attachment is obsolete: true
Attachment #8481218 - Flags: review+
Attached patch 3 - ShuffleMix (obsolete) (deleted) — Splinter Review
Rebase. Carry forward r+.
Attachment #8480425 - Attachment is obsolete: true
Attachment #8481221 - Flags: review+
Attached patch simd-shuffle.patch (obsolete) (deleted) — Splinter Review
factored out the code for retrieving lanes in LIR-Common. If you feel this needs reviewing again, don't hesitate to do so ;)
Attachment #8481218 - Attachment is obsolete: true
Attachment #8481417 - Flags: review+
Attached patch 3 - ShuffleMix (obsolete) (deleted) — Splinter Review
Reflagging for review, as ShuffleMix is *actually* the binary version of shuffle you're mentionning in your review comment. I renamed it to MSimdBinaryShuffle to make it clearer. I now think that making simd.int32x4.splat would be trivial, by using shuffle with a mask of 0x00. I'll do it next week in another patch, unless somebody beats me to it.
Attachment #8481221 - Attachment is obsolete: true
Attachment #8481419 - Flags: review?(sunfish)
Comment on attachment 8481419 [details] [diff] [review] 3 - ShuffleMix Review of attachment 8481419 [details] [diff] [review]: ----------------------------------------------------------------- (In reply to Benjamin Bouvier [:bbouvier] from comment #35) > Created attachment 8481419 [details] [diff] [review] > 3 - ShuffleMix > > Reflagging for review, as ShuffleMix is *actually* the binary version of > shuffle you're mentionning in your review comment. I renamed it to > MSimdBinaryShuffle to make it clearer. Yeah, I see now that this is the language in the current spec. I may see if we can change the spec, but for now it's fine :-). Please do address (or respond, if you disagree) with my other review comments on this patch. Thanks!
Attachment #8481419 - Flags: review?(sunfish) → review+
Attached patch 2) Shuffle (aka Swizzle) (obsolete) (deleted) — Splinter Review
Rebased, carrying forward r+.
Attachment #8481417 - Attachment is obsolete: true
Attachment #8502543 - Flags: review+
Attached patch 3) ShuffleMix (obsolete) (deleted) — Splinter Review
Rebased + carrying forward r+. Not landing, as spec has changed.
Attachment #8481419 - Attachment is obsolete: true
Attachment #8502544 - Flags: review+
Attached patch 2. Swizzle (deleted) — Splinter Review
MSimdUnaryShuffle renamed to MSimdSwizzle. Feel free to review again, only the name and a ALLOW_CLONE have changed.
Attachment #8502543 - Attachment is obsolete: true
Attachment #8504853 - Flags: review+
Attached patch 3. SimdShuffle (deleted) — Splinter Review
This brings on parity with what ShuffleMix was doing, so it's not yet implementing the general cases. However, it's very useful for testing (locally) in Odin.
Attachment #8502544 - Attachment is obsolete: true
Attachment #8504854 - Flags: review?(sunfish)
Attached patch 4. General 2 operand shuffle (obsolete) (deleted) — Splinter Review
So I've used the API shuffle(lhs, rhs, x, y, z, w) where all x, y, z, w are integers in [[0, 7]]. We then partition the subspace of [[0,7]] ** 4 like this: - when constructing the instruction, we ensure that LHS contains the majority of lanes that will be present in the final result. That eliminates a lot of cases by symmetry. - cases when all lanes come from LHS correspond to a swizzle and are trivial. - cases when there are 3 lanes coming from LHS and 1 lane from RHS: each possible position of the lane from RHS has to be considered. Codegen could be improved, and the codegen could be factored (but that would complicate it even more), so this is just a rough idea. If you have ideas for better ways to do this, I'm very open to suggestions. - cases when there are 2 lanes from LHS and 2 lanes from RHS; we can use two shufps by 1) putting desired values in a vector and 2) reordering them. - all other cases are obtained by symmetry.
Attachment #8504862 - Flags: feedback?(sunfish)
Easy optimization (next ones will include using movhlps, movlhps, unpcklps and unpckhps).
Attachment #8504863 - Flags: review?(sunfish)
Attached patch 6. Factor cases in codegen (obsolete) (deleted) — Splinter Review
This is the code refactor I talk about in comment 47, but it seems to make things even harder to understand. Thoughts?
Attachment #8504864 - Flags: feedback?(sunfish)
(In reply to Benjamin Bouvier [:bbouvier] from comment #43) > Created attachment 8504864 [details] [diff] [review] > 6. Factor cases in codegen > > This is the code refactor I talk about in comment 47, but it seems to make > things even harder to understand. Thoughts? And that was comment 41, actually.
As I suggested in person, there is a better way to handle swapping indexes and counting the number of lanes which are coming from lhs / rhs. By packing the 4 indexes in the bottom of one number gives you the ability to do: - Swaping argument is done as follow: indexes = ~indexes; mozilla:Swap(lhs, rhs); - All lanes are coming from "lhs": (special case) (indexes & 04444) == 0 - Number of lanes which are coming from "rhs": (indexes & 04444) % 7 == (4 * 4) % 7 // 4 lanes are coming from "rhs" (indexes & 04444) % 7 == (3 * 4) % 7 // 3 lanes are coming from "rhs" (indexes & 04444) % 7 == (2 * 4) % 7 // 2 lanes are coming from "rhs" (indexes & 04444) % 7 == (1 * 4) % 7 // 1 lane is coming from "rhs" (indexes & 04444) % 7 == (0 * 4) % 7 // 4 lanes are coming from "lhs" - check if all right-most lanes are coming from "rhs": (indexes & 00044) % 7 == (2 * 4) % 7 - check if all left-most lanes are coming from "lhs": (indexes & 04400) % 7 == (0 * 4) % 7 Note: "% 7" is used to sum all 3-bit components together, and with the mask we only sum multiples of 4.
Comment on attachment 8504862 [details] [diff] [review] 4. General 2 operand shuffle Review of attachment 8504862 [details] [diff] [review]: ----------------------------------------------------------------- (In reply to Benjamin Bouvier [:bbouvier] from comment #41) > Created attachment 8504862 [details] [diff] [review] > 4. General 2 operand shuffle > > So I've used the API shuffle(lhs, rhs, x, y, z, w) where all x, y, z, w are > integers in [[0, 7]]. Sounds good. There's a decent chance the JS-level API will end up looking like this too. > We then partition the subspace of [[0,7]] ** 4 like this: > - when constructing the instruction, we ensure that LHS contains the > majority of lanes that will be present in the final result. That eliminates > a lot of cases by symmetry. > - cases when all lanes come from LHS correspond to a swizzle and are trivial. > - cases when there are 3 lanes coming from LHS and 1 lane from RHS: each > possible position of the lane from RHS has to be considered. Codegen could > be improved, and the codegen could be factored (but that would complicate it > even more), so this is just a rough idea. If you have ideas for better ways > to do this, I'm very open to suggestions. It's always possible to convert from a 3-and-1 shuffle to a 2-and-2 shuffle with a single shufps: shufps the 1 element from the rhs into lane 0 of the output, and the element from the lhs which is to be in the same half of the final output into lane 1 of the output. Then you can do the 2-and-2 shuffle pattern with the output and the remaining 2 elements from lhs. > - cases when there are 2 lanes from LHS and 2 lanes from RHS; we can use two > shufps by 1) putting desired values in a vector and 2) reordering them. > - all other cases are obtained by symmetry. This sounds like a great start. Of course there are lots of optimizations possible, but it'll be good to have a complete implementation first. ::: js/src/jit/MIR.h @@ +1669,5 @@ > + // Swap operands so that new lanes come from LHS in majority. > + if ((laneX >= 4 && laneY >= 4 && laneZ >= 4) || > + (laneX >= 4 && laneY >= 4 && laneW >= 4) || > + (laneX >= 4 && laneZ >= 4 && laneW >= 4) || > + (laneY >= 4 && laneZ >= 4 && laneW >= 4)) I'd find this code easier to read like this: if ((laneX >= 4) + (laneY >= 4) + (laneZ >= 4) + (laneW >= 4) > 2) @@ +1674,5 @@ > + { > + laneX += laneX >= 4 ? -4 : 4; > + laneY += laneY >= 4 ? -4 : 4; > + laneZ += laneZ >= 4 ? -4 : 4; > + laneW += laneW >= 4 ? -4 : 4; laneX = (laneX + 4) % 8, etc.
Attachment #8504862 - Flags: feedback?(sunfish) → feedback+
Comment on attachment 8504854 [details] [diff] [review] 3. SimdShuffle Review of attachment 8504854 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jit/shared/BaseAssembler-x86-shared.h @@ +2966,5 @@ > + } > + > + void shufps_imr(uint32_t mask, const void* address, XMMRegisterID dst) > + { > + spew("shufps %x, %p, %s", nit: there's one fewer space here than in the function above. It's nice to keep all the instructions indented consistently so that everything lines up in the debug output. ::: js/src/jit/shared/MacroAssembler-x86-shared.h @@ +605,5 @@ > + static uint32_t ComputeShuffleMask(SimdLane x, SimdLane y = LaneX, SimdLane z = LaneX, > + SimdLane w = LaneX) > + { > + return ComputeShuffleMask(uint32_t(x), uint32_t(y), uint32_t(z), uint32_t(w)); > + } Enum values implicitly convert to uint32_t, so with the main ComputeShuffleMask using uint32_t, it's not necessary to have an overload just to use SimdLane. It looks like the general trend is going to be to use plain integers rather than SimdLane for things, which will likely continue as we add int16x8 and int8x16.
Attachment #8504854 - Flags: review?(sunfish) → review+
Depends on: 1083238
Comment on attachment 8504863 [details] [diff] [review] 5. Fold Shuffle into Swizzle when all lanes come from the same operand Review of attachment 8504863 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jit/shared/CodeGenerator-x86-shared.cpp @@ +2430,5 @@ > // (left) operand. > > + // If all lanes come from a single vector, we should have folded the > + // instruction to a MSimdSwizzle. > + MOZ_ASSERT(!(x < 4 && y < 4 && z < 4 && w < 4)); Asserting this isn't desirable with the patch as is, since we don't want the code generator to depend on foldsTo methods having been called. GVN can be disabled, and theoretically nodes can get created after GVN is run. However, another way to do this would be to have MSimdShuffle::New create an MSimdSwizzle when it can, instead of doing this in a foldsTo method. Since we have MSimdSwizzle and MSimdShuffle defined with immediate constants [0], there's no optimization which is going to make this optimization more effective, so we might as well just do it up front. Then we can always count on it having been done, and we can keep this assert. We could even swap the lhs() and rhs() up front too just to get that out of the way. [0] It's looking like the JS API will be supporting non-constant shuffle values too (but only because we can't prohibit them), however I expect we'll want a separate MIR class for the non-constant case anyway, as it's going to have a completely different implementation. MSimdDynamicShuffle (or whatever we call it) can have a folds to which creates an MSimdShuffle, and it'll use MSimdShuffle::New to do it, so it'll get fully folded too.
Attachment #8504863 - Flags: review?(sunfish) → review-
Attached patch 4. General 2 operand shuffle (deleted) — Splinter Review
(In reply to Dan Gohman [:sunfish] from comment #46) > Comment on attachment 8504862 [details] [diff] [review] > 4. General 2 operand shuffle > > Review of attachment 8504862 [details] [diff] [review]: > ----------------------------------------------------------------- > It's always possible to convert from a 3-and-1 shuffle to a 2-and-2 shuffle > with a single shufps: shufps the 1 element from the rhs into lane 0 of the > output, and the element from the lhs which is to be in the same half of the > final output into lane 1 of the output. Then you can do the 2-and-2 shuffle > pattern with the output and the remaining 2 elements from lhs. Is that really possible? In the first shufps you're describing, we cannot put an element of rhs in the same half as an element of lhs (otherwise, we could do the entire thing in one shufps?). If you meant "lane 2 of the output", that's what I do in my patch. Thanks for the (laneX >= 4) + (laneY >= 4) + ... snippet, my eyes were so closed to this code that i couldn't see this. That helped a lot factoring the codegen cases as well, this + new comments make it understandable. Switching to r? (no optimized cases using any other instructions yet).
Attachment #8504862 - Attachment is obsolete: true
Attachment #8504864 - Attachment is obsolete: true
Attachment #8504864 - Flags: feedback?(sunfish)
Attachment #8505814 - Flags: review?(sunfish)
Duh, I always forget that one can deactivate GVN :( Doing it at construction time sounds indeed better.
Attachment #8504863 - Attachment is obsolete: true
Attachment #8505817 - Flags: review?(sunfish)
Comment on attachment 8505814 [details] [diff] [review] 4. General 2 operand shuffle Review of attachment 8505814 [details] [diff] [review]: ----------------------------------------------------------------- Cool, nice job navigating the shufps landscape! > > It's always possible to convert from a 3-and-1 shuffle to a 2-and-2 shuffle > > with a single shufps: shufps the 1 element from the rhs into lane 0 of the > > output, and the element from the lhs which is to be in the same half of the > > final output into lane 1 of the output. Then you can do the 2-and-2 shuffle > > pattern with the output and the remaining 2 elements from lhs. > > Is that really possible? In the first shufps you're describing, we cannot > put an element of rhs in the same half as an element of lhs (otherwise, we > could do the entire thing in one shufps?). If you meant "lane 2 of the > output", that's what I do in my patch. Oh, yes, I meant lane 2, and I see now that this is what the patch is already doing. ::: js/src/jit/Lowering.cpp @@ +3840,5 @@ > + > + // See codegen for requirements details. > + LDefinition temp = (lanesFromLHS == 3 || (lanesFromLHS == 2 && zFromLHS && wFromLHS)) > + ? tempCopy(ins->rhs(), 1) > + : LDefinition::BogusTemp(); Quoting for reference below. ::: js/src/jit/MIR.h @@ +1673,5 @@ > MIRType type, int32_t laneX, int32_t laneY, int32_t laneZ, > int32_t laneW) > { > + // Swap operands so that new lanes come from LHS in majority. > + if ((laneX >= 4) + (laneY >= 4) + (laneZ >= 4) + (laneW >= 4) > 2) { I think there's one further useful thing we can do here. In the case of exactly 2 elements from each input, we should swap the operands such that the x lane comes from the lhs. I think that'll be a little easier to read -- I find (0,4,1,5) slightly nicer than (4,0,5,1) -- and it also happens to be convenient for x86: we can avoid the tricky tempCopy in the lanesFromLHS == 2 case above and make the z < 4 && w < 4 case with the extra movaps below unnecessary. Another advantage of swapping in the 2-and-2 case is that it'll allow GVN to see that e.g. (4,0,5,1) is really (0,1,4,5) in case there are happens to be another (0,1,4,5) with the same inputs present that can be made redundant. ::: js/src/jit/shared/CodeGenerator-x86-shared.cpp @@ +2512,5 @@ > + FloatRegister rhsCopy = ToFloatRegister(ins->temp()); > + masm.shufps(mask, lhs, rhsCopy); > + masm.movaps(rhsCopy, lhs); > + return true; > + } Quoting for reference above.
Attachment #8505814 - Flags: review?(sunfish) → review+
(In reply to Dan Gohman [:sunfish] from comment #51) > Another advantage of swapping in the 2-and-2 case is that it'll allow GVN to > see that e.g. (4,0,5,1) is really (0,1,4,5) in case there are happens to be > another (0,1,4,5) with the same inputs present that can be made redundant. Oops, I meant (0,4,1,5) rather than (0,1,4,5) here.
Comment on attachment 8505817 [details] [diff] [review] 5. Fold Shuffle into Swizzle when all lanes come from the same operand Review of attachment 8505817 [details] [diff] [review]: ----------------------------------------------------------------- Looks good.
Attachment #8505817 - Flags: review?(sunfish) → review+
Attached patch 6) Special cases for shuffling (obsolete) (deleted) — Splinter Review
Some mask pattern matching for interesting cases. I think I covered all the ones you've mentioned in the spec link. After implementing them and figuring out which masks they match (doh), I've put away movhps / movlps as their basic form requires a float register and a memory operand (wat, x86?). The AVX form is able to have registers, though, so that will help when we'll have it! Regarding the XXX comments, we can: 1) either say "yeah, whatever" [0] 2) or add a swap, architecture specific, when creating the MSimdShuffle, after the first swap (to cover more cases). When we'll implement NEON instructions, we can probably make this second swap architecture specific, in a Assembler:: static function. [0] https://www.youtube.com/watch?v=ejt8DbNBrDo
Attachment #8506260 - Flags: review?(sunfish)
This implements 2) in the previous comment. Not sure that my C++ swap-referenced-by-address in SwapOperandsAndLanes is very safe...
Attachment #8506263 - Flags: review?(sunfish)
I think we should be good at this point if r+ and this try is green: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=ba79c9b80e69
Comment on attachment 8506260 [details] [diff] [review] 6) Special cases for shuffling Review of attachment 8506260 [details] [diff] [review]: ----------------------------------------------------------------- (In reply to Benjamin Bouvier [:bbouvier] from comment #54) > Created attachment 8506260 [details] [diff] [review] > 6) Special cases for shuffling > > Some mask pattern matching for interesting cases. I think I covered all the > ones you've mentioned in the spec link. Looks good! Except for the macro ;-). > After implementing them and figuring out which masks they match (doh), I've > put away movhps / movlps as their basic form requires a float register and a > memory operand (wat, x86?). Makes sense. > The AVX form is able to have registers, though, > so that will help when we'll have it! Yay. There are also several goodies in SSE3 and SSE4.1 too. Fortunately, shufps is pretty fast on modern machines, so it's not urgent. > Regarding the XXX comments, we can: > 1) either say "yeah, whatever" [0] > 2) or add a swap, architecture specific, when creating the MSimdShuffle, > after the first swap (to cover more cases). When we'll implement NEON > instructions, we can probably make this second swap architecture specific, > in a Assembler:: static function. > > [0] https://www.youtube.com/watch?v=ejt8DbNBrDo I'm inclined to say "yeah, whatever, for now ;-)". See my comment below about another thing we could do, but we can do it later. ::: js/src/jit/shared/CodeGenerator-x86-shared.cpp @@ +2387,5 @@ > masm.movmskps(input, output); > return true; > } > > +#define MASK_MATCH(X, Y, Z, W) (ins->laneMask() == (X | (Y << 3) | (Z << 6) | (W << 9))) I think we can do this without a macro, like ins->lanesMatch(X, Y, Z, W). That way we can also keep the bit packing of the lanes better encapsulated in the MSimdSwizzle class. @@ +2423,5 @@ > + > + if (MASK_MATCH(0, 1, 2, 3)) { > + masm.movaps(input, output); > + return true; > + } We don't need this case because it's just an optimization, and MSimdSwizzle::foldsTo already does it (in this patch :-)). @@ +2426,5 @@ > + return true; > + } > + > + if (MASK_MATCH(2, 3, 2, 3)) { > + masm.movaps(input, output); You can put an XXX comment here too; if we made lowering do defineReuseInput for these cases, we could avoid these copies. @@ +2546,5 @@ > + masm.movaps(rhs, ScratchSimdReg); > + masm.movhlps(lhs, ScratchSimdReg); > + masm.movaps(ScratchSimdReg, out); > + return true; > + } I think the best way to optimize this will be to make lowering do defineReuseInput with rhs as the reused inputs. We can do this later though; I'm fine leaving TODO comments for these (TODO seems more idiomatic in this codebase than XXX comments for this purpose). @@ +2552,5 @@ > + // XXX unreachable, see comment above > + if (MASK_MATCH(6, 7, 2, 3)) { > + masm.movhlps(rhs, lhs); > + return true; > + } I'd prefer to leave the unreachable cases out for now. @@ +2585,5 @@ > + masm.movaps(ScratchSimdReg, out); > + return true; > + } > +#undef MASK_MATCH > + One additional shuffle that's easy to do is register-register movss, which weirdly leaves lanes y, z, and w unmodified, so it's (4, 1, 2, 3).
Attachment #8506260 - Flags: review?(sunfish)
Comment on attachment 8506263 [details] [diff] [review] 7. Architecture specific swap in SimdShuffle::NewAsmJS Review of attachment 8506263 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jit/MIR.h @@ +1675,5 @@ > + { > + *x = (*x + 4) % 8; > + *y = (*y + 4) % 8; > + *z = (*z + 4) % 8; > + *w = (*w + 4) % 8; Random aside: It'd be fun if we used uint32_t instead of int32_t for the indices, because then '% 8" is 1 instruction, rather than about 5, but this probably doesn't make any difference in practice. @@ +1696,5 @@ > + > +#define MASK_MATCH(X, Y, Z, W) (laneX == X && laneY == Y && laneZ == Z && laneW == W) > + if (MASK_MATCH(2, 3, 6, 7) || MASK_MATCH(4, 0, 5, 1) || MASK_MATCH(6, 2, 7, 3)) > + SwapOperandsAndLanes(&laneX, &laneY, &laneZ, &laneW, &lhs, &rhs); > +#undef MASK_MATCH MSimdShuffle::New's main goal here should be to have a simple and set of rules that produces a unique canonical form for each possible shuffle. If we can make the canonical form convenient to optimize in codegen, that's great, but we don't want to start matching for individual shuffle operations at this level. Your first rule of "If one input contributes more lanes than the other, put it in the lhs" is nice and simple. If we add "For a tie, put the input contributing lane 0 in the lhs", then we have enough to get canonical forms for all possible inputs, and it's all still pretty simple. I think a better approach for the x86 codegen problem is to have lowering do defineReuseInput with the rhs instead of the lhs when appropriate, so that we don't need to be swapping things around to accomodate it.
Attachment #8506263 - Flags: review?(sunfish) → review-
Attached patch 6. Special cases for shuffling (deleted) — Splinter Review
That sounds right. Having MSimdShuffleBase::lanesMatch() makes more sense, and these arch specific swaps should go into lowering (opened bug 1084404 for that, as I'd like to move on). I've let the pattern matching (0,1,2,3) in swizzle, as i seem to recall somebody telling me once that people could deactivate gvn :} Also, i would be fine with leaving this case out, as it seems reasonable to punish these people.
Attachment #8506260 - Attachment is obsolete: true
Attachment #8506263 - Attachment is obsolete: true
Attachment #8506953 - Flags: review?(sunfish)
Forgot to say that this patch also cleans up int32_t / uint32_t interface. Everything should explicitly be uint32_t from now, in this patch.
Comment on attachment 8506953 [details] [diff] [review] 6. Special cases for shuffling Review of attachment 8506953 [details] [diff] [review]: ----------------------------------------------------------------- (In reply to Benjamin Bouvier [:bbouvier] from comment #59) > Created attachment 8506953 [details] [diff] [review] > 6. Special cases for shuffling > > That sounds right. Having MSimdShuffleBase::lanesMatch() makes more sense, > and these arch specific swaps should go into lowering (opened bug 1084404 > for that, as I'd like to move on). Cool. > I've let the pattern matching (0,1,2,3) in swizzle, as i seem to recall > somebody telling me once that people could deactivate gvn :} Also, i would > be fine with leaving this case out, as it seems reasonable to punish these > people. Ha! But yeah, we don't want assertion failures when we deactivate GVN, but it's totally ok to miss some optimizations, because, well, they asked for it :). (In reply to Benjamin Bouvier [:bbouvier] from comment #60) > Forgot to say that this patch also cleans up int32_t / uint32_t interface. > Everything should explicitly be uint32_t from now, in this patch. Yay. ::: js/src/jit/LIR-Common.h @@ +246,5 @@ > + uint32_t laneY() const { return mir_->toSimdSwizzle()->laneY(); } > + uint32_t laneZ() const { return mir_->toSimdSwizzle()->laneZ(); } > + uint32_t laneW() const { return mir_->toSimdSwizzle()->laneW(); } > + > + bool lanesMatch(int32_t x, int32_t y, int32_t z, int32_t w) const { Since we changed some things to uint32_t, we should use it here too. @@ +295,5 @@ > + uint32_t laneY() const { return mir_->toSimdShuffle()->laneY(); } > + uint32_t laneZ() const { return mir_->toSimdShuffle()->laneZ(); } > + uint32_t laneW() const { return mir_->toSimdShuffle()->laneW(); } > + > + bool lanesMatch(int32_t x, int32_t y, int32_t z, int32_t w) const { and here :) ::: js/src/jit/MIR.cpp @@ +821,5 @@ > > +MDefinition * > +MSimdSwizzle::foldsTo(TempAllocator &alloc) > +{ > + if (laneX() == 0 && laneY() == 1 && laneZ() == 2 && laneW() == 3) This can use the new lanesMatch function.
Attachment #8506953 - Flags: review?(sunfish) → review+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: