Closed
Bug 1313336
Opened 8 years ago
Closed 7 years ago
Implement stubbed out 64-bit and 32-bit operations on ARM64
Categories
(Core :: JavaScript Engine: JIT, defect, P3)
Tracking
()
RESOLVED
FIXED
mozilla60
Tracking | Status | |
---|---|---|
firefox60 | --- | fixed |
People
(Reporter: lth, Assigned: lth)
References
Details
Attachments
(7 files, 9 obsolete files)
(deleted),
patch
|
lth
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
lth
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
lth
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
lth
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
lth
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
lth
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
lth
:
review+
|
Details | Diff | Splinter Review |
See bug 1313043. The second patch on that bug is a bunch of stubs for the ARM64 back-end where it did not have common MacroAssembler functionality. Almost all of these are 64-bit functions, in addition there is popcnt32.
Assignee | ||
Updated•7 years ago
|
Blocks: Rabaldr-ARM64
Assignee | ||
Comment 2•7 years ago
|
||
I have implemented most of these (for wasm, anyway), patch coming once I'm passing all tests.
Assignee: nobody → lhansen
Status: NEW → ASSIGNED
Priority: P5 → P3
Hardware: Other → ARM64
Assignee | ||
Comment 4•7 years ago
|
||
Implement stubbed-out masm integer instructions.
Assignee | ||
Comment 5•7 years ago
|
||
Implement stubbed-out masm floating-point instructions.
Assignee | ||
Comment 6•7 years ago
|
||
Implement stubbed-out masm truncate-float-to-int instructions and their ool emitters.
Assignee | ||
Comment 7•7 years ago
|
||
Implement stubbed-out masm register-to-register moves.
Assignee | ||
Comment 8•7 years ago
|
||
Implement wasmLoad and wasmStore masm instructions.
Assignee | ||
Comment 9•7 years ago
|
||
Implement stubbed-out masm atomic instructions.
We can probably optimize these a little more by using the new LDRA/STRA + one barrier, but I haven't investigated that yet, it's not urgent to do so.
Assignee | ||
Comment 10•7 years ago
|
||
Implement miscellaneous stubbed-out masm instructions, notably a lot of patchable control flow. Annotate remaining stubs that are not used by wasm or at all.
Comment 11•7 years ago
|
||
Comment on attachment 8949661 [details] [diff] [review]
bug1313336-arm64-masm-integer.patch
Review of attachment 8949661 [details] [diff] [review]:
-----------------------------------------------------------------
::: js/src/jit/arm64/MacroAssembler-arm64-inl.h
@@ +547,5 @@
> + Udiv(scratch, ARMRegister(srcDest, 32), ARMRegister(rhs, 32));
> + else
> + Sdiv(scratch, ARMRegister(srcDest, 32), ARMRegister(rhs, 32));
> + Mul(scratch, scratch, ARMRegister(rhs, 32));
> + Sub(ARMRegister(srcDest, 32), ARMRegister(srcDest, 32), scratch);
It would be good to add a comment to this function noting that it doesn't have to deal with x % 0 and INT_MIN % -1
@@ +783,5 @@
> {
> + vixl::UseScratchRegisterScope temps(this);
> + const ARMRegister scratch = temps.AcquireW();
> + Mov(scratch, 32);
> + Sub(scratch, scratch, ARMRegister(count, 32));
You can compute `-count` in a single instruction, and ror only looks at the low bits.
@@ +808,5 @@
> + vixl::UseScratchRegisterScope temps(this);
> + const ARMRegister scratch = temps.AcquireX();
> + Mov(scratch, 64);
> + Sub(scratch, scratch, ARMRegister(count, 64));
> + Ror(ARMRegister(dest.reg, 64), ARMRegister(input.reg, 64), scratch);
Same: ror(-count)
@@ +889,5 @@
> + Add(dest, dest, Operand(dest, vixl::LSR, 4));
> + And(dest, dest, 0x0F0F0F0F);
> + Add(dest, dest, Operand(dest, vixl::LSL, 8));
> + Add(dest, dest, Operand(dest, vixl::LSL, 16));
> + Lsr(dest, dest, 24);
Are you aware of the CNT instruction? It's a bytewise popcount across a SIMD register. You can follow with ADDV to implement a full popcount.
This requires a temporary SIMD register, of course.
Attachment #8949661 -
Flags: review+
Updated•7 years ago
|
Attachment #8949663 -
Flags: review+
Comment 12•7 years ago
|
||
Comment on attachment 8949664 [details] [diff] [review]
bug1313336-arm64-masm-truncate.patch
Review of attachment 8949664 [details] [diff] [review]:
-----------------------------------------------------------------
::: js/src/jit/arm64/MacroAssembler-arm64.cpp
@@ +897,5 @@
> + Fcvtzu(output, input);
> + Cmp(output, 0);
> + B(oolEntry, Assembler::Equal);
> + Cmp(output, -1);
> + B(oolEntry, Assembler::Equal);
You can use CMP+CCMP to chain the two compares and save a branch here.
@@ +923,5 @@
> + B(oolEntry, Assembler::Equal);
> + Cmp(output, INT32_MAX);
> + B(oolEntry, Assembler::Equal);
> + Cmp(output, INT32_MIN);
> + B(oolEntry, Assembler::Equal);
Same: CMP+CCMP+CCMP.
Also, I think INT32_MAX and INT32_MIN both have to be materialized in a scratch register, so it might be better to check unsigned(x-INT32_MAX) < 2.
@@ +1067,5 @@
> + bind(¬NaN);
> +
> + Label isOverflow;
> + if (isUnsigned) {
> + loadConstantFloat32(float(double(UINT64_MAX) + 1), ScratchFloat32Reg);
This expression looks a little sketchy: double(UINT64_MAX) rounds to 0x1.0p64, and then adding 1 does nothing. So you end up with 0x1.0p64 as you wanted, but it's not really clear that was intended.
Are we allowed to use hex float literals yet?
Alternatively:
const float c = -float(INT64_MIN);
use +c, -c, 2*c for your constants. No rounding anywhere.
@@ +1102,5 @@
> + } else {
> + loadConstantDouble(double(INT64_MAX) + 1, ScratchDoubleReg);
> + branchDouble(Assembler::DoubleGreaterThanOrEqual, input, ScratchDoubleReg, &isOverflow);
> + // INT64_MIN is the closest value to the true limit value INT64_MIN-1 in
> + // the double representation (53-bit mantissa).
This comment applies to the positive case too.
Updated•7 years ago
|
Attachment #8949664 -
Flags: review?(jolesen)
Updated•7 years ago
|
Attachment #8949665 -
Flags: review+
Comment 13•7 years ago
|
||
Comment on attachment 8949666 [details] [diff] [review]
bug1313336-arm64-masm-wasm-load-store.patch
Review of attachment 8949666 [details] [diff] [review]:
-----------------------------------------------------------------
::: js/src/jit/arm64/MacroAssembler-arm64.cpp
@@ +248,5 @@
> + if (sixtyfour == Register64::Invalid())
> + return ARMRegister(any.gpr(), 32);
> + else
> + return ARMRegister(sixtyfour.reg, size);
> +}
I don't find this code entirely self-documenting ;-)
@@ +281,5 @@
> + case Scalar::Int8:
> + {
> + ARMRegister dest(SelectGPReg(outany, out64));
> + Ldrb(dest, srcAddr);
> + Sxtb(dest, dest);
See LDRSB
@@ +306,5 @@
> + {
> + ARMRegister dest32(SelectGPReg(outany, out64, 32));
> + Ldr(dest32, srcAddr);
> + if (out64 != Register64::Invalid())
> + Sxtw(ARMRegister(out64.reg, 64), dest32);
And there's a LDRSW for a sign-extending 32->64 load.
@@ +375,5 @@
> + case Scalar::Uint32:
> + Str(SelectGPReg(valany, val64), dstAddr);
> + break;
> + case Scalar::Int64:
> + Str(SelectGPReg(valany, val64), dstAddr);
This is identical to the 32-bit code. Is that really right?
Attachment #8949666 -
Flags: review?(jolesen)
Comment 14•7 years ago
|
||
Comment on attachment 8949667 [details] [diff] [review]
bug1313336-arm64-masm-atomic.patch
Review of attachment 8949667 [details] [diff] [review]:
-----------------------------------------------------------------
::: js/src/jit/arm64/MacroAssembler-arm64-inl.h
@@ +1874,5 @@
> + Dmb(vixl::InnerShareable, vixl::BarrierWrites);
> + else if (barrier == MembarLoadLoad)
> + Dmb(vixl::InnerShareable, vixl::BarrierReads);
> + else if (barrier)
> + Dmb(vixl::InnerShareable, vixl::BarrierAll);
I'll have to trust you understand the details here ;-)
::: js/src/jit/arm64/MacroAssembler-arm64.cpp
@@ +1498,5 @@
> + masm.Cmp(R(output, targetWidth), R(scratch, targetWidth));
> + masm.B(&done, MacroAssembler::NotEqual);
> + StoreExclusive(masm, type, scratch, newval, ptr);
> + masm.Cmp(W(scratch), 1);
> + masm.B(&again, MacroAssembler::Equal);
CBNZ should be ok here since STXR returns 0/1 only.
Attachment #8949667 -
Flags: review+
Updated•7 years ago
|
Attachment #8949668 -
Flags: review+
Assignee | ||
Comment 15•7 years ago
|
||
(In reply to Jakob Stoklund Olesen [:jolesen] from comment #11)
> Comment on attachment 8949661 [details] [diff] [review]
> bug1313336-arm64-masm-integer.patch
>
> Review of attachment 8949661 [details] [diff] [review]:
> -----------------------------------------------------------------
> @@ +783,5 @@
> > {
> > + vixl::UseScratchRegisterScope temps(this);
> > + const ARMRegister scratch = temps.AcquireW();
> > + Mov(scratch, 32);
> > + Sub(scratch, scratch, ARMRegister(count, 32));
>
> You can compute `-count` in a single instruction, and ror only looks at the
> low bits.
Oh, nice.
> @@ +889,5 @@
> > + Add(dest, dest, Operand(dest, vixl::LSR, 4));
> > + And(dest, dest, 0x0F0F0F0F);
> > + Add(dest, dest, Operand(dest, vixl::LSL, 8));
> > + Add(dest, dest, Operand(dest, vixl::LSL, 16));
> > + Lsr(dest, dest, 24);
>
> Are you aware of the CNT instruction? It's a bytewise popcount across a SIMD
> register. You can follow with ADDV to implement a full popcount.
>
> This requires a temporary SIMD register, of course.
Yes, I am aware of CNT. There seemed to be some dispute on the net (but when is there not?) about whether it was actually a win or not to use it so I decided to just keep it simple and port working code. I'll note the issue on my worksheet and later I'll file a bug to investigate more fully.
Assignee | ||
Comment 16•7 years ago
|
||
(In reply to Jakob Stoklund Olesen [:jolesen] from comment #12)
> Comment on attachment 8949664 [details] [diff] [review]
> bug1313336-arm64-masm-truncate.patch
>
> Review of attachment 8949664 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> ::: js/src/jit/arm64/MacroAssembler-arm64.cpp
> @@ +897,5 @@
> > + Fcvtzu(output, input);
> > + Cmp(output, 0);
> > + B(oolEntry, Assembler::Equal);
> > + Cmp(output, -1);
> > + B(oolEntry, Assembler::Equal);
>
> You can use CMP+CCMP to chain the two compares and save a branch here.
Excellent.
> @@ +923,5 @@
> > + B(oolEntry, Assembler::Equal);
> > + Cmp(output, INT32_MAX);
> > + B(oolEntry, Assembler::Equal);
> > + Cmp(output, INT32_MIN);
> > + B(oolEntry, Assembler::Equal);
>
> Same: CMP+CCMP+CCMP.
>
> Also, I think INT32_MAX and INT32_MIN both have to be materialized in a
> scratch register, so it might be better to check unsigned(x-INT32_MAX) < 2.
We can materialize INT32_MIN in one MOVN and INT32_MAX in two MOVZ for a total of eight instructions for the CMP+CCMP+CCMP sequence. I think what you're suggesting is something like this:
Fcvtzs(output, input);
Sub(scratch, output, INT32_MAX);
Cmp(scratch, 1);
Ccmp(output, 0, vixl::ZFlag, Assembler::Above);
B(oolEntry, Assembler::Equal);
which should be only seven. (The scratch register does not impact regalloc so apart from microarchitectural effects it is free.)
> @@ +1067,5 @@
> > + bind(¬NaN);
> > +
> > + Label isOverflow;
> > + if (isUnsigned) {
> > + loadConstantFloat32(float(double(UINT64_MAX) + 1), ScratchFloat32Reg);
>
> This expression looks a little sketchy: double(UINT64_MAX) rounds to
> 0x1.0p64, and then adding 1 does nothing. So you end up with 0x1.0p64 as you
> wanted, but it's not really clear that was intended.
Nor was it, I botched that.
> Are we allowed to use hex float literals yet?
>
> Alternatively:
>
> const float c = -float(INT64_MIN);
> use +c, -c, 2*c for your constants. No rounding anywhere.
Righto.
>
> @@ +1102,5 @@
> > + } else {
> > + loadConstantDouble(double(INT64_MAX) + 1, ScratchDoubleReg);
> > + branchDouble(Assembler::DoubleGreaterThanOrEqual, input, ScratchDoubleReg, &isOverflow);
> > + // INT64_MIN is the closest value to the true limit value INT64_MIN-1 in
> > + // the double representation (53-bit mantissa).
>
> This comment applies to the positive case too.
Well, in the positive case we were supposed to have INT64_MAX+1 which /is/ exactly representable, but I'm going to go over this code again.
Assignee | ||
Comment 17•7 years ago
|
||
(In reply to Jakob Stoklund Olesen [:jolesen] from comment #13)
> Comment on attachment 8949666 [details] [diff] [review]
> bug1313336-arm64-masm-wasm-load-store.patch
>
> Review of attachment 8949666 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> ::: js/src/jit/arm64/MacroAssembler-arm64.cpp
> @@ +248,5 @@
> > + if (sixtyfour == Register64::Invalid())
> > + return ARMRegister(any.gpr(), 32);
> > + else
> > + return ARMRegister(sixtyfour.reg, size);
> > +}
>
> I don't find this code entirely self-documenting ;-)
Partly because it predates the cleanup of AnyRegister that I made as a result of writing this code... I'll add some comments and an assertion that is now possible.
> @@ +281,5 @@
> > + case Scalar::Int8:
> > + {
> > + ARMRegister dest(SelectGPReg(outany, out64));
> > + Ldrb(dest, srcAddr);
> > + Sxtb(dest, dest);
>
> See LDRSB
Ah, bad oversight on my part to miss these.
> @@ +375,5 @@
> > + case Scalar::Uint32:
> > + Str(SelectGPReg(valany, val64), dstAddr);
> > + break;
> > + case Scalar::Int64:
> > + Str(SelectGPReg(valany, val64), dstAddr);
>
> This is identical to the 32-bit code. Is that really right?
They are not identical, because SelectGPReg() will return a 32-bit register in the first case and a 64-bit register in the second case, given how this function is used.
Comment 18•7 years ago
|
||
(In reply to Lars T Hansen [:lth] from comment #16)
> We can materialize INT32_MIN in one MOVN and INT32_MAX in two MOVZ for a
> total of eight instructions for the CMP+CCMP+CCMP sequence. I think what
> you're suggesting is something like this:
>
> Fcvtzs(output, input);
> Sub(scratch, output, INT32_MAX);
> Cmp(scratch, 1);
> Ccmp(output, 0, vixl::ZFlag, Assembler::Above);
> B(oolEntry, Assembler::Equal);
>
> which should be only seven. (The scratch register does not impact regalloc
> so apart from microarchitectural effects it is free.)
That is what I had in mind. But then I looked at Sub() which calls MacroAssembler::AddSubMacro():
UseScratchRegisterScope temps(this);
Register temp = temps.AcquireSameSizeAs(rn);
// VIXL can acquire temp registers. Assert that the caller is aware.
VIXL_ASSERT(!temp.Is(rd) && !temp.Is(rn));
So it might be risky to allocate scratch registers at two different levels of the call graph? I think Sean spent some time cleaning this stuff up in the ARM32 backend.
I'm not sure how to best handle this. Maybe it's better to use your original sequence (plus chained compares) since it doesn't allocate scratch registers?
> > @@ +1102,5 @@
> > > + } else {
> > > + loadConstantDouble(double(INT64_MAX) + 1, ScratchDoubleReg);
> > > + branchDouble(Assembler::DoubleGreaterThanOrEqual, input, ScratchDoubleReg, &isOverflow);
> > > + // INT64_MIN is the closest value to the true limit value INT64_MIN-1 in
> > > + // the double representation (53-bit mantissa).
> >
> > This comment applies to the positive case too.
>
> Well, in the positive case we were supposed to have INT64_MAX+1 which /is/
> exactly representable, but I'm going to go over this code again.
Oh, right, I missed that.
Comment 19•7 years ago
|
||
(In reply to Lars T Hansen [:lth] from comment #15)
> > Are you aware of the CNT instruction? It's a bytewise popcount across a SIMD
> > register. You can follow with ADDV to implement a full popcount.
> >
> > This requires a temporary SIMD register, of course.
>
> Yes, I am aware of CNT. There seemed to be some dispute on the net (but
> when is there not?) about whether it was actually a win or not to use it so
> I decided to just keep it simple and port working code. I'll note the issue
> on my worksheet and later I'll file a bug to investigate more fully.
Makes sense. There is typically a latency comparable to a load when copying between the register banks, and ARM's designs tend to have pretty long latencies on SIMD instructions too.
Cortex-A72 is 3 cycles for the CNT and 6 for the ADDV.8b, and the two cross-bank VMOVs are 5 cycles each. That adds up to 19 cycles of latency. I think it comes down to a throughput vs latency tradeoff which is probably too hard to reason about for the baseline compiler.
Assignee | ||
Comment 20•7 years ago
|
||
This is the truncation patch, updated as requested (comment 6, comment 12, comment 18). I have decided to forego the clever subtraction optimization here, and just use chained compares.
Attachment #8949664 -
Attachment is obsolete: true
Attachment #8949664 -
Flags: review?(jolesen)
Attachment #8950332 -
Flags: review?(jolesen)
Assignee | ||
Comment 21•7 years ago
|
||
This is the load/store patch updated, comment 8, comment 13, comment 17. LMK if the register selection function is still inscrutable and I'll beef up the commentary further.
Attachment #8949666 -
Attachment is obsolete: true
Attachment #8949666 -
Flags: review?(jolesen)
Attachment #8950335 -
Flags: review?(jolesen)
Updated•7 years ago
|
Attachment #8950332 -
Flags: review?(jolesen) → review+
Comment 22•7 years ago
|
||
Comment on attachment 8950335 [details] [diff] [review]
bug1313336-arm64-masm-wasm-load-store-v2.patch
Review of attachment 8950335 [details] [diff] [review]:
-----------------------------------------------------------------
::: js/src/jit/arm64/MacroAssembler-arm64.cpp
@@ +242,5 @@
> Brk((code++) & 0xffff);
> }
>
> +// Either `any` is valid or `sixtyfour` is valid. Return a 32-bit ARMRegister
> +// in the first case and an ARMRegister of the desired size in the latter case.
That works. Thanks!
Attachment #8950335 -
Flags: review?(jolesen) → review+
Assignee | ||
Comment 23•7 years ago
|
||
Rebased, etc, ready to land. Carrying r+.
Attachment #8949661 -
Attachment is obsolete: true
Attachment #8954014 -
Flags: review+
Assignee | ||
Comment 24•7 years ago
|
||
Rebased, etc, ready to land. Carrying r+.
Attachment #8949663 -
Attachment is obsolete: true
Attachment #8954016 -
Flags: review+
Assignee | ||
Comment 25•7 years ago
|
||
Rebased, etc, ready to land. Carrying r+.
Attachment #8950332 -
Attachment is obsolete: true
Attachment #8954017 -
Flags: review+
Assignee | ||
Comment 26•7 years ago
|
||
Rebased, etc, ready to land. Carrying r+.
Attachment #8949665 -
Attachment is obsolete: true
Attachment #8954018 -
Flags: review+
Assignee | ||
Comment 27•7 years ago
|
||
Rebased, etc, ready to land. Carrying r+.
Attachment #8950335 -
Attachment is obsolete: true
Attachment #8954020 -
Flags: review+
Assignee | ||
Comment 28•7 years ago
|
||
Rebased, etc, ready to land. Carrying r+.
Attachment #8949667 -
Attachment is obsolete: true
Attachment #8954021 -
Flags: review+
Assignee | ||
Comment 29•7 years ago
|
||
Rebased etc, ready to land. Carrying r+.
Attachment #8949668 -
Attachment is obsolete: true
Attachment #8954022 -
Flags: review+
Comment 30•7 years ago
|
||
Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a37627e154f3
ARM64 integer masm instructions. r=jolesen
https://hg.mozilla.org/integration/mozilla-inbound/rev/48452a36fc14
ARM64 floating point masm instructions. r=jolesen
https://hg.mozilla.org/integration/mozilla-inbound/rev/4ba21b2798ca
ARM64 truncate-floating-point-to-int masm instructions. r=jolesen
https://hg.mozilla.org/integration/mozilla-inbound/rev/9e7eddae76df
ARM64 register-register move masm instructions. r=jolesen
https://hg.mozilla.org/integration/mozilla-inbound/rev/065a605a7909
ARM64 wasmLoad and wasmStore masm instructions. r=jolesen
https://hg.mozilla.org/integration/mozilla-inbound/rev/5b036f401b18
ARM64 atomic masm operations. r=jolesen
https://hg.mozilla.org/integration/mozilla-inbound/rev/885e8ce07968
ARM64 miscellaneous masm instructions. r=jolesen
Comment 31•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/a37627e154f3
https://hg.mozilla.org/mozilla-central/rev/48452a36fc14
https://hg.mozilla.org/mozilla-central/rev/4ba21b2798ca
https://hg.mozilla.org/mozilla-central/rev/9e7eddae76df
https://hg.mozilla.org/mozilla-central/rev/065a605a7909
https://hg.mozilla.org/mozilla-central/rev/5b036f401b18
https://hg.mozilla.org/mozilla-central/rev/885e8ce07968
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
status-firefox60:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
You need to log in
before you can comment on or make changes to this bug.
Description
•