548664 - NEON support for Optimize UTF8 to UTF16 conversion

Reporter

Description

•

15 years ago

in bug 506430, we added sse support for the utf8 -> utf16 conversion.  we should try using NEON (which has 128-bit registers).

Siarhei Siamashka

Comment 1

•

14 years ago

NEON may be not very suitable for this operation. Branching based on the value in NEON register is extremely slow (~20 cycles penalty).

So the following approaches can be tried:
1. Pure ARMv6 implementation using UXTB16 instruction which can convert a pair of 8-bit values to 16-bit.
2. Mixed ARM and NEON, with ARM used for testing high bits and loop breaking, and with NEON used just for 8-bit -> 16-bit extending.

ARMv6 and newer ARM cores support unaligned memory accesses (at least in linux), which is also handy.

Also for this particular use case (conversion of a huge buffer with ascii text, which does not even fit L2 cache), memory bandwidth is going to be a limitation. So explicit prefetch is required (Cortex-A8 and older ARM cores do not have a hardware prefetch).

Makoto Kato [:m_kato]

Assignee

Comment 2

•

14 years ago

When I tested 2., it could get 20% faster then current on i.mx515.

Also, it is no good to use 128-bit register because comparing cost vmov.32 are too expensive.

Makoto Kato [:m_kato]

Assignee

Comment 3

•

14 years ago

s/comparing cost vmov.32/comparing cost and vmov.32/

Makoto Kato [:m_kato]

Assignee

Comment 4

•

14 years ago

Attached patch test patch (obsolete) (deleted) — Details — Splinter Review

64-bit compare is high cost on Cortex-A8.  string length is 8 or 16, it is good performance to use 64-bit compare version with 64bit->128bit NEON conversion.  But on other situation such as length is 21 and etc, it is not good performance...

So we may use 32-bit compare version like this patch.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Comment 5

•

14 years ago

__ARM_NEON__ is unfortunately not useful, because we make those decisions at runtime instead of at compile time; but the patch looks good in any case, just need to figure out how to get it picking the right code at runtime..

Makoto Kato [:m_kato]

Assignee

Comment 6

•

14 years ago

(In reply to comment #5)
> __ARM_NEON__ is unfortunately not useful, because we make those decisions at
> runtime instead of at compile time; but the patch looks good in any case, just
> need to figure out how to get it picking the right code at runtime..

I agree.  We should add NEON detection to SSE.h at first...

Makoto Kato [:m_kato]

Assignee

Updated

•

14 years ago

Depends on: 583958

Justin Lebar (not reading bugmail)

Updated

•

14 years ago

Depends on: 506430

Justin Lebar (not reading bugmail)

Comment 7

•

14 years ago

Comment on attachment 458627 [details] [diff] [review]
test patch

>+    __asm__ __volatile__(
>+      "vdup.32   d0, %0\n\t"
>+      "vmovl.u8  q1, d0\n\t"
>+      "vst1.8    {d2}, [%1]!\n\t"
>+      : "+r"(in), "+r"(dst32)
>+      :
>+      : "%d0", "%d2", "%q1");

Seeing as you're modifying %1 (i.e. dst32) with the vst1.8, shouldn't you add it to the list of written registers?  If you did that, the asm might not have to be volatile.

Makoto Kato [:m_kato]

Assignee

Comment 8

•

14 years ago

"vst1.8 {d2}, [%1]!" means "vst1.8 {d2}, [%1]" and %1 register (dst32) += 8.  So dst32 is overwritten.

Justin Lebar (not reading bugmail)

Comment 9

•

14 years ago

Right.  My comment is just on the list of registers defined after the assembly.  Looking up the syntax again, I see that +r indicates an input/output register.  Since |in| is only read, perhaps it should be:

__asm__ (
      ...
      : "+r"(dst32)                     /* output */
      : "=r"(in)                        /* input */
      : "%d0", "%d2", "%q1", "memory"); /* clobber */

Of course, you'd have to switch %0 and %1, too.  Or maybe you could do

__asm__ (
      "vdup.32   d0, %[in]\n\t"
      "vmovl.u8  q1, d0\n\t"
      "vst1.8    {d2}, %[dst]!\n\t"
      : [dst] "+r"(dst32)
      : [in]  "=r"(in)
      : "%d0", "%d2", "%q1", "memory");

http://www.ethernut.de/en/documents/arm-inline-asm.html

IIRC gcc doesn't understand that clobbering q1 implies clobbering d2 and d3, so it might be more correct to add "%d3" to the clobber list too.  But I don't think that matters here.

Makoto Kato [:m_kato]

Assignee

Updated

•

13 years ago

Assignee: nobody → m_kato

Status: NEW → ASSIGNED

OS: Mac OS X → All

Hardware: x86 → ARM

Makoto Kato [:m_kato]

Assignee

Comment 10

•

13 years ago

Attached patch fix (obsolete) (deleted) — Details — Splinter Review

Attachment #458627 - Attachment is obsolete: true

Makoto Kato [:m_kato]

Assignee

Updated

•

13 years ago

Attachment #533542 - Flags: review?(justin.lebar+bug)

Justin Lebar (not reading bugmail)

Comment 11

•

13 years ago

>+    while (len > 4) {
>+      PRUint32 in = *src32++;
>+      if (in & 0x80808080U) {
>+        src32--;
>+        break;
>+      }
>+
>+      __asm__(
>+        ".fpu neon\n"
>+        "vdup.32   d0, %1\n"
>+        "vmovl.u8  q1, d0\n"
>+        "vst1.8    {d2}, [%0]!\n"
>+        : "+r"(dst32)
>+        : "r"(in)
>+        : "d0", "d2", "q1", "memory");
>+      len -= 4;
>+    }

Do you know if this is any faster than the original code?

Would it be faster to do 64 bits at a time?

  PRUint32 in1 = *src32++;
  PRUint32 in2 = *src32++;

  if (in1 & 0x80808080U || in2 & 0x80808080U) {
    src32 -= 2;
    break;
  }

  vmov d0, in1, in2
  vmovl.u8 q1, d0
  vst1.8 q1, [%0]!

You could even try loading into d0 first, then moving into general purpose registers to do the if.  Something like:

  Before loop, load 0x808080 into d31.

  Load [src32] into d0.
  d1 = d0 & d31
  vmov r0, r1, d0
  check that r0 == r1 == 0
  finish up

I suspect this will be slower than loading into the general-purpose registers first, but it might be worth a shot.

Makoto Kato [:m_kato]

Assignee

Comment 12

•

13 years ago

(In reply to comment #11)

> Do you know if this is any faster than the original code?

It is 20% faster then original code on Cortex-A8 board.

> Would it be faster to do 64 bits at a time?
> 
>   PRUint32 in1 = *src32++;
>   PRUint32 in2 = *src32++;
> 
>   if (in1 & 0x80808080U || in2 & 0x80808080U) {
>     src32 -= 2;
>     break;
>   }
> 
>   vmov d0, in1, in2
>   vmovl.u8 q1, d0
>   vst1.8 q1, [%0]!

Oh.  This is more 10% faster than my fix on Cortex-A8 or Cortex-A9.

 
> You could even try loading into d0 first, then moving into general purpose
> registers to do the if.  Something like:
> 
>   Before loop, load 0x808080 into d31.
> 
>   Load [src32] into d0.
>   d1 = d0 & d31
>   vmov r0, r1, d0
>   check that r0 == r1 == 0
>   finish up

This was slower than original.

Makoto Kato [:m_kato]

Assignee

Updated

•

13 years ago

Attachment #533542 - Flags: review?(justin.lebar+bug)

Justin Lebar (not reading bugmail)

Comment 13

•

13 years ago

You may get an additional speedup by adding prefetch instructions.

Makoto Kato [:m_kato]

Assignee

Comment 14

•

13 years ago

Attached patch fix v2 (deleted) — Details — Splinter Review

- aligned access to get performace.  (aligned copy is 1 cycle vs unaligned copy is 2 or 3 cycle)
- use pld for prefetch

Attachment #533542 - Attachment is obsolete: true

Makoto Kato [:m_kato]

Assignee

Updated

•

13 years ago

Attachment #559426 - Flags: review?(justin.lebar+bug)

Justin Lebar (not reading bugmail)

Comment 15

•

13 years ago

Comment on attachment 559426 [details] [diff] [review]
fix v2

> +    // Although ARMv7 can support unaligned access, We use aligned acess
> +    // for distnation to get more performance

"access", not "acess".  "destination", not "distnation".  Please end the sentence with a period, and please don't capitalize "We" in the middle of the sentence.

In fact, I'd reword it to say:

  Although ARMv7 supports unaligned loads and stores, aligning the src and dst pointers here gives us a speedup. 

> +    while (((NS_PTR_TO_UINT32(dst) & 15) || (NS_PTR_TO_UINT32(src) & 3))

Suppose src = 0x1002, dst=0x100.  Then the following are the values that src and dst will take on as we loop.  I'll put a * next to the src values aligned to 4 bits and the dst values aligned to 16 bits.  We need *s next to both numbers to exit the loop.

   src     dst
0x1002   0x100*
0x1004*  0x101
0x1006   0x102
0x1008*  0x103
...
0x1032*  0x10E
0x1034   0x110*

In other words, src and dst are out of phase with one another.  src is aligned on every odd iteration (i=1, i=3, etc), and dst is aligned when the iteration count is a multiple of 16 (i=0, i=16, etc.).  We'll never exit this loop and run the NEON code.

This is what the current code is checking for when it says "with some alignments, we'd never break out of the slow loop."

I think the right thing to do is to check whether src and dst are in phase or not.  If they're in phase, run the code in this patch.  If not, run the code in your earlier version, which assumed only that src was aligned.

(You could just do what the current version does and fall back to a slow path when the data is misaligned.  But looking at how Convert_ascii_run is used, it seems to me that we'll run the out-of-phase version often.)

> +        "pld       [%1, #8]\n"

Did you try different prefetch distances?  You may get a bigger speedup by using a larger offset than #8.

Sorry to send this back for another round.

Attachment #559426 - Flags: review?(justin.lebar+bug) → review-

Justin Lebar (not reading bugmail)

Comment 16

•

13 years ago

> But looking at how Convert_ascii_run is used, it seems to me that 
> we'll run the out-of-phase version often.

... because if there's a Japanese character in the UTF-8 string, for instance, we'll move the UTF-8 pointer forward 3 bytes but move the UTF-16 pointer forward only 2 bytes.  This will change the two pointers' relative phase.

Makoto Kato [:m_kato]

Assignee

Comment 17

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #15)
> Comment on attachment 559426 [details] [diff] [review]
> fix v2
> 
> > +    // Although ARMv7 can support unaligned access, We use aligned acess
> > +    // for distnation to get more performance
> 
> "access", not "acess".  "destination", not "distnation".  Please end the
> sentence with a period, and please don't capitalize "We" in the middle of
> the sentence.
> 
> In fact, I'd reword it to say:
> 
>   Although ARMv7 supports unaligned loads and stores, aligning the src and
> dst pointers here gives us a speedup. 

Thanks.

 
> > +    while (((NS_PTR_TO_UINT32(dst) & 15) || (NS_PTR_TO_UINT32(src) & 3))
> 
> Suppose src = 0x1002, dst=0x100.  Then the following are the values that src
> and dst will take on as we loop.  I'll put a * next to the src values
> aligned to 4 bits and the dst values aligned to 16 bits.  We need *s next to
> both numbers to exit the loop.
> 
>    src     dst
> 0x1002   0x100*
> 0x1004*  0x101
> 0x1006   0x102
> 0x1008*  0x103
> ...
> 0x1032*  0x10E
> 0x1034   0x110*

???  src* is char and dst is PRUnicode*.  It is incorrect.

> In other words, src and dst are out of phase with one another.  src is
> aligned on every odd iteration (i=1, i=3, etc), and dst is aligned when the
> iteration count is a multiple of 16 (i=0, i=16, etc.).  We'll never exit
> this loop and run the NEON code.

I should add more optimization like the following...

while (src & 3) {
  // not using NEON
}

while (dst & 15) {
  // using unaligned NEON (src grows 4 bytes per loop)
}

// using aligned NEON

> > +        "pld       [%1, #8]\n"
> 
> Did you try different prefetch distances?  You may get a bigger speedup by
> using a larger offset than #8.

If we grows chunk for src per loop, it is more faster.  But my concern is a cost of comparing.

Justin Lebar (not reading bugmail)

Comment 18

•

13 years ago

> I should add more optimization like the following.

Yes, that's what I had in mind.

> If we grows chunk for src per loop, it is more faster. But my concern is a cost of comparing.

Sorry, I don't understand what you mean.  If you changed the 8 to 16 in the pld, what would be the extra cost?

Makoto Kato [:m_kato]

Assignee

Comment 19

•

5 years ago

string conversion uses encoding_rs now. It uses SIMD

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Resolution: --- → WORKSFORME

Nobody; OK to take it and work on it

Updated

•

4 years ago

Component: String → XPCOM

test patch 14 years ago Makoto Kato [:m_kato] (deleted), patch		Details \| Diff \| Splinter Review
fix 13 years ago Makoto Kato [:m_kato] (deleted), patch		Details \| Diff \| Splinter Review
fix v2 13 years ago Makoto Kato [:m_kato] (deleted), patch	justin.lebar+bug : review-	Details \| Diff \| Splinter Review