Closed Bug 885496 Opened 11 years ago Closed 11 years ago

Optimize KissFFT with NEON instructions

Categories

(Core :: Web Audio, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: ehsan.akhgari, Assigned: jwwang)

References

Details

(Keywords: perf, Whiteboard: [blocking-webaudio-])

Attachments

(4 files, 2 obsolete files)

Depends on: 882543
No longer depends on: 884253
Is there a conventional mozilla way to apply such a foreign patch? Should I create a patch by extracting useful parts and discarding those which we don't need (ex: README.swe and kissfft.gyp)? Should we also add NEON impl. of kf_bfly3() and kf_bfly5()?
What we usually end up doing is to keep the patch at the root of the directory containing the library, and add a line to the 'update.sh' script to apply the patch after having pulled from upstream, see [1] for an example of a library where we keep a patch on top of upstream. Not sure about the kf_bfly3() and kf_bfly5() implementation, probably not worth it if we don't call them. [1]: http://mxr.mozilla.org/mozilla-central/source/media/libsoundtouch/
Attachment #773039 - Flags: review?(ehsan)
Attached patch [Part 2] Update update.sh. (obsolete) (deleted) — Splinter Review
It is kinda mind-bending to have a patch (part 2) which will create another patch which shares a common hunk (modification to kiss_fft.c) with the previous patch (part 1).
Attachment #773041 - Flags: review?(ehsan)
Comment on attachment 773039 [details] [diff] [review] [Part 1] Optimize KissFFT with NEON instructions. Review of attachment 773039 [details] [diff] [review]: ----------------------------------------------------------------- Looks good, although I'm not really familiar with NEON myself. Tim, do you mind having a quick look over the NEON code to see if there is anything wrong there?
Attachment #773039 - Flags: review?(ehsan) → review?(tterribe)
Attachment #773041 - Flags: review?(ehsan) → review+
(In reply to jwwang from comment #4) > Created attachment 773041 [details] [diff] [review] > [Part 2] Update update.sh. > > It is kinda mind-bending to have a patch (part 2) which will create another > patch which shares a common hunk (modification to kiss_fft.c) with the > previous patch (part 1). It's true. :-)
Assignee: nobody → jwwang
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #5) > Looks good, although I'm not really familiar with NEON myself. Tim, do you > mind having a quick look over the NEON code to see if there is anything > wrong there? I don't mind, but I've got a bunch of reviews in my queue, so it will likely be a few days before I can get to this.
(In reply to comment #7) > (In reply to :Ehsan Akhgari (needinfo? me!) from comment #5) > > Looks good, although I'm not really familiar with NEON myself. Tim, do you > > mind having a quick look over the NEON code to see if there is anything > > wrong there? > > I don't mind, but I've got a bunch of reviews in my queue, so it will likely be > a few days before I can get to this. Tim: ping? :)
Comment on attachment 773039 [details] [diff] [review] [Part 1] Optimize KissFFT with NEON instructions. Review of attachment 773039 [details] [diff] [review]: ----------------------------------------------------------------- r- for the lack of runtime detection. The actual asm could also certainly be a lot better. ::: media/kiss_fft/kiss_fft.c @@ +17,5 @@ > /* The guts header contains all the multiplication and addition macros that are defined for > fixed or floating point complex numbers. It also delares the kf_ internal functions. > */ > > +#ifdef HAVE_ARM_NEON We need run-time NEON detection, even on ARMv7 devices. ::: media/kiss_fft/kiss_fft_bfly2_neon.s @@ +1,2 @@ > +/* > +* Copyright (c) 2013, The Linux Foundation. All rights reserved. Has anyone pinged licensing@mozilla.org to make sure this is okay? I know BSD code is generally fine, but we may need to update about:license, etc. @@ +29,5 @@ > + > +@ NEON optimized assembly routine of kf_bfly2() > + > + .text > + .fpu neon You need to set .arch to something with NEON, and .object_arch to some lowest-common-denominator thing to, e.g., make sure this works when building an ARMv6 binary that might also be run on a device with NEON. See gfx/ycbcr/yuv_row_arm.s for an example. @@ +32,5 @@ > + .text > + .fpu neon > + .align 4 > + .global kf_bfly2 > + .func kf_bfly2 Need a .fnstart to generate an unwind table entry. @@ +35,5 @@ > + .global kf_bfly2 > + .func kf_bfly2 > + > +kf_bfly2: > + stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr} You should always push an even number of registers to keep the stack pointer 8-byte aligned. In this case, r9, sl, fp, and lr are unused. You need to use lr to be able to pop the return address into pc. You aren't use r12, which is caller-saved. Ditch r9, sl, and fp, replace r7 with r12 (which you don't need to save here), and r8 with lr. Then you're saving just 4 registers (r4, r5, r6, lr). You can get rid of r6 by following the advice below. You could also get rid of r4 by decrementing r3 by 4 at a time instead. @@ +36,5 @@ > + .func kf_bfly2 > + > +kf_bfly2: > + stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr} > +@ vstmdb sp!, {d8-d15} If we don't need this code, just delete it. @@ +41,5 @@ > + @ r0 - Fout| r1 - fstride | r2 - st | r3 - m > + pld [r0, #0] > + mov r8, r3, asl #3 @ convert m into bytes count (m*8) > + add r5, r0, r8 @ Fout2 = Fout + m; > + add r6, r2, #264 @ tw1 = st->twiddles This is very fragile. You should just pass st->twiddles directly. @@ +59,5 @@ > + vld2.32 {d16-d19}, [r6], r7 @ load *tw1; tw1 += (fstride*4); > + pld [r6, #0] @ preload next tw1 > + vmul.f32 q2, q10, q12 @ C_MUL (t, *Fout2 , *tw1); > + vmul.f32 q3, q11, q13 > + vsub.f32 q8, q2, q3 This is what VMLS was made for. @@ +62,5 @@ > + vmul.f32 q3, q11, q13 > + vsub.f32 q8, q2, q3 > + vmul.f32 q2, q10, q13 > + vmul.f32 q3, q11, q12 > + vadd.f32 q9, q2, q3 This is what VMLA was made for. @@ +64,5 @@ > + vmul.f32 q2, q10, q13 > + vmul.f32 q3, q11, q12 > + vadd.f32 q9, q2, q3 > + > + vld2.32 {d0-d3}, [r0] @ load *Fout; You should start this load earlier. As it is, it is guaranteed to cause a 2-cycle stall, at a minimum. Putting it after the first mul is probably best. @@ +82,5 @@ > +@.kf_bfly2_process_remaining: > + asr r8, r3, #31 > + lsr r7, r8, #30 > + add r4, r7, r3 > + ands r3, r4, #3 @ if (k % 4 == 0) This whole thing can just be replaced by ands r3, r3, #3. m is always non-negative. @@ +83,5 @@ > + asr r8, r3, #31 > + lsr r7, r8, #30 > + add r4, r7, r3 > + ands r3, r4, #3 @ if (k % 4 == 0) > + beq .kf_bfly2_done Instead of branching, just ldmeqfd here. @@ +92,5 @@ > + @ float32x4x2_t t; d4 {s8,s9} > + > + > +.bfly2_do_while1: @ do { //process 1 sample per iteration > + vld1.32 {d2}, [r5] @ load *Fout2;{s16,s17} The comment is wrong. @@ +93,5 @@ > + > + > +.bfly2_do_while1: @ do { //process 1 sample per iteration > + vld1.32 {d2}, [r5] @ load *Fout2;{s16,s17} > + vld1.32 {d3}, [r6], r1 @ load *tw1; tw1 += (fstride);{s24,s25} The comment is wrong. @@ +99,5 @@ > + vmul.f32 d1, d2, d3 @ @ C_MUL (t, *Fout2 , *tw1); > + vsub.f32 s8, s2, s3 > + vmul.f32 s2, s4, s7 > + vmul.f32 s3, s5, s6 > + vadd.f32 s9, s2, s3 VMLA @@ +101,5 @@ > + vmul.f32 s2, s4, s7 > + vmul.f32 s3, s5, s6 > + vadd.f32 s9, s2, s3 > + > + vld1.32 {d0}, [r0] @ load *Fout; See above about moving this up. @@ +114,5 @@ > + subs r3, r3, #1 @ }while(--k); > + bne .bfly2_do_while1 > + > +.kf_bfly2_done: > +@ vldmia sp!, {d8-d15} If we don't need this code, just delete it. @@ +116,5 @@ > + > +.kf_bfly2_done: > +@ vldmia sp!, {d8-d15} > + ldmia sp!, {r4, r5, r6, r7, r8, r9, sl, fp, pc} > + nop What's the nop for? @@ +118,5 @@ > +@ vldmia sp!, {d8-d15} > + ldmia sp!, {r4, r5, r6, r7, r8, r9, sl, fp, pc} > + nop > + > + .endfunc Need a .fnend to generate an unwind table entry. Also, you should provide a .size entry so that debuggers can identify this function in crash reports, etc. ::: media/kiss_fft/kiss_fft_bfly4_neon.s @@ +1,1 @@ > +/* Assume unless otherwise specified that every comment on kf_bfly2 applies to this file as well. @@ +35,5 @@ > + .global kf_bfly4 > + .func kf_bfly4 > + > +kf_bfly4: > + stmdb sp!, {r4, r5, r6, r7, r8, r9, r10, r11, r12, lr} You aren't using lr. Replace r11 with lr and don't save r12. Two fewer registers to save/restore. You can eliminate r4 and r6 as above, as well. @@ +39,5 @@ > + stmdb sp!, {r4, r5, r6, r7, r8, r9, r10, r11, r12, lr} > +@ vstmdb sp!, {d8-d15} > + @ r0 - Fout| r1 - fstride | r2 - st | r3 - m > + pld [r0, #0] > + mov r5, r3 Don't make a copy and then destructively modify the source. Swap the roles of r3 and r5 instead. @@ +45,5 @@ > + add r6, r2, #264 @ tw1 = st->twiddles > + pld [r6, #0] > + mov r7, r6 @ tw2 = st->twiddles > + mov r8, r7 @ tw3 = st->twiddles > + ldr r2, [r2, #4] @ st->inverse You should not be branching in an inner loop based on a value fixed outside that loop. Make a separate function for the inverse. That also means you can avoid passing in st without using more than 4 parameters. Assemble it with macros if you're worried about code duplication. @@ +60,5 @@ > + beq .kf_bfly4_do_while1 @ if(k==0) > + > +.kf_bfly4_do_while4: @ do { //process 4 samples per iteration > + add r11, r0, r3 @ fom = Fout+m; > + mov r12, r11 Instead of making this copy, you can recompute the value below just before you need it, which allows you to keep it in r11 instead of r12, saving you another register. @@ +66,5 @@ > + vld1.32 {d20}, [r6], r1 @ rtwd1 = vld1_f32((const float32_t*)tw1); tw1 += fstride; > + vld1.32 {d21}, [r6], r1 @ rtwd2 = vld1_f32((const float32_t*)tw1); tw1 += fstride; > + vld1.32 {d22}, [r6], r1 @ rtwd3 = vld1_f32((const float32_t*)tw1); tw1 += fstride; > + vld1.32 {d23}, [r6], r1 @ rtwd4 = vld1_f32((const float32_t*)tw1); tw1 += fstride; > + vuzp.32 q10, q11 @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4)); Loading into d20, d22, d21, then d23 and using vtrn instead of vuzp will save you a cycle. The same applies for the other twiddles below. @@ +67,5 @@ > + vld1.32 {d21}, [r6], r1 @ rtwd2 = vld1_f32((const float32_t*)tw1); tw1 += fstride; > + vld1.32 {d22}, [r6], r1 @ rtwd3 = vld1_f32((const float32_t*)tw1); tw1 += fstride; > + vld1.32 {d23}, [r6], r1 @ rtwd4 = vld1_f32((const float32_t*)tw1); tw1 += fstride; > + vuzp.32 q10, q11 @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4)); > + vld2.32 {d0-d3}, [r11], r3 @ rfout = vld2q_f32((const float32_t*)(fom1)); fom2 = Fout+m2; Do this load before the VUZP (VTRN). @@ +70,5 @@ > + vuzp.32 q10, q11 @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4)); > + vld2.32 {d0-d3}, [r11], r3 @ rfout = vld2q_f32((const float32_t*)(fom1)); fom2 = Fout+m2; > + vmul.f32 q2, q0, q10 @ C_MUL_NEON(scratch0, rfout, scratch3); > + vmul.f32 q3, q1, q11 > + vsub.f32 q12, q2, q3 VMLS @@ +73,5 @@ > + vmul.f32 q3, q1, q11 > + vsub.f32 q12, q2, q3 > + vmul.f32 q2, q0, q11 > + vmul.f32 q3, q1, q10 > + vadd.f32 q13, q2, q3 VMLA @@ +80,5 @@ > + vld1.32 {d20}, [r7], r9 @ rtwd1 = vld1_f32((const float32_t*)tw2); tw2 += fstride*2; > + vld1.32 {d21}, [r7], r9 @ rtwd2 = vld1_f32((const float32_t*)tw2); tw2 += fstride*2; > + vld1.32 {d22}, [r7], r9 @ rtwd3 = vld1_f32((const float32_t*)tw2); tw2 += fstride*2; > + vld1.32 {d23}, [r7], r9 @ trtwd4 = vld1_f32((const float32_t*)tw2); tw2 += fstride*2; > + vuzp.32 q10, q11 @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4)); You'd benefit a lot from figuring out how to interleave these instructions with prior arithmetic on A8-class hardware. It can share the first and last cycle of a load/store/permute instruction with that of an ALU instruction, potentially saving you 10 cycles here (since all of these instructions are multi-cycle... though you'll "save" fewer when you convert to using VMLS/VMLA since there will be fewer instructions to share with). You have to switch to using Q8/Q9 instead of Q10/Q11, though, or you'll run into data dependency problems. @@ +81,5 @@ > + vld1.32 {d21}, [r7], r9 @ rtwd2 = vld1_f32((const float32_t*)tw2); tw2 += fstride*2; > + vld1.32 {d22}, [r7], r9 @ rtwd3 = vld1_f32((const float32_t*)tw2); tw2 += fstride*2; > + vld1.32 {d23}, [r7], r9 @ trtwd4 = vld1_f32((const float32_t*)tw2); tw2 += fstride*2; > + vuzp.32 q10, q11 @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4)); > + vld2.32 {d0-d3}, [r11], r3 @ rfout = vld2q_f32((const float32_t*)(fom2)); fom3 = Fout+m3; See above. @@ +84,5 @@ > + vuzp.32 q10, q11 @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4)); > + vld2.32 {d0-d3}, [r11], r3 @ rfout = vld2q_f32((const float32_t*)(fom2)); fom3 = Fout+m3; > + vmul.f32 q2, q0, q10 @ C_MUL_NEON(scratch1, rfout, scratch3); > + vmul.f32 q3, q1, q11 > + vsub.f32 q14, q2, q3 VMLS @@ +87,5 @@ > + vmul.f32 q3, q1, q11 > + vsub.f32 q14, q2, q3 > + vmul.f32 q2, q0, q11 > + vmul.f32 q3, q1, q10 > + vadd.f32 q15, q2, q3 VMLA @@ +95,5 @@ > + vld1.32 {d21}, [r8], r10 @ rtwd2 = vld1_f32((const float32_t*)tw3); tw3 += fstride*3; > + vld1.32 {d22}, [r8], r10 @ rtwd3 = vld1_f32((const float32_t*)tw3); tw3 += fstride*3; > + vld1.32 {d23}, [r8], r10 @ rtwd4 = vld1_f32((const float32_t*)tw3); tw3 += fstride*3; > + vuzp.32 q10, q11 @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4)); > + vld2.32 {d0-d3}, [r11] @ rfout = vld2q_f32((const float32_t*)(fom3)); See above. @@ +98,5 @@ > + vuzp.32 q10, q11 @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4)); > + vld2.32 {d0-d3}, [r11] @ rfout = vld2q_f32((const float32_t*)(fom3)); > + vmul.f32 q2, q0, q10 @ C_MUL_NEON(scratch2, rfout, scratch3); > + vmul.f32 q3, q1, q11 > + vsub.f32 q8, q2, q3 VMLS @@ +101,5 @@ > + vmul.f32 q3, q1, q11 > + vsub.f32 q8, q2, q3 > + vmul.f32 q2, q0, q11 > + vmul.f32 q3, q1, q10 > + vadd.f32 q9, q2, q3 VMLA @@ +139,5 @@ > + @ } > +.c_end4: > + vst2.32 {d20-d23}, [r12], r3 @ vst2q_f32((float32_t*)(fom), scratch3); fom2 = Fout+m2; > + vst2.32 {d16-d19}, [r12], r3 @ vst2q_f32((float32_t*)fom2, scratch2); fom3 = Fout+m3; > + vst2.32 {d28-d31}, [r12] @ vst2q_f32((float32_t*)(fom3), scratch1); You can do much better by interleaving the last 8 arithmetic ops with the stores. To maximize dual issuing and minimize pipeline stalls, you want something like (for the not_inverse4 case) vadd.f32 q8, q0, q10 vadd.f32 q9, q1, q11 vadd.f32 q14, q2, q13 vadd.f32 q15, q3, q12 vst2.32 {d16-d19}, [r0]! vsub.f32 q0, q0, q10 vsub.f32 q1, q1, q11 vst2.32 {d28-d31}, [r12], r3 vsub.f32 q8, q2, q13 vst2.32 {d0-d3}, [r12], r3 vadd.f32 q9, q3, q12 vst2.32 {d16-d19}, [r12] There's still a 2-cycle stall while executing the last store, but there's no more computations to stick in there. Overall this saves 32-26 = 6 cycles. @@ +153,5 @@ > + add r4, r4, r5 > + ands r5, r4, #3 @ if (k%4 == 0) > + beq .kf_bfly4_done > + > +.kf_bfly4_do_while1: @ do { //process 1 sample per iteration Most of the comments for the above loop apply here as well. @@ +156,5 @@ > + > +.kf_bfly4_do_while1: @ do { //process 1 sample per iteration > + pld [r7, #0] > + vld1.32 {d18}, [r6], r1 @ rtwd1 = vld1_f32((const float32_t*)tw1); tw1 += fstride; > + vuzp.32 d18, d19 @ scratch3 = vuzp_f32(rtwd1, rtwd2); //d11 is empty The comment is wrong (here and everywhere that mentions d11). @@ +186,5 @@ > + vmul.f32 q1, q0, q9 @ C_MUL_NEON(scratch2, rfout, scratch3); > + vsub.f32 d16, d2, d3 > + vmul.f32 d2, d0, d19 > + vmul.f32 d3, d1, d18 > + vadd.f32 d17, d2, d3 This stuff is misguided. The Q version of VMUL is twice as slow as the D version, so it doesn't actually save anything to VUZP everything. Really, what you want is to load all three twiddles and all three coefficients into adjacent registers, and then do 3 simultaneous complex multiplies the same way you do 4 of them in the first loop. @@ +200,5 @@ > + vsub.f32 q2, q2, q8 @ C_SUB_NEON(scratch0, scratch0, scratch2); > + > + vsub.f32 q8, q0, q9 @ C_SUB_NEON(scratch2, rfout, scratch3); > + > + vadd.f32 q0, q0, q9 @ C_ADD_NEON(rfout, rfout, scratch3); Similarly, here you want the real/imaginary values packed together. The Q version of VADD/VSUB is also twice as slow as the D version, so you're halving your throughput.
Attachment #773039 - Flags: review?(tterribe) → review-
Whiteboard: [blocking-webaudio-]
Comment on attachment 773039 [details] [diff] [review] [Part 1] Optimize KissFFT with NEON instructions. Review of attachment 773039 [details] [diff] [review]: ----------------------------------------------------------------- ::: media/kiss_fft/kiss_fft.c @@ +17,5 @@ > /* The guts header contains all the multiplication and addition macros that are defined for > fixed or floating point complex numbers. It also delares the kf_ internal functions. > */ > > +#ifdef HAVE_ARM_NEON mozilla::supports_abc() is only usable in CPP files. Is there other run-time detection for C files? ::: media/kiss_fft/kiss_fft_bfly2_neon.s @@ +35,5 @@ > + .global kf_bfly2 > + .func kf_bfly2 > + > +kf_bfly2: > + stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr} What's the purpose to keep stack pointer 8-byte aligned? I see no stack operations in the following instructions.
jwwang, see how I worked around it in [1] and [2] (I use it at [3]). Essentially, you wrap the C++ functions you need in C functions, and include the C header where you need to do the runtime detection. Glandium explained to me all the build system magic we had to do in order to make this work, so it should be painless for you this time. Don't hesitate to ask me if it does not work. That said, if this is going to be common, maybe we should expose a C API to SSE.h. [1]: http://mxr.mozilla.org/mozilla-central/source/media/libspeex_resampler/src/sse_detect.h [2]: http://mxr.mozilla.org/mozilla-central/source/media/libspeex_resampler/src/sse_detect.cpp [3]: http://mxr.mozilla.org/mozilla-central/source/media/libspeex_resampler/src/resample.c#359
I see. Thanks. I would prefer a clean cut in exposing C APIs in SSE.h and arm.h
Rewrite kiss_fft_bfly4_neon.s.
Attachment #773039 - Attachment is obsolete: true
Attachment #802161 - Flags: review?(tterribe)
Attached patch [Part 2] Update update.sh. (deleted) — Splinter Review
Attachment #773041 - Attachment is obsolete: true
Attachment #802162 - Flags: review+
Attachment #802159 - Flags: review?(ehsan) → review+
Blocks: 923319
Whiteboard: [blocking-webaudio-] → [blocking-webaudio-][webaudio-perf]
Keywords: perf
Whiteboard: [blocking-webaudio-][webaudio-perf] → [blocking-webaudio-]
Do we still need this bug once Bug 926838 is landed?
I don't think so (that is why I've held off spending the time to review it).
Let's make that explicit!
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WONTFIX
Attachment #802161 - Flags: review?(tterribe)
Attached file kiss_fft_bfly5_neon.s (deleted) —
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: