Closed
Bug 406580
Opened 17 years ago
Closed 17 years ago
Faster copying of RGB pixel data
Categories
(Core :: Graphics, defect, P1)
Tracking
()
RESOLVED
FIXED
mozilla1.9beta3
People
(Reporter: swsnyder, Assigned: swsnyder)
References
Details
(Keywords: perf)
Attachments
(3 files, 6 obsolete files)
(deleted),
patch
|
pavlov
:
review+
pavlov
:
approval1.9+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
pavlov
:
review+
pavlov
:
approval1.9+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
pavlov
:
review+
|
Details | Diff | Splinter Review |
A significant percentage of the time spent in graphics processing is copying the pixel data from the buffer where a given image is decoded to the display buffer for output. Most of this slow processing is due to having to construct each pixel from 4 individual values (A,R,G,B).
In the case where the alpha value is 0xFF, no manipulation of the R,G.B values are needed, yet the pixel components are still handled as 4 values. The R,G,B values are read from contiguous byte-sized memory locations then reassembled, with 0xFF, as a single 32-bit value.
Processing would be faster if those pixel components were kept contiguous, not read individually from the source buffer and re-assembled.
Assignee | ||
Comment 1•17 years ago
|
||
Reduces by 20% - 25% the times it takes to copy pixels on x86 platforms.
The performance improvement comes from reducing the number of memory access needed for each pixel. The RGB bytes are gotten in a single 32-bit read from the source buffer, reordered within registers, and written to the destination buffer as a single 32-bit value. On x86 platforms, hardware byte reordering is used, as provided by Microsoft's and GNU's compilers.
Assignee | ||
Comment 2•17 years ago
|
||
The short version:
Original code:
4.00 reads from memory per pixel
2.00 writes to memory per pixel
Modified, using generic C byte-reordering:
2.25 reads from memory per pixel
1.75 writes to memory per pixel
Modified, using x86 bswap byte-reordering:
1.25 reads from memory per pixel
1.50 writes to memory per pixel
Details:
// original code from trunk: form pixel from individual bytes
// instruction bytes in loop: 53
// reads from memory per loop: 3 (data) + 1 (var)
//-> reads from memory per pixel: 4.00 = (4 / 1)
// writes to memory per loop: 1 (pixels) + 1 (vars)
//-> writes to memory per pixel: 2.00 = (2 / 1)
// times to copy all scanlines, rdtsc():
// cosmos.jpg: 367277, 376664, 373971; avg: 372637
// sk7.jpg: 104953, 106780, 106857; avg: 106193
// -------------------------------------------------
for (PRUint32 i=mInfo.output_width; i>0; --i) {
0050300B mov ecx,dword ptr [esi+74h]
0050300E test ecx,ecx
00503010 jbe nsJPEGDecoder::OutputScanlines+152h (503042h)
*imageRow++ = GFX_PACKED_PIXEL(0xFF, sampleRow[0], sampleRow[1], sampleRow[2]);
00503012 movzx edx,byte ptr [eax]
00503015 movzx ebp,byte ptr [eax+1]
00503019 movzx eax,byte ptr [eax+2]
0050301D or edx,0FFFFFF00h
00503023 shl edx,8
00503026 or edx,ebp
00503028 shl edx,8
0050302B or edx,eax
0050302D mov dword ptr [edi],edx
sampleRow += 3;
0050302F mov eax,dword ptr [esp+0Ch]
00503033 add eax,3
00503036 add edi,4
00503039 sub ecx,1
0050303C mov dword ptr [esp+0Ch],eax
00503040 jne nsJPEGDecoder::OutputScanlines+122h (503012h)
// mod: read contiguous bytes, generic shift/or byte re-ordering
// instruction bytes in loop: 172
// reads from memory per loop: 8 (data) + 1 (var)
//-> reads from memory per pixel: 2.25 = (9 / 4)
// writes to memory per loop: 4 (pixels) + 3 (vars)
//-> writes to memory per pixel: 1.75 = (7 / 4)
// times to copy all scanlines, rdtsc():
// cosmos.jpg: not tested
// sk7.jpg: not tested
// -------------------------------------------------
PRUint32 idx = mInfo.output_width;
0050305B mov edx,dword ptr [esi+74h]
// bulk copy of pixels
while (idx >= 4) {
0050305E cmp edx,4
00503061 mov dword ptr [esp+14h],edx
00503065 jb nsJPEGDecoder::OutputScanlines+1D8h (503118h)
0050306B shr edx,2
0050306E mov dword ptr [esp+18h],edx
PRUint32 p0, p1, p2, p3; // to avoid back-to-back stalls
p0 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+0);
p1 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+3);
00503072 mov edx,dword ptr [eax+3]
00503075 movzx esi,byte ptr [eax+5]
p2 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+6);
00503079 movzx ebx,byte ptr [eax+8]
p3 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+9);
0050307D movzx ebp,byte ptr [eax+0Bh]
00503081 mov ecx,edx
00503083 and edx,0FF00h
00503089 or ecx,0FFFFFF00h
0050308F shl ecx,10h
00503092 or ecx,esi
00503094 mov esi,dword ptr [eax+6]
00503097 or ecx,edx
00503099 mov edx,esi
0050309B or edx,0FFFFFF00h
005030A1 shl edx,10h
005030A4 or edx,ebx
005030A6 mov ebx,dword ptr [eax+9]
005030A9 and esi,0FF00h
005030AF or edx,esi
005030B1 mov esi,ebx
005030B3 or esi,0FFFFFF00h
005030B9 shl esi,10h
005030BC or esi,ebp
005030BE and ebx,0FF00h
005030C4 or esi,ebx
005030C6 mov ebx,dword ptr [eax]
imageRow[0] = p0; imageRow[1] = p1;
005030C8 movzx eax,byte ptr [eax+2]
005030CC mov ebp,ebx
005030CE or ebp,0FFFFFF00h
005030D4 shl ebp,10h
005030D7 or ebp,eax
005030D9 and ebx,0FF00h
imageRow[2] = p2; imageRow[3] = p3;
005030DF mov dword ptr [edi+8],edx
idx -= 4;
005030E2 mov edx,dword ptr [esp+14h]
005030E6 or ebp,ebx
005030E8 mov dword ptr [edi],ebp
005030EA mov dword ptr [edi+4],ecx
005030ED mov dword ptr [edi+0Ch],esi
sampleRow += 12;
005030F0 mov eax,dword ptr [esp+10h]
005030F4 sub edx,4
005030F7 add eax,0Ch
imageRow += 4;
005030FA add edi,10h
005030FD sub dword ptr [esp+18h],1
00503102 mov dword ptr [esp+14h],edx
00503106 mov dword ptr [esp+10h],eax
0050310A jne nsJPEGDecoder::OutputScanlines+132h (503072h)
// mod: read contiguous bytes, use x86 bswap byte re-ordering
// instruction bytes in loop: 105
// reads from memory per loop: 4 (data) + 1 (var)
//-> reads from memory per pixel: 1.25 = (5 / 4)
// writes to memory per loop: 4 (pixels) + 2 (vars)
//-> writes to memory per pixel: 1.50 = (6 / 4)
// times to copy all scanlines, rdtsc():
// cosmos.jpg: 294457, 299895, 301546; avg: 298632
// sk7.jpg: 81303, 81471, 80976; avg: 81250
// -------------------------------------------------
PRUint32 idx = mInfo.output_width;
0050300B mov ebp,dword ptr [ebx+74h]
// bulk copy of pixels
while (idx >= 4) {
0050300E cmp ebp,4
00503011 jb nsJPEGDecoder::OutputScanlines+18Dh (50307Dh)
00503013 mov ecx,ebp
00503015 shr ecx,2
00503018 mov dword ptr [esp+14h],ecx
0050301C lea esp,[esp]
PRUint32 p0, p1, p2, p3; // to avoid back-to-back stalls
p0 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+0);
p1 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+3);
00503020 mov ecx,dword ptr [eax+3]
p2 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+6);
00503023 mov edx,dword ptr [eax+6]
p3 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+9);
00503026 mov edi,dword ptr [eax+9]
00503029 mov eax,dword ptr [eax]
0050302B bswap eax
0050302D bswap ecx
0050302F bswap edx
00503031 bswap edi
00503033 shr eax,8
00503036 shr ecx,8
00503039 shr edx,8
0050303C shr edi,8
0050303F or eax,0FF000000h
00503044 or ecx,0FF000000h
0050304A or edx,0FF000000h
00503050 or edi,0FF000000h
imageRow[0] = p0; imageRow[1] = p1;
00503056 mov dword ptr [esi],eax
00503058 mov dword ptr [esi+4],ecx
imageRow[2] = p2; imageRow[3] = p3;
0050305B mov dword ptr [esi+8],edx
0050305E mov dword ptr [esi+0Ch],edi
idx -= 4;
sampleRow += 12;
00503061 mov eax,dword ptr [esp+10h]
00503065 add eax,0Ch
00503068 sub ebp,4
imageRow += 4;
0050306B add esi,10h
0050306E sub dword ptr [esp+14h],1
00503073 mov dword ptr [esp+10h],eax
00503077 jne nsJPEGDecoder::OutputScanlines+130h (503020h)
Updated•17 years ago
|
Assignee: swsnyder → nobody
Component: General → GFX: Thebes
Product: Mozilla Application Suite → Core
QA Contact: general → thebes
Updated•17 years ago
|
Attachment #291204 -
Flags: review?(pavlov)
Updated•17 years ago
|
Attachment #291204 -
Flags: review?(pavlov)
Attachment #291204 -
Flags: review+
Attachment #291204 -
Flags: approval1.9+
Updated•17 years ago
|
Keywords: checkin-needed
Comment 3•17 years ago
|
||
The patch needs to get an extra
&& !defined(XP_OS2)
at the end of the GCC version test line for the checkin. (For some reason the byte swapping stuff was never implemented in the OS/2 GCC...)
Assignee | ||
Comment 4•17 years ago
|
||
(In reply to comment #3)
> The patch needs to get an extra
> && !defined(XP_OS2)
> at the end of the GCC version test line for the checkin. (For some reason the
> byte swapping stuff was never implemented in the OS/2 GCC...)
In attempting to confirm this I find that the link to the OS/2 build of GCC is dead. At http://developer.mozilla.org/en/docs/OS/2_Build_Prerequisites#Compiler is a link to an FTP site (ftp.netlabs.org) that no longer has the GCC build.
FYI.
Updated•17 years ago
|
Assignee: nobody → swsnyder
Comment 5•17 years ago
|
||
I just updated that page a few days ago and thought that I had verified that link. Obviously not, sorry.
ftp://ftp.netlabs.org/pub/gcc/GCC-3.3.5-csd3.zip is the package that we use on OS/2. It does define some bswap stuff (in endian.h) but those don't seem to be backed up by symbols in the library...
Comment 6•17 years ago
|
||
I added the extra !defined(XP_OS2), as per Peter.
Checking in gfx/thebes/public/gfxColor.h;
/cvsroot/mozilla/gfx/thebes/public/gfxColor.h,v <-- gfxColor.h
new revision: 1.14; previous revision: 1.13
done
Checking in modules/libpr0n/decoders/gif/nsGIFDecoder2.cpp;
/cvsroot/mozilla/modules/libpr0n/decoders/gif/nsGIFDecoder2.cpp,v <-- nsGIFDecoder2.cpp
new revision: 1.90; previous revision: 1.89
done
Checking in modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp;
/cvsroot/mozilla/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp,v <-- nsJPEGDecoder.cpp
new revision: 1.82; previous revision: 1.81
done
Checking in modules/libpr0n/decoders/png/nsPNGDecoder.cpp;
/cvsroot/mozilla/modules/libpr0n/decoders/png/nsPNGDecoder.cpp,v <-- nsPNGDecoder.cpp
new revision: 1.77; previous revision: 1.76
done
Status: NEW → RESOLVED
Closed: 17 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9 M11
Comment 7•17 years ago
|
||
The GCC check also doesn't compile on Mac, so I had to add a !defined(XP_MACOSX) to that check to fix bustage.
Comment 8•17 years ago
|
||
This fixes the build on FreeBSD... It will also be broken on any other GCC platform, since byteswap.h is a Linux only header.
-Jeremy
Attachment #294278 -
Flags: review?(pavlov)
Updated•17 years ago
|
Attachment #294278 -
Flags: review?(pavlov) → review+
Updated•17 years ago
|
Attachment #294278 -
Flags: approval1.9?
Updated•17 years ago
|
Attachment #294278 -
Flags: approval1.9? → approval1.9+
Updated•17 years ago
|
Keywords: checkin-needed
Comment 9•17 years ago
|
||
Checking in gfx/thebes/public/gfxColor.h;
/cvsroot/mozilla/gfx/thebes/public/gfxColor.h,v <-- gfxColor.h
new revision: 1.16; previous revision: 1.15
done
Keywords: checkin-needed
Comment 10•17 years ago
|
||
I'm fairly certain that this change has broken GIF rendering on PPC Macs. See bug 409551.
Comment 11•17 years ago
|
||
Confirmed that reverting to nsGIFDecoder2.cpp, v.1.89 fixes the problem. Suggest backing out patches related to this until successfully tested on both little and big endian machines.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 12•17 years ago
|
||
See Bug 409381 (patch 294371) for fix.
This fixes the GIF, JPG, and PNG color distortion seen on PowerPC platforms.
Comment 13•17 years ago
|
||
It's being dealt with in bug 409381. Re-resolving.
Status: REOPENED → RESOLVED
Closed: 17 years ago → 17 years ago
Resolution: --- → FIXED
Comment 14•17 years ago
|
||
(In reply to comment #8)
> This fixes the build on FreeBSD... It will also be broken on any other GCC
> platform, since byteswap.h is a Linux only header.
So, this would be why my Win XP build can't find byteswap.h? Or is that a cygwin/MinGW thing?
Comment 15•17 years ago
|
||
Comment on attachment 291204 [details] [diff] [review]
Where Alpha=0xFF, keep RGB values contiguous in pixel copying
>+#else
>+# define GFX_BYTESWAP16(x) ( (((x) & 0xff) << 8) | (((x) >> 8) & 0xff) )
>+# define GFX_BYTESWAP32(x) ( (GFX_BYTESWAP16((x) & 0xffff) << 16) | GFX_BYTESWAP16(x >> 16) )
>+#endif
Won't this actually be slower than GFX_PACKED_PIXEL on those platforms that don't give you access to the native byteswap opcodes?
Comment 16•17 years ago
|
||
(In reply to comment #15)
>(From update of attachment 291204 [details] [diff] [review])
>>+#else
>>+# define GFX_BYTESWAP16(x) ( (((x) & 0xff) << 8) | (((x) >> 8) & 0xff) )
>>+# define GFX_BYTESWAP32(x) ( (GFX_BYTESWAP16((x) & 0xffff) << 16) | GFX_BYTESWAP16(x >> 16) )
>>+#endif
>Won't this actually be slower than GFX_PACKED_PIXEL on those platforms that
>don't give you access to the native byteswap opcodes?
OK, so maybe it isn't, but how about
#define GFX_BYTESWAP24FF(x)
((0xff << 24) | ((x) << 16) | ((x) & 0xff00) | (((x) >> 16) & 0xff))
#define GFX_0XFF_PPIXEL_FROM_BPTR \
(GFX_BYTESWAP24FF(*((PRUint32 *)(pbptr))))
Assignee | ||
Comment 17•17 years ago
|
||
(In reply to comment #16)
> OK, so maybe it isn't, but how about
> #define GFX_BYTESWAP24FF(x)
> ((0xff << 24) | ((x) << 16) | ((x) & 0xff00) | (((x) >> 16) & 0xff))
> #define GFX_0XFF_PPIXEL_FROM_BPTR \
> (GFX_BYTESWAP24FF(*((PRUint32 *)(pbptr))))
That does looks faster than the use of a generic byteswap as it does fewer operations (3 vs. 5 shifts) on the 32-bit value read from memory.
The important thing is that those contiguous RGB values be gotten in a single read. Reduction of operations on the value once it's in the register is icing on the cake.
Assignee | ||
Comment 18•17 years ago
|
||
FWIW, on a Pentium3/850Mhz (Win2k/SP4) machine.
The GFX_BYTESWAP24FF(x) suggested above, and the generic byteswap (not bswap()):
// 24FF jenniferconnelly.jpg: 0 = initial load; 1,2 = reload
(0) Total time to copy all scanlines: 133126
(1) Total time to copy all scanlines: 131543
(2) Total time to copy all scanlines: 132112 <-- avg: 132,260
// 32FF jenniferconnelly.jpg: 0 = initial load; 1,2 = reload
(0) Total time to copy all scanlines: 136260
(1) Total time to copy all scanlines: 137576
(2) Total time to copy all scanlines: 133686 <-- avg: 135,840
So on on a P3 the GFX_BYTESWAP24FF(x) is 2.7% faster on an x86 platform where the compiler doesn't support the bswap instruction.
Comment 19•17 years ago
|
||
I count only two effective shifts. The first shift is on a constant and the compiler should be able to turn that into 0xff000000, leaving just the shifts of R (<< 16) and B (>> 16).
So you go from 5 shifts, 5 &s, and 2 |s to 2 shifts, 2 &s, and 3 |s. Not bad.
However, looking at the assembly for one XBGR -> FRGB operation of the generic double swapping code:
00503072 mov edx,dword ptr [eax+3]
00503081 mov ecx,edx
00503083 and edx,0FF00h
00503089 or ecx,0FFFFFF00h
0050308F shl ecx,10h
00503075 movzx esi,byte ptr [eax+5]
00503092 or ecx,esi
00503097 or ecx,edx
which is effectively 2 shifts, 2 &s and 3 |s:
(((x) | 0xffffff00) << 16) | ((x) & 0xff00) | (((x) >> 16) & 0xff)
(it looks like the compiler replaced the last >> plus & with a load-byte op)
What's interesting is that this looks a lot like Neil's suggestion, except that the first two terms (0xff << 24 | (x) << 16) are combined: (((x) | 0xffffff00) << 16). Which makes me wonder, what does Neil's suggestion compile to to make it 2.7% faster?
Comment 20•17 years ago
|
||
Something like this would avoid the extra memory access with two more instructions:
mov edx,dword ptr [eax+3]
mov ecx,edx ; these 3 instructions replace movzx esi,byte ptr [eax+5]
and ecx,0FF00h ;
mov esi,ecx ;
mov ecx,edx
shr edx,10h
and edx,0FFh
or ecx,0FFFFFF00h
shl ecx,10h
or ecx,esi
or ecx,edx
Not nearly as nice as
mov ecx,dword ptr [eax+3]
bswap ecx
shr ecx,8
or ecx,0FF000000h
Comment 21•17 years ago
|
||
Hrm, and looking around on the net, it looks like for Mac OS X you can use:
#include <libkern/OSByteOrder.h>
#define GFX_BYTESWAP16(x) OSSwapInt16(x)
#define GFX_BYTESWAP32(x) OSSwapInt32(x)
Comment 22•17 years ago
|
||
With the above includes and defines I get this on Intel Mac (-O2, make nsPNGDecoder.s)
L299:
movl -112(%ebp), %eax
movl (%eax), %eax
movl %eax, -156(%ebp)
bswap %eax
movl %eax, -156(%ebp)
movl -112(%ebp), %ecx
movl 3(%ecx), %edx
bswap %edx
movl %ecx, %esi
movl 6(%ecx), %ecx
bswap %ecx
movl 9(%esi), %esi
bswap %esi
shrl $8, -156(%ebp)
orl $-16777216, -156(%ebp)
movl -156(%ebp), %eax
movl %eax, (%edi)
shrl $8, %edx
orl $-16777216, %edx
movl %edx, 4(%edi)
shrl $8, %ecx
orl $-16777216, %ecx
movl %ecx, 8(%edi)
shrl $8, %esi
orl $-16777216, %esi
movl %esi, 12(%edi)
subl $4, -96(%ebp)
addl $12, -112(%ebp)
addl $16, %edi
cmpl $4, -96(%ebp)
ja L299
jmp L300
which is pretty much the same as the bswap version in comment 2, except that GCC 4.01 isn't quite as good with registers.
The generic version boils down to the equivalent of:
((((x) & 0xFF) << 8) | (((x) >> 8) & 0xFF)) << 16) | (((x) >> 8) & 0xFF00) | 0xFF000000
From the .s file:
movzbl %dl,%eax ; - (x) & 0xFF ...R
sall $8, %eax ..R.
movzbl %dh, %ecx ; - ((x) >> 8) & 0xFF ...G
orl %ecx, %eax ..RG
sall $16, %eax RG..
shrl $8, %edx .ABG
andl $65280, %edx ..B.
orl %eax, %edx RGB.
shrl $8, %edx .RGB
orl $-16777216, %edx FRGB
Repeat twice (once %esi, once %edi) with the first three lines replaced with:
movl %esi,%eax ; \
andl $255,%eax ; - (x) & 0xFF
sall $8, %eax
movl %esi, %edx ; - ((x) >> 8) & 0xFF
movzbl %dh, %ecx ; /
And repeat once more with:
movzbl -124(%ebp),%eax ; - (x) & 0xFF
sall $8, %eax
movl -124(%ebp), %edx ; - ((x) >> 8) & 0xFF
movzbl %dh, %ecx ; /
While not being smart with registers, these variants avoid doing a second memory access.
Neil's version (slightly reordered):
((x) << 16) | ((x) & 0xFF00) | (((x) >> 16) & 0xFF) | 0xFF000000
From the .s file:
movl %edx, %eax
sall $16, %eax ; (x) << 16 GR..
movl %edx, %ecx
andl $65280, %ecx ; (x) & 0xFF00 ..G.
orl %ecx, %eax GRG.
shrl $16, %edx ; (x) >> 16 ..AB
andl $255, %edx ; & 0xFF ...B
orl %edx, %eax GRGB
orl $-16777216, %eax FRGB
Repeat 3 more times with %edx in the first and third line replaced with %edi, %esi, and -124(%ebp) respectively.
So on Intel Mac Neil's version is slightly more efficient. It's pretty much what I had suggested in comment 20 (if I hadn't been a doofus and would just have done "mov esi,edx ; and esi,0FF00h"). Reorder things slightly et voila:
mov ecx,edx
shl ecx,10h
mov esi,edx
and esi,0FF00h
or ecx,esi
shr edx,10h
and edx,0FFh
or ecx,edx
or ecx,0FF000000h
My guess is that not doing the second memory access gives Neil's version 2.7% over the generic version.
Comment 23•17 years ago
|
||
Hrm, I guess my As above should really have been Xs (well, B from the next pixel, but X for clarity)
Anyway, inspiration struck:
# define GFX_BYTESWAP24FF(x) ( ((((x) << 16) | ((x) >> 16)) & 0xffff00ff) | ((x) & 0xff00) | (0xff << 24) )
becomes:
movl %ecx, %eax ; XBGR
rorl $16, %eax ; GRXB
xorb %ah, %ah ; GR.B
andl $65280, %ecx ; ..G.
orl %ecx, %eax ; GRGB
orl $-16777216, %eax ; FRGB
And it's a damn shame the compiler didn't do:
movl %ecx, %eax ; XBGR
rorl $16, %eax ; GRXB
movb %ch, %ah ; GRGB
orl $-16777216, %eax ; FRGB
(prepend a "mov %esi, %ecx" for %esi, same for %edi)
Ah, but how about:
# define GFX_BYTESWAP24FF(x) ( ((((x) << 16) | ((x) >> 16)) | 0xff00ff00) & ((x) | 0xffff00ff) )
becomes:
movl %ecx, %eax ; XBGR
rorl $16, %eax ; GRXB
orl $-16711936, %eax ; FRFB
orl $-65281, %ecx ; FFGF
andl %ecx, %eax ; FRGB
with no messing around for %edi and %esi.
Comment 24•17 years ago
|
||
Wheeee! Not sure about the GCC range there, let me know if that needs to change.
Attachment #294884 -
Flags: review?
Updated•17 years ago
|
Attachment #294884 -
Flags: review? → review?(pavlov)
Comment 25•17 years ago
|
||
One thing I'm wondering though. Do we need to worry about the cost of non-aligned memory access? Since we're already taking 4 * 3 bytes in the main loop, could we process this in 3 chunks of 4, and make sure that the pre-loop aligns us?
Assignee | ||
Comment 26•17 years ago
|
||
Re comment #25:
For x86, memory write alignment is more important for performance than read alignment. Of course, optimimal performmance would be if all memory accesses were aligned by data type.
In the case of the pixel copy, the destination is aligned 100% of the time and the source 25% so it's not too bad. Optimal would be a scheme in which 12 RGB source bytes were gotten in a 32-bit-aligned read and 4 ARGB pixels were emitted. Some sort of SIMD (SSE2, etc) sequence would do it, as would some careful ASM code, but either would be deemed too platform-specific for Mozilla.
Comment 27•17 years ago
|
||
Any way to do what you're suggesting with intrinsics?
Assignee | ||
Comment 28•17 years ago
|
||
(In reply to comment #27)
> Any way to do what you're suggesting with intrinsics?
You means SIMD intrinsics? Certainly. In my limited experience SIMD intrinsics are *better* than ASM code in that they are common to the Micosoft/GNU/Intel compilers and avoid the differing Intel/AT&T ASM formats (don't get me started!).
My SSE2 experience is too weak to provide off-the-cuff code, but a suitable sequence would be roughly.
pre: copy pixel data as bytes until 128-bit alignment is reached for both source and destination pointers.
1. read 16 byte of RGB data from the 128-bit-aligned source into XMM register
2. shuffle the 4 3-byte sequences into 32-bit position, discarding the high 4 bytes read
3. ADD or OR 4 0xFF bytes into the appropriate positions
4. write XMM register value (4 ARGB pixels) to 128-bit-aligned destination
post: copy remaining (n mod 4) pixels as bytes
The downsides:
1. requires SSE2+ instruction set
2. might actually be slower for short scanlines due to set-up costs
3. so different from the integer code that you'd have to actually write the final code to know the relative performance
Comment 29•17 years ago
|
||
Comment on attachment 294884 [details] [diff] [review]
Add Mac OS X bswap support, speed up generic GFX_0XFF_PPIXEL_FROM_BPTR
> #elif defined(__GNUC__) && (__GNUC__ >= 2) && defined(__i386__) && !defined(XP_MACOSX) && !defined(XP_OS2)
> # include <byteswap.h>
> # define GFX_BYTESWAP16(x) bswap_16(x)
> # define GFX_BYTESWAP32(x) bswap_32(x)
> # define _GFX_USE_INTRIN_BYTESWAP_
>+#elif defined(__GNUC__) && (__GNUC__ >= 2) && defined(__i386__) && defined(XP_MACOSX)
>+# include <libkern/OSByteOrder.h>
>+# define GFX_BYTESWAP16(x) OSSwapInt16(x)
>+# define GFX_BYTESWAP32(x) OSSwapInt32(x)
>+# define _GFX_USE_INTRIN_BYTESWAP_
Can't you move the XP_MACOSX one before the one above it so we can drop the !defined(XP_MACOSX) check in the elif before it? Is there an XP_* define for just normal Linux?
Comment 30•17 years ago
|
||
reed: if the compiler requirements are the same for these two blocks, then yes. I considered it, but since I'm not quite sure if I'm doing the right version check, or need it at all, I just have it as a place-holder for now.
Comment 31•17 years ago
|
||
Steve: btw, why are you storing the results in local variables and then writing them all to the destination? That seems to require more opcodes than writing it out as you go. Compare grouped write vs. write-as-you-go:
L299: |L299:
movl -112(%ebp), %eax | movl -108(%ebp), %ecx
movl (%eax), %eax | movl (%ecx), %eax
* movl %eax, -156(%ebp) | bswap %eax
bswap %eax | shrl $8, %eax
movl %eax, -156(%ebp) | orl $-16777216, %eax
movl -112(%ebp), %ecx | movl %eax, (%edi)
movl 3(%ecx), %edx | movl 3(%ecx), %eax
bswap %edx | bswap %eax
movl %ecx, %esi | shrl $8, %eax
movl 6(%ecx), %ecx | orl $-16777216, %eax
bswap %ecx | movl %eax, 4(%edi)
movl 9(%esi), %esi | movl 6(%ecx), %eax
bswap %esi | bswap %eax
shrl $8, -156(%ebp) | shrl $8, %eax
orl $-16777216, -156(%ebp) | orl $-16777216, %eax
movl -156(%ebp), %eax | movl %eax, 8(%edi)
movl %eax, (%edi) | movl 9(%ecx), %eax
shrl $8, %edx | bswap %eax
orl $-16777216, %edx | shrl $8, %eax
movl %edx, 4(%edi) | orl $-16777216, %eax
shrl $8, %ecx | movl %eax, 12(%edi)
orl $-16777216, %ecx | subl $4, %edx
movl %ecx, 8(%edi) | addl $12, %ecx
shrl $8, %esi | movl %ecx, -108(%ebp)
orl $-16777216, %esi | addl $16, %edi
movl %esi, 12(%edi) | cmpl $4, %edx
subl $4, -96(%ebp) | ja L299
addl $12, -112(%ebp) | jmp L300
addl $16, %edi |
cmpl $4, -96(%ebp) |
ja L299 |
jmp L300 |
(*) compiler bug?
Of course, seeing how the assembly in comment 2 looks like it's ordered rather nicely, I guess I should go find a more recent compiler to play with :-)
Comment 32•17 years ago
|
||
BTW, how come in the PNG decoder RGB and BGR get the same pixel-copying treatment?
Assignee | ||
Comment 33•17 years ago
|
||
(In reply to comment #31)
> Steve: btw, why are you storing the results in local variables and then writing
> them all to the destination? That seems to require more opcodes than writing it
> out as you go. Compare grouped write vs. write-as-you-go:
[snip]
The local variables give the compiler a little wiggle room, although the
declared variables are never actually generated by the compiler. The purpose
is to avoid exactly the kind of back-to-back register stalls shown in your
snippets.
Example:
movl 3(%ecx), %eax
bswap %eax
shrl $8, %eax
orl $-16777216, %eax
Note how the source for 3 operations in a row is the destination of the prior
instruction. There will be a stall on each of the last 3 instructions because
the prior instruction has not yet completed. Remember that multiple
instructions are in various stages of execution. (Sadly, my Pentium4 has a
24-stage pipeline. The last P4 models released had 30 stages!)
Then look at the ASM code I posted above:
00503020 mov ecx,dword ptr [eax+3]
00503023 mov edx,dword ptr [eax+6]
00503026 mov edi,dword ptr [eax+9]
00503029 mov eax,dword ptr [eax]
0050302B bswap eax
0050302D bswap ecx
0050302F bswap edx
00503031 bswap edi
00503033 shr eax,8
00503036 shr ecx,8
00503039 shr edx,8
0050303C shr edi,8
0050303F or eax,0FF000000h
00503044 or ecx,0FF000000h
0050304A or edx,0FF000000h
00503050 or edi,0FF000000h
00503056 mov dword ptr [esi],eax
00503058 mov dword ptr [esi+4],ecx
0050305B mov dword ptr [esi+8],edx
0050305E mov dword ptr [esi+0Ch],edi
I have to say that this is damned good code generation on Microsoft's part.
There is only a single back-to-back use of a register (mov/bswap eax) in that
entire loop.
Presumably those local variables would actually be allocated and used if
compiler optimization was completely disabled. As it is, the compiler uses
registers to contain the intermediate values.
Re GCC: The major change in GCC v4.0 was a lot of back-end enhancements, which
at the time were said to enable future optimizations. They seem to be actually
implementing those optimizations now as the v4.1.2 I'm using (in Fedora7)
generates pretty good x86 code. FYI.
Comment 34•17 years ago
|
||
Ah, ok, that's what I guessed you were talking about. The code in the left column is slightly better in that regard, but still nowhere near as pretty as Microsoft's.
I'm curious what Microsoft does with Neil's generic macro (comment 16) and with either of mine (comment 23).
Also, why did you pick __GNUC__ >= 2?
Assignee | ||
Comment 35•17 years ago
|
||
Comment #32:
I confess that I punted on the version. I knew that bswap_xx() predated GCC v3.0, but not exactly in which version it was introduced. I guess I could have just used v3 as a minimum since
http://developer.mozilla.org/en/docs/Linux_Build_Prerequisites#Build_Tools
says that GCC v3.2+ is "recommended".
Assignee | ||
Comment 36•17 years ago
|
||
Peter:
Pulled the trunk and built in Linux today. This is what was generated for the inner loop of nsJPEGDecoder::OutputScanlines:
.L50:
subl $4, 24(%esp)
movl (%edi), %eax
movl 3(%edi), %edx
movl 6(%edi), %ecx
movl 9(%edi), %esi
addl $12, %edi
bswap %eax
bswap %edx
bswap %ecx
bswap %esi
shrl $8, %eax
shrl $8, %edx
orl $-16777216, %eax ; 0xFF000000
shrl $8, %ecx
orl $-16777216, %edx ; 0xFF000000
shrl $8, %esi
orl $-16777216, %ecx ; 0xFF000000
orl $-16777216, %esi ; 0xFF000000
movl %eax, (%ebp)
movl %edx, 4(%ebp)
movl %ecx, 8(%ebp)
movl %esi, 12(%ebp)
addl $16, %ebp
cmpl $4, 24(%esp)
ja .L50
Again, this is with GCC v4.1.2, built with -Os optimization. Not too shabby.
(The "-Os" was Red Hat's choice. For some reason they always build Mozilla products at -Os instead of their usual -O2.)
FYI.
Updated•17 years ago
|
Attachment #294884 -
Flags: review?(pavlov) → review+
Comment 37•17 years ago
|
||
Updated•17 years ago
|
Attachment #294884 -
Attachment is obsolete: true
Comment 38•17 years ago
|
||
To sorta summarize this post-fixed flurry of comments and patches, this patch (like the previous patch) gets us to where we use |bswap| on Intel Mac with GCC, and adds a faster generic implementation of GFX_0XFF_PPIXEL_FROM_BPTR().
Updated•17 years ago
|
Attachment #295008 -
Flags: review?(pavlov)
Attachment #295008 -
Flags: approval1.9?
Updated•17 years ago
|
Comment 39•17 years ago
|
||
Comment on attachment 295008 [details] [diff] [review]
Shuffle things around a bit, no functional changes.
>+#elif defined(__GNUC__) && (__GNUC__ >= 2) && defined(__i386__) && !defined(XP_OS2)
In order to stop having to append !defined() things to the end of this line, could it be changed to this:
#elif defined(__GNUC__) && (__GNUC__ >= 2) && defined(__i386__) && defined(XP_UNIX)
Would that work to deal with problems like what have been mentioned above, plus things like bug 410439?
Comment 40•17 years ago
|
||
Comment on attachment 295008 [details] [diff] [review]
Shuffle things around a bit, no functional changes.
Approvals after reviews, bitte. Also, I'd really prefer to have this be attached to a new bug, referencing this original, but whatever :)
Attachment #295008 -
Flags: approval1.9?
Comment 41•17 years ago
|
||
Given the fact that these defines are conditioned on __i386__, and people have reported problems on PPC, is the use of bswapXX? right, or should this be using ntohl(), which will take care of machine endian issues? These are also much more likely to be optimized on all architectures, since they're critical to network performance. Plus they're already in <prinet.h>, so all of the platform junk could be skipped...
Comment 42•17 years ago
|
||
Forest, trees, and way too much fun. Thanks, Jeremy.
This should do the right thing everywhere.
Attachment #295008 -
Attachment is obsolete: true
Attachment #295486 -
Flags: review?
Attachment #295008 -
Flags: review?(pavlov)
Updated•17 years ago
|
Attachment #295486 -
Flags: review? → review?(pavlov)
Comment 43•17 years ago
|
||
Comment on attachment 295486 [details] [diff] [review]
Boring version (I like it!)
this is fine as long as ntohl is a macro everywhere and this doesn't turn in to a function call. Have you verified that?
Comment 44•17 years ago
|
||
ntohl should exist on all platforms supported by NSPR. I would be very surprised if it reduces to a function call (instead of some inlined assembly) on any serious platform.
Is it good enough if we verify Windows, Mac, Linux and some BSD?
I have verified that on Intel Mac ntohl gets inlined to bswap, and on a pretty standard FC2 Linux box (note: _old_) the ntohl call gets inlined to bswap where available, and rorw 8 + rorl 16 + rorw 8 where not. I imagine more up-to-date Linux distributions will do just as well.
Steve, can you check on Windows?
Jeremy, can you take FreeBSD?
Comment 45•17 years ago
|
||
(In reply to comment #44)
> Jeremy, can you take FreeBSD?
On FreeBSD, yes. Unless optimization is turned off.
The mingw problems are because prinet.h doesn't include any headers, because windows.h includes winsock2.h, which defines ntohl, but the PNG encoder doesn't include windows.h... This is a bug in NSPR - it's breaking the contract that prinet.h makes - that ntohl will be defined if it is included.
Updated•17 years ago
|
Attachment #295486 -
Flags: review?(pavlov) → review+
Updated•17 years ago
|
Attachment #295486 -
Flags: approval1.9?
Updated•17 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 46•17 years ago
|
||
/me wonders what FreeBSD generates with optimizations turned off.
Is there a bug filed against NSPR on prinet.h not providing ntohl in a ming environment?
Comment 47•17 years ago
|
||
Comment on attachment 295486 [details] [diff] [review]
Boring version (I like it!)
a+ - thanx this is awesome
Attachment #295486 -
Flags: approval1.9? → approval1.9+
Comment 48•17 years ago
|
||
So before I can check this in we'll need some code for mingw in NSPR or update this patch to make an exception for that case for now, and I would still like some confirmation that we end up using bswap on Windows/VC++.
Comment 49•17 years ago
|
||
Ok, so since bug 411055 exists and this is already broken (bug 410439) I'm not going to wait for mingw (sorry) but I would still like confirmation that we generate bswap on Windows (very unlikely we don't, but ...)
Comment 50•17 years ago
|
||
With SSE2, a potential solution would be:
__declspec(align(16)) unsigned char ff[] = {255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0};
movd xmm0, dword ptr source RGB
pxor xmm1, xmm1 0 vector
movdqa xmm2, xmmword ptr ff FF, 0, 0, 0, 0, etc.
punpcklbw xmm0, xmm1 R0G0B0
pshuflw xmm0, xmm0, encoding (encoding is a byte which specifies how words are to
be shuffled. In this case, we'd reorder RGB to BGR
and shift it one byte)
00B0G0R0
packuswb xmm0, xmm0 0BGR
por xmm0, xmm2 FBGR
movd dword ptr target, xmm1
To do one pixel at a time in a macro.
Doing twelve bytes at a time (four pixels) is hard with SSE2 as SSE2 doesn't handle objects well that aren't sized at powers of two. And SSE2 doesn't have a packed byte shuffle which is why the above code converts the bytes to words, does a word shuffle and then converts the words back to bytes. To process multiple three-byte pixels, mask, shift and merge sequences would have to be done and it would be tight on registers to try to do four at a time. It would be an interesting exercise and it would have to be done in a piece of streaming code for efficiency. The above code does a vector read and a vector zero everytime which would normally be factored out of streaming code but there shouldn't be much of a cost as the instructions without dependencies can run in parallel given enough execution resources in the CPU.
I don't know if it is faster or not but you wouldn't put it in a macro as you'd have to test for SSE2 availability every time and then fork to the appropriate code. With streaming code, you'd only do the test once.
However, with SSSE3 (Core 2 Duo and on), there is a packed shuffle instruction which would allow you to move any byte to any other location in the vector in one instruction. Latency is 3 on Core 2 Duo and 1 on Penryn. So in one instruction, you can do:
RGBRGBRGBRGBRGBR -> BGR BGR BGR BGR
Do a vector or with F F F F and you get
FBGRFBGRFBGRFBGR with two instructions and a latency of four or five.
In SSE4, there's an instruction called PBLENDVB (Variable Blend Packed Bytes) that will do the above in one instruction. Arguments are destination register, source register/memory and copy mask. The copy mask copies any byte from either register/memory to any other position (you can duplicate) in the destination register.
As you can imagine, a code loop to do 12 bytes at a time would be a very short piece of code.
The SSSE3 instruction PSHUFB has a huge amount of potential for improving byte-twiddling operations involving pixels, whether the size is three, four or more bytes. The downsides are:
- AMD doesn't support it in Barcelona. I expect them to support it in SSE5 but my guess is that they'll need two years for their next architecture.
- MSVC doesn't support SSSE3 and SSE4 instructions until VS2008. VS2008 is out and I have the free trial installed somewhere to play with SSSE3 but I haven't yet.
- I think that the current versions of GCC don't support SSSE3 and SSE4 yet for Mac OSX. I'm not sure where Linux is on these instruction sets.
On the other hand, ICC 10.1 supports SSSE3 and SSE4 as you'd expect. I have a free trial of ICC 10.1 and have built Firefox 2.0.0.11 but any compiler optimizations and the build is unusable (left click doesn't work and gmail runs like a dog). I was planning to work on getting optimized builds using ICC before the free trial expires but other stuff came up.
I'm not sure what the inputs to this macro are and where it is used but it might be possible to optimize the input phase with SSE2. I have SSE2 code in progress that outputs three-byte pixels and the output can be in RGB or BGR order. The output of the color code is actually in four-byte pixels with one byte as a zero so I have to mask and merge to get the output to three-byte pixels. This takes 30 instructions. Outputting four-byte instructions would take only six. Of course an SSSE3 implementation would chop that down to about six as well. The SSE2 is still 25% faster than the scalar code because it replaces scalar table lookups with parallel (8 operations per instruction) arithmetic operations.
Comment 51•17 years ago
|
||
> However, with SSSE3 (Core 2 Duo and on), there is a packed shuffle instruction
> which would allow you to move any byte to any other location in the vector in
> one instruction. Latency is 3 on Core 2 Duo and 1 on Penryn. So in one
> instruction, you can do:
http://en.wikipedia.org/wiki/SSE3
Says a whole lot more processors have SSE3 - looks like anything that's dual core (including Althon64X2). So I'd gander a bet that a large portion of our user have SSE3 on their machines.
> I'm not sure what the inputs to this macro are and where it is used but it
> might be possible to optimize the input phase with SSE2. I have SSE2 code in
> progress that outputs three-byte pixels and the output can be in RGB or BGR
> order. The output of the color code is actually in four-byte pixels with one
> byte as a zero so I have to mask and merge to get the output to three-byte
> pixels. This takes 30 instructions. Outputting four-byte instructions would
> take only six. Of course an SSSE3 implementation would chop that down to about
> six as well. The SSE2 is still 25% faster than the scalar code because it
> replaces scalar table lookups with parallel (8 operations per instruction)
> arithmetic operations.
>
That would be awesome (SSE2 version)
Comment 52•17 years ago
|
||
SSE3 is not the same as SSSE3. The closeness of the acronyms can be confusing.
Comment 53•17 years ago
|
||
Alright, so #include-ing winsock.h in gfxColor.h is painful since that gets included directly and indirectly in a lot of places in the tree, some of which seem rather unhappy to now have to deal with winsock.h. If I were a Windows hacker I'm sure I could figure out all the magic incantations to make it all work. Instead, I'm moving the macros out to a new header file which will only need to be included by the various decoders that need the macros. Shaves a bit off build time, and per TryServer this actually compiles.
Attachment #295486 -
Attachment is obsolete: true
Attachment #296902 -
Flags: review?(pavlov)
Comment 54•17 years ago
|
||
Comment on attachment 296902 [details] [diff] [review]
Boring version moved out
i'd prefer not to add new files. especially for mingw
Attachment #296902 -
Flags: review?(pavlov) → review-
Comment 55•17 years ago
|
||
Ah, but you see, I'm not adding new files for MinGW, I'm adding a new file for Windows so we won't end up #include-ing winsock.h in quite a few places.
Comment 56•17 years ago
|
||
Alternatively I could just stuff the #include in the decoders. A little uglier, but not too horrid.
Comment 57•17 years ago
|
||
(In reply to comment #50)
> With SSE2, a potential solution would be:
>
> __declspec(align(16)) unsigned char ff[] = {255, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
> 0, 0, 0, 0, 0};
>
> movd xmm0, dword ptr source RGB
> pxor xmm1, xmm1 0 vector
> movdqa xmm2, xmmword ptr ff FF, 0, 0, 0, 0, etc.
> punpcklbw xmm0, xmm1 R0G0B0
> pshuflw xmm0, xmm0, encoding (encoding is a byte which specifies how
> words are to
> be shuffled. In this case, we'd
> reorder RGB to BGR
> and shift it one byte)
> 00B0G0R0
> packuswb xmm0, xmm0 0BGR
> por xmm0, xmm2 FBGR
> movd dword ptr target, xmm1
>
Shufls are very expensive with AMD Opteron/Athlon HW prior to Barcelona, as well as movd instruction. I.e. if the code needs to be the same for all platforms it's better to use bswap, which is known to work fast on all platforms and does not require SSE2.
Comment 58•17 years ago
|
||
Comment 59•17 years ago
|
||
> Shufls are very expensive with AMD Opteron/Athlon HW prior to Barcelona,
> as well as movd instruction. I.e. if the code needs to be the same for
> all platforms it's better to use bswap, which is known to work fast on
> all platforms and does not require SSE2.
pshufd is expensive at four cycles. pshuflw is only two cycles. Throughput
of the instruction is one instruction per clock which is the same as BSWAP
on K8. The thing about using SIMD instructions is that almost all of them
don't affect the status flags so they can run in some degree, in parallel
if there are no instruction dependencies.
The above code wouldn't be competitive with an unrolled bswap solution. But
perhaps something doing 48 bytes at a time using SSE2 would be where you are
working on different parts of the string in parallel. I have some other ideas
on this but am pretty busy with other work right now.
Comment 60•17 years ago
|
||
Don't really need it in XBM and BMP. Also, move the winsock include to the top to make MinGW happy.
Attachment #296957 -
Attachment is obsolete: true
Comment 61•17 years ago
|
||
(In reply to comment #59)
> > Shufls are very expensive with AMD Opteron/Athlon HW prior to Barcelona,
> > as well as movd instruction. I.e. if the code needs to be the same for
> > all platforms it's better to use bswap, which is known to work fast on
> > all platforms and does not require SSE2.
>
> pshufd is expensive at four cycles. pshuflw is only two cycles. Throughput
> of the instruction is one instruction per clock which is the same as BSWAP
> on K8.
The main problem is not clock cycles those insns take, nor their throughput. The
main problem is they are vector path, causing stalls in decoder and bubbles in
ROB. I.e. this in fact affects possibilities for out of order execution,
decreasing the parallelism possibilities.
Comment 62•17 years ago
|
||
> The main problem is not clock cycles those insns take, nor their
> throughput. The main problem is they are vector path, causing stalls
> in decoder and bubbles in ROB. I.e. this in fact affects possibilities
> for out of order execution, decreasing the parallelism possibilities.
pshufd is VectorPath. pshuflw and pshufhw are DirectPath Double.
I have a link to a local copy of the K8 SOG on my homepage and it only takes me half a minute to look this stuff up if I don't already have it memorized.
A VectorPath instruction isn't the end of the world when considering the selection of instructions to use. It should be avoided when there are other
alternatives but not so if those alternatives are considerably more expensive.
Assignee | ||
Comment 63•17 years ago
|
||
After applying attachment 296978 [details] [diff] [review] I get subroutines, not inline instructions. Ouch! With the value to be swapped passed on the stack. Ouch again!
[ Below is a verbatim display from the debugger, but those calls actually refer to ntohl(). ]
From nsJPEGDecoder::OutputScanlines(), built with VS2005/SP1 on a Win2K machine:
// bulk copy of pixels.
while (idx > 4) { // >4 to avoid last 3 bytes in buffer
00542302 cmp ecx,4
00542305 mov dword ptr [esp+10h],ecx
00542309 jbe nsJPEGDecoder::OutputScanlines+1DDh (5423ADh)
0054230F add ecx,0FFFFFFFBh
00542312 shr ecx,2
00542315 add ecx,1
00542318 mov dword ptr [esp+14h],ecx
0054231C lea esp,[esp]
PRUint32 p0, p1, p2, p3; // to avoid back-to-back register stalls
p0 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+0);
00542320 mov eax,dword ptr [eax]
00542322 push eax
00542323 call imgRequest::NotifyProxyListener+254h (54D794h)
p1 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+3);
00542328 mov ecx,dword ptr [esp+0Ch]
0054232C mov edx,dword ptr [ecx+3]
0054232F mov edi,eax
00542331 shr edi,8
00542334 push edx
00542335 or edi,0FF000000h
0054233B call imgRequest::NotifyProxyListener+23Ch (54D77Ch)
00542340 mov ebx,eax
p2 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+6);
00542342 mov eax,dword ptr [esp+0Ch]
00542346 mov ecx,dword ptr [eax+6]
00542349 shr ebx,8
0054234C push ecx
0054234D or ebx,0FF000000h
00542353 call imgRequest::NotifyProxyListener+224h (54D764h)
p3 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+9);
00542358 mov edx,dword ptr [esp+0Ch]
0054235C shr eax,8
0054235F or eax,0FF000000h
00542364 mov dword ptr [esp+20h],eax
00542368 mov eax,dword ptr [edx+9]
0054236B push eax
0054236C call imgRequest::NotifyProxyListener+20Bh (54D74Bh)
imageRow[0] = p0; imageRow[1] = p1;
imageRow[2] = p2; imageRow[3] = p3;
00542371 mov ecx,dword ptr [esp+20h]
idx -= 4;
00542375 sub dword ptr [esp+10h],4
0054237A shr eax,8
0054237D or eax,0FF000000h
00542382 mov dword ptr [esi+0Ch],eax
00542385 mov dword ptr [esi],edi
00542387 mov dword ptr [esi+4],ebx
0054238A mov dword ptr [esi+8],ecx
sampleRow += 12;
0054238D mov eax,dword ptr [esp+0Ch]
00542391 add eax,0Ch
imageRow += 4;
00542394 add esi,10h
00542397 sub dword ptr [esp+14h],1
0054239C mov dword ptr [esp+0Ch],eax
005423A0 jne nsJPEGDecoder::OutputScanlines+150h (542320h)
005423A6 mov edi,dword ptr [this]
005423A9 mov ecx,dword ptr [esp+10h]
}
Assignee | ||
Comment 64•17 years ago
|
||
[Continuation from above.]
Those calls to ntohl() execute this code:
75033304 mov ecx,dword ptr [esp+4]
75033308 mov eax,ecx
7503330A mov edx,ecx
7503330C shl eax,10h
7503330F and edx,0FF00h
75033315 or eax,edx
75033317 mov edx,ecx
75033319 shr edx,10h
7503331C and ecx,0FF0000h
75033322 or edx,ecx
75033324 shl eax,8
75033327 shr edx,8
7503332A or eax,edx
7503332C ret 4
So even after taking the pain of the stack handling we still don't get the bswap instruction, just generic C code. The only way this could be worse would be to individually arrange each of the 32 bits in the pixel.
Comment 65•17 years ago
|
||
Is this an optimized build? Or does MS' ntohl() really suck that hard?
Assignee | ||
Comment 66•17 years ago
|
||
(In reply to comment #65)
> Is this an optimized build? Or does MS' ntohl() really suck that hard?
Yes, it is an optimized build.
As for ntohl() sucking... I don't know. It doesn't make sense to me that it would be this bad, but I can't dispute the code that I'm seeing.
Comment 67•17 years ago
|
||
(In reply to comment #57)
> Shufls are very expensive with AMD Opteron/Athlon HW prior to Barcelona, as
> well as movd instruction.
I checked the movd instruction and only the register versions are VectorPath
and slow. The memory versions are DirectPath Double and latencies of 2 and 4 depending on the direction.
That said, I took another stab at an SSE2 solution. The clock starts are listed
on the right. This looks to be about 1.5 clocks per byte on the K8 architecture.
I think that doing 24 bytes at a time would get it down to 1 clock per byte. The
ideal number is 48 bytes as there is no read waste (48 bytes=3 vector reads and
16 pixels). The scalar instructions for updating source and target pointers, and doing the compare for loop end can be intermingled with the vector instructions and run in parallel.
__declspec(align(16)) const unsigned char ff000000[16] =
{255,0,0,0,255,0,0,0,255,0,0,0,255,0,0,0};
/*
movdqa xmm1, xmmword ptr source ! 1
pxor xmm0, xmm0 ! 1
movdqa xmm2, xmmword ptr ff000000 ! 2
movdqa xmm3, xmm1 ! 3
pshuflw xmm1, [0,1,1,2] ! 3
psrldq xmm3, 6 ! 5
punpcklbw xmm1, xmm0 ! 5
pshuflw xmm3, [0,1,1,2] ! 7
pshuflw xmm1, [0,2,1,0] ! 7
pshufhw xmm1, [0,3,2,1] ! 8
punpcklbw xmm3, xmm0 ! 9
pshuflw xmm3, [0,2,1,0] ! 11
pshufhw xmm3, [0,3,2,1] ! 11
packuswb xmm1, xmm3 ! 13
por xmm1, xmm2 ! 15
movlps qword ptr target, xmm1 ! 17
movhps qword ptr target+8, xmm1 ! 17
*/
Comment 68•17 years ago
|
||
This is from jdcolor.c for the most common type of jpeg. You can change the
byte order output by swapping the RGB_RED and RGB_BLUE tags. The same change
would have to be made for ycck_cmyk_convert. A small but different change would
have to be made for null_convert. gray_rgb_convert doesn't need to be changed.
The only routine that I'm not sure of is grayscale_convert but nobody has ever complained to me about it in my builds in the three years since it has been in my builds. I don't know how developers feel about making this kind of change in JPEG as it changes the functionality.
Some color output code also exists in jdmerge.c but I don't think that Mozilla uses this code.
Even more efficient would be to create four-byte pixel buffers which jdcolor could write into with ffBGR pixels. This would be cheap to do on the JPEG side;
much more expensive in nsJPEGDecoder.cpp.
120 ycc_rgb_convert (j_decompress_ptr cinfo,
121 JSAMPIMAGE input_buf, JDIMENSION input_row,
122 JSAMPARRAY output_buf, int num_rows)
123 {
124 my_cconvert_ptr cconvert = (my_cconvert_ptr) cinfo->cconvert;
125 register int y, cb, cr;
126 register JSAMPROW outptr;
127 register JSAMPROW inptr0, inptr1, inptr2;
128 register JDIMENSION col;
129 JDIMENSION num_cols = cinfo->output_width;
130 /* copy these pointers into registers if possible */
131 register JSAMPLE * range_limit = cinfo->sample_range_limit;
132 register int * Crrtab = cconvert->Cr_r_tab;
133 register int * Cbbtab = cconvert->Cb_b_tab;
134 register INT32 * Crgtab = cconvert->Cr_g_tab;
135 register INT32 * Cbgtab = cconvert->Cb_g_tab;
136 SHIFT_TEMPS
137
138 while (--num_rows >= 0) {
139 inptr0 = input_buf[0][input_row];
140 inptr1 = input_buf[1][input_row];
141 inptr2 = input_buf[2][input_row];
142 input_row++;
143 outptr = *output_buf++;
144 for (col = 0; col < num_cols; col++) {
145 y = GETJSAMPLE(inptr0[col]);
146 cb = GETJSAMPLE(inptr1[col]);
147 cr = GETJSAMPLE(inptr2[col]);
148 /* Range-limiting is essential due to noise introduced by DCT losses. */
149 outptr[RGB_RED] = range_limit[y + Crrtab[cr]];
150 outptr[RGB_GREEN] = range_limit[y +
151 ((int) RIGHT_SHIFT(Cbgtab[cb] + Crgtab[cr],
152 SCALEBITS))];
153 outptr[RGB_BLUE] = range_limit[y + Cbbtab[cb]];
154 outptr += RGB_PIXELSIZE;
155 }
156 }
157 }
Comment 69•17 years ago
|
||
(In reply to comment #67)
> (In reply to comment #57)
> That said, I took another stab at an SSE2 solution.
> movlps qword ptr target, xmm1 ! 17
> movhps qword ptr target+8, xmm1 ! 17
> */
That's nice, but now movlps/movhps pair will suffer a lot on Core2 architecture because of partial xmm register write stall. At the same time use of movdqa (or even movdqu) is slower on K8. However, not as slow as loading pair on Intel.
Comment 70•17 years ago
|
||
(In reply to comment #69)
> That's nice, but now movlps/movhps pair will suffer a lot on Core2 architecture
> because of partial xmm register write stall. At the same time use of movdqa (or
> even movdqu) is slower on K8. However, not as slow as loading pair on Intel.
I've only spent some time in the Core 2 Duo SOG so I'm not familiar with many nuances in SIMD optimization but, in practice, I've found that optimizing for the K8 architecture usually provides for good or better results in Core 2 Duo processors. Or it's okay to give back a percent or two if you're improving performance by ten times that using SIMD.
I'm not familiar with the partial xmm register write stall. Could you point me to where it is in the Intel Core 2 Duo documentation?
Also, given your expertise in this area, do you have any interest in adding SSE2 optimization in the new PNG code? The old mmx code was removed due to maintenance headaches so the current code is now all scalar. The MMX code provided about a 25% performance improvement on my PNG benchmark suite. SSE2 should be able to do a little better.
Comment 71•17 years ago
|
||
(In reply to comment #70)
> (In reply to comment #69)
> > That's nice, but now movlps/movhps pair will suffer a lot on Core2 architecture
> > because of partial xmm register write stall. At the same time use of movdqa (or
> > even movdqu) is slower on K8. However, not as slow as loading pair on Intel.
>
> I've only spent some time in the Core 2 Duo SOG so I'm not familiar with many
> nuances in SIMD optimization but, in practice, I've found that optimizing for
> the K8 architecture usually provides for good or better results in Core 2 Duo
> processors. Or it's okay to give back a percent or two if you're improving
> performance by ten times that using SIMD.
>
> I'm not familiar with the partial xmm register write stall. Could you point me
> to where it is in the Intel Core 2 Duo documentation?
You can find it here:
http://www.intel.com/design/processor/manuals/248966.pdf
Chapter 3.5.2.4 Partial XMM Register Stalls
That's the first document Google gave me.
In practice I had many FP intensive applications suffering up to 50% because of movlp*/movhp*.
> Also, given your expertise in this area, do you have any interest in adding
> SSE2 optimization in the new PNG code? The old mmx code was removed due to
> maintenance headaches so the current code is now all scalar. The MMX code
> provided about a 25% performance improvement on my PNG benchmark suite. SSE2
> should be able to do a little better.
I would do that if only (or when) have enough time.
Comment 72•17 years ago
|
||
(In reply to comment #71)
> You can find it here:
>
> http://www.intel.com/design/processor/manuals/248966.pdf
>
> Chapter 3.5.2.4 Partial XMM Register Stalls
>
> That's the first document Google gave me.
>
> In practice I had many FP intensive applications suffering up to 50% because of
> movlp*/movhp*.
The documentation there only describes a problem with XMM WRITE stalls or SIMD loads. The code above does SIMD stores and I see no problems described in the Intel documentation with SIMD partial register stores.
Did you see application performance problems using movlps/hps xmm, qword ptr source or movlps/hps qword ptr target, xmm? If the latter, then perhaps Intel needs to update their SOG. If only the former, then my code should be fine, right?
Comment 73•17 years ago
|
||
Right ;) Sorry for confusion. That is the big issue for loads. Must be AT&T and Intel syntax mess in my head. I've just looked up our final decision made some time ago. The optimal way we've been finally found for a blended code are single loads and slitted stores. So yes, the code should be good for a env unknown before hand.
Comment 74•17 years ago
|
||
Since nothl() on Windows seems to suck, I'm switching back to _byteswap_ulong() for Windows, with a fallback to the fast generic macro for older VC++ and MinGW. The rest still uses nothl(). Oh, and BIG_ENDIAN platforms just do a shift+or.
TryServer said this compiles on Linux, Mac, and Windows. Martijn let me know it compiles on MinGW (yay!), and per prinet.h ntohl() is available on OS/2 and Unix platforms.
Attachment #296902 -
Attachment is obsolete: true
Attachment #296978 -
Attachment is obsolete: true
Attachment #297255 -
Flags: review?
Updated•17 years ago
|
Attachment #297255 -
Flags: review? → review?(pavlov)
Updated•17 years ago
|
Attachment #297255 -
Flags: review?(pavlov) → review+
Updated•17 years ago
|
Attachment #297255 -
Flags: approval1.9?
Comment 75•17 years ago
|
||
Moving to blocking list so you are good to land.
Flags: blocking1.9+
Priority: -- → P1
Updated•17 years ago
|
Attachment #297255 -
Flags: approval1.9?
Updated•17 years ago
|
Attachment #297255 -
Flags: approval1.9?
Comment 76•17 years ago
|
||
Comment on attachment 297255 [details] [diff] [review]
Hybrid approach
I'll land this when I get home tonight.
Attachment #297255 -
Flags: approval1.9?
Comment 77•17 years ago
|
||
Checking in gfxColor.h;
/cvsroot/mozilla/gfx/thebes/public/gfxColor.h,v <-- gfxColor.h
new revision: 1.18; previous revision: 1.17
Status: REOPENED → RESOLVED
Closed: 17 years ago → 17 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•