406580 - Faster copying of RGB pixel data

Assignee

Description

•

17 years ago

A significant percentage of the time spent in graphics processing is copying the pixel data from the buffer where a given image is decoded to the display buffer for output. Most of this slow processing is due to having to construct each pixel from 4 individual values (A,R,G,B). In the case where the alpha value is 0xFF, no manipulation of the R,G.B values are needed, yet the pixel components are still handled as 4 values. The R,G,B values are read from contiguous byte-sized memory locations then reassembled, with 0xFF, as a single 32-bit value. Processing would be faster if those pixel components were kept contiguous, not read individually from the source buffer and re-assembled.

Steve Snyder

Assignee

Comment 1

•

17 years ago

Attached patch Where Alpha=0xFF, keep RGB values contiguous in pixel copying (deleted) — Details — Splinter Review

Reduces by 20% - 25% the times it takes to copy pixels on x86 platforms. The performance improvement comes from reducing the number of memory access needed for each pixel. The RGB bytes are gotten in a single 32-bit read from the source buffer, reordered within registers, and written to the destination buffer as a single 32-bit value. On x86 platforms, hardware byte reordering is used, as provided by Microsoft's and GNU's compilers.

Steve Snyder

Assignee

Comment 2

•

17 years ago

The short version: Original code: 4.00 reads from memory per pixel 2.00 writes to memory per pixel Modified, using generic C byte-reordering: 2.25 reads from memory per pixel 1.75 writes to memory per pixel Modified, using x86 bswap byte-reordering: 1.25 reads from memory per pixel 1.50 writes to memory per pixel Details: // original code from trunk: form pixel from individual bytes // instruction bytes in loop: 53 // reads from memory per loop: 3 (data) + 1 (var) //-> reads from memory per pixel: 4.00 = (4 / 1) // writes to memory per loop: 1 (pixels) + 1 (vars) //-> writes to memory per pixel: 2.00 = (2 / 1) // times to copy all scanlines, rdtsc(): // cosmos.jpg: 367277, 376664, 373971; avg: 372637 // sk7.jpg: 104953, 106780, 106857; avg: 106193 // ------------------------------------------------- for (PRUint32 i=mInfo.output_width; i>0; --i) { 0050300B mov ecx,dword ptr [esi+74h] 0050300E test ecx,ecx 00503010 jbe nsJPEGDecoder::OutputScanlines+152h (503042h) *imageRow++ = GFX_PACKED_PIXEL(0xFF, sampleRow[0], sampleRow[1], sampleRow[2]); 00503012 movzx edx,byte ptr [eax] 00503015 movzx ebp,byte ptr [eax+1] 00503019 movzx eax,byte ptr [eax+2] 0050301D or edx,0FFFFFF00h 00503023 shl edx,8 00503026 or edx,ebp 00503028 shl edx,8 0050302B or edx,eax 0050302D mov dword ptr [edi],edx sampleRow += 3; 0050302F mov eax,dword ptr [esp+0Ch] 00503033 add eax,3 00503036 add edi,4 00503039 sub ecx,1 0050303C mov dword ptr [esp+0Ch],eax 00503040 jne nsJPEGDecoder::OutputScanlines+122h (503012h) // mod: read contiguous bytes, generic shift/or byte re-ordering // instruction bytes in loop: 172 // reads from memory per loop: 8 (data) + 1 (var) //-> reads from memory per pixel: 2.25 = (9 / 4) // writes to memory per loop: 4 (pixels) + 3 (vars) //-> writes to memory per pixel: 1.75 = (7 / 4) // times to copy all scanlines, rdtsc(): // cosmos.jpg: not tested // sk7.jpg: not tested // ------------------------------------------------- PRUint32 idx = mInfo.output_width; 0050305B mov edx,dword ptr [esi+74h] // bulk copy of pixels while (idx >= 4) { 0050305E cmp edx,4 00503061 mov dword ptr [esp+14h],edx 00503065 jb nsJPEGDecoder::OutputScanlines+1D8h (503118h) 0050306B shr edx,2 0050306E mov dword ptr [esp+18h],edx PRUint32 p0, p1, p2, p3; // to avoid back-to-back stalls p0 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+0); p1 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+3); 00503072 mov edx,dword ptr [eax+3] 00503075 movzx esi,byte ptr [eax+5] p2 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+6); 00503079 movzx ebx,byte ptr [eax+8] p3 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+9); 0050307D movzx ebp,byte ptr [eax+0Bh] 00503081 mov ecx,edx 00503083 and edx,0FF00h 00503089 or ecx,0FFFFFF00h 0050308F shl ecx,10h 00503092 or ecx,esi 00503094 mov esi,dword ptr [eax+6] 00503097 or ecx,edx 00503099 mov edx,esi 0050309B or edx,0FFFFFF00h 005030A1 shl edx,10h 005030A4 or edx,ebx 005030A6 mov ebx,dword ptr [eax+9] 005030A9 and esi,0FF00h 005030AF or edx,esi 005030B1 mov esi,ebx 005030B3 or esi,0FFFFFF00h 005030B9 shl esi,10h 005030BC or esi,ebp 005030BE and ebx,0FF00h 005030C4 or esi,ebx 005030C6 mov ebx,dword ptr [eax] imageRow[0] = p0; imageRow[1] = p1; 005030C8 movzx eax,byte ptr [eax+2] 005030CC mov ebp,ebx 005030CE or ebp,0FFFFFF00h 005030D4 shl ebp,10h 005030D7 or ebp,eax 005030D9 and ebx,0FF00h imageRow[2] = p2; imageRow[3] = p3; 005030DF mov dword ptr [edi+8],edx idx -= 4; 005030E2 mov edx,dword ptr [esp+14h] 005030E6 or ebp,ebx 005030E8 mov dword ptr [edi],ebp 005030EA mov dword ptr [edi+4],ecx 005030ED mov dword ptr [edi+0Ch],esi sampleRow += 12; 005030F0 mov eax,dword ptr [esp+10h] 005030F4 sub edx,4 005030F7 add eax,0Ch imageRow += 4; 005030FA add edi,10h 005030FD sub dword ptr [esp+18h],1 00503102 mov dword ptr [esp+14h],edx 00503106 mov dword ptr [esp+10h],eax 0050310A jne nsJPEGDecoder::OutputScanlines+132h (503072h) // mod: read contiguous bytes, use x86 bswap byte re-ordering // instruction bytes in loop: 105 // reads from memory per loop: 4 (data) + 1 (var) //-> reads from memory per pixel: 1.25 = (5 / 4) // writes to memory per loop: 4 (pixels) + 2 (vars) //-> writes to memory per pixel: 1.50 = (6 / 4) // times to copy all scanlines, rdtsc(): // cosmos.jpg: 294457, 299895, 301546; avg: 298632 // sk7.jpg: 81303, 81471, 80976; avg: 81250 // ------------------------------------------------- PRUint32 idx = mInfo.output_width; 0050300B mov ebp,dword ptr [ebx+74h] // bulk copy of pixels while (idx >= 4) { 0050300E cmp ebp,4 00503011 jb nsJPEGDecoder::OutputScanlines+18Dh (50307Dh) 00503013 mov ecx,ebp 00503015 shr ecx,2 00503018 mov dword ptr [esp+14h],ecx 0050301C lea esp,[esp] PRUint32 p0, p1, p2, p3; // to avoid back-to-back stalls p0 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+0); p1 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+3); 00503020 mov ecx,dword ptr [eax+3] p2 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+6); 00503023 mov edx,dword ptr [eax+6] p3 = GFX_0XFF_PPIXEL_FROM_BPTR(sampleRow+9); 00503026 mov edi,dword ptr [eax+9] 00503029 mov eax,dword ptr [eax] 0050302B bswap eax 0050302D bswap ecx 0050302F bswap edx 00503031 bswap edi 00503033 shr eax,8 00503036 shr ecx,8 00503039 shr edx,8 0050303C shr edi,8 0050303F or eax,0FF000000h 00503044 or ecx,0FF000000h 0050304A or edx,0FF000000h 00503050 or edi,0FF000000h imageRow[0] = p0; imageRow[1] = p1; 00503056 mov dword ptr [esi],eax 00503058 mov dword ptr [esi+4],ecx imageRow[2] = p2; imageRow[3] = p3; 0050305B mov dword ptr [esi+8],edx 0050305E mov dword ptr [esi+0Ch],edi idx -= 4; sampleRow += 12; 00503061 mov eax,dword ptr [esp+10h] 00503065 add eax,0Ch 00503068 sub ebp,4 imageRow += 4; 0050306B add esi,10h 0050306E sub dword ptr [esp+14h],1 00503073 mov dword ptr [esp+10h],eax 00503077 jne nsJPEGDecoder::OutputScanlines+130h (503020h)

:Gavin Sharp [email: gavin@gavinsharp.com]

Updated

•

17 years ago

Assignee: swsnyder → nobody

Component: General → GFX: Thebes

Product: Mozilla Application Suite → Core

QA Contact: general → thebes

Stuart Parmenter

Updated

•

17 years ago

Attachment #291204 - Flags: review?(pavlov)

Stuart Parmenter

Updated

•

17 years ago

Attachment #291204 - Flags: review?(pavlov)

Attachment #291204 - Flags: review+

Attachment #291204 - Flags: approval1.9+

Stuart Parmenter

Updated

•

17 years ago

Keywords: checkin-needed

Peter Weilbacher

Comment 3

•

17 years ago

The patch needs to get an extra && !defined(XP_OS2) at the end of the GCC version test line for the checkin. (For some reason the byte swapping stuff was never implemented in the OS/2 GCC...)

Steve Snyder

Assignee

Comment 4

•

17 years ago

(In reply to comment #3) > The patch needs to get an extra > && !defined(XP_OS2) > at the end of the GCC version test line for the checkin. (For some reason the > byte swapping stuff was never implemented in the OS/2 GCC...) In attempting to confirm this I find that the link to the OS/2 build of GCC is dead. At http://developer.mozilla.org/en/docs/OS/2_Build_Prerequisites#Compiler is a link to an FTP site (ftp.netlabs.org) that no longer has the GCC build. FYI.

Reed Loden [:reed]

Updated

•

17 years ago

Assignee: nobody → swsnyder

Peter Weilbacher

Comment 5

•

17 years ago

I just updated that page a few days ago and thought that I had verified that link. Obviously not, sorry. ftp://ftp.netlabs.org/pub/gcc/GCC-3.3.5-csd3.zip is the package that we use on OS/2. It does define some bswap stuff (in endian.h) but those don't seem to be backed up by symbols in the library...

Reed Loden [:reed]

Comment 6

•

17 years ago

I added the extra !defined(XP_OS2), as per Peter. Checking in gfx/thebes/public/gfxColor.h; /cvsroot/mozilla/gfx/thebes/public/gfxColor.h,v <-- gfxColor.h new revision: 1.14; previous revision: 1.13 done Checking in modules/libpr0n/decoders/gif/nsGIFDecoder2.cpp; /cvsroot/mozilla/modules/libpr0n/decoders/gif/nsGIFDecoder2.cpp,v <-- nsGIFDecoder2.cpp new revision: 1.90; previous revision: 1.89 done Checking in modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp; /cvsroot/mozilla/modules/libpr0n/decoders/jpeg/nsJPEGDecoder.cpp,v <-- nsJPEGDecoder.cpp new revision: 1.82; previous revision: 1.81 done Checking in modules/libpr0n/decoders/png/nsPNGDecoder.cpp; /cvsroot/mozilla/modules/libpr0n/decoders/png/nsPNGDecoder.cpp,v <-- nsPNGDecoder.cpp new revision: 1.77; previous revision: 1.76 done

Status: NEW → RESOLVED

Closed: 17 years ago

Keywords: checkin-needed

Resolution: --- → FIXED

Target Milestone: --- → mozilla1.9 M11

Reed Loden [:reed]

Comment 7

•

17 years ago

The GCC check also doesn't compile on Mac, so I had to add a !defined(XP_MACOSX) to that check to fix bustage.

philippe (part-time)

Updated

•

17 years ago

Depends on: 409381

Jeremy Lea

Comment 8

•

17 years ago

Attached patch Fix the build on FreeBSD (deleted) — Details — Splinter Review

This fixes the build on FreeBSD... It will also be broken on any other GCC platform, since byteswap.h is a Linux only header. -Jeremy

Attachment #294278 - Flags: review?(pavlov)

Stuart Parmenter

Updated

•

17 years ago

Attachment #294278 - Flags: review?(pavlov) → review+

Reed Loden [:reed]

Updated

•

17 years ago

Attachment #294278 - Flags: approval1.9?

Stuart Parmenter

Updated

•

17 years ago

Attachment #294278 - Flags: approval1.9? → approval1.9+

Reed Loden [:reed]

Updated

•

17 years ago

Keywords: checkin-needed

Reed Loden [:reed]

Comment 9

•

17 years ago

Checking in gfx/thebes/public/gfxColor.h; /cvsroot/mozilla/gfx/thebes/public/gfxColor.h,v <-- gfxColor.h new revision: 1.16; previous revision: 1.15 done

Keywords: checkin-needed

John Daggett (:jtd)

Comment 10

•

17 years ago

I'm fairly certain that this change has broken GIF rendering on PPC Macs. See bug 409551.

John Daggett (:jtd)

Comment 11

•

17 years ago

Confirmed that reverting to nsGIFDecoder2.cpp, v.1.89 fixes the problem. Suggest backing out patches related to this until successfully tested on both little and big endian machines.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Steve Snyder

Assignee

Comment 12

•

17 years ago

See Bug 409381 (patch 294371) for fix. This fixes the GIF, JPG, and PNG color distortion seen on PowerPC platforms.

Reed Loden [:reed]

Comment 13

•

17 years ago

It's being dealt with in bug 409381. Re-resolving.

Status: REOPENED → RESOLVED

Closed: 17 years ago → 17 years ago

Resolution: --- → FIXED

David G King

Comment 14

•

17 years ago

(In reply to comment #8) > This fixes the build on FreeBSD... It will also be broken on any other GCC > platform, since byteswap.h is a Linux only header. So, this would be why my Win XP build can't find byteswap.h? Or is that a cygwin/MinGW thing?

neil@parkwaycc.co.uk

Comment 15

•

17 years ago

Comment on attachment 291204 [details] [diff] [review] Where Alpha=0xFF, keep RGB values contiguous in pixel copying >+#else >+# define GFX_BYTESWAP16(x) ( (((x) & 0xff) << 8) | (((x) >> 8) & 0xff) ) >+# define GFX_BYTESWAP32(x) ( (GFX_BYTESWAP16((x) & 0xffff) << 16) | GFX_BYTESWAP16(x >> 16) ) >+#endif Won't this actually be slower than GFX_PACKED_PIXEL on those platforms that don't give you access to the native byteswap opcodes?

neil@parkwaycc.co.uk

Comment 16

•

17 years ago

(In reply to comment #15) >(From update of attachment 291204 [details] [diff] [review]) >>+#else >>+# define GFX_BYTESWAP16(x) ( (((x) & 0xff) << 8) | (((x) >> 8) & 0xff) ) >>+# define GFX_BYTESWAP32(x) ( (GFX_BYTESWAP16((x) & 0xffff) << 16) | GFX_BYTESWAP16(x >> 16) ) >>+#endif >Won't this actually be slower than GFX_PACKED_PIXEL on those platforms that >don't give you access to the native byteswap opcodes? OK, so maybe it isn't, but how about #define GFX_BYTESWAP24FF(x) ((0xff << 24) | ((x) << 16) | ((x) & 0xff00) | (((x) >> 16) & 0xff)) #define GFX_0XFF_PPIXEL_FROM_BPTR \ (GFX_BYTESWAP24FF(*((PRUint32 *)(pbptr))))

Steve Snyder

Assignee

Comment 17

•

17 years ago

(In reply to comment #16) > OK, so maybe it isn't, but how about > #define GFX_BYTESWAP24FF(x) > ((0xff << 24) | ((x) << 16) | ((x) & 0xff00) | (((x) >> 16) & 0xff)) > #define GFX_0XFF_PPIXEL_FROM_BPTR \ > (GFX_BYTESWAP24FF(*((PRUint32 *)(pbptr)))) That does looks faster than the use of a generic byteswap as it does fewer operations (3 vs. 5 shifts) on the 32-bit value read from memory. The important thing is that those contiguous RGB values be gotten in a single read. Reduction of operations on the value once it's in the register is icing on the cake.

Steve Snyder

Assignee

Comment 18

•

17 years ago

FWIW, on a Pentium3/850Mhz (Win2k/SP4) machine. The GFX_BYTESWAP24FF(x) suggested above, and the generic byteswap (not bswap()): // 24FF jenniferconnelly.jpg: 0 = initial load; 1,2 = reload (0) Total time to copy all scanlines: 133126 (1) Total time to copy all scanlines: 131543 (2) Total time to copy all scanlines: 132112 <-- avg: 132,260 // 32FF jenniferconnelly.jpg: 0 = initial load; 1,2 = reload (0) Total time to copy all scanlines: 136260 (1) Total time to copy all scanlines: 137576 (2) Total time to copy all scanlines: 133686 <-- avg: 135,840 So on on a P3 the GFX_BYTESWAP24FF(x) is 2.7% faster on an x86 platform where the compiler doesn't support the bswap instruction.

Where Alpha=0xFF, keep RGB values contiguous in pixel copying 17 years ago Steve Snyder (deleted), patch	pavlov : review+ pavlov : approval1.9+	Details \| Diff \| Splinter Review
Fix the build on FreeBSD 17 years ago Jeremy Lea (deleted), patch	pavlov : review+ pavlov : approval1.9+	Details \| Diff \| Splinter Review
Add Mac OS X bswap support, speed up generic GFX_0XFF_PPIXEL_FROM_BPTR 17 years ago jag (Peter Annema) (deleted), patch	pavlov : review+	Details \| Diff \| Splinter Review
Shuffle things around a bit, no functional changes. 17 years ago jag (Peter Annema) (deleted), patch		Details \| Diff \| Splinter Review
Boring version (I like it!) 17 years ago jag (Peter Annema) (deleted), patch	pavlov : review+ mtschrep : approval1.9+	Details \| Diff \| Splinter Review
Boring version moved out 17 years ago jag (Peter Annema) (deleted), patch	pavlov : review-	Details \| Diff \| Splinter Review
Boring version with winsock.h included where GFX_0XFF_PPIXEL_FROM_BPTR is used 17 years ago jag (Peter Annema) (deleted), patch		Details \| Diff \| Splinter Review
Boring version with winsock.h included where GFX_0XFF_PPIXEL_FROM_BPTR is used 17 years ago jag (Peter Annema) (deleted), patch		Details \| Diff \| Splinter Review
Hybrid approach 17 years ago jag (Peter Annema) (deleted), patch	pavlov : review+	Details \| Diff \| Splinter Review