Closed
Bug 411379
Opened 17 years ago
Closed 17 years ago
Add SSE2 processing for JPEG color, use static instead of dynamic arrays
Categories
(Core :: Graphics: ImageLib, enhancement, P1)
Tracking
()
RESOLVED
FIXED
mozilla1.9beta3
People
(Reporter: mmoy, Assigned: mmoy)
References
Details
(Keywords: perf)
Attachments
(4 files, 6 obsolete files)
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
text/html
|
Details | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
pavlov
:
review+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20080101 BonEcho/2.0.0.11 (mmoy CE K8C-X03)
Build Identifier: Trunk
Add SSE2 optimization for ycc_rgb_convert and turn five dynamic tables into static tables. ycc_rgb_convert is a common color conversion and using SSE2 code instead of scalar code results in about a 25% performance improvement in my testing.
There are four tables that are generated in the jdcolor.c module every time an image is processed. The change here uses a static table instead of going through malloc, computation and free each time.
There is a table called range_limit that's created for every image and this is done in jdmaster.c. The change here changes that to a static table instead of creating it for every image.
The SSE2 code and the table code are only varianted on Win32 builds built with MSVC. The SSE2 code will only run on processors that support it but that should be most processors in the last six years.
Reproducible: Always
Steps to Reproduce:
1.
2.
3.
This code piggybacks on the HAVE_SSE2_INTEL_MNEMONICS define variable and the SSE2Available global that's in the existing SSE2 JPEG code for the Inverse Discrete Cosign Transform.
This code was ported from my Mac OSX code by changing the alignment declarations from GNU style to MSVC style. Most of the changes in this patch should work on Mac OSX Intel and Linux with minor modifications.
Assignee | ||
Comment 1•17 years ago
|
||
Changes are to jdcolor.c and jdmaster.c.
Attachment #296054 -
Flags: approval1.9?
Assignee | ||
Comment 2•17 years ago
|
||
I've done a fair amount of testing on this code and it's been around for about two years. I haven't tried building with this particular patch on OSX or Linux though I did do builds with various switches turned on and off. I also eyeballed the image results on the edges and corners and where images meet other images.
I have a bunch of other JPEG optimization but this is probably the safest with the
biggest return that should provide enough time for comments and testing before Beta 3.
Version: unspecified → Trunk
Comment 3•17 years ago
|
||
Comment on attachment 296054 [details] [diff] [review]
JPEG Color SSE2 and static tables changes
mmoy, you need to get review+ before seeking approval.
Maybe ask akayser?
Attachment #296054 -
Flags: approval1.9?
Updated•17 years ago
|
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: blocking1.9?
Assignee | ||
Comment 4•17 years ago
|
||
I need to make a few small changes to the patch after looking at the dif. Will take me 30 minutes.
Updated•17 years ago
|
Attachment #296054 -
Flags: superreview?(pavlov)
Attachment #296054 -
Flags: review?(pavlov)
Assignee | ||
Comment 5•17 years ago
|
||
Changes to jdcolor.c and jdmaster.c
Attachment #296054 -
Attachment is obsolete: true
Attachment #296062 -
Flags: approval1.9?
Attachment #296054 -
Flags: superreview?(pavlov)
Attachment #296054 -
Flags: review?(pavlov)
Comment 6•17 years ago
|
||
Comment on attachment 296062 [details] [diff] [review]
SSE2 JPEG Color optimization and use static instead of dynamic tables
You need review before approval.
Attachment #296062 -
Flags: superreview?(pavlov)
Attachment #296062 -
Flags: review?(pavlov)
Attachment #296062 -
Flags: approval1.9?
Assignee | ||
Comment 7•17 years ago
|
||
I just sent email to Alfred at his gmail account requesting a review.
Updated•17 years ago
|
Attachment #296062 -
Flags: review?(alfredkayser)
Updated•17 years ago
|
Assignee: nobody → mmoy
Comment 8•17 years ago
|
||
Michael - this is um, awesome.
Can we make sure to use the SSE2 detection code already in:
http://lxr.mozilla.org/mozilla/source/jpeg/jdapimin.c#481
So this doesn't break on non SSE2 machines?
Anyone else want to take this to GCC linux/mac?
Comment 9•17 years ago
|
||
Just a note that adapting some of bug 386651 to cover JPEG might be a great way to get automated test coverage for these sorts of changes. Dolske may take a look.
Comment 10•17 years ago
|
||
(In reply to comment #8)
> Michael - this is um, awesome.
>
> Can we make sure to use the SSE2 detection code already in:
>
> http://lxr.mozilla.org/mozilla/source/jpeg/jdapimin.c#481
>
> So this doesn't break on non SSE2 machines?
>
Um, nevermind. You've already got that covered :-) Sorry for the noise.
Comment 11•17 years ago
|
||
Comment on attachment 296062 [details] [diff] [review]
SSE2 JPEG Color optimization and use static instead of dynamic tables
After I have R'ed the (updated) patch, then pavlov only needs to do the SR.
Attachment #296062 -
Flags: review?(pavlov)
Attachment #296062 -
Flags: review?(alfredkayser)
Attachment #296062 -
Flags: review-
Comment 12•17 years ago
|
||
A few minor comments:
1. Is __declspec(align(16)) really needed? It may cause issues with other compilers...
2. No tabs please (that is why the diff view show no indenting...)
3. Instead of:
#ifndef HAVE_SSE2_INTEL_MNEMONICS
oldcode
#else
if (SSE2Available==1)
ssecode
else
oldcode
#endif
do
#ifdef HAVE_SSE2_INTEL_MNEMONICS
if (SSE2Available==1)
ssecode
else
#else
oldcode
#endif
to prevent code duplication
4. The tables should be made 'const' so that they can be placed in shared/readonly memory.
5. The static table optimization is valid for all platforms, not just when SSE2 is enabled.
6. In jpeg_calc_output_dimensions replace:
JSAMPLE * table;
int i;
#ifdef HAVE_SSE2_INTEL_MNEMONICS
/* Use a static table for Win32/MSVC */
table = (JSAMPLE *) static_range_table;
table += (MAXJSAMPLE+1); /* allow negative subscripts of simple table */
cinfo->sample_range_limit = table;
#else
...
with
#ifdef HAVE_SSE2_INTEL_MNEMONICS
/* Use a static table for Win32/MSVC */
/* allow negative subscripts of simple table */
cinfo->sample_range_limit = ((JSAMPLE *) static_range_table) + MAXJSAMPLE+1;
#else
JSAMPLE * table;
int i;
...
As, the 'i' will be warned as 'unused', and 'table' is not really used in the static code variant.
7. Last but not least: When static tables are used 'cconvert' is now now needed anymore, which saves another alloc.
Comment 13•17 years ago
|
||
Just for grins, I added this patch and turned on HAVE_SSE2_INTEL_MNEMONICS for OS X, but that drags in some exisiting inline asm, which needs -fasm-blocks and some other tweaking just to get to compile. Not clear if it's worth enabling only this patch for OS X, or if we should try to get all the SSE2 bits going.
Assignee | ||
Comment 14•17 years ago
|
||
The Windows patch above was derived from this Mac OSX patch which applies against 2.0.0.6 and has the idct, color, sample and static table code in it. The builds are at http://www.vector64.com/elliott.html
I believe that the dct_method has to be changed from JDCT_ISLOW to JDCT_IFAST in nsJPEGDecoder.cpp and you need to do it by hand. I don't know whether or not the patch will apply to FF 3.0.
Assignee | ||
Comment 15•17 years ago
|
||
This html file should be placed in a directory with jpeg files and modified to point to those jpeg files. To run the benchmark, drag the file to the browser. Exit the browser, reopen it and then drag it again. The first time will load the files into system disk cache so that the benchmark will better reflect image rendering time. You need to get enough images to be able to see a difference easily and the reasonable size will depend on your system. I like to shutdown as many background processes as possible when I run benchmarks.
Google Images, a digital camera, digital scanners or other websites can provide images. One should try to get a variety of images, maybe some black and whites.
This test doesn't work on Opera and Safari as they apparently indicate that the page is done after the first page is displayed.
Comment 16•17 years ago
|
||
I'd like to get this for b3..
Flags: blocking1.9? → blocking1.9+
Priority: -- → P1
Comment 17•17 years ago
|
||
I've got another blocker to spank first, then can help get this going on Linux and Mac if needed.
Assignee | ||
Comment 18•17 years ago
|
||
Thanks to Alfred for doing the review.
I cleaned up the code today and added support for Mac OSX for the color routines and tested those out and they look good from a quick perusal. I need to do a little more testing on images that I normally test JPEG changes with - I just don't store my stock images on the Mac.
The Mac support provides a template for adding Linux support. Someone just needs to do the code for turning on intrinsics based on the combination of hardware, OS and compiler and adding code to check the processor type to determine SSE2 support and set SSE2Available accordingly.
I found a few surprising things on the Mac OSX platform related to performance using a bunch of JPEGS.
mmoy 2.0.0.6 build: 9 seconds
2.0.0.11 Official: 18 seconds
3.0b2 37 seconds
Trunk build 37 seconds
Trunk with SSE2 code 34.5 seconds
So it appears to me that there has been a major regression from the 2.0.0.* line to 3.0.
So I ran some tests on Windows (the image files are different):
Trunk with SSE2: 3.6 seconds
mmoy 2.0.0.11: 2.35 seconds
3.0b2: 5.29 seconds
Trunk: 4.79 seconds (improvements between beta2 and today?)
Official 2.0.0.11: 6.02 seconds
The Windows results look reasonable. The Mac OSX results look very unattractive.
Assignee | ||
Comment 19•17 years ago
|
||
A few minor comments:
1. Is __declspec(align(16)) really needed? It may cause issues with
other compilers...
It isn't needed in the tables. It is needed in the constants as
there are instructions used which require alignment. Note that
it is currently used in jidctint.c.
I've removed it from the tables in jdmaster.c and jdcolor.c
I put some code in jconfig.h to make the declarations support
the alignment attributes in MSVC and gcc for the two datatypes
that I used.
2. No tabs please (that is why the diff view show no indenting...)
I think that the modules are clean now. I used M-x untabify to get
rid of the tabs.
3. Instead of:
#ifndef HAVE_SSE2_INTEL_MNEMONICS
oldcode
#else
if (SSE2Available==1)
ssecode
else
oldcode
#endif
do
#ifdef HAVE_SSE2_INTEL_MNEMONICS
if (SSE2Available==1)
ssecode
else
#else
oldcode
#endif
to prevent code duplication
The oldcodes weren't the same. I added a small optimization to the
second version to factor out an addition so that the addition
operation is only done once instead of three times.
But I made the modifications and added the range_scale_y optimization.
And took out the register declarations. In Windows, they actually
generate slower code as variables are needlessly pushed to the stack
to keep the latest variables in registers.
Fixed in jdcolor.c
4. The tables should be made 'const' so that they can be placed in
shared/readonly memory.
Fixed in jdcolor.c.
5. The static table optimization is valid for all platforms, not just
when SSE2 is enabled.
Fixed in jdcolor.c and jdmaster.c.
6. In jpeg_calc_output_dimensions replace:
JSAMPLE * table;
int i;
#ifdef HAVE_SSE2_INTEL_MNEMONICS
/* Use a static table for Win32/MSVC */
table = (JSAMPLE *) static_range_table;
table += (MAXJSAMPLE+1); /* allow negative subscripts of simple table */
cinfo->sample_range_limit = table;
#else
...
with
#ifdef HAVE_SSE2_INTEL_MNEMONICS
/* Use a static table for Win32/MSVC */
/* allow negative subscripts of simple table */
cinfo->sample_range_limit = ((JSAMPLE *) static_range_table) +
MAXJSAMPLE+1;
#else
JSAMPLE * table;
int i;
...
As, the 'i' will be warned as 'unused', and 'table' is not really used in the
static code variant.
Fixed in jdmaster.c
7. Last but not least: When static tables are used 'cconvert' is now
now needed anymore, which saves another alloc.
There is at least one other field in cconvert that are still used. This is
in routine jinit_color_deconverter (jdcolor.c):
cconvert->pub.start_pass = start_pass_dcolor
-------------
- Added #ifdefs around #include emmtrin.h in jdcolor.c.
- Added HAVE_SSE2_INTRINSICS. HAVE_SSE2_INTEL_MNEMONICS currently means
MSVC SSE2 support.
- Added Mac OSX support.
- Protected SSE2Available in jddctmgr.c (not sure why this compiled on non-MSVC
platforms before as it's only defined in jdapimin for MSVC).
- Added templates for other platforms in jdapimin.c and jmorecfg.h.
Attachment #296062 -
Attachment is obsolete: true
Attachment #296062 -
Flags: superreview?(pavlov)
Assignee | ||
Updated•17 years ago
|
Attachment #296901 -
Flags: review?(alfredkayser)
Updated•17 years ago
|
Attachment #296901 -
Flags: superreview?(pavlov)
Assignee | ||
Comment 20•17 years ago
|
||
Out of curiosity, does the superreview build the code changes or do others build and test changes or does code generally go in after review?
Comment 21•17 years ago
|
||
(In reply to comment #18)
> So I ran some tests on Windows (the image files are different):
>
> Trunk with SSE2: 3.6 seconds
> mmoy 2.0.0.11: 2.35 seconds
> 3.0b2: 5.29 seconds
> Trunk: 4.79 seconds (improvements between beta2 and today?)
> Official 2.0.0.11: 6.02 seconds
>
> The Windows results look reasonable. The Mac OSX results look very
> unattractive.
>
Bug 411718 related?
What else in your build accounts for the difference between this patch and mmoy builds?
Comment 22•17 years ago
|
||
Super-reviewers, and reviewers for that matter, are not obligated to apply, build or test the patch. You should ask if you want that to happen as part of the code review process.
/be
Comment 23•17 years ago
|
||
http://www.mozilla.org/hacking/reviewers.html gives a description of how the super-review process works.
Assignee | ||
Comment 24•17 years ago
|
||
> What else in your build accounts for the difference between this patch
> and mmoy builds?
On the build side, there is -O2, -GL and PGO. I think that I can make the case that we should be building with -O2 on Windows though perhaps that should be done elsewhere. My guess is that -GL provides a 10% boost and that PGO provides about an 8% boost. This is at the cost of image size. In 2.0, I always do static builds instead of shared builds but the addition of libxul results in a build error when I try to static builds in 3.0 so I'm doing shared builds. Static builds allow more whole program optimization (-GL).
On the code side, there are a few other things though my memory is a little foggy as the performance code was done before 2007. Reducing the copying around
of data certainly helps. I had an optimization to get rid of the division operations in GFXImageFrame. I may have had a small Huffman optimization - not sure if that was Win32 or x64. And I think that I'm using optimized memset, memcpy and memmove routines for large memory operations. The optimized memcpy/memmove routine is about 1,200 lines of assembler. I think that I inlined some of the Mozilla string functions or switched to inline Microsoft string functions and improved some locking code in the PL area. I'm not sure if this affected JPEG rendering.
Microsoft's VS2005 SSE2 memcpy/memmove implementation is pretty poor. There's a test for SSE2 and if it is available, there are tests to see whether the source and destination are both 16-byte aligned. If they are, then a MOVDQA copy loop is run to do the copy with some scalar code to do the remainder. If not, then the regular scalar code is used (sliding doubleworld copy I think). If you don't align structures, at worst, you'd use the SSE2 copy one out of 256 times. Most structures are probabily aligned on four bytes so that's one out of 16 times.
Optimzed mem* routines typically look at the size of what you're copying and, if large, use MMX or SSE register copies. I've read that some even use f87 for Pentium 1s. One can take advantage of SIMD instruction parallelism, non-temporal writes, data prefetching and data alignment to get the best performance in the mem* routines.
One of the reasons why I want to play with ICC is that I think that they have heavily optimized c-runtime libraries, at least for mem*.
Comment 25•17 years ago
|
||
(In reply to comment #18)
37 seconds
> Trunk build 37 seconds
> Trunk: 4.79 seconds (improvements between beta2 and today?)
> Official 2.0.0.11: 6.02 seconds
>
> The Windows results look reasonable. The Mac OSX results look very
> unattractive.
>
I can confirm a big disparity between Windows/Mac. On same hardware using today's trunk running the same test I get:
Vista: 3s
Mac: 24s
You can watch the mac draw the JPEG's - so something is very very wrong. I've got a series of test images that I'll upload once my machine is back up and running later today. I'll file a separate bug on that.
Assignee | ||
Comment 26•17 years ago
|
||
(In reply to comment #25)
> (In reply to comment #18)
> 37 seconds
> > Trunk build 37 seconds
> > Trunk: 4.79 seconds (improvements between beta2 and today?)
> > Official 2.0.0.11: 6.02 seconds
> >
> > The Windows results look reasonable. The Mac OSX results look very
> > unattractive.
> >
>
> I can confirm a big disparity between Windows/Mac. On same hardware using
> today's trunk running the same test I get:
>
> Vista: 3s
> Mac: 24s
>
> You can watch the mac draw the JPEG's - so something is very very wrong. I've
> got a series of test images that I'll upload once my machine is back up and
> running later today. I'll file a separate bug on that.
The images that I used in my Windows tests weren't the same as on the Mac tests. What I wanted to show is that Mac Trunk is taking twice as long as Mac 2.0.0.11 while Windows Trunk is faster than Windows 2.0.0.11. I'm pretty sure that jpeg rendering on Mac is quite a bit slower than Windows with both on 2.0.0.11. But those using the Mac want to see something faster than what they have now. Getting something that takes quite as long would be considered a step backwards.
Comment 27•17 years ago
|
||
You want to check bug 411718, where another optimization is in the patch review process for JPEG decoding.
Yet another optimization is to make the JPEG decoder in libjpeg itself convert the colorplanes directly to Cairo format (RGB24, which is actually a DWORD with (B | G>>8 | R>>16). Currently libjpeg converts from colorpanes to packed RGB (3 bytes per pixel) after which the decoder converts packed RGB to Cairo's RGB24.
This can be done adding a new 'color converter' into libjpeg (and that one could then also be SSE2 optimized...)
Comment 28•17 years ago
|
||
Comment on attachment 296901 [details] [diff] [review]
Changes after Alfred's comments
A few more minor review-nits:
1. Alignment is in multiple places not correct: e.g jdapimin.c line 53 is indented too much. Indenting according to the overall style in libjpeg is 2 space per indent.
2. For a patch with so many 'whitespace only change' it would be good to also add a whitespace ignorant diff (-ws) for the review of real changes.
3. In jdcolor.c, in my_color_deconverter the four table pointers are no longer required (use #if 0), otherwise space will still be allocated for them.
4. Use the reference to the static table directly instead of via a variable:
const int * Crrtab = Cr_r_tab; // This is not needed...
With that addressed, R+
Attachment #296901 -
Flags: review?(alfredkayser) → review+
Comment 29•17 years ago
|
||
I've been comparing Mac versions (pre-this-patch), and am getting 4 secs on fx2 vs 8.5 secs on recent trunk, so consistent wth Michael's numbers. The same test on Safari reports a tiny fraction of a second, I suspect that they are rendering asynchronously to JS execution, but even so the page is coming up in considerably less wall clock time, even from a cold start. (A second run comes up instantaneously, they must cache CGImages in memory or something.)
Assignee | ||
Comment 30•17 years ago
|
||
(In reply to comment #29)
> I've been comparing Mac versions (pre-this-patch), and am getting 4 secs on fx2
> vs 8.5 secs on recent trunk, so consistent wth Michael's numbers. The same test
> on Safari reports a tiny fraction of a second, I suspect that they are
> rendering asynchronously to JS execution, but even so the page is coming up in
> considerably less wall clock time, even from a cold start. (A second run comes
> up instantaneously, they must cache CGImages in memory or something.)
The Javascript timer doesn't work on Safari and Opera. I believe that they report "done" when the screen is finished painting which obviously is only a part of the rendering of images.
Comment 31•17 years ago
|
||
(In reply to comment #30)
> (In reply to comment #29)
> > I've been comparing Mac versions (pre-this-patch), and am getting 4 secs on fx2
> > vs 8.5 secs on recent trunk, so consistent wth Michael's numbers. The same test
> > on Safari reports a tiny fraction of a second, I suspect that they are
> > rendering asynchronously to JS execution, but even so the page is coming up in
> > considerably less wall clock time, even from a cold start. (A second run comes
> > up instantaneously, they must cache CGImages in memory or something.)
>
> The Javascript timer doesn't work on Safari and Opera. I believe that they
> report "done" when the screen is finished painting which obviously is only a
> part of the rendering of images.
>
The onload handler does fire on Safari before the images are decoded/painted. The scoll event I put in the testcase (in the URL I provided) should force Safari to paint (and thus decode) but I wouldn't worry to much about those #'s. The 2->3 FF numbers are the ones to watch.
Comment 32•17 years ago
|
||
So the good news is that the patch applies cleanly to my trunk build, and seems to work correctly on OS X 10.4. The bad news is that I'm not seeing any noticeable speedup. Out of paranoia, I ran it under the debugger and confirmed that indeed it was considering SSE2 to be available, executing the intrinsics code in ycc_rgb_convert, and that the code generation looked reasonable.
Comment 33•17 years ago
|
||
Random comment: The imgITools helpers I recently added might be useful for testing here; the decodeImageData() routine basically works directly with the image decoders.
For example, see:
http://mxr.mozilla.org/seamonkey/source/modules/libpr0n/test/unit/test_imgtools.js#174
Assignee | ||
Comment 34•17 years ago
|
||
(In reply to comment #32)
> So the good news is that the patch applies cleanly to my trunk build, and seems
> to work correctly on OS X 10.4. The bad news is that I'm not seeing any
> noticeable speedup. Out of paranoia, I ran it under the debugger and confirmed
> that indeed it was considering SSE2 to be available, executing the intrinsics
> code in ycc_rgb_convert, and that the code generation looked reasonable.
What are you using for a benchmark? I'm seeing the improvement greatly reduced with the other Mac OSX issue around. I'll try the patch on 2.0.0.11 to see if it applies. It should be easier to see the difference without the other noise.
Comment 35•17 years ago
|
||
I've got a stash of 25 500K-1M plant photos (read: noisy), in your benchmark framework. I should try dialing it up to 100, in case the perf difference is too small to show clearly - I am getting some random jumping around between 2nd, 3rd, 4th, etc runs, so maybe my macbook is being churned by something else.
Comment 36•17 years ago
|
||
(In reply to comment #35)
> I've got a stash of 25 500K-1M plant photos (read: noisy), in your benchmark
> framework. I should try dialing it up to 100, in case the perf difference is
> too small to show clearly - I am getting some random jumping around between
> 2nd, 3rd, 4th, etc runs, so maybe my macbook is being churned by something
> else.
>
Given the profile in the dependent bug:
40.3% CoreGraphics argb32_image_mark_rgb32
24.7% CoreGraphics resample_byte_h_3cpp_vector
8.5% CoreGraphics decode_byte_8bpc_3
6.0% CoreGraphics decode_data
5.1% XUL jpeg_idct_islow
4.1% XUL ycc_rgb_convert
2.8% XUL decode_mcu
1.1% XUL h2v1_fancy_upsample
I'm not surprised you are seeing no advantages on mac. I'd suggest we fix that bug first, then see what this does.
Assignee | ||
Comment 37•17 years ago
|
||
There is clearly a very wide performance variation in this bug. The 3.0 patch didn't apply against 2.0.0.11 as there were two failures. I have limited time on the Mac in the evenings so I didn't press for the machine.
If you want to see the difference in performance on the Mac, I can only point you to http://www.vector64.com/elliott.html where you can compare the 2.0.0.6 build to the official 2.0.0.* build. That build has the kitchen sink in it including jdcolor.
Comment 38•17 years ago
|
||
Michael, possibly this patch is better split into two: static tables and SSE2 optimization.
Even more, check out bug 412753 where I moved this YCrCb conversion to the JPEG decoder (to allow to generate directly into Ciaro format). It would be even better to apply SSE2 tricks that direct conversion.
Assignee | ||
Comment 39•17 years ago
|
||
I think that it would make more sense to optimize the color code in nsJPEGDecoder.cpp than it would here as the routine in jdcolor.c will never get called. Perhaps it would be better to change this bug into just doing the static tables and putting in the plumbing for portable SSE2.
What I wanted to do was to get in the infrastructure and color code and then do portable idct and maybe jdsample in other transactions. And follow that up by moving the SSE2 infrastructure out of JPEG and into a common area of code that could be used for other areas of code such as PNG.
Regarding libjpeg, when that option is used, are the pointers changed to use different routine names for the jpeg routines or does the mozilla copy of jpeg not get compiled? If the latter, then we'd need to put the SSE2 infrastucture somewhere else right away.
Comment 40•17 years ago
|
||
Having the optimized color conversion code in nsJPEGDecoder makes it so that that code will be used in both scenarios: compiled/linkedin libjpeg v. a system linked libjpeg. (note as far as I know the 'system linked libjpeg' is only done on Linux).
Assignee | ||
Comment 41•17 years ago
|
||
This patch contains Alfred's changes from Bug 411718 and Bug 412753 and SSE2 optimization of the latter bug. It doesn't have platform stuff though which I'm working on next. This code will likely go into another Bug. Overall performance improvement from Alfred's two bugs, the static table stuff here and the SSE2 optimization of nsJPEGDecoder is 30% using my old benchmark.
Comment 42•17 years ago
|
||
Comment on attachment 296901 [details] [diff] [review]
Changes after Alfred's comments
is it possible to get this patch without all the tab removal changes? it is kind of a pain to review with them. (do we want them at all?)
Comment 43•17 years ago
|
||
also, should we use the cpuid intrinsic with msvc2005 rather than the inline assembly? probably doesn't matter.
Assignee | ||
Comment 44•17 years ago
|
||
(In reply to comment #42)
> (From update of attachment 296901 [details] [diff] [review])
> is it possible to get this patch without all the tab removal changes? it is
> kind of a pain to review with them. (do we want them at all?)
I was taking them all out this morning before leaving for the office. I was going to ask Alfred to give me a pass on the tab/space stuff given that this was an external contribution and it appears that changes to the code since then haven't been untabified.
The code change will be a lot simpler without the color code as that is moving to nsJPEGDecoder.cpp. This change is just going to be the static tables and intrinsics infrastructure. Unfortunately I will have to replicate that in nsJPEGDecoder (or somewhere else if that's more appropriate).
I'll look into the CPUID intrinsic. It would certainly make the code a lot neater. As long as there are no objections to not running on compilers below
VS2005.
Assignee | ||
Comment 45•17 years ago
|
||
I removed the space changes (converting tabs to spaces) and left the original spacing as intact as I could and what's left is infrastructure for intrinsics and using static tables instead of dynamic tables. I can redo the spacing stuff if it's required to get this in.
Attachment #296901 -
Attachment is obsolete: true
Attachment #296901 -
Flags: superreview?(pavlov)
Assignee | ||
Comment 46•17 years ago
|
||
This patch includes Alfred's code to do certain color/cairo conversion in nsJPEGDecoder.cpp.
Attachment #297746 -
Attachment is obsolete: true
Assignee | ||
Comment 47•17 years ago
|
||
Does Alfred need to review this again or does it go to someone else now?
Updated•17 years ago
|
Attachment #298019 -
Flags: review?(pavlov)
Comment 48•17 years ago
|
||
(In reply to comment #47)
> Does Alfred need to review this again or does it go to someone else now?
>
Let's have pav take the final review..
Comment 49•17 years ago
|
||
Comment on attachment 298019 [details] [diff] [review]
nsJPEGDecoder.cpp SSE2 jdcolor (cairo) optimization
rather than ycc_rgb_convert_cairo and mUseCairo, use argb/ARGB instead of 'cairo'?
this patch looks fine, but it is hard to tell what is in this patch vs what is in alfred's patch in another bug. Would be great to split this patch out.
Assignee | ||
Comment 50•17 years ago
|
||
Sorry, the patch to review was "SSE2 Intrinsics Support and Static Tables" which is the second to last patch. The last one has the changes that include Alfred's patches which I was just saving here until Alfred's two patches show up. The "SSE2 Intrinsics Support and Static Tables" patch is orthogonal to Alfred's two changes and the changes in the last patch
Updated•17 years ago
|
Attachment #298019 -
Flags: review?(pavlov)
Comment 51•17 years ago
|
||
Comment on attachment 298016 [details] [diff] [review]
SSE2 Intrinsics Support and Static Tables
Adjusting review flags
Attachment #298016 -
Flags: review?(pavlov)
Comment 52•17 years ago
|
||
Comment on attachment 298016 [details] [diff] [review]
SSE2 Intrinsics Support and Static Tables
can you add some comments around the if 0s or perhaps remove them entirely? aside from that this looks fine.
Attachment #298016 -
Flags: review?(pavlov) → review+
Comment 53•17 years ago
|
||
Is the output from this code expected to be 100% pixel identical to the existing code? Or might there be +/- 1 rounding differences? I have a set of JPEGs to use for tests (bug 411626), but they'll be compared against reference PNGs.
Assignee | ||
Comment 54•17 years ago
|
||
(In reply to comment #53)
> Is the output from this code expected to be 100% pixel identical to the
> existing code? Or might there be +/- 1 rounding differences? I have a set of
> JPEGs to use for tests (bug 411626), but they'll be compared against reference
> PNGs.
This code is expected to be 100% pixel identical to the existing code but this code doesn't do that much. It just takes the dynamically generated tables (which are the same for every image) and puts them into a static array. The size and type of the static array is the same as that of the dynamically generated array as the static array was generated from the output of the dynamically generated array. The rest of the changes just adds intrinsics infrastructure.
The color changes to go in later may have +/- 1 rounding differences as they use 16-bit intermediate calculations instead of 32-bit intermediate calculations for addition and subtraction. Multiplication uses 32-bit intermediate calculations but results are stored in 16-bits. This is how you get the parallelism of doing eight operations at the same time. This kind of processing is already in the Windows code for systems supporting SSE2 in the form of the current Inverse Discrete Cosine Transform code.
Assignee | ||
Comment 55•17 years ago
|
||
(In reply to comment #52)
> (From update of attachment 298016 [details] [diff] [review])
> can you add some comments around the if 0s or perhaps remove them entirely?
> aside from that this looks fine.
Will do.
Assignee | ||
Comment 56•17 years ago
|
||
There are three #if 0 statements that I added and one didn't have a comment so I added a comment there. On one of the others, I moved the comment to before the #if 0.
Attachment #298016 -
Attachment is obsolete: true
Updated•17 years ago
|
Keywords: checkin-needed
Updated•17 years ago
|
Attachment #298856 -
Flags: superreview+
Attachment #298856 -
Flags: review+
Comment 57•17 years ago
|
||
Some more minor comments:
1. in jdcolor.c:
/* These fields are needed anymore as these are now static tables */
--> not needed anymore
2. The comment on line 241 is no longer valid:
241 * We assume build_ycc_rgb_table has been called.
3. In ycck_cmyk_convert:
convert:
outptr[0] = range_limit[MAXJSAMPLE - (y + Cr_r_tab[cr])]
etc...
also to:
range_limit_y = range_limit + MAXJSAMPLE - y;
outptr[0] = range_limit_y + Cr_r_tab[cr];
etc...
4. In prepare_range_limit_table (j_decompress_ptr cinfo)
cinfo->sample_range_limit = (JSAMPLE *) static_range_table + (MAXJSAMPLE+1);;
Remove the second ;
Assignee | ||
Comment 58•17 years ago
|
||
The four items have been addressed. I wasn't able to find an image to test the ycck_cmyk_convert change as it seems to be pretty rare on the web.
Attachment #298856 -
Attachment is obsolete: true
Comment 59•17 years ago
|
||
Comment on attachment 298929 [details] [diff] [review]
Fixes to Alfred's latest comments (4)
Final Review..
Attachment #298929 -
Flags: review?(pavlov)
Updated•17 years ago
|
Attachment #298929 -
Flags: review?(pavlov) → review+
Updated•17 years ago
|
Flags: in-testsuite?
Comment 60•17 years ago
|
||
Leaving open for the second part of the fix. Please address all nits and request review.
Checking in jpeg/jconfig.h;
/cvsroot/mozilla/jpeg/jconfig.h,v <-- jconfig.h
new revision: 3.12; previous revision: 3.11
done
Checking in jpeg/jdapimin.c;
/cvsroot/mozilla/jpeg/jdapimin.c,v <-- jdapimin.c
new revision: 3.8; previous revision: 3.7
done
Checking in jpeg/jdcolor.c;
/cvsroot/mozilla/jpeg/jdcolor.c,v <-- jdcolor.c
new revision: 3.3; previous revision: 3.2
done
Checking in jpeg/jddctmgr.c;
/cvsroot/mozilla/jpeg/jddctmgr.c,v <-- jddctmgr.c
new revision: 3.4; previous revision: 3.3
done
Checking in jpeg/jdmaster.c;
/cvsroot/mozilla/jpeg/jdmaster.c,v <-- jdmaster.c
new revision: 3.3; previous revision: 3.2
done
Checking in jpeg/jmorecfg.h;
/cvsroot/mozilla/jpeg/jmorecfg.h,v <-- jmorecfg.h
new revision: 3.20; previous revision: 3.19
done
Comment 61•17 years ago
|
||
(In reply to comment #60)
> Leaving open for the second part of the fix. Please address all nits and
> request review.
>
What second part?
Assignee | ||
Comment 62•17 years ago
|
||
The original patch did SSE2 color processing, added general SSE2 intrinsics support and put in static tables instead of dynamic tables. Alfred has a change pending that moves the color processing and byteswap into nsJPEGDecoder.cpp so I changed this patch to just do SSE2 intrinsics support and static tables - I took out the color SSE2 processing. After Alfred's change goes through (improve performance by 10%), I will add the SSE2 color processing to Alfred's combined code in nsJPEGDecoder.cpp.
I was planning to do that in another bug but can do it here.
Comment 63•17 years ago
|
||
Is there a bug for the Mac jpeg performance regression?
Comment 64•17 years ago
|
||
Comment 65•17 years ago
|
||
(In reply to comment #62)
> I was planning to do that in another bug but can do it here.
Actually, another bug is fine. I'll resolve this bug then, and you can open a new bug.
Status: ASSIGNED → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Comment 66•17 years ago
|
||
the recent addition of this line:
jdapimin.c:
#include "intrin.h"
Is *busting* my build. My system does not have this include file ...
I had to add an:
#undef HAVE_MMX_INTEL_MNEMONICS
to work around ...
Please adviese ...
Comment 67•17 years ago
|
||
I think we have to move the SSE2 detection code to somewhere else, as the mozilla/jpeg code is not used when compiling with system-jpeg.
A good place would be to put the SSE2 detection is in gfxPlatform.h/cpp.
All image/rendering code include gfxPlatform.
See also bug 414947.
Assignee | ||
Comment 68•17 years ago
|
||
Should we have it in the JPEG library as well as gfxPlatform.h? It would be nice to call it from browser initialization so that it is called before anything else uses it and so that we don't have to test whether the init was called each time.
Comment 69•17 years ago
|
||
For code inside libjpeg that does SSE2, the check is needed inside libjpeg.
For code inside image decoders and libraries we need to have this check in gfxPlatform.
See also bug 415054.
Note, Michael, it looks as if the code in the block copy/convert in bug 414947 is something that could be SSE2 enabled (as a followup bug to that one).
Assignee | ||
Comment 70•17 years ago
|
||
Is there initialization code for GFX that gets called on browser startup where we can put the CPUID check? I would rather do it once instead of checking to see that we sniffed cpu capabilities for every use of SSE2 in GFX.
gfxPlatform
Comment 72•17 years ago
|
||
(In reply to comment #15)
> Created an attachment (id=296230) [details]
> Image benchmark framework
Appropo of nothing:
I populated this HTML file with 2600 JPEG images, pulled from my web proxy logs after 10 days of casual browsing. After several runs on SeaMonkey v1.1.8/Linux, this is what the high-level system profiling looks like:
samples % app name symbol name
62231 24.1899 libnecko.so nsMemoryCacheDevice::EvictEntriesIfNecessary()
23650 9.1930 libjpeg.so.62.1.0 decode_mcu
16191 6.2936 libgdk-x11-2.0.so.0.1000.14 (no symbols)
15001 5.8311 libfb.so (no symbols)
9709 3.7740 libc-2.6.so memcpy
9013 3.5035 libjpeg.so.62.1.0 ..@16.bs
7583 2.9476 libjpeg.so.62.1.0 jpeg_idct_islow_sse2.column_end
6376 2.4784 libjpeg.so.62.1.0 jpeg_idct_islow_sse2.columnDCT
6301 2.4493 Xorg (no symbols)
3614 1.4048 nvidia_drv.so (no symbols)
3374 1.3115 libgklayout.so nsLineBox::IndexOf(nsIFrame*) const
3147 1.2233 libqt-mt.so.3.3.8 (no symbols)
2823 1.0973 nvidia (no symbols)
2795 1.0864 libc-2.6.so _int_malloc
I just mention this because I'm intrigued by the amount of time spent in disk cache maintainance, nearly as much time as is spent overall in the JPEG library. Hmmm...
Comment 73•17 years ago
|
||
(In reply to comment #72)
> (In reply to comment #15)
> > Created an attachment (id=296230) [details] [details]
> > Image benchmark framework
>
> Appropo of nothing:
>
> I populated this HTML file with 2600 JPEG images, pulled from my web proxy logs
> after 10 days of casual browsing. After several runs on SeaMonkey
> v1.1.8/Linux, this is what the high-level system profiling looks like:
>
> samples % app name symbol name
> 62231 24.1899 libnecko.so
> nsMemoryCacheDevice::EvictEntriesIfNecessary()
> I just mention this because I'm intrigued by the amount of time spent in disk
> cache maintainance, nearly as much time as is spent overall in the JPEG
> library. Hmmm...
>
Filed bug 417154 to capture this.
You need to log in
before you can comment on or make changes to this bug.
Description
•