477728 - JPEG image decoding bug (sse idct)

Assignee

Description

•

16 years ago

This image should render as gray with repeating dct blocks of 66, 122, and 128. It shows up with slight lines when rendered. The dct blocks look like: 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 65 65 65 65 65 65 65 65 65 65 65 65 65 65 65 65 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 66 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 121 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 122 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 128 We need to figure out the cause of the differences in the results for rows 4 and 5 to see if we can fix them. This bug was initially noted in Bugzilla 247437 and was reported by Marat Tanalin. Further references are seen at http://support.mozilla.com/tiki-view_forum_thread.php?comments_parentId=160061&thread_sort_mode=commentDate_asc&forumId=1

Michael Moy

Assignee

Comment 1

•

16 years ago

With this input, -500 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 We're getting this on scaler: -2000 -2000 -2000 -2000 -2000 -2000 -2000 -2000 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 And this with SIMD: -1984 -1984 -1984 -1984 -1984 -1984 -1984 -1984 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 The -1984 to -2000 is most likely okay as we ultimately only have 8 bits of data coming out but those 1's shouldn't be there. It's likely some kind of rounding issue which I'm trying to track down. The scaler routine does columns then rows while the SIMD routine does it the other way around. I transposed the display of the scaler dct matrix to make it easier to compare with the SIMD routine.

Michael Moy

Assignee

Comment 2

•

16 years ago

The rounding code is causing this problem. This is one-half of the row code and values should be zero for row 1 as the entire row is zero but we're adding a rounding correction factor of 5208 and then shifting 12 bits or dividing by 4096 which results in 1. __asm pshuflw xmm0, xmm0, 11011101b (0) __asm pshufhw xmm1, xmm1, 10001000b (0) __asm pshufhw xmm0, xmm0, 11011101b (0) __asm movdqa xmm2, XMMWORD PTR [TABLE] (22725, 29692, 22725, 12299, etc) __asm pmaddwd xmm2, xmm1 (0) __asm movdqa xmm3, XMMWORD PTR [TABLE + 32] (31521, 26722, 26722, -6270, etc) __asm pmaddwd xmm3, xmm0 (0) __asm pmaddwd xmm1, XMMWORD PTR [TABLE + 16] (0) __asm pmaddwd xmm0, XMMWORD PTR [TABLE + 48] (0) __asm pshufd xmm1, xmm1, 01001110b (0) __asm pshufd xmm0, xmm0, 01001110b (0) __asm paddd xmm2, XMMWORD PTR [ROUND1] (5802, 0, 5802, 0) __asm paddd xmm3, xmm0 (0) __asm paddd xmm1, xmm2 (5802, 0, 5802, 0) __asm movdqa xmm2, xmm1 (5802, 0, 5802, 0) __asm psubd xmm2, xmm3 (5802, 0, 5802, 0) __asm psrad xmm2, SHIFT_INV_ROW (1, 0, 1, 0) __asm paddd xmm1, xmm3 (5802, 0, 5802, 0) __asm psrad xmm1, SHIFT_INV_ROW (1, 0, 1, 0)

Michael Moy

Assignee

Comment 3

•

16 years ago

The SSE2 implementation uses an algorithm from Rao and Yip and includes rounding code. It seems to me that the rounding code is wrong (an all-zero row should result in an all-zero row). I don't know whether the rounding code is part of the algorithm or if it was added by Intel. I just ordered Rao and Yip and should have it in a few days and will be able to see for myself. The Scalar implementation (IJG) does not do the rounding. It uses a routine from Loeffler, Ligtenberg and Moschytz. The IFAST implementation (jidctfst.c) uses an algorithm from Arai, Agui, and Nakajima. The MMX implementation looks like a straight port from the scalar code. I have my own SSE2 implementation of IFAST that doesn't exhibit the problem in this bug. I'm going to try a straight port of the ISLOW scalar code to SSE2 which should do a better job. The dct book should arrive in a few days and I should be able to see what the Intel SSE2 ISLOW code is trying to do.

Michael Moy

Assignee

Comment 4

•

16 years ago

Attached patch Replace Intel ISLOW code with port of scalar code (deleted) — Details — Splinter Review

This patch is against an older code base but I thought that I'd save the work here. I need to get a newer version of the code. This should work against 3.0 though.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Updated

•

16 years ago

Flags: blocking1.9.2?

Michael Moy

Assignee

Updated

•

16 years ago

Attachment #362430 - Flags: review?(pavlov)

Joe Drew (not getting mail)

Comment 5

•

16 years ago

Michael, is that patch still targeted only at 3.0?

Michael Moy

Assignee

Comment 6

•

16 years ago

I believe that jidctint.c is the same for 3.0 and up so it should apply to 3.0, 3.1 and 3.2.

Michael Moy

Assignee

Comment 7

•

16 years ago

One other note about the patch: it should eventually run on other x86 platforms with some platform varianting code in other modules. Everything here is in intrinsics which is common to GCC and MSVC and I've removed the MMX code so it should work in MSVC x64.

Alfred Kayser

Updated

•

16 years ago

Blocks: 242145

Joe Drew (not getting mail)

Updated

•

15 years ago

Attachment #362430 - Flags: review?(pavlov) → review?(joe)

Joe Drew (not getting mail)

Updated

•

15 years ago

Attachment #362430 - Flags: review?(joe) → review?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Comment 8

•

15 years ago

What's the performance difference with this patch?

Michael Moy

Assignee

Comment 9

•

15 years ago

When I tested it the last time, it was somewhat faster than the Intel (current code). I thought that I posted the performance differences but it isn't in the bug so I guess I forgot to. I will run performance comparisons tonight and report the results. My main goal in the patch was to fix the problem and also not regress performance.

Michael Moy

Assignee

Comment 10

•

15 years ago

I ran five tests with the patch and five tests without the patch on a 2.5 Ghz Mobile Penryn (model number T9300) running Windows XP SP3 using a recent Firefox 3.0 source code pull. The median numbers showed a 1.96% reduction in time to render with the patch compared to the Intel SSE2 code.

Jeff Muizelaar [:jrmuizel]

Comment 11

•

15 years ago

Sorry this process is taking so long. Here are a couple of comments: At the beginning you write: + * The scalar code carries 13 bits of precision for multiplication in the + * constants. It passes two extra bits of precision from column processing + * into the row processing by descaling two fewer bits than it was scaled by + * (13). We do the same but we do not do the rounding because the SSE + * multiply chops off the bits before we can see them. We have the two extra + * bits, though, which should be good enough. I don't really understand the implications of this. Are the results of this implementation identical to the c version? It looks like this new code doesn't support 12 bit samples, so the code should only be compiled when BITS_IN_JSAMPLE == 8 It would probably be good to include the even part/odd part comments from the c version in the sse2 version. It makes following matching the sse2 code to the c code easier. It would be good if we could add a reftest that tests this problem. I think we should be able to have a jpg and png and make sure they decode the same. This will be even easier if the sse2 code produces exactly the same results as the c code. I tried out compiling the code on OS X with gcc-4.0 and gcc-4.2. From my quick look, neither seemed to do a tremendous job, as the resulting code had what looked to be a lot of data movement. I wonder how much could be gained from a hand-coded implementation...

Marat Tanalin

Comment 12

•

15 years ago

(In reply to comment #11) > It would be good if we could add a reftest that tests this problem. I think we > should be able to have a jpg and png and make sure they decode the same. Original full testcase page with JPEG and PNG images included: http://tanalin.com/_experimentz/firefox_bugs/2009/jpeg-decoding-errors/ It was initially posted in my comment to the related bug: https://bugzilla.mozilla.org/show_bug.cgi?id=247437#c85

Michael Moy

Assignee

Comment 13

•

15 years ago

> I don't really understand the implications of this. Are the results of > this implementation identical to the c version? I do not know. Perhaps I should run a few tests. The problem is that the Intel code ISN'T identical to the c version and that's what we are using now. Marat Tanalin provided an example which shows a significant difference to the point where I thought that the problem should be fixed. The Intel code was derived from a rather obscure textbook and, in my experience, works fine 99.999% of the time. Marat just found a case where it breaks down. So this patch is my attempt at a straight port of the C code to use SSE2. The most efficient port processes 8 16-bit words at a time so 16-bit arithmetic is used. The eventual output form is 8-bit words. The C code uses 32-bit words for intermediate calculations. > It looks like this new code doesn't support 12 bit samples, so the > code should only be compiled when BITS_IN_JSAMPLE == 8 BITS_IN_JSAMPLE is set to 8 in jmorecfg.h. The current Intel code that is executed does not have a check for BITS_IN_JSAMPLE. If you want me to put one in, I can. > It would probably be good to include the even part/odd part comments > from the c version in the sse2 version. It makes following matching > the sse2 code to the c code easier. I will add the comments. > It would be good if we could add a reftest that tests this problem. I > think we should be able to have a jpg and png and make sure they > decode the same. This will be even easier if the sse2 code produces > exactly the same results as the c code. I will do some manual testing in this area with a variety of JPEGs. I do not know how the current regression test system works so I might need some help with that. > I tried out compiling the code on OS X with gcc-4.0 and gcc-4.2. From > my quick look, neither seemed to do a tremendous job, as the resulting > code had what looked to be a lot of data movement. I wonder how much > could be gained from a hand-coded implementation... This code only runs on Windows with MSVC at the moment. The amount of work to get it running on Mac OS X is relatively small but I wanted to get this fix in first to address the issue of incorrect results. I was planning on adding Mac OS X support in a separate bug. I believe that it would be very easy to add Linux support as well but I do not have such a machine available to build and test. The performance improvement on Mac OS X should be somewhere in the 20% area from past experiments.

Jeff Muizelaar [:jrmuizel]

Comment 14

•

15 years ago

(In reply to comment #13) > > I don't really understand the implications of this. Are the results of > > this implementation identical to the c version? > > I do not know. Perhaps I should run a few tests. > > The problem is that the Intel code ISN'T identical to the c version > and that's what we are using now. Marat Tanalin provided an example > which shows a significant difference to the point where I thought > that the problem should be fixed. The Intel code was derived from > a rather obscure textbook and, in my experience, works fine 99.999% > of the time. Marat just found a case where it breaks down. So this > patch is my attempt at a straight port of the C code to use SSE2. Right, I certainly agree that the Intel code that we currently have is broken. However, since we're rewriting it, it would be nice to have code that produces identical results to make testing across multiple platforms easier. (like in bug 411626). I assume then, that when you say: "We have the two extra bits, though, which should be good enough." that you believe that the code should produce identical results but aren't certain. > The most efficient port processes 8 16-bit words at a time so 16-bit > arithmetic is used. The eventual output form is 8-bit words. The C > code uses 32-bit words for intermediate calculations. > > > It looks like this new code doesn't support 12 bit samples, so the > > code should only be compiled when BITS_IN_JSAMPLE == 8 > > BITS_IN_JSAMPLE is set to 8 in jmorecfg.h. The current Intel code > that is executed does not have a check for BITS_IN_JSAMPLE. If you > want me to put one in, I can. Might as well. It will need to be added anyways if we ever want to upstream it. > > It would probably be good to include the even part/odd part comments > > from the c version in the sse2 version. It makes following matching > > the sse2 code to the c code easier. > > I will add the comments. Thanks. > > It would be good if we could add a reftest that tests this problem. I > > think we should be able to have a jpg and png and make sure they > > decode the same. This will be even easier if the sse2 code produces > > exactly the same results as the c code. > > I will do some manual testing in this area with a variety of JPEGs. > I do not know how the current regression test system works so I might > need some help with that. I'll take care of adding the regression test. > > I tried out compiling the code on OS X with gcc-4.0 and gcc-4.2. From > > my quick look, neither seemed to do a tremendous job, as the resulting > > code had what looked to be a lot of data movement. I wonder how much > > could be gained from a hand-coded implementation... > > This code only runs on Windows with MSVC at the moment. The amount of > work to get it running on Mac OS X is relatively small but I wanted to > get this fix in first to address the issue of incorrect results. I was > planning on adding Mac OS X support in a separate bug. I believe that > it would be very easy to add Linux support as well but I do not have > such a machine available to build and test. > > The performance improvement on Mac OS X should be somewhere in the 20% > area from past experiments. This sounds good to me.

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

15 years ago

Flags: blocking1.9.2? → blocking1.9.2-

Marat Tanalin

Comment 15

•

15 years ago

> What |Removed |Added > Flag |blocking1.9.2? |blocking1.9.2- It's a shame since the bug is quite critical for rendering precision/quality and there is working and (as far as I can see) acceptable patch.

Jeff Muizelaar [:jrmuizel]

Comment 16

•

15 years ago

(In reply to comment #15) > > What |Removed |Added > > Flag |blocking1.9.2? |blocking1.9.2- > > It's a shame since the bug is quite critical for rendering precision/quality > and there is working and (as far as I can see) acceptable patch. Sure, but we've already shipped a bunch of releases with this bug so it doesn't seem reasonable to wait for this bug to be fixed before releasing again.

Marat Tanalin

Comment 17

•

15 years ago

(In reply to comment #16) Any feature or bugfix have no immediate effect, but this is not reason to delay or abandon it at all. Fixes for such serious bugs are always subject of "the sooner the better" principle.

Jeff Muizelaar [:jrmuizel]

Comment 18

•

15 years ago

Any progress here?

Jeff Muizelaar [:jrmuizel]

Comment 19

•

15 years ago

Comment on attachment 362430 [details] [diff] [review] Replace Intel ISLOW code with port of scalar code I need more information before I can review this.

Attachment #362430 - Flags: review?(jmuizelaar)

Michael Moy

Assignee

Comment 20

•

15 years ago

The last time I looked at it I needed a verification mechanism to determine if the results from this code are as good or better than the results from the scalar code. I tried to build the verification image but found that there were several pieces missing. I did a little looking around and found those pieces in other products but I never got around to building the program. I will work on doing that.

Makoto Kato [:m_kato]

Updated

•

15 years ago

Blocks: 475225

Marat Tanalin

Comment 21

•

14 years ago

Any progress, guys?..

Jeff Muizelaar [:jrmuizel]

Comment 22

•

14 years ago

I've found a project called jpeg-simd (http://cetus.sakura.ne.jp/softlab/jpeg-x86simd/jpegsimd.html) it has sse2 and mmx versions of the islow idct, that appear to be bit identical to the C version. The idct in this patch does not appear to be bit identical.

Makoto Kato [:m_kato]

Comment 23

•

14 years ago

(In reply to comment #22) > I've found a project called jpeg-simd > (http://cetus.sakura.ne.jp/softlab/jpeg-x86simd/jpegsimd.html) it has sse2 and > mmx versions of the islow idct, that appear to be bit identical to the C > version. The idct in this patch does not appear to be bit identical. Ccing Ichimaru-san who has a patch for Firefox that using custom build.

atzaus

Updated

•

14 years ago

Depends on: 573948

Marat Tanalin

Comment 24

•

14 years ago

Do we going to still have this exciting bug in Firefox 4?..

Justin Lebar (not reading bugmail)

Comment 25

•

14 years ago

This is invalid now that we've switched to libjpeg-turbo.

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → INVALID

Marat Tanalin

Comment 26

•

14 years ago

(In reply to comment #25) > This is invalid now that we've switched to libjpeg-turbo. Indeed, the bug does not appear anymore in Firefox nightlies as of 2011-03-30. Great news. (Though it's sad that the fix was not included in Firefox 4 while the bug is fixed in just one (!) week after Firefox 4 has been released.) By the way, the bug is not invalid, its just fixed--similar, for example, to those bugs that are fixed (not invalid) by switching to completely different HTML5 parser: https://bugzilla.mozilla.org/buglist.cgi?status_whiteboard_type=substring&status_whiteboard=[fixed%20by%20the%20HTML5%20parser]&resolution=FIXED

Ryan VanderMeulen [:RyanVM]

Comment 27

•

14 years ago

Firefox 4 was released a week ago, but it was frozen for new features many months ago. Since libjpeg-turbo was way more than just a simple bugfix, it had to miss the boat. The reason it landed one week after Firefox 4 shipped was precisely because they opened the tree again to new code like this. But at least you won't have to wait long for Firefox 5 :)

Joe Drew (not getting mail)

Updated

•

14 years ago

Resolution: INVALID → WORKSFORME