Closed
Bug 97861
Opened 23 years ago
Closed 14 years ago
Rounding error in nsTransform2D
Categories
(Core :: Layout, defect)
Tracking
()
RESOLVED
FIXED
Future
People
(Reporter: jhpedemonte, Unassigned)
References
(Blocks 1 open bug)
Details
(Keywords: testcase)
Attachments
(4 files, 3 obsolete files)
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
netscape
:
review+
jesup
:
approval+
|
Details | Diff | Splinter Review |
On OS/2, there are several off-by-one pixel errors where images or text will shift by one pixel when they are clicked. I tracked this down to an issue in gfx/src/nsTransform2D.cpp, where in some cases m20 = 241.4999985 (and got rounded to 241.5), and in other cases m20 = 241.5000000 (and got rounded to 242, a difference of one pixel). This is due to the way the values of m20 and m21 are calculated. They are stored as floats, but the CPU does calculations using doubles. So while all the variables are floats, when doing float arithmetic, everything gets converted to double, then mutltiplied/added/etc, and then saved to a float. This last step causes some information to be truncated. Take this example, where: From the code in nsTransform2D::AddTranslate, where [ m20 += ptX * m00 ]: First time through: m00 = 8.33333358e-02 ptX = 2.76000000e+02 m20 = 0 => When multiplying [ ptX * m00 ], these two float values get changed to doubles in order to the the mutliplication, and then added to (double)m20, which is 0.0f. The result is 2.3000000685453415e+01. This then gets stored in a float, so the result gets truncated to 2.30000000e+01. Now lets look at the second pass. Second time through: m00 = 8.33333358e-02 ptX = -2.76000000e+02 m20 = 2.30000000e+01 => Again, in order to the mathematical operations, the floats get converted to doubles, with the result of [ ptX * m00 ] being -2.3000000685453415e+01. This then gets added to double-converted m20, which is 2.3000000000000000e+01. Therefore, the result of their addition is -6.85453415e-07, although we would expect zero. The problem here is that we are not properly considering of the most significant bits. In the second pass, we should have cast the result of [ ptX * m00 ] to a float, in order to properly truncate the irrelevant data at the end of the number. By saving [ ptX * m00 ] to a temporary float variable, and then adding to m20, we get the correct result of zero.
Reporter | ||
Comment 1•23 years ago
|
||
Reporter | ||
Comment 2•23 years ago
|
||
The above test case should compile on any platform, and should display the error on any intel based platform (os/2, windows, linux, etc). As you can see, saving to temporary float variable removes the errors from the mutliplication. From what I can tell, there are two solutions to this problem. One is to change the internal members of the nsTranform2D class (m00, m01, m10, m11, m20, m21, as well as the temporary variables used throughout) to double. This stores the 'errors' from the mutliplication and produces the correct results. I have tried this on OS/2, and it fixes the one-pixel problems I was seeing. The better solution, though, is to make sure that when adding or subtracting we store the results of mutliplications in temporary float variables, in order to truncate the 'error' bits.
Comment 3•23 years ago
|
||
dcone: This appears similar to http://bugzilla.mozilla.org/show_bug.cgi?id=19601 What do you think of this?
Comment 4•23 years ago
|
||
I also have looked at some of these problems and have found the following case. When you scrolled a page in the Y direction the element looked like it was jiggling because the width of this element would vary by a pixel and this was caused from the calculation of the starting position and the error introduced there. For example.. if the starting y position was 5.3, and the width is 5.4, the right side of the element would be 10.7, rounded you get 11, for a width of 11-5 or... 6. If the starting position was 5 and the width is 5.4 the pixel that the right would land on would be 10.4, rounded would be 10 and the width of the element would be 10-5 or... 5. So the width of this element would jiggle depending on how and were this starting position was calculated. The solution I used to fix this.. was cache this error and use it for the calculation of the width.. to keep the right side from moving, or calculate the width at the time your calculating the left side. We sometimes calculate the right side, then the left side and I have tried to hunt those down and fix them. Using double will not fix this problem.. but I am not sure this is contributing to your bug.
Reporter | ||
Comment 5•23 years ago
|
||
Reporter | ||
Comment 6•23 years ago
|
||
I have attached one a solution here, that makes sure to change the results of any multiplication to float before doing the addition. In this case, I've only made the changes when calculating m20 and m21. If you add printfs to output m20 at the end of the changed functions, you can see how this fix makes things better: the instances of m20 ending with .4999997 have disappeared.
I think this is the wrong approach. The fundamental issue is converting from floating point to integer. That's where the problems actually occur and that's where the problems should be addressed. Fiddling with floating point is just masking the issue. In a certain sense the problem is too much precision in the calculations. The example here shows a loss of two decimal digits of precision (between 6 and 7 bits) but even a ieee float has 23 bits so that leaves a lot of bits. I think a better solution is to retain some extra bits when converting to integers and do the final rounding in the integer domain. Something like this: #define Real float #define EXTRABITS 8 #define N (2<<EXTRABITS) #define NR ((Real)N) int toInt(Real f) { int i = NR*f + (Real)0.5; return (i + N/2) >> EXTRABITS; } This is a clear candidate for inlining or macrozing. Just a suggestion.
Comment 8•23 years ago
|
||
That is right.. the problem comes from to much presision.. as I mentioned above. If you calculate a position of an two different elements.. relative to a point, let say is 0,0, the error of rounding to a pixel location will not be consistent from one element to another. See my comments up above and in the related bug to understand more deeply. The only way to consistently place elements is to always use the error from previously calculated elements and add in those error.. a virtual screen if you will, but these errors can not just be thrown away.. or rounded a different way to solve this problem.
Comment 9•23 years ago
|
||
->layout
Assignee: trudelle → attinasi
Component: XP Toolkit/Widgets → Layout
QA Contact: jrgm → petersen
Reporter | ||
Comment 11•23 years ago
|
||
dcone: then is this solved by making m20 and m21 doubles? This way, they 'save' the error across function calls, and it gets factored in when doing the calculations.
Comment 12•23 years ago
|
||
*** Bug 102391 has been marked as a duplicate of this bug. ***
Comment 13•23 years ago
|
||
This is also causing our bug where scrolling some images causes them to have white lines across them.
Comment 14•23 years ago
|
||
I think modifying cross-platform code is a bad idea. All the ideas, so far, make implicit assumptions about how the compiler works and that's bound to cause trouble. On Linux, gcc works differently than the compiler you're using. For the original example code, gcc reproduces the supposed errors without optimization but, with optimization, gives "correct" results. The *toFloat helper functions in the proposed patch are quite small so they could well be prime candidates for inlining by an optimizing compiler. If that happens, then you must be concerned that the compiler might notice that all it's doing is popping an fpu stack value into a temporary register and pushing that value back onto the fpu stack so it might decide that's a waste of time and eliminate it. Gcc does inline the functions but does not eliminate them. However, I haven't tried all possible combinations of command-line options so I certainly wouldn't want to depend on this behavior. Changing all the floats to doubles doesn't fix anything; it just makes the problem less likely. On Linux, the fpu generally runs in extended precision so even doubles are prone to this sort of round-off error. I have some examples where gcc gives surprising results for seemingly simple double precision calculations. Also, doubles are twice as big as floats so you have to consider the possibility of stack overflows. If you want a cross-platform solution, you almost certainly need to examine the users of these classes. It might be faster to just change platform-specific stuff instead.
Reporter | ||
Comment 15•23 years ago
|
||
You are correct about the doubles; i spoke without thinking. However, we keep seeing more and more problems in our build are related to this issue, so we would really like for this to be fixed. The ToFloat fix was one solution that I came up with. I have tried to implement dcone's suggestion (saving the error across function calls), but I have not been successful. As an aside, I need some clarification on how nsTransform2D is used in the code. Basically, the original problem stems from using two different nsTransform2Ds (which are calculated differently, but should end up with the same values) when dealing with the same image. Why is this? Shouldn't the same transform be used every time we deal with that image?
Reporter | ||
Comment 16•23 years ago
|
||
This problem can be recreated on windows. If you hardcode mTwipsToPixels in nsDeviceContext to be 1/12.0 (what it is set to on OS/2, while windows is normally set to 1/15.0), you will see instances of text and images moving one pixel when you click or highlight it. Also, I don't think saving the error across function calls will work. In this case, the error would just get compounded over each iteration. The problem here is one of Least Significant Digit. When we do the multiplication of the floats, the result is a double (until it gets saved to a float member var). We then add this double to a float. However, this is not the correct thing to do. We are introducing error into the equation. I coded ToFloat in order to deal with this issue. We have two floats being multiplied, so the end result (before any addition/subtraction) should be a float. The 'error' should be immediately discarded before any further calculation.
Comment 17•23 years ago
|
||
The problem described is similar to the one-dimensional problem of maintaining a checking account balance in float or double. You can add and subtract dollars and cents but the balance is only an approximation because pennies (0.01) are not exactly representable in float or double. You can add and subtract many transactions with a net sum of zero, but if you examine the final balance it can be less than zero, equal to zero, or greater than zero due to rounding approximations at each transaction for a given path. How does one resolve this problem? The simplest and most direct solution is to maintain the balance in pennies stored in an integer. The balance is always precise and correct. How does this relate to the problem at hand? Adding and subtracting x and y translations that are imprecise into an accumulator that is imprecise will cause the balance to be imprecise. If the positional translation is in an integer type e.g., milli pixels and the translations occur in milli pixels, the balance will always be exact. One will return to the origin with moves that sum to zero. A similar strategy could also be adopted for rotation (milli degrees) and absolute magnification. The transformations will be a bit more complex, but the inconsistencies observed will go away. No amount of extra precision or rounding strategies will solve the problem presented in the first paragraph.
Reporter | ||
Comment 18•23 years ago
|
||
Reporter | ||
Comment 19•23 years ago
|
||
I tried to come up with a patch based on khanson's post above. I made the private member variables all integers. m20 and m21 now represent hundredths. The others are multiplied/divided by 60; I chose this number since it is evenly divisible by 12 and 15 (the twips-to-pixels values for OS/2 and win32, respectively). It has fixed most of the issues I was seeing, although the printf's have shown that m20 can still end with .48 or .49 (just not as often as before). I am still looking into this. Obviously, this is a work in progess. Let me know what you guys think.
Comment 20•23 years ago
|
||
I wrote this next bit before the latest comments. Integer arithmetic is attractive and it may well solve a number of problems but it has issues of its own. C/C++ only permits integer arithmetic modulo some power of two. Overflows are silently truncated and the language does not tell you when this happens. Since losing the most significant bits is a bad idea here, it is necessary to limit the size of the input operands so overflows cannot happen. For an underlying n-bit integer, the inputs must be n/2-1 bit integers or smaller. If you internally scale the inputs to help control round-off errors then the inputs must be even smaller. Furthermore, avoiding overflow imposes stringent conditions on the type and order of operations that are allowed. For example, a*(b+c)/d might be allowed but a*(b+c+d)/d might not. Obviously this has been done before so it's not impossible. It is, however, more difficult that just changing floats to integers. Note also that all the testing and whatnot may be slower that using floats. As for the latest patch, I hate to say it but the fact that using integers doesn't fix the round-off problems shows that neither the problem nor the solution lie here. I think we all need to look at who uses these functions and how they use them.
Reporter | ||
Comment 21•23 years ago
|
||
Actually, I found the last remaining issue, which seems to fix the remaining rounding problems. m20 & m21 may need to represent sixths, so I changed the factor involved with them from 100 to 60; so now all the member vars are multiplied/divided by 60, simplifying some of the calculations. Yes, I agree that overflow may be a problem in this case, and we will continue to look into it. As for the calling functions, many of the x and y and dimension coordinates come in as float, although for all functions but one they are actaully integers cast as floats and then passed in. So maybe those should all in fact be changed to integers. We have an issue on OS/2, though (and possibly other platforms), that causes visual inconsitencies that we would like to fix as soon as possible. If you are talking about changing all the calling functions, then maybe that should be opened up as a different bug that is more longterm, while using this bug to come up with a shortterm solution to fix the OS/2 problems.
Reporter | ||
Comment 22•23 years ago
|
||
Comment 23•23 years ago
|
||
I agree there seem to be two issues. The idea of caching errors seems odd. Floats are perfectly fine containers for this sort of data. That most likely should be addressed higher up. I renew my objections to fixing OS/2 problems in cross-platform code. Can you do it in OS/2-specific code instead?
Comment 24•23 years ago
|
||
This isn't an OS/2 specific problem. As Javier mentioned, it is just a coincidence that only OS/2 sees it. If Windows were 120 DPI instead of 96 DPI at the OS level, it would see this problem as well. You can show this by making the change Javier mentioned. The bug IS cross platform. Only OS/2 is visually showing the bug.
Reporter | ||
Comment 25•23 years ago
|
||
Reporter | ||
Updated•23 years ago
|
Attachment #52971 -
Attachment is obsolete: true
Reporter | ||
Updated•23 years ago
|
Attachment #53123 -
Attachment is obsolete: true
Reporter | ||
Comment 26•23 years ago
|
||
Reporter | ||
Comment 27•23 years ago
|
||
Added a patch to test for overflow in every multiplication and addition. Ran with this patch for a while, and also ran some of the browser buster automated tests. Did not run into any of the overflow assertions during that time.
Comment 28•23 years ago
|
||
I believe dcone is on sabbatical. I'm reassigning the bug to pavlov, cc'ing roc and waterson for expertise and super-review. I don't see how anyone can argue that the code does not have problems on any platform, given the right DPI and other constraints. I think we should try to fix this soon, and if necessary, file a long-term fix sequel bug. Pav, I'll stop by your cube. /be
Assignee: dcone → pavlov
Keywords: mozilla0.9.6
Comment 29•23 years ago
|
||
It appears that most of the consumers of this class are using the integer *Coord methods, e.g. TransformCoord and all these methods depend on nsToCoordRound in xpcom/ds/nsUnitConversion.h. The coordinate functions there are rather simplistic. I suspect that the quickest fix would be to replace these coordinate functions with something more sophisticated.
Comment 30•23 years ago
|
||
mkaply: I seem to recall you mentioning that changing the member variables to doubles also solved the problem. Is that still the case?
Comment 31•23 years ago
|
||
Changing to doubles just hides the problem temporarily, it doesn't fix it. What really needs to happen here is to use integers for these types of computations.
Comment 32•23 years ago
|
||
BTW, you can't assume anything other than float precision in the floating point because mozilla neither checks nor sets the fpu control word.
Comment 33•23 years ago
|
||
tenthumbs: eh? http://lxr.mozilla.org/mozilla/source/js/src/jsnum.c#436 /be
Comment 34•23 years ago
|
||
Brendan, I already knew about the FIX_FPU macro but it's a nullity on Linux and other platforms and we're talking cross-platform code here. I tend to think there's a js bug here. There's also the possibility that a third-party library could play with the control word.
Comment 35•23 years ago
|
||
tenthumbs: can you provide a patch? We've never had troubles in 5+ years outside of Windows (wherefore FIX_FPU). /be
Comment 36•23 years ago
|
||
Brendan, I filed bug 109286 on the js issue.
Comment 37•23 years ago
|
||
More information. We found an image painting error that also only happens when DPI is 1/12 - it doesn't happen with 1/15. I don't know what to do with these problems. Can someone tell me what mTwipsToPixels is set to on other platforms (Mac, Linux)
Comment 38•23 years ago
|
||
hinting back and forth between this and bug 104544
Comment 39•23 years ago
|
||
Is this the right layer to worry about this? If the transform classes are defined to operate over floats, then the users of those classes must take the appropriate care when converting to/from floats. In specific, in the context of bug 104544, when float coordinates are used for purposes of refresh, the rendering code should be *pessimistic* and round outward. For other purposes, rounding downward may be more appropriate. My point is that the decision over conversion policy may best be decided by the user of these transform classes, and not handled as an internal matter.
Comment 40•23 years ago
|
||
Could this bug instead be fixed by a patch like attachment 2508 [details] [diff] [review] on bug 16200?
Comment 41•23 years ago
|
||
... and regarding that patch (and applying it to nsDeviceContextOS2::CommonInit), I believe that change is necessary in order to fix all rounding errors. See my explanation in bug 16200 comment 16. (What I'm not sure of is whether the rounding errors you're seeing are necessarily caused by that problem.)
Comment 42•23 years ago
|
||
Hmm. I'm going to read these in more detail, but on first flush this sounds similar to some issues where moving the mouse over or using UI (XUL) and sometimes page elements would cause them to shift position by a pixel. I've had that on and off on FreeBSD for a year or more; I think I reported it at least once.
Comment 43•23 years ago
|
||
Take a look at the definitions of CEIL_CONST_FLOAT and ROUND_EXCLUSIVE_CONST_FLOAT in mozilla/xpcom/ds/nsUnitConversion.h, particularly the comments. The constants are far too precise for single precision and, on Linux, gcc compiles them to exactly 1.0 and 0.5 so many of the functions in the file are seriously broken. I redefined them to this #define CEIL_CONST_FLOAT (1.0f - FLT_EPSILON) #define ROUND_EXCLUSIVE_CONST_FLOAT (0.5f*CEIL_CONST_FLOAT) and ran some tests on the functions and there are a _lot_ of differences. If these things are busted, nothing else can possibly work right.
Comment 44•23 years ago
|
||
brendan, sorry to let you down, but I don't have a lot to add here. I notice that my p2t is 19. (Which seems to mean that we'd need to change the patch to multiply by 1140 instead of 60 if I'm to be happy? And lord knows how we'll please dbaron, who has a p2t of 15 1/6.)
Comment 45•23 years ago
|
||
This bug probably depends on bug 118117 or bug 120690. At the very least they're related.
Comment 46•23 years ago
|
||
i'm seeing something exactly like this on macOSX with both gcc and codewarrior. UI elements are shifted by a pixel on hover and click, but only _sometimes_. It's driving me crazy, and i've isolated it to issues in the transform2D with m20 and m21. Smells a lot like this.
Reporter | ||
Comment 47•23 years ago
|
||
Yes, I have noticed the same issues on MacOSX. Unfortunately, as pointed out earlier, the patch above which uses 60 as a common multiple only works for the default cases of windows and OS/2. It will not work correctly for DPI for other platforms or 'custom' dpi settings. We need someone who is familiar with this code to come up with a better solution. I tried, but I don't know it well enough.
Comment 48•23 years ago
|
||
i applied the patch and it didn't fix my problems on osx. i still see jiggling.
Reporter | ||
Comment 49•23 years ago
|
||
What is the PixelToTwips value in OSX? On windows, it is 15. On OS/2, it is 12. Both divide evenly into 60. That's why I chose that number. Take the PixelToTwips value on OSX and multiply it by 60. Change all the '60's to this number in both the cpp and h file. Does this fix the jiggling?
Comment 50•23 years ago
|
||
on mac, the pixelsToTwips value is <drum roll> 17.1458266 gotta love it ;)
Comment 51•23 years ago
|
||
pinkerton: If it's not an integer, the problems you're seeing are likely just bug 16200, and you should probably port the fix to that bug to the mac.
Comment 52•23 years ago
|
||
dbaron, ok, doing that fixed some of my issues (jiggling checkboxes) but not all (i still have jiggling scrollbars). Any ideas?
Comment 53•23 years ago
|
||
so making sure mPixelsToTwips helped on the cfm mac builds, where it was 14.99999 and is now 15.0. However, on the mach-o builds it was 17.xxx and is now 17.0. 17 isn't a friendly multiple, so we still get the same behavior of jiggling. I'll try the 17*60 trick with the patch in a few.
Reporter | ||
Comment 54•23 years ago
|
||
*** Bug 123235 has been marked as a duplicate of this bug. ***
Updated•23 years ago
|
Target Milestone: --- → Future
Comment 55•23 years ago
|
||
Is this the underlying cause of bug 83289 ? It sure sounds like it. Just a side note: On Windows with Large Fonts my Pixel to Twips = 15.0 and with Small fonts = 12.0. It can be different values on the same OS. I can also confirm that I get lines in images/text during scrolling on particular pages with p2t = 15.0, but not at p2t = 12.0.
Updated•23 years ago
|
Attachment #48908 -
Attachment is obsolete: true
Comment 56•23 years ago
|
||
Given that this problem is not going to be fixed any time soon (we probably need to remove twips from Mozilla completely) and given that it makes the nightly OS/2 builds look like crap, I'd like to propose that we create an nsTransform2DOS2.cpp that is only built on OS/2 so we can put in the fixes we have that make OS/2 better. What do people think of that? Note it would have to live in gfx/src, since gfx/src and gfx/src/os2 are two different DLLs (shared libs)
Updated•23 years ago
|
Keywords: mozilla0.9.6,
mozilla0.9.9
Comment 57•23 years ago
|
||
I have tried changing my p2t to 16, and that seems to help. (it was 20.) I think the reason is that t2p is now 1/16, which can be written exactly in binary. (0.0001). I am still seeing some problems, but not nearly as much. My testcases for bug 83289 are fixed (without the patch in that bug). This is on linux, build from cvs, around 20020318.
Comment 58•23 years ago
|
||
OK, so this bug won't be fixed anytime soon and Mozilla 1.0 is approaching fast. I want to create an nsTransform2DOS2.cpp and use it only for OS/2. I am sick and tired of the fact that OS/2 nightlies suck eggs because of this problem. Looking for review.
Comment 59•23 years ago
|
||
I am certainly in favour of anything that allows me not to wonder whether or not I'm wearing my contact lenses. I don't think any change could make it much worse, so I think the suggestion is a goer.
Comment 60•23 years ago
|
||
Comment on attachment 81544 [details] [diff] [review] Temporary fix for OS/2 r=cls
Attachment #81544 -
Flags: review+
Comment 61•23 years ago
|
||
Comment on attachment 81544 [details] [diff] [review] Temporary fix for OS/2 a=rjesup@wgate.com
Attachment #81544 -
Flags: approval+
Comment 62•23 years ago
|
||
This appears to be an OS/2 only fix. Adding adt1.0.0+ for 1.0 checkin.
Keywords: adt1.0.0+
Comment 63•23 years ago
|
||
I've checked the OS/2 only fix onto branch and trunk.
Keywords: adt1.0.0+ → fixed1.0.0
Comment 64•22 years ago
|
||
I don't access to a 0S/2 machine to verify this issue. Anyone willing to check this on trunk and branch ?
Updated•17 years ago
|
Assignee: pavlov → nobody
QA Contact: chrispetersen → layout
Comment 65•17 years ago
|
||
Could someone verify this now after the unit fix and the border rewrite landed? I think it should be gone now.
Comment 66•17 years ago
|
||
https://bugzilla.mozilla.org/show_bug.cgi?id=173051 also seems to have fixed an nsTransform2D rounding error. Is this still a bug?
Comment 67•14 years ago
|
||
Changing the OS to make this bug appears in the Warpzilla searches, getting it visible for those who can verify it. And who still care about OS/2.
OS: All → OS/2
Hardware: All → x86
Comment 68•14 years ago
|
||
nsTransform2DOS2 was checked in on 2002-05-15 20:42 and was again removed in bug 156402 / a checkin on 2003-01-15 15:15 (see http://bonsai.mozilla.org/cvslog.cgi?file=mozilla/gfx/src/nsTransform2DOS2.cpp). Whatever happened this temporary thing fixed the problem before a more thorough patch was created. I don't see a reason to leave this open. --> FIXED in 2002
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•