Closed Bug 416287 Opened 17 years ago Closed 6 years ago

performance improvement opportunity with isNaN

Categories

(Tamarin Graveyard :: Virtual Machine, defect)

x86
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX
Future

People

(Reporter: mohammad.r.haghighat, Assigned: wmaddox)

References

Details

(Whiteboard: PACMAN)

Attachments

(2 files)

User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1; MS-RTC LM 8) Build Identifier: avmplus shell 1.0 build cyclone_dschaffe_2007-05-11_15-05 The implementation of the math function isNaN() cab be significantly improved. Currently, it is defined as: static bool isNaN(double value) { return value != value; } On x86, C++ compilers use x87 instructions to implement this code. The generated code compares the argument value against itself and checks the flags. fld QWORD PTR [esp+04h] fcomp QWORD PTR [esp+04h] fnstsw ax test ah, 0x44h When the value is actually Not a Number (NaN), hardware exceptions happen and HW microcode handles the exception. This has severe performance penalty. There is a much more efficient way of doing this by avoiding the x87 FPU and using integer instructions (event without any need for SSE2). The performance improvement of the primitive is huge when handling NaNs - something like 35x improvement. When the argument is not a NaN value, the performance of the two approaches are comparable. IsNaN() is one of the hot Tamarin helper functions on the JSBench benchmarks (MolDyn), where ~29% of the total execution time is in isNaN(). By using the right code (fully expressible in C), I get about 20% overall improvement in that benchmark. Here's my proposed solution: #define I64(f) (*(long long int *)&f) static bool isNaN (double value) { unsigned long long int jvalue = (I64(value) & ~0x8000000000000000uLL); return (jvalue > 0x7ff0000000000000uLL); } It would be a good idea to have an assertion (preferably, compile-time) that sizeof(long long int) = 8. Reproducible: Always Steps to Reproduce: 1. Run avmplus on JavaScriptGrande2 Moldyn benchmark Actual Results: No functionality problem. Expected Results: ~20% overall performance improvement on the benchmark. It would be a good idea to have an assertion (preferably, compile-time) that sizeof(long long int) = 8.
Summary: performance improvemet opportunity with isNaN → performance improvement opportunity with isNaN
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → DUPLICATE
I'm going to leave both open, one is for TC, the other is for TT. Specifically, 416398 is about removing many of TT's isNaN checks that aren't even necessary due to how the FPU preserves canonical nan's.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
This may be a dumb question, but: I'm surprised that the floating-point comparison instruction in that sequence is fcomp; that instruction is supposed to raise a floating-point exception if either operand is a NaN. Could the amazing overhead of that comparison be due, not to processor microcode overhead, but kernel exception-handling overhead? Would it make sense to try using a fucomp instruction here instead, using explicit assembly? (For what it's worth, GCC generates fucomp instructions for these comparisons.)
I agree, fucomp is the preferred instruction. I don't know if Moh was quoting a real piece of code or going from memory; but I'm pretty sure he was talking about fpu penalties and not OS exceptions. fucomp would still incur penalties when one value was NaN.
MSVC (both 2005 and 2008) under /Os (used in Tamarin) generates fcomp. The penatlies I mentioned were due do the overhead in microcode and not OS. fucomp does not cause exceptions for QNaNs, but still causes traps for SNaNs.
Priority: -- → P5
Target Milestone: --- → flash10.1
Target Milestone: flash10.1 → flash10.2
Whiteboard: PACMAN
Looks like we have this code for platforms beside UNIX which has a slightly different integer based variant. I don't know why UNIX has a different implementation. One optimization for isNaN is to remove Toplevel::isNaN and have the JIT call MathUtils::isNaN directly or inline the isNaN code in Toplevel. Cutting out this helper function boosts the asmicro/isNaN-1.as performance test case by 40%.
Yeah, why there is a slightly different UNIX variant has mystified me too, it predates my work on this code. I'd naively guess that a single int64-based implementation, inlined everywhere, would be the best bet on most platforms. (For that matter, seems like inlining the test in LIR directly would be doable?)
If the isNaN argument type is known and we're fully optimizing, then you can avoid the call completely: isNaN(x:Number) => inline LIR_eq(LIR_eqd(x,x), LIR_immi(0)) // !(x==x) isNaN(x:int, uint, bool, String, Namespace) => inline LIR_immi(0) // false the constant false case occurs for any T where we know T is final and doesn't have a valueOf() function that could return NaN. The main thing when inlining is to ensure we dont break the debugger's model of whats going on. Can a user put a breakpoint on isNaN? if so then we shouldn't inline it (same goes for the other inline cases, really, although we might be breaking the debugger already in those cases).
Adding dependency on 588922 for isNaN optimization.
Depends on: 588922
Priority: P5 → P3
Assignee: nobody → wsharp
My SSE inlining test case is about 5x faster with inlined version. Non-SSE x86 version is actually slower (see previous comments about hardware microcode issues) so this is only enabled for SSE on x86 (plus all other architectures). Function is set up to expand for other built in functions related to work in 588922. Existing micro benchmark exists: asmicro/isNaN-1.as.
Attachment #470527 - Flags: superreview?(edwsmith)
Attachment #470527 - Flags: review?(stejohns)
Attachment #470527 - Flags: review?(stejohns) → review+
Comment on attachment 470527 [details] [diff] [review] separate isNaN part from bug 588922 Functionality looks correct. R- for the debugger hazard only. If I'm not mistaken, this will break if someone tries to put a breakpoint on isNaN in the debugger, so we should guard these kinds of inlines with if (!haveDebugger). nits: - house style in the jit is { on the same line as control flow statements. - SSE2_ONLY(if (core->use_sse2()) does not strip the IF for SSE2-always builds (mac x86, all x64). - elsewhere in CodegenLIR we have SSE2_ONLY(if(config.i386_sse2)), but here we have if (core->use_sse2()). I slightly prefer directly using njconfig but either way lets be consistent. I have a question unrelated to this patch: Should we care about the performance when the argument is NaN? in that case, in comment #1, Moh suggests using softfloat logic instead, to avoid a severe FPU penalty. Also, I've heard rumors of people unhappy with the performance of isNaN for non-double arguments. e.g. if the argument is int uint, or bool, then we convert to double, and end up doing eqi(eqd(i2d(x), i2d(x)), immi(0)). will the existing optimizations combine, and erase all that and return immi(1)?
Attachment #470527 - Flags: superreview?(edwsmith) → superreview-
Attached patch updated patch after feedback (deleted) — Splinter Review
1. Added haveDebugger check. 2. Uglified the bracket style to match most of codegenlir.cpp. Yeah, I'm not a huge fan of same line { syntax and codegenlir needs some cleanup if we want true consistency. 3. Added ifdef so SSE check is only for builds that need it. 4. Using core->config.njconfig to get at sse flag 5. Added i/ui2d optimization to just emit a constant 0 result. The performance when the argument is NaN is very good for SSE enabled machines. It is only with x87 syntax that the performance is very slow (2.5x slower than the int64 code version) which is why I added the SSE flag check. Approximate numbers for a 100 million for loop: isNaN(NaN) - 3000 original, 150-500 SSE (depending on loop alignment), 8000 if we use inline x87 code isNaN(not NaN) - 3000 original, 150-500 SSE, 250-500 x87 inlined So we are 2.5x slower if we inline the x87 FPU call vs calling out to an Math helper routine using int64 math. But we are 6-12x faster if we are not checking a NaN value.
Attachment #470790 - Flags: superreview?(edwsmith)
Attachment #470790 - Flags: review?(stejohns)
inline isNaN(NaN) is still fast on SSE2? sweet. Can you add the microbenchmarks for NaN and non-double cases as well? separate patch is fine.
Yes, SSE ucomisd is flag regardless of the argument. For microbenchmarks, you want an AS file to add to the performance directory? Can I just put it in the asmicro directory or are there some guidelines posted about adding new test cases?
Comment on attachment 470790 [details] [diff] [review] updated patch after feedback Probably should add a "break" to the end of the case statement; doesn't matter currently but if/when another case is ever added the break would prevent a dumb error.
Attachment #470790 - Flags: review?(stejohns) → review+
Comment on attachment 470790 [details] [diff] [review] updated patch after feedback nit: you dont need #ifdef DEBUGGER around if (haveDebugger) because haveDebugger is const = false in non-DEBUGGER configs. (see CodegenLIR.h). R+ with this and Steven's suggestion (add a break statement) fixed.
Attachment #470790 - Flags: superreview?(edwsmith) → superreview+
pushed inlining code: http://hg.mozilla.org/tamarin-redux/rev/9de9c686abc2 Leaving this open. Our current MathUtils::isNaN implementation is written specifically to avoid big slowdowns on x86 architectures. My investigation with this patch shows that SSE compares (NaN == NaN via UCOMISD) do not have the slowdown so it would be faster to have isNaN just be (val!= val). What about ARM, PowerPC, Sparc? Perhaps only x86 has the slowdown with comparing NaN to itself via floating point instructions. If so, x86 should use the int compare technique while all others should use (val != val).
(In reply to comment #17) > about ARM, PowerPC, Sparc? Perhaps only x86 has the slowdown with comparing > NaN to itself via floating point instructions. If so, x86 should use the int > compare technique while all others should use (val != val). Non-SSE2 machines are a vanishingly small percent of hardware we care about anymore, so frankly, optimizing for those is not worth our time. If the SSE2 comparison is a win on x86, leave it at that and move on...
I'm talking about the MathUtils::isNaN routine in C++ which on Windows will use the x87 instructions. Mac does not have the problem but Windows/Linux wont automatically use SSE.
Ah, got it. How many places (outside of the JIT) do we call this on a critical path? Perhaps we could call it via a function pointer specialized at startup time (to a generic vs SSE2 version)? Of course, indirect call might overwhelm any perf advantage of the integer-hack version inlined.
There are 61 hits in the source tree. No idea if any of them are critical path. The beauty of the (f != f) variant is it can be inline and is only a couple instructions. Going back to my old code (pre-inlining of NaN) and running the perf test, allowing MathUtils::isNaN to inline in TopLevel::isNaN basically doubles performance of Toplevel.isNaN. I'm sure it would be helpful to performance in the various other places that call it as well. I don't think we should inline the int64 variant (too much code) though. But at the very least, x64/SSE-always builds should just use an inline version of return (f != f).
> I don't think we should inline the int64 variant (too much code) though. But > at the very least, x64/SSE-always builds should just use an inline version of > return (f != f). I have some bad memory that the semantics of C/C++ have a loophole about NaN, and under high optimization settings, an optimizer is allowed to assume (f != f) is always false. (which is wrong for NaN under IEEE-754 semantics). So, we just have to be sure we've got good unit test coverage, and do the right thing as compilers dictate. this isn't a concern for LIR_eqd(f, f) since all nanojit backends must conform to IEEE-754 semantics.
(In reply to comment #22) > So, we > just have to be sure we've got good unit test coverage, and do the right thing > as compilers dictate. Recent discussion with Steven suggests that systematic testing on non-SSE2 hardware is spotty or worse - so even having good unit tests may not be good enough. (That could have been VM testing or player testing - the TC in question was for the player.) cc'ing Trevor in case there are action items to consider.
(In reply to comment #23) > Recent discussion with Steven suggests that systematic testing on non-SSE2 > hardware is spotty or worse - so even having good unit tests may not be good > enough. (That could have been VM testing or player testing - the TC in > question was for the player.) cc'ing Trevor in case there are action items to > consider. The particular issue turned out to be not non-SSE2 per se, but non-CMOV machines. (non-SSE2 testing is less extensive than I'd like, however, given that we are still supporting such machines.)
As part of testing, under a designated flag, one can make the CPUID routine fake a system with SSE* to look like a non-SSE* machine. Then, by having assertions at the codegen routines of SSE*, MMX, and CMOV instruction, we can break if those instructions were ever used. This way we would not be totally dependent on the availability of those old systems in the test farm. Re CMOV: back in mid'09, at some point for a short time, TraceMonkey was erroneously generating "conditional moves" unconditionally (i.e., CMOV without the proper CPU check). Note that CMOV was first introduced in Pentium Pro. So, a "Pentium processor with MMX™ technology" has MMX but not CMOV (or SSE2). This was captured in the bug https://bugzilla.mozilla.org/show_bug.cgi?id=500277
From an IM chat with Ed: Approximately 1/3 of isNaN calls in 1000 brightspot SWFs are doing isNaN(number(arg)) where they are converting an atom into a double and then calling isNaN. The alternative to this would be to add isNaNAtom(Atom a) and combining the two calls into one. isNaN is the #1 call according to bug 588922. Ed brought up the point of worrying that the number call would not necessarily get stripped by the verifier/codegen though and we don't want two calls to Object.valueOf. (See duplicate charCodeAt discussion in bug 593383)
Blocks: 588922
No longer depends on: 588922
Flags: flashplayer-qrb?
Assignee: wsharp → nobody
Assignee: nobody → wmaddox
Flags: flashplayer-injection-
Whiteboard: PACMAN → PACMAN must-fix-candidate
Depends on: Andre
Flags: flashplayer-qrb?
Flags: flashplayer-qrb+
Flags: flashplayer-bug-
OS: Windows XP → All
Whiteboard: PACMAN must-fix-candidate → PACMAN
Depends on: 645018
The discussion above covers a lot of ground. To summarize what remains open: 1) MathUtils::isNanInline() uses the integer-based test on IA32 unless we know SSE will always be available (i.e., MacOSX). MathUtils::isNan() always uses an integer-based test, on all architectures. This discrepancy is unmotivated. The integer-based test is known to be faster on non-SSE2 IA32 platforms, and the floating-point test faster on SSE2 IA32, but we have not examined which is preferable on non-x86 platforms. It is also unclear why we do not use the inlined form everywhere, as it quite small. 2) On Unix (Solaris only?) we use a different integer-based test. It is not clear if this is necessary, or results in superior performance. 3) Toplevel::isNan() has been measured to be faster when it invokes an inlined version of MathUtils::isNan(). 4) Math.isNan() is specialized to an inline floating-point test. At one point, it was proposed that this be done on IA32 only for SSE2-enabled processors, but that was dropped in the patch that landed (attachment 527712 [details] [diff] [review] for bug 606561). We need to rationalize the usage of the integer vs. floating point NaN test across all of these contexts, and on non-x86 platforms as well.
From Bill: -It is a very old bug, in which several different aspects of isNaN, both static and JIT-ed are discussed. Werner did quite a bit of work on inlining isNaN in JIT-ed code, and doing some context-dependent specialization. That has landed. There is the possibility of improving isNaN tests in the static code, and it’s unclear whether the assumptions made in Werner’s patch have been adequately validated on non-x86 platforms. This is a performance study and tuning exercsie, however, with no “fix” in the pipeline, or any claim of a performance regression. I don’t expect to do any work on it prior to ZBB, so we should kick it down the road. Moving to Brannan.
Target Milestone: Q3 11 - Serrano → Q1 12 - Brannan
Retargeting further work to "Future" release. If benchmarking points to this as a problem, send to QRB.
Priority: P3 → --
Target Milestone: Q1 12 - Brannan → Future
As Dan assigned it to future, removing Andre blocker.
No longer depends on: Andre
Tamarin is a dead project now. Mass WONTFIX.
Status: REOPENED → RESOLVED
Closed: 17 years ago6 years ago
Resolution: --- → WONTFIX
Tamarin isn't maintained anymore. WONTFIX remaining bugs.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: