Bugzilla

Comment 2

•

17 years ago

I'm going to leave both open, one is for TC, the other is for TT. Specifically, 416398 is about removing many of TT's isNaN checks that aren't even necessary due to how the FPU preserves canonical nan's.

Status: RESOLVED → REOPENED

Resolution: DUPLICATE → ---

Jim Blandy :jimb

Comment 3

•

16 years ago

This may be a dumb question, but: I'm surprised that the floating-point comparison instruction in that sequence is fcomp; that instruction is supposed to raise a floating-point exception if either operand is a NaN. Could the amazing overhead of that comparison be due, not to processor microcode overhead, but kernel exception-handling overhead? Would it make sense to try using a fucomp instruction here instead, using explicit assembly? (For what it's worth, GCC generates fucomp instructions for these comparisons.)

Comment 4

•

16 years ago

I agree, fucomp is the preferred instruction. I don't know if Moh was quoting a real piece of code or going from memory; but I'm pretty sure he was talking about fpu penalties and not OS exceptions. fucomp would still incur penalties when one value was NaN.

Moh Haghighat

Reporter

Comment 5

•

16 years ago

MSVC (both 2005 and 2008) under /Os (used in Tamarin) generates fcomp. The penatlies I mentioned were due do the overhead in microcode and not OS. fucomp does not cause exceptions for QNaNs, but still causes traps for SNaNs.

Lars T Hansen

Updated

•

15 years ago

Priority: -- → P5

Target Milestone: --- → flash10.1

Dan Smith

Updated

•

15 years ago

Target Milestone: flash10.1 → flash10.2

Lars T Hansen

Updated

•

15 years ago

Whiteboard: PACMAN

Comment 6

•

14 years ago

Looks like we have this code for platforms beside UNIX which has a slightly different integer based variant. I don't know why UNIX has a different implementation. One optimization for isNaN is to remove Toplevel::isNaN and have the JIT call MathUtils::isNaN directly or inline the isNaN code in Toplevel. Cutting out this helper function boosts the asmicro/isNaN-1.as performance test case by 40%.

Comment 7

•

14 years ago

Yeah, why there is a slightly different UNIX variant has mystified me too, it predates my work on this code. I'd naively guess that a single int64-based implementation, inlined everywhere, would be the best bet on most platforms. (For that matter, seems like inlining the test in LIR directly would be doable?)

Comment 8

•

14 years ago

If the isNaN argument type is known and we're fully optimizing, then you can avoid the call completely: isNaN(x:Number) => inline LIR_eq(LIR_eqd(x,x), LIR_immi(0)) // !(x==x) isNaN(x:int, uint, bool, String, Namespace) => inline LIR_immi(0) // false the constant false case occurs for any T where we know T is final and doesn't have a valueOf() function that could return NaN. The main thing when inlining is to ensure we dont break the debugger's model of whats going on. Can a user put a breakpoint on isNaN? if so then we shouldn't inline it (same goes for the other inline cases, really, although we might be breaking the debugger already in those cases).

Comment 9

•

14 years ago

Adding dependency on 588922 for isNaN optimization.

Depends on: 588922

Updated

•

14 years ago

Priority: P5 → P3

Updated

•

14 years ago

Assignee: nobody → wsharp

Comment 10

•

14 years ago

Attached patch separate isNaN part from bug 588922 (deleted) — Details — Splinter Review

My SSE inlining test case is about 5x faster with inlined version. Non-SSE x86 version is actually slower (see previous comments about hardware microcode issues) so this is only enabled for SSE on x86 (plus all other architectures). Function is set up to expand for other built in functions related to work in 588922. Existing micro benchmark exists: asmicro/isNaN-1.as.

Attachment #470527 - Flags: superreview?(edwsmith)

Attachment #470527 - Flags: review?(stejohns)

Updated

•

14 years ago

Attachment #470527 - Flags: review?(stejohns) → review+

Comment 11

•

14 years ago

Comment on attachment 470527 [details] [diff] [review] separate isNaN part from bug 588922 Functionality looks correct. R- for the debugger hazard only. If I'm not mistaken, this will break if someone tries to put a breakpoint on isNaN in the debugger, so we should guard these kinds of inlines with if (!haveDebugger). nits: - house style in the jit is { on the same line as control flow statements. - SSE2_ONLY(if (core->use_sse2()) does not strip the IF for SSE2-always builds (mac x86, all x64). - elsewhere in CodegenLIR we have SSE2_ONLY(if(config.i386_sse2)), but here we have if (core->use_sse2()). I slightly prefer directly using njconfig but either way lets be consistent. I have a question unrelated to this patch: Should we care about the performance when the argument is NaN? in that case, in comment #1, Moh suggests using softfloat logic instead, to avoid a severe FPU penalty. Also, I've heard rumors of people unhappy with the performance of isNaN for non-double arguments. e.g. if the argument is int uint, or bool, then we convert to double, and end up doing eqi(eqd(i2d(x), i2d(x)), immi(0)). will the existing optimizations combine, and erase all that and return immi(1)?

Attachment #470527 - Flags: superreview?(edwsmith) → superreview-

Comment 12

•

14 years ago

Attached patch updated patch after feedback (deleted) — Details — Splinter Review

1. Added haveDebugger check. 2. Uglified the bracket style to match most of codegenlir.cpp. Yeah, I'm not a huge fan of same line { syntax and codegenlir needs some cleanup if we want true consistency. 3. Added ifdef so SSE check is only for builds that need it. 4. Using core->config.njconfig to get at sse flag 5. Added i/ui2d optimization to just emit a constant 0 result. The performance when the argument is NaN is very good for SSE enabled machines. It is only with x87 syntax that the performance is very slow (2.5x slower than the int64 code version) which is why I added the SSE flag check. Approximate numbers for a 100 million for loop: isNaN(NaN) - 3000 original, 150-500 SSE (depending on loop alignment), 8000 if we use inline x87 code isNaN(not NaN) - 3000 original, 150-500 SSE, 250-500 x87 inlined So we are 2.5x slower if we inline the x87 FPU call vs calling out to an Math helper routine using int64 math. But we are 6-12x faster if we are not checking a NaN value.

Attachment #470790 - Flags: superreview?(edwsmith)

Attachment #470790 - Flags: review?(stejohns)

Comment 13

•

14 years ago

inline isNaN(NaN) is still fast on SSE2? sweet. Can you add the microbenchmarks for NaN and non-double cases as well? separate patch is fine.

Comment 14

•

14 years ago

Yes, SSE ucomisd is flag regardless of the argument. For microbenchmarks, you want an AS file to add to the performance directory? Can I just put it in the asmicro directory or are there some guidelines posted about adding new test cases?

Comment 15

•

14 years ago

Comment on attachment 470790 [details] [diff] [review] updated patch after feedback Probably should add a "break" to the end of the case statement; doesn't matter currently but if/when another case is ever added the break would prevent a dumb error.

Attachment #470790 - Flags: review?(stejohns) → review+

Comment 16

•

14 years ago

Comment on attachment 470790 [details] [diff] [review] updated patch after feedback nit: you dont need #ifdef DEBUGGER around if (haveDebugger) because haveDebugger is const = false in non-DEBUGGER configs. (see CodegenLIR.h). R+ with this and Steven's suggestion (add a break statement) fixed.

Attachment #470790 - Flags: superreview?(edwsmith) → superreview+

Comment 17

•

14 years ago

pushed inlining code: http://hg.mozilla.org/tamarin-redux/rev/9de9c686abc2 Leaving this open. Our current MathUtils::isNaN implementation is written specifically to avoid big slowdowns on x86 architectures. My investigation with this patch shows that SSE compares (NaN == NaN via UCOMISD) do not have the slowdown so it would be faster to have isNaN just be (val!= val). What about ARM, PowerPC, Sparc? Perhaps only x86 has the slowdown with comparing NaN to itself via floating point instructions. If so, x86 should use the int compare technique while all others should use (val != val).

Comment 18

•

14 years ago

(In reply to comment #17) > about ARM, PowerPC, Sparc? Perhaps only x86 has the slowdown with comparing > NaN to itself via floating point instructions. If so, x86 should use the int > compare technique while all others should use (val != val). Non-SSE2 machines are a vanishingly small percent of hardware we care about anymore, so frankly, optimizing for those is not worth our time. If the SSE2 comparison is a win on x86, leave it at that and move on...

Comment 19

•

14 years ago

I'm talking about the MathUtils::isNaN routine in C++ which on Windows will use the x87 instructions. Mac does not have the problem but Windows/Linux wont automatically use SSE.

Comment 20

•

14 years ago

Ah, got it. How many places (outside of the JIT) do we call this on a critical path? Perhaps we could call it via a function pointer specialized at startup time (to a generic vs SSE2 version)? Of course, indirect call might overwhelm any perf advantage of the integer-hack version inlined.

Comment 21

•

14 years ago

There are 61 hits in the source tree. No idea if any of them are critical path. The beauty of the (f != f) variant is it can be inline and is only a couple instructions. Going back to my old code (pre-inlining of NaN) and running the perf test, allowing MathUtils::isNaN to inline in TopLevel::isNaN basically doubles performance of Toplevel.isNaN. I'm sure it would be helpful to performance in the various other places that call it as well. I don't think we should inline the int64 variant (too much code) though. But at the very least, x64/SSE-always builds should just use an inline version of return (f != f).

Comment 22

•

14 years ago

> I don't think we should inline the int64 variant (too much code) though. But > at the very least, x64/SSE-always builds should just use an inline version of > return (f != f). I have some bad memory that the semantics of C/C++ have a loophole about NaN, and under high optimization settings, an optimizer is allowed to assume (f != f) is always false. (which is wrong for NaN under IEEE-754 semantics). So, we just have to be sure we've got good unit test coverage, and do the right thing as compilers dictate. this isn't a concern for LIR_eqd(f, f) since all nanojit backends must conform to IEEE-754 semantics.

Lars T Hansen

Comment 23

•

14 years ago

(In reply to comment #22) > So, we > just have to be sure we've got good unit test coverage, and do the right thing > as compilers dictate. Recent discussion with Steven suggests that systematic testing on non-SSE2 hardware is spotty or worse - so even having good unit tests may not be good enough. (That could have been VM testing or player testing - the TC in question was for the player.) cc'ing Trevor in case there are action items to consider.

Comment 24

•

14 years ago

(In reply to comment #23) > Recent discussion with Steven suggests that systematic testing on non-SSE2 > hardware is spotty or worse - so even having good unit tests may not be good > enough. (That could have been VM testing or player testing - the TC in > question was for the player.) cc'ing Trevor in case there are action items to > consider. The particular issue turned out to be not non-SSE2 per se, but non-CMOV machines. (non-SSE2 testing is less extensive than I'd like, however, given that we are still supporting such machines.)

Moh Haghighat

Reporter

Comment 25

•

14 years ago

As part of testing, under a designated flag, one can make the CPUID routine fake a system with SSE* to look like a non-SSE* machine. Then, by having assertions at the codegen routines of SSE*, MMX, and CMOV instruction, we can break if those instructions were ever used. This way we would not be totally dependent on the availability of those old systems in the test farm. Re CMOV: back in mid'09, at some point for a short time, TraceMonkey was erroneously generating "conditional moves" unconditionally (i.e., CMOV without the proper CPU check). Note that CMOV was first introduced in Pentium Pro. So, a "Pentium processor with MMX™ technology" has MMX but not CMOV (or SSE2). This was captured in the bug https://bugzilla.mozilla.org/show_bug.cgi?id=500277

Comment 26

•

14 years ago

From an IM chat with Ed: Approximately 1/3 of isNaN calls in 1000 brightspot SWFs are doing isNaN(number(arg)) where they are converting an atom into a double and then calling isNaN. The alternative to this would be to add isNaNAtom(Atom a) and combining the two calls into one. isNaN is the #1 call according to bug 588922. Ed brought up the point of worrying that the number call would not necessarily get stripped by the verifier/codegen though and we don't want two calls to Object.valueOf. (See duplicate charCodeAt discussion in bug 593383)

Updated

•

14 years ago

Blocks: 588922

No longer depends on: 588922

Updated

•

14 years ago

Flags: flashplayer-qrb?

Updated

•

14 years ago

Assignee: wsharp → nobody

Updated

•

14 years ago

Assignee: nobody → wmaddox

Trevor Baker

Updated

•

14 years ago

Flags: flashplayer-injection-

Whiteboard: PACMAN → PACMAN must-fix-candidate

Dan Smith

Updated

•

14 years ago

Depends on: Andre

Flags: flashplayer-qrb?

Flags: flashplayer-qrb+

Flags: flashplayer-bug-

OS: Windows XP → All

Whiteboard: PACMAN must-fix-candidate → PACMAN

Updated

•

14 years ago

Depends on: 645018

William Maddox

Assignee

Comment 27

•

13 years ago

The discussion above covers a lot of ground. To summarize what remains open: 1) MathUtils::isNanInline() uses the integer-based test on IA32 unless we know SSE will always be available (i.e., MacOSX). MathUtils::isNan() always uses an integer-based test, on all architectures. This discrepancy is unmotivated. The integer-based test is known to be faster on non-SSE2 IA32 platforms, and the floating-point test faster on SSE2 IA32, but we have not examined which is preferable on non-x86 platforms. It is also unclear why we do not use the inlined form everywhere, as it quite small. 2) On Unix (Solaris only?) we use a different integer-based test. It is not clear if this is necessary, or results in superior performance. 3) Toplevel::isNan() has been measured to be faster when it invokes an inlined version of MathUtils::isNan(). 4) Math.isNan() is specialized to an inline floating-point test. At one point, it was proposed that this be done on IA32 only for SSE2-enabled processors, but that was dropped in the patch that landed (attachment 527712 [details] [diff] [review] for bug 606561). We need to rationalize the usage of the integer vs. floating point NaN test across all of these contexts, and on non-x86 platforms as well.

Trevor Baker

Comment 28

•

13 years ago

From Bill: -It is a very old bug, in which several different aspects of isNaN, both static and JIT-ed are discussed. Werner did quite a bit of work on inlining isNaN in JIT-ed code, and doing some context-dependent specialization. That has landed. There is the possibility of improving isNaN tests in the static code, and it’s unclear whether the assumptions made in Werner’s patch have been adequately validated on non-x86 platforms. This is a performance study and tuning exercsie, however, with no “fix” in the pipeline, or any claim of a performance regression. I don’t expect to do any work on it prior to ZBB, so we should kick it down the road. Moving to Brannan.

Target Milestone: Q3 11 - Serrano → Q1 12 - Brannan

Dan Smith

Comment 29

•

13 years ago

Retargeting further work to "Future" release. If benchmarking points to this as a problem, send to QRB.

Priority: P3 → --

Target Milestone: Q1 12 - Brannan → Future