Closed Bug 578517 Opened 14 years ago Closed 14 years ago

JM: Make |double| >> |int| fast

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dmandelin, Assigned: cdleary)

References

Details

(Whiteboard: fixed-in-jaegermonkey)

Attachments

(2 files, 4 obsolete files)

We take 300K slow calls in cordic for right-shifting a double, plus maybe 100K from somewhere else (using old repr profiling stats--should rerun after fatvals lands). JSC is much faster on this because they have an out-of-line fast path for this case. This is worth about 8 ms on cordic. JM wanted SS win: 10 ms
The tracer has the same issue. Are the numbers being shifted there actual doubles or just integers in disguise?
(In reply to comment #1) > The tracer has the same issue. Are the numbers being shifted there actual > doubles or just integers in disguise? They are actual doubles.
Multiple fast-paths per op are not possible to implement satisfactorily until FrameState branch()/merge() are implemented. Setting dependency.
Depends on: 574930
FWIW, I just had a discussion on IRC with dvander, sstangl, adrake about all these "JM: make X fast" bugs. My point was that we shouldn't optimise something that is slow in SunSpider on JM unless we are certain that it will also be slow in SunSpider on JM+TM. In other words, TM may cover up some of JM's bad spots, so optimising those bad spots prematurely is not a good use of manpower.
(In reply to comment #4) > FWIW, I just had a discussion on IRC with dvander, sstangl, adrake about all > these "JM: make X fast" bugs. My point was that we shouldn't optimise > something that is slow in SunSpider on JM unless we are certain that it will > also be slow in SunSpider on JM+TM. In my opinion, that's the wrong quantifier: we *should* optimize things in JM *unless* we are fairly certain that it will be fast with JM+TM. We can plan to be sure to be fast; we won't need luck. > In other words, TM may cover up some of JM's bad spots, so optimising those > bad spots prematurely is not a good use of manpower. That is a reasonable point. I looked into cordic, even though TM is fast on math, because I wanted to know why JM was slower there, and I did discover some good stuff that will definitely help us on other benchmarks where TM is not the fastest. And the TM equivalent of this bug should give us a little boost on TM math perf, and our total score. It is true, though, that this particular bug should not be a high priority, because it will only get us <=2 ms of win outside of cordic. More broadly, I don't think we need to worry too much about the misallocation that you are worried about, because TM only beats JSC on 5 SunSpider benchmarks, for a total of 15 ms of win. So, if we just make JM fast on everything we cannot go too far astray. And we want good speed for other workloads with similar features that don't trace for whatever reason. But yes, point taken, improvements that help only JM on only 3bit, bitwise-and, and math-* should be assigned a lower priority at this point.
Blocks: JaegerPunyWins
No longer blocks: JaegerSpeed
No objections to #5, just a clarification: if neither JM nor TM beat JSC on a particular test case right now, it doesn't mean that the blended score of the two is guaranteed not to be beat JSC either. If JM makes the less loopy code fast, but underperforms on the hot kernel, and TM makes that fast, but sucks at the rest, the blended results should be quite exciting. We will see soon whether this theory holds.
yes to comment #5 - win where we can, but we know TM wins against all in a few benchmarks. bitwise-and isn't the priority but we can get general wins by finding these slowdowns. We don't need fancy FrameState for this case. We can do multiple fast-paths per op, not perfect, but something. I'm wont to not block anything other than branch/loop regalloc on bug 574930. Schedule wise, it's starting to feel like a post-4.0 thing, and the overhead of the algorithm will almost certainly be high.
No longer depends on: 574930
Andreas has a big TM win brewing. Labor expended on TM to win at SS can help too. Not sure how this trades against JM, but it's clear that the status quo balance of TM vs. JM in comment 5, next-to-last para, does not tell us where to invest. My point is not to freeze TM as-is, since we know of patches making it faster at SS. If a TM win is big enough (this is important) and easy to engineer, and the control flow already traces, we should do it. /be
I've got a nearly-completed patch in the works.
Assignee: general → cdleary
Status: NEW → ASSIGNED
Slow call changes on math-cordic: JSOP_GT: +300000 JSOP_RSH: -300000 Net: 0 Looking into why the GTs are slowing down...
Updates against tip, which dvander pointed out has the GT fastpaths. I'll post benchmark results for my wobbly machine, but cordic goes stably from 15.3ms to 11.0ms.
Attachment #460721 - Attachment is obsolete: true
Attachment #460723 - Flags: review?(dvander)
Comment on attachment 460723 [details] [diff] [review] Right shift fastpath with rsh(double, int) support. >+ RegisterID jsop_rsh_rhs_reg(FrameEntry *rhs); This wants a better name. >+ if (!rhs->isConstant()) >+ frame.pwnDataIntoReg(rhs, reg); frame.copyDataIntoReg(rhs, reg) >+void >+mjit::Compiler::jsop_rsh_const_const(FrameEntry *lhs, FrameEntry *rhs) Add a case to JSOpBinaryTryConstantFold, then call it at the top of jsop_rsh(), see jsop_binary for its use. >+mjit::Compiler::jsop_rsh_int_const(FrameEntry *lhs, FrameEntry *rhs) >+{ >+ RegisterID result = frame.copyDataIntoReg(lhs); Move this below.... >+ int32 shiftAmount = rhs->getValue().toInt32(); >+ >+ if (!shiftAmount) { >+ frame.popn(2); >+ frame.pushTypedPayload(JSVAL_TYPE_INT32, result); >+ return; >+ } ... this. If the shiftAmount is 0, you can just frame.pop(). >+void >+mjit::Compiler::jsop_rsh_unknown_const(FrameEntry *lhs, FrameEntry *rhs) >+{ >+ int32 shiftAmount = rhs->getValue().toInt32(); >+ >+ if (!shiftAmount) { >+ RegisterID lhsData = frame.copyDataIntoReg(lhs); >+ frame.popn(2); >+ frame.pushTypedPayload(JSVAL_TYPE_INT32, lhsData); >+ return; >+ } If you don't know the type, and you get a constant 0 shift, you want: frame.pop(); jsop_pos(); Otherwise, you're applying a payload to an unknown type: js> function f(x) { print(x >> 0); } js> f("2") 2010096 Will review rest in a bit...
Attachment #460723 - Flags: review?(dvander)
Updated with those changes. Parser looks like it folds the rsh(const, const) in all cases, so I just omitted that. The rsh(x, 0) was a silly mistake, but I'm not sure about jsop_pos -- the type is known INT32 after the rsh, but the jsop_pos looks like it converts sp[-1] to a number. Also, what does JSOP_POS even stand for?
Attachment #460723 - Attachment is obsolete: true
Attachment #460751 - Flags: review?(dvander)
Piece Of S417? Kidding. POSitive, oppositve of JSOP_NEGative. /be
(In reply to comment #15) > POSitive, oppositve of JSOP_NEGative. Added. https://developer.mozilla.org/En/SpiderMonkey/Internals/Bytecode
(In reply to comment #14) Thanks, yeah, JSOP_POS is wrong. You want to guard on the type first, and if you get a zero, elide the shift. We can't rely on the parser because we propagate constants, i.e: var x = 5; var y = 6; print(x >> y); The parser won't fold this, but we will. Getting late so will look at the rest first thing in the morning :) Btw - this is a 10ms win on my machine, so: awesome.
(In reply to comment #17) > The parser won't fold this, but we will. Er... right. Even if it did happen to fold that, I don't think it folds as many expressions as we do, because we're cool and the parser isn't, so it's worth sticking in there. Coincidentally, nothing hits my NYI assertion, so we should make a test for that. Constant prop makes for some interesting untested state space...
Why can the parser not fold this?
(In reply to comment #19) > Why can the parser not fold this? I think it actually does at the moment, but, as I was saying in comment 18, one can imagine a sufficiently complex expression that abstract stack interp will do it but the parser gives up.
The parser recursively builds the bytecode you interpret, how could the bytecode ever be too complex for the parser?
(In reply to comment #21) > The parser recursively builds the bytecode you interpret, how could the > bytecode ever be too complex for the parser? The folding actually occurs in its own phase, after the parsing/node creation phase in the parser. Looking at the code, it's hard to say with confidence that it'll catch every case the abstract stack will. But really, there's no *need* for it to handle every case, especially when constant folding falls out of abstract stack interp so easily. If neither tracing nor JM rely on constant prop at parse time, it seems like it can be eliminated.
(In reply to comment #22) > If neither tracing nor JM rely on constant prop at parse time, it seems like it > can be eliminated. Thought about that for about two minutes and realized it's probably unrealistic. Programs with stuff like 1 + 2 + 3 + 4 or other ridiculous things would all have to have their bytecodes traced during recording. Sorry, just had a fleeting dream of eliminating some code...
(In reply to comment #20) > I think it actually does at the moment function f() { var x = 5, y = 6; return x >> y; } 00001: int8 5 00003: setlocal 0 00006: pop 00007: int8 6 00009: setlocal 1 00012: pop 00013: getlocal 0 00016: getlocal 1 00019: rsh 00020: return JM emits: [jaeger] Insns movl $0xffff0001, 0x2c(%ebx) [jaeger] Insns movl $0x0, 0x28(%ebx) ... return stuff
So why don't we fix the parser? Less code to analyze, trace, carry around.
(In reply to comment #25) > So why don't we fix the parser? Less code to analyze, trace, carry around. The current issue is that we don't fold numbers into name nodes (jsparse.cpp:FoldType), I'm guessing because we wanted to avoid checking the use/decl chains.
The folder predates def-use chains but you have to be careful about control flow in ways the upvar optimizations do not. We need SSA in one pass for the parser to do the folding here, and maybe the var elimination too (debugger shouldn't see uninitialized vars, it should see no vars). Righteous JM work, Chris! /be
(In reply to comment #27) > We need SSA in one pass for the > parser to do the folding here, and maybe the var elimination too I don't think we need it for these simple cases. If we can track a use back to a decl/initialization and that link doesn't cross control flow statements, we can propagate the constants. /me goes to look at our statement traversal code...
Yes, of course you can special case the def-use chain, but why? Why waste time on partial front end folding for these synthetic cases? No one writes such dumb code in any benchmark I know of (please correct me if I'm wrong). /be
Comment on attachment 460751 [details] [diff] [review] Right shift fastpath with rsh(double, int) support. >+ if (!shiftAmount) { >+ /* Need to coerce lhs to int. */ >+ frame.pop(); >+ jsop_pos(); >+ return; >+ } Remove, and >+ masm.rshift32(Imm32(shiftAmount), lhsData); .. just avoid the shift here if shiftAmount is 0. ARM doesn't like 0 shift immediates for some reason. >+ >+void >+mjit::Compiler::jsop_rsh_const_unknown(FrameEntry *lhs, FrameEntry *rhs) >+{ >+ RegisterID rhsData = jsop_rsh_rhs_reg(rhs); Seriously, rightRegForShift or something :) >+#define HAS_KNOWN_TYPE(fe, type) (fe->isTypeKnown() && fe->getKnownType() == type) No macro, use fe->isType(type) >+#define SLOW_CALL() \ >+ JS_BEGIN_MACRO \ >+ prepareStubCall(Uses(2)); \ >+ stubCall(stubs::Rsh); \ >+ frame.popn(2); \ >+ frame.pushSyncedType(JSVAL_TYPE_INT32); \ >+ return; \ >+ JS_END_MACRO And this can go away after the second use is removed, since we must constant fold. > mjit::Compiler::jsop_bitop(JSOp op) > { >+ if (op == JSOP_RSH) { >+ jsop_rsh(); >+ return; >+ } More explicit to call jsop_rsh() directly from Compiler.cpp instead. > /* > * Just pop RHS - leave LHS. ARM can't shift by 0. > * Type of LHS should be learned already. > */ > frame.popn(2); > frame.pushTypedPayload(JSVAL_TYPE_INT32, reg); >- if (stubNeeded) >- stubcc.rejoin(Changes(1)); This can't be removed, since if stubNeeded==true there will be no jump from the OOL path to inline path. >+assertEq(rsh(1024.0, 2.0), 256) 1024.0 and 2.0 will be integers, so this doesn't test doubles. Also might want to add a test for >> 0.
Attachment #460751 - Flags: review?(dvander)
Comment on attachment 460751 [details] [diff] [review] Right shift fastpath with rsh(double, int) support. >+ if ((lhs->isTypeKnown() && lhs->getKnownType() != JSVAL_TYPE_INT32) || >+ (rhs->isTypeKnown() && rhs->getKnownType() != JSVAL_TYPE_INT32)) { >+ SLOW_CALL(); >+ } Something tagged as a double will take a slow path here. Easy way to test this is to have a constant double. Btw this pattern is pretty common so dmandelin added fe->isNotType().
Attachment #460727 - Attachment is patch: false
With review changes.
Attachment #460751 - Attachment is obsolete: true
Attachment #461047 - Flags: review?(dvander)
Whoops, had attached an old patch.
Attachment #461047 - Attachment is obsolete: true
Attachment #461049 - Flags: review?(dvander)
Attachment #461047 - Flags: review?(dvander)
Comment on attachment 461049 [details] [diff] [review] Right shift fastpath with rsh(double, int) support. nice
Attachment #461049 - Flags: review?(dvander) → review+
9.5ms on the graphs - cordic score now matches TM.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: