Closed Bug 536630 Opened 15 years ago Closed 14 years ago

TM: awful LIR generated for bitwise-and.js (and everything else)

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

Yesterday I was trying to gauge the quality of Nanojit's LIR-to-native translations. Nothing leapt out at me, I think there may be some tweaks to do but the quality is fairly good. This isn't so surprising because LIR is pretty close to machine code. What did leap out at me was the LIR itself, which to my eyes looks, well, awful. For example, bitops-bitwise-and.js looks like this: bitwiseAndValue = 4294967296; for (var i = 0; i < 600000; i++) bitwiseAndValue = bitwiseAndValue & i; The bytecode looks like this: 00022: 27 trace 00023: 28 bindname "bitwiseAndValue" 00026: 28 name "bitwiseAndValue" 00029: 28 getgvar "i" 00032: 28 bitand 00033: 28 setname "bitwiseAndValue" 00037: 27 gvarinc "i" 00040: 27 pop 00041: 27 getgvar "i" 00044: 27 uint24 600000 00048: 27 lt The LIR looks like this (instructions indented to the right are optimised away by StackFilter or dead-code elimination): 00: start 01: ebx = iparam 0 ebx 02: esi = iparam 1 esi 03: edi = iparam 2 edi 04: state = iparam 0 ecx 05: label1: 06: sp = ld state[8] 07: rp = ld state[24] 08: cx = ld state[0] 09: eos = ld state[12] 10: eor = ld state[28] 11: ld1 = ld cx[0] 12: 0 = int 0 13: eq1 = eq ld1, 0 14: xf1: xf eq1 -> pc=0x30cba6 imacpc=0x0 sp+0 rp+0 (GuardID=001) 15: obj = int 2961408 16: sti sp[0] = obj 17: ld2 = ld cx[148] 18: ld3 = ld ld2[56] 19: map = ld ld3[0] 20: ld4 = ld eos[664] 21: $global0 = i2f ld4 22: sti sp[8] = ld4 23: ld5 = ld eos[656] 24: $global1 = i2f ld5 25: sti sp[16] = ld5 26: and1 = and ld4, ld5 27: i2f1 = i2f and1 28: sti sp[8] = and1 29: map = ld obj[0] 30: sti eos[664] = and1 31: 1 = float 1 32: 1 = int 1 33: add1 = add ld5, 1 34: ov1 = ov add1 35: xt1: xt ov1 -> pc=0x30cbb5 imacpc=0x0 sp+0 rp+0 (GuardID=002) 36: i2f2 = i2f add1 37: sti sp[0] = ld5 38: sti eos[656] = add1 39: sti sp[0] = add1 40: 600000 = float 600000 41: 600000 = int 600000 42: sti sp[8] = 600000 43: flt1 = flt i2f2, 600000 44: xf2: xf flt1 -> pc=0x30cbc0 imacpc=0x0 sp+16 rp+0 (GuardID=003) 45: sti eos[664] = and1 46: sti eos[656] = add1 47: j -> label1 48: live state 49: x1: x -> pc=0x30cba6 imacpc=0x0 sp+0 rp+0 (GuardID=004) I sure was surprised to see so much LIR generated for such a small loop. Some thoughts I had about this: - I wonder if a quick optimisation of the bytecode would be a win. The gvarinc "i" / pop / getgvar "i" sequence seems sub-optimal. That could also help with interpretation speed. - There sure is a lot of memory traffic via 'eos' and 'sp'. Some of the latter cases are removed by StackFilter but it still seems like this could be improved, perhaps in the bytecode-to-LIR translation itself. - I imagine some of the write-backs are necessary because of the guards, maybe these could be moved to exit stubs? - Instruction 45 seems redundant w.r.t. instruction 30, maybe applying StackFilter to eos would be beneficial? Likewise instructions 38 and 46. - Writing 600000 to sp[8] is pointless. - I also saw this code sequence for bitops-bits-in-byte.js: 44: ld1 = ld sp[-8] 46: sti sp[-8] = ld1 And ld1 had no other uses. There was an instruction between them that was dead and so removed, but still, yuk. - It's depressing to see overflow checks on the i++. That for-loop pattern must be really common. - i is an integer for the bitwise-and and the increment, but is a float for the loop limit test, necessitating the i2f (and FP comparisons generate much worse machine code on i386/x64 than integer comparisons). - The operation callback stuff (the first guard) is also depressing, but I don't know much about it. - It's a shame that the bytecode-to-LIR translation is printed one basic block at a time by TMFLAGS=recorder, it would be easier to understand the translation if it were one bytecode instruction at a time. I realise I'm wading into an area of the implementation that I know very little about, but I wonder if there is some low-hanging fruit here that I should pursue before eking out another couple of percent from Nanojit codegen. Suggestions are welcome!
- Writing 600000 to sp[8] is pointless. If the guard fails, this value has to be on the interpreter stack. But yeah, in general all these stores to the stack on the fast path are just awful. We should be sinking stores to side exits or figuring out how to spill to only one stack and not two. - It's depressing to see overflow checks on the i++ Language semantics :( - i is an integer for the bitwise-and and the increment, but is a float for the loop limit test That's worth looking into, it might have something to do with the global var being float and there being an & operator. - The operation callback stuff (the first guard) is also depressing, but I don't know much about it. This is for the slow script dialog in the browser. WebKit (according to Luke) pins the timeout to a fixed register. We check a variable in the JSContext.
(In reply to comment #1) > > - The operation callback stuff (the first guard) is also depressing, but I > don't know much about it. > > This is for the slow script dialog in the browser. WebKit (according to Luke) > pins the timeout to a fixed register. We check a variable in the JSContext. I just tried taking it out, the changes to SunSpider were in the noise. This matches what dmandelin told me. (But if we got rid of all the other crap I wonder if its significance would increase.)
Operation callback is used for gc too, not just the slow script dialog, iirc... We generally have plans for unifying the interp and jit stacks, right? Applying stackfilter to eos might be a pretty simple quick-and-dirty thing to try and measure...
Oh, and fixing the gvarinc/pop/getgvar thing would basically involve optimizing our bytecode a bit right after generating it, right? I wonder how often such patterns come up...
(In reply to comment #4) > Oh, and fixing the gvarinc/pop/getgvar thing would basically involve optimizing > our bytecode a bit right after generating it, right? I wonder how often such > patterns come up... All the time for C-style for (i=0;i<N;i++) do_something(); loops, because we compile them like so: i=0; goto L2; L1: do_something(); i++; L2: if (i<N) goto L1; We need a peephole optimizer. I dimly recall a bug being on file already. Can anyone spot it? On the overflow check: language semantics requires it in general, but not for this case where we can see i start at 0 and be manipulated only by i++. True it is a global variable, so aliased as a property -- but if we can rule out a getter or setter for bitwiseAndValue that could mutate i to be on the edge of overflowing to a double, then we could avoid the overflow check. This calls for a separate bug: cheap static analysis of loop variables to avoid overflow guards. Please file or cite if already filed. /be
I already commented on these things on IRC but I figured I should say something again here for posterity. > - I wonder if a quick optimisation of the bytecode would be a win. The > gvarinc "i" / pop / getgvar "i" sequence seems sub-optimal. That could > also help with interpretation speed. Peephole optimization of bytecode? Seems reasonable, although I don't have any real evidence yet that it would be a big win. It also seems to me that some of those kinds of optimizations could be expressed in the tracer or the baseline-compiler-to-be. > - There sure is a lot of memory traffic via 'eos' and 'sp'. Some of the > latter cases are removed by StackFilter but it still seems like this could > be improved, perhaps in the bytecode-to-LIR translation itself. > - I imagine some of the write-backs are necessary because of the guards, > maybe these could be moved to exit stubs? > - Instruction 45 seems redundant w.r.t. instruction 30, maybe applying > StackFilter to eos would be beneficial? Likewise instructions 38 and > 46. > - Writing 600000 to sp[8] is pointless. > - I also saw this code sequence for bitops-bits-in-byte.js: > > 44: ld1 = ld sp[-8] > 46: sti sp[-8] = ld1 > > And ld1 had no other uses. There was an instruction between them that > was dead and so removed, but still, yuk. I agree there's way too many stack loads/stores. But I'm not sure which you are concerned about: (a) more LIR -> longer compilation time, or (b) loads/stores getting through to the native code and making the generated code bigger and slower. Either way, some basic approaches for fixing this: (1) My personal favorite is to dispense with the stack loads/stores from the LIR, and instead maintain on the side (say in TraceRecorder) a representation of what logically is on the stack, i.e., a mapping from stack location to LIR instruction. Presumably we already have this in the tracker. The state of this map needs to be snapshotted at every side exit. Then, when we are generating the code, every time we do a side exit, we generate stores for the current stack state either on the main trace or in a side exit compensation block. When we do generate stores on trace, we can track that so that we don't generate them again next exit if unchanged. What I like about this is that it minimizes the intermediate info that needs to be tracked, and doesn't need a high-quality general-purpose LIR optimizer to get rid of the stores. (2) Alternatively, use a high-quality dead store elimination. :-) This doesn't help shrink the LIR, though. > - It's depressing to see overflow checks on the i++. That for-loop pattern > must be really common. It seems to me they should be pretty predictable, so they may not hurt us that much. We should measure. It might also be possible to mitigate by scheduling 1 or 2 non-side-effecting ops that would come after the oflow branch to run before it instead, or something like that. Otherwise, I think we should demote less aggressively, for this reason and so that we generate fewer traces. But in this particular test we would probably want to demote anyway, because we are doing bitops. I've been recommending that we try out some of these codes with various demotion decisions for a while now, so that we have a clearer idea of what the effect really is. The static analysis of loop bounds idea is worth pursuing, *if* measurements suggest it will give a significant boost. > - i is an integer for the bitwise-and and the increment, but is a float for > the loop limit test, necessitating the i2f (and FP comparisons generate > much worse machine code on i386/x64 than integer comparisons). Why is the loop limit test done with floats? That seems clearly wrong. IMO we should strive for "connected" variables on a trace to use the same representation ("connected" meaning they are operands to the same operation). > - It's a shame that the bytecode-to-LIR translation is printed one basic > block at a time by TMFLAGS=recorder, it would be easier to understand > the translation if it were one bytecode instruction at a time. Agreed.
Depends on: 536641
Depends on: 536642
(In reply to comment #5) > On the overflow check: language semantics requires it in general, but not for > this case where we can see i start at 0 and be manipulated only by i++. "and stopping at 60000", I meant to add. > True it > is a global variable, so aliased as a property -- but if we can rule out a > getter or setter for bitwiseAndValue that could mutate i to be on the edge of > overflowing to a double, then we could avoid the overflow check. We don't have to do compile-time property vs. global alias analysis, thanks to js_LeaveTraceIfGlobalObject. This all seems doable, even the canonical for-loop analysis, given our def-use chains through the AST. No CFG full of BBs, and no peephole optimization pass overhead required. /be
(In reply to comment #5) > > We need a peephole optimizer. I dimly recall a bug being on file already. Can > anyone spot it? I couldn't find any open JS bugs involving the word "peephole" so I filed bug 536642. > This calls for a separate bug: cheap static analysis of loop variables to avoid > overflow guards. Please file or cite if already filed. Bug 536641. > Peephole optimization of bytecode? Seems reasonable, although I don't have any > real evidence yet that it would be a big win. It also seems to me that some of > those kinds of optimizations could be expressed in the tracer or the > baseline-compiler-to-be. For this one, I don't see how we'll get the evidence without implementing a peephole optimizer. Fortunately they are pretty easy things to implement. > I agree there's way too many stack loads/stores. But I'm not sure which you are > concerned about: (a) more LIR -> longer compilation time, or (b) loads/stores > getting through to the native code and making the generated code bigger and > slower. I imagine (b) is more important, but it doesn't really matter because we'll get both (a) and (b) if we can improve things. > (2) Alternatively, use a high-quality dead store elimination. :-) This > doesn't help shrink the LIR, though. bz was talking about something similar, but I don't see how NJ can do much better than StackFilter. I'd be happy to be enlightened. > > - i is an integer for the bitwise-and and the increment, but is a float for > > the loop limit test, necessitating the i2f (and FP comparisons generate > > much worse machine code on i386/x64 than integer comparisons). > > Why is the loop limit test done with floats? That seems clearly wrong. IMO we > should strive for "connected" variables on a trace to use the same > representation ("connected" meaning they are operands to the same operation). I filed bug 536643 for this.
(In reply to comment #8) > (In reply to comment #5) > > (2) Alternatively, use a high-quality dead store elimination. :-) This > > doesn't help shrink the LIR, though. > > bz was talking about something similar, but I don't see how NJ can do much > better than StackFilter. I'd be happy to be enlightened. Hmmm, I think I forgot half of what I meant to say. Does NJ do full store-after-store dead store elimination already? That was one piece of what I was talking about. The other thing that would be needed that I totally forgot about is a store-sinking optimization. It would have to know that stack stores have no effect until you take a side exit and use that to push them into the side exit.
(In reply to comment #9) > > Hmmm, I think I forgot half of what I meant to say. Does NJ do full > store-after-store dead store elimination already? That was one piece of what I > was talking about. StackFilter does this for sp-relative and rp-relative stores within a basic block. (Although bz recently found an edge case of interest that it misses, bug 536458.) Doing it also for eos might be worthwhile, although I think in the example above that alone won't help because all the repeats are in different basic blocks. > The other thing that would be needed that I totally forgot about is a > store-sinking optimization. It would have to know that stack stores have no > effect until you take a side exit and use that to push them into the side exit. Sounds like what dvander said in comment 1. Should I file a bug for that as well?
(In reply to comment #10) > > The other thing that would be needed that I totally forgot about is a > > store-sinking optimization. It would have to know that stack stores have no > > effect until you take a side exit and use that to push them into the side exit. > > Sounds like what dvander said in comment 1. Should I file a bug for that as > well? Please do. The idea has come up several times but we haven't done anything with it yet. I think the argument against is that it will increase code size. Whether the perf gains are worth the increase is something we probably can't figure out until we actually do it.
(In reply to comment #11) > > > The other thing that would be needed that I totally forgot about is a > > > store-sinking optimization. > > > > Should I file a bug for that as well? > > Please do. Bug 537842.
Depends on: 536643, 537842
OS: Mac OS X → All
Hardware: x86 → All
(In reply to comment #0) > > - Instruction 45 seems redundant w.r.t. instruction 30, maybe applying > StackFilter to eos would be beneficial? Likewise instructions 38 and > 46. > - I also saw this code sequence for bitops-bits-in-byte.js: > > 44: ld1 = ld sp[-8] > 46: sti sp[-8] = ld1 > > And ld1 had no other uses. There was an instruction between them that > was dead and so removed, but still, yuk. Bug 551885 fixes these cases.
Depends on: 551885
FWIW, the LIR for bitwise-and.js is substantially better now: ebx = parami 0 ebx esi = parami 1 esi edi = parami 2 edi state = parami 0 ecx label1: sp = ldi.o state[8] cx = ldi.o state[0] eos = ldi.o state[12] ldi3 = ldi.csro cx[0] immi3 = immi 0 eqi1 = eqi ldi3, immi3/*0*/ xf2: xf eqi1 -> pc=0x90add56 imacpc=(nil) sp+0 rp+0 (GuardID=001) ldi2 = ldi.o eos[1384] ldi1 = ldi.o eos[1376] andi1 = andi ldi2, ldi1 sti.o eos[1384] = andi1 immi2 = immi 1 addxovi1 = addxovi ldi1, immi2/*1*/ -> pc=0x90add65 imacpc=(nil) sp+0 rp+0 (GuardID=002) sti.o eos[1376] = addxovi1 sti.s sp[0] = addxovi1 immi1 = immi 600000 sti.s sp[8] = immi1/*600000*/ lti1 = lti addxovi1, immi1/*600000*/ xf1: xf lti1 -> pc=0x90add70 imacpc=(nil) sp+16 rp+0 (GuardID=003) j -> label1 livei state x1: x -> pc=0x90add56 imacpc=(nil) sp+0 rp+0 (GuardID=004) There's nothing egregiously stupid (except for the dead guard at the end, but that doesn't matter much).
This tracking bug has served its purpose, even though some blocking bugs are still open.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.