Closed Bug 584987 Opened 14 years ago Closed 14 years ago

JM/TM: static analysis to remove guards for overflow and negative zero

Categories

(Core :: JavaScript Engine, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: jandem, Unassigned)

References

Details

User-Agent:       Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.8) Gecko/20100723 Ubuntu/10.04 (lucid) Firefox/3.6.8
Build Identifier: 

For the mod-operator, JM and TM guard that if the result is 0 and if LHS is negative, the result is a negative zero. V8 does some static analysis to prove that negative zero does not matter in some cases, for example:

- while(x % y) {}
- a[x % y] = 0;
- (x % 2) == 0

The same applies to div, mul and probably other operations. This reduces code size and is probably a small win on some benchmarks.

V8 has a similar optimization for overflow checks. For (a + b) & 0xff they don't check for overflow, as the result is always converted to an integer.

Reproducible: Always
Status: UNCONFIRMED → NEW
Ever confirmed: true
(In reply to comment #0)
> 
> V8 has a similar optimization for overflow checks. For (a + b) & 0xff they
> don't check for overflow, as the result is always converted to an integer.

We have bug 575529 for this.
We could annotate all bytecodes (perhaps in BytecodeAnalyzer?) that push a value with the bytecode that pops it, like this:

0: JSOP_ZERO  (JSOP_ADD, 1)
1: JSOP_ONE   (JSOP_ADD, 0)
2: JSOP_ADD   (JSOP_SUB, 1)
3: JSOP_ZERO  (JSOP_SUB, 0)
4: JSOP_SUB   (JSOP_POP, 0)
5: JSOP_POP   -

(Or just some flags, like used_as_int, can_ignore_neg_zero, etc.)

This will allow us to add at least 3 optimizations:

1) Remove negative zero checks. Consider (x * y) == 0. In JSOP_MUL we know it's popped by JSOP_EQ, so we can skip the check.

2) Remove overflow checks in JM: for (a + b) & 0xff we know in JSOP_ADD that JSOP_BITAND will convert to int32, so ignore overflow.

3) Optimize register allocation. Shift operations on x86 want lhs in ecx. So for patterns like (m&0x7fff)<<15, we have to move the result of the JSOP_BITAND to ecx. With this optimization, for JSOP_BITAND we can try to allocate ecx as result register, eliminating a spill and move.

4) Some other optimizations. Like for arr[x / 2] we can perhaps use integer division.

Does this make any sense, would it get accepted, or is there a better approach? I don't want to write a patch and then hear there is a much better alternative :)
Jan, this sounds really good. #1 probably won't affect register allocation, but could prevent some branches. For #2 this seems potentially valuable, because then the upcoming BITAND does not need to load the type tag into a register. I like the other use cases as well.

Even a peephole analysis might work just to see what kind of gains you'd get.
(In reply to comment #2)

Yes, good ideas. I have some further comments below about how to conceptualize this.

> We could annotate all bytecodes (perhaps in BytecodeAnalyzer?) that push a
> value with the bytecode that pops it, like this:
> 
> 0: JSOP_ZERO  (JSOP_ADD, 1)
> 1: JSOP_ONE   (JSOP_ADD, 0)
> 2: JSOP_ADD   (JSOP_SUB, 1)
> 3: JSOP_ZERO  (JSOP_SUB, 0)
> 4: JSOP_SUB   (JSOP_POP, 0)
> 5: JSOP_POP   -

That's basically def-use 'chains' on temporary variables, a classic and very useful compiler annotation. It's particularly nice and simple for stack temporary variables, because they can have only one use. One way of encoding them would be to put the address of the use for each op in the ByteCodeAnalyzer result.

But it might also be useful to 'push' the def-use chains through local variables and certain other constructs. For example, with JSOP_DUP, the JSOP_DUP isn't really the "consumer" of the value--it's really copying the value so that 1 or 2 other ops can really use them. So it might be interesting to make that distinction. If we do that, then there can be multiple uses.

Don't let me get too excited, though--it's probably better to keep things simple and get the small wins right now, and let the expanded ideas wait for later.

> (Or just some flags, like used_as_int, can_ignore_neg_zero, etc.)

Those would correspond to the optimization properties compilers compute from the def-use results. And we should have those too--it will make the following optimization stages easier to implement.

> This will allow us to add at least 3 optimizations:
> 
> 1) Remove negative zero checks. Consider (x * y) == 0. In JSOP_MUL we know it's
> popped by JSOP_EQ, so we can skip the check.

Makes sense. The perf win is probably small, but the code size improvement might be nice. Are there other ops where negative zero doesn't count? What's the key property that establishes that?

> 2) Remove overflow checks in JM: for (a + b) & 0xff we know in JSOP_ADD that
> JSOP_BITAND will convert to int32, so ignore overflow.

This might help some in fast int kernels, like crypto stuff. Is the rule that if all uses of a value convert it to int32 or uint32, then we can ignore overflow and instead do standard int32 arithmetic (assuming the inputs are ints)?

> 3) Optimize register allocation. Shift operations on x86 want lhs in ecx. So
> for patterns like (m&0x7fff)<<15, we have to move the result of the JSOP_BITAND
> to ecx. With this optimization, for JSOP_BITAND we can try to allocate ecx as
> result register, eliminating a spill and move.

David, doesn't TM do things like this? It seems to me that if an operand must be in ecx, then can't lose by allocating ecx for the result that feeds it. I guess the other aspect is that we'd prefer *not* to allocate ecx for other values that are coming up, but we'd need some additional algorithm to do that.

> 4) Some other optimizations. Like for arr[x / 2] we can perhaps use integer
> division.

There are probably others. But I think array indices are not truncated, so I don't think there's anything we can do with a[x/2].

> Does this make any sense, would it get accepted, or is there a better approach?

I think it makes sense and would be accepted. I think the most important thing is to make the approach reasonably clear at the conceptual level, so that people can understand how it works, and it can grow to encompass more and more optimizations. I think you are on the right track with that.

> I don't want to write a patch and then hear there is a much better alternative
> :)

I certainly don't have a better alternative sitting around. Actually I think your idea is really great. Since you asked, I do want to kick out a brief forecast of some complementary ideas we might want to get into someday, so you can keep them in mind:

- Liveness analysis seems like it might be able to give us a further boost. If the bytecode analyzer (suitably beefed up) can prove that a certain op's result will never get used, then we can do less work for the op, or skip it entirely. Trivial example:

  x = 2;
  x = 3;

The first store to x is dead, so we can wipe it out. A dead value is basically one that has no bytecodes that use it. So it might be able to be the same analysis. Note that liveness is usually done as a backward pass. Right now we have only one forward pass in the bytecode analyzer. Since it's done for all code, and it's in a JIT, it needs to be really fast, so it's not clear whether or not we can afford a full backward pass right now. But I think the def-use chains you want can reasonably be computed forwards anyway.

- Similarly, it seems like we might want a more advanced register allocator someday, with the ability to do things like see ahead that ecx will be required for a certain op, and thus to not use ecx now. Clearly, we can't use a slow-running highly optimizing register allocator, so it would either use a simple algorithm or simple tricks. That would be a ways out in the future, though.
Lots of overlap here with the information computed in the initial passes by the JS type inference (bug 557407).  Basic thought: adding simple information to the ByteCodeAnalyzer is great, but it's probably better to just rip out and use the analysis structures in the inference (which subsume this and everything else in the ByteCodeAnalyzer) if you want to do something more complicated.

The cost of doing passes isn't much.  The total cost of the initial forward pass (connect stacks together, find must-execute code, etc.), second forward pass (copy/constant propagation) and third backward pass (register allocation) is 5ms for my machine on SS.  Constructing the data structures to store the information is more expensive (not sure how much) but would go down if type information wasn't being constructed/stored.
Thanks for your comments, very useful. 

@dmandelin: interesting ideas, and indeed, in the long term #3 can/should be fixed with a better register allocation algorithm. Also good to know the name of things I propose :)

@Brian: Ah of course, if we can get the same info from your type inference algorithm, then I will probably just wait for that, as I don't expect dramatic performance wins here. Are there any blockers for bug 557407? If so, I should probably help with that instead. I will also try to understand your type-inference algorithm -- it's truly amazing that it can determine most types in a few ms.

Later this week I will add some hacks and see how much we can win. If that's just a few ms. I will wait for bug 557407 to land.
Marking WONTFIX; let's just wait for bug 557407.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.