Closed
Bug 600143
Opened 14 years ago
Closed 13 years ago
TM: more implicit constant propagation after guards
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: n.nethercote, Assigned: n.nethercote)
References
Details
Attachments
(1 file)
(deleted),
patch
|
edwsmith
:
feedback+
wmaddox
:
feedback+
|
Details | Diff | Splinter Review |
Continuing on from the example in bug 600127 comment 0, the code for
"a[i] += 3":
#------------------------------------------------------------------------
# getelem
#------------------------------------------------------------------------
ldi5 = ldi.o $stack5[4]
clasp = immi 0x2ee060
guard(class is Array) = eqi ldi5, clasp/*0x2ee060*/
xf4: xf guard(class is Array) -> pc=0x613532 imacpc=0x0 sp+32 rp+0
(GuardID=002)
capacity = ldi.o $stack5[40]
ltui1 = ltui ldi1, capacity
xf3: xf ltui1 -> pc=0x613532 imacpc=0x0 sp+32 rp+0 (GuardID=003)
dslots = ldi.o $stack5[24]
immi5 = immi 3
lshi2 = lshi ldi1, immi5/*3*/
addi2 = addi dslots, lshi2
ldi4 = ldi.o addi2[4]
JSVAL_TAG_INT32 = immi 0xffff0001
* eqi4 = eqi ldi4, JSVAL_TAG_INT32/*0xffff0001*/
xf2: xf eqi4 -> pc=0x613532 imacpc=0x0 sp+32 rp+0 (GuardID=004)
ldi3 = ldi.o addi2[0]
sti.s sp[16] = ldi3
#--------------------------------------------------------------------
sti.s sp[24] = immi5/*3*/
addxovi2 = addxovi ldi3, immi5/*3*/ -> pc=0x613535 imacpc=0x0 sp+32 rp+0
(GuardID=005)
sti.s sp[16] = addxovi2
#------------------------------------------------------------------------
# setelem
#------------------------------------------------------------------------
JSVAL_TAG_MAGIC = immi 0xffff0004
** eqi2 = eqi ldi4, JSVAL_TAG_MAGIC/*0xffff0004*/
** jf eqi2 -> label2
** hasNoIndexedProperties = calli.all #js_Array_dense_setelem_hole ( cx
$stack5 ldi1 )
** immi3 = immi 0
** eqi1 = eqi hasNoIndexedProperties, immi3/*0*/
** xt1: xt eqi1 -> pc=0x613536 imacpc=0x0 sp+24 rp+0 (GuardID=008)
** label2:
JSVAL_TAG_INT32_2 = immi 0xffff0001
sti.o addi2[4] = JSVAL_TAG_INT32_2/*0xffff0001*/
sti.o addi2[0] = addxovi2
#------------------------------------------------------------------------
We can see that ldi4 must be equal to JSVAL_TAG_INT32 after xf2 (marked with
a '*'). Therefore eqi2 must be false. Therefore all the lines marked '**'
can be removed. All this leaves for the SETELEM is writing the value and
tag. (The tag write is redundant but that's harder to determine and fodder
for a follow-up bug.) Also, by removing the js_Array_dense_setelem_hole()
it subsequently makes life easier for CSE.
Attached is a half-working patch, but one that doesn't correctly handle
non-linear control flow. It might be better to do this in CseFilter.
Kraken's imaging-desaturate.js looks to benefit hugely from this. It's main
loop looks like this:
while (p--) {
data[pix-=4] = data[pix1=pix+1] = data[pix2=[pix+2] = (data[pix]*0.3 + data[pix1]*0.59 + data[pix2]*0.11);
ie. it does three GETELEMS followed by three SETELEMS of the same elements;
this patch will allow the SETELEMS to be done almost for free; however, bug
584279 will need to land as well to make CSE strong enough for all this to
fall into place. I suspect other Kraken benchmarks will benefit a lot too,
eg. audio-fft and audio-beat-detection are both dominated by a single loop
where some array elements are get and then set.
Assignee | ||
Comment 1•14 years ago
|
||
Kraken results:
---------------------------------------------------------------
| millions of instructions executed |
| total | on-trace (may overestimate) |
---------------------------------------------------------------
| 4433.097 4433.098 (------) | 4142.906 4142.906 (------) | ai-astar
| 2934.580 2864.798 (1.024x) | 1391.178 1320.148 (1.054x) | audio-beat-det
| 1307.400 1307.410 (------) | 1140.652 1140.652 (------) | audio-dft
| 2683.524 2612.184 (1.027x) | 1347.583 1276.516 (1.056x) | audio-fft
| 2665.113 2665.127 (------) | 1856.255 1856.255 (------) | audio-oscillat
| 6950.825 6950.836 (------) | 4784.012 4784.013 (------) | imaging-gaussi
| 3315.591 3305.341 (1.003x) | 748.440 738.383 (1.014x) | imaging-darkro
| 5995.455 5952.050 (1.007x) | 3836.520 3793.159 (1.011x) | imaging-desatu
| 681.309 681.310 (------) | 10.073 10.073 (------) | json-parse-fin
| 497.739 497.739 (------) | 5.954 5.954 (------) | json-stringify
| 1272.720 1272.851 (------) | 748.719 748.719 (------) | stanford-crypt
| 710.550 710.096 (1.001x) | 362.991 362.697 (1.001x) | stanford-crypt
| 1152.121 1137.421 (1.013x) | 585.029 571.448 (1.024x) | stanford-crypt
| 503.629 497.694 (1.012x) | 199.421 195.863 (1.018x) | stanford-crypt
-------
| 35103.659 34887.962 (1.006x) | 21159.740 20946.793 (1.010x) | all
imaging-desaturate didn't improve much, turns out there's enough slots aliasing in the GETELEM/GETELEM/GETELEM/SETELEM/SETELEM/SETELEM sequence that this optimization doesn't hit for the most part, which is annoying.
I'm ambivalent about this patch. It's a small Kraken speed-up, but it puts some quite TM-specific code into NJ.
Assignee | ||
Comment 2•14 years ago
|
||
Comment on attachment 489387 [details] [diff] [review]
patch (against TM 57064:805c1a5d5cc6)
Ed, do you think this too gross for words?
Attachment #489387 -
Flags: feedback?(edwsmith)
Comment 3•14 years ago
|
||
Sort of pushing the envelope on CSE a bit, but the cure could be worse than the disease -- trying to factor that stuff out into a subclass would introduce some pretty weird hooks.
I see where you're going with it; at the point of a use of an expression, we know more information than at the point of a def, because of predicates in the surrounding control flow. So, you're building up tables of information about values at certain points, that get cleared when CSE resets.
In TR we have a similar hashtable for tracking the possible-null-ness of a bunch of pointer values (see VarTracker in CodegenLIR). Managing the hashtable is similar to managing the CSE tables, except we work harder to merge states at control-flow joins and not blow it away at labels with only one predecessor (so either branch of an if is handled well... not applicable for guards).
As long as the tables stay fairly small, it is manageable. But when they get big because of long sequences of code, I wish there were a cleaner way to do it, maybe with pseudo- LIR instructions, rather than a side table. But, we'd be inserting pseudo-instructions at the exact points we're inserting constants, using similar side tables.
SSI representation aims to solve this exact problem, by inserting special instructions at control-flow splits, which allow more exact information to be stored with the instruction. (like what value it must have). but its too heavweight. Jeremy Singer's dissertation is a great read on the topic.
http://www.dcs.gla.ac.uk/~jsinger/phd.html
Updated•14 years ago
|
Attachment #489387 -
Flags: feedback?(edwsmith) → feedback+
Updated•14 years ago
|
Attachment #489387 -
Flags: feedback?(wmaddox)
Assignee | ||
Comment 4•14 years ago
|
||
(In reply to comment #3)
>
> SSI representation aims to solve this exact problem, by inserting special
> instructions at control-flow splits, which allow more exact information to be
> stored with the instruction. (like what value it must have). but its too
> heavweight. Jeremy Singer's dissertation is a great read on the topic.
> http://www.dcs.gla.ac.uk/~jsinger/phd.html
Ha, I know Jeremy well, I did my PhD at the same time as him, he was in the office next door and we'd play cricket in the hallways :) He used to go on about SSI all the time, though I never absorbed much of the details.
Comment 5•14 years ago
|
||
Comment on attachment 489387 [details] [diff] [review]
patch (against TM 57064:805c1a5d5cc6)
Since this is in a similar vein to the existing knownCmpValues machinery, I won't raise an objection here, though it would be nice if there were some way to factor out these sorts of specialized optimizations so they are included only with the front-end that needs them.
GCC has a pass that does a fairly general optimization of this sort, decomposing the condition in an if-statement to generate a set of conditions known to be true or false in each branch. Since matching of candidate expressions against these is done by CSE-style hash table lookup rather than a backward-chaining inference engine, all of the possible expressions that will be eligible for this optimization have to be generated up-front (forward-chaining). This allows interesting optimizations for comparisons not involving a constant, for example. I can't imagine this approach is very fast, nor represents a good cost/benefit tradeoff for Nanojit, however, since CSE is already a bottleneck for us.
Attachment #489387 -
Flags: feedback?(wmaddox) → feedback+
Assignee | ||
Comment 6•14 years ago
|
||
(In reply to comment #5)
>
> Since this is in a similar vein to the existing knownCmpValues machinery, I
> won't raise an objection here, though it would be nice if there were some way
> to factor out these sorts of specialized optimizations so they are included
> only with the front-end that needs them.
I agree, it's just that this fits perfectly into the CSE pass :/ Either way, I don't plan on landing it any time soon as it's not enough of a win to be worth it.
Assignee | ||
Comment 7•13 years ago
|
||
TM's days are numbered: WONTFIX.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•