Closed Bug 599099 Opened 14 years ago Closed 13 years ago

investigate inline Vector.getNative*Property/setNative*Property calls

Categories

(Tamarin Graveyard :: Virtual Machine, defect)

x86
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: wsharp, Assigned: lhansen)

References

Details

(Whiteboard: PACMAN)

Attachments

(2 files, 6 obsolete files)

Attached file work in progress patch (obsolete) (deleted) —
Profiling v8.5/optimized/crypto shows nearly all our C++ time in the Vector.getNativeIntProperty/setNativeIntProperty but 70% or so in VM code. Experimental patch with inlining these calls directly shows a good performance improvement...4655 to 8618 (higher number is better) Other highlights: jsbench/typed/Crypt - 20.6% faster HeapSort - 19.7% faster SparseMatmult - 34.6% faster The downside is added jitted code size.
Whiteboard: PACMAN
(In reply to comment #0) > The downside is added jitted code size. Is the increase out of line with the other inlining patches? Otherwise, the real root problem is inlining without regard for profitability. (increases compile time and code size on all code, but only matters for hot code). OSR will address that.
The code size increase is in the same ballpark as our other patches. If OSR will mitigate the risks of bloating our JITed code, then this change would be fine.
Code snippet that works with Steven's latest Vector code that wraps an avmpluslist object LIns *listDataPtr = loadIns(LIR_ldp, offsetof(IntVectorObject, m_list) + offsetof(DataList<int32_t>, m_data), localGetp(sp-1), ACCSET_OTHER, LOAD_CONST); LIns *arrayLength = loadIns(LIR_ldi, offsetof(ListData<int32_t>, len), listDataPtr, ACCSET_OTHER, LOAD_CONST); LIns *c = binaryIns(LIR_ltui, arrayLength, index); CodegenLabel &begin_label = createLabel("arrayoutofbounds"); branchToLabel(LIR_jf, c, begin_label); // !!@ could be // toplevel()->throwRangeError(kOutOfRangeError, core()->intToString(index), core()->uintToString(m_length)); callIns((objType==VECTORINT_TYPE ? FUNCTIONID(IntVectorObject_getNativeIntProperty) : FUNCTIONID(UIntVectorObject_getNativeIntProperty)), 2, localGetp(sp-1), index); emitLabel(begin_label); LIns *rawArrayPtr = binaryIns(LIR_addp, listDataPtr, InsConst(offsetof(ListData<int32_t>, entries))); LIns *valOffset = binaryIns(LIR_addp, rawArrayPtr, ui2p(binaryIns(LIR_lshi, index, InsConst(2)))); value = loadIns(LIR_ldi, 0, valOffset, ACCSET_OTHER, LOAD_CONST);
I would really like this change in the VM for Molehill games content. Many Molehill APIs (vertex buffer, index buffer, constants upload) take Vectors as input and therefore the AS3 driving them uses Vector heavily. Profiling new game content shows _setNativeIntProperty always in the top spot, and the other variations get/uint adding up to a full 10% (!) of CPU time spent only there.
Is the molehill code you are referring to built into the player or in user code or runtime loaded libraries?
(In reply to comment #5) > Is the molehill code you are referring to built into the player or in user > code or runtime loaded libraries? It is coming entirely from the user AS3 code. We use block locks (*VectorAccessor) on the player side.
Assignee: nobody → lhansen
These are probably the most desirable cases to handle: index value is known to be int or uint and target location for load has same type as vector element source and target location is local and ((operation is read and type is int, uint, double, float, float4, object) or (operation is write and type is int, uint, double, float, float4)) There's less sense in optimizing loads into heap storage as the write barrier on store will dominate. There's no sense in handling writes to object vectors as the write barrier on store will dominate. The read code should look like Werner's code. The write code will look very similar, but with a twist: the trap may do something, like it may extend the vector, and so may return. Based on Werner's code this looks like 7-10 additional static instructions (perhaps a little more for float4); the trap should have the same static cost as the current call. The preconditions on the optimization are intended to restrict the optimization to the cases where it could possibly matter; OSR would be needed to determine whether it actually does matter. It'll be relatively tricky for the JIT to cache the length of the vector and hoist it out of a loop because it can be updated (a weakness in the design) so I think the only short-term benefit we will get here is avoiding the call overhead in the common case. However, several of the other values are unvarying (as Werner's code shows, if I read it right) and those are hoistable. Thus for loops we may see additional effects.
Status: NEW → ASSIGNED
Ah, in my previous comment, the restriction "target location is local" applies only to object reads, of course. I reiterate that the point of these restrictions is to reduce possible code bloat by only choosing instances where the optimization will be maximally visible. It would be useful to do some code size measurements without those instructions; if the code size increase is very minor then there's no sense in worrying about it. Anyway, trying again, for the record: index value of vector access is known to be int or uint and source and target locations have the same type and ((operation is read or write and type is int, uint, double, float, float4) or (operation is read and target location is a local slot))
Attached patch Work in progress (updated, partly) (obsolete) (deleted) — Splinter Review
With the proviso that I am hardly a LIR expert: This update of Werner's work reduces the time of the Vector read benchmark (bug #678952) from 354ms to 86ms (speedup of 4.1), on Object vectors with int or uint indices. (Lightly loaded MacPro 2x2.93HGz Xeon, acres of memory and cache.)
Data from benchmark and code that will be attached. The "old" code is TR 6533. The "new" code incorporates optimizations from this bug for reading int, uint, Number, and objects to vectors, and for writing int, uint, and Number to vectors, provided that certain types match and the index types are int or uint. "SP" = speedup, old/new. The array timings are incorporated for comparison. This is a single run, use with caution (but times are pretty stable; MacPro as in comment #9. Array Vector Rd Wr Rd Wr Objects Old 501 2138 357 1850 New 85 1850 SP 4.2 1.0 int Old 501 1753 351 697 New 103 117 SP 3.4 6.0 Number Old 502 1778 365 732 New 92 118 SP 4.0 6.2
Attached file Benchmark (obsolete) (deleted) —
Attached patch Work in progress (obsolete) (deleted) — Splinter Review
This still needs additional testing, and obviously it needs careful review.
Attachment #478040 - Attachment is obsolete: true
Attachment #554389 - Attachment is obsolete: true
Those numbers look great! When can we have it? :)
(In reply to Sebastian Marketsmueller from comment #13) > Those numbers look great! When can we have it? :) Well, it's dependent on approval of course, but that said ... My intent is to get this into Anza along with several other vector optimizations that matter (at least bug #678952, bug #678969). Serrano is too far gone to be useful as a target. The remaining blight will be the performance of writing objects to vectors (obvious from the numbers above). In-lining is unlikely to be helpful, most of the cost is probably in the write barrier and reference counting. It's possible that doing something about bug #601817 will be helpful.
(In reply to Lars T Hansen from comment #14) > The remaining blight will be the performance of writing objects to vectors > (obvious from the numbers above). In-lining is unlikely to be helpful, most > of the cost is probably in the write barrier and reference counting. It's > possible that doing something about bug #601817 will be helpful. Work on bug #601817 suggests we can shave about 1/3 off that time with modest effort. So that bug is becoming attractive too.
Attached patch Patch (obsolete) (deleted) — Splinter Review
This is a basic patch that covers some common cases: - Reading Vector.<T> with int or uint index, for T=int, uint, Number, Object; for the number types the destination slot must have type T. - Writing Vector.<T> with int or uint index, for T=int, uint, Number; the value provided must have type T. I think there are additional common cases for reading from Vector.<T> into slot of type U, where T=int and U=uint or vice versa, and T=int or uint and U is Number. That will be follow-up work and I consider them less important for now. Offered as a separate subsequent patch will be test cases and benchmarks for this optimization (WIP), nothing will land until that has been approved as well. Obviously I'm a JIT novice so the generated code needs close inspection here, I'm particularly worried about my use of LOAD_CONST/LOAD_VOLATILE, whose use I don't understand well.
Attachment #554421 - Attachment is obsolete: true
Attachment #555052 - Flags: review?(edwsmith)
Attached patch Test case and benchmarks (deleted) — Splinter Review
There's one broad test case that goes into test/acceptance/as3/Vector, and a ton of benchmarks for vector and array performance that go into test/performance/asmicro. Arguably there's a little scope creep here, as they cover Array as well as some optimizations that come in bug #601817, but why wait?
Attachment #554420 - Attachment is obsolete: true
Attachment #555347 - Flags: review?(edwsmith)
System: Mac Pro 2x2.93GHz Xeon, acres of RAM. "avm" is tamarin before this change, "avm2" after the change. 5 iterations. Improvements from bug 678952 and bug 678969 have landed already and are included in both results. 32-bit results: avm avm2 test best avg best avg %dBst %dAvg Metric: iterations/second Dir: asmicro/ vector-read-C-1 276.7 276.1 911.2 911.1 229.3 230.0 ++ vector-read-C-2 276.9 276.7 911.1 909.9 229.0 228.9 ++ vector-read-Number-1 222.6 222.0 911.2 911.1 309.4 310.3 ++ vector-read-Number-2 237.3 237.1 910.1 909.7 283.5 283.6 ++ vector-read-int-1 284.9 284.7 913.1 912.5 220.5 220.5 ++ vector-read-int-2 285.1 284.8 913.1 911.3 220.2 220.0 ++ vector-read-int-3 271.5 271.2 822.2 821.1 202.9 202.7 ++ vector-read-int-4 285.1 284.8 914.1 913.1 220.6 220.6 ++ vector-read-uint-1 285.1 285.0 912.1 911.1 219.9 219.7 ++ vector-read-uint-2 285.1 284.9 914.1 912.9 220.6 220.4 ++ vector-read-uint-3 224.6 224.2 490.5 490.4 118.4 118.8 ++ vector-read-uint-4 285.1 284.8 913.1 911.9 220.2 220.2 ++ vector-write-C-1 54.1 54.0 54.0 54.0 -0.1 0.0 vector-write-C-2 54.0 54.0 54.0 54.0 -0.0 -0.1 vector-write-Number-1 135.9 135.7 847.2 846.4 523.5 523.7 ++ vector-write-Number-2 135.9 135.8 847.2 846.6 523.5 523.5 ++ vector-write-Object-1 54.0 53.9 53.7 53.7 -0.5 -0.4 vector-write-Object-2 53.8 53.8 53.9 53.8 0.2 0.1 vector-write-int-1 142.3 142.1 846.3 846.2 494.8 495.4 ++ vector-write-int-2 143.9 143.7 846.2 846.0 488.2 488.5 ++ vector-write-uint-1 144.0 143.8 877.1 876.5 509.1 509.6 ++ vector-write-uint-2 144.9 144.8 875.2 875.0 504.2 504.4 ++ 64-bit results: avm avm2 test best avg best avg %dBst %dAvg Metric: iterations/second Dir: asmicro/ vector-read-C-1 280.9 280.4 749.3 748.9 166.8 167.1 ++ vector-read-C-2 278.7 278.2 748.3 747.4 168.5 168.6 ++ vector-read-Number-1 282.2 281.8 749.3 748.5 165.5 165.6 ++ vector-read-Number-2 283.2 282.2 784.2 783.2 177.0 177.5 ++ vector-read-int-1 291.4 291.1 810.2 810.0 178.0 178.3 ++ vector-read-int-2 290.4 289.9 805.4 805.0 177.3 177.7 ++ vector-read-int-3 276.7 276.4 660.3 659.7 138.6 138.7 ++ vector-read-int-4 292.7 292.4 821.2 820.6 180.5 180.7 ++ vector-read-uint-1 296.1 295.6 821.2 820.5 177.3 177.5 ++ vector-read-uint-2 295.4 295.1 784.2 783.5 165.5 165.5 ++ vector-read-uint-3 273.7 273.3 660.3 660.0 141.2 141.5 ++ vector-read-uint-4 293.1 292.4 811.2 810.4 176.7 177.2 ++ vector-write-C-1 72.9 72.9 72.9 72.9 -0.0 -0.0 vector-write-C-2 72.3 72.2 72.3 72.2 -0.0 -0.0 vector-write-Number-1 213.8 213.4 746.3 745.7 249.1 249.4 ++ vector-write-Number-2 210.4 210.2 695.3 694.9 230.5 230.7 ++ vector-write-Object-1 73.0 72.9 73.0 72.9 -0.0 0.1 vector-write-Object-2 72.3 72.2 72.3 72.3 -0.0 0.1 vector-write-int-1 214.1 213.8 860.1 859.6 301.7 302.0 ++ vector-write-int-2 210.4 210.2 828.2 827.4 293.7 293.7 ++ vector-write-uint-1 211.4 211.1 849.2 847.2 301.7 301.3 ++ vector-write-uint-2 208.8 208.7 849.2 848.8 306.7 306.8 ++ Notably the baseline results for 64-bit are better than fr 32-bit, but the optimized results are generally worse. That the baseline is better is possibly due to lower function calling costs and a better architecture for C++; that the optimized results are worse may then be due to nanojit being worse at optimizing the code for x86-64. (I'm just speculating now.)
Full disclosure / interesting problem / etc: 64-bit -Ojit: avm avm2 test best avg best avg %dBst %dAvg Metric: iterations/second Dir: asmicro/ string-indexOf-1 360.6 360.2 376.6 376.3 4.4 4.5 + string-indexOf-2 197.8 197.0 146.9 146.6 -25.8 -25.5 -- string-indexOf-3 142.6 142.3 144.9 144.7 1.6 1.6 + 64-bit -Dinterp: test best avg best avg %dBst %dAvg Metric: iterations/second Dir: asmicro/ string-indexOf-1 74.8 67.2 77.8 77.5 4.0 15.3 string-indexOf-2 52.5 52.3 54.3 54.0 3.4 3.4 + string-indexOf-3 47.6 47.4 54.1 53.9 13.6 13.7 ++ Note that the patch is a jit-only change (only alters CodegenLIR, not the verifier or builtin C++ code), there should be no impact on -Dinterp results. Also there are no vectors in these tests. Also there's no impact from the previous patch (bug 678952), which did change the verifier. I conclude that what I'm looking at is some alignment or cache issue; I need to test another system too, but I cannot for the life of me see how this patch could cause those changes other than by accidental side effects like alignment or cache.
Windows7 lenovo T61p (core2 duo T770)@2.40GHz) tamarin-redux 6538:d6fe8486bf99 AVM: baseline AVM2: has patch# 555052 applied 10 iterations Windows 32-bit -Ojit avm avm2 test best avg best avg %dBst %dAvg Metric: iterations/second Dir: asmicro/ string-indexOf-1 185.1 182.4 185.9 183.7 0.4 0.7 string-indexOf-2 157.1 155.2 154.1 152.7 -1.9 -1.6 string-indexOf-3 89.1 88.5 90.1 88.9 1.1 0.5 Windows 32-bit -Dinterp avm avm2 test best avg best avg %dBst %dAvg Metric: iterations/second Dir: asmicro/ string-indexOf-1 45.1 44.5 49.5 48.9 9.8 9.9 ++ string-indexOf-2 39.5 39.1 39.7 39.3 0.5 0.6 string-indexOf-3 30.4 30.1 34.4 34.0 13.1 13.1 ++
Comment on attachment 555052 [details] [diff] [review] Patch Patch currently fails to compile for win64: c:/hg/tamarin-redux/core/CodegenLIR.cpp(4501) : error C2220: warning treated as error - no 'object' file generated c:/hg/tamarin-redux/core/CodegenLIR.cpp(4501) : warning C4267: 'argument' : conv ersion from 'size_t' to 'int32_t', possible loss of data c:/hg/tamarin-redux/core/CodegenLIR.cpp(4502) : warning C4267: 'argument' : conv ersion from 'size_t' to 'int32_t', possible loss of data c:/hg/tamarin-redux/core/CodegenLIR.cpp(4508) : warning C4267: 'argument' : conv ersion from 'size_t' to 'int32_t', possible loss of data c:/hg/tamarin-redux/core/CodegenLIR.cpp(4655) : warning C4267: 'argument' : conv ersion from 'size_t' to 'int32_t', possible loss of data c:/hg/tamarin-redux/core/CodegenLIR.cpp(4656) : warning C4267: 'argument' : conv ersion from 'size_t' to 'int32_t', possible loss of data c:/hg/tamarin-redux/core/CodegenLIR.cpp(4663) : warning C4267: 'argument' : conv ersion from 'size_t' to 'int32_t', possible loss of data make: *** [core/CodegenLIR.obj] Error 2
Attached patch Patch (obsolete) (deleted) — Splinter Review
Updated to fix Win64 compilation problems (only).
Attachment #555052 - Attachment is obsolete: true
Attachment #555392 - Flags: review?(edwsmith)
Attachment #555052 - Flags: review?(edwsmith)
Attachment #555347 - Flags: review?(edwsmith) → review+
Comment on attachment 555392 [details] [diff] [review] Patch > LIns *arrayLen = loadIns(LIR_ldi, int32_t(lenOffset), arrayData, ACCSET_OTHER, LOAD_CONST); LOAD_CONST means the load is subject to CSE ignoring stores. possible failure case: x = a[i] a.[a.length+1] = c // length changes, maybe realloc's x = a[j] If the second arrayLen or arrayData are eliminated then, bug? LOAD_NORMAL won't be eliminated unless the result is unused. ACCSET_OTHER is o.k. Background on LoadQual and AccSet are in comments in LIR.h. We have the ability to create AccSet enums for additional coarse disjoint regions. (e.g. ACCSET_ARRAYDATA could be a named region that only aliases with itself).
Attachment #555392 - Flags: review?(wmaddox)
Attachment #555392 - Flags: review?(edwsmith)
Attachment #555392 - Flags: feedback+
Comment on attachment 555392 [details] [diff] [review] Patch Assume the five instances of LOAD_CONST and LOAD_VOLATILE in this patch will be changed to LOAD_NORMAL. ACCSET_ issues will be filed as separate follow-up bug (it's a more general theme).
Attached patch Patch (deleted) — Splinter Review
Patch updated to correct the LOAD_ attributes as described in the previous comment.
Attachment #555392 - Attachment is obsolete: true
Attachment #555665 - Flags: review?(wmaddox)
Attachment #555392 - Flags: review?(wmaddox)
Blocks: 681888
changeset: 6542:62283c51bdf2 user: Lars T Hansen <lhansen@adobe.com> summary: For 599099 - investigate inline Vector.getNative*Property/setNative*Property calls: test cases (r=edwsmith) http://hg.mozilla.org/tamarin-redux/rev/62283c51bdf2
Comment on attachment 555665 [details] [diff] [review] Patch Review of attachment 555665 [details] [diff] [review]: ----------------------------------------------------------------- R+. Looks good, but see comments below. Nice performance results! It might appear that the scaled offset of a vector element could exceed the range of a 32-bit value, thus the computation of valOffset should do the scaling in the intptr_t domain, using lshp rather than lshi. We are already relying on the 4GB object size limit, but all of this seems a bit fragile, e.g., should we change MMgc in the future. Perhaps a bit more commentary is appropriate here. NITS 1) This comment is incorrect: // value = *(arrayData + index*scale + entriesOffset) ; value = array->data->entries[index] Scale is actually represented as log_2 of the intended scale factor, and is used as a shift count in the code. 2) Write LIns* foo, not LIns *foo. 3) Indentation of code generation calls, with the 'LIns* foo =" set off in a separate column, does not follow the prevailing style and makes the lines unnecessarily long. 4) emitInlineVector[Read,Write] take a long list of arguments that are constants related to the vector element type. Could these be packaged up in a single struct for each type? It's not clear whether this would leave the code tidier or not, though. It's something I would have tried and seen how it worked out.
Attachment #555665 - Flags: review?(wmaddox) → review+
changeset: 6560:31354fe284a8 user: Lars T Hansen <lhansen@adobe.com> summary: Fix 599099 - investigate inline Vector.getNative*Property/setNative*Property calls (r=wmaddox) http://hg.mozilla.org/tamarin-redux/rev/31354fe284a8
Addressed the review comments, though I did not play with packaging up the arguments to emitInlineVector[Read,Write], since all the calls are very close to the definitions - it did not seem like it was worth it. Arguably there is one outstanding issue here. In the VMCFG_FASTPATH_ADD_INLINE case for OP_getproperty and OP_setproperty emit{Get,Set}IndexedProperty are called twice, once for the INT case and once for the DOUBLE case. There may then be in-line expansion of those cases from the changes in the present bug. It seems to me, without having dug much into it, that there could be room for some tightening of the emitted code in that case, eg, it would be possible to share some of the calls into the support code. This would be a code size issue, not a performance issue. I will file a separate bug for that.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: