Closed Bug 578761 Opened 14 years ago Closed 14 years ago

JM: PIC for |o[e]| (JSOP_GETELEM)

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dmandelin, Assigned: cdleary)

References

Details

(Whiteboard: fixed-in-jaegermonkey)

Attachments

(5 files, 10 obsolete files)

(deleted), patch
Details | Diff | Splinter Review
(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), application/javascript
Details
(deleted), patch
dvander
: review+
Details | Diff | Splinter Review
We need a pic for getelem expressions of the form o[e]. It is the main cause of slowdown on fasta. Previous measurements suggest that we don't have too worry too much about megamorphism: a simple solution, like going to a slow call after 16 stubs, will probably do. I suspect that most getelems we see will either use integer indexes, thus skipping this entirely, or will go over a relatively small number of property names. JM wanted SS win: 20 ms
Ooh, I'll take this one.
Assignee: general → cdleary
Status: NEW → ASSIGNED
For string-fasta.js's two for-in loops, we could do even better, but maybe it is overkill: the def-use chain for the loops has the def in the loop head dominating all (few) uses, so we could try a bit more static analysis and predict the next value (property name). /be
What about the idea mentioned a few times of iterating over JSScopeProperties? It'd be pretty cool if we could just read the last sprop out of the iterator and avoid the PIC.
Other than the sprop linkage being up the property tree, so in reverse for-in order, it's great. This isn't a big deal given the fastiterators bug's work, but you'd need to tweak that code to build an array of sprops and know how to get the next one in for-in enumeration order from it. Or something like that. /be
Attached patch WIP patch. (obsolete) (deleted) — Splinter Review
Very WIP: needs cleanup and the integer fast path added back in.
Attached patch iterator getelem patch (obsolete) (deleted) — Splinter Review
FWIW here's the patch from last week that keeps the last-seen value around which would avoid a MIC/PIC in the common case. Hi, patch attached which saves 8-9ms on string-fasta for TM and JM. 'for in' loops keep an array of (id,slot) instead of id, which is inspected before doing an arbitrary name lookup (not fast-pathed yet for JM, just non-as-slow-pathed). Doesn't handle property deletes (don't know how those work). Should this go on 560989 or a new bug? Brian
(In reply to comment #6) > Should this go on 560989 or a new bug? I think this belongs in bug 578756. I tried to apply it to check out the perf gain, but it needs rebasing. We are hoping to get 23 ms total from improving that particular loop of fasta, so if we aren't getting all of it, we need to know why to decide if the current patches are going to take care of it or if we need to do more.
Summary: JM: PIC for |o[e]| → JM: PIC for |o[e]| (JSOP_GETELEM)
Attached patch WIP: Passes trace tests. (obsolete) (deleted) — Splinter Review
Shoehorning the int path back in...
Attachment #457736 - Attachment is obsolete: true
Attached patch rebased iterators patch (deleted) — Splinter Review
Iterator patch rebased to current JM. I still get a 9ms speedup on string-fasta with this (35ms -> 26ms).
Attachment #457885 - Attachment is obsolete: true
Chris, what speedup are you seeing on string-fasta with your patch (can you test it yet)? Looking at this some more, unless I'm missing something I don't see why using a PIC makes sense for the workloads seen on SS. There are 330k slow paths taken for getelem on SS. 175k are from one site in string-fasta, 80k are from four sites in string-unpack-code, leaving 75k (iirc 30k are gets on string elements, 8k are getting holes, remainder are misc). If you PIC on (shape x id) (you need to do this to determine the slot, right?), then for the one site in string-fasta there are 19 cases that get hit (the total number of properties in the two objects that are repeatedly iterated over), and for the sites in string-unpack-code there are between 600-2000 cases that get hit (all 1-2 character strings). The iterator patch works on the 175k in string-fasta but not the ones in string-unpack-code; I think these require old fashioned hashtable lookups.
I've only tested on string-fasta (because that has no dense array hot paths): 13.5% == 4ms speedup.
Attached patch WIP: Adds int32 inline path. (obsolete) (deleted) — Splinter Review
About to test on a machine whose benchmark results I trust...
Attachment #458032 - Attachment is obsolete: true
Attached file Perf comparison. (obsolete) (deleted) —
4ms on SunSpider, wash on V8. I'll see if I can do anything to cut down on extraneous spillage to make these numbers better.
How many times does the PIC miss on SS, and how many times does it hit?
General concern: how reliable are these benchmark comparisons? Comparison from comment 13 shows 'significant' differences in tests completely unrelated to these PICs (binary-trees, bitwise-and, plenty that only have integer array accesses). string-fasta is .9ms faster, and string-unpack-code is .5ms faster; if you repeat the comparison, how much do things change?
SunSpider overstates significance all the time; a while ago dmandelin thought he saw a flaw in its pstats but then I think thought it was ok, but it "feels" all wrongly overstated, way too often. Of course SS has ND in it too (Date, Math.random). Benchmarketing at its best! /be
(In reply to comment #15) > General concern: how reliable are these benchmark comparisons? It's reproducible. There are a lot of little effects from adding these PICs that can have emergent behavior, though most of the ones I can think of are bad: self-modifying code causes I$ flushes, PIC guards alias other branches in the BHTs, there's more I$ pressure because of the inline PIC code and out-of-line slow call code that gets tacked on to the inline buffer. I look a quick look at cachegrind diff results yesterday, but I'll look into it more today. @dvander: I'll investigate hit counts.
I wrote a little script to compare stub call counts using our stub call profiling mode and I got the following: sunspider-0.9.1 aggregate JSOP_NOP: +999 JSOP_PROPINC: +1 JSOP_EXCEPTION: -1 JSOP_CONCATN: -1 JSOP_CALL: -3 JSOP_REGEXP: -999 JSOP_GETELEM: -24425 Total: -24429 I think that the stub profiling data (or my script) may be a little buggy because running the untouched shell against itself I got: crypto-aes.js JSOP_CONCATN: -12 JSOP_CALL: -36 Total: -48 sunspider-0.9.1 aggregate JSOP_CONCATN: -12 JSOP_CALL: -36 Total: -48 Investigating...
Yeah, there's definitely nondeterminism in the stub profiling, although it's always for the same opcodes. Still looking, but CCing dmandelin -- maybe this is a known issue.
In any case, this is the bit that matters to us: string-tagcloud.js JSOP_GETELEM: -4998 date-format-xparb.js JSOP_GETELEM: -3999 string-fasta.js JSOP_GETELEM: -15428 sunspider-0.9.1 aggregate JSOP_GETELEM: -24425 Those numbers should represent the PIC hits.
Is this nondeterminism in the stub profiling or nondeterminism in SS itself (see the nonce variable in AESEncryptCtr in crypto-aes.js)? As Brendan pointed out, SS doesn't do the same thing every time, including differing numbers of calls (bitten me before).
(In reply to comment #21) > (see the nonce variable in AESEncryptCtr in crypto-aes.js) That makes a lot of sense for crypto-aes, didn't see the call to Date until you mentioned it. :-) Still not sure how exception is becoming propinc or regexp is becoming jsop_nop -- we don't even have a slow call for anything noppish, so that'll be what I check out next.
(In reply to comment #20) What about PIC misses? Trying to see what percentage of PIC'able callsites we're missing. And how many times do we hit max stubs?
Attached patch WIP: Fixes bug w/megamorphism on shape change. (obsolete) (deleted) — Splinter Review
Excellent suggestion by dvander -- rooted out a bug. Re-analyzing with that fix.
Attachment #458158 - Attachment is obsolete: true
Attached file GETELEM PIC results on SunSpider. (deleted) —
1 disabled due to failed lookup at date-format-xparb.js:19. 4000 iterations where the first iteration is used in a "try to look it up, and if it turned out it wasn't in there, put it in there" fashion. 3 disabled due to abnormal object ops (2 in regexp-dna.js:1707, 1 in string-tagcloud.js:185) TODO: investigate what these objects actually are. 1 disabled in string-fasta.js:68 disabled due to megamorphism, could up the maximum a little bit to catch all of these. 4 disabled in string-unpack-code.js (:16, :30, :51, :66) due to megamorphism, but this one is definitely correct -- the strings that are being used for indexing are dynamically generated, so the atom comparison fails even when the characters are the same, i.e. "L" shows up three times with different string pointers. In summary, I'll try to treating failed lookups as nops for the benefit of xparb and bump the stub cap for the (small) benefit of fasta. Will rerun the benchmarks on my macbook tonight and give V8 a look-over tomorrow.
Attached file Perf comparison. (deleted) —
16.8ms win on SunSpider without any of the modifications I mentioned at the end of comment 25, but looks like it regresses V8 by 80ms (1%) in its current form. Will investigate tomorrow.
Attachment #458162 - Attachment is obsolete: true
Attached patch WIP: Fixes bug w/megamorphism on shape change. (obsolete) (deleted) — Splinter Review
#if out some debug stuff I accidentally left in there. (I removed it before profiling in the above comment.)
Attachment #458282 - Attachment is obsolete: true
Re: comment 25: non-native object ops in SS must be dense arrays. ITM we want a different approach here from a PIC. We want prediction based on likelihood or (better) a bit of static analysis, not prediction based on what just missed the cache and forced us to fill it with a key that won't be matched again all too often. /be
On my machine, with the patch from comment #27, I don't see any regressions. string-fasta speeds up by 34% (6.5ms), and that's the only clearly-affected test.
I get a 9ms speedup (35ms -> 26ms; how are you computing the 34%?) with the patch from comment 27, the same as for the iterators patch. That other patch could get a few ms more by fast-pathing the cache lookup (the calls to GetElem and associated spills/tests are still performed) and maybe by using an array of JSScopeProperty*. Both patches are attacking string-fasta; how to proceed?
Attached patch PIC GETELEM (obsolete) (deleted) — Splinter Review
Discussed offline with dvander, dmandelin, and bhackett. We're going to go with the GETELEM PIC over the iterator patch. The GETELEM patch shows a little more speedup on microbenchmarks because the iterator patch is lacking a fast path in its current implementation.
Attachment #458289 - Attachment is obsolete: true
Attachment #458497 - Flags: review?(dvander)
Could you post a summary of the discussion?
(In reply to comment #32) > Could you post a summary of the discussion? Surely! The patches do the same thing (make getelem faster) -- the GETELEM PICs are a bit more generic and have a larger speedup ATM, so it doesn't seem worthwhile to spend more engineering effort duplicating something we already have the win for. Since GETELEM PICs work and have a slightly larger speedup, into the tree they go.
Sorry if I missed this: What PIC hit rates do you get for the string-fasta loops? /be
(In reply to comment #33) > the GETELEM PICs > are a bit more generic To clarify, when I say "a bit more generic" I mean that they work on a larger set of cases than indexing into the innermost iterator.
fasta does a property walk over an object. why does a PIC outperform an optimized iterator data structure?
(In reply to comment #36) > fasta does a property walk over an object. why does a PIC outperform an > optimized iterator data structure? As I mentioned in comment 33, because we have a fast path implemented for it. The current iterator patch performs a slow call on each iteration.
Attached patch PIC GETELEM, max count bumped (obsolete) (deleted) — Splinter Review
Bumps the megamorphic limit to 20, so we don't disable at all on fasta.
Attachment #458497 - Attachment is obsolete: true
Attachment #458505 - Flags: review?(dvander)
Attachment #458497 - Flags: review?(dvander)
Your summary above is completely misleading. So we are choosing a less optimal algorithm that we implemented a fast path for over a better algorithm that we didn't finish implementing? This doesn't seem like a good approach to me.
Tuning for SunSpider is one thing (20 may be overkill -- isn't the magic number for IUB 15?). Giving the win to the patch that has had some fast-path love, even though it uses caching which is in general not effective on real for-in loops, instead of giving the same advantage of fast-path love to the iterator path, and then judging, is a path-dependent mistake. Apples to apples, if the fast-path'ed iterator wins, then it should go in. /be
We should do both, they're not mutually exclusive, and this is ready now. We fast-path FORLOCAL and MOREITER, we can have a separate bug for noticing where FORLOCAL results flow into.
Ok, so let's do both -- not make this an either-or on bogus grounds :-/. Also, be clear this is tuning for SunSpider. We can take it, but it should be described as "more general". Is the other bug on file? /be
(In reply to comment #34) > Sorry if I missed this: What PIC hit rates do you get for the string-fasta > loops? Not sure that I understand the metric you're looking for, but I'll take a shot. The patch elides 147331 slow calls to GETELEM versus TIP, so with 52 slow calls for PIC updates, that's a hit rate of 99.9647%.
er, "should *not* be described as 'more general'." /be
(In reply to comment #43) > (In reply to comment #34) > > Sorry if I missed this: What PIC hit rates do you get for the string-fasta > > loops? > > Not sure that I understand the metric you're looking for, A PIC is a cache. Caches are tested, which either hits or misses (if miss, then fill according to a replacement policy). The hit rate is the ratio of hits to tests. What's the hit rate? > but I'll take a shot. > The patch elides 147331 slow calls to GETELEM versus TIP, so with 52 slow calls > for PIC updates, that's a hit rate of 99.9647%. That's pretty awesome. Thanks! /be
Chris, what happens if you shrink the PIC size to 16? Does the hit rate drop a whole lot? /be
(In reply to comment #42) > Is the other bug on file? dmandelin mentioned in comment 7 that the iterator work is living in bug 578756. This bug happened to become a de-facto string-fasta optimization bug, which was the cause for the either-or confusion.
(In reply to comment #45) > > Not sure that I understand the metric you're looking for, > > A PIC is a cache. I may have misrepresented my level of cluelessness... You were right, 16 is okay. PIC Size Hit Rate -------- -------- 2: 49322 / 147384 = 33.471% 4: 64493 / 147384 = 43.758% 8: 79014 / 147384 = 53.611% 15: 122866 / 147384 = 83.365% 16: 147331 / 147384 = 99.964% 32: 147331 / 147384 = 99.964% Just so happens only 17 of the 19 shape/atom combos flow through that GETELEM. In light of this pathetic analysis, I would like to apologize for the impression I gave that PICing is more general in any serious way.
Attached file regression test case (deleted) —
On the mac-mini: fasta: 1.40x as fast 41.9ms +/- 0.6% 29.8ms +/- 0.9% significant The only consistent regression I see across multiple computers is this: fannkuch: *1.052x as slow* 20.8ms +/- 1.0% 21.8ms +/- 1.0% significant When there's only a single GETELEM site, things are fine. As more get added the performance degrades. To demonstrate see the attached script. It runs in 2500ms before this patch, and 3000ms after. It's only 1ms on the benchmark, but my gut feeling is we should investigate this.
Is it still the case that JM does not handle megamorphism explicitly? I.e. too much PIC thrashing -> recompile a generic stub call? /be
(In reply to comment #50) Yeah, though this problem case is the reverse - lots of call sites, each monomorphic.
Comment on attachment 458505 [details] [diff] [review] PIC GETELEM, max count bumped >+ /* FIXME: we should factor this out. Easy to miss new members. */ Forgetting to set a member will light up pretty quickly. Omit this or explain it better (maybe even turn it into documentation). >+ script->pics[i].idReg = pics[i].idReg; This should be in the |u.get| union struct of ic::PICInfo. >+ pic.hotPathBegin = masm.label(); >+ /* Guard on shape. */ Newline before comment. >+ pic.shapeGuard = masm.label(); >+ Jump shapeGuard = masm.branch32(Assembler::NotEqual, shapeReg, >+ Imm32(int32(JSObjectMap::INVALID_SHAPE))); >+ /* Guard on id identity. */ Ditto. >+ static const int32 BOGUS_ATOM = 0xdeadbeef; >+ Jump idGuard = masm.branch32(Assembler::NotEqual, idReg, Imm32(BOGUS_ATOM)); branchPtr, ImmPtr - or better yet, put a // :FIXME: X64 in there somewhere. >+ stubcc.linkExit(idGuard, Uses(2)); >+ stubcc.linkExitDirect(shapeGuard, pic.slowPathStart); This must be linkExit(), not linkExitDirect(), because if the branch is taken registers won't be synced. >+ void jsop_getelem_dense(FrameEntry *obj, FrameEntry *id, RegisterID objReg, >+ MaybeRegisterID &idReg, RegisterID shapeReg); > inline JSC::MacroAssembler::RegisterID >+FrameState::tryRegForType(FrameEntry *fe, RegisterID fallback) More on this later. >+ if (obj->isDenseArray()) { >+ start = masm.label(); >+ atomGuard = masm.branchPtr(Assembler::NotEqual, pic.idReg, ImmPtr(id)); >+ shapeGuard = masm.branchPtr(Assembler::NotEqual, >+ Address(pic.objReg, offsetof(JSObject, clasp)), >+ ImmPtr(obj->getClass())); Dense arrays are non-native, so an earlier check made this all dead. You can just remove it. >+ masm.loadData32(proto, pic.objReg); loadPtr() - looks like an existing bug that got copy pasta'd. proto used to be a jsval. >+ if (obj->isDenseArray()) >+ disable("dense array"); Can remove this too. > // This struct comes out to 61 bits. 65 now (70 when idReg moves in). >+ } >+ /* Postcondition: type should be in tmpReg, data should be in objReg. */ >+ /* Note: linkExits will be hooked up to a leave() after this method completes. */ Newline before comment, merge comments or use // >+ if (id->isTypeKnown() && id->getKnownType() == JSVAL_TYPE_INT32 && id->isConstant() && >+ id->getValue().toInt32() < 0) { >+ jsop_getelem_slow(); >+ return; >+ } Can this check be moved into the jsop_getelem_dense path? >+ >+ if (id->isTypeKnown() && id->getKnownType() == JSVAL_TYPE_STRING && id->isConstant()) { >+ /* Never happens, or I'd optimize it. */ >+ jsop_getelem_slow(); >+ return; >+ } Similarly, into the jsop_getelem_pic path. >+ RegisterID tmpReg = frame.allocReg(); >+ if (!obj->isTypeKnown()) { >+ RegisterID typeReg = frame.tryRegForType(obj, tmpReg); >+ Jump objGuard = masm.testObject(Assembler::NotEqual, typeReg); >+ stubcc.linkExit(objGuard, Uses(2)); >+ } I'd prefer playing along with the register allocator. Use copyTypeIntoReg, and allocReg on the other side of the |if|. Right now, you might be spilling and then re-loading the type. >+ if (!id->isConstant()) >+ idReg.setReg(frame.copyDataIntoReg(id)); >+ /* Meat. */ Delicious, but more savory with a preceding newline. >+ jsop_getelem_dense(obj, id, objReg, idReg, tmpReg); >+ stubcc.leave(); >+ stubcc.call(stubs::GetElem); >+ /* Epilogue. */ Ditto. >+ RegisterID objReg = frame.copyDataIntoReg(obj); >+ RegisterID idReg = frame.copyDataIntoReg(id); >+ /* Meat. */ >+ jsop_getelem_pic(obj, id, objReg, idReg, tmpReg); >+ /* Epilogue. */ Ditto * 2. >+#ifdef JS_POLYIC >+ JS_ASSERT(!id->isTypeKnown()); Taking another look, this function is *really* long. Would be nice to separate into smaller pieces (getelem_known_id_type, getelem_with_pic, getelem_without_pic), but up to you. >+ RegisterID typeReg = frame.tryRegForType(id, tmpReg); >+ Jump intGuard = masm.testInt32(Assembler::NotEqual, typeReg); Nice, this is a good use case for tryRegForType, because anything else will be less optimal. Let's overload it to "tempRegForType" and give it a "hint" parameter. Assert that the hint reg is owned. >+ Jump performedDense = masm.jump(); Hrm... I think we could do better by keeping one case OOL, or PIC'ing everything. For example, if the type isn't known, we'd always go to a PIC. If it saw a string id, it'd guard on that and build string PICs. If it saw an int32 id, it'd guard on that, and build a dense elem path. We most likely don't need or want both in one path, for locality, branch prediction, or code size. This should be a follow-up bug though. Nice work so far, looking forward to new patch.
(In reply to comment #52) > >+ stubcc.linkExit(idGuard, Uses(2)); > >+ stubcc.linkExitDirect(shapeGuard, pic.slowPathStart); > > This must be linkExit(), not linkExitDirect(), because if the branch is taken > registers won't be synced. I think the same set of registers is dirty on the idGuard taken edge and the shapeGuard taken edge. > >+ if (obj->isDenseArray()) { > >+ start = masm.label(); > >+ atomGuard = masm.branchPtr(Assembler::NotEqual, pic.idReg, ImmPtr(id)); > >+ shapeGuard = masm.branchPtr(Assembler::NotEqual, > >+ Address(pic.objReg, offsetof(JSObject, clasp)), > >+ ImmPtr(obj->getClass())); > > Dense arrays are non-native, so an earlier check made this all dead. You can > just remove it. We checked in the caller that the *holder* had a native JSObjectOps, not obj. Is it possible for a dense array to have a prototype with native object ops? I'm somewhat confused by the terminology here, may have to discuss offline. > Would be nice to separate into smaller pieces I'd like that as well, will break it up some. > >+ if (id->isTypeKnown() && id->getKnownType() == JSVAL_TYPE_INT32 && id->isConstant() && > >+ id->getValue().toInt32() < 0) { > >+ jsop_getelem_slow(); > >+ return; > >+ } > > Can this check be moved into the jsop_getelem_dense path? It's easier to emit all the shortcut-to-slow stuff before the shared isObject guard. I can duplicate that guard a few times if you hate the way it is now. > Let's overload it to "tempRegForType" and give it a "hint" > parameter. Assert that the hint reg is owned. So this just means rename tryRegForType to tempRegForType? tryRegForType already asserted that the fallback register was owned.
(In reply to comment #53) > I think the same set of registers is dirty on the idGuard taken edge and the shapeGuard taken edge. Whoops, yup, you're right. Can you leave a comment in there to avoid accidental breakage? > We checked in the caller that the *holder* had a native JSObjectOps, not obj. Okay, yeah, we need to check |obj|, as well as everything in that protochain walk. Another existing bug, we can fix the other cases separately. > I can duplicate that guard a few times if you hate the way it is now. If it's a big deal then nevermind. > So this just means rename tryRegForType to tempRegForType? Yes.
(In reply to comment #53) > > Dense arrays are non-native, so an earlier check made this all dead. You can > > just remove it. > > We checked in the caller that the *holder* had a native JSObjectOps, not obj. > Is it possible for a dense array to have a prototype with native object ops? > I'm somewhat confused by the terminology here, may have to discuss offline. David pointed out js::PropertyCache::fill over IRC, filling a PIC needs to have the same logic about non-native followed by native on proto chain. Fortunately this is rare apart from dense arrays, and for dense arrays we special-case all proto chain identifier-based (atom literal) tests/lookups to skip the direct (dense array) object, since it cannot have named properties. GETELEM could be accessing any computed property name, though, so if the id at runtime is not an index, but the direct object is a dense array, then you should be able to skip up one level on the proto chain. /be
Attached patch PIC GETELEM, take 2. (obsolete) (deleted) — Splinter Review
Attachment #458505 - Attachment is obsolete: true
Attachment #458799 - Flags: review?(dvander)
Attachment #458505 - Flags: review?(dvander)
Shouldn't have marked for re-review, fixing native logic now...
Attached patch PIC GETELEM, take 2, for real. (deleted) — Splinter Review
Attachment #458799 - Attachment is obsolete: true
Attachment #458846 - Flags: review?(dvander)
Attachment #458799 - Flags: review?(dvander)
Comment on attachment 458846 [details] [diff] [review] PIC GETELEM, take 2, for real. >+ /* >+ * Check all nodes along proto chain are JS-native; otherwise, >+ * obj->shape() is undefined. >+ */ >+ for (JSObject *protoNode = obj; protoNode; protoNode = protoNode->getProto()) { As discussed IRL the isNative check should be in the proto chain walk in generateStub(), don't need need two loops. r=me with that. The regression to fannkuch is still there, but the attached test case is somewhat better (2800ms from 2500ms). I'll file a follow-up bug to investigate this.
Attachment #458846 - Flags: review?(dvander) → review+
Canonical variable names are important in house style, so s/protoNode/pobj/ (it's not a "node" :_P). Drive-by nit, hope you can get it in before it's forgotten. /be
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Depends on: 583873
Depends on: 583942
Depends on: 584466
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: