Closed Bug 482787 Opened 16 years ago Closed 14 years ago

We code gen init code for an object to initialize slots but walk them in Traits::destroyInstance

Tracking

(Not tracked)

Status:

VERIFIED WONTFIX

Milestone:

Q3 11 - Serrano

People

(Reporter: treilly, Assigned: stejohns)

References

Details

(Whiteboard: PACMAN, has-patch)

Attachments

(6 files, 1 obsolete file)

Rough performance test 14 years ago Steven Johnson (deleted), text/plain	lhansen : feedback+	Details
Remove the BitSet walking 14 years ago Steven Johnson (deleted), patch	treilly : feedback+ edwsmith : feedback+	Details \| Diff \| Splinter Review
Encapsulate isDictionary flag 14 years ago Steven Johnson (deleted), patch	edwsmith : review+	Details \| Diff \| Splinter Review
Specialize destroyInstance via function ptr 14 years ago Steven Johnson (deleted), patch	edwsmith : feedback+ treilly : feedback+	Details \| Diff \| Splinter Review
First cut at jitting dtors 14 years ago Steven Johnson (deleted), patch	edwsmith : feedback+	Details \| Diff \| Splinter Review
Rearrange slot order to put pointer-slots first 14 years ago Steven Johnson (deleted), patch	edwsmith : review+ chrisb : feedback+	Details \| Diff \| Splinter Review
replace Traits::m_totalSize with Traits::m_extraSize 14 years ago Steven Johnson (deleted), patch	edwsmith : review+	Details \| Diff \| Splinter Review

Tommy Reilly

Reporter

Description

•

16 years ago

End result slow Finalize which could lead to uncool pause times.

Lars T Hansen

Updated

•

15 years ago

Priority: -- → P4

Target Milestone: --- → flash10.1

Lars T Hansen

Comment 1

•

15 years ago

I've observed this too, in some of the programs in test/performance/mmgc - destroyInstance walks a bit vector, and that can be quite slow.

Edwin Smith

Updated

•

15 years ago

Hardware: x86 → All

Lars T Hansen

Comment 2

•

15 years ago

destroyInstance appears to add considerable cost even to objects that should require minimal destruction, see numbers in bug #531250.

Blocks: 531250

Dan Smith

Updated

•

15 years ago

Target Milestone: flash10.1 → flash10.2

Lars T Hansen

Updated

•

15 years ago

Whiteboard: PACMAN

Steven Johnson

Assignee

Updated

•

14 years ago

Assignee: nobody → stejohns

Steven Johnson

Assignee

Comment 3

•

14 years ago

Looking at this, I realized that we can also considerably simplify how we walk them; as a result of changes to the slot layout by Chris Brichford, the pointer slots are now always contiguous (in the middle of the object), so we don't need a bitset to find them, just an offset and count. (If we rearranged their order just a count could suffice.) Looks for a series of patches...

Steven Johnson

Assignee

Comment 4

•

14 years ago

On this note, do we have any good existing benchmarks for timing this? (My initial assumption is to allocate a few million objects, then force a collection...)

Edwin Smith

Comment 5

•

14 years ago

given this inheritance tree: class A { int int_a // int32 Array array_a // ScriptObject* } class B { int int_b // int32 Array array_b // ScriptObject* } Won't the layout be this: (assume a 32-bit cpu) { /* ScriptObject stuff */ int32_t int_a ScriptObject* array_a int32_t int_b ScriptObject* array_b }

Steven Johnson

Assignee

Comment 6

•

14 years ago

Yep, but within each subclass they are a contiguous range.

Steven Johnson

Assignee

Comment 7

•

14 years ago

Attached file Rough performance test (deleted) — Details

Initial perf benchmark to try to measure the cost of destroyInstance: basically, allocate a few million of various sorts of objects, force a full collection, measure the time.

Attachment #441586 - Flags: feedback?(lhansen)

Steven Johnson

Assignee

Comment 8

•

14 years ago

Attached patch Remove the BitSet walking (deleted) — Details — Splinter Review

Take advantage of the slot layout to get rid of FixedBitSet entirely. From my testing (MacTel32), perf is roughly a wash (no gain, no loss), but code is substantially smaller and simpler, and should pave the way for some subsequent patches. (MacTel64 is actually measurably faster for the most complex testcase with lots of objects, by about 5%.) Tradeoffs to look at: -- We formerly used a FixedBitSet, which would fit into a pointer-sized field for small objects (good) but dynamically allocate for larger ones (bad). Now we just need an offset-and-count pair... for small objects on 32-bit systems this is more space (2 int32 vs one intptr), but for all other configs it's at least as small, and (trivially) reduces gc pressure. (Note that if we restructured the order of slot layout we might be able to get by with just a count, not an offset.) -- the old code took great pains to walk thru the object in low-to-high-addr order, zeroing as it went, presumably to maximize cache performance. the new code doesn't bother with this: it focuses on just the section(s) with pointers, then finishes with a single memset of the whole thing. (furthermore, it walks the pointer section in reverse order, so as to avoid recursion.) In theory, this is probably suboptimal, but since I plan on adding jitted dtors (which will go in low-to-high order) in a subsequent patch, this is probably an acceptable tradeoff.

Attachment #441590 - Flags: feedback?(treilly)

Steven Johnson

Assignee

Comment 9

•

14 years ago

Attached patch Encapsulate isDictionary flag (obsolete) (deleted) — Details — Splinter Review

Refactoring patch to pave the way for subsequent patch; pulled out separately to make reviewing easier

Attachment #441615 - Flags: review?(edwsmith)

Steven Johnson

Assignee

Updated

•

14 years ago

Attachment #441590 - Flags: review?(edwsmith)

Steven Johnson

Assignee

Comment 10

•

14 years ago

Comment on attachment 441590 [details] [diff] [review] Remove the BitSet walking changing to feedback until I get better perf numbers measured from the patch set.

Attachment #441590 - Flags: review?(edwsmith)

Attachment #441590 - Flags: review?

Attachment #441590 - Flags: feedback?(edwsmith)

Steven Johnson

Assignee

Comment 11

•

14 years ago

Attached patch Specialize destroyInstance via function ptr (deleted) — Details — Splinter Review

Add a function pointer for destroyInstance, and specialize two cases that are common and easily detectable: no-pointer-slots-and-no-hashtable (aka Plain Old Data), and no-pointer-slots-with-hashtable (aka dynamic objects). This change (plus previous patches) runs the benchmark at about 91% of total time (vs TR baseline) on MacTel32, and 95% of total time on MacTel64. The obvious downside to this is that Traits gets another function pointer added to it, so there's a modest memory size increase.

Attachment #441619 - Flags: feedback?(edwsmith)

Steven Johnson

Assignee

Updated

•

14 years ago

Attachment #441619 - Flags: feedback?(treilly)

Steven Johnson

Assignee

Comment 12

•

14 years ago

Attached patch First cut at jitting dtors (deleted) — Details — Splinter Review

First cut. Definitely not ready for review, because: (1) disappointingly, it's not much faster than the previous patch... close to the noise range (2) the code generated is kinda large (3) it tends to hang on acceptance tests; I haven't diagnosed the problem, but wonder if our CodeMgr is getting destroyed at avm-teardown-time ahead of things that still need the custom dtor functions (which would be bad)... So, no review, but feedback welcomed.

Attachment #441672 - Flags: feedback?(edwsmith)

Edwin Smith

Comment 13

•

14 years ago

Comment on attachment 441672 [details] [diff] [review] First cut at jitting dtors initial reactions to jitter * when I wrote InvokerCompiler::assemble(), I was hoping to generalize it and move it to LirHelper... might be worth a try at that before the cut+pasting gets out of control. * I can hear Moh from intel muttering: "we can specalize that using rep mov p, 0" (fast inline x86 specific memclear)... doubt it matters, yet. * this has all the markings of an AVMTWEAK rather than AVMFEATURE * I think I'd be happier if all the compiler code was in the compiler class, instead of spread between Traits.cpp and CodegenLIR.cpp. but if you have a reason you did it that way, add a comment. * the handling of jit failure (fall through to a second null check) is too subtle. comment?

Edwin Smith

Updated

•

14 years ago

Attachment #441672 - Flags: feedback?(edwsmith) → feedback+

Lars T Hansen

Comment 14

•

14 years ago

(In reply to comment #13) > (From update of attachment 441672 [details] [diff] [review]) > * this has all the markings of an AVMTWEAK rather than AVMFEATURE No opinion on the patch, but a general observation: Tweaks and features are intended for configuration points that will stay in the code indefinitely, as various platforms may prefer various settings and it's meaningful to expose those configuration points to embedders. Configuration points that are for experimentation and will go away eventually when the numbers are in should stay out of the avmfeatures altogether, and be ad-hoc ifdefs in avmplus.h (typically), which are eventually removed. (I guess we can have a discussion about whether internal-only configuration points can meaningfully stay in the code forever for reasons of continuous benchmarking etc, but as we do no such benchmarking today allow me to express skepticism at that.) (There are possibly settings in avmfeatures that already don't belong there, by that definition (and certainly features that should be demoted to tweaks)).

Edwin Smith

Updated

•

14 years ago

Attachment #441615 - Flags: review?(edwsmith) → review+

Edwin Smith

Updated

•

14 years ago

Attachment #441590 - Flags: review?

Steven Johnson

Assignee

Comment 15

•

14 years ago

(In reply to comment #14) > Configuration > points that are for experimentation and will go away eventually when the > numbers are in should stay out of the avmfeatures altogether, and be ad-hoc > ifdefs in avmplus.h (typically), which are eventually removed. Agreed -- if this pans out, it will likely simply be tied to nanojit; the feature-itis is for my ongoing dev work. If it makes reviewing harder, I'll switch to ad-hoc for subsequent patches.

Steven Johnson

Assignee

Comment 16

•

14 years ago

(In reply to comment #13) > * when I wrote InvokerCompiler::assemble(), I was hoping to generalize it and > move it to LirHelper... might be worth a try at that before the cut+pasting > gets out of control. Can do. > * I think I'd be happier if all the compiler code was in the compiler class, > instead of spread between Traits.cpp and CodegenLIR.cpp. but if you have a > reason you did it that way, add a comment. Hm, IMHO it is all in the compiler class -- Traits simply drives it to tell the offsets to generate. The alternative is to have CodegenLIR have intimate knowledge about Traits slot layout, which is currently (thankfully) pretty much concentrated in Traits. > * the handling of jit failure (fall through to a second null check) is too > subtle. comment? Comments are good, I'll add some

Edwin Smith

Comment 17

•

14 years ago

Comment on attachment 441590 [details] [diff] [review] Remove the BitSet walking + Whats not to like about deleting code? - Needs a situational comment explaining pointer/non-pointer slot layout, and how we're taking advantage of it. (or reference to a doc on slot layout) +/- needs data to motivate the size/speed tradeoff. even some measurements or back-of-envelope calculations on a hypothetical app, would be useful.

Attachment #441590 - Flags: feedback?(edwsmith) → feedback+

Steven Johnson

Assignee

Comment 18

•

14 years ago

(In reply to comment #17) > how we're taking advantage of it. (or reference to a doc on slot layout) Fully documented in doc/slots.txt; I'll reference that > +/- needs data to motivate the size/speed tradeoff. even some measurements or > back-of-envelope calculations on a hypothetical app, would be useful. Agreed

Edwin Smith

Comment 19

•

14 years ago

Comment on attachment 441619 [details] [diff] [review] Specialize destroyInstance via function ptr hasPointerSlotsThatNeedDestruction() could avoid traversing up the inheritance chain by inspecting the base class's m_destroyProc field. Doubt its important here, but its hard to pass up a chance point out an O(1) alternative to an O(N) function. As in the previous patch, a policy comment would be helpful. Best place is probably near setDestroyProc(). I would be interested in some statistics that show how frequently each destroyInstance variant is used.

Attachment #441619 - Flags: feedback?(edwsmith) → feedback+

Tommy Reilly

Reporter

Updated

•

14 years ago

Attachment #441590 - Flags: feedback?(treilly) → feedback+

Tommy Reilly

Reporter

Comment 20

•

14 years ago

Comment on attachment 441619 [details] [diff] [review] Specialize destroyInstance via function ptr if (t->m_offsetofSlots > sizeof(ScriptObject)) Would this be obvious if I read the new slots layout doc? Maybe a t->IsBuiltin() would be clearer?

Tommy Reilly

Reporter

Comment 21

•

14 years ago

you might get more accurate benchmark data if your test tried to not measure allocation/init time. I was thinking you could to this by using from fragile hacks that I won't mention or you could add: System.nogc = true ... do alllocations ... now = new Date() System.forceFullCollection(); Control of nogc from AS3 would probably come in handy elsewhere

Steven Johnson

Assignee

Comment 22

•

14 years ago

(In reply to comment #20) > (From update of attachment 441619 [details] [diff] [review]) > if (t->m_offsetofSlots > sizeof(ScriptObject)) > > Would this be obvious if I read the new slots layout doc? Maybe a > t->IsBuiltin() would be clearer? Not obvious enough, it took me a bit to figure it out. I'll expand on it with more comments (or maybe tweak the slot layout a bit more)

Tommy Reilly

Reporter

Comment 23

•

14 years ago

A cheap an easy way to just measure destruction time would be to run the benchmark with -Dnogc and change forceFullCollection override nogc to be true

Rick Reitmaier

Comment 24

•

14 years ago

Comment on attachment 441619 [details] [diff] [review] Specialize destroyInstance via function ptr Would it make sense to wrap up ' this->m_hashTableOffset != 0 && !this->isDictionary() ' in a method call? And it will be interesting to explore both the performance of the jit'd dtors along with the memory usage.

Steven Johnson

Assignee

Comment 25

•

14 years ago

Comment on attachment 441615 [details] [diff] [review] Encapsulate isDictionary flag http://hg.mozilla.org/tamarin-redux/rev/6b18bf00491e

Attachment #441615 - Attachment is obsolete: true

Steven Johnson

Assignee

Comment 26

•

14 years ago

(In reply to comment #21) > you might get more accurate benchmark data if your test tried to not measure > allocation/init time. Yeah, but... is that really useful? In actual usage, we're not going to have destruction without allocation. (I can always use Shark to measure the destruction time itself)

Steven Johnson

Assignee

Comment 27

•

14 years ago

(In reply to comment #17) > +/- needs data to motivate the size/speed tradeoff. even some measurements or > back-of-envelope calculations on a hypothetical app, would be useful. OK, I did some rough profiling... the existing FixedBitSet works out to be slightly smaller if most Traits have < 31 slots (as it doesn't do out of line allocs then). Typically this is true of "simple" (ie non-Flex) apps. Running ATS AS3 cases I ended up with only a tiny % of cases needed to do an out of line alloc, which ended up meaning the existing data structure is nearly half the space. Testing on a handful of Flex apps showed a smaller advantage, but still an advantage. (All for 32-bit systems, btw.) However, this can probably be mitigated by rearranging the ordering of the slots; I'll do a quick experiment with that to see if it's practical.

Steven Johnson

Assignee

Comment 28

•

14 years ago

Attached patch Rearrange slot order to put pointer-slots first (deleted) — Details — Splinter Review

New patch that rearranges the way we layout slots; formerly, we always did 4-byte slots (possible padding) pointer-sized slots (possible padding) 8-byte slots with this patch, we now do (possible padding) pointer-sized slots 4-byte slots (possible padding) 8-byte slots On average, this should be the same amount of space (it may vary slightly on 64-bit systems depending on padding) but allows us to assume the pointer-slots always start at m_offsetofSlots; this paves the way for subsequent patches to avoid having to store a separate offset to the start of the pointer slots.

Attachment #442209 - Flags: review?(edwsmith)

Attachment #442209 - Flags: feedback?(chrisb)

Christopher A. Brichford

Comment 29

•

14 years ago

(In reply to comment #28) > Created an attachment (id=442209) [details] > Rearrange slot order to put pointer-slots first > > ... > > with this patch, we now do > > (possible padding) > pointer-sized slots > 4-byte slots > (possible padding) > 8-byte slots > > On average, this should be the same amount of space (it may vary slightly on > 64-bit systems depending on padding) but allows us to assume the pointer-slots > always start at m_offsetofSlots; this paves the way for subsequent patches to > avoid having to store a separate offset to the start of the pointer slots. I think the new layout you are propsing is fine. The main thing I was trying to do was to make the generated struct for the slots in glue classes not need to have #ifdef's for 64 bit vs. 32 bit and minimize the amount of padding in the struct. In the old layout you would never have more than one int32_t for padding. In the new layout you might have two. If the benefits of having the pointers at the front out weigh the cost of the potential extra padding on 64 bits then I think you are doing the right thing.

Steven Johnson

Assignee

Comment 30

•

14 years ago

(In reply to comment #29) > (In reply to comment #28) > In the old layout you would never have more than one int32_t for > padding. In the new layout you might have two. When would that happen (for 32-bit systems)? The first "possible padding" (before the pointer-sized slots) should only ever be possible for 64-bit systems. (FWIW, some additional rearrangement of fields in Traits will be necessary for this to be truly useful, so look for a subsequent patch first)

Steven Johnson

Assignee

Comment 31

•

14 years ago

Attached patch replace Traits::m_totalSize with Traits::m_extraSize (deleted) — Details — Splinter Review

A truly trivial optimization, but arguably worth capturing: we use m_extraSize at far more call points (every object constructor) than we do m_totalSize (hardly ever).

Attachment #442557 - Flags: review?(edwsmith)

Christopher A. Brichford

Updated

•

14 years ago

Attachment #442209 - Flags: feedback?(chrisb) → feedback+

Steven Johnson

Assignee

Comment 32

•

14 years ago

So after some fiddling around, I'm not convinced that jitting the dtors is likely to pan out, so I think I'm going to mothball this unless someone has a suggestion... (1) The patches that remove BitSet provide a small-but-measurable improvement in some cases, but I don't really move the needle any further. (2) The jitted code is crashy, because at GC teardown time, the CodeMgr can (and is) destroyed before all the jitted dtors. To fix this we'd need to create a separate CodeMgr that manages the dtors and is guaranteed to live longer (which is doable but probably not worth the effort, see (1)). (3) If RCObjects/ZCT goes away entirely (as we're hoping to do for this cycle), this whole effort becomes moot since none of this DecrementRef nonsense will be necessary. Thoughts? (I could land the first two patches; they would cause a modest memory increase and a modest perf enhancement...)

Edwin Smith

Updated

•

14 years ago

Attachment #442557 - Flags: review?(edwsmith) → review+

Edwin Smith

Comment 33

•

14 years ago

if anything we should continue to test & refine the non-jit code ; only when you have a test that more strongly depends on destructor speed, should we worry about the jit version. I think the only blocker for landing the first two patches, is getting data on a wider (perhaps more relevant) set of tests. (2) doesn't seem like a decision factor for the jit-patch landing, to me, its just a bit more work. CodeMgr is free of GC dependencies, and internally supports freeing individual code hunks, with free space coalescing. Traits needs code to explicitly free destructors so that when a pool is unloaded, the destructors for types defined in that pool get freed.

Lars T Hansen

Updated

•

14 years ago

Attachment #441586 - Flags: feedback?(lhansen) → feedback+

Lars T Hansen

Comment 34

•

14 years ago

Comment on attachment 441586 [details] Rough performance test Most of the objects will be destroyed by reference counting, not GC, so the forceFullCollection is probably mostly redundant but hardly wrong. (You need circular object references to avoid that.)

Lars T Hansen

Comment 35

•

14 years ago

(In reply to comment #26) > (In reply to comment #21) > > you might get more accurate benchmark data if your test tried to not measure > > allocation/init time. > > Yeah, but... is that really useful? In actual usage, we're not going to have > destruction without allocation. (I can always use Shark to measure the > destruction time itself) I agree with Steven, since we don't really care about percentages, only the actual time reduced; looking at percentages is not the right approach for PACMAN, we should always be looking to absolute improvements - most of the areas we are looking at improving probably take less than 2% of the running time in many cases.

Tommy Reilly

Reporter

Comment 36

•

14 years ago

Another thing to think about in this area would be to look at using loop unrolling and/or SSE2 instructions to look more than one pointer at a time. I was thinking about this for decrementAtomRegion which showed up as a hotspot in another benchmark. They might even be able to share the same code.

Edwin Smith

Updated

•

14 years ago

Attachment #442209 - Flags: review?(edwsmith) → review+

Tommy Reilly

Reporter

Updated

•

14 years ago

Attachment #441619 - Flags: feedback?(treilly) → feedback+

Edwin Smith

Updated

•

14 years ago

Whiteboard: PACMAN → PACMAN, has-patch

Steven Johnson

Assignee

Comment 37

•

14 years ago

Note that this bug will be made obsolete if/when https://bugzilla.mozilla.org/show_bug.cgi?id=538943 lands, as Traits::destroyInstance won't be necessary anymore (in any form).

Steven Johnson

Assignee

Comment 38

•

14 years ago

Closing this as WONTFIX as bug 538943 is almost certain to happen before we get around to this.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → WONTFIX

Trevor Baker

Comment 39

•

14 years ago

bulk verifying resolved !fixed issues

Status: RESOLVED → VERIFIED

You need to log in before you can comment on or make changes to this bug.