Closed Bug 482787 Opened 16 years ago Closed 14 years ago

We code gen init code for an object to initialize slots but walk them in Traits::destroyInstance

Categories

(Tamarin Graveyard :: Virtual Machine, enhancement, P4)

enhancement

Tracking

(Not tracked)

VERIFIED WONTFIX
Q3 11 - Serrano

People

(Reporter: treilly, Assigned: stejohns)

References

Details

(Whiteboard: PACMAN, has-patch)

Attachments

(6 files, 1 obsolete file)

End result slow Finalize which could lead to uncool pause times.
Priority: -- → P4
Target Milestone: --- → flash10.1
I've observed this too, in some of the programs in test/performance/mmgc - destroyInstance walks a bit vector, and that can be quite slow.
Hardware: x86 → All
destroyInstance appears to add considerable cost even to objects that should require minimal destruction, see numbers in bug #531250.
Blocks: 531250
Target Milestone: flash10.1 → flash10.2
Whiteboard: PACMAN
Assignee: nobody → stejohns
Looking at this, I realized that we can also considerably simplify how we walk them; as a result of changes to the slot layout by Chris Brichford, the pointer slots are now always contiguous (in the middle of the object), so we don't need a bitset to find them, just an offset and count. (If we rearranged their order just a count could suffice.) Looks for a series of patches...
On this note, do we have any good existing benchmarks for timing this? (My initial assumption is to allocate a few million objects, then force a collection...)
given this inheritance tree: class A { int int_a // int32 Array array_a // ScriptObject* } class B { int int_b // int32 Array array_b // ScriptObject* } Won't the layout be this: (assume a 32-bit cpu) { /* ScriptObject stuff */ int32_t int_a ScriptObject* array_a int32_t int_b ScriptObject* array_b }
Yep, but within each subclass they are a contiguous range.
Attached file Rough performance test (deleted) —
Initial perf benchmark to try to measure the cost of destroyInstance: basically, allocate a few million of various sorts of objects, force a full collection, measure the time.
Attachment #441586 - Flags: feedback?(lhansen)
Attached patch Remove the BitSet walking (deleted) — Splinter Review
Take advantage of the slot layout to get rid of FixedBitSet entirely. From my testing (MacTel32), perf is roughly a wash (no gain, no loss), but code is substantially smaller and simpler, and should pave the way for some subsequent patches. (MacTel64 is actually measurably faster for the most complex testcase with lots of objects, by about 5%.) Tradeoffs to look at: -- We formerly used a FixedBitSet, which would fit into a pointer-sized field for small objects (good) but dynamically allocate for larger ones (bad). Now we just need an offset-and-count pair... for small objects on 32-bit systems this is more space (2 int32 vs one intptr), but for all other configs it's at least as small, and (trivially) reduces gc pressure. (Note that if we restructured the order of slot layout we might be able to get by with just a count, not an offset.) -- the old code took great pains to walk thru the object in low-to-high-addr order, zeroing as it went, presumably to maximize cache performance. the new code doesn't bother with this: it focuses on just the section(s) with pointers, then finishes with a single memset of the whole thing. (furthermore, it walks the pointer section in reverse order, so as to avoid recursion.) In theory, this is probably suboptimal, but since I plan on adding jitted dtors (which will go in low-to-high order) in a subsequent patch, this is probably an acceptable tradeoff.
Attachment #441590 - Flags: feedback?(treilly)
Attached patch Encapsulate isDictionary flag (obsolete) (deleted) — Splinter Review
Refactoring patch to pave the way for subsequent patch; pulled out separately to make reviewing easier
Attachment #441615 - Flags: review?(edwsmith)
Attachment #441590 - Flags: review?(edwsmith)
Comment on attachment 441590 [details] [diff] [review] Remove the BitSet walking changing to feedback until I get better perf numbers measured from the patch set.
Attachment #441590 - Flags: review?(edwsmith)
Attachment #441590 - Flags: review?
Attachment #441590 - Flags: feedback?(edwsmith)
Add a function pointer for destroyInstance, and specialize two cases that are common and easily detectable: no-pointer-slots-and-no-hashtable (aka Plain Old Data), and no-pointer-slots-with-hashtable (aka dynamic objects). This change (plus previous patches) runs the benchmark at about 91% of total time (vs TR baseline) on MacTel32, and 95% of total time on MacTel64. The obvious downside to this is that Traits gets another function pointer added to it, so there's a modest memory size increase.
Attachment #441619 - Flags: feedback?(edwsmith)
Attachment #441619 - Flags: feedback?(treilly)
Attached patch First cut at jitting dtors (deleted) — Splinter Review
First cut. Definitely not ready for review, because: (1) disappointingly, it's not much faster than the previous patch... close to the noise range (2) the code generated is kinda large (3) it tends to hang on acceptance tests; I haven't diagnosed the problem, but wonder if our CodeMgr is getting destroyed at avm-teardown-time ahead of things that still need the custom dtor functions (which would be bad)... So, no review, but feedback welcomed.
Attachment #441672 - Flags: feedback?(edwsmith)
Comment on attachment 441672 [details] [diff] [review] First cut at jitting dtors initial reactions to jitter * when I wrote InvokerCompiler::assemble(), I was hoping to generalize it and move it to LirHelper... might be worth a try at that before the cut+pasting gets out of control. * I can hear Moh from intel muttering: "we can specalize that using rep mov p, 0" (fast inline x86 specific memclear)... doubt it matters, yet. * this has all the markings of an AVMTWEAK rather than AVMFEATURE * I think I'd be happier if all the compiler code was in the compiler class, instead of spread between Traits.cpp and CodegenLIR.cpp. but if you have a reason you did it that way, add a comment. * the handling of jit failure (fall through to a second null check) is too subtle. comment?
Attachment #441672 - Flags: feedback?(edwsmith) → feedback+
(In reply to comment #13) > (From update of attachment 441672 [details] [diff] [review]) > * this has all the markings of an AVMTWEAK rather than AVMFEATURE No opinion on the patch, but a general observation: Tweaks and features are intended for configuration points that will stay in the code indefinitely, as various platforms may prefer various settings and it's meaningful to expose those configuration points to embedders. Configuration points that are for experimentation and will go away eventually when the numbers are in should stay out of the avmfeatures altogether, and be ad-hoc ifdefs in avmplus.h (typically), which are eventually removed. (I guess we can have a discussion about whether internal-only configuration points can meaningfully stay in the code forever for reasons of continuous benchmarking etc, but as we do no such benchmarking today allow me to express skepticism at that.) (There are possibly settings in avmfeatures that already don't belong there, by that definition (and certainly features that should be demoted to tweaks)).
Attachment #441615 - Flags: review?(edwsmith) → review+
Attachment #441590 - Flags: review?
(In reply to comment #14) > Configuration > points that are for experimentation and will go away eventually when the > numbers are in should stay out of the avmfeatures altogether, and be ad-hoc > ifdefs in avmplus.h (typically), which are eventually removed. Agreed -- if this pans out, it will likely simply be tied to nanojit; the feature-itis is for my ongoing dev work. If it makes reviewing harder, I'll switch to ad-hoc for subsequent patches.
(In reply to comment #13) > * when I wrote InvokerCompiler::assemble(), I was hoping to generalize it and > move it to LirHelper... might be worth a try at that before the cut+pasting > gets out of control. Can do. > * I think I'd be happier if all the compiler code was in the compiler class, > instead of spread between Traits.cpp and CodegenLIR.cpp. but if you have a > reason you did it that way, add a comment. Hm, IMHO it is all in the compiler class -- Traits simply drives it to tell the offsets to generate. The alternative is to have CodegenLIR have intimate knowledge about Traits slot layout, which is currently (thankfully) pretty much concentrated in Traits. > * the handling of jit failure (fall through to a second null check) is too > subtle. comment? Comments are good, I'll add some
Comment on attachment 441590 [details] [diff] [review] Remove the BitSet walking + Whats not to like about deleting code? - Needs a situational comment explaining pointer/non-pointer slot layout, and how we're taking advantage of it. (or reference to a doc on slot layout) +/- needs data to motivate the size/speed tradeoff. even some measurements or back-of-envelope calculations on a hypothetical app, would be useful.
Attachment #441590 - Flags: feedback?(edwsmith) → feedback+
(In reply to comment #17) > how we're taking advantage of it. (or reference to a doc on slot layout) Fully documented in doc/slots.txt; I'll reference that > +/- needs data to motivate the size/speed tradeoff. even some measurements or > back-of-envelope calculations on a hypothetical app, would be useful. Agreed
Comment on attachment 441619 [details] [diff] [review] Specialize destroyInstance via function ptr hasPointerSlotsThatNeedDestruction() could avoid traversing up the inheritance chain by inspecting the base class's m_destroyProc field. Doubt its important here, but its hard to pass up a chance point out an O(1) alternative to an O(N) function. As in the previous patch, a policy comment would be helpful. Best place is probably near setDestroyProc(). I would be interested in some statistics that show how frequently each destroyInstance variant is used.
Attachment #441619 - Flags: feedback?(edwsmith) → feedback+
Attachment #441590 - Flags: feedback?(treilly) → feedback+
Comment on attachment 441619 [details] [diff] [review] Specialize destroyInstance via function ptr if (t->m_offsetofSlots > sizeof(ScriptObject)) Would this be obvious if I read the new slots layout doc? Maybe a t->IsBuiltin() would be clearer?
you might get more accurate benchmark data if your test tried to not measure allocation/init time. I was thinking you could to this by using from fragile hacks that I won't mention or you could add: System.nogc = true ... do alllocations ... now = new Date() System.forceFullCollection(); Control of nogc from AS3 would probably come in handy elsewhere
(In reply to comment #20) > (From update of attachment 441619 [details] [diff] [review]) > if (t->m_offsetofSlots > sizeof(ScriptObject)) > > Would this be obvious if I read the new slots layout doc? Maybe a > t->IsBuiltin() would be clearer? Not obvious enough, it took me a bit to figure it out. I'll expand on it with more comments (or maybe tweak the slot layout a bit more)
A cheap an easy way to just measure destruction time would be to run the benchmark with -Dnogc and change forceFullCollection override nogc to be true
Comment on attachment 441619 [details] [diff] [review] Specialize destroyInstance via function ptr Would it make sense to wrap up ' this->m_hashTableOffset != 0 && !this->isDictionary() ' in a method call? And it will be interesting to explore both the performance of the jit'd dtors along with the memory usage.
Attachment #441615 - Attachment is obsolete: true
(In reply to comment #21) > you might get more accurate benchmark data if your test tried to not measure > allocation/init time. Yeah, but... is that really useful? In actual usage, we're not going to have destruction without allocation. (I can always use Shark to measure the destruction time itself)
(In reply to comment #17) > +/- needs data to motivate the size/speed tradeoff. even some measurements or > back-of-envelope calculations on a hypothetical app, would be useful. OK, I did some rough profiling... the existing FixedBitSet works out to be slightly smaller if most Traits have < 31 slots (as it doesn't do out of line allocs then). Typically this is true of "simple" (ie non-Flex) apps. Running ATS AS3 cases I ended up with only a tiny % of cases needed to do an out of line alloc, which ended up meaning the existing data structure is nearly half the space. Testing on a handful of Flex apps showed a smaller advantage, but still an advantage. (All for 32-bit systems, btw.) However, this can probably be mitigated by rearranging the ordering of the slots; I'll do a quick experiment with that to see if it's practical.
New patch that rearranges the way we layout slots; formerly, we always did 4-byte slots (possible padding) pointer-sized slots (possible padding) 8-byte slots with this patch, we now do (possible padding) pointer-sized slots 4-byte slots (possible padding) 8-byte slots On average, this should be the same amount of space (it may vary slightly on 64-bit systems depending on padding) but allows us to assume the pointer-slots always start at m_offsetofSlots; this paves the way for subsequent patches to avoid having to store a separate offset to the start of the pointer slots.
Attachment #442209 - Flags: review?(edwsmith)
Attachment #442209 - Flags: feedback?(chrisb)
(In reply to comment #28) > Created an attachment (id=442209) [details] > Rearrange slot order to put pointer-slots first > > ... > > with this patch, we now do > > (possible padding) > pointer-sized slots > 4-byte slots > (possible padding) > 8-byte slots > > On average, this should be the same amount of space (it may vary slightly on > 64-bit systems depending on padding) but allows us to assume the pointer-slots > always start at m_offsetofSlots; this paves the way for subsequent patches to > avoid having to store a separate offset to the start of the pointer slots. I think the new layout you are propsing is fine. The main thing I was trying to do was to make the generated struct for the slots in glue classes not need to have #ifdef's for 64 bit vs. 32 bit and minimize the amount of padding in the struct. In the old layout you would never have more than one int32_t for padding. In the new layout you might have two. If the benefits of having the pointers at the front out weigh the cost of the potential extra padding on 64 bits then I think you are doing the right thing.
(In reply to comment #29) > (In reply to comment #28) > In the old layout you would never have more than one int32_t for > padding. In the new layout you might have two. When would that happen (for 32-bit systems)? The first "possible padding" (before the pointer-sized slots) should only ever be possible for 64-bit systems. (FWIW, some additional rearrangement of fields in Traits will be necessary for this to be truly useful, so look for a subsequent patch first)
A truly trivial optimization, but arguably worth capturing: we use m_extraSize at far more call points (every object constructor) than we do m_totalSize (hardly ever).
Attachment #442557 - Flags: review?(edwsmith)
Attachment #442209 - Flags: feedback?(chrisb) → feedback+
So after some fiddling around, I'm not convinced that jitting the dtors is likely to pan out, so I think I'm going to mothball this unless someone has a suggestion... (1) The patches that remove BitSet provide a small-but-measurable improvement in some cases, but I don't really move the needle any further. (2) The jitted code is crashy, because at GC teardown time, the CodeMgr can (and is) destroyed before all the jitted dtors. To fix this we'd need to create a separate CodeMgr that manages the dtors and is guaranteed to live longer (which is doable but probably not worth the effort, see (1)). (3) If RCObjects/ZCT goes away entirely (as we're hoping to do for this cycle), this whole effort becomes moot since none of this DecrementRef nonsense will be necessary. Thoughts? (I could land the first two patches; they would cause a modest memory increase and a modest perf enhancement...)
Attachment #442557 - Flags: review?(edwsmith) → review+
if anything we should continue to test & refine the non-jit code ; only when you have a test that more strongly depends on destructor speed, should we worry about the jit version. I think the only blocker for landing the first two patches, is getting data on a wider (perhaps more relevant) set of tests. (2) doesn't seem like a decision factor for the jit-patch landing, to me, its just a bit more work. CodeMgr is free of GC dependencies, and internally supports freeing individual code hunks, with free space coalescing. Traits needs code to explicitly free destructors so that when a pool is unloaded, the destructors for types defined in that pool get freed.
Attachment #441586 - Flags: feedback?(lhansen) → feedback+
Comment on attachment 441586 [details] Rough performance test Most of the objects will be destroyed by reference counting, not GC, so the forceFullCollection is probably mostly redundant but hardly wrong. (You need circular object references to avoid that.)
(In reply to comment #26) > (In reply to comment #21) > > you might get more accurate benchmark data if your test tried to not measure > > allocation/init time. > > Yeah, but... is that really useful? In actual usage, we're not going to have > destruction without allocation. (I can always use Shark to measure the > destruction time itself) I agree with Steven, since we don't really care about percentages, only the actual time reduced; looking at percentages is not the right approach for PACMAN, we should always be looking to absolute improvements - most of the areas we are looking at improving probably take less than 2% of the running time in many cases.
Another thing to think about in this area would be to look at using loop unrolling and/or SSE2 instructions to look more than one pointer at a time. I was thinking about this for decrementAtomRegion which showed up as a hotspot in another benchmark. They might even be able to share the same code.
Attachment #442209 - Flags: review?(edwsmith) → review+
Attachment #441619 - Flags: feedback?(treilly) → feedback+
Whiteboard: PACMAN → PACMAN, has-patch
Note that this bug will be made obsolete if/when https://bugzilla.mozilla.org/show_bug.cgi?id=538943 lands, as Traits::destroyInstance won't be necessary anymore (in any form).
Closing this as WONTFIX as bug 538943 is almost certain to happen before we get around to this.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WONTFIX
bulk verifying resolved !fixed issues
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: