Closed Bug 575316 Opened 14 years ago Closed 14 years ago

Reengineer per-object bits to provide for more bits

Categories

(Tamarin Graveyard :: Garbage Collection (mmGC), defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED
Q3 11 - Serrano

People

(Reporter: lhansen, Assigned: lhansen)

References

Details

(Whiteboard: has-patch)

Attachments

(4 files, 4 obsolete files)

(Spun off from bug #516156.) A number of useful projects - exact marking, generations among them - depend on having more per-object bits available.
Attached patch Preliminary patch (obsolete) (deleted) — Splinter Review
Rescued from the parent bug, painstakingly rebased to redux 4882, tested on MacOS, test compiled on Windows. Confidence level == medium, I also need to run all benchmarks (for acceptance testing) and I need to run acceptance on more platforms. I also need to measure performance and memory consumption, although frankly, the memory consumption will have to be what it is - we need more bits. There are really two changes rolled into one here. The big change is a general refactoring that changes the representation of object bits and changes many APIs internal to the GC as a result; also as a result the GC bits per object now needs to be an integral number of bits. This is what you get if MMGC_FASTBITS is disabled in MMgc.h. The second change is MMGC_FASTBITS - a new representation for the object bit /array/, which allows the array to have holes that go unused. The purpose of the representation is to make it quicker to compute the address of the bit vector corresponding to a particular object. (If desirable I can break this out as a separate patch.) Apart from more testing and benchmarking, I need to go through and add some documentation here and there, at least. (For the purpose of making the review easier I've neglected to re-indent some code in GCAlloc.cpp whose controlling statements were coalesced.)
Attached patch Refactor mark bits, make more space available (obsolete) (deleted) — Splinter Review
See description of previous patch. This one comes without the MMGC_FASTBITS changes, though it anticipates them in that it makes a distinction between the offset of an object within its block and the offset of that object's bits within the block's bit vector. Without MMGC_FASTBITS that distinction would not have been very interesting. In sum, this patch increases the number of mark bits per object from four to eight, and requires further increases to be in multiples of eight bits. It implements a number of simplifications that fall out from that change.
Attachment #454579 - Attachment is obsolete: true
Attachment #455455 - Flags: review?(treilly)
Attachment #455455 - Flags: review?(fklockii)
Attached patch Implement MMGC_FASTBITS (deleted) — Splinter Review
This implements MMGC_FASTBITS on top of the previous patch. The comment in GCAlloc.cpp describes the change in detail, but effectively we allow holes in the bit vector in order to compute the offset of the object's mark bits within the bit vector more efficiently. It's unclear how important this change is, as of yet. I won't insist on landing it, but I would like to have it reviewed.
Attachment #455456 - Flags: review?(treilly)
Attachment #455456 - Flags: review?(fklockii)
Blocks: 576307
Whiteboard: has-patch
Comment on attachment 455455 [details] [diff] [review] Refactor mark bits, make more space available An update to this patch will be posted later. The difference between the patch and the new patch is that four instances of b->bits[...] |= kFreelist; will change to b->bits[...] = kFreelist; (Those are all the instances of the former pattern in the MMgc code.) Please review with that in mind.
(In reply to comment #4) > (From update of attachment 455455 [details] [diff] [review]) > An update to this patch will be posted later. The difference between the patch > and the new patch is that four instances of > > b->bits[...] |= kFreelist; > > will change to > > b->bits[...] = kFreelist; > > (Those are all the instances of the former pattern in the MMgc code.) > > Please review with that in mind. Just to be clear: the former |= variant was keeping in the spirit of a refactoring of the existing code (which invoked SetBit), while the latter = variant is a (sound) change to the behavior, correct? Is this fixing a bug? Or was the code previously clearing the other bits elsewhere in the control flow, and this is more of a local performance fix?
(In reply to comment #5) > > Just to be clear: the former |= variant was keeping in the spirit of a > refactoring of the existing code (which invoked SetBit), while the latter = > variant is a (sound) change to the behavior, correct? A necessary change. > Is this fixing a bug? Yes. We want to make sure that when something moves onto the free list, all other bits are reset so that they do not linger, because we do not always properly clear bits when we pick an object off the free list. The invariant is that objects on the free list have kFreelist set, no other bits. CreateChunk establishes the invariant.
Depends on: 576576
Incorporates the dependency on bug #576576, and folds in the fixes I mentioned earlier, and has one more change in GCAlloc::AllocFromQuickList, where the flags on the object are cleared and the kFinalizable flag is set if appropriate. Previously, it would only clear the low four bits, but that's not correct: the object is fresh and should have cleared flags.
Attachment #455455 - Attachment is obsolete: true
Attachment #456265 - Flags: review?(treilly)
Attachment #456265 - Flags: review?(fklockii)
Attachment #455455 - Flags: review?(treilly)
Attachment #455455 - Flags: review?(fklockii)
(In reply to comment #6) > The invariant is > that objects on the free list have kFreelist set, no other bits. The correct invariant is that objects on the free list have kFreelist set. CreateChunk, GCAlloc::Free, and the garbage collector all maintain that invariant. Objects on the free list that were freed explicitly or by DRC may have some other bits set additionally; the additional bits are cleared in GCAlloc::AllocFromQuickList.
Comment on attachment 456265 [details] [diff] [review] Refactor mark bits, make more space available (v2) How's performance? Isn't working with machine word sized pointers faster than byte pointers (gcbits&) in general? Or is that old dumb compiler wisdom?
Comment on attachment 456265 [details] [diff] [review] Refactor mark bits, make more space available (v2) Random thought, could we make GCAlloc::GetBitsIndex return 0 for LargeObjects and fold more large/small code together? items has to go into GCBlockHeader so that would make it bigger for GCLarge. items is pretty simple calculation we could get rid of it (gain 4 bytes for an extra read and some simple math).
Attachment #456265 - Flags: review?(treilly) → review+
(In reply to comment #10) > Comment on attachment 456265 [details] [diff] [review] > Refactor mark bits, make more space available (v2) > > How's performance? Isn't working with machine word sized pointers faster than > byte pointers (gcbits&) in general? Or is that old dumb compiler wisdom? Hmm; I was worried about potential lossage due to moving from register-allocatable values to C++-references [1], but I had not thought of the word/byte size issue. [1] Lars and I chatted about this in the asteam chat room; he suggested exposing separate accessors for value- or reference- semantics. http://asteam.corp.adobe.com/irc/log/asteam/asteam/20100702
Comment on attachment 456265 [details] [diff] [review] Refactor mark bits, make more space available (v2) I bit the bullet and did a tiny bit of comparison of disassembled routines. My sample was very small; just two routines: GC::ClearUnmarkedWeakRefs and GC::WriteBarrierHit. Skimming them was enough to convince me that there are not any obvious problems with using C++ references. (The ordering of quantifiers is important there.) I'll attach the relevant portions of my transcript, for completeness.
Attachment #456265 - Flags: review?(fklockii) → review+
Attached file baseline disassembly (deleted) —
baseline disassembly; contrast against disassembly post-refactor via attachment 456265 [details] [diff] [review] (to be posted next).
Attached file refactor disassembly (deleted) —
disassembly post-refactor via attachment 456265 [details] [diff] [review]; contrast against baseline disassembly (attachment 462752 [details])
Comment on attachment 455456 [details] [diff] [review] Implement MMGC_FASTBITS (reviewing)
Rebase of original patch to the tip, to save Lars and Tommy the trouble. (I've already gone through this rebasing process twice, so I thought I'd put the rebased version up to avoid duplicating that effort again.) Lars: diffing the patches themselves should indicate that I did not change the patch itself.
Attachment #456265 - Attachment is obsolete: true
(In reply to comment #17) > Lars: diffing the patches themselves should indicate that I did not change the > patch itself. (diffing them will tell you that I did accidentally drop Lar's header describing the patch; it said: "New mark bits structure")
Comment on attachment 465657 [details] [diff] [review] Refactor mark bits, make more space available (v2) Marking rebase of patch as self-reviewed. Requesting super-review from Lars mostly as a formality. (Note that the previous patch, originally authored by Lars, had been reviewed by Tommy and myself.)
Attachment #465657 - Flags: superreview?(lhansen)
Attachment #465657 - Flags: review+
Comment on attachment 455456 [details] [diff] [review] Implement MMGC_FASTBITS Any idea why MMGC_FASTBITS improves memory usage on jsbench/FFT.as? (By 17-35% (!)?) (I thought it might be an artifact of time improvements changing GC behavior, but after turning off incremental GC I still see a big improvement in memory usage on FFT.) Here's the relevant details: avm: ../../../tamarin-redux2/objdir-selftest32/shell/avmshell -Dnoincgc version: cyclone avm2: ../../../tamarin-redux/objdir-selftest32/shell/avmshell -Dnoincgc version: cyclone iterations: 5 avm:cyclone avm2:cyclone test best avg best avg %dBst %dAvg Metric: memory ... jsbench/FFT.as 41.2M 31.6M 26.7M 26.0M 35.19 17.65 + ... jsbench/typed/FFT.as 3.1M 3.1M 3.1M 3.1M 0 -0.65 ... avm: ../../../tamarin-redux2/objdir-selftest32/shell/avmshell version: cyclone avm2: ../../../tamarin-redux/objdir-selftest32/shell/avmshell version: cyclone iterations: 5 avm:cyclone avm2:cyclone test best avg best avg %dBst %dAvg Metric: time ... jsbench/FFT.as 12579 12611.4 12010 12144.8 4.52 3.70 + ... jsbench/typed/FFT.as 7919 7934.6 7594 7616.4 4.10 4.01 +
(In reply to comment #20) > Comment on attachment 455456 [details] [diff] [review] > Implement MMGC_FASTBITS > > Any idea why MMGC_FASTBITS improves memory usage on jsbench/FFT.as? (By 17-35% > (!)?) Clarification: when running in incremental GC mode, the memory usage improvement is "only" 9-22%; the above 17-35% figure is just from the -Dnoincgc run. avm: ../../../tamarin-redux2/objdir-selftest32/shell/avmshell version: cyclone avm2: ../../../tamarin-redux/objdir-selftest32/shell/avmshell version: cyclone iterations: 5 avm:cyclone avm2:cyclone test best avg best avg %dBst %dAvg Metric: memory ... jsbench/FFT.as 39.5M 36.0M 35.9M 28.0M 9.11 22.12 ...
Comment on attachment 465657 [details] [diff] [review] Refactor mark bits, make more space available (v2) Rubber stamp.
Attachment #465657 - Flags: superreview?(lhansen) → superreview+
(In reply to comment #20) > > Any idea why MMGC_FASTBITS improves memory usage on jsbench/FFT.as? > (By 17-35% (!)?) No. (In reply to comment #21) > Clarification: when running in incremental GC mode, the memory usage > improvement is "only" 9-22%; the above 17-35% figure is just from the > -Dnoincgc run. No, even so :-) I assume the comparison here is between new mark bits on the one hand and new mark bits plus mmgc_fastbits on the other? Or is the baseline an unmodified redux? Anyway, some speculation: MMGC_FASTBITS will definitely change memory dynamics because fewer bitmaps will be allocated on-page - the new mark bits by themselves require twice the memory, and with MMGC_FASTBITS we require more still. If there is a lot of floating point data on the stack it is conceivable that the changed dynamics results in different false retention patterns. Since memory management is essentially chaotic and programs like FFT will allocate some arrays, a small change could have a large effect. It's not a great explanation because jsbench/FFT.as uses mainly untyped variables and data, and everything should be boxed, and floating point boxes are handled as non-pointer-containing.
Comment on attachment 455456 [details] [diff] [review] Implement MMGC_FASTBITS Could we use FindOneBit on size instead of adding a new field? Do all architectures have a log2n instruction like x86 (I think so)?
Comment on attachment 455456 [details] [diff] [review] Implement MMGC_FASTBITS Why does MMGC_FASTBITS have its own GetGCBits, seems like the one that calls GetBitsIndex should work fine no?
(In reply to comment #24) > Comment on attachment 455456 [details] [diff] [review] > Implement MMGC_FASTBITS > > Could we use FindOneBit on size instead of adding a new field? Do all > architectures have a log2n instruction like x86 (I think so)? Of all the platforms we support so far, the only one without a find-bit instruction seems to be SH4. The gcc intrinsic is __builtin_ctz[ll], and the MSVC intrinsic is _BitScanForward[64]. I'm not into the code enough to argue whether a load is faster/slower than using the processor intrinsic, however.
And rebased again to account for recent GC changes. The MMGC_FASTBITS patch is still good.
Attachment #465657 - Attachment is obsolete: true
Some data (first of several). Conclusions will be posted after all data. The system is my macpro, not doing much. avm1 is the baseline, TR 5067. avm2 is baseline+refactoring. compiled with --enable-shell, otherwise defaults. Metric: time jsbench/Crypt.as 3348 3330 0.5 0.5 jsbench/Euler.as 6587 6396 2.9 2.9 jsbench/FFT.as 6481 6682 -3.1 -3.1 jsbench/HeapSort.as 2789 2784 0.2 0.2 jsbench/LUFact.as 7488 7600 -1.5 -1.5 jsbench/Moldyn.as 8612 8480 1.5 1.5 jsbench/RayTracer.as 6877 6895 -0.3 -0.3 jsbench/Series.as 8055 7860 2.4 2.4 jsbench/SOR.as 30949 29360 5.1 5.1 jsbench/SparseMatmult.as 11808 11746 0.5 0.5 jsbench/typed/Crypt.as 815 822 -0.9 -0.9 jsbench/typed/Euler.as 8310 8053 3.1 3.1 jsbench/typed/FFT.as 4060 3861 4.9 4.9 jsbench/typed/HeapSort.as 1380 1434 -3.9 -3.9 jsbench/typed/LUFact.as 8062 7550 6.4 6.4 jsbench/typed/Moldyn.as 3692 3553 3.8 3.8 jsbench/typed/RayTracer.as 1039 1029 1.0 1.0 jsbench/typed/Series.as 7007 6804 2.9 2.9 jsbench/typed/SOR.as 22808 21176 7.2 7.2 jsbench/typed/SparseMatmult.as 2221 2598 -17.0 -17.0 I've run this twice, the second run was with the binaries reversed to see if that mattered. (It didn't, mostly.) The general summary is that overall the trend is positive, there are minor slowdowns (typed/SparseMatmult dropped to -0.4% on the second run), and there is a persistent 3.5-4% slowdown on typed/HeapSort.as.
More data, same system. Metric: time mmgc/gcbench.as 2652 2271 14.4 14.4 mmgc/ofib-rc.as 213 206 3.3 3.3 mmgc/ofib.as 987 967 2.0 2.0 mmgc/sfib.as 333 321 3.6 3.6 I did three runs. The 14% speedup on gcbench is persistent, as is the speedup on sfib. ofib-rc was always positive, ofib fluctuated from -0.4 to +2.0.
More data, same system. Metric: v8 custom v8 normalized metric (hardcoded in the test) v8.5/js/crypto.as 437 445 1.8 1.8 v8.5/js/deltablue.as 387 390 0.8 0.8 v8.5/js/earley-boyer.as 1219 1244 2.1 2.1 v8.5/js/raytrace.as 781 813 4.1 4.1 v8.5/js/regexp.as 107 111 3.7 3.7 v8.5/js/richards.as 319 316 -0.9 -0.9 v8.5/js/splay.as 836 986 17.9 17.9 v8.5/optimized/crypto.as 4549 4510 -0.9 -0.9 v8.5/optimized/deltablue.as 3798 3883 2.2 2.2 v8.5/optimized/earley-boyer.as 1213 1233 1.6 1.6 v8.5/optimized/raytrace.as 9038 9587 6.1 6.1 v8.5/optimized/regexp.as 108 112 3.7 3.7 v8.5/optimized/richards.as 4455 4664 4.7 4.7 v8.5/optimized/splay.as 6105 6691 9.6 9.6 v8.5/typed/crypto.as 2654 2706 2.0 2.0 v8.5/typed/deltablue.as 4122 4207 2.1 2.1 v8.5/typed/earley-boyer.as 1218 1246 2.3 2.3 v8.5/typed/raytrace.as 9047 9579 5.9 5.9 v8.5/typed/regexp.as 108 112 3.7 3.7 v8.5/typed/richards.as 4455 4658 4.6 4.6 v8.5/typed/splay.as 1249 1289 3.2 3.2 v8.5/untyped/crypto.as 482 476 -1.2 -1.2 v8.5/untyped/deltablue.as 1942 1966 1.2 1.2 v8.5/untyped/earley-boyer.as 1216 1256 3.3 3.3 v8.5/untyped/raytrace.as 3544 3705 4.5 4.5 v8.5/untyped/regexp.as 108 111 2.8 2.8 v8.5/untyped/richards.as 501 506 1.0 1.0 v8.5/untyped/splay.as 1188 1206 1.5 1.5 Two runs, the data appear stable. A clear positive trend, some significant wins, slowdowns are all minor.
More data, same system. I've purged differences in the range -2.5..2.5. Metric: v8 custom v8 normalized metric (hardcoded in the test) asmicro/alloc-1.as 44 48 9.1 9.1 asmicro/alloc-10.as 16 17 6.2 6.2 asmicro/alloc-11.as 16 17 6.2 6.2 asmicro/alloc-13.as 95 106 11.6 11.6 asmicro/alloc-14.as 82 90 9.8 9.8 asmicro/alloc-2.as 22 23 4.5 4.5 asmicro/alloc-3.as 18 19 5.6 5.6 asmicro/alloc-4.as 51 58 13.7 13.7 asmicro/alloc-5.as 35 38 8.6 8.6 asmicro/alloc-6.as 78 83 6.4 6.4 asmicro/alloc-7.as 39 44 12.8 12.8 asmicro/alloc-8.as 18 19 5.6 5.6 asmicro/alloc-9.as 18 19 5.6 5.6 asmicro/array-2.as 772 724 -6.2 -6.2 asmicro/array-shift-1.as 148 157 6.1 6.1 asmicro/array-slice-1.as 21 22 4.8 4.8 asmicro/array-sort-1.as 29 30 3.4 3.4 asmicro/array-sort-3.as 22 23 4.5 4.5 asmicro/array-unshift-1.as 185 179 -3.2 -3.2 asmicro/closedvar-write-2.as 5208 5999 15.2 15.2 asmicro/funcall-4.as 119 128 7.6 7.6 asmicro/globalvar-read-1.as 4984 4779 -4.1 -4.1 asmicro/globalvar-write-1.as 6526 4739 -27.4 -27.4 asmicro/isNaN-1.as 421 574 36.3 36.3 asmicro/number-toString-2.as 72 78 8.3 8.3 asmicro/parseInt-1.as 208 194 -6.7 -6.7 asmicro/regex-exec-1.as 67 71 6.0 6.0 asmicro/regex-exec-2.as 79 85 7.6 7.6 asmicro/regex-exec-3.as 137 143 4.4 4.4 asmicro/regex-exec-4.as 306 324 5.9 5.9 asmicro/restarg-1.as 883 855 -3.2 -3.2 asmicro/restarg-2.as 491 519 5.7 5.7 asmicro/restarg-3.as 45 48 6.7 6.7 asmicro/restarg-4.as 30 34 13.3 13.3 asmicro/string-casechange-1.as 26 29 11.5 11.5 asmicro/string-casechange-2.as 26 29 11.5 11.5 asmicro/string-charAt-2.as 73 80 9.6 9.6 micro/string-fromCharCode-2.as 64 72 12.5 12.5 asmicro/string-indexOf-1.as 215 220 2.3 2.3 asmicro/string-slice-1.as 124 139 12.1 12.1 asmicro/string-substring-1.as 130 146 12.3 12.3 asmicro/switch-2.as 47 49 4.3 4.3 asmicro/try-1.as 203 255 25.6 25.6 asmicro/try-2.as 16 18 12.5 12.5 asmicro/try-3.as 57 64 12.3 12.3 asmicro/vector-push-1.as 47 44 -6.4 -6.4 These benchmarks are inordinately sensitive to small perturbations (alignment on the stack and in the cache) so use with care, I guess. Again a very clear positive trend.
Based on the above data it seems pretty clear that the change does not cause a slowdown in any systematic way, though individual tests may be affected, but we don't know if that's because we're using more memory (we are) or because of alignment or other issues. Since the general trend is a speedup across the board it appears that the new code paths are beneficial. Since the patch was reviewed and approved, and the numbers look good, I'm inclined to land, and will do so tomorrow if I don't hear otherwise. Objections need to be registered today :-)
(In reply to comment #32) > Since the patch was reviewed and approved, and the numbers look good, I'm > inclined to land, and will do so tomorrow if I don't hear otherwise. > Objections need to be registered today :-) You landing both, or just the refactoring? (I won't object to landing both. I had not yet r+'ed MMGC_FASTBITS because I wanted to better understand GC and GCAlloc before doing so, but that should not yet hold things up.)
(In reply to comment #33) > You landing both, or just the refactoring? The latter, since the former has not been reviewed yet and there's a discussion still ongoing. (Performance data for fastbits coming up.)
Second set of data (first chunk of several). Conclusions will be posted after all data. The system is my macpro, not doing much. avm1 is the baseline, TR 6067+refactoring. avm2 is the baseline+fastbits compiled with --enable-shell, otherwise defaults. Metric: time jsbench/Crypt.as 3333 3346 -0.4 -0.4 jsbench/Euler.as 6393 6246 2.3 2.3 jsbench/FFT.as 6509 6464 0.7 0.7 jsbench/HeapSort.as 2782 2789 -0.3 -0.3 jsbench/LUFact.as 7509 7370 1.9 1.9 jsbench/Moldyn.as 8445 8211 2.8 2.8 jsbench/RayTracer.as 6864 6520 5.0 5.0 jsbench/Series.as 7803 7609 2.5 2.5 jsbench/SOR.as 29189 28256 3.2 3.2 jsbench/SparseMatmult.as 11800 11638 1.4 1.4 jsbench/typed/Crypt.as 822 822 0.0 0.0 jsbench/typed/Euler.as 8069 7789 3.5 3.5 jsbench/typed/FFT.as 3867 3770 2.5 2.5 jsbench/typed/HeapSort.as 1432 1383 3.4 3.4 jsbench/typed/LUFact.as 7558 7272 3.8 3.8 jsbench/typed/Moldyn.as 3555 3617 -1.7 -1.7 jsbench/typed/RayTracer.as 1028 1026 0.2 0.2 jsbench/typed/Series.as 6788 6591 2.9 2.9 jsbench/typed/SOR.as 21018 19925 5.2 5.2 jsbench/typed/SparseMatmult.as 2172 2174 -0.1 -0.1 Generally positive trend but few numbers outside the 2.5% range so hard to say that the change makes a big difference.
More data: v8.5, five iterations. avm:cyclone avm2:cyclone test best avg best avg %dBst %dAvg Metric: v8 custom v8 normalized metric (hardcoded in the test) js/crypto.as 448 447.8 458 458 2.23 2.28 + js/deltablue.as 393 392.6 393 392.6 0 0 js/earley-boyer.as 1278 1271.4 1303 1297.4 1.96 2.04 + js/raytrace.as 817 815.4 817 812 0 -0.42 js/regexp.as 112 111.2 106 105.6 -5.36 -5.04 -- js/richards.as 316 315.8 321 320.8 1.58 1.58 + js/splay.as 989 985.2 993 987.8 0.40 0.26 optimized/crypto.as 4523 4520.6 4570 4564.4 1.04 0.97 + optimized/deltablue.as 3937 3917.8 3972 3962 0.89 1.13 + optimized/earley-boyer.as 1246 1239 1285 1281.4 3.13 3.42 + optimized/raytrace.as 9607 9595.4 9784 9760.6 1.84 1.72 + optimized/regexp.as 112 112 107 106.6 -4.46 -4.82 - optimized/richards.as 4670 4666.4 4753 4747.4 1.78 1.74 + optimized/splay.as 6705 6677.4 6881 6860 2.62 2.73 + typed/crypto.as 2708 2706.2 2723 2720.8 0.55 0.54 + typed/deltablue.as 4240 4221.8 4289 4273.2 1.16 1.22 + typed/earley-boyer.as 1246 1242.8 1289 1286 3.45 3.48 + typed/raytrace.as 9607 9589.8 9774 9752.8 1.74 1.70 + typed/regexp.as 112 111.8 107 106.2 -4.46 -5.01 - typed/richards.as 4670 4666.4 4746 4744.8 1.63 1.68 + typed/splay.as 1288 1277.4 1239 1234.6 -3.80 -3.35 - untyped/crypto.as 477 475.4 500 499.2 4.82 5.01 + untyped/deltablue.as 1968 1962.2 1902 1894.8 -3.35 -3.43 - untyped/earley-boyer.as 1255 1245.8 1285 1281 2.39 2.83 + untyped/raytrace.as 3694 3683 3767 3754.8 1.98 1.95 + untyped/regexp.as 111 111 106 105.8 -4.50 -4.68 - untyped/richards.as 510 508.2 513 512.4 0.59 0.83 + untyped/splay.as 1210 1203.6 1193 1191.8 -1.40 -0.98 - Generally the trend is positive, though the regexp benchmark has a strong negative trend.
According to Shark the regexp benchmark spends 30% of its time in regular expression matching. That code is not affected by the GC but is a tight loop that is sensitive to small perturbations (as when a function is aligned differently in a cache line or on a page). That's a credible explanation for the negative trend on that benchmark.
Another interesting data point, maybe not pertinent here, is that the regexp benchmark really hammers on FixedMalloc - actually about 33% of the time goes to malloc and free, called from the regexp code.
(In reply to comment #26) > (In reply to comment #24) > > Comment on attachment 455456 [details] [diff] [review] [details] > > Implement MMGC_FASTBITS > > > > Could we use FindOneBit on size instead of adding a new field? Do all > > architectures have a log2n instruction like x86 (I think so)? > > Of all the platforms we support so far, the only one without a find-bit > instruction seems to be SH4. The gcc intrinsic is __builtin_ctz[ll], and the > MSVC intrinsic is _BitScanForward[64]. > > I'm not into the code enough to argue whether a load is faster/slower than > using the processor intrinsic, however. I'm not sure what the benefit would be. Now we have: load a value (bitsShift) use that to variable-shift another value If we were to use a log instruction, we'd have load a value (size) log use that to variable-shift another value Ergo more work where we want less. I also don't know if shifting with the log of the size would make sense for large objects. I agree that it would be good to reduce the amount of per-block data, but the goal of the patch is to make it faster to find the bits for an object, and I'm willing to spend a little to make that possible. Am I missing something?
(In reply to comment #25) > Comment on attachment 455456 [details] [diff] [review] > Implement MMGC_FASTBITS > > Why does MMGC_FASTBITS have its own GetGCBits, seems like the one that calls > GetBitsIndex should work fine no? I don't understand the question.
(In reply to comment #39) > Am I missing something? I'm operating on the hypothesis that reading the memory is gonna be many times more expensive than the log instruction so they would be roughly equivalent. Another factor is that we have 4 bytes to play with on 32 bit systems since the block header is currently 44 bytes. I was also thinking that the real savings here probably come from not having to read shift and multiple from the GCAlloc instance and that we may get similar speedups by just moving those to the block instead.
(In reply to comment #40) > (In reply to comment #25) > > Comment on attachment 455456 [details] [diff] [review] [details] > > Implement MMGC_FASTBITS > > > > Why does MMGC_FASTBITS have its own GetGCBits, seems like the one that calls > > GetBitsIndex should work fine no? > > I don't understand the question. Seems like unnecessary code duplication, GetGCBits could just call GetBitsIndex instead of inlining it by hand.
Comment on attachment 466991 [details] [diff] [review] Refactor mark bits, make more space available (v2) tamarin-redux changeset: 5068:8bd75a9d9735 (patch) tamarin-redux changeset: 5084:a750dcce0c16 (merge)
(In reply to comment #41) > (In reply to comment #39) > > Am I missing something? > > I'm operating on the hypothesis that reading the memory is gonna be many > times more expensive than the log instruction so they would be roughly > equivalent. I still don't get it. If we were to compute the log of the size every time we'd still have to load the size field, so we'd just be adding the log instruction, and not getting rid of any instructions. Can you please be a lot more specific about what you have in mind? > Another factor is that we have 4 bytes to play with on 32 bit systems since > the block header is currently 44 bytes. I was also thinking that the real > savings here probably come from not having to read shift and multiple from > the GCAlloc instance and that we may get similar speedups by just moving > those to the block instead. Again, can you please be a lot more specific? I don't understand what code you're referring to.
(In reply to comment #42) > Seems like unnecessary code duplication, GetGCBits could just call GetBitsIndex > instead of inlining it by hand. I see. Could be a consequence of refactoring, I'll consider it.
> I still don't get it. If we were to compute the log of the size every time > we'd still have to load the size field, so we'd just be adding the log > instruction, and not getting rid of any instructions. Can you please be a lot > more specific about what you have in mind? If it cost 100 cycles to load the memory and 1 cycle to to do the log then there's not much diff between 101 and 100 cycles. > > Another factor is that we have 4 bytes to play with on 32 bit systems since > > the block header is currently 44 bytes. I was also thinking that the real > > savings here probably come from not having to read shift and multiple from > > the GCAlloc instance and that we may get similar speedups by just moving > > those to the block instead. > > Again, can you please be a lot more specific? I don't understand what code > you're referring to. sizeof(GCAlloc::GCBlock) == 44 bytes, GC objects are 8 byte aligned so there a 4 byte pocket to play with so adding the shift to the header didn't cost us anything. But going on my earlier theory that loading memory is the true cost then maybe the shift/multiple approach would receive same speedup by moving those values into the block. I'm basically trying to prove to myself that the (minor) memory wastage we're introducing is actually worth it. I'll run some benchmarks on refactor+fastbits vs. refactor + move shift/multiple into block header approach and report back.
Results of refactor vs. fastbits followed by refactor vs. move-multiple/shift-to-block followed by fastbits vs. move-multiple/shift-to-block avm: /tmp/avmshell-refactor version: cyclone avm2: /tmp/avmshell-fastbits version: cyclone iterations: 1 test avm:cyclone avm2:cyclone %diff Metric: time jsbench/Crypt.as 5915 5944 -0.5 -0.5 jsbench/Euler.as 11240 10733 4.5 4.5 jsbench/FFT.as 11279 10609 5.9 5.9 jsbench/HeapSort.as 4491 4557 -1.5 -1.5 jsbench/LUFact.as 11522 11355 1.4 1.4 jsbench/Moldyn.as 15906 14839 6.7 6.7 jsbench/RayTracer.as 11514 10818 6.0 6.0 jsbench/Series.as 12947 12759 1.5 1.5 jsbench/SOR.as 51644 49095 4.9 4.9 jsbench/SparseMatmult.as 17896 17771 0.7 0.7 jsbench/typed/Crypt.as 1151 1145 0.5 0.5 jsbench/typed/Euler.as 13826 12771 7.6 7.6 jsbench/typed/FFT.as 6162 5827 5.4 5.4 jsbench/typed/HeapSort.as 2191 2273 -3.7 -3.7 jsbench/typed/LUFact.as 11919 11446 4.0 4.0 jsbench/typed/Moldyn.as 5613 5306 5.5 5.5 jsbench/typed/RayTracer.as 1767 1720 2.7 2.7 jsbench/typed/Series.as 11929 10945 8.2 8.2 jsbench/typed/SOR.as 38646 34911 9.7 9.7 jsbench/typed/SparseMatmult.as 2730 2735 -0.2 -0.2 treilly-MacBookPro:performance treilly$ ./runtests.py --avm /tmp/avmshell-refactor --avm2 /tmp/avmshell-ms-move-to-block jsbench/ Tamarin tests started: 2010-08-20 11:57:01.101088 Executing 21 test(s) avm: /tmp/avmshell-refactor version: cyclone avm2: /tmp/avmshell-ms-move-to-block version: cyclone iterations: 1 test avm:cyclone avm2:cyclone %diff Metric: time jsbench/Crypt.as 5941 5913 0.5 0.5 jsbench/Euler.as 11237 11067 1.5 1.5 jsbench/FFT.as 11227 11062 1.5 1.5 jsbench/HeapSort.as 4491 4584 -2.1 -2.1 jsbench/LUFact.as 11494 11484 0.1 0.1 jsbench/Moldyn.as 15926 15286 4.0 4.0 jsbench/RayTracer.as 11475 11323 1.3 1.3 jsbench/Series.as 12917 13211 -2.3 -2.3 jsbench/SOR.as 51615 52021 -0.8 -0.8 jsbench/SparseMatmult.as 17837 17754 0.5 0.5 jsbench/typed/Crypt.as 1160 1157 0.3 0.3 ^C Keyboard Interupt treilly-MacBookPro:performance treilly$ treilly-MacBookPro:performance treilly$ ./runtests.py --avm /tmp/avmshell-fastbits --avm2 /tmp/avmshell-ms-move-to-block jsbench/typed/ Tamarin tests started: 2010-08-20 12:02:50.487311 Executing 10 test(s) avm: /tmp/avmshell-fastbits version: cyclone avm2: /tmp/avmshell-ms-move-to-block version: cyclone iterations: 1 test avm:cyclone avm2:cyclone %diff Metric: time jsbench/typed/Crypt.as 1146 1158 -1.0 -1.0 jsbench/typed/Euler.as 12784 13723 -7.3 -7.3 jsbench/typed/FFT.as 5829 6014 -3.2 -3.2 jsbench/typed/HeapSort.as 2275 2234 1.8 1.8 jsbench/typed/LUFact.as 11503 11616 -1.0 -1.0 jsbench/typed/Moldyn.as 5345 5658 -5.9 -5.9 jsbench/typed/RayTracer.as 1731 1751 -1.2 -1.2 jsbench/typed/Series.as 10929 11654 -6.6 -6.6 jsbench/typed/SOR.as 35012 35275 -0.8 -0.8 jsbench/typed/SparseMatmult.as 2733 2733 0.0 0.0
Results show: even greater speedup due to fastbits than lars reported (%10 in one case) neglible speedup by moving shift/multiple to the block
Attachment #455456 - Flags: review?(treilly) → review+
Comment on attachment 455456 [details] [diff] [review] Implement MMGC_FASTBITS We should consider opening a bug to replace the calls to log2 with an intrinsic where available.
Attachment #455456 - Flags: review?(fklockii) → review+
(In reply to comment #49) > Comment on attachment 455456 [details] [diff] [review] > Implement MMGC_FASTBITS > > We should consider opening a bug to replace the calls to log2 with an intrinsic > where available. I would only recommend taking on the complexity of intrinsics for critical paths. That said we do use log2n in the avmplus::Hashtable and if we wanted to push the impl down to the VMPI layer and share that would be nice.
(In reply to comment #48) > Results show: > even greater speedup due to fastbits than lars reported (%10 in one case) > neglible speedup by moving shift/multiple to the block Thanks for investigating! The 44 (actually 48) bytes overhead in the block seems like a lot, actually - it's nearly 12%. So it seems like it would be a good idea to investigate ways of reducing it. (For one thing, the items pointer can go away if we align the user objects with the start of the block and place the metainformation at the end of the block, it'll still be trivial to find it: block + 4K - sizeof(meta), for example. But I don't know yet what the consequences of removing 'items' will be elsewhere.) I'll create a work item.
I think you mean 1.2%? Its double that for 64 bits.
(In reply to comment #52) > I think you mean 1.2%? Its double that for 64 bits. You're right - I'm off by a factor of 10 again. Cancel red alert.
Memory investigation for the fastbits change. This is jsbench: avm:cyclone avm2:cyclone test best avg best avg %dBst %dAvg Metric: memory Crypt.as 42.9M 42.9M 42.9M 42.9M 0 0 Euler.as 4.7M 4.4M 5.3M 4.4M -12.77 -0.91 FFT.as 31.3M 28.0M 30.6M 23.0M 2.24 17.93 HeapSort.as 4.7M 4.7M 4.7M 4.7M 0 0 LUFact.as 32.0M 24.5M 32.0M 20.7M 0 15.50 Moldyn.as 1.2M 962K 904K 899K 26.43 6.58 + RayTracer.as 1.2M 1008K 1.1M 989K 8.33 1.95 SOR.as 57.6M 55.4M 56.5M 56.4M 1.91 -1.70 Series.as 11.2M 10.6M 11.2M 10.5M 0 0.75 SparseMatmult.as 13.9M 13.4M 13.9M 13.4M 0 0.15 typed/Crypt.as 34.7M 34.7M 34.7M 34.7M 0 0 typed/Euler.as 936K 889K 1.7M 1.2M -85.98 -41.13 -- typed/FFT.as 468K 468K 468K 428K 0 8.55 typed/HeapSort.as 4.1M 4.1M 4.1M 4.1M 0 0 typed/LUFact.as 2.1M 2.1M 2.1M 2.1M 0 0.96 typed/Moldyn.as 1.3M 1.0M 908K 908K 31.79 14.56 ++ typed/RayTracer.as 368K 352K 360K 349K 2.17 0.91 typed/SOR.as 8.5M 8.3M 8.5M 8.3M 0 0 typed/Series.as 992K 924K 784K 760K 20.97 17.82 ++ typed/SparseMatmult.as 4.7M 4.5M 4.7M 4.7M 0 -4.44 The consistently worse results on Euler are worrisome, other than that it looks good. (The use of "best" in the headings is misleading, it is really "worst", or better, "largest". But that's a quibble.)
Memory investigation continues. This is v8.5: avm:cyclone avm2:cyclone test best avg best avg %dBst %dAvg Metric: memory js/crypto.as 1020K 857K 1.1M 894K -10.43 -4.35 js/deltablue.as 552K 496K 604K 553K -9.42 -11.43 js/earley-boyer.as 4.6M 4.5M 4.6M 4.5M 0 0 js/raytrace.as 340K 340K 340K 340K 0 0 js/regexp.as 1.4M 849K 896K 776K 37.50 8.70 js/richards.as 340K 340K 340K 340K 0 0 js/splay.as 95.8M 95.8M 96.2M 96.2M -0.42 -0.42 optimized/crypto.as 340K 340K 340K 340K 0 0 optimized/deltablue.as 964K 788K 904K 856K 6.22 -8.62 optimized/earley-boyer.as 4.6M 4.5M 4.7M 4.6M -2.17 -3.11 optimized/raytrace.as 340K 340K 340K 340K 0 0 optimized/regexp.as 1.4M 838K 872K 801K 39.17 4.43 optimized/richards.as 340K 340K 340K 340K 0 0 optimized/splay.as 40.2M 40.2M 40.7M 40.7M -1.24 -1.19 - typed/crypto.as 664K 616K 732K 650K -10.24 -5.58 typed/deltablue.as 996K 835K 1.0M 937K -2.81 -12.26 typed/earley-boyer.as 4.8M 4.6M 4.6M 4.6M 4.17 0.43 + typed/raytrace.as 340K 340K 340K 340K 0 0 typed/regexp.as 736K 670K 820K 768K -11.41 -14.56 typed/richards.as 340K 340K 340K 340K 0 0 typed/splay.as 67.0M 67.0M 67.6M 67.6M -0.90 -0.90 untyped/crypto.as 1.2M 867K 1.1M 770K 8.33 11.22 untyped/deltablue.as 1008K 880K 804K 694K 20.24 21.09 + untyped/earley-boyer.as 4.5M 4.5M 4.7M 4.5M -4.44 0.44 - untyped/raytrace.as 1.2M 971K 1.3M 1.0M -8.33 -5.65 untyped/regexp.as 796K 694K 760K 739K 4.52 -6.45 untyped/richards.as 340K 340K 340K 340K 0 0 untyped/splay.as 67.7M 67.7M 68.2M 68.2M -0.74 -0.74 Apart from Earley-Boyer and Splay the heap sizes here are all puny and the results are probably not terribly interesting. For those two benchmarks the new code is more or less a wash, memory-wise.
(In reply to comment #51) (For one thing, the items pointer can go away if we align the > user objects with the start of the block and place the metainformation at the > end of the block, it'll still be trivial to find it: block + 4K - sizeof(meta), > for example. Another idea would be to remove items and replace it with an offset in GCAlloc, since items is always the same offset for each size class, eg: items = b + b->alloc->itemsOffset.
On the mmgc tests, there were improvements on ofib and sfib but otherwise nothing to report. Summarizing: - there's a negative trend on v8.5 - there's a positive trend on jsbench - there's a positive trend on mmgc - we have one significant regression on jsbench, typed/Euler, but many big wins on the positive side. Generally the memory data are inconclusive. According to the comments in the code the worst efficiency for the bits array is 50% (half the bytes are wasted). Suppose this happens for the smallest object size (8 bytes). Then mark bytes should account for two out of ten useful bytes or 20% of the heap, not figuring other overhead. Suppose all the mark bits arrays previously fit on-block and now have to go off-block (not realistic). Then a 20% increase in space is the worst we should see anywhere, for a heap consisting entirely of 8-byte objects. Ergo the 41% growth in average heap size with fastbits relative to no fastbits is at least half due to sheer dynamics in the heap.
In an effort to get to thumbs up or down on the fastbits work I rebuilt the binaries from today's tip and reran jsbench/typed: avm avm2 test best avg best avg %dBst %dAvg Metric: memory Dir: jsbench/typed/ Crypt 34.7M 34.7M 34.7M 34.7M 0 0 Euler 1.4M 1012K 1.5M 976K -7.1 3.5 FFT 468K 468K 468K 460K 0 1.5 HeapSort 4.1M 4.1M 4.1M 4.1M 0 0 LUFact 2.2M 2.2M 2.3M 2.2M -4.5 0 Moldyn 964K 898K 1.3M 1.0M -38.1 -14.2 - RayTracer 348K 341K 340K 340K 2.3 0.5 + SOR 8.5M 7.1M 8.5M 8.4M 0 -17.4 Series 1.2M 1.0M 1.2M 987K 0 7.3 SparseMatmult 4.7M 4.7M 4.7M 4.3M 0 8.5 Now Euler is a wash and Moldyn, which was the poster child for the memory-reducing benefits of this patch is suddenly the black sheep. I'm going to conclude that the memory numbers are subject to chaos effects and that there will probably be a slight real increase due to the increased use of memory by the bit vectors. However, that effect is not systematically visible. Furthermore, there are performance benefits documented a couple of different ways. I'll land the patch.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Flags: flashplayer-bug-
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: