Closed Bug 584275 Opened 14 years ago Closed 14 years ago

nanojit: preparation for adding many more access regions

Categories

(Core Graveyard :: Nanojit, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

(Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin)

Attachments

(4 files)

This patch adds some optimisations to avoid slow-downs caused by using more access regions. - In CseFilter, only clear a hash table if it is non-empty. This avoids a good number of memset() calls. - Add a function msbSet() which uses machine instructions to find the most-significant bit set in a uint32_t. I haven't yet tested msbSet() on Windows but it's cribbed from jsbit.h in TM so hopefully it'll work. I'll try server it shortly. - Use msbSet() in CseFilter::insLoad() to optimise the invalidation loop, so there's only one iteration for each bit set in storesSinceLastLoad. Also use it to speed up compressAccSet(), along with the new isSingletonSet() function. Also: - Change the way embedder-specific values are passed into ValidateWriter -- use an array, so any number of values can be passed in. - Add 'disp' to checkAccSet().
Attachment #462670 - Flags: review?(edwsmith)
Update TM for the NJ patches.
Attachment #462671 - Flags: review?(gal)
Blocks: 584279
Attachment #462671 - Flags: review?(gal) → review+
cool, msbSet()'s active ingredient is the same instruction as nRegisterAllocFromSet (more or less; bsf vs bsr). I also noticed this patch's use of msbSet() would work equally well on architectures that only have an "lsb" equivalent instruction. See NativeARM.cpp for more crib-able code for ARM. It occurs to me we could also use this to speed up iterating through various register sets. A common pattern is: for (Register r = FirstReg; r <= LastReg; r = nextreg(r)) if (/* mask expression of rmask(r) */ && something) .. it could be replaced by: RegisterMask mask = /* equivalent mask expression */ for (Register r = msbSet(mask); mask != 0; r = msbSet()) if (/* something */) ... mask &= ~rmask(r) I remember a while back you were looking at speeding up those iteration loops but i dont know if you tried something based on the BSR/BSF (or equivalent) machine instruction.
Comment on attachment 462670 [details] [diff] [review] NJ patch (against 48641:8cae6d5cd69c) Looks correct. It would be good to factor out the bit-scanning code from each backend, but its not something that must be done on the critical path.
Attachment #462670 - Flags: review?(edwsmith) → review+
(In reply to comment #2) > > It occurs to me we could also use this to speed up iterating through various > register sets. A common pattern is: > > for (Register r = FirstReg; r <= LastReg; r = nextreg(r)) > if (/* mask expression of rmask(r) */ && something) > .. > > it could be replaced by: > RegisterMask mask = /* equivalent mask expression */ > for (Register r = msbSet(mask); mask != 0; r = msbSet()) > if (/* something */) > ... > mask &= ~rmask(r) > > I remember a while back you were looking at speeding up those iteration loops > but i dont know if you tried something based on the BSR/BSF (or equivalent) > machine instruction. I didn't, but it's a good idea for the ones where the pattern applies. There are still several of the form: for (Register r = FirstReg; r <= LastReg; r = nextreg(r)) LIns* ins = _allocator.getActive(r) if (ins) { ... } Is there a fast way to do the same thing at the level of words? I guess we could have a bit-set alongside 'active' to enable bitscanning, though that's less ideal.
Attachment #462941 - Flags: review?(edwsmith)
(In reply to comment #4) > Is there a fast way to do the same thing at the level of words? I guess we > could have a bit-set alongside 'active' to enable bitscanning, though that's > less ideal. two ideas: 1. active = ~free & managed, then use bsr/bsf. One alu op to compute the active bitmask (free is managed in RegAlloc, and "managed" is constant). the loop in unionRegisterState would look like this: mask = (~current.free | ~saved.free) & managed for (... bsf-iteration of mask) 2. replace active[] with a Briggs/Torczon style sparse set http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.30.7319 The sparse set maintains two arrays of [MaxReg]: sparse[] and dense[]. sparse[] is indexed by register, like _active is, but contains an index in dense. add, get, and delete are O(1), and iteration is O(N) where N is the # of active entries. (vs a bitset which is O(U)). Using BSF/BSR on a 1-word bitmask ought to be just as fast though, since it could skip 1..32 bits in 1 machine cycle, making iteration O(N) instead of O(U). the sparse set starts to beat the bitmap when U/32 >> N, ie when the set is *so* big and sparse that we start losing the benefit of the 1-cycle BSF instruction and 32-bit parallel ALU ops. oh, and idea #3: use SSE instructions to slurp 2 or 4 words at a time, or more. this doesnt seem worth the bother.
Attachment #462941 - Flags: review?(edwsmith) → review+
I prototyped idea #1 in Bug 584935, its showing about a 5% speedup in pure jit speed (large input set of 1000's of methods, no jit-code execution) on x64. Expect that % to be diluted in the real world of course.
Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey
This snuck in because TR's implementation of Allocator::allocChunk() calls malloc(), while IIRC TM's calls calloc().
Attachment #464461 - Flags: review?(gal)
Attachment #464461 - Flags: review?(gal) → review+
Blocks: 587735
No longer blocks: 587735
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: