Closed Bug 688641 Opened 13 years ago Closed 13 years ago

Allocate GC chunks using helper thread.

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla10

People

(Reporter: igor, Assigned: igor)

References

Details

Attachments

(1 file, 5 obsolete files)

Using smaller GC chunks as suggested in bug 671702 as a way to improve memory utilization requires a faster way to allocate GC chunks. Otherwise at least on OSX we take too big performance hit as on that platform mmap is rather expensive. One idea is to use a background thread for chunk allocation, the same thread that currently release GC chunks back to OS. This way thee chunks will be allocated and released by the same kernel thread improving cache utilization on multi-core CPU. Also we can release empty chunks earlier and keep their number small minimizing the JS heap memory usage. The question is if this improves the overall allocation speed or weather the extra locking/signaling would offset any wins.
Attached patch v1 (obsolete) (deleted) — Splinter Review
The patch uses small 4-chunk pool to allocate chunks in the background. When the number of chunks in the pool 2 or less the background thread starts and replenishes the pool. The pool effectively replaces the previous empty chunk list and corresponding chunk aging code. On fast 4-core i5-2400 3.1 GHz under Linux I have for http://v8.googlecode.com/svn/data/benchmarks/v6/run.html : Before the patch After the patch Score: 7810 Score: 7941 Richards: 9825 Richards: 9962 DeltaBlue: 12258 DeltaBlue: 12562 Crypto: 16898 Crypto: 16730 RayTrace: 4387 RayTrace: 4427 EarleyBoyer: 9327 EarleyBoyer: 9305 RegExp: 1907 RegExp: 1917 Splay: 11164 Splay: 12044 This shows a 7% improvement in splay. I will test this more.
When I test the patch under js shell using js/src/v8/run.js I observe the speedup rather sporadically. That is, 3 of 4 runs shows no change from the base in total score or in splay, but one in 4 runs show 6-7% speedup.
On dual-core AMD Athlon(tm) II Neo K325 1.3 GHz CPU (AMD answer to Intel Atom) under 32 bit Linux I see over 10% speedup in splay when run under js shell: Splay: 3928 -> Splay: 4572 The rest of benchmarks is unaffected.
To test how the patch behaves on one core CPU I disabled one of two cores on that Neo K325 1.3 GHz CPU. With the patch I see no change in splay benchmark, the score stays around 3400. However, DeltaBlue and RayTrace regress badly: DeltaBlue: 3914 -> DeltaBlue: 3362 , 18% slowdown RayTrace: 1498 -> RayTrace: 1367 , 10 % slowdown
(In reply to Igor Bukanov from comment #4) > To test how the patch behaves on one core CPU I disabled one of two cores on > that Neo K325 1.3 GHz CPU. With the patch I see no change in splay > benchmark, the score stays around 3400. However, DeltaBlue and RayTrace > regress badly: > > DeltaBlue: 3914 -> DeltaBlue: 3362 , 18% slowdown > RayTrace: 1498 -> RayTrace: 1367 , 10 % slowdown So you do only allocation on the background thread and if there is no chunk left we wait for the background thread to run?
(In reply to Gregor Wagner from comment #5) > So you do only allocation on the background thread and if there is no chunk > left we wait for the background thread to run? No - the patch never waits for the background thread. If it has not managed to get a chunk in time before the allocation thread emptied the pool, the the allocation thread always allocates a chunk on its own. But perhaps this is wrong on single-core CPU due to potentially bad contention when 2 threads tries to mmap 1MB chunks simultaneously. So I will update the patch so the allocation thread waits if it see that the background tries to allocates a chunk. If that would not help with the regression, I will add a runtime switch to the patch to switch to the old behavior. The switch can be activated on single-core CPU.
Attached patch v2 (obsolete) (deleted) — Splinter Review
I was not able to recover the regression in v1 on single-core CPU via changes in scheduling policy. mmap has inherent cost and with one core the only way to hide it is to avoid mmap. So in v2 the GC initialization code queries the number of CPU in the system and switches the background allocation on only on multi-core. On single-code CPU the patch continue to use the current chunk aging code.
Attachment #562127 - Attachment is obsolete: true
Attached patch v3 (obsolete) (deleted) — Splinter Review
According to http://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man3/sysconf.3.html sysconf(_SC_NPROCESSORS_ONLN) is available on OSX starting from 10.4 so there is no need in Mac-specific version of GetCPUCount() that uses sysctl/CTL_HW/HW_AVAILCPU/HW_NCPU.
Attachment #562248 - Attachment is obsolete: true
Attached patch v4 (obsolete) (deleted) — Splinter Review
The patch implements background chunk allocation when the number of CPU is at least 2. In that case if the number of available empty drops below ChunkPool::BACKGROND_CHUNK_MIN, then the helper thread is asked to get more chunks. It does that until there are at least ChunkPool::BACKGROND_CHUNK_MAX chunks. In the patch I have chosen 3 and 6 as values for min/max after testing on that dual-core 1.3 GHz Neo netbook. On a 4-core desktop even 0 and 1 would work just as fine. I suppose on a that desktop there is always a core to spare to run the helper thread so the thread always manages to allocate one more chunk before the last one is exhausted. But on the netbook a small queue is necessary to mitigate sporadic busy state of the system. The allocation thread never waits for the background allocation to finish. If there are not empty chunks, it allocates a chunk on its own even if the helper thread is trying to do the same. This not only simplifies the code but also avoids introducing a bottleneck. On a busy system with many cores with many threads allocating the helper thread may not be able to keep up with chunk demand. In that case any wait would slow down the system. The GC does wait for the allocation to finish. In principle it there is no need to do that before the finalization starts. Only at that stage the chunk pool will be accessed by the GC thread without taking the GC lock. But doing the wait for the allocation together with the wait for the sweeping lead to simpler code. Another place when the code waits for background allocation to finsish is when retiring from OOM error. There I also added a call to empty the chunk pool. It should also help when the background allocation is disabled. The drawback of the patch is that on my netbook the V8 score for Splay and to a lesser extend DeltaBlue became more volatile even if Splay is improved by 15% on that system.
Attachment #562253 - Attachment is obsolete: true
Attachment #562329 - Flags: review?(anygregor)
Comment on attachment 562329 [details] [diff] [review] v4 This version of the patch allocates 6 chunks in the background if the runtime needs only one ever, like the web workers runtime or a shell running small script. I will fix this in a new patch.
Attachment #562329 - Attachment is obsolete: true
Attachment #562329 - Flags: review?(anygregor)
Attached patch v5 (obsolete) (deleted) — Splinter Review
Besides the problem with wasting too much memory I observed regressions with v4 in benchmarks besides the splay. The Linux perf tools showed that that came from significant increase cache misses. The problem is that patch aggressively released chunks during GC. I assumed that the background allocation would allow to get chunk back quickly so there was no point to keep those chunks around. However chunk release and allocation invalidates caches including caches for the most recently freed chunk. That was enough to slow down things. It is possible to fix this via releasing the least recently used chunk, but for now I decided to take simpler approach and keep the chunk aging code as is. The new patch also starts the background allocation only when there are no empty chunks. It turned out that even my 1.3GHz Dual-core Neo netbook can get a new chunk in time for a new allocation even on relatively busy system as long as it does not race in kernel with another thread doing mmap (another problem in v4). The new patch preserves on the netbook that 15% win in V8 splay and small but noticeable win in deltablue. On 3.1 a desktop with 4-core intel i5-2400 CPU the win is within noise. Again, perf tool shows that on the netbook the CPU utilization with the patch increased from 107% to 130%, while on the desktop that changed from 101% to 101.5%. Apparently on the desktop mmap is fast enough not to worry about its performance implications.
Keywords: footprint
Attached patch v6 (deleted) — Splinter Review
The new version adds reporting on ratio of chunks allocated in background/allocation thread. Bill, could you review this as Gregor does not have time?
Attachment #562746 - Attachment is obsolete: true
Attachment #563031 - Flags: review?(wmccloskey)
Comment on attachment 563031 [details] [diff] [review] v6 Review of attachment 563031 [details] [diff] [review]: ----------------------------------------------------------------- Thanks, Igor. I saw a speedup on splay under Linux with this patch. ::: js/src/jscntxt.cpp @@ +1489,5 @@ > JS_FRIEND_API(void *) > JSRuntime::onOutOfMemory(void *p, size_t nbytes, JSContext *cx) > { > + /* > + * Retry when we done with the background sweeping and stopped all the "when we done" -> "when we are done" "and stopped" -> "and have stopped" ::: js/src/jsgc.cpp @@ +428,5 @@ > + rt->gcChunkSet.count() >= 4; > +} > +#endif > + > +/* Must be called with the GC lock taken. */ Can we assert this? (And in other places?) @@ +455,5 @@ > +#ifdef JS_THREADSAFE > + /* > + * We schedule the background allocation after a potential allocate call > + * above to avoid getting chunks from two threads in parallel and races in > + * the OS. Are you saying here that you don't want to start the background allocation earlier because then two threads may try to allocate a chunk simultaneously, which would increase lock contention in the OS? @@ +2084,5 @@ > > +inline JSRuntime * > +GCHelperThread::runtime() > +{ > + return reinterpret_cast<JSRuntime *>(reinterpret_cast<uintptr_t>(this) - offsetof(JSRuntime, gcHelperThread)); Uh, is this really necessary? If you want to eliminate the argument, why not just add a field to store the runtime? @@ +2160,5 @@ > + if (rt->gcChunkPool.wantBackgroundAllocation(rt)) > + break; > + } > + if (state == ALLOCATING) > + state = IDLE; Could you try to simplify the control-flow here? Something like this seems better to me: case ALLOCATING: do { Chunk *chunk; { AutoUnlockGC unlock(rt); chunk = Chunk::allocate(); } if (!chunk) break; rt->gcChunkPool.put(rt, chunk); } while (state == ALLOCATING && rt->gcChunkPool.wantBackgroundAllocation(rt)); if (state == ALLOCATING) state = IDLE; @@ +2213,5 @@ > +} > + > +/* Must be called with the GC lock taken. */ > +inline void > +GCHelperThread::startBackgroundAllocationIfIdle() { Brace should be on new line. @@ +2250,5 @@ > + * Expire the chunks released during the GC so they will be available to > + * the rest of the system immediately. > + */ > + JSRuntime *rt = runtime(); > + rt->gcChunkPool.expire(rt, shouldShrink()); I understand why you want to release early. However, if this is a shrinking GC, isn't it possible that more chunks will be freed below and we will fail to free them immediately? Could you put a second expiration after the backgroundFinalize calls? Also, is it important that this expiration happens before the unlock? Does it reduce contention or something? @@ -2723,5 @@ > > { > -#ifdef JS_THREADSAFE > - rt->gcHelperThread.waitBackgroundSweepEnd(rt); > -#endif This call never made any sense to me. Do you know why it was there in the first place? ::: js/src/jslock.h @@ +197,5 @@ > > #define JS_ATOMIC_SET_MASK(w, mask) js_AtomicSetMask(w, mask) > #define JS_ATOMIC_CLEAR_MASK(w, mask) js_AtomicClearMask(w, mask) > > +extern unsigned Two spaces here.
Attachment #563031 - Flags: review?(wmccloskey) → review+
(In reply to Bill McCloskey (:billm) from comment #13) > > +/* Must be called with the GC lock taken. */ > > Can we assert this? (And in other places?) I will file a separated bug about that. > > +inline JSRuntime * > > +GCHelperThread::runtime() > > +{ > > + return reinterpret_cast<JSRuntime *>(reinterpret_cast<uintptr_t>(this) - offsetof(JSRuntime, gcHelperThread)); > > Uh, is this really necessary? If you want to eliminate the argument, why not > just add a field to store the runtime? Then it would be necessary to add #ifdef JS_THREADSAFE to JSRuntime constructor around gcHelperThread(this). That is even uglier IMO so I went with that offsetof hack that also eliminates an extra load. > > + rt->gcChunkPool.expire(rt, shouldShrink()); > > I understand why you want to release early. However, if this is a shrinking > GC, isn't it possible that more chunks will be freed below and we will fail > to free them immediately? Could you put a second expiration after the > backgroundFinalize calls? During the background sweep the empty chunks are released immediately. There is no need for the second expire call. > Also, is it important that this expiration happens before the unlock? Does > it reduce contention or something? I should file another bug about this. To reduce risk of regressions I have not changed the present logic of releasing chunks under the lock. > > @@ -2723,5 @@ > > > > { > > -#ifdef JS_THREADSAFE > > - rt->gcHelperThread.waitBackgroundSweepEnd(rt); > > -#endif > > This call never made any sense to me. Do you know why it was there in the > first place? I do not know.
http://hg.mozilla.org/integration/mozilla-inbound/rev/f8fa25d9551c - landed with using JSRuntime * as an extra field to avoid the offsetof hack above. With the bug 647390 fixed adding the gcHelperThread(rt) initializer to ::JSRunime is no longer ugly.
(In reply to Igor Bukanov from comment #15) > http://hg.mozilla.org/integration/mozilla-inbound/rev/f8fa25d9551c - landed > with using JSRuntime * as an extra field to avoid the offsetof hack above. > With the bug 647390 fixed adding the gcHelperThread(rt) initializer to > ::JSRunime is no longer ugly. It looks like GCHelperThread::runtime() is still there.
(In reply to Bill McCloskey (:billm) from comment #16) > (In reply to Igor Bukanov from comment #15) > > http://hg.mozilla.org/integration/mozilla-inbound/rev/f8fa25d9551c - landed > > with using JSRuntime * as an extra field to avoid the offsetof hack above. > > With the bug 647390 fixed adding the gcHelperThread(rt) initializer to > > ::JSRunime is no longer ugly. > > It looks like GCHelperThread::runtime() is still there. Sorry, I forgot hg refresh before exporting the patch. I will land the extra changes separately.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla10
Depends on: 694883
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: