638660 - Use background thread for parallel marking.

Reporter

Description

•

14 years ago

I implemented a first draft version of parallel marking. The first design includes 2 GCMarkers. One for the main thread that performs the GC and one for the background thread. Each GCMarker has a deque that acts as a marking stack. A GCMarker pushes and pops from one end and another GCMarker can steal objects from the other end. The main thread executes MarkRuntime and only pushes roots on the deque. The background thread starts stealing objects from the other side. After all initial roots are pushed, the main thread also starts marking.

Gregor Wagner [:gwagner]

Reporter

Updated

•

14 years ago

Assignee: general → anygregor

Gregor Wagner [:gwagner]

Reporter

Comment 1

•

14 years ago

Attached patch very first draft (obsolete) (deleted) — Details — Splinter Review

This is an early draft with lots of debug code in there. I tried the realtime raytracer with the new approach and the marking time goes down from 200ms to 85 - 90ms! A typical GC looks like (in ms): Total, Marking, sweeping Trunk: 300.3, 205.0, 94.5, Patch: 180.5, 85.1, 94.6, Very promising results!

Gregor Wagner [:gwagner]

Reporter

Updated

•

14 years ago

Depends on: 616666

Gregor Wagner [:gwagner]

Reporter

Comment 2

•

14 years ago

Attached patch patch (obsolete) (deleted) — Details — Splinter Review

Updated version based on the new marking stack. The marking time reduction is around 20-40% all over the place. Since we don't spend much time sweeping any more the marking time reduction is equal to the whole GC pause time reduction.

Attachment #517021 - Attachment is obsolete: true

Gregor Wagner [:gwagner]

Reporter

Comment 3

•

14 years ago

Attached patch patch (obsolete) (deleted) — Details — Splinter Review

Tryserver is green. The gcHelperThread can steal work from the main mark stack. Large objects are split into half sized largeMarkStackItems and pushed on the largeMarkingStack. The gcHelperThread can steal now the second half. This is definitely an important for the realtime raytracer since it only has one large object with about 1 mill kids. I also changed OSAtomicCompareAndSwapPtrBarrier on mac to __sync_bool_compare_and_swap because because it is about 10% faster.

Attachment #527426 - Attachment is obsolete: true

Gregor Wagner [:gwagner]

Reporter

Comment 4

•

14 years ago

Attached patch patch (obsolete) (deleted) — Details — Splinter Review

update

Attachment #528326 - Attachment is obsolete: true

Gregor Wagner [:gwagner]

Reporter

Comment 5

•

14 years ago

Attached patch patch (obsolete) (deleted) — Details — Splinter Review

Attachment #528697 - Attachment is obsolete: true

Attachment #528857 - Flags: review?(gal)

Gregor Wagner [:gwagner]

Reporter

Comment 6

•

14 years ago

Attached patch patch (obsolete) (deleted) — Details — Splinter Review

Don't use CAS for single-threaded marking.

Attachment #528857 - Attachment is obsolete: true

Attachment #528857 - Flags: review?(gal)

Attachment #528986 - Flags: review?(gal)

Gregor Wagner [:gwagner]

Reporter

Comment 7

•

14 years ago

Attached patch patch (obsolete) (deleted) — Details — Splinter Review

merge

Attachment #528986 - Attachment is obsolete: true

Attachment #528986 - Flags: review?(gal)

Attachment #529235 - Flags: review?(gal)

Igor Bukanov

Comment 8

•

14 years ago

It looks like the usage of rt->marking has a problem similar to the bug 652416. That is, it assumes that !rt->marking in GCHelperThread::doMark implies that it can query the state of variables written on the GC thread. This is unsound even on x86.

Igor Bukanov

Comment 9

•

14 years ago

Other problems: delayMarkingChildren and drainMarkStack() are not modified for safe parallel marking. I suppose one safest solution would be to abandon the parallel marking if we need delayed marking. GCMarker::helpOtherMarker asserts that tmp->isMarked(). I am not sure that we can do that. I assume that as a part of the queue implementation some memory barriers would be issued, but I do not see that those would be enough to guarantee the read ordering here. I guess we can just remove the assert with comments here.

Igor Bukanov

Comment 10

•

14 years ago

The queue algorithm assumes that the maximum queue size on 32-bit CPU is 2^21 so the rest 11 bits can be used to implement unique tagging. I am not sure that this is enough to ensure that the tag never wraps in practice. So what about changing the bit split to use just 16 bits for queue size as we do not use bigger queues?

Igor Bukanov

Comment 11

•

14 years ago

The patch adds the second signal to the background thread intended to be used to signal the start of marking phase. As implemented that could lead to a deadlock in GCHelperThread::finish as that thread sends the wakeup signal. If the background thread waits for the the marking signal at that memonet, then it may never wake up. This could be fixed via keeping just one signal and using an enum to describe the reason for the wakeup like MARK, SWEEP, SHUTDOWN.

Gregor Wagner [:gwagner]

Reporter

Comment 12

•

14 years ago

(In reply to comment #8) > It looks like the usage of rt->marking has a problem similar to the bug 652416. > That is, it assumes that !rt->marking in GCHelperThread::doMark implies that > it can query the state of variables written on the GC thread. This is unsound > even on x86. Thx for looking over it! rt->marking should be surrounded by Locks (start/end BackgroundMark). What control-flow do you see where it doesn't work? For delayMarkingChildren and drainMarkStack(): I see what you mean and we can definitely use a lock there and don't need to make it more complicated. Reordering in the CPU can lead to a situation where both GCMarkers can get the same object and therefore call ScanObject on it as you mentioned in an email but I don't see a problem there.

Igor Bukanov

Comment 13

•

14 years ago

(In reply to comment #12) > rt->marking should be surrounded by Locks (start/end BackgroundMark). What > control-flow do you see where it doesn't work? Hm, I do not see how rt->marking is protected by the locks. In all cases AFAICS it is read/modified outside the lock. Fortunately the ordering here does not matter as the usage of it in doMark does not depend on it. But it is rather unclear what is the ourpose of that field. Is it to ensure that the background marking would not stop prematurely? That is, the code prefer for that thread to busy-loop if the main thread has not managed to generate enough work, right? In any case, the patch needs comments explaining this. > > For delayMarkingChildren and drainMarkStack(): > I see what you mean and we can definitely use a lock there and don't need to > make it more complicated. Yes, the lock seems to be the simplest solution. > > Reordering in the CPU can lead to a situation where both GCMarkers can get the > same object and therefore call ScanObject on it as you mentioned in an email > but I don't see a problem there. The problem is with any code that uses IS_GC_MARKING_TRACER to modify data structures under the real marking. Currently these do not tolerate the parallel marking. So we either fix the queue implementation to ensure that two threads cannot pop the same object or fix the users of IS_GC_MARKING_TRACER. BTW, regarding the implementation, do you know if the usage of dequeue in V8 in fact tolerates popping of the same object? Also it would be interesting to know if at least on X86 it is impossible for a thread A to read an older value of some variable that was updated by compare-and-swap on thread B. http://www.rdrop.com/users/paulmck/scalability/paper/whymb.2009.04.05a.pdf (table on page 16) mentioned that on X86 atomic operation and load/store are never reordered, but I do not see if this applies across CPUs or means that an atomic operation on X86 acts as a read-write barrier for the CPU that issues them and has no influence on other CPUs.

Igor Bukanov

Comment 14

•

14 years ago

For references, according to Intel manual , http://download.intel.com/design/processor/manuals/253668.pdf , page 329: Reads may be reordered with older writes to different locations but not with older writes to the same location. The example on the page 334 illustrates that explicitly. Hence the Deque implementation from the patch is not safe even on x86. In popBottom() the assignment "bot = currBot;" can happen after the the code read oldTaggedTop.getTop() in "currBot > oldTaggedTop.getTop()". If that condition is true and after this read but before the re-ordered assignment "bot = currBot;" another CPU will be in popTop(), then for it the condition "currBot <= oldTaggedTop.getTop()" will be false so it ptoceed with compare-and-swap to update the top. That will succeed as popBottom() does not change oldTaggedTop in the fast pop case. This results in popBottom() and popTop() returning the same element.

Gregor Wagner [:gwagner]

Reporter

Comment 15

•

14 years ago

(In reply to comment #14) > For references, according to Intel manual , > http://download.intel.com/design/processor/manuals/253668.pdf , page 329: > > Reads may be reordered with older writes to different locations but not with > older writes to the same location. > > The example on the page 334 illustrates that explicitly. > > Hence the Deque implementation from the patch is not safe even on x86. In > popBottom() the assignment "bot = currBot;" can happen after the the code read > oldTaggedTop.getTop() in "currBot > oldTaggedTop.getTop()". If that condition > is true and after this read but before the re-ordered assignment "bot = > currBot;" another CPU will be in popTop(), then for it the condition "currBot > <= oldTaggedTop.getTop()" will be false so it ptoceed with compare-and-swap to > update the top. That will succeed as popBottom() does not change oldTaggedTop > in the fast pop case. This results in popBottom() and popTop() returning the > same element. We don't ensure exclusive pushing either so we can always end up with a reference to the same object in both GCMarkers. We might want to add this for large objects but I don't see this as a real problem now.

Igor Bukanov

Comment 16

•

14 years ago

(In reply to comment #15) > We don't ensure exclusive pushing either so we can always end up with a > reference to the same object in both GCMarkers. > We might want to add this for large objects but I don't see this as a real > problem now. See comment 13 above why scanning the same object from two threads can be bad. Now, AFAICS among JS_IsGCMarkingTracer/IS_GC_MARKING_TRACER users only 2 places are not safe, js_TraceXML and WeakMap::mark. The rest just set a flag.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 17

•

14 years ago

(In reply to comment #16) > (In reply to comment #15) > > We don't ensure exclusive pushing either so we can always end up with a > > reference to the same object in both GCMarkers. > > We might want to add this for large objects but I don't see this as a real > > problem now. > > See comment 13 above why scanning the same object from two threads can be bad. > Now, AFAICS among JS_IsGCMarkingTracer/IS_GC_MARKING_TRACER users only 2 places > are not safe, js_TraceXML and WeakMap::mark. The rest just set a flag. I think we also need to worry about shape-regenerating GCs. The code for these happens in ScanShape/ScanObject, where we already know that IS_GC_MARKING_TRACER holds.

Gregor Wagner [:gwagner]

Reporter

Comment 18

•

14 years ago

(In reply to comment #17) > (In reply to comment #16) > > (In reply to comment #15) > > > We don't ensure exclusive pushing either so we can always end up with a > > > reference to the same object in both GCMarkers. > > > We might want to add this for large objects but I don't see this as a real > > > problem now. > > > > See comment 13 above why scanning the same object from two threads can be bad. > > Now, AFAICS among JS_IsGCMarkingTracer/IS_GC_MARKING_TRACER users only 2 places > > are not safe, js_TraceXML and WeakMap::mark. The rest just set a flag. > > I think we also need to worry about shape-regenerating GCs. The code for these > happens in ScanShape/ScanObject, where we already know that > IS_GC_MARKING_TRACER holds. Right now I don't do parallel marking for shape-regenerating GCs.

Igor Bukanov

Comment 19

•

14 years ago

Do we have any data regarding patch impact on single core CPU without or with hyperthreading like recent Atoms? Also IIRC on ARM CompareAndSwap is relatively expensive and using it for the marking bitmap may offsets the gains from going to 2-core CPU. I think it is important to collect those data now as we may need to have API that not only disable the background thread but also switch the marking bitmap to use the marking code without CompareAndSwap.

Gregor Wagner [:gwagner]

Reporter

Comment 20

•

14 years ago

(In reply to comment #19) > Do we have any data regarding patch impact on single core CPU without or with > hyperthreading like recent Atoms? Also IIRC on ARM CompareAndSwap is relatively > expensive and using it for the marking bitmap may offsets the gains from going > to 2-core CPU. > > I think it is important to collect those data now as we may need to have API > that not only disable the background thread but also switch the marking bitmap > to use the marking code without CompareAndSwap. I would love to test it on different settings but I don't have any equipment for that. The CAS implementation/library is also an important performance factor. Is mobile currently using the background thread?

Igor Bukanov

Comment 21

•

14 years ago

(In reply to comment #20) > I would love to test it on different settings but I don't have any equipment > for that. OK, I will test this on single core Atom and on dual-core Neo K345 where I can disable the second core. I suppose it would be sufficient to measure the marking phase timing with some synthetic benchmarks. > Is mobile currently using the background thread? The background finalization benefits even single-core CPU as it reduces the GC pause even there for the price of slower performance later and I saw the benefits on Atom CPU clearly last year. This is very different from the parallel marking which would definitely make things slower on single-core CPU. It may even harm hyperthreading as the latter effectively reduces per-thread cache by 50%. For this reasons I assume we would need API that selectively disables the parallel marking without affecting the background finalize.

Gregor Wagner [:gwagner]

Reporter

Comment 22

•

14 years ago

(In reply to comment #21) > (In reply to comment #20) > > I would love to test it on different settings but I don't have any equipment > > for that. > > OK, I will test this on single core Atom and on dual-core Neo K345 where I can > disable the second core. I suppose it would be sufficient to measure the > marking phase timing with some synthetic benchmarks. > > > Is mobile currently using the background thread? > > The background finalization benefits even single-core CPU as it reduces the GC > pause even there for the price of slower performance later and I saw the > benefits on Atom CPU clearly last year. This is very different from the > parallel marking which would definitely make things slower on single-core CPU. > It may even harm hyperthreading as the latter effectively reduces per-thread > cache by 50%. > I think we should focus on the 20-40% improvement for the more common architectures and don't spend too much time measuring synthetic benchmarks on rare or outdated environments. Let's fix the problems we have here and then we can do the performance tests on real web workload and other benchmarks. Avoiding regressions for special environments shouldn't be a problem once we are aware of them.

Igor Bukanov

Comment 23

•

14 years ago

(In reply to comment #22) > I think we should focus on the 20-40% improvement for the more common > architectures and don't spend too much time measuring synthetic benchmarks on > rare or outdated environments. Netbook shipments are at least 10% of all PCs for the last quoter. Typically they do not have dual-core CPU.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 24

•

14 years ago

Dual core Atom netbooks have been on sale for 3 quarters, and typically have a small price premium (~$50 last year, maybe less now). I don't think we should be designing for single-core anywhere, especially since mobile is headed that way too. We should spend the time on other things, like helping with the generational GC work, rather than this or sub-percent gains on benchmarks.

Gregor Wagner [:gwagner]

Reporter

Comment 25

•

14 years ago

I tested with my new Netbook. (Single atom core with hyperthreading) I don't see much difference for the v8 benchmark in the browser. I usually get better scores with this patch but the GCTimer results are noisy.

Igor Bukanov

Comment 26

•

14 years ago

(In reply to comment #25) > I tested with my new Netbook. (Single atom core with hyperthreading) > I don't see much difference for the v8 benchmark in the browser. I usually > get better scores with this patch but the GCTimer results are noisy. Could you disable hyperthreading in BIOS and try with that?

Gregor Wagner [:gwagner]

Reporter

Comment 27

•

14 years ago

(In reply to comment #26) > (In reply to comment #25) > > I tested with my new Netbook. (Single atom core with hyperthreading) > > I don't see much difference for the v8 benchmark in the browser. I usually > > get better scores with this patch but the GCTimer results are noisy. > > Could you disable hyperthreading in BIOS and try with that? There is no option to disable it. Wiki says that all atom cores come with HT now.

Gregor Wagner [:gwagner]

Reporter

Comment 28

•

14 years ago

Attached patch patch (obsolete) (deleted) — Details — Splinter Review

Update with most of the comments addressed. I added locks to the delayed marking and XML marking, removed the second cond var and added an enum that describes the state. rt->marking is used by the main marker to signal the background thread that no more marking work has to be done. Afterwards the main marker waits for the background marker to finish. It's a 10% performance increase for the web-based v8 on my netbook with single core.

Attachment #529235 - Attachment is obsolete: true

Attachment #529235 - Flags: review?(gal)

Igor Bukanov

Comment 29

•

14 years ago

Comment on attachment 532109 [details] [diff] [review] patch Review of attachment 532109 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jsgc.h @@ +380,2 @@ > *word |= mask; > +#endif The non-black color is used only when marking the extra xpcom roots. The patch does not mark them parallel. So we do not need compare-and-swap after color != BLACK check. Also, why the code uses while + goto rather than if + goto?

Gregor Wagner [:gwagner]

Reporter

Comment 30

•

14 years ago

(In reply to comment #29) > Comment on attachment 532109 [details] [diff] [review] [review] > patch > > Review of attachment 532109 [details] [diff] [review] [review]: > ----------------------------------------------------------------- > > ::: js/src/jsgc.h > @@ +380,2 @@ > > *word |= mask; > > +#endif > > The non-black color is used only when marking the extra xpcom roots. The > patch does not mark them parallel. So we do not need compare-and-swap after > color != BLACK check. Also, why the code uses while + goto rather than if + > goto? Thats true. I will update. Another problem is in JSString::mark(JSTracer *trc) for rope strings. Is it safe to mark them in parallel? It's also possible that we mark the same strings in both threads. I guess I also need a CAS for the TAG and UNTAG right? Or is it enough to just remove the Asserts in there? Any Rope-String experts here?

David Mandelin [:dmandelin]

Comment 31

•

14 years ago

Hey, for when this lands, what's the plan to be able to pref it off? Ideally there would be at least a compile-time switch controlling whether we do parallel marking.

Gregor Wagner [:gwagner]

Reporter

Comment 32

•

14 years ago

(In reply to comment #31) > Hey, for when this lands, what's the plan to be able to pref it off? Ideally > there would be at least a compile-time switch controlling whether we do > parallel marking. There is a patch in bug 653488 that enables/disables the GCHelperThread but maybe we want something more detailed.

Gregor Wagner [:gwagner]

Reporter

Comment 33

•

14 years ago

I was talking to Luke about the concurrent marking of rope strings today. We can't use the current implementation because of race conditions. There are 2 solutions: 1) change it to a stack based marking 2) add a mark-stack for ropes that is only processed by one thread but both threads can insert.

Gregor Wagner [:gwagner]

Reporter

Updated

•

14 years ago

Depends on: 658041

Gregor Wagner [:gwagner]

Reporter

Comment 34

•

14 years ago

Attached patch patch (obsolete) (deleted) — Details — Splinter Review

Update based on bug 658041.

Attachment #532109 - Attachment is obsolete: true

Gregor Wagner [:gwagner]

Reporter

Updated

•

14 years ago

Attachment #535269 - Flags: review?(igor)

Igor Bukanov

Comment 35

•

14 years ago

(In reply to comment #34) > Created attachment 535269 [details] [diff] [review] [review] > patch Deque from the patch is thread-unsafe, but the chance that we will see the bug is low so we would not have a good coverage for parallel scanning of the same objects. So we need either to fix this or to remove from the implementation any compare-and-swap and related complexity and just say that any object should allow parallel scanning and any marking code must not use any state. As for the fix the queue implementation can try to use extra compare-and-swap to ensure consistency. Alternatively we can consider implementing thread-safe queue. For example, the queue can be implemented as a vector of blocks of, say, 64 elements. Another thread can only get full block of elements. This way the thread that owns the queue would need to take the lock only once per each 64 elements.

Gregor Wagner [:gwagner]

Reporter

Comment 36

•

14 years ago

(In reply to comment #35) > (In reply to comment #34) > > Created attachment 535269 [details] [diff] [review] [review] [review] > > patch > > Deque from the patch is thread-unsafe, but the chance that we will see the > bug is low so we would not have a good coverage for parallel scanning of the > same objects. So we need either to fix this or to remove from the > implementation any compare-and-swap and related complexity and just say that > any object should allow parallel scanning and any marking code must not use > any state. > > As for the fix the queue implementation can try to use extra > compare-and-swap to ensure consistency. Alternatively we can consider > implementing thread-safe queue. For example, the queue can be implemented as > a vector of blocks of, say, 64 elements. Another thread can only get full > block of elements. This way the thread that owns the queue would need to > take the lock only once per each 64 elements. So you mean we need test-coverage for scanning objects in parallel? I don't want to make the common case more complex. Somebody could write a testcase where locking in marking XML objects would hurt but I haven't seen any regression on real web-workload or benchmarks so far. Locking in state-changing code looks fine to me but we can try to get rid of it if it's really a problem.

Igor Bukanov

Comment 37

•

14 years ago

Hm, the implementation of the JSClass::trace hooks in XPC definitely use state. For example, XPC_WN_Helper_Trace, http://mxr.mozilla.org/mozilla-central/source/js/src/xpconnect/src/xpcwrappednativejsops.cpp#1099 , calls GetWrappedNativeOfJSObject, uses reference counting and may evebe allocate obejcts. How are we going to ensure that its is safe to do the parallel marking of XPConnext things? Perhaps it would be simpler to make sure that they are marked only from the main thread?

Igor Bukanov

Comment 38

•

14 years ago

(In reply to comment #36) > So you mean we need test-coverage for scanning objects in parallel? I mean that we either: Ensure that we never scan the same object in parallel from two threads. or Explicitly state that we are going to do so and make Dequeue implementation as fast as possible via removal of compapre-and-swap instructions there. This way at least we get better coverage from testing for any potential threading bugs and would not pay performance price for pseudo-safety.

Gregor Wagner [:gwagner]

Reporter

Comment 39

•

14 years ago

(In reply to comment #37) > Hm, the implementation of the JSClass::trace hooks in XPC definitely use > state. For example, XPC_WN_Helper_Trace, > http://mxr.mozilla.org/mozilla-central/source/js/src/xpconnect/src/ > xpcwrappednativejsops.cpp#1099 , calls GetWrappedNativeOfJSObject, uses > reference counting and may evebe allocate obejcts. > > How are we going to ensure that its is safe to do the parallel marking of > XPConnext things? Perhaps it would be simpler to make sure that they are > marked only from the main thread? Ok that looks pretty scary and it doesn't look like we can fix xpconnect easily.

Gregor Wagner [:gwagner]

Reporter

Comment 40

•

14 years ago

Maybe we can use another color to make sure that we leave the JS engine only once for an object?

Igor Bukanov

Comment 41

•

14 years ago

(In reply to comment #40) > Maybe we can use another color to make sure that we leave the JS engine only > once for an object? That require one more compare-and-swap operation when taking the object from the queue. But then one can use compare-and-swap to set to NULL the object taken from the queue. IMO it would be simpler. In any case, that would not solve XPConnect problem. The code there expects that it would be called from the main thread and simply fail if not in XPCCallContext::Init. So we must fix that before we can proceed.

Gregor Wagner [:gwagner]

Reporter

Comment 42

•

14 years ago

Ok let me see if I understand this piece of code correctly. We create a XPCCallContext in nsXPConnect::Collect() and set some thread private data that we rely on during clasp->trace() callbacks? This results in a situation where we have to call the tracing callback from the main thread?

Igor Bukanov

Comment 43

•

14 years ago

(In reply to comment #42) > Ok let me see if I understand this piece of code correctly. > We create a XPCCallContext in nsXPConnect::Collect() and set some thread > private data that we rely on during clasp->trace() callbacks? XPCCallContext::Init requires that the current thread in its private data should have a valid context associated with it. We should really rewrite this to avoid any XPCCallContext instances, but I have no idea what it would involve. mrbkap should know better. Blake, could you comment why nsXPConnect::GetJSObjectOfWrapper needs XPCCallContext? That is not used in the function.

Gregor Wagner [:gwagner]

Reporter

Comment 44

•

14 years ago

Andreas said it's ok to call the callback on another thread as long as the main thread is either stopped or also marking.

Igor Bukanov

Comment 45

•

14 years ago

(In reply to comment #44) > Andreas said it's ok to call the callback Which callback?

Gregor Wagner [:gwagner]

Reporter

Comment 46

•

14 years ago

(In reply to comment #45) > (In reply to comment #44) > > Andreas said it's ok to call the callback > > Which callback? Oh I mean the mark functions in XPConnect. No callbacks...

Gregor Wagner [:gwagner]

Reporter

Comment 47

•

14 years ago

I don't know how useful this information is but the try-server is green with this patch. We definitely call the mark functions in XPConnect also from the background thread. So it doesn't fail right away.

Andreas Gal :gal

Comment 48

•

14 years ago

Igor is right, we can't safely make XPCCallContexts on random threads, but as long we just call the mark function and the main thread is stalled (or marking as well) while we do so, we should be good. Igor, do you agree?

Igor Bukanov

Comment 49

•

14 years ago

(In reply to comment #48) > we can't safely make XPCCallContexts on random threads, but > as long we just call the mark function and the main thread is stalled (or > marking as well) while we do so, we should be good. We need extra work - see comment 43.

Igor Bukanov

Comment 50

•

14 years ago

(In reply to comment #47) We definitely call the mark functions in XPConnect also from the > background thread. So it doesn't fail right away. Effectively the code in xpconnect early return AFAICS without marking anything. As typically the stuff that should be marked under those functions is reached via multiple ways we would need very specific test cases to trigger the problem. In any case the code there is a mess. We should not calling all those functions just to call JS_TRACE on GC things stored in those wrappers. So lets fix that in a different bug and then land this one with the knowledge that parallel tracing is really possible in all cases. On a different subject dom/src/threads contain asserts that insists on the main thread during marking. Those should be fixed. Searching on mxr.mozilla.org for JSTracer shows those places.

Andreas Gal :gal

Comment 51

•

14 years ago

GetJSObjectOfWrapper should not take a ccx. Creating a ccx during GC is kinda bad anyway since we might be in an OOM squeeze. Igor, want to take a look (or I can do it Tuesday).

Igor Bukanov

Updated

•

14 years ago

Depends on: 660430

Igor Bukanov

Comment 52

•

14 years ago

(In reply to comment #51) > GetJSObjectOfWrapper should not take a ccx. Creating a ccx during GC is > kinda bad anyway since we might be in an OOM squeeze. Igor, want to take a > look (or I can do it Tuesday). I filed bug 660430 about this. I try to look at it over weekend.

Igor Bukanov

Updated

•

13 years ago

Depends on: 660783

Gregor Wagner [:gwagner]

Reporter

Comment 53

•

13 years ago

Seems like most xpconnect issues are resolved. Thanks for the fast fix Igor! Is there anything left on the xpconnect side?

Igor Bukanov

Comment 54

•

13 years ago

(In reply to comment #53) > Is there anything left on the xpconnect side? I hope not but it is easy to miss stuff there.

Tom S [:evilpie]

Updated

•

13 years ago

Blocks: 505308

Gregor Wagner [:gwagner]

Reporter

Comment 55

•

13 years ago

Attached patch patch (obsolete) (deleted) — Details — Splinter Review

Merging again. There was the problem that popping the same LargeMarkItem from one Deque caused a double free. I introduced a popBottomLocked and popTopLocked that uses a GCLock only for the largeMarkStack.

Attachment #535269 - Attachment is obsolete: true

Attachment #535269 - Flags: review?(igor)

Gregor Wagner [:gwagner]

Reporter

Comment 56

•

13 years ago

Comment on attachment 538418 [details] [diff] [review] patch Still the same performance wins. Between 20% and 45% overall GC pause time reduction.

Attachment #538418 - Flags: review?(igor)

Gregor Wagner [:gwagner]

Reporter

Updated

•

13 years ago

Depends on: 653488

Igor Bukanov

Comment 57

•

13 years ago

I will review this on Tuesday-Wednesday when I will be back from some traveling.

Igor Bukanov

Comment 58

•

13 years ago

Comment on attachment 538418 [details] [diff] [review] patch Review of attachment 538418 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jsgc.h @@ +1267,5 @@ > + * Multiprocessors" by Arora et al. > + * The implementation was inspired by Vyacheslav Egorov from the Chromium > + * project. > + */ > +#ifdef JS_THREADSAFE As I wrote this is NOT a thread-safe dequee even on x86. So update the comments accordingly stating that we tolerate races and simplify the implementation by at least removing the TaggedTop class and encoding/decoding logic. ::: js/src/jsgcmark.cpp @@ +691,1 @@ > } pushBottom does not use synchronization. As such it is not guaranteed that a write to item->start will propagate to another core when another thread calls popTopLocked the item from the stack. On x86 this does not happen as there it is guaranteed that a write is not reordered with any lock operation and popTopLocked does use locks. But on ARM this is possible and when this happens we would get a double free. Suggestion: as we use locks in any case, implement separated class with an interface: LargeItemWork { bool add(const LargeItem &item); bool get(LargeItem &item); /* Any better name ?. */ bool getFromNonOwnerThread(LargeItem &item); } Non-owner thread can only call getFromNonOwnerThread. Initially all methods take a lock and "get" can be implemented via calling getFromNonOwnerThread. But we can optimize "add" and even "get" for x86 to take advantage of stronger coherency guarantees. The "get" method internally under the lock checks for start + LARGE_OBJECT_CHUNK_SIZE <= end and either pops the item from the stack or update the stack top. The method returns the range to scan so the caller just scans [start, end).

Attachment #538418 - Flags: review?(igor) → review-

Gregor Wagner [:gwagner]

Reporter

Comment 59

•

13 years ago

Attached patch patch (deleted) — Details — Splinter Review

I fixed the comment and the largeMarkStack issue. It uses always a lock now. So this should be a working version now with the same 20% - 40% performance wins. I don't have time for the other MarkStack changes right now because I am not in MV over the summer. We can either take this and optimize it later, wait until fall and I will finish it or somebody else can steal it.

Attachment #538418 - Attachment is obsolete: true

Igor Bukanov

Comment 60

•

13 years ago

(In reply to comment #59) > Created attachment 540193 [details] [diff] [review] [review] > patch On Monday during the day European time I will try to test the patch impact on a laptop where I can disable hyperthreading and the second core. If the patch would not regress noticeably such configuration, then I will do a review for nits etc, so we can land this. If the regression would be substantial we would need a workaround to enable the parallel marking via a preference and/or only in debug builds. This way we can still land this and test for bugs while implementing later code that detects single-core non-hyperthreading CPU.

Gregor Wagner [:gwagner]

Reporter

Comment 61

•

13 years ago

Thanks Igor! There happened a strange crash during a try run: http://tinderbox.mozilla.org/showlog.cgi?log=Try/1308371835.1308373055.21282.gz#err0 The crashing thread is not involved with the GC but it happened during the GC.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 62

•

13 years ago

That try run may not be a crash, but a hang. According to bug 563012 comment 5, that's what crashinjectdll.dll is for: it kills a hung test. It's odd that thread 0 was doing a PR_Unlock call, though. Could there be some kind of livelock?

Gregor Wagner [:gwagner]

Reporter

Comment 63

•

13 years ago

We don't use work-stealing for XML objects so maybe we can avoid some locking. Igor are JSXMLArrays like xml_namespaces shared between XML objects?

Igor Bukanov

Comment 64

•

13 years ago

(In reply to comment #63) > We don't use work-stealing for XML objects so maybe we can avoid some > locking. True! > Igor are JSXMLArrays like xml_namespaces shared between XML objects? The arrays that are shrunk during the GC are not shared.

Igor Bukanov

Comment 65

•

13 years ago

Comment on attachment 540193 [details] [diff] [review] patch Review of attachment 540193 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jsweakmap.cpp @@ +246,5 @@ > if (IS_GC_MARKING_TRACER(trc) && map->empty()) { > trc->context->delete_(map); > obj->setPrivate(NULL); > } else { > map->trace(trc); This will hold the lock when marking/tracing non-empty map leading to a deadlock. Gal: could you comment why it is necessary to delete an empty map during the marking phase of the GC and not when the code removes all elements from it? It would be nice if we can replace the body of the method with: if (ObjectValueMap *map = GetObjectMap(obj)) map->trace(trc);

Andreas Gal :gal

Comment 66

•

13 years ago

That was meant to be just a shortcut. If its empty, delete it. We can delete it later, should be fine as long you are sure that code gets called when the map is empty.

Igor Bukanov

Comment 67

•

13 years ago

Comment on attachment 540193 [details] [diff] [review] patch Review of attachment 540193 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jsgc.cpp @@ +1869,5 @@ > > if (IS_GC_MARKING_TRACER(trc)) { > GCMarker *gcmarker = static_cast<GCMarker *>(trc); > MarkWeakReferences(gcmarker); > } Here is another bug: we should mark the weak references after stopping the parallel marking or we can miss some weak maps. Also we need an extra lock at http://hg.mozilla.org/tracemonkey/file/0428dbdf3d58/js/src/jsweakmap.h#l108 that recently was changed to use IS_MARKING_TRACER.

Gregor Wagner [:gwagner]

Reporter

Updated

•

13 years ago

Depends on: 666549

Tom S [:evilpie]

Comment 68

•

13 years ago

What happened to this bug? The blocking bugs all seem to be fixed, and the last one claims to be landed after this.

Gregor Wagner [:gwagner]

Reporter

Comment 69

•

13 years ago

(In reply to comment #68) > What happened to this bug? The blocking bugs all seem to be fixed, and the > last one claims to be landed after this. The patch should be 97% ready but I don't have the time to finish it now. I already tested it on try-server, various benchmarks and real workload on different platforms but this was about 2 months ago. The results were pretty good so far! Another thing is that currently we have some GC instrumentation patches in the engine that make it impossible to perform the last tunings here. I also don't want to make big changes during crash investigation. Finding GC crashes is _very_ important and Bill should take his time there. Maybe I will rebase it for FF9 but otherwise I will wait until October to finish it.

Gregor Wagner [:gwagner]

Reporter

Comment 70

•

13 years ago

Another paper describing parallel marking: Parallel garbage collection for shared memory multiprocessors http://www.math.tau.ac.il/~shanir/publications/dfsz2001.pdf

Cyko_01

Comment 71

•

13 years ago

(In reply to Gregor Wagner from comment #69) > Maybe I will rebase it for FF9 but otherwise I will wait until October to > finish it. so is this finished now?

Gregor Wagner [:gwagner]

Reporter

Comment 72

•

13 years ago

(In reply to Cyko_01 from comment #71) > (In reply to Gregor Wagner from comment #69) > > > Maybe I will rebase it for FF9 but otherwise I will wait until October to > > finish it. > > so is this finished now? nope. working on some other stuff now :(

Marco Castelluccio [:marco]

Updated

•

13 years ago

Whiteboard: [Snappy]

Andrew McCreight [:mccr8]

Comment 73

•

13 years ago

Snappy:P3 because incremental probably will solve most snappiness problems related to GC, on sites without a really heavy GC load.

Whiteboard: [Snappy] → [Snappy:P3]

David Mandelin [:dmandelin]

Updated

•

13 years ago

blocking-kilimanjaro: --- → ?

Justin Lebar (not reading bugmail)

Comment 74

•

13 years ago

David, why do you think this should block K9O, particularly given comment 73?

David Mandelin [:dmandelin]

Comment 75

•

13 years ago

(In reply to Justin Lebar [:jlebar] from comment #74) > David, why do you think this should block K9O, particularly given comment 73? I don't positively think this should block k9o. I just went out and nominated every major piece of GC-related work I could remember or easily find, and this came up. But, as you may have guessed, I didn't review comment 73 :-). It's a good observation and I'm not sure what the answer is yet. IGC will probably get us most of the way there, but I think that if we were left with 16ms pauses or so, cutting them in half could still be of value in reducing frame skips.

Gregor Wagner [:gwagner]

Reporter

Comment 76

•

13 years ago

I don't completely agree with comment 73 :) On my low end netbook I see a fast and slow mode with IGC enabled. So for example without IGC I see a 100msec pause and with IGC I see a 500msec period where animations are very slow. So I don't think we can say everything is solved there. Right now I also see a huge pause at the end where we probably reset the GC because we run OOM but I hope that's just a tuning issue. Also on my regular laptop I see a huge memory regression when IGC is enabled. Cutting the time in half (or even more with more threads) we spend in IGC might help in the memory regression as well.

Bill McCloskey [inactive unless it's an emergency] (:billm)

Comment 77

•

13 years ago

What sites are you talking about, Gregor?

Gregor Wagner [:gwagner]

Reporter

Comment 78

•

13 years ago

(In reply to Bill McCloskey (:billm) from comment #77) > What sites are you talking about, Gregor? Is there a dedicated bug for tracking IGC performance? So for the realtime raytracer I see a memory consumption of 770MB when disabled vs. 1.3GB when IGC enabled on my MBP. The animation is definitely smoother with IGC enabled and there is no slow part noticeable. For my low-end netbook the animation during IGC becomes very slow. The same for the spinning balls from v8. There I see a few seconds high frame rate and a second or so slow frame rate.

David Mandelin [:dmandelin]

Updated

•

13 years ago

blocking-kilimanjaro: ? → -

Till Schneidereit [:till]

Comment 79

•

12 years ago

taking as per irc discussion with billm.

Assignee: anygregor → tschneidereit+bmo

Till Schneidereit [:till]

Comment 80

•

12 years ago

I might or might not work on this in the short to medium term. For now, I'm putting it back into the pond so as not to deter anyone else from working on it.

Assignee: tschneidereit+bmo → general

Florian Bender

Comment 81

•

10 years ago

Is this bug still valid?

Nobody; OK to take it and work on it

Assignee

Updated

•

10 years ago

Assignee: general → nobody

Wayne Mery (:wsmwk)

Updated

•

8 years ago

Blocks: 1262321

Jon Coppeard (:jonco) (PTO until 14th September)

Comment 82

•

3 years ago

We're not currently planning on implementing parallel marking. Some work on concurrent marked was happening under bug 1693542.

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → INACTIVE

very first draft 14 years ago Gregor Wagner [:gwagner] (deleted), patch		Details \| Diff \| Splinter Review
patch 14 years ago Gregor Wagner [:gwagner] (deleted), patch		Details \| Diff \| Splinter Review
patch 14 years ago Gregor Wagner [:gwagner] (deleted), patch		Details \| Diff \| Splinter Review
patch 14 years ago Gregor Wagner [:gwagner] (deleted), patch		Details \| Diff \| Splinter Review
patch 14 years ago Gregor Wagner [:gwagner] (deleted), patch		Details \| Diff \| Splinter Review
patch 14 years ago Gregor Wagner [:gwagner] (deleted), patch		Details \| Diff \| Splinter Review
patch 14 years ago Gregor Wagner [:gwagner] (deleted), patch		Details \| Diff \| Splinter Review
patch 14 years ago Gregor Wagner [:gwagner] (deleted), patch		Details \| Diff \| Splinter Review
patch 14 years ago Gregor Wagner [:gwagner] (deleted), patch		Details \| Diff \| Splinter Review
patch 13 years ago Gregor Wagner [:gwagner] (deleted), patch	igor : review-	Details \| Diff \| Splinter Review
patch 13 years ago Gregor Wagner [:gwagner] (deleted), patch		Details \| Diff \| Splinter Review