Closed Bug 538702 Opened 15 years ago Closed 13 years ago

Thread synchronization for conservative stack scanning

Categories

(Core :: JavaScript Engine, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: gwagner, Assigned: gwagner)

Details

Attachments

(1 file, 8 obsolete files)

Since the the debugging process for bug 516832 became too complicated because of stack-scan crashes and synchronization crashes, I removed all the stack-scan code. This bug is just for the thread synchronization.
Assignee: general → anygregor
Attached patch WIP (obsolete) (deleted) — Splinter Review
Work in progress. Works on Mac, random xpcshell crashes on Win.
Attached patch patch (obsolete) (deleted) — Splinter Review
Passes tryserver for all platforms. Thread synchronization is enabled in DEBUG mode.
Attachment #420847 - Attachment is obsolete: true
Attachment #420900 - Flags: review?(gal)
Comment on attachment 420900 [details] [diff] [review] patch >+#if defined(XP_UNIX) && !defined(__APPLE__) >+static struct timespec onemillisec = {0, 1000000L}; >+#endif OneMilliSec. Also maybe move closer to where do you need it. >+#if defined(XP_UNIX) && !defined(__APPLE__) >+ >+static pthread_key_t key; >+static pthread_once_t key_once = PTHREAD_ONCE_INIT; >+ SuspendSignalHandler >+static void suspendSignalHandler(int signo, siginfo_t *info, void *extra) >+{ >+ JSThread *thread = (JSThread*)pthread_getspecific(key); >+ JS_ASSERT(thread != 0); >+ JS_ASSERT((thread->suspended & JSTHREAD_SUSPENDED) == 0); >+ >+ thread->suspended |= JSTHREAD_SUSPENDED; >+ pthread_cond_signal(&thread->suspendResumeCV); >+ while (thread->suspended & JSTHREAD_SUSPENDED) >+ { >+ pthread_yield(); >+ } Don't over-brace single line consequence. >+ thread->suspended |= JSTHREAD_RESUMED; >+ pthread_cond_signal(&thread->suspendResumeCV); >+} >+ >+static void >+initGCSupport() InitGCSupport, but thats a cheesy name too. InstallSuspendSignalHandler? >+{ >+ struct sigaction sigactUsr2; >+ >+ sigactUsr2.sa_flags = SA_SIGINFO; >+ sigactUsr2.sa_sigaction = suspendSignalHandler; >+ sigaction (SIGUSR2, &sigactUsr2, NULL); >+} Use short names inside functions, sigactUsr2 is too long, just sa or something. >+ >+static void >+makeKey() >+{ >+ (void) pthread_key_create(&key, NULL); >+} Better name? And MakeKey. >+ >+#endif >+ > static JSThread * > NewThread(jsword id) > { > JS_ASSERT(js_CurrentThreadId() == id); > JSThread *thread = (JSThread *) js_calloc(sizeof(JSThread)); > if (!thread) > return NULL; > JS_INIT_CLIST(&thread->contextList); > thread->id = id; >+#ifdef DEBUG >+# ifdef __APPLE__ >+ thread->handle = pthread_mach_thread_np(pthread_self()); >+# elif WIN32 >+ HANDLE realHandle = 0; "handle" ? >+ DuplicateHandle(GetCurrentProcess(), GetCurrentThread(), >+ GetCurrentProcess(), &realHandle, 0, TRUE, DUPLICATE_SAME_ACCESS); Align GetCurrentProcesses here, indent further. >+ thread->handle = HANDLE2WORD(realHandle); >+# elif defined(XP_UNIX) && !defined(__APPLE__) >+ thread->handle = pthread_self(); >+ (void) pthread_once(&key_once, makeKey); >+ (void) pthread_setspecific(key, thread); >+ initGCSupport(); >+ pthread_mutex_init(&thread->suspendResumeMutex,NULL); >+ pthread_cond_init(&thread->suspendResumeCV,NULL); >+# else >+# error Unsupported platform. >+# endif >+#endif > thread->data.init(); > return thread; > } > > static void > DestroyThread(JSThread *thread) > { > /* The thread must have zero contexts. */ > JS_ASSERT(JS_CLIST_IS_EMPTY(&thread->contextList)); > JS_ASSERT(!thread->titleToShare); > thread->data.finish(); >+#ifdef DEBUG >+# ifdef WIN32 >+ CloseHandle(WORD2HANDLE(thread->handle)); >+# endif >+#endif > js_free(thread); > } > >+#if defined(XP_UNIX) && !defined(__APPLE__) >+typedef struct user_regs_struct user_regs_struct; >+#endif >+ >+void >+JSThread::suspend() >+{ >+#ifdef __APPLE__ >+ thread_suspend(mach_port_t(handle)); >+#elif WIN32 >+ SuspendThread(WORD2HANDLE(handle)); >+#elif defined(XP_UNIX) && !defined(__APPLE__) >+ JS_ASSERT((suspended & JSTHREAD_SUSPENDED) == 0); >+ pthread_kill(handle, SIGUSR2); >+ suspendTest(); Test? >+#else >+ #error Unsupported platform. >+#endif >+} >+ >+void >+JSThread::resume() >+{ >+#ifdef __APPLE__ >+ thread_resume(mach_port_t(handle)); >+#elif WIN32 >+ ResumeThread(WORD2HANDLE(handle)); >+#elif defined(XP_UNIX) && !defined(__APPLE__) >+ JS_ASSERT(suspended & JSTHREAD_SUSPENDED); >+ suspended &= ~JSTHREAD_SUSPENDED; >+ resumeTest(); Test? >+#else >+#error Unsupported platform. >+#endif >+} >+ >+#if defined(XP_UNIX) && !defined(__APPLE__) >+void >+JSThread::suspendTest() >+{ >+ pthread_mutex_lock(&suspendResumeMutex); >+ while ((suspended & JSTHREAD_SUSPENDED) == 0) >+ { Don't overbrace. Even if, it would be in the line one up. >+ pthread_cond_timedwait(&suspendResumeCV, &suspendResumeMutex, &onemillisec); >+ } >+ pthread_mutex_unlock(&suspendResumeMutex); >+} >+ >+void >+JSThread::resumeTest() Do you mean resumeWait? >+{ >+ pthread_mutex_lock(&suspendResumeMutex); >+ while ((suspended & JSTHREAD_RESUMED) == 0) >+ { >+ pthread_cond_timedwait(&suspendResumeCV, &suspendResumeMutex, &onemillisec); >+ } >+ pthread_mutex_unlock(&suspendResumeMutex); >+ suspended &= ~JSTHREAD_RESUMED; >+} >+#endif >+ > JSThread * > js_CurrentThread(JSRuntime *rt) > { > jsword id = js_CurrentThreadId(); > JS_LOCK_GC(rt); > > /* > * We must not race with a GC that accesses cx->thread for JSContext >@@ -303,17 +467,29 @@ thread_purger(JSDHashTable *table, JSDHa > return JS_DHASH_NEXT; > } > > static JSDHashOperator > thread_marker(JSDHashTable *table, JSDHashEntryHdr *hdr, uint32 /* index */, > void *arg) > { > JSThread *thread = ((JSThreadsHashEntry *) hdr)->thread; >- thread->data.mark((JSTracer *) arg); >+ JSTracer *trc = (JSTracer *)arg; >+ if (!trc->context->runtime->gcReleaseRoots) { >+#ifdef DEBUG >+ if (trc->context->thread != thread) >+ thread->suspend(); >+#endif >+ thread->data.mark(trc); >+#ifdef DEBUG >+ if (trc->context->thread != thread) >+ thread->resume(); >+#endif >+ } >+ > return JS_DHASH_NEXT; > } > > #endif /* JS_THREADSAFE */ > > JSThreadData * > js_CurrentThreadData(JSRuntime *rt) > { >diff -r d0ad088ca59f js/src/jscntxt.h >--- a/js/src/jscntxt.h Fri Jan 08 18:16:40 2010 -0800 >+++ b/js/src/jscntxt.h Sat Jan 09 09:43:00 2010 -0800 >@@ -430,35 +430,51 @@ struct JSThreadData { > void finish(); > void mark(JSTracer *trc); > void purge(JSContext *cx); > void purgeGCFreeLists(); > }; > > #ifdef JS_THREADSAFE > >+#define JSTHREAD_SUSPENDED 0x40 /* thread has been suspended */ >+#define JSTHREAD_RESUMED 0x80 /* thread has been resumed */ >+ I tink we can move the defines into the CPP file and make const int? > /* > * Structure uniquely representing a thread. It holds thread-private data > * that can be accessed without a global lock. > */ > struct JSThread { > /* Linked list of all contexts in use on this thread. */ > JSCList contextList; > > /* Opaque thread-id, from NSPR's PR_GetCurrentThread(). */ > jsword id; >+ jsword handle; >+ volatile jsuword suspended; >+#if defined(XP_UNIX) && !defined(__APPLE__) >+ pthread_mutex_t suspendResumeMutex; >+ pthread_cond_t suspendResumeCV; Seriously, indentation =) >+#endif > > /* Indicates that the thread is waiting in ClaimTitle from jslock.cpp. */ > JSTitle *titleToShare; > > /* > * Thread-local version of JSRuntime.gcMallocBytes to avoid taking > * locks on each JS_malloc. > */ > ptrdiff_t gcThreadMallocBytes; >+ >+ void suspend(); >+ void resume(); >+#if defined(XP_UNIX) && !defined(__APPLE__) >+ void resumeTest(); >+ void suspendTest(); >+#endif > > /* > * Deallocator task for this thread. > */ > JSFreePointerListTask *deallocatorTask; > > /* Factored out of JSThread for !JS_THREADSAFE embedding in JSRuntime. */ > JSThreadData data; >@@ -567,16 +583,17 @@ struct JSRuntime { > uint32 gcEmptyArenaPoolLifespan; > uint32 gcLevel; > uint32 gcNumber; > JSTracer *gcMarkingTracer; > uint32 gcTriggerFactor; > size_t gcTriggerBytes; > volatile JSBool gcIsNeeded; > volatile JSBool gcFlushCodeCaches; >+ volatile JSBool gcReleaseRoots; > > /* > * NB: do not pack another flag here by claiming gcPadding unless the new > * flag is written only by the GC thread. Atomic updates to packed bytes > * are not guaranteed, so stores issued by one thread may be lost due to > * unsynchronized read-modify-write cycles on other threads. > */ > JSPackedBool gcPoke; >diff -r d0ad088ca59f js/src/jsgc.cpp >--- a/js/src/jsgc.cpp Fri Jan 08 18:16:40 2010 -0800 >+++ b/js/src/jsgc.cpp Sat Jan 09 09:43:00 2010 -0800 >@@ -3100,16 +3100,19 @@ js_GC(JSContext *cx, JSGCInvocationKind > #endif > > /* > * Mark phase. > */ > JS_TRACER_INIT(&trc, cx, NULL); > rt->gcMarkingTracer = &trc; > JS_ASSERT(IS_GC_MARKING_TRACER(&trc)); >+ >+ if (gckind == GC_LAST_CONTEXT) >+ rt->gcReleaseRoots = true; > > #ifdef DEBUG > for (a = rt->gcDoubleArenaList.head; a; a = a->prev) > JS_ASSERT(!a->hasMarkedDoubles); > #endif > > js_TraceRuntime(&trc, keepAtoms); > js_MarkScriptFilenames(rt, keepAtoms);
Looks pretty cool otherwise. Can you do the changes and then we should probably consult with igor.
Attached patch patch (obsolete) (deleted) — Splinter Review
Attachment #420900 - Attachment is obsolete: true
Attachment #420900 - Flags: review?(gal)
My concern with the current patch is that it may suspend a thread that is inside the GC lock. Then, if any marking code would directly or indirectly call code that calls JS_LOCK_GC (which is allowed by the current GC model), we would get a deadlock. A similar problem may happen with XPConnect which have a few locks on its own that IIRC is also taken during the marking phase of the GC. Also, the problem with the original approach (that just checked threads inside JS requests) was due to JS_SuspendRequest. The stack for such threads has to be scanned yet they are not suspended. But such threads must not produce more GC roots. Moreover, an all relevant platforms AFAIK OS never releases CPU pages allocated for the stack until the thread finishes. Thus taking the stack pointer at an arbitrary moment and scanning for roots the space between the thread base and that pointer is safe.
Igor, any ideas how to address the locking/deadlock issue?
(In reply to comment #7) > Igor, any ideas how to address the locking/deadlock issue? The question is why we need to suspend a thread to mark its stack? If the thread is running during the GC, then it is either outside the request, when it should not have any roots, or it called JS_SuspendRequest. In the latter case the thread cannot generate new GC roots to scan or it would violate the JS request model. Thus, for such thread, its SP can be snapshot at an arbitrary moment, rounded to the nearest CPU page boundary and the space between the stack base and that SP is scanned. This is safe since OS never release native stack CPU pages and only that space may contain any GC roots. Thus at worst the root scanner would need to check unused pages of the native stack. It would not miss any roots.
If the above idea of not suspending threads is unsound, the the deadlock can be avoided if the GC would not mark GC things discovered on the native stack. Rather it would set them aside so the marking can be done later after the GC resume all the native threads. That "set aside" can be trivially implemented using the same mechanism that the GC uses to avoid using too much native stack during recursive scanning of the GC thing graph.
I was also thinking of a "marking queue" or stack as an optimization for conservative stack scanning but I guess we need it earlier. We should discuss this in the stack scanning bug or even in a new bug. Are there any other problems with just the thread synchronization part?
Attached patch patch (obsolete) (deleted) — Splinter Review
fix mutex and condvar destruction.
Attachment #420916 - Attachment is obsolete: true
Attached patch patch (obsolete) (deleted) — Splinter Review
refresh
Attachment #421527 - Attachment is obsolete: true
Attached patch update (deleted) — Splinter Review
Attachment #428291 - Attachment is obsolete: true
Passes tryserver with unrelated? orange on Win unit test. http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1266977080.1266989310.32680.gz
Attachment #428596 - Flags: review?(igor)
Comment on attachment 428596 [details] [diff] [review] update >diff -r 1b5ca6cc5ce8 js/src/jscntxt.cpp >+static void >+SuspendSignalHandler(int signo, siginfo_t *info, void *extra) >+{ ... >+ while (thread->suspended & JSTHREAD_SUSPENDED) >+ pthread_yield(); This does not put truly suspend the thread. According to the manual for pthread_yield "the thread is placed at the end of the run queue". So there is chance that, when the stack is scanned, the thread will wake up. I suppose a better alternative here would be to use a true waiting mutex. > static JSThread * > NewThread(jsword id) > { ... >+# elif defined(XP_UNIX) && !defined(__APPLE__) >+ thread->handle = pthread_self(); >+ (void) pthread_once(&key_once, CreateThreadLookupKey); >+ (void) pthread_setspecific(key, thread); >+ InstallSuspendSignalHandler(); The signal handler is per process. So InstallSuspendSignalHandler should not be called for each thread. Also, using a user signal is a nice hack, but it may interfere with other code that use the same signal. Granted, now we have out-of-process plugins. That removes a lot of potential other users of user signals. Still non-browser embeddings or addons using some native libraries may hit it. As a workaround for that I suggest to install the signal only temporary when we about to suspend non-GC threads and then restore the handler back. >+# else >+# error Unsupported platform. That should be #error "Unsupported platform." Fix that here and everywhere. >+void >+JSThread::suspendWait() >+{ >+ while ((suspended & JSTHREAD_SUSPENDED) == 0) >+ pthread_cond_timedwait(&suspendResumeCV, &suspendResumeMutex, &onemillisec); Comment here why it is necessary to use a loop with 1ms wait instead of just unconditional wait. > static JSDHashOperator > thread_marker(JSDHashTable *table, JSDHashEntryHdr *hdr, uint32 /* index */, > void *arg) > { > JSThread *thread = ((JSThreadsHashEntry *) hdr)->thread; >- thread->data.mark((JSTracer *) arg); thread->data.mark must not be called when the thread is suspended. The suspend is only necessary for the conservative stack scanning. Thus keep thread->data.mark as is and do only conservative scanning inside suspend/resume. Also, we do not need to suspend threads in requests. They will be waiting in any case. Only threads outside requests that call JSSuspendRequest should be suspended. This will help to mitigate that user-signal problem above. But in general the signal hack on Linux feels like just a hack. It is fine to use it temporary but we should try to avoid it since Linux dos not support suspending threads naively. A simple way to do that would be for any thread that calls JSSuspendRequest to scan its stack from base until the end of the CPU page that contains JSSuspendRequest call. A better way would be to add an alternative to JSSuspendRequest that takes a callback that will be called outside the JS request. This way we can accurately record the stack position.
Attachment #428596 - Flags: review?(igor) → review-
> The signal handler is per process. So InstallSuspendSignalHandler should not be > called for each thread. Also, using a user signal is a nice hack, but it may > interfere with other code that use the same signal. Granted, now we have > out-of-process plugins. That removes a lot of potential other users of user > signals. Still non-browser embeddings or addons using some native libraries may > hit it. As a workaround for that I suggest to install the signal only temporary > when we about to suspend non-GC threads and then restore the handler back. We are the platform. We set the environment. If we decide we use SIGUSR1, then we use SIGUSR1. If an embedding tries to interfere, we die. I think that's acceptable. Same goes for plugins. We make a lot of other changes to the environment as well. We set the FPU into a specific mode w.r.t. to rounding and NaN reporting. If a plugin changes that we used to crash, now we just flip those bits back. We redirect SIGFPE for that.
Signals are bad medicine, the SIGFPU story proves this. We could try to claim SIGUSR1 but better if we don't use signals at all, or if we arm and handle one only temporarily. Why are we not using any js/src/threads (bug 507718) or NSPR veneer at all? At the least we should minimize #ifdefs where there's no platform-specific advantage and the APIs are essentially cross-platform. /be
You mean PR_Interrupt? It has to be doing the same thing internally. Linux does not support asynchronous thread suspending.
Brendan what do you mean with "NSPR veneer"? I don't see how PR_Interrupt could help here. Is there a way to register our own interrupt handler? The linux implementation is based on the nspr thread synchronization. Fixing the suggestions with the signal handler activation for each GC and clean up the signal handler code shouldn't be too hard.
Summoning Wan-Teh for better NSPR advice than I can render from old memories. /be
Brendan: what is the question?
(In reply to comment #21) > Brendan: what is the question? Sorry, the question is how to suspend and scan (in the conservative garbage collection sense) a thread's stack, short of hacking platform-dependent code (e.g., sending SIGUSR1 to the thread after arming a handler for that signal). /be
NSPR used to support that when Netscape ran its port of Sun JVM over NSPR. The functions are PR_SuspendAll and PR_ResumeAll. They are implemented using SIGUSR2 (and possibly SIGUSR1 also). I'm not sure if the code still works. Pthreads doesn't have a function that suspends a thread. Some implementations of pthreads do have such a function as an extension.
Attached patch WiP (obsolete) (deleted) — Splinter Review
Inspired by bug 507718 I modified this patch (still not done and not tested on Windows) I also tried to address previous comments with this patch: - install signal handler on linux only for GC - send signal to suspended thread instead of pthread_yield I am already using the approach Wan Teh is mentioning.
Attached patch update (obsolete) (deleted) — Splinter Review
Attachment #429859 - Attachment is obsolete: true
Attached patch patch (obsolete) (deleted) — Splinter Review
Attachment #429896 - Attachment is obsolete: true
Attachment #429914 - Flags: review?(igor)
(In reply to comment #26) > Created an attachment (id=429914) [details] > patch I am not sure that proliferation of files with platform-specific implementation is a right thing here. IMO it makes it harder to maintain code since updating several files requires more efforts than updating one containing platform-specific code. Plus such layers create a false sense of having a working thread suspend. I would prefer to see the Linux deficiency clear and loud in the code so other parts would not even think about calling thread->suspend(). Other opinions? That aside the patch should not suspend threads outside the requests since they cannot contain JS values on the stack. They also can be about to be destroyed at the moment patch tries to suspend them. This could be problematic especially on Linux. Plus the threads inside the request are already suspended waiting for the GC to finish. So we only need to suspend threads that called JS_SuspendRequest from the inside the GC request.
(In reply to comment #27) > (In reply to comment #26) > > Created an attachment (id=429914) [details] [details] > > patch > > I am not sure that proliferation of files with platform-specific implementation > is a right thing here. IMO it makes it harder to maintain code since updating > several files requires more efforts than updating one containing > platform-specific code. Plus such layers create a false sense of having a > working thread suspend. I would prefer to see the Linux deficiency clear and > loud in the code so other parts would not even think about calling > thread->suspend(). Other opinions? I don't think the code gets cleaner if we add a few hundred LOC to an already huge jsgc.cpp file. To quote Brendan: "It would be great to avoid ifdef hell." But that can be changed... > > That aside the patch should not suspend threads outside the requests since they > cannot contain JS values on the stack. They also can be about to be destroyed > at the moment patch tries to suspend them. This could be problematic especially > on Linux. Plus the threads inside the request are already suspended waiting for > the GC to finish. So we only need to suspend threads that called > JS_SuspendRequest from the inside the GC request. Is it safe to use one of the outstandingRequests or requestDepth variables to determine if a thread has to be suspended or do I have to add a new one for this?
(In reply to comment #28) > > I am not sure that proliferation of files with platform-specific implementation > > is a right thing here. IMO it makes it harder to maintain code since updating > > several files requires more efforts than updating one containing > > platform-specific code. Note for later this point. > > Plus such layers create a false sense of having a > > working thread suspend. I would prefer to see the Linux deficiency clear and > > loud in the code so other parts would not even think about calling > > thread->suspend(). Other opinions? > > I don't think the code gets cleaner if we add a few hundred LOC to an already > huge jsgc.cpp file. To quote Brendan: "It would be great to avoid ifdef hell." > But that can be changed... I agree with Igor we do not want a facade of an internal API that is just going to fall over if used more widely. Attractive nuisance. #ifdef hell is no fun either, but these are evils competing for "lesser" status. Why not put the platform-specific code in a separate file but label/name it so as to warn kids away from the nuisance. You could even do it via inline methods in a .h file that is not exported. > Is it safe to use one of the outstandingRequests or requestDepth variables to > determine if a thread has to be suspended or do I have to add a new one for > this? JS_SuspendRequest clearly uses outstandingRequests precisely for this purpose, no more counters needed. It was added as a bug fix, see bug 401687. /be
I suggest not to fix this bug. Rather for a thread that have a suspended request the stack scanner should just scan until the SP on the moment of JS_SuspendRequest call. This cannot miss any possible roots. The only problem is that if JS_SuspendRequest would be called through a series of helpers so the scanner may end up scanning more memory than necessary. But it would just increase the chance of false positives, it would not add memory-safety worries since the OSes that we target never release the CPU pages that the stack uses. And with the bug 551680 and its followups fixed the extra false positives would not exist for the Firefox where they would inevitably exist when using the SP of the suspended thread. Gregor, I can update your patch 516832 with the necessary changes.
That's a deja vu. We removed the thread synchronization code and realized that we might miss roots from threads that don't get suspended. So I put it in again. I am happy if the stack scanning works without the suspend-resume code. I couldn't follow the discussion about the new SuspendRequest during the last week(s). What changed that it works now for stack scanning? Igor, feel free to update the patch.
(In reply to comment #31) > That's a deja vu. We removed the thread synchronization code and realized that > we might miss roots from threads that don't get suspended. So I put it in > again. I am happy if the stack scanning works without the suspend-resume code. > I couldn't follow the discussion about the new SuspendRequest during the last > week(s). What changed that it works now for stack scanning? IIRC we have never added the native stack pointer capturing code to JS_SuspendRequest.
(In reply to Igor Bukanov from comment #30) > I suggest not to fix this bug. WONTFIX then? :-)
(In reply to Ryan VanderMeulen from comment #40) > (In reply to Igor Bukanov from comment #30) > > I suggest not to fix this bug. > > WONTFIX then? :-) This
Summary: TM: Thread synchronization for conservative stack scanning → Thread synchronization for conservative stack scanning
Attachment #429914 - Attachment is obsolete: true
Attachment #429914 - Flags: review?(igor)
This was addressed in another bug
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: