Closed Bug 721025 Opened 13 years ago Closed 8 years ago

Lots of crashes in GetJSContext from single-threaded runtime release assert on main thread, and about:jank addon

Categories

(Core :: XPConnect, defect)

x86
macOS
defect
Not set
critical

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: jrmuizel, Unassigned)

References

Details

(Keywords: crash)

Crash Data

Assignee: general → nobody
Component: JavaScript Engine → XPConnect
QA Contact: general → xpconnect
Fwiw, bug 715713 prevents these from properly showing up in crash stats.
Depends on: 715713
Severity: normal → critical
Crash Signature: [@ CrashInJS]
Keywords: crash
Oh, this is the single-threaded runtime release assert. If I read the buildids correctly, the crashes were before the landing of bug 675078 yesterday. Also a weird thing about these stacks is that the aborts are happening on the main thread. That would imply perhaps some funny business being played with threads? I wonder how this hasn't been hit until recently since the assert has been in for several months. I also see that this is all on the nightly-profiling branch; perhaps something is particular to this branch?
I just got this crash: https://crash-stats.mozilla.com/report/index/bp-e1d430be-c981-41ab-99eb-111f42120127 Please let me know if I can provide any helpful information. Thanks!
I think this crash is caused by about:jank/profiler. Has anyone who sees the crash not installed either of those addons?
Uh, yeah I installed that some minutes before the crash.
Same same.
Ah, clues! So does anyone know if the about:jank profiler does any tricks with threads or thread ids? Basically, JSRuntime stores PR_CurrentThread when it is created and asserts that is equal to PR_CurrentThread anytime JS_AbortIfWrongThread is called (e.g. in GetJSContext). There is also a little stupid dance where we temporarily change threads (in a single-threaded manner) during cycle collection in case that is relevant...
(In reply to Tim Taubert [:ttaubert] from comment #8) > Uh, yeah I installed that some minutes before the crash. Same here.
Taras blogged about:jank.
(In reply to Axel Hecht from comment #12) > Taras blogged about:jank. Jeff wrote about:jank! ;-)
(In reply to Luke Wagner [:luke] from comment #10) > Ah, clues! So does anyone know if the about:jank profiler does any tricks > with threads or thread ids? Basically, JSRuntime stores PR_CurrentThread > when it is created and asserts that is equal to PR_CurrentThread anytime > JS_AbortIfWrongThread is called (e.g. in GetJSContext). There is also a > little stupid dance where we temporarily change threads (in a > single-threaded manner) during cycle collection in case that is relevant... We're building with frame pointers. While the extension is active we send a signal each 10ms to the main thread which will push a signal handler, perform a backtrace and resume.
Does the signal handler possibly touch JS or XPConnect?
That would be 'TableTicker::Tick'. The only thing we use from gecko is TimeStamps. There's no JS or XPConnect.
So my best guess here is that perhaps something with TLS is getting screwed up.
I was able to catch this in a debugger. mJSContext->runtime->ownerThread_ is equal to 0xc1ea12 when it should be equal to 0x100336220 (the value returned by PR_GetCurrentThread() and pthread_getspecific(261))
That would be caused by JSRuntime::clearOwnerThread: http://mxr.mozilla.org/mozilla-central/source/js/src/jsapi.cpp#891
Also for this runtime suspendCount = 1, requestDepth = 0, I'm not sure if that's expected or not.
(In reply to Josh Matthews [:jdm] from comment #19) > That would be caused by JSRuntime::clearOwnerThread: > http://mxr.mozilla.org/mozilla-central/source/js/src/jsapi.cpp#891 It looks like the likely caller of that is nsXPConnect::NotifyLeaveMainThread() which is called in one place by the Cyclecollector. Further, I don't see how we could avoid calling NotifyEnterMainThread() which should reset it to the proper value.
Hmm, all hints seem to point toward something happening with cycle collection. One random idea is that perhaps a bug in NSPR is causing the cycle collector thread to get notified unexpectedly (so it would run a cycle collection concurrent with the main thread (bad!) and leave rt->ownerThread_ in the 'clear' state). First, I'd try running a debug build (which may catch things a lot earlier). Second, I'd put a printf (with flush, obviously) in JS_ClearRuntimeThread, JS_SetRuntimeThread, nsCycleCollector::BeginCollection, and the signal handler and see if there are any weird interleaving.
(In reply to Luke Wagner [:luke] from comment #22) > First, I'd try running a debug build (which may catch things a lot earlier). > Second, I'd put a printf (with flush, obviously) in JS_ClearRuntimeThread, > JS_SetRuntimeThread, nsCycleCollector::BeginCollection, and the signal > handler and see if there are any weird interleaving. That sounds like a good idea. If someone can come up with a good way of reproducing this that would be great. I've only been able to see it after about a day of regular use in a browser.
I was able to get a core dump of this assertion failing: JS_ASSERT(ownerThread_ == (void *)0xc1ea12); during nsXPConnect::NotifyEnterCycleCollectionThread () Here's some state when this happens. cycle collector thread - 0x117801000 main thread - 0x7fff70f6fcc0 rt->ownerThread - 0x7fff70f6fcc0 condition variable lock owner - 0x117801000 The main thread is not waiting in nsCycleCollectorRunner::Collect I don't yet have any theories as to what's going wrong.
(In reply to Jeff Muizelaar [:jrmuizel] from comment #24) The main thread not waiting is a big red flag. This sounds like exactly what comment 22 para 1 is guessing. Perhaps there some bug involving signal handlers (or the backtrace facility) and NSPR's condition variables? Another question: I only see crash reports from OS X above. Does anyone know anyone using about:jank on windows and having/not-having problems?
One thing to note is that NSPR, and specifically PR_WaitCondVar (which is the underlying implementation of the CondVar class), does not protect against spurious wakeups or interrupted threads. CondVar::Wait actually returns an nsresult, so an NS_FAILED along with PR_GetError would be useful to see if NSPR is being fussy here.
I added the following code: nsresult result = mRequest.Wait(); if (result != NS_OK) { printf("%x\n", result); assert(result == NS_OK); } and I still get the crash without hitting the assert
Perhaps the bug is in Wait() ?
Wait is just a wrapper around PR_WaitCondVar, with some extra machinery for the deadlock detector in debug builds.
Ok, then perhaps PR_WaitCondVar (on OS X) has a bug.
Crash Signature: [@ CrashInJS] → [@ CrashInJS] [@ CrashInJS | XPCCallContext::GetJSContext] [@ CrashInJS | XPCCallContext::GetJSContext(JSContext**)]
This is topcrash #3 in 11.0b3 now: https://crash-stats.mozilla.com/topcrasher/byversion/Firefox/11.0b3 Do we have any clue how to fix this?
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #31) > This is topcrash #3 in 11.0b3 now: > https://crash-stats.mozilla.com/topcrasher/byversion/Firefox/11.0b3 > > Do we have any clue how to fix this? Sorry, it's actually bug 715757.
Crash Signature: [@ CrashInJS] [@ CrashInJS | XPCCallContext::GetJSContext] [@ CrashInJS | XPCCallContext::GetJSContext(JSContext**)] → [@ CrashInJS] [@ CrashInJS | XPCCallContext::GetJSContext ] [@ CrashInJS | XPCCallContext::GetJSContext(JSContext**) ]
Summary: Lots of crashes in GetJSContext → Lots of crashes in GetJSContext from single-threaded runtime release assert on main thread, and about:jank addon
You need to log in before you can comment on or make changes to this bug.