Closed
Bug 937220
Opened 11 years ago
Closed 11 years ago
crash in mozalloc_abort(char const* const) | NS_DebugBreak (###!!! ABORT: cycle collector fault: file e:/builds/moz2_slave/m-cen-w64-ntly-000000000000000/build/xpcom/base/nsCycleCollector.cpp, line 1054)
Categories
(Core :: DOM: Workers, defect)
Tracking
()
RESOLVED
DUPLICATE
of bug 956284
Tracking | Status | |
---|---|---|
firefox27 | --- | unaffected |
firefox28 | --- | affected |
firefox29 | --- | affected |
firefox-esr24 | --- | unaffected |
People
(Reporter: tracy, Assigned: khuey)
References
Details
(4 keywords, Whiteboard: [most likely fixed by bug 956284])
Crash Data
Attachments
(2 files)
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
application/x-javascript
|
Details |
This bug was filed from the Socorro interface and is
report bp-321d7a93-a8f5-41da-a53a-d275b2131111.
=============================================================
This increased in volume on 20131107 and has been ramping up in volume since then.
0 mozalloc.dll mozalloc_abort(char const * const) memory/mozalloc/mozalloc_abort.cpp
1 xul.dll NS_DebugBreak xpcom/base/nsDebugImpl.cpp
Comment 1•11 years ago
|
||
Are there reports with any information further up the stack? The first two frames are really generic.
Reporter | ||
Comment 2•11 years ago
|
||
No, I clicked into a couple dozen random reports and they all look the same.
Comment 3•11 years ago
|
||
What we do have is an abort message in the App Notes:
xpcom_runtime_abort(###!!! ABORT: cycle collector fault: file e:/builds/moz2_slave/m-cen-w64-ntly-000000000000000/build/xpcom/base/nsCycleCollector.cpp, line 1054)
Summary: crash in mozalloc_abort(char const* const) | NS_DebugBreak → crash in mozalloc_abort(char const* const) | NS_DebugBreak (###!!! ABORT: cycle collector fault: file e:/builds/moz2_slave/m-cen-w64-ntly-000000000000000/build/xpcom/base/nsCycleCollector.cpp, line 1054)
Comment 4•11 years ago
|
||
That's quite peculiar. I'm not sure why it isn't printing out a more useful message.
Comment 5•11 years ago
|
||
Given that cycle collector fault and that it rose up on Nightly at the same time as bug 937191, is this another regression from bug 928312?
Comment 6•11 years ago
|
||
Does the file path mean this is a Win64 crash, or is this just built on a Win64 machine or something?
Comment 7•11 years ago
|
||
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #5)
> Given that cycle collector fault and that it rose up on Nightly at the same
> time as bug 937191, is this another regression from bug 928312?
That sounds plausible, but we don't have much to go on here.
Comment 8•11 years ago
|
||
If I'm reading the crash-stats report correctly, 100% of these 233 crashes are on AMD, which is unusual.
Comment 9•11 years ago
|
||
Also of note, I looked at half a dozen crashes and none of them were on the main thread, which is evidence in favor of comment 5.
Comment 10•11 years ago
|
||
This is a win64 crash: you can tell by looking for "Build Architecture" in the crash report.
I'm looking into whether this is win64-only.
Comment 11•11 years ago
|
||
https://crash-stats.mozilla.com/search/?app_notes=cycle&app_notes=collector&app_notes=fault&version=28.0a1&_facets=version&_facets=signature&_facets=cpu_name&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=cpu_info&_columns=cpu_name shows that the recent spike is almost entirely win64, but multiple computers (not a single installation crashing repeatedly). Cycle collector aborts across other versions are distributed normally.
Comment 12•11 years ago
|
||
I'm seeing both Intel and AMD CPUID's in here, so it doesn't look manufacturer-specific. The "amd64" field just means Win64. What's misleading is that on 64-bit processes, we don't get a "GenuineIntel" or "AuthenticAMD" in label in front of the Family/Model/Stepping notes. This seems to be a general problem, not specific to our code; windbg's !cpuid extension seems to suffer the same problem.
I wonder if this is actually a Win64-only crash or if the stackwalking just happened to make it so that only Win64 builds hit this particular variant of this signature. There appear to be some odd frames between abort and NS_DebugBreak.
Comment 13•11 years ago
|
||
> I wonder if this is actually a Win64-only crash or if the stackwalking just
> happened to make it so that only Win64 builds hit this particular variant of
> this signature
I'm pretty certain that this is win64-only. The crash-stats search I linked wasn't searching by signature, but was searching for "cycle collector fault" in the abort message, which should catch all the variants.
Comment 14•11 years ago
|
||
Pointing the debugger at a local copy of the xul image somehow lets it figure out the stack better. It's an overflowing refcount in DescribeRefCountedNode.
xul!NS_DebugBreak
xul!Fault
xul!GCGraphBuilder::DescribeRefCountedNode
xul!nsDOMEventTargetHelper::cycleCollection::Traverse
xul!mozilla::dom::workers::WorkerPrivateParent<mozilla::dom::workers::WorkerPrivate>::cycleCollection::Traverse
xul!mozilla::dom::workers::XMLHttpRequest::cycleCollection::Traverse
xul!GCGraphBuilder::Traverse
xul!nsCycleCollector::MarkRoots
xul!nsCycleCollector::BeginCollection
xul!nsCycleCollector::Collect
xul!nsCycleCollector_collect
xul!`anonymous namespace'::WorkerJSRuntime::CustomGCCallback
mozjs!Collect
mozjs!js::GC
mozjs!js::DestroyContext
xul!`anonymous namespace'::WorkerThreadRunnable::Run
xul!nsThread::ProcessNextEvent
xul!NS_ProcessNextEvent
xul!nsThread::ThreadFunc
nss3!_PR_NativeRunThread
nss3!pr_root
msvcr100!_callthreadstartex
Comment 15•11 years ago
|
||
I also see the same stack as comment 14 in the signatures containing js::frontend::Parser<js::frontend::FullParseHandler>::noteNameUse. noteNameUse disappears from the stack after the debugger gets to see the full image.
Comment 16•11 years ago
|
||
Thanks for the stack!
The faults in that method are:
if (refCount == 0)
Fault("zero refcount", mCurrPi);
if (refCount == UINT32_MAX)
Fault("overflowing refcount", mCurrPi);
Presumably it is the first one.
Blocks: 928312
Comment 17•11 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #16)
> Thanks for the stack!
>
> The faults in that method are:
> if (refCount == 0)
> Fault("zero refcount", mCurrPi);
> if (refCount == UINT32_MAX)
> Fault("overflowing refcount", mCurrPi);
>
> Presumably it is the first one.
Based on the return address and the string pushed, it looks like the second one (overflow).
Comment 18•11 years ago
|
||
Weird.
Assignee | ||
Comment 20•11 years ago
|
||
So it sounds like our refcount is underflowing. Fun.
I had an idea of how to rewrite WorkerPrivate to have less crazy ownership. Maybe I should just do that.
Reporter | ||
Updated•11 years ago
|
Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak] → [@ mozalloc_abort(char const* const) | NS_DebugBreak]
[@ mozalloc_abort(char const* const) | NS_DebugBreak | js::AtomizeChars(js::ExclusiveContext*, wchar_t const*, unsigned __int64, js::InternBehavior)]
Comment 22•11 years ago
|
||
This continues to be a topcrash on trunk, even if now we get a somewhat random and pretty bogus third frame in the signature (the most-common signature of those has been added here, a few others are floating around as well).
tracking-firefox28:
--- → ?
Assignee | ||
Comment 24•11 years ago
|
||
Yeah, I will dive in next week.
Assignee: nobody → khuey
Flags: needinfo?(khuey)
Reporter | ||
Updated•11 years ago
|
Crash Signature: [@ mozalloc_abort(char const* const) | NS_DebugBreak]
[@ mozalloc_abort(char const* const) | NS_DebugBreak | js::AtomizeChars(js::ExclusiveContext*, wchar_t const*, unsigned __int64, js::InternBehavior)] → [@ mozalloc_abort(char const* const) | NS_DebugBreak]
[@ mozalloc_abort(char const* const) | NS_DebugBreak | js::AtomizeChars(js::ExclusiveContext*, wchar_t const*, unsigned __int64, js::InternBehavior)]
[@ mozalloc_abort(char const*) | Abort | NS_Debug…
status-firefox28:
--- → affected
Updated•11 years ago
|
Flags: needinfo?(khuey)
Reporter | ||
Comment 25•11 years ago
|
||
Macs are crashing with:
###!!! ABORT: cycle collector fault: file ../../../../xpcom/base/nsCycleCollector.cpp, line 1155)
Crash Reason EXC_BAD_ACCESS / KERN_INVALID_ADDRESS
[@ mozalloc_abort(char const*) | Abort | NS_DebugBreak | GCGraphBuilder::DescribeRefCountedNode(unsigned int, char const*) ]
Assignee | ||
Updated•11 years ago
|
Component: XPCOM → DOM: Workers
Flags: needinfo?(khuey)
Updated•11 years ago
|
status-firefox27:
--- → ?
status-firefox29:
--- → affected
status-firefox-esr24:
--- → ?
tracking-firefox29:
--- → +
Assignee | ||
Comment 27•11 years ago
|
||
This is a regression form bug 928312. It doesn't affect anything before 28.
Comment 28•11 years ago
|
||
It's been over a week, can we get a status update on this topcrash?
Flags: needinfo?(khuey)
Assignee | ||
Comment 29•11 years ago
|
||
Well I've been hoping we would get a testcase in bug 959562.
Flags: needinfo?(khuey)
Updated•11 years ago
|
Comment 30•11 years ago
|
||
smacleod has a test case for this, it sounds like. Hurray!
Comment 31•11 years ago
|
||
<smacleod> mccr8: I'm having an issue when I start a worker up at onquitapplication. The browser crashes with Fault in cycle collector: overflowing refcount (ptr: 0x10af2e360)
Comment 32•11 years ago
|
||
I can reproduce with a pretty high frequency on my OSX 10.9.1 machine. I've attached the patch I'm using which causes the error at shutdown.
STR:
- Create a profile with session store set to automatically restore.
- Apply Patch and build (Currently applied to 03070649278e65e31fe9452a259730c7370d177b commit on gecko-dev repository)
- Run the browser
- Close the browser by cmd+q hotkey
- Browser will print timestamp from js inside worker
- Browser will do one of four things:
1. Close and exit process (Usually without properly writing sessionstore.js)
2. Window will close, but process locks up without printing
3. Window will close, but procces locks up and prints:
###!!! [Child][DispatchAsyncMessage] Error: Route error: message sent to unknown actor ID
4. Process prints to console and crashes:
Fault in cycle collector: overflowing refcount (ptr: 0x10b63d800)
[95490] ###!!! ABORT: cycle collector fault: file /Users/smacleod/src/moz/gecko-dev/xpcom/base/nsCycleCollector.cpp, line 1254
[95490] ###!!! ABORT: cycle collector fault: file /Users/smacleod/src/moz/gecko-dev/xpcom/base/nsCycleCollector.cpp, line 1254
The frequency of the fourth case seems to be dependent on the sessionstore.js file, which is what the worker is writing when things crash. Some session will cause the frequency to go up, I will attach a session which causes it occasionally, but not as often as I've previously seen. I suspect there might be a sweet spot of string size sent to the worker to be written.
It should be noted that if you allow 15s to elapse after starting the browser the worker will start before shutdown, and there will be no crash when shutdown occurs.
Comment 33•11 years ago
|
||
Comment 34•11 years ago
|
||
With Steven's STR the refcount seems to always overflow for a nsDOMEventTargetHelper. Not sure if that's helpful information.
Comment 35•11 years ago
|
||
Just a note here: We are experiencing a lot of those crashes in our Mozmill automation for Firefox 28 and 29 throughout a day. See bug 959562 for more details.
I'm a bit worried that this will have a larger impact for QA once the merge from Aurora to Beta happened. If tests are getting aborted due to those crashes, there will be a delay in the sign-off process. So is there anything we can do to get this crasher fixed until next Monday?
Comment 36•11 years ago
|
||
This should be fixed by bug 956284.
Updated•11 years ago
|
Comment 37•11 years ago
|
||
Steven, could you please test the upcoming Nightly build from today if you can still reproduce this problem? If not bug 956284 really fixed it.
Flags: needinfo?(smacleod)
Comment 38•11 years ago
|
||
(In reply to Henrik Skupin (:whimboo) from comment #37)
> Steven, could you please test the upcoming Nightly build from today if you
> can still reproduce this problem? If not bug 956284 really fixed it.
I'm unable to reproduce the cycle collector crash now.
I still get lockups occasionally in this situation, but I think that's just related to Bug 964531
Flags: needinfo?(smacleod)
Comment 39•11 years ago
|
||
(In reply to Steven MacLeod [:smacleod] from comment #38)
> I still get lockups occasionally in this situation, but I think that's just
> related to Bug 964531
Investigated the lock ups in bug 965309.
Updated•11 years ago
|
Comment 41•11 years ago
|
||
Thanks again for the test case, Steven!
Comment 42•11 years ago
|
||
Andrew, is the testcase something we could add to one of our test suites?
Flags: in-testsuite?
Comment 43•11 years ago
|
||
Bug 959562 suggests you have a working test case already ;) But seriously it's a rather intermittent issue caused by spawning workers on shutdown. Might be hard/impossible to reproduce reliably in a test.
Comment 44•11 years ago
|
||
We haven't created a minimized testcase for bug 959562. It also failed intermittently. So I was hoping the testcase here on this bug would help better to get it always reproduced.
Comment 45•11 years ago
|
||
Yeah, I don't know if there's a test. Presumably once Stephen's stuff lands, things will fail if we regress this, so that's something.
Updated•11 years ago
|
tracking-firefox28:
+ → ---
tracking-firefox29:
+ → ---
Updated•9 years ago
|
Group: core-security → core-security-release
Updated•8 years ago
|
Group: core-security-release
You need to log in
before you can comment on or make changes to this bug.
Description
•