Closed
Bug 997908
Opened 11 years ago
Closed 10 years ago
crash in ReleaseSliceNow(unsigned int, void*) accessing memory at 0x5a5a5a5a5a5a5a5a
Categories
(Core :: XPCOM, defect)
Tracking
()
People
(Reporter: justdave, Unassigned)
References
(Blocks 1 open bug)
Details
(Keywords: crash, topcrash-mac)
Crash Data
Attachments
(10 files, 8 obsolete files)
(deleted),
text/plain
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
smaug
:
review+
|
Details | Diff | Splinter Review |
(deleted),
text/plain
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
froydnj
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
mccr8
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
mccr8
:
review-
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is
report bp-a1afae6a-7b03-4d08-a793-df4ee2140416.
=============================================================
I'm going out on a limb and guessing this is related to the plugin code because I get a separate trapped crash from the OS in the plugin-container at the same time every time this happens.
the crash report from the OS which corresponds to this crash is attached.
Reporter | ||
Comment 1•11 years ago
|
||
There's actually two separate crash reports logged by the OS at the same time, here's the other one.
Reporter | ||
Updated•11 years ago
|
Version: 1.9.0 Branch → 30 Branch
Comment 2•11 years ago
|
||
Flash 13.0.0.182 triggered a huge number of crashes on OS X on older Macs (at different addresses). See bug 996637. That's the version you have.
So I suggest you upgrade to the current version (13.0.0.201) and see if these crashes still happen.
Reporter | ||
Comment 3•11 years ago
|
||
This one was actually Silverlight, not Flash (according to the dialog box from OS X - I don't see it actually in the report :( )
Comment 4•11 years ago
|
||
Your second crash stack (from comment #1) is for Silverlight 5.1.20913.0. So you have two more-or-less simultaneous crashes in two different plugins.
Congratulations! I've never seen this before :-)
Comment 5•11 years ago
|
||
Your first crash stack (from comment #0) was definitely for the Flash plugin.
Comment 6•11 years ago
|
||
Actually, thinking more about this I suspect the real problem is the CC crash in ReleaseSliceNow(). The plugin crashes are probably just side effects.
There have been a lot of these crashes lately -- mostly on Windows, but with a disproportionate number of them on the Mac:
https://crash-stats.mozilla.com/report/list?signature=ReleaseSliceNow%28unsigned+int%2C+void%2A%29&product=Firefox&query_type=contains&range_unit=weeks&process_type=any&hang_type=any&date=2014-04-17+19%3A00%3A00&range_value=4#reports
These crashes have also been responsible for a bunch of intermittent test failures, of which the following are still open: bug 816064, bug 981139.
Comment 8•11 years ago
|
||
If by any chance you can reproduce these crashes, *please* tell us how.
Updated•11 years ago
|
Comment 9•11 years ago
|
||
Though this bug happens on all platforms, I'm leaving the platform set to "Mac OS X" because it happens there disproportionately.
Reporter | ||
Comment 10•11 years ago
|
||
I have a membership management app that I use for Boy Scouts that is built on Silverlight, and I'm in it frequently enough that I tend to leave it open in a tab. It times out after 30 minutes of inactivity and kicks you out, and the login screen isn't Silverlight, which means the app would be frequently getting unloaded and reloaded. I don't know if that's related, but I seem to get this crash every day or two. Said app is usually in a background tab when it crashes.
Comment 11•11 years ago
|
||
For what it's worth, these are currently the #1 Mac topcrasher on the 31 branch and #3 on the 30 branch. They're down in the 20s and 30s on the 29 and 28 branches, and not even in the top 100 on the 27 branch.
So though this problem has existed for a long time, we appear to have done something recently to make it much worse -- especially on the Mac.
status-firefox30:
--- → affected
status-firefox31:
--- → affected
tracking-firefox30:
--- → ?
tracking-firefox31:
--- → ?
Keywords: topcrash-mac
Comment 12•11 years ago
|
||
Could some kind QA person please find a list of URLs most closely associated with these crashes?
Whiteboard: [QA wanted]
Comment 13•11 years ago
|
||
Most of the Mac crashes are null-dereferences, so this might work as a bandaid patch.
Kyle, I'm asking you to review because you have hg blame for the code I'm changing.
I've started an all-platform tryserver build, whose results should eventually be available here:
https://tbpl.mozilla.org/?tree=Try&rev=45714bba0168
Attachment #8409191 -
Flags: review?(khuey)
Comment 14•11 years ago
|
||
Comment on attachment 8409191 [details] [diff] [review]
Potential bandaid patch
But apparently Kyle Huey is away, so I'll try Andrew McCreight.
Attachment #8409191 -
Flags: review?(khuey) → review?(continuation)
Comment 15•11 years ago
|
||
Forgot to mention that this is where these crashes happen:
http://hg.mozilla.org/mozilla-central/annotate/7fe3ee0cf8be/xpcom/base/CycleCollectedJSRuntime.cpp#l1006
Comment on attachment 8409191 [details] [diff] [review]
Potential bandaid patch
Review of attachment 8409191 [details] [diff] [review]:
-----------------------------------------------------------------
My vacation starts in 10 minutes, so you should request review from mccr8 again.
::: xpcom/base/CycleCollectedJSRuntime.cpp
@@ +1003,5 @@
>
> nsISupports* wrapper = items->ElementAt(lastItemIdx);
> + if (!wrapper) {
> + continue;
> + }
It seems quite surprising to me that we would get a null pointer in here, but if we do, we should still remove it from the array to avoid having to memcpy as we destroy the rest of the array. So you should use NS_IF_RELEASE rather than adding an early continue.
Attachment #8409191 -
Flags: review?(continuation) → review-
Comment 17•11 years ago
|
||
Thanks, Kyle, for your suggestion.
Attachment #8409191 -
Attachment is obsolete: true
Attachment #8409205 -
Flags: review?(continuation)
Comment 18•11 years ago
|
||
For what it's worth I've started another tryserver build:
https://tbpl.mozilla.org/?tree=Try&rev=76e2a3694db2
Steven, I'll give it a try a bit later today!
Keywords: qawanted
Comment 20•11 years ago
|
||
Comment on attachment 8409205 [details] [diff] [review]
Bandaid patch rev1
Hmm. It is probably better to filter these out when we're adding things to the array. I'll look up where that would happen, but until then, maybe Olli has more thoughts here.
Attachment #8409205 -
Flags: review?(continuation) → review?(bugs)
Here's some urls:
https://www.facebook.com/
https://twitter.com/
http://www.odnoklassniki.ru/
http://www.youtube.com/
https://www.google.co.in/
http://mwap.me.uk/mwap-rx6.html
http://www.yandex.ru/
https://www.google.de/
http://www.ynet.co.il/home/0,7340,L-8,00.html
http://dicel.fr/selection-produit/
I tried quite a few, on MacBook Air running 10.8.5 and Firefox 31.0a1 (2014-04-21) but wasn't able to reproduce the crash.
Reporter | ||
Comment 22•11 years ago
|
||
I reproduce it pretty reliably on https://lodgemaster.oa-bsa.org/lodge/client but that requires a login.
FWIW it typically happens after I let the page idle for several hours in a background tab after logging in.
Comment 24•11 years ago
|
||
(In reply to comment #22)
Dave, could you try the following tryserver build for a day or two?
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-76e2a3694db2/try-macosx64/firefox-31.0a1.en-US.mac.dmg
It was made using my rev1 patch.
Reporter | ||
Comment 25•11 years ago
|
||
(In reply to Steven Michaud from comment #26)
> Dave, could you try the following tryserver build for a day or two?
OK, running it now, got the above page opened, and I'll let you know what happens.
Reporter | ||
Comment 26•11 years ago
|
||
Possibly of note, I was previously running Aurora, and this is a trunk build, so hopefully this crash was happening on Nightly, too.
Comment 27•11 years ago
|
||
> so hopefully this crash was happening on Nightly, too.
It certainly seems so. This is a Mac topcrasher on the 30 and 31 branches.
It's my hunch that the recent Mac null-dereferences have a different origin from the other crashes -- those that go back many FF versions, and many of which aren't null-dereferences. Your results should help us tell whether or not I'm right.
If you don't see any of these crashes with my tryserver build, I'm likely to be right. But if you do see even one of *these* crashes (in ReleaseSliceNow()), I'm likely to be wrong.
Comment 28•11 years ago
|
||
Comment on attachment 8409205 [details] [diff] [review]
Bandaid patch rev1
We can try this, but I'll look at the real fix too.
Doesn't sound like a CC bug, but bug in something using DeferredFinalize
Attachment #8409205 -
Flags: review?(bugs) → review+
QA Contact: lhenry
Comment 29•11 years ago
|
||
Comment on attachment 8409205 [details] [diff] [review]
Bandaid patch rev1
Landed on mozilla-inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/bace819903bb
Comment 30•11 years ago
|
||
Assignee: nobody → smichaud
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
Reporter | ||
Comment 31•11 years ago
|
||
(In reply to Dave Miller [:justdave] (justdave@bugzilla.org) from comment #27)
> (In reply to Steven Michaud from comment #26)
> > Dave, could you try the following tryserver build for a day or two?
>
> OK, running it now, got the above page opened, and I'll let you know what
> happens.
And it just now crashed, been running continuously since I made the above comment though, so it took that long to trigger it.
The crash signature is different, but it took out Silverlight with it just like before. (separate crash dialog from the OS for that one).
Not sure which one is which, it logged all three of these at once:
bp-ef1df32e-c458-4f4f-9220-f25092140425
bp-1ed72539-b256-4e0f-adc2-38d672140425
bp-hr-20140425-59a24ac5-41b4-4b32-b5e1-8ade8b66df53
Comment 32•11 years ago
|
||
> bp-hr-20140425-59a24ac5-41b4-4b32-b5e1-8ade8b66df53
I can't find this one.
> bp-ef1df32e-c458-4f4f-9220-f25092140425
> bp-1ed72539-b256-4e0f-adc2-38d672140425
And it's a real pain that neither of these has reliable symbols, even for XUL stuff. Let me download a copy of the tryserver build and try atos on it.
Comment 33•11 years ago
|
||
> bp-ef1df32e-c458-4f4f-9220-f25092140425
Here are the top few lines of this stack (for the main process) from atos:
ReleaseSliceNow(unsigned int, void*) (in XUL) + 164
mozilla::IncrementalFinalizeRunnable::ReleaseNow(bool) (in XUL) + 140
mozilla::IncrementalFinalizeRunnable::Run() (in XUL) + 38
...
And, from looking at the machine code for ReleaseSliceNow(), it appears the crash is at the following line:
NS_IF_RELEASE(wrapper);
But if so the crash can't be a null deference, as that crash stack says it is!
So the Mac crashes that Socorro reports as null dereferences may in fact not be null deferences!
If so all bets are off. But we should still watch the crash stats to see if my bandaid patch makes any difference in these crashes' frequency. The first m-c nightly it will appear in is tomorrow's.
Comment 34•11 years ago
|
||
> NS_IF_RELEASE(wrapper);
000000000002e41f 4885DB test rbx, rbx
000000000002e422 7409 je 0x2e42d
000000000002e424 488B03 mov rax, qword [ds:rbx]
000000000002e427 4889DF mov rdi, rbx
000000000002e42a FF5010 call qword [ds:rax+0x10]
Here's the assembly code for this line. The crash is supposed to take place at address 0x2e424. As you can see it can't possibly be a null dereference.
Comment 35•11 years ago
|
||
If we're operating on dead memory another thread could be nulling it between the test and the mov. Given the symptoms that seems reasonably likely.
Comment 36•11 years ago
|
||
Actually I suspect Socorro is using a 32-bit unsigned integer to store the "Crash Address", and that this is the cause of the trouble. Next week I'll open a bug on this.
Comment 37•11 years ago
|
||
I've opened bug 1002564 on the possible Breakpad/Socorro bug.
Comment 38•11 years ago
|
||
A bunch of (supposedly) null-dereference crashes have happened in the Mac m-c nightlies for 2014-04-26 and 2014-04-27. My bandaid patch didn't fix them :-(
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [QA wanted]
Steven let me know if there's anything you need. It still looks like a top crashing signature for Firefox 31.0a1.
Comment 40•11 years ago
|
||
Right now I'm waiting for my patch for bug 1002564 to get into trunk nightlies (starting with tomorrow's). It won't fix this bug. But I think it'll show that the relatively large number of recent Mac crashes, which Socorro currently reports as null-dereference crashes, aren't that at all. With luck it'll also provide some clues about the actual cause.
Comment 41•11 years ago
|
||
Dropping qawanted keyword from this bug as it doesn't look like there's much QA can do to help here.
From bug 1002564 comment #13 by Steven Michaud:
> ...it doesn't seem to have fixed the issue with the Socorro stacks for bug 997908:
> All the recent Mac crashes (those in builds containing the patch) are still reported as
> NULL-dereferences, even though they can't possibly be (as best I can tell).
>
> I've now thought of more tests I can run to try to get to the bottom of this
> (for example to find out if this is an Apple bug). Let me do them and get
> back before we decide either way.
I think we should block this bug until bug 1002564 is addressed.
Comment 42•11 years ago
|
||
justdave: Are you still seeing this bug? And if so are you willing to try another tryserver build for a day or two?
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-b93830179c78/try-macosx64/firefox-32.0a1.en-US.mac.dmg
This build contains my latest patch for bug 1002564. It doesn't fix this bug, but it *does* partly fix the problem with Breakpad falsely reporting some crashes as null-deferences. My hope is that if/when you see this crash again with this build, you'll be able to tell us the actual crash address.
Comment 43•11 years ago
|
||
Steven: I just hit this signature running the beta. The browser crashed when Firefox was idle and I believe only one tab was open at the time: https://crash-stats.mozilla.com/report/index/df40a1d3-18f0-4fa1-b3bb-a7ab02140520 (Note: this wasn't just a plugin crash - it took the whole browser down).
Comment 44•11 years ago
|
||
Marcia: Since comment #6 I haven't thought this was a plugin bug, so your report doesn't surprise me.
Do you see this bug often? If so, could you try out my tryserver build from comment #44 for a couple of days?
Reporter | ||
Comment 45•11 years ago
|
||
(In reply to Steven Michaud from comment #44)
> justdave: Are you still seeing this bug? And if so are you willing to try
> another tryserver build for a day or two?
Yes, frequently (well, every few days still). Sure, I'll give it a run.
Reporter | ||
Comment 46•11 years ago
|
||
(In reply to Steven Michaud from comment #44)
> http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-b93830179c78/try-macosx64/firefox-32.0a1.en-US.mac.dmg
https://app.work.com renders as a blank page using this build. Is that specific to this build or shall I file a bug against nightly?
Comment 47•11 years ago
|
||
(In reply to Dave Miller [:justdave] (justdave@bugzilla.org) from comment #48)
> https://app.work.com renders as a blank page using this build. Is that
> specific to this build or shall I file a bug against nightly?
That sounds vaguely like a cache loading bug that has been very recently fixed. Try doing force reload (shift click on the reload) and see if that helps.
Comment 48•11 years ago
|
||
> https://app.work.com renders as a blank page using this build.
I just tried loading this page in my tryserver build (on OS X 10.8.5), and had no trouble.
Reporter | ||
Comment 49•11 years ago
|
||
Force reload didn't help (that was the first thing I tried). Were you logged in when you visited work.com? Their site is one of those ones that behaves very differently if you're logged in. I'm on OS X 10.9.3 also. I'm just running Safari for that site for the time being, so I'm okay so far. I'm told on IRC that Asa said there's a ton of sites busted on Nightly right now (should be fixed in a couple days) but that's the only one I've run into so far.
Comment 50•11 years ago
|
||
> Were you logged in when you visited work.com?
No, and I don't have an account there.
Try the 2014-05-19 and 2014-05-20 m-c nightlies. My tryserver build was made on top of trunk code "between" those two nightlies.
It's *very* unlikely that my patch is responsible for the behavior you're seeing ... unless (say) one of your plugins is crashing when you visit https://app.work.com.
Comment 51•11 years ago
|
||
> I don't have an account there
Actually I *do* have an account there -- one of my Mozilla accounts. And I *can* reproduce what you report. But I also see it with the 2014-05-19 m-c nightly and not with the 2014-05-20 m-c nightly -- so the problem is unrelated and seems already to have been fixed.
Comment 52•11 years ago
|
||
justdave: Do you want me to make another tryserver build (based on current m-c code)? Or is what you have tolerable enough to live with for the next day or two?
Reporter | ||
Comment 53•11 years ago
|
||
yeah, after talking with other people on IRC I'm pretty sure it's not you. This range of nightlies seems whacked out right now. Surveygizmo has issues, too (it just flipped out on me in the middle of taking a survey), and so does Yammer. So far I'm making due with Safari on the weird sites until I get the ReleaseSlice crash again. Asa says this is bug 1010783, which is now fixed, so a new try build against current m-c would probably be helpful.
Comment 54•11 years ago
|
||
I started another tryserver build at bug 1002564 comment #28.
Reporter | ||
Comment 55•10 years ago
|
||
Haven't managed to figure out how to download that try build yet, but I just triggered the crash again on your previous one. :-)
bp-95006357-f6b5-4e2d-aa1e-f059b2140521
Comment 56•10 years ago
|
||
Thanks, Dave, for the report!
But it contains bad news -- the crash address is still NULL. So the true crash address must be > 0x7fffffffffff (an Apple bug my current patch for bug 1002564 doesn't (yet) fix or work around).
As I said at bug 1002564 comment #26, there's a chance I'll be able to work around this bug in Breakpad (even though it's not a Breakpad bug). But there a bunch of other things currently eating my time, and I'm not sure when I'll be able to get back to bug 1002564 (or whether I'll be successful when I do).
I think we need to get someone else than me to start working on this bug very soon -- someone who knows CC code (the code where the crashes happen). We can't keep waiting for bug 1002564.
Assignee: smichaud → nobody
Updated•10 years ago
|
status-firefox32:
--- → affected
I hit this yesterday: https://crash-stats.mozilla.com/report/index/aa5066b6-ff91-45f3-aae4-b7f842140520
The URL given was https://www.irccloud.com/#!/ircs://irc.mozilla.org:6697/%23mozwebqa, and I *think* I noticed an IRCCloud notification at the top of my cloud-IRC session saying something about "WebSockets having issues; you might try restarting your browser, if a reload doesn't fix it.) Hope that helps!
Comment 58•10 years ago
|
||
(In reply to Steven Michaud from comment #58)
> I think we need to get someone else than me to start working on this bug
> very soon -- someone who knows CC code (the code where the crashes happen).
> We can't keep waiting for bug 1002564.
Sylvestre, could you please escalate this?
Flags: needinfo?(sledru)
Comment 59•10 years ago
|
||
FYI, this has risen to #8 in the Firefox 31 topcrash report and #1 in the Firefox 31 Mac OS X topcrash report.
Comment 60•10 years ago
|
||
Benjamin, are you the right person to look to on this for a deeper dive into the CC code? If not, can you recommend someone?
Flags: needinfo?(sledru) → needinfo?(benjamin)
Comment 61•10 years ago
|
||
Not me, no. If this were a CC bug, probably mccr8, but without steps we really don't know whether the bug is actually in the CC or elsewhere in the codebase. Other than asking justdave to do an ASAN/valgrind run and see if it finds something interesting, I don't think we have clear next steps here.
Flags: needinfo?(benjamin)
Comment 62•10 years ago
|
||
As Olli said in comment 30, it seems like the most likely cause is something misusing DeferredFinalize.
Though the fact that we're hitting a null deref immediately after a null check is mysterious...
Comment 63•10 years ago
|
||
The null isn't real, it's a crash reporting artifact. Assume that the pointer is non-null.
Comment 64•10 years ago
|
||
I'll try to think about something we can do.
Assignee: nobody → continuation
Comment 65•10 years ago
|
||
For what it's worth, I'm doing a Mac opt universal ASan build for justdave.
Comment 66•10 years ago
|
||
I was unable to do a universal build (for some reason the clang I need to do ASan builds won't do 32-bit builds). So here's a 64-bit only build:
http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg
Please try it, justdave, and let us know your results.
Some of your plugins (notably Silverlight) won't work. And it may turn out that, for arbitrary reasons, you can't reproduce the ReleaseSliceNow() crashes with this build. But it seems like you're our best hope for moving forward on this bug :-)
Comment 67•10 years ago
|
||
> http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg
On second thought don't bother. For some stupid reason this build includes a dependency specific to the version of clang I used. I'll see if I can repair it (to add in all the dependencies it needs). If so I'll repost a new copy at the same URL.
Comment 68•10 years ago
|
||
(Following up comment #69)
On second thought that will take too much time. Can someone else do a "portable" ASan build for justdave?
Reporter | ||
Comment 69•10 years ago
|
||
I have the capability to build my own, too, if someone can point me at instructions. I've built nightlies before but it's been a while.
Comment 70•10 years ago
|
||
https://developer.mozilla.org/en-US/docs/Mozilla/Testing/Firefox_and_Address_Sanitizer#ManualBuild
Here are the instructions I followed to get a 64-bit build. They worked for me.
Comment 71•10 years ago
|
||
And here's the mozconfig I used.
Reporter | ||
Comment 72•10 years ago
|
||
OK, apparently I get the same results. When I try to make it build universal the build crashes. 64bit build runs, but Silverlight doesn't work, and I can't run without it (I use it constantly - and it's probably at least a catalyst to triggering this crash - that's probably why I can reproduce it so often).
Comment 73•10 years ago
|
||
Dave, I think I can see my way to creating a patch for bug 1002564 that logs to the console the "thread state" at the time of a crash -- basically the contents of all the user-level registers. Unfortunately we'll probably never get this into Breakpad, but it *might* let us discover the actual address you're crashing at, if you run it yourself for a day or two.
I'll post it once I have it ... if I do.
As for getting universal builds to work (and making them portable), I'm sure it's just a matter of building clang "correctly". But it may take me a day or two to figure that out -- time which I won't have before next week.
Comment 74•10 years ago
|
||
Steven, does this bandaid patch work well enough (and is low risk enough) for consideration to uplift to Beta? Today's Beta is the last we can feasibly take this on and get enough feedback to be sure it can stick.
Flags: needinfo?(smichaud)
Comment 75•10 years ago
|
||
The bandaid patch doesn't work at all :-(
The reason is that these aren't null-dereference crashes.
Flags: needinfo?(smichaud)
Comment 76•10 years ago
|
||
In that case we will need to leave this to be fixed (hopefully) in FF31 as we're too late to take on any speculative work in FF30 and this has no current, verified fix.
tracking-firefox32:
--- → +
Comment 77•10 years ago
|
||
Dave, I've got another tryserver build for you to try, if you're willing:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-6b8002b98fd5/try-macosx64/firefox-32.0a1.en-US.mac.dmg
Once again, this won't fix your crashes. But it does log extra information (to the console and to /var/log/system.log) when a crash is reported to Breakpad -- the "thread state" (the values, as of the crash, of all the user-level registers in the crashing thread). This information does *not* get included in a minidump or reported to Socorro. So when a crash happens, you'll need both to look at the Socorro crash stack and to use the Console app to find the "thread state".
You may actually see multiple "thread states" -- some for various instances of plugin-container. What we need, though, is the "thread state" for Firefox. Hopefully one or more of the thread state's registers will contain this bug's actual crash "address".
Comment 78•10 years ago
|
||
The reason why universal Mac ASan builds don't work (why they won't build) is that 32-bit ASan builds won't build.
At least for the time being, I've given trying to fix this problem. The main reason is that I found the following quote at https://bugzilla.mozilla.org/show_bug.cgi?id=753148#c6 from Christian Holler (decoder):
"OSes we would like to target now are Linux 64 and Mac OSX 64. 32 bit has less priority right now because there are several problems with stack space that prevent it from working properly on all tests."
This leads me to suspect that, even if we can fix the build problem, 32-bit ASan builds will be unusable.
Am I right, Christian?
Flags: needinfo?(choller)
Comment 79•10 years ago
|
||
> Hopefully one or more of the thread state's registers will contain this bug's actual crash "address".
Comment #36 suggests that the register will be rbx.
Reporter | ||
Comment 80•10 years ago
|
||
(In reply to Steven Michaud from comment #79)
> Dave, I've got another tryserver build for you to try, if you're willing:
Running it now, will let you know.
Comment 81•10 years ago
|
||
(In reply to Steven Michaud from comment #80)
>
> This leads me to suspect that, even if we can fix the build problem, 32-bit
> ASan builds will be unusable.
>
> Am I right, Christian?
Yes, I think so. The reduced stack space on 32 bit previously caused startup errors for Firefox (e.g. "Too much recursion" from the JS engine because stack space was exhausted too quickly). This typically happens in the parser because our parsing works recursively and stack frames grow really large with Clang Inlining+ASan.
We should still try to fix the build errors (or just create 64-bit only ASan builds). There is another problem on Mac though that would have to be solved too, that is bug 923916. It has to do with the fact that on Mac, ASan uses a dylib instead of statically linking the ASan runtime to the target binary.
Flags: needinfo?(choller)
Reporter | ||
Comment 82•10 years ago
|
||
Crash triggered. Syslog data attached.
https://crash-analysis.mozilla.com/hang-reports/2014/05-29/hr-20140529-880016b7-3d3b-48ff-b0bc-c28b93eb95f1.html
bp-347f536c-345b-4c80-9a0d-81aff2140529
Reporter | ||
Comment 83•10 years ago
|
||
FWIW, the event that appears to have triggered the crash was clicking a URL to a PNG file on dropbox in my IRC client, which brought Firefox to the front, hung for a minute or so, then crashed.
Comment 84•10 years ago
|
||
(In reply to comment #84)
> bp-347f536c-345b-4c80-9a0d-81aff2140529
This isn't a ReleaseSliceNow() crash. Also Firefox is running in 32-bit mode (which you weren't doing previously). And from the timestamp (and other considerations) it doesn't match what you copied from your syslog.
Even the syslog Firefox crash isn't a ReleaseSliceNow() crash -- it has a non-zero crash address. And both plugin-container processes are 64-bit (though Silverlight is 32-bit only).
But one of your plugin-container crashes does contain this as part of the thread state of the "crashing" thread:
rbx: 0x5a5a5a5a5a5a5a5a
And the Firefox thread state contains two registers whose contents is 0x5a5a5a5a.
Comment 85•10 years ago
|
||
Dave, could you keep trying for a ReleaseSliceNow() crash? :-)
Reporter | ||
Comment 86•10 years ago
|
||
(In reply to Steven Michaud from comment #86)
> This isn't a ReleaseSliceNow() crash. Also Firefox is running in 32-bit
> mode (which you weren't doing previously).
The symptoms matched, so I assumed... And I'm not running in 32bit, it's set to launch native, and Activity Monitor confirms it's currently running in 64bit mode (granted, I didn't think to look before it crashed).
Reporter | ||
Comment 87•10 years ago
|
||
Yep, I'll keep trying.
Comment 88•10 years ago
|
||
Also, could you post Breakpad crash IDs for both of the plugin-container crashes?
Comment 89•10 years ago
|
||
>> bp-347f536c-345b-4c80-9a0d-81aff2140529
>
> This isn't a ReleaseSliceNow() crash. Also Firefox is running in 32-bit mode (which
> you weren't doing previously).
This crash ID is actually for the Silverlight plugin.
But (apparently) you were also running two other 64-bit plugins. Its crash IDs for those that I want to see -- particularly the one whose rbx == 0x5a5a5a5a5a5a5a5a.
Also the Firefox crash id, if you can find it. It *might* have a ReleaseSliceNow() crash.
Reporter | ||
Comment 90•10 years ago
|
||
Those are the only breakpad crashes that were dropped. The next one in the list by date is from the 25th.
Comment 91•10 years ago
|
||
(In reply to Steven Michaud from comment #86)
> rbx: 0x5a5a5a5a5a5a5a5a
Hmm, that's the value we use for poisoning on freeing memory.
Comment 92•10 years ago
|
||
As per bug 1002564 comment #39, the latest minidump-stackwalk will display all the user-level registers for the top frame of the crashing thread. And as minidumps have contained this information all along, it just occurred to me to download a few corresponding to recent instance of these crashes. They all have the following:
rbx = 0x5a5a5a5a5a5a5a5a
So this is the crash address for these crashes.
You need special permissions to download minidumps from http://crash-stats.mozilla.com. Here's how to download and build the latest minidump-stackwalk:
1) Install a reasonably recent XCode (I have 4.5.2) and the latest XCode commandline tools.
2) Visit https://github.com/mozilla/socorro and click "download zip".
3) CC=clang CXX=clang++ make breakpad
4) CC="clang -Wno-switch" CXX="clang++ -Wno-switch" make stackwalker
Summary: crash in ReleaseSliceNow(unsigned int, void*) → crash in ReleaseSliceNow(unsigned int, void*) at 0x5a5a5a5a5a5a5a5a
Comment 93•10 years ago
|
||
Something else also occurred to me: It may be possible to do a universal build that combines a 64-bit ASan build with a 32-bit "regular" build. I'll try that on Monday.
Comment 94•10 years ago
|
||
Dave, you don't have to keep waiting for a ReleaseSliceNow crash to happen in the build I gave you. But if I can manage to do a special ASan universal build as per comment #95, I'll post the instructions here.
Updated•10 years ago
|
Summary: crash in ReleaseSliceNow(unsigned int, void*) at 0x5a5a5a5a5a5a5a5a → crash in ReleaseSliceNow(unsigned int, void*) accessing memory at 0x5a5a5a5a5a5a5a5a
Comment 95•10 years ago
|
||
I *was* able to create a universal ASan build, with only the 64-bit component actually using ASan. I used this mozconfig file.
You need to build the clang (and llvm) you use to do the 64-bit part of the build as per https://developer.mozilla.org/en-US/docs/Mozilla/Testing/Firefox_and_Address_Sanitizer#ManualBuild. But for the 32-bit part you should use the clang that comes with XCode's command line tools.
After the build is finished, run "make package" in obdir/i386 or objdir/x86_64, then copy the Nightly app bundle from the resulting DMG to somewhere on your hard drive to run it. It will only run on the machine where you built it, because it (the 64-bit part of it) depends on a clang runtime that's located where you built llvm.
Please try this, Dave, and let us know your results.
Attachment #8427219 -
Attachment is obsolete: true
Comment 96•10 years ago
|
||
For what it's worth, I've created a genuinely portable Mac ASan universal build and posted it here:
http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg
You can use this, Dave, if you'd rather not build it yourself.
I hacked it up using hdiutil and install_name_tool.
Comment 97•10 years ago
|
||
Dave has told me (on irc) that my ASan build crashes on startup. He's running on OS X 10.9.3. I just tried it on 10.9.3 and got the same crash. It's almost certainly unrelated to the ReleaseSliceNow() crashes, but I'll check to make sure.
Thanks for your help so far, Dave! For now at least, there's nothing more we can reasonably ask you to do.
So it looks like we're not going to get new information here anytime soon.
Comment 98•10 years ago
|
||
The ASan crashes you get on OS X 10.9.3 are due to a bug (or bugs) in Apple Gestalt code.
Comment 99•10 years ago
|
||
I get this crash fairly consistently:
bp-49abfea0-d578-4566-ba16-3ed852140605
bp-cf3017b1-331c-4ac8-8527-dedb42140530
bp-109d1ddc-40ec-41df-8112-798662140529
bp-450e80b0-3e1f-4a72-b33a-7e8d52140524
Steps to reproduce:
1. I have a separate window full of youtube videos (15 tabs). I have opted into their HTML5 version (not sure if this matters), but I'm playing a Flash-only video.
2. Play a video halfway.
3. Work on your main Firefox window (250+ tabs), the usual surfing and all that, for an unknown period of time.
4. Go back to the window full of youtube videos. Resume playing that video.
I'll then crash occasionally, but only once or twice daily, of this STR that is done fairly often.
Comment 100•10 years ago
|
||
Gary, what version of OS X are you running? If it's not 10.9.3, try my ASan build from comment #98.
Comment 101•10 years ago
|
||
(In reply to Steven Michaud from comment #102)
> Gary, what version of OS X are you running? If it's not 10.9.3, try my ASan
> build from comment #98.
It is indeed 10.9.3, unfortunately.
Comment 102•10 years ago
|
||
> The ASan crashes you get on OS X 10.9.3 are due to a bug (or bugs) in Apple Gestalt code.
Actually they're an unacknowledged and Mac-specific bug in LLVM:
http://lists.cs.uiuc.edu/pipermail/llvmbugs/2013-June/028916.html
I fixed this in my local copy of LLVM, then redid my build and copied it over the one I originally posted, here:
http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg
Please try this, Dave and Gary, and let us know your results.
Comment 103•10 years ago
|
||
(In reply to Steven Michaud from comment #104)
> Please try this, Dave and Gary, and let us know your results.
I hit bug 1021612 which may or may not be related, but which is blocking further testing.
Comment 104•10 years ago
|
||
I just updated my ASan build to current m-c code, which contains the fix for bug 1021612.
http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg
Please try it out, Gary and Dave!
Comment 105•10 years ago
|
||
(In reply to Steven Michaud from comment #106)
> I just updated my ASan build to current m-c code, which contains the fix for
> bug 1021612.
Thanks Steven! I am now using this build.
Comment 106•10 years ago
|
||
I've been testing this build, so far I may have found another issue:
2014-06-11 00:23:13.825 plugin-container[32796:d07] CoreText performance note: Client called CTFontCreateWithName() using name "Times Roman" and got font with PostScript name "Times-Roman". For best performance, only use PostScript names when calling this API.
2014-06-11 00:23:31.501 plugin-container[32796:d07] CoreText performance note: Client called CTFontCreateWithName() using name "Lucida Grande" and got font with PostScript name "LucidaGrande". For best performance, only use PostScript names when calling this API.
=================================================================
==14071==ERROR: AddressSanitizer: heap-use-after-free on address 0x607003486e60 at pc 0x10915ae92 bp 0x7fff5fbfbb90 sp 0x7fff5fbfbb88
READ of size 8 at 0x607003486e60 thread T0
==14071==AddressSanitizer CHECK failed: /usr/local/src/llvm/llvm-185949/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:36 "((n_ranges_)) < ((kMaxNumberOfAddressRanges))" (0x6, 0x6)
==14071==AddressSanitizer CHECK failed: /usr/local/src/llvm/llvm-185949/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:36 "((n_ranges_)) < ((kMaxNumberOfAddressRanges))" (0x6, 0x6)
==14071==AddressSanitizer CHECK failed: /usr/local/src/llvm/llvm-185949/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:36 "((n_ranges_)) < ((kMaxNumberOfAddressRanges))" (0x6, 0x6)
==14071==AddressSanitizer CHECK failed: /usr/local/src/llvm/llvm-185949/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:36 "((n_ranges_)) < ((kMaxNumberOfAddressRanges))" (0x6, 0x6)
==14071==AddressSanitizer CHECK failed: /usr/local/src/llvm/llvm-185949/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:36 "((n_ranges_)) < ((kMaxNumberOfAddressRanges))" (0x6, 0x6)
(the above line is repeated countless times)
Because this stack is useless, I'm not able to tell how useful this is.
Comment 107•10 years ago
|
||
(In reply to Gary Kwong [:gkw] [:nth10sd] from comment #107)
> (In reply to Steven Michaud from comment #106)
> > I just updated my ASan build to current m-c code, which contains the fix for
> > bug 1021612.
>
> Thanks Steven! I am now using this build.
Just minutes later after hitting comment 108, I hit bug 1023758. (again, which may or may not be related)
Comment 108•10 years ago
|
||
(In reply to comment #108)
I wonder if you'd get a reasonable (though non-symbolized) stack if you didn't use llvm-symbolizer. I think it's possible to symbolize such stacks after the fact.
And by the way, how are you running the ASan builds? Are you running them from the command line, first setting the ASAN_SYMBOLIZER_PATH environment variable?
And thanks, by the way, for your work with my Mac ASan builds! I'm going to start using them myself, as my daily browser. Seems like there are lots of new bugs to be found.
Comment 109•10 years ago
|
||
(In reply to Steven Michaud from comment #110)
> And by the way, how are you running the ASan builds? Are you running them
> from the command line, first setting the ASAN_SYMBOLIZER_PATH environment
> variable?
Example command line:
ASAN_SYMBOLIZER_PATH=/Users/skywalker/llvm/build/bin/llvm-symbolizer /Applications/Nightly.app/Contents/MacOS/firefox-bin 2>&1 | tee ~/Downloads/asanLog1.txt
Comment 110•10 years ago
|
||
Dave, are you still seeing these crashes in current trunk nightlies?
If so please try turning off ICC (incremental cycle collection), and see if that makes them go away. To do this, set dom.cycle_collector.incremental to false in about:config.
Flags: needinfo?(justdave)
Comment 111•10 years ago
|
||
The (ICC-related) patch for bug 1023758 hasn't stopped these crashes from happening on trunk (the 33 branch). And though ICC has been turned off on the 32 (Aurora) branch as of the 2017-07-03 nightly (bug 911246 comment #20), these crashes are also still happening there.
So, though mozilla::IncrementalFinalizeRunnable::Run() is on the stack, these crashes aren't (directly) related to incremental cycle collection.
Flags: needinfo?(justdave)
Comment 112•10 years ago
|
||
IncrementalFinalizeRunnable has nothing to do with incremental cycle collector.
IncrementalFinalizeRunnable is about GC calling Release on refcounted object after
GC has collected the relevant JS wrappers.
Reporter | ||
Comment 113•10 years ago
|
||
(In reply to Steven Michaud from comment #112)
> Dave, are you still seeing these crashes in current trunk nightlies?
I don't use Nightly on a day-to-day basis, I primarily use Aurora. I am still seeing this crash on Aurora.
Comment 114•10 years ago
|
||
I keep updating my Mac ASan build about once a week. I've just done so again:
http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg
http://people.mozilla.org/~stmichaud/bmo/firefox-asan-howto.txt
These are based on current trunk code. But could you use one for a few days, Dave, to see if you can reproduce one of these crashes in it?
To launch it, you'd typically do something like this (at a Terminal prompt):
1) cd ~/Downloads
2) ASAN_SYMBOLIZER_PATH=~/Desktop/Nightly\ ASan.app/Contents/MacOS/llvm-symbolizer nohup ~/Desktop/Nightly\ ASan.app/Contents/MacOS/firefox 2>&1 | tee firefox-asan.log &
When/if you crash, you'll see an ASan log (fully symbolized) in firefox-asan.log.
Comment 115•10 years ago
|
||
Same as 30, wontfix for 31.
Reporter | ||
Comment 116•10 years ago
|
||
I finally got a good breaking point to restart my browser and try this out...
It dies at launch with:
dyld: lazy symbol binding failed: Symbol not found: ___asan_init_v3
Referenced from: /Applications/Nightly.app/Contents/MacOS/libmozglue.dylib
Expected in: flat namespace
dyld: Symbol not found: ___asan_init_v3
Referenced from: /Applications/Nightly.app/Contents/MacOS/libmozglue.dylib
Expected in: flat namespace
And it leaves an OS X crash dump (which I can upload if you want it). The llvm-symbolizer is what crashes, the app itself continues running and works fine (using it to type this comment).
Reporter | ||
Comment 117•10 years ago
|
||
Apparently llvm-symbolizer crashes every time I open a new page.
Comment 118•10 years ago
|
||
I see that error too -- though only when I load a plugin. As far as I can tell it's benign -- you can continue using the ASan build, and if it crashes a symbolized crash log will get written to firefox-asan.log.
I've already looked a bit into the error, but I haven't yet been able to figure it out. I don't consider it high priority, though -- since it doesn't prevent the ASan builds from doing what they're supposed to.
Comment 119•10 years ago
|
||
> dyld: lazy symbol binding failed: Symbol not found: ___asan_init_v3
> it leaves an OS X crash dump (which I can upload if you want it).
In my case, that error *isn't* associated with a crash dump. I suspect it isn't in your case, either. But I would like to see the crash dump. Please attach it here.
Reporter | ||
Comment 120•10 years ago
|
||
llvm-symbolizer crash dump
Comment 121•10 years ago
|
||
Oops, I misunderstood. I was expecting output from llvm-symbolizer, but this is an Apple crash report. And (as you said) it reports a crash in llvm-symbolizer itself. And now that I look, I see I have similar crash reports (generated whenever I start a plugin).
So apparently llvm-symbolizer crashes every time you open a plugin-container process -- though the plugin-container process itself runs just fine.
But llvm-symbolizer still works fine with the main process, which is what we're interested in here.
Reporter | ||
Comment 122•10 years ago
|
||
I've typically had to be using Firefox about a day and a half to trigger this crash, and I can't keep this running long enough to crash it. It leaks like a sieve, and is using 14 GB of physical RAM right now (I only have 16 GB) and I've only been running it for 24 hours. As soon as I'm done typing this comment I'm restarting it to clear up the memory leak, and that may also reset my chances of reproducing the crash :(
Comment 123•10 years ago
|
||
> It leaks like a sieve
ASan builds do use *lots* of RAM. But I don't know that they leak any worse than Firefox itself (or than the plugins/extensions that it uses).
If you decide to keep running my ASan builds, please keep track of how the RAM usage increases over time, and how this compares to non-ASan builds.
By the way, I update them about once a week, and I've just done so again.
Reporter | ||
Comment 124•10 years ago
|
||
I had 40 GB of VM in use, and after I quit Firefox there was 22 GB of VM in use.
After restarting it, it's using 3GB of RAM, which is more normal (I usually see Firefox in the 3 to 6 GB range in my normal usage without ASan)
Is anything in the ASan log useful even if it didn't crash? I saved the one it made when I restarted it.
Comment 125•10 years ago
|
||
> Is anything in the ASan log useful even if it didn't crash?
I don't think so. But please attach it just in case. I'll mark it obsolete if it's not relevant.
Comment 126•10 years ago
|
||
(In reply to Steven Michaud from comment #125)
> > It leaks like a sieve
>
> ASan builds do use *lots* of RAM. But I don't know that they leak any worse
> than Firefox itself (or than the plugins/extensions that it uses).
>
> If you decide to keep running my ASan builds, please keep track of how the
> RAM usage increases over time, and how this compares to non-ASan builds.
fwiw, I've been using a build from 2 weeks ago constantly and haven't hit anything new thus far. I did seem to notice things getting slower (I have 200+ tabs) after using the browser (opening/closing tabs) for a day or 2, and have had to restart the browser to have a reasonable surfing experience again.
I also have 16Gb RAM, but didn't measure the memory usage.
Comment 127•10 years ago
|
||
> fwiw, I've been using a build from 2 weeks ago constantly and haven't hit anything new thus far.
Do you still see ReleaseSliceNow() crashes in "normal" builds?
Comment 128•10 years ago
|
||
(In reply to Steven Michaud from comment #129)
> Do you still see ReleaseSliceNow() crashes in "normal" builds?
Not tried yet - will switch back to "normal" Nightlies.
Reporter | ||
Comment 129•10 years ago
|
||
I have not yet been able to reproduce the crash in the ASan builds (I'm running it again now - had it going about 12 hours this time and had to restart it because it hung). I had switched back to Aurora for a couple days earlier this week because the ASan build was so slow, and managed to trip the crash twice on Aurora during that time.
Comment 130•10 years ago
|
||
Let's wait until we hear back from Gary. But at this point I strongly suspect it's not possible to reproduce these crashes in my ASan builds.
I get these crashes now about once a day in a normal Nightly build:
bp-79d1fa25-2273-4433-8b55-379212140801
bp-ac5888fe-ffb2-4f01-a0be-869942140729
Haven't isolated any STR yet. Let me know if there's something I can do to help diagnose this.
Comment 132•10 years ago
|
||
(In reply to Steven Michaud from comment #132)
> Let's wait until we hear back from Gary. But at this point I strongly
> suspect it's not possible to reproduce these crashes in my ASan builds.
I haven't hit the crash again in normal Nightlies, but my recent browser usage declined and changed recently (due to various flights around the world). Will keep trying.
Comment 133•10 years ago
|
||
I just hit this crash, while I was using Firefox Sync to sync my bookmarks and I was just having a bookmarks folder open in my bookmarks toolbar.
https://crash-stats.mozilla.com/report/index/fe8843b2-4953-442c-b330-180802140807
Comment 134•10 years ago
|
||
(In reply to Gary Kwong [:gkw] [:nth10sd] slower than usual Aug 04-07, PTO Aug 08 from comment #134)
> I haven't hit the crash again in normal Nightlies, but my recent browser
> usage declined and changed recently (due to various flights around the
> world). Will keep trying.
Hit this now:
https://crash-stats.mozilla.com/report/index/c5becd92-e7e1-4e27-958d-5f2442140809
I was accessing page 3 from page 1 of this ComputerWorld article:
http://www.computerworld.com/s/article/9250266/Microsoft_slashes_IE_support_sets_huge_edict_for_Jan._2016?taxonomyId=125&pageNumber=3
Comment 135•10 years ago
|
||
Andrew - You're listed as the owner although from the recent comments it looks like Steven may actually be driving this one. I'm don't have a great deal of confidence right now that this will be resolved in 32. Do we have next steps here?
Flags: needinfo?(smichaud)
Flags: needinfo?(continuation)
Comment 136•10 years ago
|
||
Not really. There's just not any information to go on.
Flags: needinfo?(continuation)
Comment 137•10 years ago
|
||
This bug is currently stuck hard in the mud.
My last hope was to get someone to reproduce these crashes (or their precursor) in an ASan build. But apparently that's not going to work.
Flags: needinfo?(smichaud)
Comment 138•10 years ago
|
||
I'm marking this bug as won't fix for 32. I would like to see us make a decision about whether we can make any progress on this bug in the 33 cycle.
status-firefox34:
--- → affected
Comment 139•10 years ago
|
||
I just hit this while watching a documentary on YouTube. Had a few simple pages of API documentation open in other tabs.
https://crash-stats.mozilla.com/report/index/d2b66014-b827-4152-89d6-1a3482140823
First time I've had release Firefox crash on me in a long time.
Comment 140•10 years ago
|
||
I just hit this bug in 32 while browsing Google Maps.
https://crash-stats.mozilla.com/report/index/fa607675-76f2-4791-9e01-903de2140901
Unfortunately it does not seem to be reproducible, but I still wanted to mention it here just in case...
Comment 142•10 years ago
|
||
Hi, thought I might mention that FF has crashed for me every couple of days for the last two months with this bug. It's not reproducible (in the sense that it seems to happen on random pages), but it does reliably crash on my machine.
Comment 143•10 years ago
|
||
32.0.1 crash bp-00e81aca-f9cd-4948-bb8b-e3a452140915
Comment 144•10 years ago
|
||
(In reply to Steven Michaud from comment #139)
> This bug is currently stuck hard in the mud.
>
> My last hope was to get someone to reproduce these crashes (or their
> precursor) in an ASan build. But apparently that's not going to work.
Maybe we could work this out if we get more eyes on this?
I just hit this again on a 2014-09-16 nightly, surfing http://www.espnfc.com/ and getting bp-d053e21e-d37f-4d4c-bc12-a8e062140919
Comment 145•10 years ago
|
||
I've been working on a test build that should crash "earlier", and give more informative stacks. But so far I haven't been able to get it to build on the tryservers (the builds consistently fail with the same bizarre and inexplicable errors). It builds just fine locally. Ultimately I may just post a link to one of my local builds.
Comment 146•10 years ago
|
||
I finally got a patch that builds on the tryservers!
Looking again at this bug a few days ago, it seemed likely that the invalid nsISupports object that causes these crashes is already invalid when it's queued for deletion at CycleCollectedJSRuntime::DeferredFinalize(). If we crash there, we're likely to get better (more informative) stacks.
So whoever sees these crashes at least once a day, please use the following build for a day or two, hoping for crashes. When/if you do crash, please post your crash ids here.
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-32ba3bb2e233/try-macosx64/firefox-35.0a1.en-US.mac.dmg
Comment 147•10 years ago
|
||
Steven, if you want to land that, feel free to do so with r=me. We should back it out at some point if it doesn't turn up anything, but it should be relatively harmless.
Comment 148•10 years ago
|
||
Ideally that will crash even with non-ASan builds thanks to poisoning, if the object is in fact bad there.
Comment 149•10 years ago
|
||
> Steven, if you want to land that, feel free to do so with r=me.
Will do. Thanks! Crash reports for tryserver builds don't get properly symbolized, so you have to symbolize them by hand (using atos). So having this in m-c nightlies will be *much* more convenient.
I'll change the comment to indicate that the patch should come out at some point -- either when this bug gets fixed, or if the patch doesn't help fix it.
Comment 150•10 years ago
|
||
What I'll land on m-i.
Attachment #8492208 -
Attachment is obsolete: true
Comment 151•10 years ago
|
||
Comment on attachment 8492240 [details] [diff] [review]
Patch for testing (*not* a fix) -- should make crash stacks more informative
Review of attachment 8492240 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks!
Attachment #8492240 -
Flags: review+
Comment 152•10 years ago
|
||
Comment on attachment 8492240 [details] [diff] [review]
Patch for testing (*not* a fix) -- should make crash stacks more informative
Landed on mozilla-inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/317c684efd2d
Updated•10 years ago
|
Whiteboard: [leave open]
Comment 153•10 years ago
|
||
Backed out for e10s mochitest-bc leaks.
https://hg.mozilla.org/integration/mozilla-inbound/rev/7e06f31a3f01
https://tbpl.mozilla.org/php/getParsedLog.php?id=48476728&tree=Mozilla-Inbound
Comment 154•10 years ago
|
||
Also test_closeOnGC failures:
https://tbpl.mozilla.org/php/getParsedLog.php?id=48477237&tree=Mozilla-Inbound
Comment 155•10 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #155)
> Backed out for e10s mochitest-bc leaks.
s/e10s mochitest-bc leaks/blowing up the world
Please run this through Try before attempting to push it again :)
Comment 156•10 years ago
|
||
Really weird! I have no idea what happened there.
Comment 157•10 years ago
|
||
Those errors were the same as the "bizarre and inexplicable errors" I reported in comment #147.
I'm convinced this is a build system problem. But I'm not willing to put in the work to track it down -- not for a test patch.
I'll ifdef my patch to make it Mac-only (since these crashes are only topcrashers on the Mac), and do a set of Mac tryserver builds and tests.
If all goes well, I'll ask you to review it again, Andrew, and try landing it again.
Comment 158•10 years ago
|
||
https://tbpl.mozilla.org/?tree=Try&rev=a31e02acc6af
If the rats keep biting, I'll just post my own local build where people can download it and test with it.
Comment 159•10 years ago
|
||
For good measure, let's try this patch on all platforms, even though it's ifdefed to only have effect on the Mac. Let's see how far into "bizarre and inexplicable" we truly get:
https://tbpl.mozilla.org/?tree=Try&rev=4a90c03aefd5
Comment 160•10 years ago
|
||
The rats are still biting, so let's forget the tryservers.
Here's a local build of my patch. I've disabled Breakpad and ensured that symbols aren't stripped. So when you crash you'll see an Apple crash report. If you see any crashes in CycleCollectedJSRuntime::DeferredFinalize(), look for a copy of the Apple crash report in ~/Library/Logs/DiagnosticReports/. Attach the report here (don't paste it into a comment).
http://people.mozilla.org/~stmichaud/bmo/firefox-35.0a1.en-US.mac-bugzilla997908-test-DeferredFinalize.dmg
Comment 161•10 years ago
|
||
hey Steve, I've been running your build for a couple of days now but unfortunately did not get the crash (apart from one which took down my whole machine but was probably unrelated). But I can't continue testing it because it performs so badly overall, including leaking 100+Mb of RAM an hour. I would be happy to test a patched build from the current release branch though, if it's possible to build one.
Comment 162•10 years ago
|
||
> because it performs so badly overall, including leaking 100+Mb of RAM an hour
I expect this means that at least some of the test failures were genuine (and not just a build system artifact). So it looks like I'll have to figure out some other way to test in CycleCollectedJSRuntime::DeferredFinalize().
Thanks, though, for trying my test build.
Comment 163•10 years ago
|
||
With an ASan build, you could write code that just touches the object, in some way that it won't be optimized out, hopefully. That won't produce useful stacks outside of an ASan build, though. Any other method on ISupports we could call will cause an addref and release, and thus cause the same leaks.
Comment 164•10 years ago
|
||
We've long known that these crashes take place here:
https://hg.mozilla.org/mozilla-central/annotate/790f41c631cc/xpcom/base/CycleCollectedJSRuntime.cpp#l1081. The assembly code from comment #36 leads me to believe that 'wrapper' itself has been "poisoned" -- that when these crashes happen it equals 0x5a5a5a5a5a5a5a5a. If this is true, then it's likely that the element which becomes 'wrapper' was already poisoned when it was queued for deletion in a call to CycleCollectedJSRuntime::DeferredFinalize().
This patch is as ugly as sin. But it should trigger crashes with much better (more informative) stacks when/if CycleCollectedJSRuntime::DeferredFinalize() is called with a poisoned aSupports. And it should otherwise be safe (not crash).
Andrew, are you willing to have this on the trunk for a week or two, to see if we start getting more interesting crash stacks?
I've started an all-platform tryserver build, which so far seems to be progressing smoothly:
https://tbpl.mozilla.org/?tree=Try&rev=65d03e6f30af
Attachment #8492240 -
Attachment is obsolete: true
Attachment #8493487 -
Flags: review?(continuation)
Comment 165•10 years ago
|
||
(In reply to comment #165)
I don't think there's much point in continuing to test with ASan builds. It's pretty clear that these crashes don't happen with them.
Comment 166•10 years ago
|
||
Comment on attachment 8493487 [details] [diff] [review]
Patch for testing (*not* a fix)
I'd be happy to put this patch in an #ifdef XP_MACOSX block, if the need is felt.
Comment 167•10 years ago
|
||
Is that load really not just going to be optimized out?
Comment 168•10 years ago
|
||
> Is that load really not just going to be optimized out?
I don't think so. But I can check ... however that will take a while. I need to load XUL from one of my tryserver builds into the Hopper Disassembler, and that takes *hours* (since Hopper needs to resolve all the internal cross-references).
In the meantime, can you suggest a way for me to rewrite my patch to ensure that it *doesn't* get optimized out?
Comment 169•10 years ago
|
||
Maybe write a getter method for mInstanceVariable and mark it MOZ_NEVER_INLINE, then use that?
Comment 170•10 years ago
|
||
Andrew: I finally got XUL from my OS X 10.8 opt build to load into Hopper Disassembler ... and you were right -- my patch got optimized away.
I will attempt to deal with this, but probably not until next week. I'm currently very busy with other, more urgent things.
I have a hunch it won't be optimized away if I somehow create a side effect outside of the CycleCollectedJSRuntime::DeferredFinalize() function's scope -- for example by setting a global static variable to the results of whatever->mInstanceVariable. When I have time, I'll try that and see what happens.
Updated•10 years ago
|
Attachment #8493487 -
Flags: review?(continuation)
Comment 171•10 years ago
|
||
Like previous releases, wontfix for 33
This is still the #10 topcrash on Firefox 34.0a2, though it isn't very high volume, with 64/3985 crashes in the last 7 days.
Comment 173•10 years ago
|
||
Let me try my hand at this again.
This patch is *very* simple and *very* ugly, but I think it'll work -- that it won't get optimized away and will crash when it needs to (giving us better stacks).
I'll run this through the tryservers before landing it. But they're currently closed.
Attachment #8493487 -
Attachment is obsolete: true
Attachment #8513044 -
Flags: review?(continuation)
Comment 174•10 years ago
|
||
Comment on attachment 8513044 [details] [diff] [review]
Another patch for testing (*not* a fix)
Review of attachment 8513044 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks for being persistent on this.
Attachment #8513044 -
Flags: review?(continuation) → review+
Comment 175•10 years ago
|
||
Comment on attachment 8513044 [details] [diff] [review]
Another patch for testing (*not* a fix)
I've started my tryserver builds:
https://tbpl.mozilla.org/?tree=Try&rev=f3e383000027
These are all platform because some of the B2G builds are done on OS X.
Comment 176•10 years ago
|
||
Comment on attachment 8513044 [details] [diff] [review]
Another patch for testing (*not* a fix)
Overnight I loaded XUL from the opt tryserver build into Hopper Disassembler ... and discovered my debugging code was gone. Then I looked at the build log, and discovered (in a warning) that "optimize" isn't recognized by clang as an attribute.
Sigh.
I'll try again, this time with inline assembly.
Attachment #8513044 -
Attachment is obsolete: true
Comment 177•10 years ago
|
||
Comment 178•10 years ago
|
||
Comment on attachment 8513584 [details] [diff] [review]
Patch for testing (*not* a fix)
And here's another set of tryserver builds, since the first has weird and seemingly unrelated persistent errors.
https://tbpl.mozilla.org/?tree=Try&rev=e325f0b6ac0a
Comment 179•10 years ago
|
||
Comment on attachment 8513584 [details] [diff] [review]
Patch for testing (*not* a fix)
> It doesn't matter if we zap %rax.
Apparently it *does* matter. I'll try again.
Attachment #8513584 -
Attachment is obsolete: true
Comment 180•10 years ago
|
||
(In reply to Steven Michaud from comment #181)
> Comment on attachment 8513584 [details] [diff] [review]
> Patch for testing (*not* a fix)
>
> > It doesn't matter if we zap %rax.
>
> Apparently it *does* matter. I'll try again.
%r11 might be a safe register to clobber (not an argument register, a callee-clobbered register, and one that is probably further down on the register allocator's list of registers).
If that fails, there's always push/mov/pop.
Comment 181•10 years ago
|
||
Here we go again.
https://tbpl.mozilla.org/?tree=Try&rev=653dd4432e5f
Comment 182•10 years ago
|
||
Comment on attachment 8513821 [details] [diff] [review]
Patch for testing (*not* a fix), rev7
My previous patch was wrong about aSupports being in the %rsi register. Which is very peculiar -- there must be some strange things going on on the compiler. This gets around the problem, and also makes sure we don't zap any registers.
Comment 183•10 years ago
|
||
Comment on attachment 8513821 [details] [diff] [review]
Patch for testing (*not* a fix), rev7
The tryserver tests are going well. And I've confirmed (in gdb, using its disassemble command) that is actually present in the opt build.
Attachment #8513821 -
Flags: review?(continuation)
Comment 184•10 years ago
|
||
> that is
that my test code is
Comment 185•10 years ago
|
||
Comment on attachment 8513821 [details] [diff] [review]
Patch for testing (*not* a fix), rev7
I think Nathan is probably a better reviewer for this, as I have no idea what is going on with the assembly.
Attachment #8513821 -
Flags: review?(continuation) → review?(nfroyd)
Comment 186•10 years ago
|
||
Comment on attachment 8513821 [details] [diff] [review]
Patch for testing (*not* a fix), rev7
Review of attachment 8513821 [details] [diff] [review]:
-----------------------------------------------------------------
Inline assembly patches are the best kind of patches.
::: xpcom/base/CycleCollectedJSRuntime.cpp
@@ +1063,5 @@
> +#if defined(XP_MACOSX) && defined(__LP64__)
> + // We'll crash here if aSupports is poisoned (== 0x5a5a5a5a5a5a5a5a). This
> + // is better (more informative) than crashing in ReleaseSliceNow(). See
> + // bug 997908. This patch should get backed out when bug 997908 gets fixed,
> + // or if it doesn't actually help fix that bug.
Nit: maybe "or if it doesn't actually help diagnose that bug"
Attachment #8513821 -
Flags: review?(nfroyd) → review+
Updated•10 years ago
|
Keywords: leave-open
Whiteboard: [leave open]
Comment 187•10 years ago
|
||
Comment on attachment 8513821 [details] [diff] [review]
Patch for testing (*not* a fix), rev7
Landed on mozilla-inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b56b9aa70d22
> Inline assembly patches are the best kind of patches.
They work when all else fails :-)
Comment 188•10 years ago
|
||
Is this patch worth uplifting to get more crash stats, or is the nightly population high enough that this will start showing up in significant volume?
Comment 189•10 years ago
|
||
Comment 190•10 years ago
|
||
(In reply to comment #190)
There were 53 of these crashes in the last week on the 36 branch (so they're still the #5 Mac topcrasher on that branch). I don't think we need to uplift the patch. In a few days we should have enough new crashes (in DeferredFinalize(nsISupports*)) to play with.
Updated•10 years ago
|
Assignee: continuation → nobody
Comment 191•10 years ago
|
||
Since my test patch landed, there have been several more of this bug's crashes, but no crashes at address 0x5a5a5a5a5a5a5a5a in mozilla::cyclecollector::DeferredFinalize(nsISupports*). So it seems my hunch (from comment #166) is wrong, and that this bug happens after the call to mozilla::cyclecollector::DeferredFinalize(nsISupports*).
I don't know where to go from here. Let's leave my patch on the trunk for another week, to see if anything else turns up.
bp-d70bccbf-4266-408b-bc90-7d12f2141103
bp-e1800a3e-bfc6-4ad3-9497-194522141102
bp-2f6cd17a-a6f4-42a9-a725-417362141103
bp-81db71bf-bf36-4558-8425-19e812141103
Comment 192•10 years ago
|
||
> bp-81db71bf-bf36-4558-8425-19e812141103
Just noticed that this crash *isn't* at 0x5a5a5a5a5a5a5a5a. This may be significant ... though I don't currently know how.
Comment 193•10 years ago
|
||
Is it possible to do something horrible like copy out the method pointer from the vtable of the object in deferred finalize, then use that later at the point where we are crashing with the actual object? Then we'd crash in the method itself, rather then when trying to call the method, so maybe the class name would show up in the crash stack. Maybe that's hard to do without rewriting a bunch of code, though.
Comment 194•10 years ago
|
||
Note to myself, for when I come back to this bug:
I wonder if it, like bug 1091801, is come kind of clang bug? Comment #194 indicates it may not be a use-after-free bug.
Comment 195•10 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #195)
> Is it possible to do something horrible like copy out the method pointer
> from the vtable of the object in deferred finalize, then use that later at
> the point where we are crashing with the actual object? Then we'd crash in
> the method itself, rather then when trying to call the method, so maybe the
> class name would show up in the crash stack. Maybe that's hard to do
> without rewriting a bunch of code, though.
Pulling out the vtable of the object is not a problem.
Comment 196•10 years ago
|
||
(In reply to comment #195)
Feel free to try :-)
As for myself, when I come back to this bug I'm going to try to follow my hunch that this is a clang bug. If it *is* a clang bug, it is (I think) mostly likely happening in the call to mozilla::cyclecollector::DeferredFinalize(nsISupports*) itself. While debugging bug 1091801, I noticed that (even in opt builds) there are two distinct binary implementations for mozilla::cyclecollector::DeferredFinalize(nsISupports*) and CycleCollectedJSRuntime::DeferredFinalize(nsISupports*). I haven't yet dug into the tree to figure out why. But this is weird, and what's weird might be broken.
Comment 197•10 years ago
|
||
> there are two distinct binary implementations for
> mozilla::cyclecollector::DeferredFinalize(nsISupports*) and
> CycleCollectedJSRuntime::DeferredFinalize(nsISupports*)
Actually this isn't so wierd. These are two different methods, at two
different places in the tree:
https://hg.mozilla.org/mozilla-central/annotate/5999e92e89ff/xpcom/base/nsCycleCollector.cpp#l3987
https://hg.mozilla.org/mozilla-central/annotate/5999e92e89ff/xpcom/base/CycleCollectedJSRuntime.cpp#l1047
The first calls the second.
But, in opt builds, the second is inlined inside the first. It's
conceivable this is being done incorrectly. I'll take a closer look
later.
Comment 198•10 years ago
|
||
> But, in opt builds, the second is inlined inside the first. It's conceivable
> this is being done incorrectly. I'll take a closer look later.
I did check, and I didn't see any problems.
In XUL from a recent m-c nightly (loaded into Hopper Disassembler), I compared
the machine code for mozilla::cyclecollector::DeferredFinalize(nsISupports*)
with an inline copy of CycleCollectedJSRuntime::DeferredFinalize(nsISupports*)
to the machine code for CycleCollectedJSRuntime::DeferredFinalize(nsISupports*).
I didn't see any significant differences.
I'm currently out of ideas here.
Comment 199•10 years ago
|
||
> I'm currently out of ideas here.
Not for long :-)
DeferredFinalize(nsISupports*) is often called on one or more secondary threads. It basically just calls mDeferredSupports.AppendElement(aSupports). But CycleCollectedJSRuntime::FinalizeDeferredThings() is, as far as I know, called on the main thread, and also access mDeferredSupports, without (as far as I can tell) any locking mechanism.
Could that be what's causing this bug?
CycleCollectedJSRuntime is a per thread singleton, so each thread should have its own object.
Comment 201•10 years ago
|
||
Even so, CycleCollectedJSRuntime.mDeferredSupports can apparently be accessed (written to and copied from) on two different threads.
When I have time I'll dig further into this, and with luck write a test patch.
Comment 202•10 years ago
|
||
Just for the record, here's where things now stand:
We crash here:
https://hg.mozilla.org/mozilla-central/annotate/5999e92e89ff/xpcom/base/CycleCollectedJSRuntime.cpp#l1102
'wrapper' is almost always 0x5a5a5a5a5a5a5a5a. The only possible explanation for this is that at least one of the "items" in 'items' is set to this value.
In almost all cases (or perhaps all of them), ReleaseSliceNow() is called with 'aSlice' == 100 and 'aData' == a valid nsTArray<nsISupports*>*. By looking at the machine code for ReleaseSliceNow() and the register values at-crash (in the raw dumps of the crash logs), the crash rarely (if ever) happens the first time through the "for (uint32_t i = length; i > length - aSlice; --i) {}" loop.
'items' and 'aData' in ReleaseSliceNow() come from 'mSupports' and 'aSupports', here:
https://hg.mozilla.org/mozilla-central/annotate/5999e92e89ff/xpcom/base/CycleCollectedJSRuntime.cpp#l1132
These in turn come from CycleCollectedJSRuntime.mDeferredSupports, here:
https://hg.mozilla.org/mozilla-central/annotate/5999e92e89ff/xpcom/base/CycleCollectedJSRuntime.cpp#l1214
CycleCollectedJSRuntime.mDeferredSupports is populated by calls to DeferredFinalize(), here:
https://hg.mozilla.org/mozilla-central/annotate/5999e92e89ff/xpcom/base/CycleCollectedJSRuntime.cpp#l1061
We know from the results of my test patch that 'aSupports' is *never* == 0x5a5a5a5a5a5a5a5a in calls to DeferredFinalize(). So how did the 'items'/'aData' array in ReleaseSliceNow() acquire an 'item' that == 0x5a5a5a5a5a5a5a5a? I think this is only possible as the result of some kind of "corruption" -- for example a clang bug or thread contention.
Comment 203•10 years ago
|
||
Oh the value we're pulling out of the array is 0x5a? I guess that makes more sense. For some reason, I thought that we were pulling a valid value out, but then we hit the poison value when we tried to call Release() on it. So it is the array that is not being held alive properly, not the things the contents of the array point to.
Yeah, I could imagine that some GC background thread is incorrectly calling deferred release. I wrote a patch just now that checks any time we access one of these arrays if we're on the main thread. This will check if we're accessing this array only on the main thread, for the main thread. For workers, it won't check anything useful, but hopefully these are mostly main thread crashes. (The GC doesn't use a background thread for workers IIRC so that shouldn't be an issue.)
Comment 204•10 years ago
|
||
Here are the values at-crash for some of the user-level registers, at least in the 2014-11-02-03-02-04-mozilla-central nightly:
r12 == wrapper
r14 == aData
r15 == length - aSlice
rax == i
rbx == lastItemIdx
Here are some crash reports for that nightly:
bp-d70bccbf-4266-408b-bc90-7d12f2141103
bp-7931d847-2e43-41db-9378-4bdef2141104
bp-53fd5fec-b055-4de4-b1a2-d06fd2141104
bp-0be3ecf3-cc61-4ecf-bda1-c51ce2141105
Comment 205•10 years ago
|
||
This didn't turn up anything in local testing. Maybe a try push will turn up something, or we could try landing a version that crashes in non-debug builds.
Comment 206•10 years ago
|
||
> we could try landing a version that crashes in non-debug builds
I'd say try that on the tryservers. But if it causes lots of problems it's not the answer -- probably because it's not specific enough. After all, these crashes are very rare, and normally never show up in any of our automated tests.
Comment 207•10 years ago
|
||
Note that the one crash reported at bug 1091801 (in mozilla::cyclecollector::DeferredFinalize(nsISupports*)) is on a secondary thread. So, if I remember right, were all the crashes I saw as I was debugging that bug. But I haven't had a chance to retest.
Comment 208•10 years ago
|
||
(In reply to Steven Michaud from comment #209)
> Note that the one crash reported at bug 1091801 (in
> mozilla::cyclecollector::DeferredFinalize(nsISupports*)) is on a secondary
> thread. So, if I remember right, were all the crashes I saw as I was
> debugging that bug. But I haven't had a chance to retest.
The stack in that bug looks like it is on the "main thread" for a worker, so that's fine. It should be using a separate CC runtime than the one on the main thread.
Comment 209•10 years ago
|
||
My thread checking try run didn't turn up anything: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=d8ccc5fe7779
Comment 210•10 years ago
|
||
I get this crash occasionally (once a week maybe). If there is anything I can do to help let me know.
Comment 211•10 years ago
|
||
I spent some time today trying out trunk on Valgrind on Yosemite,
but didn't see anything that seemed directly relevant.
In two runs Valgrind did crash with symptoms that are usually
associated with the program-under-test trashing its heap. This
is a bit surprising given that I almost never see that on runs
on Linux. I have stack traces of all the threads at the points
of the crash if anybody wants to see. It may well be a red
herring though, due to general flakyness of Valgrind on Yosemite.
The crashes are not reproducible.
Overall I'd say the thread-race theory is most plausible. Can we
get the TSan folks on the case? Do we have TSan on MacOS ?
Comment 212•10 years ago
|
||
> Do we have TSan on MacOS?
Not that I'm aware of.
I do have my own ASan builds, as mentioned above in many comments. At some point I'll also try to do Mac TSan builds ... but I won't have time for a while.
Comment 213•10 years ago
|
||
Given that we're very late in the 34 cycle, the overall crash rate for 34 is in a good range, the crash reported in this bug is very rare (comment 208), and that we don't have a patch - I'm going to mark this as wontfix in 34.
Comment 214•10 years ago
|
||
In case it helps with root cause isolation, I got this stack about 15 minutes into a Loop/Hello call. I've not otherwise seen this crash, so I do kind of suspect something about having a long-running WebRTC call as being an exacerbating factor. Could be a red herring.
https://crash-stats.mozilla.com/report/index/61a66f0c-cbac-45fa-a112-590f72141119
Comment 215•10 years ago
|
||
(In reply to Timothy Nikkel (:tn) from comment #212)
> I get this crash occasionally (once a week maybe). If there is anything I
> can do to help let me know.
Ditto. 3 this week. Feels like they happen more often recently (my crash history backs up this feeling, sadly).
Comment 216•10 years ago
|
||
Just had this when watching a ted talk. The video stopped and then it crashed. Crash stats url: https://crash-stats.mozilla.com/report/index/27eb2e26-757a-4136-a54b-296a32141204
Comment 217•10 years ago
|
||
This may not be the best place for this post, but at least it will be seen by someone who might know what to do with it.
I have been using FF since v1.0 on Windows. A couple of years ago I switched to an MacBook Air running Lion and continued using FF without any problems; the occasional crash here and there, but nothing too problematic. Then a few months ago I got a new MacBook Pro running Mavericks and FF has been crashing almost daily ever since. The crashes are mostly the ReleaseSliceNow one, but there are also lots of other random looking crashes (nsHtml5TreeBuilder::appendCharacters is the latest one for me), almost always with bad memory access. Here's my about:crashes for reference (https://gist.github.com/anonymous/4701c1a0e1534b9e5577)
So what changed? Was it the new hardware that FF doesn't like? Not likely. Was it the switch from Lion to Mavericks? Was it a change in the compiler toolchain perhaps? I doubt it was a sudden slew of bugs committed to FF. Perhaps it is some root cause bug corrupting memory that is then randomly manifested throughout the code. But Mac only? Why would such a bug not manifest on Windows? It suggests toolchain issues, but I don't know enough about FF dev to further comment on that. Perhaps I'm in the minority with these problems, which might suggest something unique about my environment, a pinned webpage perhaps that no-one else uses?
I'm sad after having used FF for so many years without problems to suddenly have an extremely unstable browser, but I remain hopeful that this can be rectified. And I'm available to run test builds if that can help. Good luck devs!
Comment 218•10 years ago
|
||
Olli had a theory that maybe we're spinning the event loop in ReleaseSliceNow or something, and we end up re-running the runnable while we're running the runnable. We could try adding some kind of reentrance check to at least detect that particular problem.
Comment 219•10 years ago
|
||
For what it's worth, I've been looking into doing Mac TSan builds. But for now I'm giving up.
In principle, TSan builds are made very similarly to ASan builds. But even the latest version of llvm that I tried (rev 224161, from a few days ago) claims to not be able to support TSan builds on Darwin.
There are some hints that Google was able to do Mac TSan builds a long time ago:
https://code.google.com/p/chromium/issues/detail?id=106197
https://code.google.com/p/chromium/issues/detail?id=120136
But I don't know enough about llvm to feel confident about trying to follow these hints up.
Comment 220•10 years ago
|
||
Comment 221•10 years ago
|
||
I've pored over this again, and done some more debug logging (in current trunk code). I've convinced myself this bug can't be the result of thread contention: At least in current code, calls to ReleaseSliceNow() can happen on the main thread and any of several "DOM Thread" threads. But the associated calls to CycleCollectedJSRuntime::DeferredFinalize() and CycleCollectedJSRuntime::FinalizeDeferredThings() also all happen on the same thread. So there's never a case where the same nsTArray<nsISupports*> object is accessed/changed on different threads.
So what on earth is the cause? I'm beginning to wonder if there are problems with one or more of the nsTArray methods that we use: AppendElement() in CycleCollectedJSRuntime::DeferredFinalize(), SwapElements() in IncrementalFinalizeRunnable::IncrementalFinalizeRunnable() and ElementAt() in ReleaseSliceNow().
Of these I suppose the most likely to cause trouble is AppendElement(). In my next test patch I'll try to come up with some trivial tests that AppendElement() worked correctly in CycleCollectedJSRuntime::DeferredFinalize().
By the way, it's interesting that we only see these crashes on the main thread of the main process. And, from my own debug logging, there seem to be a lot more objects garbage-collected there than on other threads or in other processes. Also, from my notes in comment 206 you can see that the 'items' array in ReleaseSliceNow() can get very large: r12 in one of those raw dumps is 0xc3f9, indicating an array with 50169 items (at that time). So maybe we only see these crashes when CycleCollectedJSRuntime::mDeferredSupports gets very large.
Comment 222•10 years ago
|
||
(In reply to Steven Michaud from comment #223)
> By the way, it's interesting that we only see these crashes on the main
> thread of the main process.
The main thread is just where most of these kinds of objects live. This is mostly used for things like the DOM tree. Workers have a bit of support, but not for as many things.
Some nsTArray problem does seem like the most suspicious thing here, esp. given your earlier investigation that they were happening, but that would be a little surprising, given how much we use nsTArray everywhere.
Comment 223•10 years ago
|
||
Here's my next test/debug patch. Fortunately it doesn't require inline assembly. I removed my previous test patch, since it's already served its purpose.
I've started a full set of tryserver builds:
https://tbpl.mozilla.org/?tree=Try&rev=0824af4068dc
Attachment #8539512 -
Flags: review?(continuation)
Comment 224•10 years ago
|
||
Comment on attachment 8539512 [details] [diff] [review]
Another test/debug patch (*not* a fix)
Oops, sorry. I already noticed a problem. New patch (and new tryserver tests) coming up.
Attachment #8539512 -
Attachment is obsolete: true
Attachment #8539512 -
Flags: review?(continuation)
Comment 225•10 years ago
|
||
Let's try this again.
Here's another set of tryserver builds:
https://tbpl.mozilla.org/?tree=Try&rev=adb35338f548
Attachment #8539518 -
Flags: review?(continuation)
Comment 226•10 years ago
|
||
Comment on attachment 8539518 [details] [diff] [review]
Another test/debug patch (*not* a fix)
Review of attachment 8539518 [details] [diff] [review]:
-----------------------------------------------------------------
::: xpcom/base/CycleCollectedJSRuntime.cpp
@@ +1050,5 @@
> + // intermittent failures here in nsTArray::AppendElement(). So if we see
> + // any failures, deliberately crash and include diagnostic information in
> + // the crash report.
> + size_t oldLength = mDeferredSupports.Length();
> + nsISupports **itemPtr = mDeferredSupports.AppendElement(aSupports);
nit: * to the left, here and below.
@@ +1053,5 @@
> + size_t oldLength = mDeferredSupports.Length();
> + nsISupports **itemPtr = mDeferredSupports.AppendElement(aSupports);
> + size_t newLength = mDeferredSupports.Length();
> + nsISupports *item = mDeferredSupports.ElementAt(newLength - 1);
> + if ((newLength - oldLength != 1) || !itemPtr ||
Why don't you do this checking, sans CrashReporter note adding, in non crashreporter builds/
Attachment #8539518 -
Flags: review?(continuation) → review+
Comment 227•10 years ago
|
||
> nit: * to the left, here and below.
OK.
> Why don't you do this checking, sans CrashReporter note adding, in
> non crashreporter builds?
Because I don't think it's worth crashing here if I can't log any
diagnostic information.
Comment 228•10 years ago
|
||
Comment on attachment 8539518 [details] [diff] [review]
Another test/debug patch (*not* a fix)
Landed (with Andrew's changes) on fx-team:
https://hg.mozilla.org/integration/fx-team/rev/29299d5117e9
Comment 229•10 years ago
|
||
Comment 230•10 years ago
|
||
It's a bit too early to be sure, but it looks like my latest test patch hasn't revealed any problems with nsTArray::AppendElement() (or for that matter with nsTArray::ElementAt()). Since it landed there've been no Socorro bug reports with "app notes" containing "997908". But there have been two crashes at ReleaseSliceNow():
bp-b8eff680-d455-4911-9964-585242141221
bp-f5fa94c4-c422-4fda-adc4-2d35b2141222
Comment 231•10 years ago
|
||
I have a patch in progress that stops us from using mDeferredSupports entirely, and instead moves the code to the other way we do deferred finalize. The code looks a little sketchy to me, so hopefully using some other code that doesn't seem to have weird crashes will help. I'll file a separate bug for that blocking this at some point.
Comment 232•10 years ago
|
||
Given what I said in comment #232, I started poking around in nsTArray code for whatever clues I might be able to find. I'm pretty sure nsTArray::SwapElements() isn't giving us trouble -- for non-auto arrays (as we're using), it just swaps pointers. But then I noticed that nsTArray code tries to use constructors and destructors as it adds and removes elements, or moves them around.
Already we've seen with my first (abortive) test patch (attachment 8492208 [details] [diff] [review]) that we aren't dealing here with "ordinary" nsISupports objects -- just adding the following two lines to CycleCollectedJSRuntime::DeferredFinalize() caused all kinds of trouble:
+ NS_IF_ADDREF(aSupports);
+ NS_IF_RELEASE(aSupports);
So I don't really think we want nsTArray to be calling constructors and destructors on the objects in CycleCollectedJSRuntime::mDeferredSupports and IncrementalFinalizeRunnable::mSupports. So why not change these nsTArrays from nsTArray<nsISupports*> to nsTArray<void*>?
That's what this patch does.
I've started a full set of tryserver builds, here:
https://tbpl.mozilla.org/?tree=Try&rev=c5a99eb900c9
I'm leaving tomorrow on my Christmas vacation. So even if you really like this patch, Andrew, I think it's probably best to wait to land it on the trunk until I get back (in late December or January).
Attachment #8540413 -
Flags: review?(continuation)
Comment 233•10 years ago
|
||
That's an interesting idea, but I don't think anything should happen to the nsISupports pointers in the array.
Comment 234•10 years ago
|
||
Comment on attachment 8540413 [details] [diff] [review]
Possible fix
Review of attachment 8540413 [details] [diff] [review]:
-----------------------------------------------------------------
I don't think this should change the behavior of destroying the array.
I'm going to try my patch to eliminate this array entirely.
Attachment #8540413 -
Flags: review?(continuation) → review-
Comment 235•10 years ago
|
||
Wontfixing again for 35 given the time remaining in the cycle. Since we've been shipping this for many releases I'm not tracking any further, looks like we'll have a fix on trunk soon and an uplift nomination with risk assessment would be welcome. We can take it as high as timing/risk allows.
Comment 236•10 years ago
|
||
I got one today in 35 during Firefox Hello call while sitting back and chatting. Lots of tabs in the background.
https://crash-stats.mozilla.com/report/index/f72e5bc6-7c63-487a-890c-2df6e2150117
Updated•10 years ago
|
Comment 237•10 years ago
|
||
As of the patch for bug 1114804 (which landed on trunk on 2015-03-10 and first appeared in the 2015-03-11 mozilla-central nightly), the code where these crashes happen has disappeared. As best I can tell, that code has been replaced by DeferredFinalizerImpl::DeferredFinalize(), here:
https://hg.mozilla.org/mozilla-central/annotate/cf1060d8ce9f/dom/bindings/BindingUtils.h#l2910
The big difference is that the "new" code uses nsTArray::RemoveElementsAt() once instead of nsTArray::RemoveElementAt() repeatedly.
And that seems to have been enough to fix this bug. I see no crashes at all at DeferredFinalizerImpl::DeferredFinalize() in the last week on the 39 branch.
Updated•10 years ago
|
Status: REOPENED → RESOLVED
Closed: 11 years ago → 10 years ago
Resolution: --- → FIXED
Comment 239•10 years ago
|
||
(Following up comment #239)
> The big difference is that the "new" code uses
> nsTArray::RemoveElementsAt() once instead of
> nsTArray::RemoveElementAt() repeatedly.
Actually another (and likely more important) difference is that
ReleaseSliceNow() called Release() on each element it removed, but
DeferredFinalizerImpl::DeferredFinalize() never calls Release().
Could the calls to Release() in ReleaseSliceNow() have been redundant?
And could that have been the cause of bug 997908?
Comment 240•10 years ago
|
||
The old code was an array of raw pointers, while the new code is an array of smart pointers, so it will be automatically released when things are removed.
Updated•9 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•