Closed Bug 997908 Opened 11 years ago Closed 10 years ago

crash in ReleaseSliceNow(unsigned int, void*) accessing memory at 0x5a5a5a5a5a5a5a5a

Categories

(Core :: XPCOM, defect)

30 Branch
All
macOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla31
Tracking Status
firefox30 + wontfix
firefox31 + wontfix
firefox32 + wontfix
firefox33 + wontfix
firefox34 + wontfix
firefox35 + wontfix

People

(Reporter: justdave, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: crash, topcrash-mac)

Crash Data

Attachments

(10 files, 8 obsolete files)

(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), patch
smaug
: review+
Details | Diff | Splinter Review
(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), text/plain
Details
(deleted), patch
froydnj
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
mccr8
: review+
Details | Diff | Splinter Review
(deleted), patch
mccr8
: review-
Details | Diff | Splinter Review
This bug was filed from the Socorro interface and is report bp-a1afae6a-7b03-4d08-a793-df4ee2140416. ============================================================= I'm going out on a limb and guessing this is related to the plugin code because I get a separate trapped crash from the OS in the plugin-container at the same time every time this happens. the crash report from the OS which corresponds to this crash is attached.
There's actually two separate crash reports logged by the OS at the same time, here's the other one.
Version: 1.9.0 Branch → 30 Branch
Flash 13.0.0.182 triggered a huge number of crashes on OS X on older Macs (at different addresses). See bug 996637. That's the version you have. So I suggest you upgrade to the current version (13.0.0.201) and see if these crashes still happen.
This one was actually Silverlight, not Flash (according to the dialog box from OS X - I don't see it actually in the report :( )
Your second crash stack (from comment #1) is for Silverlight 5.1.20913.0. So you have two more-or-less simultaneous crashes in two different plugins. Congratulations! I've never seen this before :-)
Your first crash stack (from comment #0) was definitely for the Flash plugin.
Actually, thinking more about this I suspect the real problem is the CC crash in ReleaseSliceNow(). The plugin crashes are probably just side effects. There have been a lot of these crashes lately -- mostly on Windows, but with a disproportionate number of them on the Mac: https://crash-stats.mozilla.com/report/list?signature=ReleaseSliceNow%28unsigned+int%2C+void%2A%29&product=Firefox&query_type=contains&range_unit=weeks&process_type=any&hang_type=any&date=2014-04-17+19%3A00%3A00&range_value=4#reports These crashes have also been responsible for a bunch of intermittent test failures, of which the following are still open: bug 816064, bug 981139.
This is probably a cycle collector bug.
Component: Plug-ins → XPCOM
If by any chance you can reproduce these crashes, *please* tell us how.
Though this bug happens on all platforms, I'm leaving the platform set to "Mac OS X" because it happens there disproportionately.
I have a membership management app that I use for Boy Scouts that is built on Silverlight, and I'm in it frequently enough that I tend to leave it open in a tab. It times out after 30 minutes of inactivity and kicks you out, and the login screen isn't Silverlight, which means the app would be frequently getting unloaded and reloaded. I don't know if that's related, but I seem to get this crash every day or two. Said app is usually in a background tab when it crashes.
For what it's worth, these are currently the #1 Mac topcrasher on the 31 branch and #3 on the 30 branch. They're down in the 20s and 30s on the 29 and 28 branches, and not even in the top 100 on the 27 branch. So though this problem has existed for a long time, we appear to have done something recently to make it much worse -- especially on the Mac.
Could some kind QA person please find a list of URLs most closely associated with these crashes?
Whiteboard: [QA wanted]
Keywords: needURLs
Attached patch Potential bandaid patch (obsolete) (deleted) — Splinter Review
Most of the Mac crashes are null-dereferences, so this might work as a bandaid patch. Kyle, I'm asking you to review because you have hg blame for the code I'm changing. I've started an all-platform tryserver build, whose results should eventually be available here: https://tbpl.mozilla.org/?tree=Try&rev=45714bba0168
Attachment #8409191 - Flags: review?(khuey)
Comment on attachment 8409191 [details] [diff] [review] Potential bandaid patch But apparently Kyle Huey is away, so I'll try Andrew McCreight.
Attachment #8409191 - Flags: review?(khuey) → review?(continuation)
Comment on attachment 8409191 [details] [diff] [review] Potential bandaid patch Review of attachment 8409191 [details] [diff] [review]: ----------------------------------------------------------------- My vacation starts in 10 minutes, so you should request review from mccr8 again. ::: xpcom/base/CycleCollectedJSRuntime.cpp @@ +1003,5 @@ > > nsISupports* wrapper = items->ElementAt(lastItemIdx); > + if (!wrapper) { > + continue; > + } It seems quite surprising to me that we would get a null pointer in here, but if we do, we should still remove it from the array to avoid having to memcpy as we destroy the rest of the array. So you should use NS_IF_RELEASE rather than adding an early continue.
Attachment #8409191 - Flags: review?(continuation) → review-
Attached patch Bandaid patch rev1 (deleted) — Splinter Review
Thanks, Kyle, for your suggestion.
Attachment #8409191 - Attachment is obsolete: true
Attachment #8409205 - Flags: review?(continuation)
For what it's worth I've started another tryserver build: https://tbpl.mozilla.org/?tree=Try&rev=76e2a3694db2
Steven, I'll give it a try a bit later today!
Keywords: qawanted
Comment on attachment 8409205 [details] [diff] [review] Bandaid patch rev1 Hmm. It is probably better to filter these out when we're adding things to the array. I'll look up where that would happen, but until then, maybe Olli has more thoughts here.
Attachment #8409205 - Flags: review?(continuation) → review?(bugs)
I reproduce it pretty reliably on https://lodgemaster.oa-bsa.org/lodge/client but that requires a login. FWIW it typically happens after I let the page idle for several hours in a background tab after logging in.
Tracking for now due to topcrash.
(In reply to comment #22) Dave, could you try the following tryserver build for a day or two? http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-76e2a3694db2/try-macosx64/firefox-31.0a1.en-US.mac.dmg It was made using my rev1 patch.
(In reply to Steven Michaud from comment #26) > Dave, could you try the following tryserver build for a day or two? OK, running it now, got the above page opened, and I'll let you know what happens.
Possibly of note, I was previously running Aurora, and this is a trunk build, so hopefully this crash was happening on Nightly, too.
> so hopefully this crash was happening on Nightly, too. It certainly seems so. This is a Mac topcrasher on the 30 and 31 branches. It's my hunch that the recent Mac null-dereferences have a different origin from the other crashes -- those that go back many FF versions, and many of which aren't null-dereferences. Your results should help us tell whether or not I'm right. If you don't see any of these crashes with my tryserver build, I'm likely to be right. But if you do see even one of *these* crashes (in ReleaseSliceNow()), I'm likely to be wrong.
Comment on attachment 8409205 [details] [diff] [review] Bandaid patch rev1 We can try this, but I'll look at the real fix too. Doesn't sound like a CC bug, but bug in something using DeferredFinalize
Attachment #8409205 - Flags: review?(bugs) → review+
Assignee: nobody → smichaud
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
(In reply to Dave Miller [:justdave] (justdave@bugzilla.org) from comment #27) > (In reply to Steven Michaud from comment #26) > > Dave, could you try the following tryserver build for a day or two? > > OK, running it now, got the above page opened, and I'll let you know what > happens. And it just now crashed, been running continuously since I made the above comment though, so it took that long to trigger it. The crash signature is different, but it took out Silverlight with it just like before. (separate crash dialog from the OS for that one). Not sure which one is which, it logged all three of these at once: bp-ef1df32e-c458-4f4f-9220-f25092140425 bp-1ed72539-b256-4e0f-adc2-38d672140425 bp-hr-20140425-59a24ac5-41b4-4b32-b5e1-8ade8b66df53
> bp-hr-20140425-59a24ac5-41b4-4b32-b5e1-8ade8b66df53 I can't find this one. > bp-ef1df32e-c458-4f4f-9220-f25092140425 > bp-1ed72539-b256-4e0f-adc2-38d672140425 And it's a real pain that neither of these has reliable symbols, even for XUL stuff. Let me download a copy of the tryserver build and try atos on it.
> bp-ef1df32e-c458-4f4f-9220-f25092140425 Here are the top few lines of this stack (for the main process) from atos: ReleaseSliceNow(unsigned int, void*) (in XUL) + 164 mozilla::IncrementalFinalizeRunnable::ReleaseNow(bool) (in XUL) + 140 mozilla::IncrementalFinalizeRunnable::Run() (in XUL) + 38 ... And, from looking at the machine code for ReleaseSliceNow(), it appears the crash is at the following line: NS_IF_RELEASE(wrapper); But if so the crash can't be a null deference, as that crash stack says it is! So the Mac crashes that Socorro reports as null dereferences may in fact not be null deferences! If so all bets are off. But we should still watch the crash stats to see if my bandaid patch makes any difference in these crashes' frequency. The first m-c nightly it will appear in is tomorrow's.
> NS_IF_RELEASE(wrapper); 000000000002e41f 4885DB test rbx, rbx 000000000002e422 7409 je 0x2e42d 000000000002e424 488B03 mov rax, qword [ds:rbx] 000000000002e427 4889DF mov rdi, rbx 000000000002e42a FF5010 call qword [ds:rax+0x10] Here's the assembly code for this line. The crash is supposed to take place at address 0x2e424. As you can see it can't possibly be a null dereference.
If we're operating on dead memory another thread could be nulling it between the test and the mov. Given the symptoms that seems reasonably likely.
Actually I suspect Socorro is using a 32-bit unsigned integer to store the "Crash Address", and that this is the cause of the trouble. Next week I'll open a bug on this.
I've opened bug 1002564 on the possible Breakpad/Socorro bug.
A bunch of (supposedly) null-dereference crashes have happened in the Mac m-c nightlies for 2014-04-26 and 2014-04-27. My bandaid patch didn't fix them :-(
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Steven let me know if there's anything you need. It still looks like a top crashing signature for Firefox 31.0a1.
Right now I'm waiting for my patch for bug 1002564 to get into trunk nightlies (starting with tomorrow's). It won't fix this bug. But I think it'll show that the relatively large number of recent Mac crashes, which Socorro currently reports as null-dereference crashes, aren't that at all. With luck it'll also provide some clues about the actual cause.
Dropping qawanted keyword from this bug as it doesn't look like there's much QA can do to help here. From bug 1002564 comment #13 by Steven Michaud: > ...it doesn't seem to have fixed the issue with the Socorro stacks for bug 997908: > All the recent Mac crashes (those in builds containing the patch) are still reported as > NULL-dereferences, even though they can't possibly be (as best I can tell). > > I've now thought of more tests I can run to try to get to the bottom of this > (for example to find out if this is an Apple bug). Let me do them and get > back before we decide either way. I think we should block this bug until bug 1002564 is addressed.
Depends on: 1002564
Keywords: qawanted
justdave: Are you still seeing this bug? And if so are you willing to try another tryserver build for a day or two? http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-b93830179c78/try-macosx64/firefox-32.0a1.en-US.mac.dmg This build contains my latest patch for bug 1002564. It doesn't fix this bug, but it *does* partly fix the problem with Breakpad falsely reporting some crashes as null-deferences. My hope is that if/when you see this crash again with this build, you'll be able to tell us the actual crash address.
Steven: I just hit this signature running the beta. The browser crashed when Firefox was idle and I believe only one tab was open at the time: https://crash-stats.mozilla.com/report/index/df40a1d3-18f0-4fa1-b3bb-a7ab02140520 (Note: this wasn't just a plugin crash - it took the whole browser down).
Marcia: Since comment #6 I haven't thought this was a plugin bug, so your report doesn't surprise me. Do you see this bug often? If so, could you try out my tryserver build from comment #44 for a couple of days?
(In reply to Steven Michaud from comment #44) > justdave: Are you still seeing this bug? And if so are you willing to try > another tryserver build for a day or two? Yes, frequently (well, every few days still). Sure, I'll give it a run.
(In reply to Steven Michaud from comment #44) > http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-b93830179c78/try-macosx64/firefox-32.0a1.en-US.mac.dmg https://app.work.com renders as a blank page using this build. Is that specific to this build or shall I file a bug against nightly?
(In reply to Dave Miller [:justdave] (justdave@bugzilla.org) from comment #48) > https://app.work.com renders as a blank page using this build. Is that > specific to this build or shall I file a bug against nightly? That sounds vaguely like a cache loading bug that has been very recently fixed. Try doing force reload (shift click on the reload) and see if that helps.
> https://app.work.com renders as a blank page using this build. I just tried loading this page in my tryserver build (on OS X 10.8.5), and had no trouble.
Force reload didn't help (that was the first thing I tried). Were you logged in when you visited work.com? Their site is one of those ones that behaves very differently if you're logged in. I'm on OS X 10.9.3 also. I'm just running Safari for that site for the time being, so I'm okay so far. I'm told on IRC that Asa said there's a ton of sites busted on Nightly right now (should be fixed in a couple days) but that's the only one I've run into so far.
> Were you logged in when you visited work.com? No, and I don't have an account there. Try the 2014-05-19 and 2014-05-20 m-c nightlies. My tryserver build was made on top of trunk code "between" those two nightlies. It's *very* unlikely that my patch is responsible for the behavior you're seeing ... unless (say) one of your plugins is crashing when you visit https://app.work.com.
> I don't have an account there Actually I *do* have an account there -- one of my Mozilla accounts. And I *can* reproduce what you report. But I also see it with the 2014-05-19 m-c nightly and not with the 2014-05-20 m-c nightly -- so the problem is unrelated and seems already to have been fixed.
justdave: Do you want me to make another tryserver build (based on current m-c code)? Or is what you have tolerable enough to live with for the next day or two?
yeah, after talking with other people on IRC I'm pretty sure it's not you. This range of nightlies seems whacked out right now. Surveygizmo has issues, too (it just flipped out on me in the middle of taking a survey), and so does Yammer. So far I'm making due with Safari on the weird sites until I get the ReleaseSlice crash again. Asa says this is bug 1010783, which is now fixed, so a new try build against current m-c would probably be helpful.
I started another tryserver build at bug 1002564 comment #28.
Haven't managed to figure out how to download that try build yet, but I just triggered the crash again on your previous one. :-) bp-95006357-f6b5-4e2d-aa1e-f059b2140521
Thanks, Dave, for the report! But it contains bad news -- the crash address is still NULL. So the true crash address must be > 0x7fffffffffff (an Apple bug my current patch for bug 1002564 doesn't (yet) fix or work around). As I said at bug 1002564 comment #26, there's a chance I'll be able to work around this bug in Breakpad (even though it's not a Breakpad bug). But there a bunch of other things currently eating my time, and I'm not sure when I'll be able to get back to bug 1002564 (or whether I'll be successful when I do). I think we need to get someone else than me to start working on this bug very soon -- someone who knows CC code (the code where the crashes happen). We can't keep waiting for bug 1002564.
Assignee: smichaud → nobody
No longer depends on: 1002564
I hit this yesterday: https://crash-stats.mozilla.com/report/index/aa5066b6-ff91-45f3-aae4-b7f842140520 The URL given was https://www.irccloud.com/#!/ircs://irc.mozilla.org:6697/%23mozwebqa, and I *think* I noticed an IRCCloud notification at the top of my cloud-IRC session saying something about "WebSockets having issues; you might try restarting your browser, if a reload doesn't fix it.) Hope that helps!
(In reply to Steven Michaud from comment #58) > I think we need to get someone else than me to start working on this bug > very soon -- someone who knows CC code (the code where the crashes happen). > We can't keep waiting for bug 1002564. Sylvestre, could you please escalate this?
Flags: needinfo?(sledru)
FYI, this has risen to #8 in the Firefox 31 topcrash report and #1 in the Firefox 31 Mac OS X topcrash report.
Benjamin, are you the right person to look to on this for a deeper dive into the CC code? If not, can you recommend someone?
Flags: needinfo?(sledru) → needinfo?(benjamin)
Not me, no. If this were a CC bug, probably mccr8, but without steps we really don't know whether the bug is actually in the CC or elsewhere in the codebase. Other than asking justdave to do an ASAN/valgrind run and see if it finds something interesting, I don't think we have clear next steps here.
Flags: needinfo?(benjamin)
As Olli said in comment 30, it seems like the most likely cause is something misusing DeferredFinalize. Though the fact that we're hitting a null deref immediately after a null check is mysterious...
The null isn't real, it's a crash reporting artifact. Assume that the pointer is non-null.
I'll try to think about something we can do.
Assignee: nobody → continuation
For what it's worth, I'm doing a Mac opt universal ASan build for justdave.
I was unable to do a universal build (for some reason the clang I need to do ASan builds won't do 32-bit builds). So here's a 64-bit only build: http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg Please try it, justdave, and let us know your results. Some of your plugins (notably Silverlight) won't work. And it may turn out that, for arbitrary reasons, you can't reproduce the ReleaseSliceNow() crashes with this build. But it seems like you're our best hope for moving forward on this bug :-)
> http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg On second thought don't bother. For some stupid reason this build includes a dependency specific to the version of clang I used. I'll see if I can repair it (to add in all the dependencies it needs). If so I'll repost a new copy at the same URL.
(Following up comment #69) On second thought that will take too much time. Can someone else do a "portable" ASan build for justdave?
I have the capability to build my own, too, if someone can point me at instructions. I've built nightlies before but it's been a while.
https://developer.mozilla.org/en-US/docs/Mozilla/Testing/Firefox_and_Address_Sanitizer#ManualBuild Here are the instructions I followed to get a 64-bit build. They worked for me.
Attached file Mozconfig for 64-bit ASan build (obsolete) (deleted) —
And here's the mozconfig I used.
Keywords: needURLs
OK, apparently I get the same results. When I try to make it build universal the build crashes. 64bit build runs, but Silverlight doesn't work, and I can't run without it (I use it constantly - and it's probably at least a catalyst to triggering this crash - that's probably why I can reproduce it so often).
Dave, I think I can see my way to creating a patch for bug 1002564 that logs to the console the "thread state" at the time of a crash -- basically the contents of all the user-level registers. Unfortunately we'll probably never get this into Breakpad, but it *might* let us discover the actual address you're crashing at, if you run it yourself for a day or two. I'll post it once I have it ... if I do. As for getting universal builds to work (and making them portable), I'm sure it's just a matter of building clang "correctly". But it may take me a day or two to figure that out -- time which I won't have before next week.
Steven, does this bandaid patch work well enough (and is low risk enough) for consideration to uplift to Beta? Today's Beta is the last we can feasibly take this on and get enough feedback to be sure it can stick.
Flags: needinfo?(smichaud)
The bandaid patch doesn't work at all :-( The reason is that these aren't null-dereference crashes.
Flags: needinfo?(smichaud)
In that case we will need to leave this to be fixed (hopefully) in FF31 as we're too late to take on any speculative work in FF30 and this has no current, verified fix.
Dave, I've got another tryserver build for you to try, if you're willing: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-6b8002b98fd5/try-macosx64/firefox-32.0a1.en-US.mac.dmg Once again, this won't fix your crashes. But it does log extra information (to the console and to /var/log/system.log) when a crash is reported to Breakpad -- the "thread state" (the values, as of the crash, of all the user-level registers in the crashing thread). This information does *not* get included in a minidump or reported to Socorro. So when a crash happens, you'll need both to look at the Socorro crash stack and to use the Console app to find the "thread state". You may actually see multiple "thread states" -- some for various instances of plugin-container. What we need, though, is the "thread state" for Firefox. Hopefully one or more of the thread state's registers will contain this bug's actual crash "address".
The reason why universal Mac ASan builds don't work (why they won't build) is that 32-bit ASan builds won't build. At least for the time being, I've given trying to fix this problem. The main reason is that I found the following quote at https://bugzilla.mozilla.org/show_bug.cgi?id=753148#c6 from Christian Holler (decoder): "OSes we would like to target now are Linux 64 and Mac OSX 64. 32 bit has less priority right now because there are several problems with stack space that prevent it from working properly on all tests." This leads me to suspect that, even if we can fix the build problem, 32-bit ASan builds will be unusable. Am I right, Christian?
Flags: needinfo?(choller)
> Hopefully one or more of the thread state's registers will contain this bug's actual crash "address". Comment #36 suggests that the register will be rbx.
(In reply to Steven Michaud from comment #79) > Dave, I've got another tryserver build for you to try, if you're willing: Running it now, will let you know.
(In reply to Steven Michaud from comment #80) > > This leads me to suspect that, even if we can fix the build problem, 32-bit > ASan builds will be unusable. > > Am I right, Christian? Yes, I think so. The reduced stack space on 32 bit previously caused startup errors for Firefox (e.g. "Too much recursion" from the JS engine because stack space was exhausted too quickly). This typically happens in the parser because our parsing works recursively and stack frames grow really large with Clang Inlining+ASan. We should still try to fix the build errors (or just create 64-bit only ASan builds). There is another problem on Mac though that would have to be solved too, that is bug 923916. It has to do with the fact that on Mac, ASan uses a dylib instead of statically linking the ASan runtime to the target binary.
Flags: needinfo?(choller)
FWIW, the event that appears to have triggered the crash was clicking a URL to a PNG file on dropbox in my IRC client, which brought Firefox to the front, hung for a minute or so, then crashed.
(In reply to comment #84) > bp-347f536c-345b-4c80-9a0d-81aff2140529 This isn't a ReleaseSliceNow() crash. Also Firefox is running in 32-bit mode (which you weren't doing previously). And from the timestamp (and other considerations) it doesn't match what you copied from your syslog. Even the syslog Firefox crash isn't a ReleaseSliceNow() crash -- it has a non-zero crash address. And both plugin-container processes are 64-bit (though Silverlight is 32-bit only). But one of your plugin-container crashes does contain this as part of the thread state of the "crashing" thread: rbx: 0x5a5a5a5a5a5a5a5a And the Firefox thread state contains two registers whose contents is 0x5a5a5a5a.
Dave, could you keep trying for a ReleaseSliceNow() crash? :-)
(In reply to Steven Michaud from comment #86) > This isn't a ReleaseSliceNow() crash. Also Firefox is running in 32-bit > mode (which you weren't doing previously). The symptoms matched, so I assumed... And I'm not running in 32bit, it's set to launch native, and Activity Monitor confirms it's currently running in 64bit mode (granted, I didn't think to look before it crashed).
Yep, I'll keep trying.
Also, could you post Breakpad crash IDs for both of the plugin-container crashes?
>> bp-347f536c-345b-4c80-9a0d-81aff2140529 > > This isn't a ReleaseSliceNow() crash. Also Firefox is running in 32-bit mode (which > you weren't doing previously). This crash ID is actually for the Silverlight plugin. But (apparently) you were also running two other 64-bit plugins. Its crash IDs for those that I want to see -- particularly the one whose rbx == 0x5a5a5a5a5a5a5a5a. Also the Firefox crash id, if you can find it. It *might* have a ReleaseSliceNow() crash.
Those are the only breakpad crashes that were dropped. The next one in the list by date is from the 25th.
(In reply to Steven Michaud from comment #86) > rbx: 0x5a5a5a5a5a5a5a5a Hmm, that's the value we use for poisoning on freeing memory.
As per bug 1002564 comment #39, the latest minidump-stackwalk will display all the user-level registers for the top frame of the crashing thread. And as minidumps have contained this information all along, it just occurred to me to download a few corresponding to recent instance of these crashes. They all have the following: rbx = 0x5a5a5a5a5a5a5a5a So this is the crash address for these crashes. You need special permissions to download minidumps from http://crash-stats.mozilla.com. Here's how to download and build the latest minidump-stackwalk: 1) Install a reasonably recent XCode (I have 4.5.2) and the latest XCode commandline tools. 2) Visit https://github.com/mozilla/socorro and click "download zip". 3) CC=clang CXX=clang++ make breakpad 4) CC="clang -Wno-switch" CXX="clang++ -Wno-switch" make stackwalker
Summary: crash in ReleaseSliceNow(unsigned int, void*) → crash in ReleaseSliceNow(unsigned int, void*) at 0x5a5a5a5a5a5a5a5a
Something else also occurred to me: It may be possible to do a universal build that combines a 64-bit ASan build with a 32-bit "regular" build. I'll try that on Monday.
Dave, you don't have to keep waiting for a ReleaseSliceNow crash to happen in the build I gave you. But if I can manage to do a special ASan universal build as per comment #95, I'll post the instructions here.
Summary: crash in ReleaseSliceNow(unsigned int, void*) at 0x5a5a5a5a5a5a5a5a → crash in ReleaseSliceNow(unsigned int, void*) accessing memory at 0x5a5a5a5a5a5a5a5a
Attached file Mozconfig for universal ASan build (deleted) —
I *was* able to create a universal ASan build, with only the 64-bit component actually using ASan. I used this mozconfig file. You need to build the clang (and llvm) you use to do the 64-bit part of the build as per https://developer.mozilla.org/en-US/docs/Mozilla/Testing/Firefox_and_Address_Sanitizer#ManualBuild. But for the 32-bit part you should use the clang that comes with XCode's command line tools. After the build is finished, run "make package" in obdir/i386 or objdir/x86_64, then copy the Nightly app bundle from the resulting DMG to somewhere on your hard drive to run it. It will only run on the machine where you built it, because it (the 64-bit part of it) depends on a clang runtime that's located where you built llvm. Please try this, Dave, and let us know your results.
Attachment #8427219 - Attachment is obsolete: true
For what it's worth, I've created a genuinely portable Mac ASan universal build and posted it here: http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg You can use this, Dave, if you'd rather not build it yourself. I hacked it up using hdiutil and install_name_tool.
Dave has told me (on irc) that my ASan build crashes on startup. He's running on OS X 10.9.3. I just tried it on 10.9.3 and got the same crash. It's almost certainly unrelated to the ReleaseSliceNow() crashes, but I'll check to make sure. Thanks for your help so far, Dave! For now at least, there's nothing more we can reasonably ask you to do. So it looks like we're not going to get new information here anytime soon.
The ASan crashes you get on OS X 10.9.3 are due to a bug (or bugs) in Apple Gestalt code.
I get this crash fairly consistently: bp-49abfea0-d578-4566-ba16-3ed852140605 bp-cf3017b1-331c-4ac8-8527-dedb42140530 bp-109d1ddc-40ec-41df-8112-798662140529 bp-450e80b0-3e1f-4a72-b33a-7e8d52140524 Steps to reproduce: 1. I have a separate window full of youtube videos (15 tabs). I have opted into their HTML5 version (not sure if this matters), but I'm playing a Flash-only video. 2. Play a video halfway. 3. Work on your main Firefox window (250+ tabs), the usual surfing and all that, for an unknown period of time. 4. Go back to the window full of youtube videos. Resume playing that video. I'll then crash occasionally, but only once or twice daily, of this STR that is done fairly often.
Gary, what version of OS X are you running? If it's not 10.9.3, try my ASan build from comment #98.
(In reply to Steven Michaud from comment #102) > Gary, what version of OS X are you running? If it's not 10.9.3, try my ASan > build from comment #98. It is indeed 10.9.3, unfortunately.
> The ASan crashes you get on OS X 10.9.3 are due to a bug (or bugs) in Apple Gestalt code. Actually they're an unacknowledged and Mac-specific bug in LLVM: http://lists.cs.uiuc.edu/pipermail/llvmbugs/2013-June/028916.html I fixed this in my local copy of LLVM, then redid my build and copied it over the one I originally posted, here: http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg Please try this, Dave and Gary, and let us know your results.
(In reply to Steven Michaud from comment #104) > Please try this, Dave and Gary, and let us know your results. I hit bug 1021612 which may or may not be related, but which is blocking further testing.
I just updated my ASan build to current m-c code, which contains the fix for bug 1021612. http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg Please try it out, Gary and Dave!
(In reply to Steven Michaud from comment #106) > I just updated my ASan build to current m-c code, which contains the fix for > bug 1021612. Thanks Steven! I am now using this build.
I've been testing this build, so far I may have found another issue: 2014-06-11 00:23:13.825 plugin-container[32796:d07] CoreText performance note: Client called CTFontCreateWithName() using name "Times Roman" and got font with PostScript name "Times-Roman". For best performance, only use PostScript names when calling this API. 2014-06-11 00:23:31.501 plugin-container[32796:d07] CoreText performance note: Client called CTFontCreateWithName() using name "Lucida Grande" and got font with PostScript name "LucidaGrande". For best performance, only use PostScript names when calling this API. ================================================================= ==14071==ERROR: AddressSanitizer: heap-use-after-free on address 0x607003486e60 at pc 0x10915ae92 bp 0x7fff5fbfbb90 sp 0x7fff5fbfbb88 READ of size 8 at 0x607003486e60 thread T0 ==14071==AddressSanitizer CHECK failed: /usr/local/src/llvm/llvm-185949/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:36 "((n_ranges_)) < ((kMaxNumberOfAddressRanges))" (0x6, 0x6) ==14071==AddressSanitizer CHECK failed: /usr/local/src/llvm/llvm-185949/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:36 "((n_ranges_)) < ((kMaxNumberOfAddressRanges))" (0x6, 0x6) ==14071==AddressSanitizer CHECK failed: /usr/local/src/llvm/llvm-185949/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:36 "((n_ranges_)) < ((kMaxNumberOfAddressRanges))" (0x6, 0x6) ==14071==AddressSanitizer CHECK failed: /usr/local/src/llvm/llvm-185949/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:36 "((n_ranges_)) < ((kMaxNumberOfAddressRanges))" (0x6, 0x6) ==14071==AddressSanitizer CHECK failed: /usr/local/src/llvm/llvm-185949/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:36 "((n_ranges_)) < ((kMaxNumberOfAddressRanges))" (0x6, 0x6) (the above line is repeated countless times) Because this stack is useless, I'm not able to tell how useful this is.
(In reply to Gary Kwong [:gkw] [:nth10sd] from comment #107) > (In reply to Steven Michaud from comment #106) > > I just updated my ASan build to current m-c code, which contains the fix for > > bug 1021612. > > Thanks Steven! I am now using this build. Just minutes later after hitting comment 108, I hit bug 1023758. (again, which may or may not be related)
(In reply to comment #108) I wonder if you'd get a reasonable (though non-symbolized) stack if you didn't use llvm-symbolizer. I think it's possible to symbolize such stacks after the fact. And by the way, how are you running the ASan builds? Are you running them from the command line, first setting the ASAN_SYMBOLIZER_PATH environment variable? And thanks, by the way, for your work with my Mac ASan builds! I'm going to start using them myself, as my daily browser. Seems like there are lots of new bugs to be found.
(In reply to Steven Michaud from comment #110) > And by the way, how are you running the ASan builds? Are you running them > from the command line, first setting the ASAN_SYMBOLIZER_PATH environment > variable? Example command line: ASAN_SYMBOLIZER_PATH=/Users/skywalker/llvm/build/bin/llvm-symbolizer /Applications/Nightly.app/Contents/MacOS/firefox-bin 2>&1 | tee ~/Downloads/asanLog1.txt
Dave, are you still seeing these crashes in current trunk nightlies? If so please try turning off ICC (incremental cycle collection), and see if that makes them go away. To do this, set dom.cycle_collector.incremental to false in about:config.
Flags: needinfo?(justdave)
The (ICC-related) patch for bug 1023758 hasn't stopped these crashes from happening on trunk (the 33 branch). And though ICC has been turned off on the 32 (Aurora) branch as of the 2017-07-03 nightly (bug 911246 comment #20), these crashes are also still happening there. So, though mozilla::IncrementalFinalizeRunnable::Run() is on the stack, these crashes aren't (directly) related to incremental cycle collection.
Flags: needinfo?(justdave)
IncrementalFinalizeRunnable has nothing to do with incremental cycle collector. IncrementalFinalizeRunnable is about GC calling Release on refcounted object after GC has collected the relevant JS wrappers.
(In reply to Steven Michaud from comment #112) > Dave, are you still seeing these crashes in current trunk nightlies? I don't use Nightly on a day-to-day basis, I primarily use Aurora. I am still seeing this crash on Aurora.
I keep updating my Mac ASan build about once a week. I've just done so again: http://people.mozilla.org/~stmichaud/bmo/firefox-asan.dmg http://people.mozilla.org/~stmichaud/bmo/firefox-asan-howto.txt These are based on current trunk code. But could you use one for a few days, Dave, to see if you can reproduce one of these crashes in it? To launch it, you'd typically do something like this (at a Terminal prompt): 1) cd ~/Downloads 2) ASAN_SYMBOLIZER_PATH=~/Desktop/Nightly\ ASan.app/Contents/MacOS/llvm-symbolizer nohup ~/Desktop/Nightly\ ASan.app/Contents/MacOS/firefox 2>&1 | tee firefox-asan.log & When/if you crash, you'll see an ASan log (fully symbolized) in firefox-asan.log.
Same as 30, wontfix for 31.
I finally got a good breaking point to restart my browser and try this out... It dies at launch with: dyld: lazy symbol binding failed: Symbol not found: ___asan_init_v3 Referenced from: /Applications/Nightly.app/Contents/MacOS/libmozglue.dylib Expected in: flat namespace dyld: Symbol not found: ___asan_init_v3 Referenced from: /Applications/Nightly.app/Contents/MacOS/libmozglue.dylib Expected in: flat namespace And it leaves an OS X crash dump (which I can upload if you want it). The llvm-symbolizer is what crashes, the app itself continues running and works fine (using it to type this comment).
Apparently llvm-symbolizer crashes every time I open a new page.
I see that error too -- though only when I load a plugin. As far as I can tell it's benign -- you can continue using the ASan build, and if it crashes a symbolized crash log will get written to firefox-asan.log. I've already looked a bit into the error, but I haven't yet been able to figure it out. I don't consider it high priority, though -- since it doesn't prevent the ASan builds from doing what they're supposed to.
> dyld: lazy symbol binding failed: Symbol not found: ___asan_init_v3 > it leaves an OS X crash dump (which I can upload if you want it). In my case, that error *isn't* associated with a crash dump. I suspect it isn't in your case, either. But I would like to see the crash dump. Please attach it here.
Attached file llvm-dump.txt (deleted) —
llvm-symbolizer crash dump
Oops, I misunderstood. I was expecting output from llvm-symbolizer, but this is an Apple crash report. And (as you said) it reports a crash in llvm-symbolizer itself. And now that I look, I see I have similar crash reports (generated whenever I start a plugin). So apparently llvm-symbolizer crashes every time you open a plugin-container process -- though the plugin-container process itself runs just fine. But llvm-symbolizer still works fine with the main process, which is what we're interested in here.
I've typically had to be using Firefox about a day and a half to trigger this crash, and I can't keep this running long enough to crash it. It leaks like a sieve, and is using 14 GB of physical RAM right now (I only have 16 GB) and I've only been running it for 24 hours. As soon as I'm done typing this comment I'm restarting it to clear up the memory leak, and that may also reset my chances of reproducing the crash :(
> It leaks like a sieve ASan builds do use *lots* of RAM. But I don't know that they leak any worse than Firefox itself (or than the plugins/extensions that it uses). If you decide to keep running my ASan builds, please keep track of how the RAM usage increases over time, and how this compares to non-ASan builds. By the way, I update them about once a week, and I've just done so again.
I had 40 GB of VM in use, and after I quit Firefox there was 22 GB of VM in use. After restarting it, it's using 3GB of RAM, which is more normal (I usually see Firefox in the 3 to 6 GB range in my normal usage without ASan) Is anything in the ASan log useful even if it didn't crash? I saved the one it made when I restarted it.
> Is anything in the ASan log useful even if it didn't crash? I don't think so. But please attach it just in case. I'll mark it obsolete if it's not relevant.
(In reply to Steven Michaud from comment #125) > > It leaks like a sieve > > ASan builds do use *lots* of RAM. But I don't know that they leak any worse > than Firefox itself (or than the plugins/extensions that it uses). > > If you decide to keep running my ASan builds, please keep track of how the > RAM usage increases over time, and how this compares to non-ASan builds. fwiw, I've been using a build from 2 weeks ago constantly and haven't hit anything new thus far. I did seem to notice things getting slower (I have 200+ tabs) after using the browser (opening/closing tabs) for a day or 2, and have had to restart the browser to have a reasonable surfing experience again. I also have 16Gb RAM, but didn't measure the memory usage.
> fwiw, I've been using a build from 2 weeks ago constantly and haven't hit anything new thus far. Do you still see ReleaseSliceNow() crashes in "normal" builds?
(In reply to Steven Michaud from comment #129) > Do you still see ReleaseSliceNow() crashes in "normal" builds? Not tried yet - will switch back to "normal" Nightlies.
I have not yet been able to reproduce the crash in the ASan builds (I'm running it again now - had it going about 12 hours this time and had to restart it because it hung). I had switched back to Aurora for a couple days earlier this week because the ASan build was so slow, and managed to trip the crash twice on Aurora during that time.
Let's wait until we hear back from Gary. But at this point I strongly suspect it's not possible to reproduce these crashes in my ASan builds.
I get these crashes now about once a day in a normal Nightly build: bp-79d1fa25-2273-4433-8b55-379212140801 bp-ac5888fe-ffb2-4f01-a0be-869942140729 Haven't isolated any STR yet. Let me know if there's something I can do to help diagnose this.
(In reply to Steven Michaud from comment #132) > Let's wait until we hear back from Gary. But at this point I strongly > suspect it's not possible to reproduce these crashes in my ASan builds. I haven't hit the crash again in normal Nightlies, but my recent browser usage declined and changed recently (due to various flights around the world). Will keep trying.
I just hit this crash, while I was using Firefox Sync to sync my bookmarks and I was just having a bookmarks folder open in my bookmarks toolbar. https://crash-stats.mozilla.com/report/index/fe8843b2-4953-442c-b330-180802140807
(In reply to Gary Kwong [:gkw] [:nth10sd] slower than usual Aug 04-07, PTO Aug 08 from comment #134) > I haven't hit the crash again in normal Nightlies, but my recent browser > usage declined and changed recently (due to various flights around the > world). Will keep trying. Hit this now: https://crash-stats.mozilla.com/report/index/c5becd92-e7e1-4e27-958d-5f2442140809 I was accessing page 3 from page 1 of this ComputerWorld article: http://www.computerworld.com/s/article/9250266/Microsoft_slashes_IE_support_sets_huge_edict_for_Jan._2016?taxonomyId=125&pageNumber=3
Andrew - You're listed as the owner although from the recent comments it looks like Steven may actually be driving this one. I'm don't have a great deal of confidence right now that this will be resolved in 32. Do we have next steps here?
Flags: needinfo?(smichaud)
Flags: needinfo?(continuation)
Not really. There's just not any information to go on.
Flags: needinfo?(continuation)
This bug is currently stuck hard in the mud. My last hope was to get someone to reproduce these crashes (or their precursor) in an ASan build. But apparently that's not going to work.
Flags: needinfo?(smichaud)
I'm marking this bug as won't fix for 32. I would like to see us make a decision about whether we can make any progress on this bug in the 33 cycle.
I just hit this while watching a documentary on YouTube. Had a few simple pages of API documentation open in other tabs. https://crash-stats.mozilla.com/report/index/d2b66014-b827-4152-89d6-1a3482140823 First time I've had release Firefox crash on me in a long time.
I just hit this bug in 32 while browsing Google Maps. https://crash-stats.mozilla.com/report/index/fa607675-76f2-4791-9e01-903de2140901 Unfortunately it does not seem to be reproducible, but I still wanted to mention it here just in case...
Hi, thought I might mention that FF has crashed for me every couple of days for the last two months with this bug. It's not reproducible (in the sense that it seems to happen on random pages), but it does reliably crash on my machine.
(In reply to Steven Michaud from comment #139) > This bug is currently stuck hard in the mud. > > My last hope was to get someone to reproduce these crashes (or their > precursor) in an ASan build. But apparently that's not going to work. Maybe we could work this out if we get more eyes on this? I just hit this again on a 2014-09-16 nightly, surfing http://www.espnfc.com/ and getting bp-d053e21e-d37f-4d4c-bc12-a8e062140919
I've been working on a test build that should crash "earlier", and give more informative stacks. But so far I haven't been able to get it to build on the tryservers (the builds consistently fail with the same bizarre and inexplicable errors). It builds just fine locally. Ultimately I may just post a link to one of my local builds.
Attached patch Patch for testing (*not* a fix) (obsolete) (deleted) — Splinter Review
I finally got a patch that builds on the tryservers! Looking again at this bug a few days ago, it seemed likely that the invalid nsISupports object that causes these crashes is already invalid when it's queued for deletion at CycleCollectedJSRuntime::DeferredFinalize(). If we crash there, we're likely to get better (more informative) stacks. So whoever sees these crashes at least once a day, please use the following build for a day or two, hoping for crashes. When/if you do crash, please post your crash ids here. http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/smichaud@pobox.com-32ba3bb2e233/try-macosx64/firefox-35.0a1.en-US.mac.dmg
Steven, if you want to land that, feel free to do so with r=me. We should back it out at some point if it doesn't turn up anything, but it should be relatively harmless.
Ideally that will crash even with non-ASan builds thanks to poisoning, if the object is in fact bad there.
> Steven, if you want to land that, feel free to do so with r=me. Will do. Thanks! Crash reports for tryserver builds don't get properly symbolized, so you have to symbolize them by hand (using atos). So having this in m-c nightlies will be *much* more convenient. I'll change the comment to indicate that the patch should come out at some point -- either when this bug gets fixed, or if the patch doesn't help fix it.
What I'll land on m-i.
Attachment #8492208 - Attachment is obsolete: true
Comment on attachment 8492240 [details] [diff] [review] Patch for testing (*not* a fix) -- should make crash stacks more informative Review of attachment 8492240 [details] [diff] [review]: ----------------------------------------------------------------- Thanks!
Attachment #8492240 - Flags: review+
Comment on attachment 8492240 [details] [diff] [review] Patch for testing (*not* a fix) -- should make crash stacks more informative Landed on mozilla-inbound: https://hg.mozilla.org/integration/mozilla-inbound/rev/317c684efd2d
Whiteboard: [leave open]
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #155) > Backed out for e10s mochitest-bc leaks. s/e10s mochitest-bc leaks/blowing up the world Please run this through Try before attempting to push it again :)
Really weird! I have no idea what happened there.
Those errors were the same as the "bizarre and inexplicable errors" I reported in comment #147. I'm convinced this is a build system problem. But I'm not willing to put in the work to track it down -- not for a test patch. I'll ifdef my patch to make it Mac-only (since these crashes are only topcrashers on the Mac), and do a set of Mac tryserver builds and tests. If all goes well, I'll ask you to review it again, Andrew, and try landing it again.
https://tbpl.mozilla.org/?tree=Try&rev=a31e02acc6af If the rats keep biting, I'll just post my own local build where people can download it and test with it.
For good measure, let's try this patch on all platforms, even though it's ifdefed to only have effect on the Mac. Let's see how far into "bizarre and inexplicable" we truly get: https://tbpl.mozilla.org/?tree=Try&rev=4a90c03aefd5
The rats are still biting, so let's forget the tryservers. Here's a local build of my patch. I've disabled Breakpad and ensured that symbols aren't stripped. So when you crash you'll see an Apple crash report. If you see any crashes in CycleCollectedJSRuntime::DeferredFinalize(), look for a copy of the Apple crash report in ~/Library/Logs/DiagnosticReports/. Attach the report here (don't paste it into a comment). http://people.mozilla.org/~stmichaud/bmo/firefox-35.0a1.en-US.mac-bugzilla997908-test-DeferredFinalize.dmg
hey Steve, I've been running your build for a couple of days now but unfortunately did not get the crash (apart from one which took down my whole machine but was probably unrelated). But I can't continue testing it because it performs so badly overall, including leaking 100+Mb of RAM an hour. I would be happy to test a patched build from the current release branch though, if it's possible to build one.
> because it performs so badly overall, including leaking 100+Mb of RAM an hour I expect this means that at least some of the test failures were genuine (and not just a build system artifact). So it looks like I'll have to figure out some other way to test in CycleCollectedJSRuntime::DeferredFinalize(). Thanks, though, for trying my test build.
With an ASan build, you could write code that just touches the object, in some way that it won't be optimized out, hopefully. That won't produce useful stacks outside of an ASan build, though. Any other method on ISupports we could call will cause an addref and release, and thus cause the same leaks.
Attached patch Patch for testing (*not* a fix) (obsolete) (deleted) — Splinter Review
We've long known that these crashes take place here: https://hg.mozilla.org/mozilla-central/annotate/790f41c631cc/xpcom/base/CycleCollectedJSRuntime.cpp#l1081. The assembly code from comment #36 leads me to believe that 'wrapper' itself has been "poisoned" -- that when these crashes happen it equals 0x5a5a5a5a5a5a5a5a. If this is true, then it's likely that the element which becomes 'wrapper' was already poisoned when it was queued for deletion in a call to CycleCollectedJSRuntime::DeferredFinalize(). This patch is as ugly as sin. But it should trigger crashes with much better (more informative) stacks when/if CycleCollectedJSRuntime::DeferredFinalize() is called with a poisoned aSupports. And it should otherwise be safe (not crash). Andrew, are you willing to have this on the trunk for a week or two, to see if we start getting more interesting crash stacks? I've started an all-platform tryserver build, which so far seems to be progressing smoothly: https://tbpl.mozilla.org/?tree=Try&rev=65d03e6f30af
Attachment #8492240 - Attachment is obsolete: true
Attachment #8493487 - Flags: review?(continuation)
(In reply to comment #165) I don't think there's much point in continuing to test with ASan builds. It's pretty clear that these crashes don't happen with them.
Comment on attachment 8493487 [details] [diff] [review] Patch for testing (*not* a fix) I'd be happy to put this patch in an #ifdef XP_MACOSX block, if the need is felt.
Is that load really not just going to be optimized out?
> Is that load really not just going to be optimized out? I don't think so. But I can check ... however that will take a while. I need to load XUL from one of my tryserver builds into the Hopper Disassembler, and that takes *hours* (since Hopper needs to resolve all the internal cross-references). In the meantime, can you suggest a way for me to rewrite my patch to ensure that it *doesn't* get optimized out?
Maybe write a getter method for mInstanceVariable and mark it MOZ_NEVER_INLINE, then use that?
Andrew: I finally got XUL from my OS X 10.8 opt build to load into Hopper Disassembler ... and you were right -- my patch got optimized away. I will attempt to deal with this, but probably not until next week. I'm currently very busy with other, more urgent things. I have a hunch it won't be optimized away if I somehow create a side effect outside of the CycleCollectedJSRuntime::DeferredFinalize() function's scope -- for example by setting a global static variable to the results of whatever->mInstanceVariable. When I have time, I'll try that and see what happens.
Attachment #8493487 - Flags: review?(continuation)
Like previous releases, wontfix for 33
This is still the #10 topcrash on Firefox 34.0a2, though it isn't very high volume, with 64/3985 crashes in the last 7 days.
Attached patch Another patch for testing (*not* a fix) (obsolete) (deleted) — Splinter Review
Let me try my hand at this again. This patch is *very* simple and *very* ugly, but I think it'll work -- that it won't get optimized away and will crash when it needs to (giving us better stacks). I'll run this through the tryservers before landing it. But they're currently closed.
Attachment #8493487 - Attachment is obsolete: true
Attachment #8513044 - Flags: review?(continuation)
Comment on attachment 8513044 [details] [diff] [review] Another patch for testing (*not* a fix) Review of attachment 8513044 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for being persistent on this.
Attachment #8513044 - Flags: review?(continuation) → review+
Comment on attachment 8513044 [details] [diff] [review] Another patch for testing (*not* a fix) I've started my tryserver builds: https://tbpl.mozilla.org/?tree=Try&rev=f3e383000027 These are all platform because some of the B2G builds are done on OS X.
Comment on attachment 8513044 [details] [diff] [review] Another patch for testing (*not* a fix) Overnight I loaded XUL from the opt tryserver build into Hopper Disassembler ... and discovered my debugging code was gone. Then I looked at the build log, and discovered (in a warning) that "optimize" isn't recognized by clang as an attribute. Sigh. I'll try again, this time with inline assembly.
Attachment #8513044 - Attachment is obsolete: true
Comment on attachment 8513584 [details] [diff] [review] Patch for testing (*not* a fix) And here's another set of tryserver builds, since the first has weird and seemingly unrelated persistent errors. https://tbpl.mozilla.org/?tree=Try&rev=e325f0b6ac0a
Comment on attachment 8513584 [details] [diff] [review] Patch for testing (*not* a fix) > It doesn't matter if we zap %rax. Apparently it *does* matter. I'll try again.
Attachment #8513584 - Attachment is obsolete: true
(In reply to Steven Michaud from comment #181) > Comment on attachment 8513584 [details] [diff] [review] > Patch for testing (*not* a fix) > > > It doesn't matter if we zap %rax. > > Apparently it *does* matter. I'll try again. %r11 might be a safe register to clobber (not an argument register, a callee-clobbered register, and one that is probably further down on the register allocator's list of registers). If that fails, there's always push/mov/pop.
Comment on attachment 8513821 [details] [diff] [review] Patch for testing (*not* a fix), rev7 My previous patch was wrong about aSupports being in the %rsi register. Which is very peculiar -- there must be some strange things going on on the compiler. This gets around the problem, and also makes sure we don't zap any registers.
Comment on attachment 8513821 [details] [diff] [review] Patch for testing (*not* a fix), rev7 The tryserver tests are going well. And I've confirmed (in gdb, using its disassemble command) that is actually present in the opt build.
Attachment #8513821 - Flags: review?(continuation)
> that is that my test code is
Comment on attachment 8513821 [details] [diff] [review] Patch for testing (*not* a fix), rev7 I think Nathan is probably a better reviewer for this, as I have no idea what is going on with the assembly.
Attachment #8513821 - Flags: review?(continuation) → review?(nfroyd)
Comment on attachment 8513821 [details] [diff] [review] Patch for testing (*not* a fix), rev7 Review of attachment 8513821 [details] [diff] [review]: ----------------------------------------------------------------- Inline assembly patches are the best kind of patches. ::: xpcom/base/CycleCollectedJSRuntime.cpp @@ +1063,5 @@ > +#if defined(XP_MACOSX) && defined(__LP64__) > + // We'll crash here if aSupports is poisoned (== 0x5a5a5a5a5a5a5a5a). This > + // is better (more informative) than crashing in ReleaseSliceNow(). See > + // bug 997908. This patch should get backed out when bug 997908 gets fixed, > + // or if it doesn't actually help fix that bug. Nit: maybe "or if it doesn't actually help diagnose that bug"
Attachment #8513821 - Flags: review?(nfroyd) → review+
Keywords: leave-open
Whiteboard: [leave open]
Comment on attachment 8513821 [details] [diff] [review] Patch for testing (*not* a fix), rev7 Landed on mozilla-inbound: https://hg.mozilla.org/integration/mozilla-inbound/rev/b56b9aa70d22 > Inline assembly patches are the best kind of patches. They work when all else fails :-)
Is this patch worth uplifting to get more crash stats, or is the nightly population high enough that this will start showing up in significant volume?
(In reply to comment #190) There were 53 of these crashes in the last week on the 36 branch (so they're still the #5 Mac topcrasher on that branch). I don't think we need to uplift the patch. In a few days we should have enough new crashes (in DeferredFinalize(nsISupports*)) to play with.
Depends on: 1091801
Assignee: continuation → nobody
Since my test patch landed, there have been several more of this bug's crashes, but no crashes at address 0x5a5a5a5a5a5a5a5a in mozilla::cyclecollector::DeferredFinalize(nsISupports*). So it seems my hunch (from comment #166) is wrong, and that this bug happens after the call to mozilla::cyclecollector::DeferredFinalize(nsISupports*). I don't know where to go from here. Let's leave my patch on the trunk for another week, to see if anything else turns up. bp-d70bccbf-4266-408b-bc90-7d12f2141103 bp-e1800a3e-bfc6-4ad3-9497-194522141102 bp-2f6cd17a-a6f4-42a9-a725-417362141103 bp-81db71bf-bf36-4558-8425-19e812141103
> bp-81db71bf-bf36-4558-8425-19e812141103 Just noticed that this crash *isn't* at 0x5a5a5a5a5a5a5a5a. This may be significant ... though I don't currently know how.
Is it possible to do something horrible like copy out the method pointer from the vtable of the object in deferred finalize, then use that later at the point where we are crashing with the actual object? Then we'd crash in the method itself, rather then when trying to call the method, so maybe the class name would show up in the crash stack. Maybe that's hard to do without rewriting a bunch of code, though.
Note to myself, for when I come back to this bug: I wonder if it, like bug 1091801, is come kind of clang bug? Comment #194 indicates it may not be a use-after-free bug.
(In reply to Andrew McCreight [:mccr8] from comment #195) > Is it possible to do something horrible like copy out the method pointer > from the vtable of the object in deferred finalize, then use that later at > the point where we are crashing with the actual object? Then we'd crash in > the method itself, rather then when trying to call the method, so maybe the > class name would show up in the crash stack. Maybe that's hard to do > without rewriting a bunch of code, though. Pulling out the vtable of the object is not a problem.
(In reply to comment #195) Feel free to try :-) As for myself, when I come back to this bug I'm going to try to follow my hunch that this is a clang bug. If it *is* a clang bug, it is (I think) mostly likely happening in the call to mozilla::cyclecollector::DeferredFinalize(nsISupports*) itself. While debugging bug 1091801, I noticed that (even in opt builds) there are two distinct binary implementations for mozilla::cyclecollector::DeferredFinalize(nsISupports*) and CycleCollectedJSRuntime::DeferredFinalize(nsISupports*). I haven't yet dug into the tree to figure out why. But this is weird, and what's weird might be broken.
> there are two distinct binary implementations for > mozilla::cyclecollector::DeferredFinalize(nsISupports*) and > CycleCollectedJSRuntime::DeferredFinalize(nsISupports*) Actually this isn't so wierd. These are two different methods, at two different places in the tree: https://hg.mozilla.org/mozilla-central/annotate/5999e92e89ff/xpcom/base/nsCycleCollector.cpp#l3987 https://hg.mozilla.org/mozilla-central/annotate/5999e92e89ff/xpcom/base/CycleCollectedJSRuntime.cpp#l1047 The first calls the second. But, in opt builds, the second is inlined inside the first. It's conceivable this is being done incorrectly. I'll take a closer look later.
> But, in opt builds, the second is inlined inside the first. It's conceivable > this is being done incorrectly. I'll take a closer look later. I did check, and I didn't see any problems. In XUL from a recent m-c nightly (loaded into Hopper Disassembler), I compared the machine code for mozilla::cyclecollector::DeferredFinalize(nsISupports*) with an inline copy of CycleCollectedJSRuntime::DeferredFinalize(nsISupports*) to the machine code for CycleCollectedJSRuntime::DeferredFinalize(nsISupports*). I didn't see any significant differences. I'm currently out of ideas here.
> I'm currently out of ideas here. Not for long :-) DeferredFinalize(nsISupports*) is often called on one or more secondary threads. It basically just calls mDeferredSupports.AppendElement(aSupports). But CycleCollectedJSRuntime::FinalizeDeferredThings() is, as far as I know, called on the main thread, and also access mDeferredSupports, without (as far as I can tell) any locking mechanism. Could that be what's causing this bug?
CycleCollectedJSRuntime is a per thread singleton, so each thread should have its own object.
Even so, CycleCollectedJSRuntime.mDeferredSupports can apparently be accessed (written to and copied from) on two different threads. When I have time I'll dig further into this, and with luck write a test patch.
Just for the record, here's where things now stand: We crash here: https://hg.mozilla.org/mozilla-central/annotate/5999e92e89ff/xpcom/base/CycleCollectedJSRuntime.cpp#l1102 'wrapper' is almost always 0x5a5a5a5a5a5a5a5a. The only possible explanation for this is that at least one of the "items" in 'items' is set to this value. In almost all cases (or perhaps all of them), ReleaseSliceNow() is called with 'aSlice' == 100 and 'aData' == a valid nsTArray<nsISupports*>*. By looking at the machine code for ReleaseSliceNow() and the register values at-crash (in the raw dumps of the crash logs), the crash rarely (if ever) happens the first time through the "for (uint32_t i = length; i > length - aSlice; --i) {}" loop. 'items' and 'aData' in ReleaseSliceNow() come from 'mSupports' and 'aSupports', here: https://hg.mozilla.org/mozilla-central/annotate/5999e92e89ff/xpcom/base/CycleCollectedJSRuntime.cpp#l1132 These in turn come from CycleCollectedJSRuntime.mDeferredSupports, here: https://hg.mozilla.org/mozilla-central/annotate/5999e92e89ff/xpcom/base/CycleCollectedJSRuntime.cpp#l1214 CycleCollectedJSRuntime.mDeferredSupports is populated by calls to DeferredFinalize(), here: https://hg.mozilla.org/mozilla-central/annotate/5999e92e89ff/xpcom/base/CycleCollectedJSRuntime.cpp#l1061 We know from the results of my test patch that 'aSupports' is *never* == 0x5a5a5a5a5a5a5a5a in calls to DeferredFinalize(). So how did the 'items'/'aData' array in ReleaseSliceNow() acquire an 'item' that == 0x5a5a5a5a5a5a5a5a? I think this is only possible as the result of some kind of "corruption" -- for example a clang bug or thread contention.
Oh the value we're pulling out of the array is 0x5a? I guess that makes more sense. For some reason, I thought that we were pulling a valid value out, but then we hit the poison value when we tried to call Release() on it. So it is the array that is not being held alive properly, not the things the contents of the array point to. Yeah, I could imagine that some GC background thread is incorrectly calling deferred release. I wrote a patch just now that checks any time we access one of these arrays if we're on the main thread. This will check if we're accessing this array only on the main thread, for the main thread. For workers, it won't check anything useful, but hopefully these are mostly main thread crashes. (The GC doesn't use a background thread for workers IIRC so that shouldn't be an issue.)
Here are the values at-crash for some of the user-level registers, at least in the 2014-11-02-03-02-04-mozilla-central nightly: r12 == wrapper r14 == aData r15 == length - aSlice rax == i rbx == lastItemIdx Here are some crash reports for that nightly: bp-d70bccbf-4266-408b-bc90-7d12f2141103 bp-7931d847-2e43-41db-9378-4bdef2141104 bp-53fd5fec-b055-4de4-b1a2-d06fd2141104 bp-0be3ecf3-cc61-4ecf-bda1-c51ce2141105
This didn't turn up anything in local testing. Maybe a try push will turn up something, or we could try landing a version that crashes in non-debug builds.
> we could try landing a version that crashes in non-debug builds I'd say try that on the tryservers. But if it causes lots of problems it's not the answer -- probably because it's not specific enough. After all, these crashes are very rare, and normally never show up in any of our automated tests.
Note that the one crash reported at bug 1091801 (in mozilla::cyclecollector::DeferredFinalize(nsISupports*)) is on a secondary thread. So, if I remember right, were all the crashes I saw as I was debugging that bug. But I haven't had a chance to retest.
(In reply to Steven Michaud from comment #209) > Note that the one crash reported at bug 1091801 (in > mozilla::cyclecollector::DeferredFinalize(nsISupports*)) is on a secondary > thread. So, if I remember right, were all the crashes I saw as I was > debugging that bug. But I haven't had a chance to retest. The stack in that bug looks like it is on the "main thread" for a worker, so that's fine. It should be using a separate CC runtime than the one on the main thread.
I get this crash occasionally (once a week maybe). If there is anything I can do to help let me know.
I spent some time today trying out trunk on Valgrind on Yosemite, but didn't see anything that seemed directly relevant. In two runs Valgrind did crash with symptoms that are usually associated with the program-under-test trashing its heap. This is a bit surprising given that I almost never see that on runs on Linux. I have stack traces of all the threads at the points of the crash if anybody wants to see. It may well be a red herring though, due to general flakyness of Valgrind on Yosemite. The crashes are not reproducible. Overall I'd say the thread-race theory is most plausible. Can we get the TSan folks on the case? Do we have TSan on MacOS ?
> Do we have TSan on MacOS? Not that I'm aware of. I do have my own ASan builds, as mentioned above in many comments. At some point I'll also try to do Mac TSan builds ... but I won't have time for a while.
Given that we're very late in the 34 cycle, the overall crash rate for 34 is in a good range, the crash reported in this bug is very rare (comment 208), and that we don't have a patch - I'm going to mark this as wontfix in 34.
In case it helps with root cause isolation, I got this stack about 15 minutes into a Loop/Hello call. I've not otherwise seen this crash, so I do kind of suspect something about having a long-running WebRTC call as being an exacerbating factor. Could be a red herring. https://crash-stats.mozilla.com/report/index/61a66f0c-cbac-45fa-a112-590f72141119
(In reply to Timothy Nikkel (:tn) from comment #212) > I get this crash occasionally (once a week maybe). If there is anything I > can do to help let me know. Ditto. 3 this week. Feels like they happen more often recently (my crash history backs up this feeling, sadly).
Just had this when watching a ted talk. The video stopped and then it crashed. Crash stats url: https://crash-stats.mozilla.com/report/index/27eb2e26-757a-4136-a54b-296a32141204
This may not be the best place for this post, but at least it will be seen by someone who might know what to do with it. I have been using FF since v1.0 on Windows. A couple of years ago I switched to an MacBook Air running Lion and continued using FF without any problems; the occasional crash here and there, but nothing too problematic. Then a few months ago I got a new MacBook Pro running Mavericks and FF has been crashing almost daily ever since. The crashes are mostly the ReleaseSliceNow one, but there are also lots of other random looking crashes (nsHtml5TreeBuilder::appendCharacters is the latest one for me), almost always with bad memory access. Here's my about:crashes for reference (https://gist.github.com/anonymous/4701c1a0e1534b9e5577) So what changed? Was it the new hardware that FF doesn't like? Not likely. Was it the switch from Lion to Mavericks? Was it a change in the compiler toolchain perhaps? I doubt it was a sudden slew of bugs committed to FF. Perhaps it is some root cause bug corrupting memory that is then randomly manifested throughout the code. But Mac only? Why would such a bug not manifest on Windows? It suggests toolchain issues, but I don't know enough about FF dev to further comment on that. Perhaps I'm in the minority with these problems, which might suggest something unique about my environment, a pinned webpage perhaps that no-one else uses? I'm sad after having used FF for so many years without problems to suddenly have an extremely unstable browser, but I remain hopeful that this can be rectified. And I'm available to run test builds if that can help. Good luck devs!
Olli had a theory that maybe we're spinning the event loop in ReleaseSliceNow or something, and we end up re-running the runnable while we're running the runnable. We could try adding some kind of reentrance check to at least detect that particular problem.
For what it's worth, I've been looking into doing Mac TSan builds. But for now I'm giving up. In principle, TSan builds are made very similarly to ASan builds. But even the latest version of llvm that I tried (rev 224161, from a few days ago) claims to not be able to support TSan builds on Darwin. There are some hints that Google was able to do Mac TSan builds a long time ago: https://code.google.com/p/chromium/issues/detail?id=106197 https://code.google.com/p/chromium/issues/detail?id=120136 But I don't know enough about llvm to feel confident about trying to follow these hints up.
I've pored over this again, and done some more debug logging (in current trunk code). I've convinced myself this bug can't be the result of thread contention: At least in current code, calls to ReleaseSliceNow() can happen on the main thread and any of several "DOM Thread" threads. But the associated calls to CycleCollectedJSRuntime::DeferredFinalize() and CycleCollectedJSRuntime::FinalizeDeferredThings() also all happen on the same thread. So there's never a case where the same nsTArray<nsISupports*> object is accessed/changed on different threads. So what on earth is the cause? I'm beginning to wonder if there are problems with one or more of the nsTArray methods that we use: AppendElement() in CycleCollectedJSRuntime::DeferredFinalize(), SwapElements() in IncrementalFinalizeRunnable::IncrementalFinalizeRunnable() and ElementAt() in ReleaseSliceNow(). Of these I suppose the most likely to cause trouble is AppendElement(). In my next test patch I'll try to come up with some trivial tests that AppendElement() worked correctly in CycleCollectedJSRuntime::DeferredFinalize(). By the way, it's interesting that we only see these crashes on the main thread of the main process. And, from my own debug logging, there seem to be a lot more objects garbage-collected there than on other threads or in other processes. Also, from my notes in comment 206 you can see that the 'items' array in ReleaseSliceNow() can get very large: r12 in one of those raw dumps is 0xc3f9, indicating an array with 50169 items (at that time). So maybe we only see these crashes when CycleCollectedJSRuntime::mDeferredSupports gets very large.
(In reply to Steven Michaud from comment #223) > By the way, it's interesting that we only see these crashes on the main > thread of the main process. The main thread is just where most of these kinds of objects live. This is mostly used for things like the DOM tree. Workers have a bit of support, but not for as many things. Some nsTArray problem does seem like the most suspicious thing here, esp. given your earlier investigation that they were happening, but that would be a little surprising, given how much we use nsTArray everywhere.
Attached patch Another test/debug patch (*not* a fix) (obsolete) (deleted) — Splinter Review
Here's my next test/debug patch. Fortunately it doesn't require inline assembly. I removed my previous test patch, since it's already served its purpose. I've started a full set of tryserver builds: https://tbpl.mozilla.org/?tree=Try&rev=0824af4068dc
Attachment #8539512 - Flags: review?(continuation)
Comment on attachment 8539512 [details] [diff] [review] Another test/debug patch (*not* a fix) Oops, sorry. I already noticed a problem. New patch (and new tryserver tests) coming up.
Attachment #8539512 - Attachment is obsolete: true
Attachment #8539512 - Flags: review?(continuation)
Let's try this again. Here's another set of tryserver builds: https://tbpl.mozilla.org/?tree=Try&rev=adb35338f548
Attachment #8539518 - Flags: review?(continuation)
Comment on attachment 8539518 [details] [diff] [review] Another test/debug patch (*not* a fix) Review of attachment 8539518 [details] [diff] [review]: ----------------------------------------------------------------- ::: xpcom/base/CycleCollectedJSRuntime.cpp @@ +1050,5 @@ > + // intermittent failures here in nsTArray::AppendElement(). So if we see > + // any failures, deliberately crash and include diagnostic information in > + // the crash report. > + size_t oldLength = mDeferredSupports.Length(); > + nsISupports **itemPtr = mDeferredSupports.AppendElement(aSupports); nit: * to the left, here and below. @@ +1053,5 @@ > + size_t oldLength = mDeferredSupports.Length(); > + nsISupports **itemPtr = mDeferredSupports.AppendElement(aSupports); > + size_t newLength = mDeferredSupports.Length(); > + nsISupports *item = mDeferredSupports.ElementAt(newLength - 1); > + if ((newLength - oldLength != 1) || !itemPtr || Why don't you do this checking, sans CrashReporter note adding, in non crashreporter builds/
Attachment #8539518 - Flags: review?(continuation) → review+
> nit: * to the left, here and below. OK. > Why don't you do this checking, sans CrashReporter note adding, in > non crashreporter builds? Because I don't think it's worth crashing here if I can't log any diagnostic information.
Comment on attachment 8539518 [details] [diff] [review] Another test/debug patch (*not* a fix) Landed (with Andrew's changes) on fx-team: https://hg.mozilla.org/integration/fx-team/rev/29299d5117e9
It's a bit too early to be sure, but it looks like my latest test patch hasn't revealed any problems with nsTArray::AppendElement() (or for that matter with nsTArray::ElementAt()). Since it landed there've been no Socorro bug reports with "app notes" containing "997908". But there have been two crashes at ReleaseSliceNow(): bp-b8eff680-d455-4911-9964-585242141221 bp-f5fa94c4-c422-4fda-adc4-2d35b2141222
I have a patch in progress that stops us from using mDeferredSupports entirely, and instead moves the code to the other way we do deferred finalize. The code looks a little sketchy to me, so hopefully using some other code that doesn't seem to have weird crashes will help. I'll file a separate bug for that blocking this at some point.
Attached patch Possible fix (deleted) — Splinter Review
Given what I said in comment #232, I started poking around in nsTArray code for whatever clues I might be able to find. I'm pretty sure nsTArray::SwapElements() isn't giving us trouble -- for non-auto arrays (as we're using), it just swaps pointers. But then I noticed that nsTArray code tries to use constructors and destructors as it adds and removes elements, or moves them around. Already we've seen with my first (abortive) test patch (attachment 8492208 [details] [diff] [review]) that we aren't dealing here with "ordinary" nsISupports objects -- just adding the following two lines to CycleCollectedJSRuntime::DeferredFinalize() caused all kinds of trouble: + NS_IF_ADDREF(aSupports); + NS_IF_RELEASE(aSupports); So I don't really think we want nsTArray to be calling constructors and destructors on the objects in CycleCollectedJSRuntime::mDeferredSupports and IncrementalFinalizeRunnable::mSupports. So why not change these nsTArrays from nsTArray<nsISupports*> to nsTArray<void*>? That's what this patch does. I've started a full set of tryserver builds, here: https://tbpl.mozilla.org/?tree=Try&rev=c5a99eb900c9 I'm leaving tomorrow on my Christmas vacation. So even if you really like this patch, Andrew, I think it's probably best to wait to land it on the trunk until I get back (in late December or January).
Attachment #8540413 - Flags: review?(continuation)
Depends on: 1114804
That's an interesting idea, but I don't think anything should happen to the nsISupports pointers in the array.
Comment on attachment 8540413 [details] [diff] [review] Possible fix Review of attachment 8540413 [details] [diff] [review]: ----------------------------------------------------------------- I don't think this should change the behavior of destroying the array. I'm going to try my patch to eliminate this array entirely.
Attachment #8540413 - Flags: review?(continuation) → review-
Wontfixing again for 35 given the time remaining in the cycle. Since we've been shipping this for many releases I'm not tracking any further, looks like we'll have a fix on trunk soon and an uplift nomination with risk assessment would be welcome. We can take it as high as timing/risk allows.
I got one today in 35 during Firefox Hello call while sitting back and chatting. Lots of tabs in the background. https://crash-stats.mozilla.com/report/index/f72e5bc6-7c63-487a-890c-2df6e2150117
As of the patch for bug 1114804 (which landed on trunk on 2015-03-10 and first appeared in the 2015-03-11 mozilla-central nightly), the code where these crashes happen has disappeared. As best I can tell, that code has been replaced by DeferredFinalizerImpl::DeferredFinalize(), here: https://hg.mozilla.org/mozilla-central/annotate/cf1060d8ce9f/dom/bindings/BindingUtils.h#l2910 The big difference is that the "new" code uses nsTArray::RemoveElementsAt() once instead of nsTArray::RemoveElementAt() repeatedly. And that seems to have been enough to fix this bug. I see no crashes at all at DeferredFinalizerImpl::DeferredFinalize() in the last week on the 39 branch.
Status: REOPENED → RESOLVED
Closed: 11 years ago10 years ago
Resolution: --- → FIXED
Great! Thanks for looking through the crash stats.
Keywords: leave-open
(Following up comment #239) > The big difference is that the "new" code uses > nsTArray::RemoveElementsAt() once instead of > nsTArray::RemoveElementAt() repeatedly. Actually another (and likely more important) difference is that ReleaseSliceNow() called Release() on each element it removed, but DeferredFinalizerImpl::DeferredFinalize() never calls Release(). Could the calls to Release() in ReleaseSliceNow() have been redundant? And could that have been the cause of bug 997908?
The old code was an array of raw pointers, while the new code is an array of smart pointers, so it will be automatically released when things are removed.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: