Closed Bug 747055 Opened 13 years ago Closed 12 years ago

crash in RtlEnterCriticalSection | PR_Lock | mozilla::ipc::RPCChannel::WaitForNotify

Categories

(Core Graveyard :: Plug-ins, defect)

13 Branch
x86
Windows 7
defect
Not set
critical

Tracking

(firefox13+ wontfix, firefox14+ fixed, firefox15 verified)

RESOLVED FIXED
mozilla16
Tracking Status
firefox13 + wontfix
firefox14 + fixed
firefox15 --- verified

People

(Reporter: kairo, Assigned: gfritzsche)

References

Details

(4 keywords)

Crash Data

Attachments

(4 files, 3 obsolete files)

This bug was filed from the Socorro interface and is report bp-3028b5dc-0643-42a1-b083-d884e2120419 . ============================================================= Top 10 Frames: 0 ntdll.dll RtlEnterCriticalSection 1 nspr4.dll PR_Lock nsprpub/pr/src/threads/combined/prulock.c:233 2 xul.dll mozilla::ipc::RPCChannel::WaitForNotify ipc/glue/WindowsMessageLoop.cpp:910 3 xul.dll mozilla::ipc::RPCChannel::Call ipc/glue/RPCChannel.cpp:202 4 xul.dll mozilla::plugins::PluginModuleParent::NPP_New dom/plugins/ipc/PluginModuleParent.cpp:904 5 xul.dll nsNPAPIPluginInstance::InitializePlugin dom/plugins/base/nsNPAPIPluginInstance.cpp:432 6 xul.dll nsNPAPIPluginInstance::Initialize dom/plugins/base/nsNPAPIPluginInstance.cpp:159 7 xul.dll nsPluginHost::TrySetUpPluginInstance dom/plugins/base/nsPluginHost.cpp:1341 8 xul.dll nsPluginHost::SetUpPluginInstance dom/plugins/base/nsPluginHost.cpp:1221 9 xul.dll nsPluginHost::InstantiateFullPagePlugin dom/plugins/base/nsPluginHost.cpp:1149 10 xul.dll nsObjectLoadingContent::InstantiatePluginInstance content/base/src/nsObjectLoadingContent.cpp:635 More signatures are at https://crash-stats.mozilla.com/report/list?signature=RtlEnterCriticalSection%20|%20PR_Lock%20|%20mozilla%3A%3Aipc%3A%3ARPCChannel%3A%3AWaitForNotify%28%29
It's #36 top browser crasher in 13.0a2 and #209 in 14.0a1. The spike started in 13.0a2/20120327042012. The regression range for the spike is: http://hg.mozilla.org/releases/mozilla-aurora/pushloghtml?fromchange=e14e81432a54&tochange=5c4189019bc1
Summary: crash in RtlEnterCriticalSection → crash in RtlEnterCriticalSection | PR_Lock | mozilla::ipc::RPCChannel::WaitForNotify
Version: Trunk → 13 Branch
Keywords: regression
Top crasher, so tracking for release. Benjamin - can you take a look since IPC appears to be in your wheelhouse?
Assignee: nobody → benjamin
Nothing to do with IPC, this is a plugin bug where we instantiate a plugin which, while it is being instantiated, pops up some kind of modal dialog. This causes us to spin the event loop which ends up re-instantiating a plugin (I think it's the same one). A regression from bug 726734, apparently, although that surprises me. I would have expected bug 90268.
Blocks: 726734
Component: IPC → Plug-ins
QA Contact: ipc → plugins
Assignee: benjamin → georg.fritzsche
Status: NEW → ASSIGNED
This is apparently an issue with full-page instantiations of the Adobe Reader plugin. I haven't been able to repro so far - tried bringing up modal dialogs from different phases in a plugins initialization, from primary and secondary thread, windowed and non-windowed. Possibly related: I have got my old Reader plugin (9.0) in a state where it shows an empty messagebox and afterwards at least drawing appears to be broken. Apparently the messagebox handling is custom though, with what i assume to be a modal message loop on a secondary thread (see thread 6 in callstack_msgbox_pdf.txt).
We actually have different cases here, with the most common (28 out of 41 samples) being full-page instantiations and apparently non-recursive, e.g.: https://crash-stats.mozilla.com/report/index/29f8f73f-2b69-4644-a2c1-435c12120516 Crash-addresses are usually in the range 0x20 - 0x30. Other cases are: Recursive plugin-instantiation, e.g.: https://crash-stats.mozilla.com/report/index/cc7dfa29-4ef9-4d98-9d4c-b1eab2120516 Crash on non-recursive, embedded plugin instantiation, e.g.: https://crash-stats.mozilla.com/report/index/ee1895e4-6704-40d8-8d11-7a0522120516 Crash on plugin event-handling, e.g.: https://crash-stats.mozilla.com/report/index/3b860c4c-6780-409d-b56a-1496d2120516
Scoobidiver/Georg - Do we know what version of Adobe reader to target? We could add the qawanted keyword to try to find STR if so. Any other tips (like looking at a full page PDF) for trying to reproduct would be helpful.
The version is unknown so far. The majority are full-page instantiations and the comments point to this happening with Adobe Reader.
(In reply to Georg Fritzsche [:gfritzsche] from comment #8) > The version is unknown so far. The majority are full-page instantiations and > the comments point to this happening with Adobe Reader. Adding qawanted to test opening PDFs in the browser with the latest version of Adobe Reader on Win7, since we don't yet know affected versions.
Keywords: qawanted
(In reply to Alex Keybl [:akeybl] from comment #7) > Scoobidiver/Georg - Do we know what version of Adobe reader to target? Unfortunately, those crashes are on the browser process and we don't load the plugin into that one so we don't see its version there.
1. Firefox 13.0b4 on Windows 7 64-bit. 2. Did a search on Google for "pdf" 3. Opened the first 20 pdf documents in new tabs 4. Randomly switched display modes, scrolling, otherwise playing around with documents in quick succession 5. Restarted Firefox and triggered session restore to load all at once Adobe Reader 10.1.3: would hang once in a while but never for more than 30 seconds, and never crashed. Adobe Reader 9.5.0: did not experience any performance degradation, nor crashes. Is there something more specific I can do to test? Please advise.
> Is there something more specific I can do to test? Please advise. Nothing more specific standing out from the crash-stats so far, sorry.
Removing QAWANTED as per comment 12. Please re-add if there is something more QA can do to help.
Keywords: qawanted
Crash-dumps show that for the most common crash-path the lock (from AsyncChannel::mMonitor) is null, so apparently we have a tear-down occuring from the event-loop spinning. Looking through the call-path however the tear-down protection seems proper (PluginDestructionGuard present and apparently checked against in the relevant places, reference to nsPluginInstanceOwner being held).
We're not close enough to resolution here to land a fix in FF13. We'll continue the investigation for FF14.
Found STR for the less common crash-path involving plugin event handling with FF13.0b4 & Flash 11.2.202.235: 1. go to http://practicefusion.com/ 2. choose free signup 3. view license agreement 4. click print 5. click print 6. choose print to file 7. click ok 8. choose filepath 9. click ok Results: (1) FF hangs completely for some time, crash-reporter comes up -> https://crash-stats.mozilla.com/report/index/bp-282dab94-2c2f-48c3-93cd-5385f2120525 (2) FF hangs completely for some time, crash-reporter submits a hang-report for Flash: -> https://crash-stats.mozilla.com/report/index/bp-f9cea563-dce4-475c-84ab-11b582120525
(In reply to Georg Fritzsche [:gfritzsche] from comment #16) > Found STR for the less common crash-path involving plugin event handling > with FF13.0b4 & Flash 11.2.202.235 This is not reproducible for me using Windows 7 64-bit, Flash 11.2.202.235 64-bit, and Firefox 13.0b4 or 13.0b5. There was a hang for a couple of seconds in Beta 4, no hang at all in Beta 5. Granted, I tested this on a new profile and the only print-to-file printer I have installed is the Microsoft XPS writer. Is there any more details you can give Georg? Is your testcase 100% reproducible for you?
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #17) > Is there any more details you can give Georg? Is your testcase 100% > reproducible for you? I just checked and found that i don't get the hang/crash with the XPS document writer either. I left the default choice on my HP Deskjet D2400 and enabled the checkbox "print to file". With that the hang is 100% reproducible for me with 13.0bX, Window 7 x64 SP1 on 2 different machines - either the plugin-hang-detector kills Flash or the browser crashes. A user comment in the reports mentions having this a "Brother Fax 2.1 sender": https://crash-stats.mozilla.com/report/index/54f259cf-46dc-4f55-9361-d7e142120519
Repeated the steps in comment 16 with the latest Nightly, a new profile, and an HP Deskjet D2400 installed. I could not reproduce this crash. Doing the same with Firefox 13.0b5 resulted in a hang for about 30 seconds, but no crash. While some set of external users can trigger this crash, it is not internally reproducible. Any other suggestions for a reproducible case are greatly appreciated.
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #19) > Doing the > same with Firefox 13.0b5 resulted in a hang for about 30 seconds, but no > crash. Well, if your machine is slower for some reason, then this might just be it. IIRC, this is the plugin process hanging, and if it hangs longer than 45 seconds, we kill it and send a hang report with the crash reporter. That could be the (2) case of comment #16.
Attached file Callstack for PluginInstanceParent destruction (obsolete) (deleted) —
Managed to reproduce with the aforementioned steps on a custom build, though less reliably. Destruction of the PluginInstanceParent is happening from a processed message (see callstack), afterwards it is crashing on accessing the destroyed AsyncChannel::mMonitor here: http://mxr.mozilla.org/mozilla-central/source/ipc/glue/WindowsMessageLoop.cpp#877
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #20) > IIRC, this is the plugin process hanging, and if it hangs longer than 45 > seconds, we kill it and send a hang report with the crash reporter. That > could be the (2) case of comment #16. Yes, apparently the plugin is killed due to hanging in both cases - either killing it like intended in case (2) or hitting the crash bug in (1).
Attached patch fix v1.0 (obsolete) (deleted) — Splinter Review
AsyncChannel::Close() & ::NotifyMaybeChannelError() may get called from a task that's processed in the nested event loop in RPCChannel::WaitForNotify() while being in a sync plugin IPC call. This leads to AsyncChannel::Clear() clearing out AsyncChannel::mMonitor and ::mListener. The patch should assure that the RPCChannel methods in the crash-paths can be properly exited after such a task occured. It fixes the issue for me locally on trunk. Possible other issue that is not handled by this patch: PluginInstanceParent gets deleted by the aforementioned task, so other issues could occur when the call-paths unwind to it's methods.
(In reply to Anthony Hughes, Mozilla QA (irc: ashughes) from comment #19) > Repeated the steps in comment 16 with the latest Nightly, a new profile, and > an HP Deskjet D2400 installed. I could not reproduce this crash. Doing the > same with Firefox 13.0b5 resulted in a hang for about 30 seconds, but no > crash. While some set of external users can trigger this crash, it is not > internally reproducible. Any other suggestions for a reproducible case are > greatly appreciated. How about the following: * Reduce dom.ipc.plugins.timeoutSecs until the plugin regularly gets killed for hanging * keep clicking into the text area of the plugin (increased the crash-reliability for me)
I was able to reproduce a plug-in hang/crash by reducing the timeout to 5s and clicking Print, printing the document, and repeatedly clicking and dragging in the text area. Though I had to repeat the process about 10 times before it crashed (I'm guessing it's an issue that builds up over time). In Firefox 13.0b5 it manifested as a plugin crash/hang https://crash-stats.mozilla.com/report/index/bp-955eeff6-b685-4dd5-9232-598782120531 https://crash-stats.mozilla.com/report/index/bp-36fe12ce-3d1a-49d1-b7e4-dd0f32120531 In Firefox 13.0b6 it manifested as a browser crash https://crash-stats.mozilla.com/report/index/bp-358d2286-cf1f-4f77-b8dc-b5a3e2120531 In Firefox 15.0a1 2012-05-31 it manifested as a plugin crash/hang https://crash-stats.mozilla.com/report/pending/7ac98e9f-9978-4ef8-8091-679ee2120531 https://crash-stats.mozilla.com/report/index/f4698b2d-7318-4d6e-9e76-414f52120531
Keywords: reproducible
Patch which delays the channel error notification after a channel timeout if we are still in an IPC call. While AsyncChannel::OnNotifyMaybeChannelError() was already protected against this, the following call-path was not: * PluginModuleParent::CleanupFromTimeout() * PPluginModuleParent::Close() * AsyncChannel::Close() The task which invokes this is created here: http://mxr.mozilla.org/mozilla-central/source/dom/plugins/ipc/PluginModuleParent.cpp#205 I did not have success on a mochitest for this so far.
Attachment #628282 - Attachment is obsolete: true
Attachment #629195 - Flags: feedback?(jones.chris.g)
Comment on attachment 629195 [details] [diff] [review] Fix #2 - delay NotifyMaybeChannelError() if still in IPC call From what I can tell this is a logic error in PluginModuleParent. It can't Close() the channel with C++ code on the stack. "Spinning" the event loop like this isn't acceptable for *Channel code. I would suggest writing a test first. Find the sequence of events/npapi calls that reproduce this bug, then add some gross code to our testplugin to reproduce whatever the plugin is doing here. There's code in testplugin already that does this, so you can use that as an example. Then call that code from a mochitest. MOZ_IPC_MESSAGE_LOG can be helpful in seeing what the plugin is doing.
Attachment #629195 - Flags: feedback?(jones.chris.g) → feedback-
The sequence leading up to this for me is starting with a plugin mouse event, from which a modal dialog is being brought up (leading to spinning the internal event loop), which subsequently hangs. In the internal event loop, further mouse events are processed (nested), which trigger the hang-after-timeout-cleanup and thus the CleanupFromTimeout() task: PluginModuleParent::ShouldContinueFromReplyTimeout() SyncChannel::ShouldContinueFromTimeout() RPCChannel::Call() PPluginInstanceParent::CallNPP_HandleEvent() PluginInstanceParent::NPP_HandleEvent() PluginModuleParent::NPP_HandleEvent() nsNPAPIPluginInstance::HandleEvent() nsPluginInstanceOwner::ProcessEvent() [...] RPCChannel::SpinInternalEventLoop() RPCChannel::WaitForNotify() RPCChannel::Call() PPluginInstanceParent::CallNPP_HandleEvent() PluginInstanceParent::NPP_HandleEvent() PluginModuleParent::NPP_HandleEvent() nsNPAPIPluginInstance::HandleEvent() nsPluginInstanceOwner::ProcessEvent() ... which may be processed from the internal event loop (timing-sensitive?): AsyncChannel::Close() PluginModuleParent::CleanUpFromTimeout() MessageLoop::RunTask() [...] RPCChannel::SpinInternalEventLoop() RPCChannel::WaitForNotify() RPCChannel::Call() PPluginInstanceParent::CallNPP_HandleEvent() ... which then leads to a crash when exiting the HandleEvent call-path. This still isn't reproducible for me in a mochitest, either due to the timing sensitivity or the event loop not being flooded with mouse events the same way when synthesized from JS.
Jet, anyone we can cc in from the reader group? Not sure what this dialog is for, it never has a chance to display any information. https://twitter.com/GeorgFritzsche/status/190522573963014145/photo/1
Attached file WIP test-case (deleted) —
WIP test-case which has the same basic behaviour i'm seeing in the reproducible case, but so far fails to reproduce the crash due to stepping out of the hanging RPC call before the PluginModuleParent::CleanUpFromTimeout() task is run.
As per yesterdays discussion i tested posting the CleanupFromTimeout() task via PostNonNestableTask() instead in ShouldContinueFromReplyTimeout(): http://mxr.mozilla.org/mozilla-central/source/dom/plugins/ipc/PluginModuleParent.cpp#205 That however doesn't fix the issue for me locally, with tracing showing that CleanupFromTimeout() is still being processed from SpinInternalEventLoop() (see callstack). This is because the methods involved here: MessageLoop::RunTask() MessageLoop::DeferOrRunPendingTask() MessageLoop::DoWork() ... are not actually increasing the run depth of the message loop (MessageLoop::state::run_depth).
Attachment #628209 - Attachment is obsolete: true
Do we have any alternatives for posting non-nestable tasks? Or should a call-path from ipc::DoWorkRunnable::Run() possibly increase the run depth?
Attached patch Possible fix (deleted) — Splinter Review
Georg, does this patch fix the crash?
(In reply to Chris Jones [:cjones] [:warhammer] from comment #33) > Created attachment 631609 [details] [diff] [review] > Possible fix > > Georg, does this patch fix the crash? It fixes the crash, but instead doesn't do any cleanup in the previously crashing case. Reposting the task from CleanupFromTimeout() seems to fix that, i can test that more tomorrow.
The error notification should still be delivered after all the plugin code is off the C++ stack, and that should kick off cleanup.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #35) > The error notification should still be delivered after all the plugin code > is off the C++ stack, and that should kick off cleanup. Tracing shows that AsyncChannel::PostErrorNotifyTask(), and hence AsyncChannel::OnNotifyMaybeChannelError(), are not being called in this case.
I can't reproduce the missing error notification anymore - can we push the patch to try so QA can verify independently to exclude possible quirks on my end?
Adding qawanted for validation of the try build: * http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/cjones@mozilla.com-665e54d54c40/try-win32/ * Test-URL: https://static.practicefusion.com/apps/ehr/?signup=true * STR as per comment 25 / above comments * Expected: Either plugin is killed for hanging or functions normally * Not expected & should be fixed: browser crash The "print to file" feature is broken for me in Flash 11.3, but works in 11.2. Older Flash versions here: http://helpx.adobe.com/flash-player/kb/archived-flash-player-versions.html Uninstalling newer Flash versions: http://helpx.adobe.com/flash-player/kb/uninstall-flash-player-windows.html
Keywords: qawanted
Print-to-File works fine for me in Flash 11.3. With the Try build I get a plugin crash, repeating my steps in comment 25. Firefox is no longer taken down by this bug it would seem.
It's #35 top browser crasher in 13.0.1, #39 in 14.0b7 and #67 in 15.0a2.
Keywords: topcrash
Comment on attachment 631609 [details] [diff] [review] Possible fix Requesting review for Chris fix.
Attachment #631609 - Flags: review?(benjamin)
Attachment #629195 - Attachment is obsolete: true
Comment on attachment 631609 [details] [diff] [review] Possible fix It is clear that this patch will make things better than they are currently. I am a little worried about the fact that we're just not calling Close() any more. Does anyone remember why we needed to call Close() in the first place? Since the call to Close() may be racing with the actual killing of the child process, is it possible that we'd end up with a different `why` at http://hg.mozilla.org/mozilla-central/annotate/3df1de4c9592/dom/plugins/ipc/PluginModuleParent.cpp#l226 and we'll end up with double minidumps (one from the hang and another from ActorDestroy)? Should we be reposting the message as we do lower at http://hg.mozilla.org/mozilla-central/annotate/3df1de4c9592/dom/plugins/ipc/PluginModuleParent.cpp#l272 ?
Attachment #631609 - Flags: review?(benjamin) → review+
Keywords: qawantedcheckin-needed
Assignee: georg.fritzsche → jones.chris.g
Flags: in-testsuite?
Keywords: checkin-needed
Target Milestone: --- → mozilla16
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
(In reply to Ryan VanderMeulen from comment #44) > Should this have a test? So far i failed to reproduce that in a test (possibly due to different timing or event queuing). Even the STR above are not fully consistent.
Reopening as we want to track this on Nightly for a week to see wether it fixes all or at least a subset of the crashes with this signature.
Assignee: jones.chris.g → georg.fritzsche
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Standard practice is to keep this closed while verifying and then mark VERIFIED later. This is pretty low volume on trunk, so it's going to be hard to verify based on trunk crash-stats. I think we should begin uplift to Aurora ASAP.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Comment on attachment 631609 [details] [diff] [review] Possible fix [Approval Request Comment] Bug caused by (feature/regressing bug #): Timing issue leading to premature teardown of plugin handling code. User impact if declined: High browser-crash volume for a subset of FF users. Testing completed (on m-c, etc.): QA verified a try-build for the patch. Risk to taking this patch (and alternatives if risky): Possible risk of moving crashes to another location instead of fixing them. String or UUID changes made by this patch: None.
Attachment #631609 - Flags: approval-mozilla-aurora?
Comment on attachment 631609 [details] [diff] [review] Possible fix [Triage Comment] Approving for Aurora 15, with the possibility of uplifting if it does resolve the crash.
Attachment #631609 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
With combined signatures, it's #19 top browser crasher in 13.0.1 and #30 in 14.0b8.
Crash Signature: [@ RtlEnterCriticalSection | PR_Lock | mozilla::ipc::RPCChannel::WaitForNotify()] → [@ RtlEnterCriticalSection | PR_Lock | mozilla::ipc::RPCChannel::WaitForNotify()] [@ mozilla::ipc::RPCChannel::WaitForNotify()]
In the 7-day view for FF15, there are no crash-signatures containing mozilla::ipc::RPCChannel::WaitForNotify() in the top-crashers - only in the 14-day view.
Comment on attachment 631609 [details] [diff] [review] Possible fix [Approval Request Comment] Bug caused by (feature/regressing bug #): Timing issue leading to premature teardown of plugin handling code. User impact if declined: High browser-crash volume for a subset of FF users (top-crasher). Testing completed (on m-c, etc.): QA verified a try-build for the patch. Watched crash-signature on Aurora and Nightly; no new crashes with this signature in builds after the checkin. Risk to taking this patch (and alternatives if risky): Possible risk of moving crashes to another location instead of fixing them. String or UUID changes made by this patch: None.
Attachment #631609 - Flags: approval-mozilla-beta?
Comment on attachment 631609 [details] [diff] [review] Possible fix [Triage Comment] Top crash fix with only risk of moving the signature around. Approved for Beta 14.
Attachment #631609 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Keywords: checkin-needed
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: