Closed Bug 1098057 Opened 10 years ago Closed 9 years ago

Sync SendMessage calls from chrome may deadlock in the new plugin model

Categories

(Core Graveyard :: Plug-ins, defect)

All
Windows 7
defect
Not set
normal

Tracking

(e10s+)

RESOLVED WONTFIX
Tracking Status
e10s + ---

People

(Reporter: jimm, Assigned: jimm)

References

Details

and the reverse of bug 1098055, going the other way, chrome -> sync sendmessage -> plugin -> sync ipc -> content -> sync ipc -> chrome I'm not sure how common this is.
we also, afaict, have very few options for avoiding this. in the old model this would have been: chrome -> sync sendmessage -> plugin -> sync ipc -> chrome I don't see how we're avoiding this. ni'ing bsmedberg, curious if you've seen hangs like this?
Flags: needinfo?(benjamin)
Blocks: e10s-plugins
We used to, but then we fixed a few things like UpdateWindow to stop sending sync messages and haven't seen reports of this recently. But none of our hang reporters would catch this, so we wouldn't have data on it if it were happening.
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #2) > We used to, but then we fixed a few things like UpdateWindow to stop sending > sync messages and haven't seen reports of this recently. > > But none of our hang reporters would catch this, so we wouldn't have data on > it if it were happening. Why wouldn't the general chrome hang detector thread pick these up? The app shell dispatch loop wouldn't be pumping, which should trigger that detector.
Assignee: nobody → jmathies
I wonder about code like this - http://mxr.mozilla.org/mozilla-central/source/widget/windows/nsWindow.cpp#6404 I'm not sure if enumerating all the child windows in a sandboxed child flash process can trigger sync sends to the procedure in that process, but it certainly seems possible.
Just curious: why doesn't REQUIRE_DEFERRED_MESSAGE_PROTECTION protect against this?
(In reply to Bill McCloskey (:billm) from comment #5) > Just curious: why doesn't REQUIRE_DEFERRED_MESSAGE_PROTECTION protect > against this? which connection are you curious about?
In comment 0, when the plug-in sends a sync message to content, would it process/defer the Windows message while waiting and thus process the message from chrome? Similarly, in comment 1, when the plugin sends a sync message to chrome, would it process/defer the Windows message from chrome?
(In reply to Bill McCloskey (:billm) from comment #7) > In comment 0, when the plug-in sends a sync message to content, would it > process/defer the Windows message while waiting and thus process the message > from chrome? If the Windows message hit the plugin process message procedure while the plugin process was in the sync ipc send to content, the plugin <-> content channel deferral trap would catch it. However in the case where the native sent message triggers the sync ipc call to content, I think the system is vulnerable. > Similarly, in comment 1, when the plugin sends a sync message to chrome, > would it process/defer the Windows message from chrome? I think this is the same situation as above.
I see. I did not realize we could send sync messages while dispatching these Windows messages. I think I understand why Benjamin said that our hang detector wouldn't catch this. The plugin hang detector just looks for places where a sync IPC takes too long. In the case of comment 1, the too-long IPC is from plugin to chrome. The hang detector does support these sorts of hangs, but it was disabled in bug 683967 (by you!). It looks like we were thinking of enabling it again when bug 679238 was fixed, but that got WONTFIXed without much explanation. The situation is comment 0 is similar, although we'd probably need more work to get crash reports for the right processes. It seems like crash reports could be pretty helpful in sorting this situation out, although we would need support for paired minidumps on the crashstats website, which we need for other stuff as well.
(In reply to Bill McCloskey (:billm) from comment #9) > I see. I did not realize we could send sync messages while dispatching these > Windows messages. > > I think I understand why Benjamin said that our hang detector wouldn't catch > this. The plugin hang detector just looks for places where a sync IPC takes > too long. In the case of comment 1, the too-long IPC is from plugin to > chrome. The hang detector does support these sorts of hangs, but it was > disabled in bug 683967 (by you!). It looks like we were thinking of enabling > it again when bug 679238 was fixed, but that got WONTFIXed without much > explanation. We could revisit self aborting in the content process when the browser isn't responding. I'm really not a fan of long timeouts when chrome is unresponsive, so I wonder if we can come up with some way of blocking this situation vs. hanging the browser and waiting for ~45 seconds, or whatever the current plugin hang timeout is.
chrome -> sync sendmessage -> plugin -> sync ipc -> content -> sync ipc -> chrome For example, I think we might be able to detect that the plugin process main thread is in SendMessage before we sync ipc send from plugin to content here, and just fail that call.
(In reply to Jim Mathies [:jimm] from comment #10) > We could revisit self aborting in the content process when the browser isn't > responding. I'm really not a fan of long timeouts when chrome is > unresponsive, so I wonder if we can come up with some way of blocking this > situation vs. hanging the browser and waiting for ~45 seconds, or whatever > the current plugin hang timeout is. I was thinking we could use crash reports to figure out all the places where a Windows messages triggers a sync IPC. Then hopefully we could eliminate those cases somehow, perhaps by deferring the Windows message to the next turn of the event loop. It sounds like that's what Benjamin was suggesting we did in comment 2. I agree that such long timeouts are bad, but if we just use them as a way to gather information, I don't think it's so much of a problem.
Kicking this out for re-triage - this bug is about a hypothetical deadlock, we have no idea if this actually occurs in the wild. This bug encompasses instrumenting using the crash reporter and a timeout to capture hung chrome processes. This seems more like a "want" vs. "must have", at least until we see evidence of this type of hang in the wild.
(In reply to Bill McCloskey (:billm) from comment #12) > (In reply to Jim Mathies [:jimm] from comment #10) > > We could revisit self aborting in the content process when the browser isn't > > responding. I'm really not a fan of long timeouts when chrome is > > unresponsive, so I wonder if we can come up with some way of blocking this > > situation vs. hanging the browser and waiting for ~45 seconds, or whatever > > the current plugin hang timeout is. > > I was thinking we could use crash reports to figure out all the places where > a Windows messages triggers a sync IPC. Then hopefully we could eliminate > those cases somehow, perhaps by deferring the Windows message to the next > turn of the event loop. It sounds like that's what Benjamin was suggesting > we did in comment 2. You're talking about all the places flash might call out from handling a native windowing event, right? So isn't your list basically the entire plugin api?
Flags: needinfo?(wmccloskey)
(In reply to Jim Mathies [:jimm] from comment #14) > (In reply to Bill McCloskey (:billm) from comment #12) > > (In reply to Jim Mathies [:jimm] from comment #10) > > > We could revisit self aborting in the content process when the browser isn't > > > responding. I'm really not a fan of long timeouts when chrome is > > > unresponsive, so I wonder if we can come up with some way of blocking this > > > situation vs. hanging the browser and waiting for ~45 seconds, or whatever > > > the current plugin hang timeout is. > > > > I was thinking we could use crash reports to figure out all the places where > > a Windows messages triggers a sync IPC. Then hopefully we could eliminate > > those cases somehow, perhaps by deferring the Windows message to the next > > turn of the event loop. It sounds like that's what Benjamin was suggesting > > we did in comment 2. > > You're talking about all the places flash might call out from handling a > native windowing event, right? So isn't your list basically the entire > plugin api? Or alternatively on the chrome side the list would include any win32 api that potentially could process native events. This list is long and hard to identify. I think either way, we've been attacking this problem from this kind of angle for a while now and it doesn't seem to help. Regardless, if we had a nice way of detecting the and triggering a three way abort, we could collect some stats. I'd rather try experimenting with caching or failing plugin calls on the plugin side when we know we're in SendMessage in the plugin process, or maybe pass some information on to content about this such that content can return cached values or do chrome related things async.
Depends on: 1160807
(In reply to Jim Mathies [:jimm] from comment #15) > (In reply to Jim Mathies [:jimm] from comment #14) > > (In reply to Bill McCloskey (:billm) from comment #12) > > > (In reply to Jim Mathies [:jimm] from comment #10) > > > > We could revisit self aborting in the content process when the browser isn't > > > > responding. I'm really not a fan of long timeouts when chrome is > > > > unresponsive, so I wonder if we can come up with some way of blocking this > > > > situation vs. hanging the browser and waiting for ~45 seconds, or whatever > > > > the current plugin hang timeout is. > > > > > > I was thinking we could use crash reports to figure out all the places where > > > a Windows messages triggers a sync IPC. Then hopefully we could eliminate > > > those cases somehow, perhaps by deferring the Windows message to the next > > > turn of the event loop. It sounds like that's what Benjamin was suggesting > > > we did in comment 2. > > > > You're talking about all the places flash might call out from handling a > > native windowing event, right? So isn't your list basically the entire > > plugin api? Yes, that's what I was suggesting. > I think either way, we've been attacking this problem from this kind of > angle for a while now and it doesn't seem to help. I think you might be overstating the case here, since we don't know whether this particular problem happens in practice, but I certainly won't argue against better hang detection. > Regardless, if we had a > nice way of detecting the and triggering a three way abort, we could collect > some stats. Yes, I agree. Perhaps each sync IPC call could include a list of processes that are transitively waiting for the result. If we send a sync IPC call to process P, and the list already includes process P, then we can crash everyone in the list.
Flags: needinfo?(wmccloskey)
Blocks: e10s-hangs
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → WONTFIX
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.