Closed
Bug 1170231
Opened 9 years ago
Closed 9 years ago
Interrupt race resolution may cause deadlocks
Categories
(Core :: IPC, defect)
Tracking
()
RESOLVED
FIXED
mozilla42
People
(Reporter: kairo, Assigned: bugzilla)
References
Details
(Keywords: crash, topcrash)
Crash Data
Attachments
(2 files, 3 obsolete files)
(deleted),
patch
|
dvander
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
dvander
:
review+
kglazko
:
approval-mozilla-aurora+
lmandel
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
[Tracking Requested - why for this release]:
This bug was filed from the Socorro interface and is
report bp-41e8cc89-ca30-47df-a7b0-110a32150601.
=============================================================
Top Frames:
0 ntdll.dll ZwWaitForMultipleObjects
1 kernelbase.dll WaitForMultipleObjectsEx
2 kernel32.dll WaitForMultipleObjectsExImplementation
3 user32.dll RealMsgWaitForMultipleObjectsEx
4 user32.dll MsgWaitForMultipleObjects
5 xul.dll mozilla::ipc::MessageChannel::WaitForInterruptNotify() ipc/glue/WindowsMessageLoop.cpp
6 xul.dll mozilla::ipc::MessageChannel::Call(IPC::Message*, IPC::Message*) ipc/glue/MessageChannel.cpp
7 xul.dll mozilla::plugins::PPluginInstanceChild::CallNPN_GetValue_NPNVdocumentOrigin(nsCString*, short*) obj-firefox/ipc/ipdl/PPluginInstanceChild.cpp
8 xul.dll mozilla::plugins::PluginInstanceChild::NPN_GetValue(NPNVariable, void*) dom/plugins/ipc/PluginInstanceChild.cpp
9 xul.dll mozilla::plugins::child::_getvalue dom/plugins/ipc/PluginModuleChild.cpp
10 npswf32_17_0_0_169.dll F21225463__________________________________________________________________________________________________________ F209679109___________________________________________________________________________________:1234
[...]
This hang is spiking with 39.0b1 which is the first beta to include async plugin init.
Find more of those crashes in 39.0b1 at https://crash-stats.mozilla.com/report/list?product=Firefox&signature=hang+|+WaitForMultipleObjectsEx+|+RealMsgWaitForMultipleObjectsEx+|+MsgWaitForMultipleObjects+|+mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3AWaitForInterruptNotify%28%29+|+mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3ACall%28IPC%3A%3AMessage*%2C+IPC%3A%3AMessage*%29&version=Firefox%3A39.0b1#tab-reports
Aaron says he knows what the problem is and he's looking into a fix so assigning to him right away.
Assignee | ||
Comment 1•9 years ago
|
||
This is a problem with interrupt race resolution in our MessageChannel code. When a particular sequence of interrupt calls race between parent and child processes, it is possible to deadlock both of them.
Assignee | ||
Updated•9 years ago
|
Summary: Flash hang mozilla::ipc::MessageChannel::Call with async plugin init → Interrupt race resolution may cause deadlocks
Assignee | ||
Updated•9 years ago
|
Blocks: asyncplugininit
Assignee | ||
Comment 4•9 years ago
|
||
This test exhibits the problem - it hangs permanently. Note that your build needs to be configured with --enable-ipdl-tests to try this.
Assignee | ||
Comment 5•9 years ago
|
||
Whoops, messed up the mercurial export. Here's the real patch :-)
Attachment #8615589 -
Attachment is obsolete: true
Assignee | ||
Comment 6•9 years ago
|
||
Attachment #8615590 -
Attachment is obsolete: true
Assignee | ||
Comment 7•9 years ago
|
||
Assignee | ||
Comment 8•9 years ago
|
||
Comment on attachment 8616532 [details] [diff] [review]
Part 1 - Possible Fix
The existing predicate is fine for determining when a race has occurred and when to defer a message. The problem with _undeferring_ an incall with this predicate is that in some cases (particularly the one covered by the test in this bug), interrupt_remote_stack_depth_guess will never be correct as long as the outcall that the incall raced with is on the stack. In this case, this results in the message sitting in mDeferred in perpetuity.
This modified predicate should be true when the remote stack depth matches the caller's stack depth at the time of the match, implying that it is safe to run.
I've run this against all existing IPDL regression tests without any regressions. I've noticed that this does cause a bit of "hot potato" passing back and forth of some messages between mPending and mDeferred. This ends up having the effect of allowing the pending messages that need to run to propagate through the mPending queue and eventually break the deadlock.
Here's a try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=de071b34a7a3
Attachment #8616532 -
Flags: review?(dvander)
Assignee | ||
Updated•9 years ago
|
Attachment #8616531 -
Flags: review?(dvander)
Sorry Aaron, I'm looking at this now and should have it reviewed by the end of the day. This code is pretty obscure, cjones would be a better reviewer but it's probably best if someone else sucks it up and understands the race detection stuff as well.
Flags: needinfo?(dvander)
Okay, I think I see what is going wrong here now:
1) Parent sends RPC1 to child. Its remote stack guess is 0, local is 1.
2) Child sends RPC2 to parent. Its remote stack guess is 0, local is 1.
3) Child wakes up for RPC1, detects a race. Child defers and remote stack guess is 1.
4) Parent sends RPC3. Its remote stack guess is still 0, local is now 2.
5) Parent wakes up for RPC2, detects a race. Parent defers and remote stack guess is 1.
6) Child wakes up for RPC3, detects a race. Child defers and remote stack guess is 2.
In this sequence of events, the remote stack guess will always be < RemoteViewOfStackDepth(), and we'll never wake up. So clearly that condition can't be right.
The logic for race detection is roughly: "the other process thought I had N calls on the stack, but actually I've already sent another and it just hasn't processed it yet." The check in MaybeUndeferIncall was attempting to duplicate that so it wouldn't undefer a message that would just be deferred again.
You said "This modified predicate should be true when the remote stack depth matches the caller's stack depth at the time of the match, implying that it is safe to run." But MayebUndeferIncall doesn't run anything, DispatchInterruptMessage does, and it has a different check. In theory MaybeUndeferIncall could have *no* predicate and always undefer, and we'd just be stressing the race condition code. That's what we're probably seeing with the message bouncing.
Comment on attachment 8616532 [details] [diff] [review]
Part 1 - Possible Fix
I think the new predicate needs a very clear explanation of what its relation to the check in DispatchInterruptMessage() is. Unless I'm mistaken and they're unrelated, in which case we should have an explanation of why the new predicate is any better than having none at all.
I'm also yet not sure what the best condition would be, to both not cause deadlocks and to not wake up the thread just to defer again. I'll keep thinking about it though.
Sorry again for the late review, I think I understand this well enough to have a much faster turnaround time now though.
Attachment #8616532 -
Flags: review?(dvander)
Comment on attachment 8616531 [details] [diff] [review]
Part 2 - Unit Test
Review of attachment 8616531 [details] [diff] [review]:
-----------------------------------------------------------------
Thanks for writing this test case, it does reproduce for me. I do see that part 1 fixes it, but I want to make sure that's because MaybeUndeferIncall is correct rather than just letting the race detection logic run until it succeeds.
Attachment #8616531 -
Flags: review?(dvander) → review+
We turned off async plugin init in 39.0b5. So we will likely still need this for 40+.
Wontfix for 39 though.
status-firefox40:
--- → affected
status-firefox41:
--- → affected
tracking-firefox40:
--- → +
tracking-firefox41:
--- → +
Comment 16•9 years ago
|
||
Aaron, any reason why the patch didn't land? Thanks
Flags: needinfo?(aklotz)
Assignee | ||
Comment 17•9 years ago
|
||
Part 2 is just the test. Part 1 needs more work.
Flags: needinfo?(aklotz)
Assignee | ||
Comment 18•9 years ago
|
||
Attachment #8616532 -
Attachment is obsolete: true
Reporter | ||
Comment 19•9 years ago
|
||
Aaron, the hangs including the signatures in this bug are high again in 40 beta 1 (of course, as no fix landed here) - are you doing more work on this or did you miss to request review for the fix?
Flags: needinfo?(aklotz)
Assignee | ||
Comment 20•9 years ago
|
||
dvander and I talked about this at Whistler. We have a path going forward here. The revised patch is written but I am running this through a bunch of tests.
Flags: needinfo?(aklotz)
Assignee | ||
Comment 21•9 years ago
|
||
Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)
This patch removes the condition outright. try and ipdl regression tests look good.
Attachment #8626828 -
Flags: review?(dvander)
Reporter | ||
Updated•9 years ago
|
Crash Signature: [@ hang | WaitForMultipleObjectsEx | RealMsgWaitForMultipleObjectsEx | MsgWaitForMultipleObjects | mozilla::ipc::MessageChannel::WaitForInterruptNotify() | mozilla::ipc::MessageChannel::Call(IPC::Message*, IPC::Message*)] → [@ hang | WaitForMultipleObjectsEx | RealMsgWaitForMultipleObjectsEx | MsgWaitForMultipleObjects | mozilla::ipc::MessageChannel::WaitForInterruptNotify() | mozilla::ipc::MessageChannel::Call(IPC::Message*, IPC::Message*)]
[@ hang | CleanupPerAppKey | Msg…
Attachment #8626828 -
Flags: review?(dvander) → review+
Comment 22•9 years ago
|
||
Assignee | ||
Comment 23•9 years ago
|
||
Nominations for uplift will be done gradually depending on how things look on treeherder.
https://hg.mozilla.org/mozilla-central/rev/3586d497bde2
https://hg.mozilla.org/mozilla-central/rev/61a5b8d678f3
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox42:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
Comment 25•9 years ago
|
||
Hi Aaron, can you update us on how things are going in treeherder? We are interested in getting an uplift for 40/41.
Flags: needinfo?(aklotz)
Assignee | ||
Comment 26•9 years ago
|
||
Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)
I'm OK with uplifting to Aurora at this time. I think we should let it bake there for a bit. I'm not concerned with Treeherder so much as I am with crashstats.
Approval Request Comment
[Feature/regressing bug #]: Plugin IPC
[User impact if declined]: Possible plugin hangs
[Describe test coverage new/current, TreeHerder]: IPDL test suite
[Risks and why]: Low, small change in IPC behavior. Looks good so far on Nightly.
[String/UUID change made/needed]: None
Flags: needinfo?(aklotz)
Attachment #8626828 -
Flags: approval-mozilla-aurora?
Comment 27•9 years ago
|
||
Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)
Looking good on m-c, IPDL test suite.
Attachment #8626828 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 28•9 years ago
|
||
https://hg.mozilla.org/releases/mozilla-aurora/rev/19937e5a2904
https://hg.mozilla.org/releases/mozilla-aurora/rev/ff1058741c2c
Flags: in-testsuite+
Comment 29•9 years ago
|
||
Aaron, how many days of baking do you need for this before the uplift to beta? Thanks
Flags: needinfo?(aklotz)
Assignee | ||
Comment 30•9 years ago
|
||
We should reevaluate when I get back from PTO on Monday the 20th. We need sufficient crashstats data.
Flags: needinfo?(aklotz)
Assignee | ||
Comment 31•9 years ago
|
||
Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)
Approval Request Comment
[Feature/regressing bug #]: Plugin IPC
[User impact if declined]: Possible plugin hangs
[Describe test coverage new/current, TreeHerder]: IPDL test suite, has been on Aurora for over a week.
[Risks and why]: Low, small change in IPC behavior. Looks good so far on Nightly and Aurora. Simple backout + shut off asyncInit if necessary to revert.
[String/UUID change made/needed]: None
Attachment #8626828 -
Flags: approval-mozilla-beta?
Comment 32•9 years ago
|
||
Comment on attachment 8626828 [details] [diff] [review]
Possible Fix (r2)
Let's take this small change and measure the impact in beta6. Beta+
Attachment #8626828 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 33•9 years ago
|
||
Comment 34•9 years ago
|
||
This is affecting our plugin as well in 40+. I had to disable dom.ipc.plugins.asyncInit. Is there guidance on what not to do from the plugin side to avoid the problem?
Assignee | ||
Comment 35•9 years ago
|
||
(In reply to akorchemniy from comment #34)
> This is affecting our plugin as well in 40+. I had to disable
> dom.ipc.plugins.asyncInit. Is there guidance on what not to do from the
> plugin side to avoid the problem?
If you needed to disable dom.ipc.plugins.asyncInit in 40 then this bug is not your problem. We've had quite a few compatibility issues roll in with async plugin init over the weekend. Those bugs are being tracked as dependencies of bug 1116806. We're probably going to push out a hotfix to disable asyncInit in 40 until the problem is fixed. Ideally async plugin init should just work seamlessly with all plugins (provided that they are correctly following the API contracts).
You need to log in
before you can comment on or make changes to this bug.
Description
•