Closed
Bug 1152048
Opened 10 years ago
Closed 9 years ago
nsSocketTransportService::Dispatch may deadlock during shutdown
Categories
(Core :: Networking, defect)
Tracking
()
RESOLVED
FIXED
mozilla42
People
(Reporter: mayhemer, Assigned: u408661)
References
Details
Attachments
(3 files, 1 obsolete file)
(deleted),
patch
|
mayhemer
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
bagder
:
review+
mcmanus
:
feedback+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
bagder
:
review+
ritu
:
approval-mozilla-aurora+
|
Details | Diff | Splinter Review |
https://crash-stats.mozilla.com/report/index/27ae8618-8676-436d-8056-c6db72150406#allthreads, thread 5
https://crash-stats.mozilla.com/report/index/bb6d4bd5-75b5-475d-a926-4207d2150404#allthreads, thread 4
PR_Lock
nsSocketTransportService::Dispatch(nsIRunnable*, unsigned int) [1]
nsSocketTransportService::Run() [2]
nsThread::ProcessNextEvent(bool, bool*)
NS_ProcessNextEvent(nsIThread*, bool)
mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*)
MessageLoop::RunHandler()
MessageLoop::Run()
nsThread::ThreadFunc(void*)
_PR_NativeRunThread
pr_root
_callthreadstartex
[1] http://hg.mozilla.org/releases/mozilla-aurora/annotate/239508ee646f/netwerk/base/nsSocketTransportService2.cpp#l128
[2] http://hg.mozilla.org/releases/mozilla-aurora/annotate/239508ee646f/netwerk/base/nsSocketTransportService2.cpp#l815
Comment 1•10 years ago
|
||
Our theory here is that we might be locking a lock twice (and getting it into some sort of inconsistent state as a result). We signed up Valentin to try to look into that, by adding some code that will assert if this particular lock gets locked reentrantly.
Assignee: nobody → valentin.gosu
Honza, I believe you said you could test this to see where we're acquiring the lock twice on the same thread (assuming that's the issue). I've got a try run at https://treeherder.mozilla.org/#/jobs?repo=try&revision=cf46564546e9 for windows builds with the results at https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/try-builds/hurley@todesschaf.org-cf46564546e9 Can you give these a try and see if you either still reproduce the deadlock, or end up asserting when we acquire the lock in the wrong place?
Flags: needinfo?(honzab.moz)
Reporter | ||
Comment 4•10 years ago
|
||
I never said I was able to reproduce this exact deadlock, sorry. We have to let this go out and collect crash data.
Flags: needinfo?(honzab.moz)
This patch should crash when we want it to, even on release builds. Let's
land this and see what happens :) (We'll back out as soon as we get some
data on the deadlock, and certainly before this goes out to a significant
number of users.)
Attachment #8612467 -
Flags: review?(honzab.moz)
Reporter | ||
Comment 6•9 years ago
|
||
Comment on attachment 8612467 [details] [diff] [review]
debugging patch
Review of attachment 8612467 [details] [diff] [review]:
-----------------------------------------------------------------
::: netwerk/base/nsSocketTransportService2.cpp
@@ +66,5 @@
> + :mLock(&mutex)
> +{
> + PRThread *currentThread = PR_GetCurrentThread();
> + MOZ_RELEASE_ASSERT(sDebugOwningThread != currentThread);
> + sDebugOwningThread = currentThread;
do this assignment after you acquired the lock!
Attachment #8612467 -
Flags: review?(honzab.moz) → review+
> do this assignment after you acquired the lock!
*facepalm* thanks! Fixed in the landed version :)
remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/91a3fb327007
Keywords: leave-open
Backed out and re-landed DONTBUILD to fix the bug number in the commit message
remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/71a80e43615c
remote: https://hg.mozilla.org/integration/mozilla-inbound/rev/85e4bd6b0baa
Comment 10•9 years ago
|
||
Should this be uplifted? What versions did/does it affect? I'm following up on this since I'm tracking the bug that it's blocking (for shutdown hangs)
Flags: needinfo?(hurley)
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 12•9 years ago
|
||
No, this patch should not be uplifted, as it is not a fix - it's merely a patch designed to cause crashes so we can see where, exactly, things are going wrong (which, so far, hasn't succeeded - I've seen no crash reports around my assertion). This patch will be backed out before hitting release, possibly (probably?) before beta.
Flags: needinfo?(hurley)
Flags: needinfo?(honzab.moz)
Assignee | ||
Comment 13•9 years ago
|
||
Looks like we (finally) got a single hit on the assertion: https://crash-stats.mozilla.org/report/index/5dbc5816-6b99-409f-9286-089bb2150704
Assignee | ||
Comment 14•9 years ago
|
||
Note: I haven't dug into it yet, this is just so I don't have to dig up that crash-stats URL again :)
Reporter | ||
Comment 15•9 years ago
|
||
(In reply to Nicholas Hurley [:hurley, :nwgh] from comment #13)
> Looks like we (finally) got a single hit on the assertion:
> https://crash-stats.mozilla.org/report/index/5dbc5816-6b99-409f-9286-
> 089bb2150704
Looks like we crash inside nsSocketTransportService::Dispatch (up the stack) what winds up in CallWindowProcCrashProtected that calls nsWindow::ProcessMessage that processes another message that enters the lock again. Really interesting :D
Reporter | ||
Comment 16•9 years ago
|
||
Err.. we are not crashing. The name is CallWindowProcCrash_Protected_ which just do try/catch. Event more interesting :D
Reporter | ||
Comment 17•9 years ago
|
||
The Bdprovider.dll called by send (an LSP) seems to belong to bitdefender.
Assignee | ||
Comment 18•9 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #16)
> Err.. we are not crashing. The name is CallWindowProcCrash_Protected_ which
> just do try/catch. Event more interesting :D
Right, this looks like a suspend/resume state (note PostSleepWakeNotification in the crash stack). It seems like the machine goes to sleep while the lock is held (looks like on the main thread) by nsSocketTransportService::OnDispatchedEvent. Later, when the machine wakes up, the network link change bits take hold and try to do their cleanup. nsHttpConnectionMgr::PruneDeadConnections is called, which posts an event, causing nsSocketTransportService::Dispatch to be called, which tries to acquire the lock re-entrantly, which would cause our issues.
I'm not convinced that bitdefender has anything to do with the crash, as this is a race that could happen at any time, with or without bitdefender - though I could certainly see bitdefender making the race more likely to be lost.
Right now, I'm mostly at a loss for how to fix this, except to say that these accesses all occur on the main thread, which means any reads of nsSocketTransportService::mThread and nsSocketTransportService::mThreadEvent don't strictly need to be protected by the lock - we could just have these main-thread reads be unlocked (writes of course still would need to be locked); but down that path lies the madness of poor locking, so it's less of a real suggestion and more of a "I'm gonna throw my hands up for now and mull it over for a while" kind of thing.
Comment 19•9 years ago
|
||
That's quite a stack, nick :) (comment 13)
the information that is new to me is that NS_WIDGET_WAKE_OBSERVER_TOPIC is delievered synchronously after a windows message pre-empts execution. (the latter makes sense for sleep/wake, the synchronous part of the necko delivery surprised me). The IO Service code hooks that observer and changes it into the link_changed events as apropos.
The lock assertion is helpful, without it we would just have corruption of the socket thread event queue (which was partially changed before going to sleep and is now being changed again as part of the observer handling).
It seems odd that NS_WIDGET_WAKE_OBSERVER_TOPIC observers need to be fully re-entrant against this in ways that other observers are not.
Looking at that stack I am concerned about the integrity of the timer code too.
ni a couple folks that seem to have been involved in the wake-observer code.
I think we can fix the necko issue by having the ioService handler just post a fully async event back to the main thread to initiate the work, rather than doing it synchronously.
Flags: needinfo?(netzen)
Flags: needinfo?(jmathies)
Comment 20•9 years ago
|
||
in general, this unexpectedly re-entrant condition could cause all manner of hard to identify corruption (maybe the write segments bug?) beyond the case caught here so we should prioritize a fix.
Comment 21•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #19)
> That's quite a stack, nick :) (comment 13)
>
> the information that is new to me is that NS_WIDGET_WAKE_OBSERVER_TOPIC is
> delievered synchronously after a windows message pre-empts execution. (the
> latter makes sense for sleep/wake, the synchronous part of the necko
> delivery surprised me). The IO Service code hooks that observer and changes
> it into the link_changed events as apropos.
>
> The lock assertion is helpful, without it we would just have corruption of
> the socket thread event queue (which was partially changed before going to
> sleep and is now being changed again as part of the observer handling).
>
> It seems odd that NS_WIDGET_WAKE_OBSERVER_TOPIC observers need to be fully
> re-entrant against this in ways that other observers are not.
>
> Looking at that stack I am concerned about the integrity of the timer code
> too.
>
> ni a couple folks that seem to have been involved in the wake-observer code.
>
> I think we can fix the necko issue by having the ioService handler just post
> a fully async event back to the main thread to initiate the work, rather
> than doing it synchronously.
I'm confused by all this, what are you suggesting we do differently here?
Comment 22•9 years ago
|
||
(In reply to Jim Mathies [:jimm] from comment #21)
> I'm confused by all this, what are you suggesting we do differently here?
can the NS_WIDGET_WAKE_OBSERVER_TOPIC observers be notified from a non re-entrant stack?
Comment 24•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #22)
> (In reply to Jim Mathies [:jimm] from comment #21)
>
> > I'm confused by all this, what are you suggesting we do differently here?
>
> can the NS_WIDGET_WAKE_OBSERVER_TOPIC observers be notified from a non
> re-entrant stack?
Yeah I think it'd be safe to fire these off via async runnables independent from the windows notification. From what I remember we had some problems with timers not getting reset right after a long sleep. I don't see that regressing due to async notifications.
Flags: needinfo?(jmathies)
Comment 25•9 years ago
|
||
nick, given that bagder is on PTO for a while, would you be interested in making the IOService change (assuming you think it helps too)? We could get that backported easily enough I would think.
We could separately make the async runnable change for generating NS_WIDGET_WAKE_OBSERVER_TOPIC too (comment 24) but I have less confidence in what that might impact, so it would be better to do it on just nightly imo.
Assignee | ||
Comment 26•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #25)
> nick, given that bagder is on PTO for a while, would you be interested in
> making the IOService change (assuming you think it helps too)? We could get
> that backported easily enough I would think.
I'm already working on it, believe it or not :)
Assignee | ||
Comment 27•9 years ago
|
||
This reverts the debugging changes I made (we will definitely want to uplift this one). A patch with the fix comes next.
Attachment #8636087 -
Flags: review?(mcmanus)
Assignee | ||
Comment 28•9 years ago
|
||
And here's the one that should fix the actual issue.
Attachment #8636088 -
Flags: review?(mcmanus)
Assignee | ||
Comment 29•9 years ago
|
||
Comment 30•9 years ago
|
||
Comment on attachment 8636087 [details] [diff] [review]
revert lock change
Review of attachment 8636087 [details] [diff] [review]:
-----------------------------------------------------------------
if we revert we can't confirm the bug is fixed. how about MOZ_DIAGNOSTIC_ASSERT ?
Comment 31•9 years ago
|
||
Comment on attachment 8636088 [details] [diff] [review]
Async wakeup notifications
Review of attachment 8636088 [details] [diff] [review]:
-----------------------------------------------------------------
sgtm; daniel is back this week right?
Attachment #8636088 -
Flags: review?(mcmanus)
Attachment #8636088 -
Flags: review?(daniel)
Attachment #8636088 -
Flags: feedback+
Comment 32•9 years ago
|
||
and is there a bug filed for making a similar change to NS_WIDGET_WAKE_OBSERVER_TOPIC observers ala comment 24?
Updated•9 years ago
|
Attachment #8636087 -
Flags: review?(mcmanus)
Reporter | ||
Comment 33•9 years ago
|
||
BTW, have you checked how many shutdown deadlock have happened when this patch was in?
Flags: needinfo?(hurley)
Assignee | ||
Comment 34•9 years ago
|
||
(In reply to Honza Bambas (:mayhemer) from comment #33)
> BTW, have you checked how many shutdown deadlock have happened when this
> patch was in?
There are two - both of which seem to have the same stack involved. So it looks like this is being hit maybe once a month (give or take).
Flags: needinfo?(hurley)
Assignee | ||
Comment 35•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #30)
> Comment on attachment 8636087 [details] [diff] [review]
> revert lock change
>
> Review of attachment 8636087 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> if we revert we can't confirm the bug is fixed. how about
> MOZ_DIAGNOSTIC_ASSERT ?
That's doable. Though with the frequency (or lack thereof) of this deadlock/crash, confirmation is going to be hard to come by, anyway :)
Assignee | ||
Comment 36•9 years ago
|
||
(In reply to Patrick McManus [:mcmanus] from comment #32)
> and is there a bug filed for making a similar change to
> NS_WIDGET_WAKE_OBSERVER_TOPIC observers ala comment 24?
Just filed https://bugzilla.mozilla.org/show_bug.cgi?id=1186074
Assignee | ||
Comment 37•9 years ago
|
||
New lock-related patch making it a diagnostic assert instead of a release assert, so we can (hopefully) verify this as fixed at some point in the future.
Attachment #8636087 -
Attachment is obsolete: true
Attachment #8636658 -
Flags: review?(daniel)
Comment 38•9 years ago
|
||
(In reply to Nicholas Hurley [:hurley, :nwgh] from comment #34)
> (In reply to Honza Bambas (:mayhemer) from comment #33)
> > BTW, have you checked how many shutdown deadlock have happened when this
> > patch was in?
>
> There are two - both of which seem to have the same stack involved. So it
> looks like this is being hit maybe once a month (give or take).
this deadlock is infrequent, but I have high hopes that the patch fixes a lot of other weird behavior. the root cause here is re-entrancy and this deadlock is just one of a billion ways that could go poorly. (I'm wondering about the famous writesegments bug)
Updated•9 years ago
|
Comment 39•9 years ago
|
||
Comment on attachment 8636658 [details] [diff] [review]
diagnostic (not release) assert for the lock
Review of attachment 8636658 [details] [diff] [review]:
-----------------------------------------------------------------
I approve!
Attachment #8636658 -
Flags: review?(daniel) → review+
Comment 40•9 years ago
|
||
Comment on attachment 8636088 [details] [diff] [review]
Async wakeup notifications
Review of attachment 8636088 [details] [diff] [review]:
-----------------------------------------------------------------
LGTM
Attachment #8636088 -
Flags: review?(daniel) → review+
Assignee | ||
Comment 41•9 years ago
|
||
url: https://hg.mozilla.org/integration/mozilla-inbound/rev/cd1d76c2f219987ad981e1f3e9ac76845a9962a2
changeset: cd1d76c2f219987ad981e1f3e9ac76845a9962a2
user: Nicholas Hurley <hurley@todesschaf.org>
date: Fri Jul 31 13:33:48 2015 -0700
description:
Bug 1152048 - Make the RELEASE_ASSERT on the sts lock a DIAGNOSTIC_ASSERT. r=bagder
url: https://hg.mozilla.org/integration/mozilla-inbound/rev/a0582ac1d92e3484975e8dd6144cf42f8ad27d4b
changeset: a0582ac1d92e3484975e8dd6144cf42f8ad27d4b
user: Nicholas Hurley <hurley@todesschaf.org>
date: Fri Jul 31 13:39:48 2015 -0700
description:
Bug 1152048 - Send wakeup notification asynchronously to avoid reentrancy issues. r=bagder
Comment 42•9 years ago
|
||
Assignee | ||
Comment 43•9 years ago
|
||
Comment on attachment 8636658 [details] [diff] [review]
diagnostic (not release) assert for the lock
Approval Request Comment
[Feature/regressing bug #]: this bug
[User impact if declined]: could cause crashes we don't want (instead of deadlocks)
[Describe test coverage new/current, TreeHerder]: on m-c
[Risks and why]: none (just changes assert type to not fire on release builds)
[String/UUID change made/needed]: none
Attachment #8636658 -
Flags: approval-mozilla-aurora?
Keywords: leave-open
Status: NEW → RESOLVED
Closed: 9 years ago
status-firefox41:
--- → affected
status-firefox42:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
Comment on attachment 8636658 [details] [diff] [review]
diagnostic (not release) assert for the lock
Seems safe for uplift to Aurora.
Attachment #8636658 -
Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment 46•9 years ago
|
||
Oh nevermind, just the assert patch is supposed to be getting uplifted.
Flags: needinfo?(hurley)
Comment 47•9 years ago
|
||
You need to log in
before you can comment on or make changes to this bug.
Description
•