Closed Bug 645263 Opened 14 years ago Closed 14 years ago

dangerous deadlock detection data divergence

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: dbaron, Assigned: cjones)

References

Details

Attachments

(7 files, 3 obsolete files)

part 1: Migrate content/media to mozilla:: sync primitives 14 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me (deleted), patch		Details \| Diff \| Splinter Review
part 2: Migrate modules/plugin to mozilla:: sync primitives 14 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me (deleted), patch	benjamin : superreview+	Details \| Diff \| Splinter Review
part 3: Migrate nsComponentManagerImpl to mozilla:: sync primitives 14 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me (deleted), patch	benjamin : superreview+	Details \| Diff \| Splinter Review
part 4: Migrate everything else to mozilla:: sync primitives 14 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me (deleted), patch		Details \| Diff \| Splinter Review
part 5: Remove nsAutoLock.* 14 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me (deleted), patch	benjamin : review+	Details \| Diff \| Splinter Review
part 4: Migrate everything else to mozilla:: sync primitives, v2 14 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me (deleted), patch	dbaron : review+	Details \| Diff \| Splinter Review
part 1: Migrate content/media to mozilla:: sync primitives 14 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me (deleted), patch	cajbir : review+	Details \| Diff \| Splinter Review
part 6: Make editor test be nicer to deadlock detector 14 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me (deleted), patch	ehsan.akhgari : review+	Details \| Diff \| Splinter Review
part 7: Disable tracemalloc backtraces for xpcshell tests 14 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me (deleted), patch		Details \| Diff \| Splinter Review
part 7: Disable tracemalloc backtraces for xpcshell tests, v1.1 14 years ago Chris Jones [:cjones] inactive; ni?/f?/r? if you need me (deleted), patch	dbaron : review+	Details \| Diff \| Splinter Review

David Baron :dbaron:

Reporter

Description

•

14 years ago

After investigating bug 594666, I've noticed a bunch of problems with our deadlock detection code (start from bug 594666 comment 365) for the relevant history. In particular: * we have separate deadlock detection code in BlockingResourceBase.cpp and in nsAutoLock.cpp; they don't detect deadlocks between the different types of locks * each of these files lives in xpcom/glue, which means that there's a separate copy of their data structures inside and outside of xpcom glue. This means that we don't detect deadlocks between a lock used by one user of xpcom glue and a lock used by another (e.g., a lock used in the firefox binary and a lock used in libxul, if my memory of glue usage is correct) We should: * move the data structure out of xpcom glue and maintain one copy per process * either unify the structure or rapidly migrate nsAutoLock/nsAutoMonitor users (and raw PRLock/PRMonitor users) to mozilla::Mutex

(not currently active) Ted Mielczarek

Updated

•

14 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=493080, https://bugzilla.mozilla.org/show_bug.cgi?id=493146

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 1

•

14 years ago

The first part of this work is basically what bug 493080 covers. (The old deadlock detector has other problems too that made us want to kill it off, like not actually being thread safe.) The second part isn't filed; that problem was known when we created the new deadlock detector, but not considered especially serious. The work following bug 642381 would allow us to fix that, and to additionally check for deadlocks across SM/Gecko. The work from bug 493080 obviously fizzled out. There's some momentum now from bug 594666, so let's try to use this new hotter bug to finally kill off nsAutoLock.

Depends on: 594666

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 9

•

14 years ago

Attached patch part 1: Migrate content/media to mozilla:: sync primitives (obsolete) (deleted) — Details — Splinter Review

This patch was made slightly-not-quite-entirely-trivial by the custom AutoExit helper.

Assignee: nobody → jones.chris.g

Attachment #522608 - Flags: review?(chris.double)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 10

•

14 years ago

Attached patch part 2: Migrate modules/plugin to mozilla:: sync primitives (deleted) — Details — Splinter Review

This patch was made slightly-not-quite-entirely-trivial by the AutoAsyncCallLock helper.

Attachment #522609 - Flags: superreview?(benjamin)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 11

•

14 years ago

Attached patch part 3: Migrate nsComponentManagerImpl to mozilla:: sync primitives (deleted) — Details — Splinter Review

This patch was made slightly-not-quite-entirely-trivial by the nsAutoMonitor-compatibility-mode helper.

Attachment #522610 - Flags: superreview?(benjamin)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 12

•

14 years ago

Attached patch part 4: Migrate everything else to mozilla:: sync primitives (obsolete) (deleted) — Details — Splinter Review

Aaand the rest is straight rewrite :).

Attachment #522611 - Flags: review?(dbaron)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 13

•

14 years ago

Attached patch part 5: Remove nsAutoLock.* (deleted) — Details — Splinter Review

Not sure if this is semi-officially part of our SDK, but this would feel really good.

Attachment #522612 - Flags: review?(benjamin)

Benjamin Smedberg

Updated

•

14 years ago

Attachment #522609 - Flags: superreview?(benjamin) → superreview+

Benjamin Smedberg

Updated

•

14 years ago

Attachment #522610 - Flags: superreview?(benjamin) → superreview+

Benjamin Smedberg

Updated

•

14 years ago

Attachment #522612 - Flags: review?(benjamin) → review+

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 14

•

14 years ago

Comment on attachment 522611 [details] [diff] [review] part 4: Migrate everything else to mozilla:: sync primitives Not quite ready yet.

Attachment #522611 - Flags: review?(dbaron)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Updated

•

14 years ago

Depends on: 646259

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 15

•

14 years ago

Attached patch part 4: Migrate everything else to mozilla:: sync primitives, v2 (deleted) — Details — Splinter Review

Fixed some windows- and mac-only callers.

Attachment #522611 - Attachment is obsolete: true

Attachment #522915 - Flags: review?(dbaron)

David Baron :dbaron:

Reporter

Comment 16

•

14 years ago

You should also fix up the one user of nsAutoCMonitor at some point-- probably to just use nsAutoMonitor.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 17

•

14 years ago

Hmm, looks like an unused test. I wonder if we should just remove the test.

David Baron :dbaron:

Reporter

Comment 18

•

14 years ago

Comment on attachment 522915 [details] [diff] [review] part 4: Migrate everything else to mozilla:: sync primitives, v2 XPCAutoLock should probably become a typedef for MonitorAutoEnter at some point. Doesn't have to be now, though. Though it'd be nice. In nsJar::Open and OpenInner, I think you should use another mechanism for enforcing the already-open check. Add a boolean if you need one. >+ mMonitor("nsCacheService.mMoniton"), s/n"/r"/ nsHostResolver.h: >- PRLock *mLock; >- PRCondVar *mIdleThreadCV; // non-null if idle thread >+ Mutex mLock; >+ CondVar mIdleThreadCV; // non-null if idle thread comment is no longer true nsHttpActivityDistributor::Init should be removed, and it should just use NS_GENERIC_FACTORY_CONSTRUCTOR (no _INIT). Maybe leave the blank lines before and after the first (new) block in nsKeygenThread::Run? In nsHTTPListener::FreeLoadGroup you should leave the block that you removed; you're changing the scope of the autolock. >+ //Free the lock and return - no error - just implies nothing to schedule Drop the "Free the lock and " Also, you don't need to introduce a new scope and reindent in this method, since you unlock at the very end. (Having the return inside the lock scope is a sign you don't need it!) In nsNSSShutdownList you should probably either remove the indent in those methods or make an explicit scope to match what the indent used to show. Same for nsNSSActivityState (where you already did make the explicit scope in some places). In nsProtectedAuthThread::GetSlot, don't introduce extra spaces before =. >+ mQueueCond.Wait(PR_MillisecondsToInterval(10000)_); You've got an extra underscore, which says you didn't compile on Android. In nsExceptionService, don't get too cute -- use a variable name with letters in it. You shouldn't remove FreeAutoLockStatics calls in this patch. >+ mLock("nsRecyclingAllocatior.mLock"), s/tior/tor/ nsPipe::GetInputStream and GetOutputStream need another mechanism of testing for initialization. >+ : mCounterLock("ProxyTex.mCounterLock") >+ , mEvilMonitor("ProxyTex.mEvilMonitor") s/Tex/Test/ Please add MOZ_COUNT_CTOR and MOZ_COUNT_DTOR macros to mozilla::Mutex, mozilla::Monitor, and mozilla::CondVar to ensure you're not leaking locks anywhere. Make sure you get green try runs before landing. r=dbaron with that

Attachment #522915 - Flags: review?(dbaron) → review+

David Baron :dbaron:

Reporter

Comment 19

•

14 years ago

(Removing FreeAutoLockStatics in patch 5 is great, though.)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 20

•

14 years ago

Thanks much. This can't have been an enjoyable review :). (In reply to comment #18) > Comment on attachment 522915 [details] [diff] [review] > part 4: Migrate everything else to mozilla:: sync primitives, v2 > > XPCAutoLock should probably become a typedef for MonitorAutoEnter > at some point. Doesn't have to be now, though. Though it'd be nice. Yes, I didn't understand the fork there; maybe it predated nsAutoMonitor? Bug 556214 might be a good starting point for cleaning this up, since the re-entrancy semantics needs to be made clear there. > In nsJar::Open and OpenInner, I think you should use another mechanism for > enforcing the already-open check. Add a boolean if you need one. Done. I added a bool because I wasn't sure how to generally check already-open in OpenInner. > >+ mMonitor("nsCacheService.mMoniton"), > > s/n"/r"/ Fixed in bug 646259, which will be folded into this before pushing. > nsHostResolver.h: > >- PRLock *mLock; > >- PRCondVar *mIdleThreadCV; // non-null if idle thread > >+ Mutex mLock; > >+ CondVar mIdleThreadCV; // non-null if idle thread > > comment is no longer true D'oh, I went to check whether that comment was true in the old PRCondVar impl before nuking (it wasn't), then I forgot to nuke. Fixed. > nsHttpActivityDistributor::Init should be removed, and it should > just use NS_GENERIC_FACTORY_CONSTRUCTOR (no _INIT). Done. > Maybe leave the blank lines before and after the first (new) block > in nsKeygenThread::Run? OK. > In nsHTTPListener::FreeLoadGroup you should leave the block that you > removed; you're changing the scope of the autolock. > > >+ //Free the lock and return - no error - just implies nothing to schedule > > Drop the "Free the lock and " Done. > Also, you don't need to introduce a new scope and reindent in this > method, since you unlock at the very end. (Having the return inside the > lock scope is a sign you don't need it!) Hm, I'm not sure there's a good solution here. I tried to only add new locking scopes where needed, and I was inconsistent here :S. In v2, I reindented the sites I was 99.9% sure were strangely formatted just because of explicit lock/unlock, not because, say, pushing a value onto the stack while the mutex was held was found to cause perf issues (i.e., none). > In nsNSSShutdownList you should probably either remove the indent > in those methods or make an explicit scope to match what the indent > used to show. Same for nsNSSActivityState (where you already > did make the explicit scope in some places). Changed per above. > In nsProtectedAuthThread::GetSlot, don't introduce extra spaces before > =. Fixed. > >+ mQueueCond.Wait(PR_MillisecondsToInterval(10000)_); > > You've got an extra underscore, which says you didn't compile on > Android. I did, locally, but only opt, which is also only what tinderbox does :S (bug 581290). Debug builds were broken for months until recently. > In nsExceptionService, don't get too cute -- use a variable name > with letters in it. The problem here was that the static variable is named "lock", the same as our (apparent) convention for RAII auto-locks. Renamed static variable to sLock to match style guide. > You shouldn't remove FreeAutoLockStatics calls in this patch. Hm yeah, I had the kill-nsAutoLock patch before this in series so that users wouldn't compile. Will rejigger and fix. (I'm going to fold all these patches before pushing, but that change will help rebasing if a backout is needed.) > >+ mLock("nsRecyclingAllocatior.mLock"), > > s/tior/tor/ Whups. > nsPipe::GetInputStream and GetOutputStream need another mechanism > of testing for initialization. Nice catch! Tryserver found that too :). Another place where I forgot to change the is-init check after looking at the .cpp. Fixed with another new bool member introduced to match existing semantics. > >+ : mCounterLock("ProxyTex.mCounterLock") > >+ , mEvilMonitor("ProxyTex.mEvilMonitor") > > s/Tex/Test/ Fixed. > Please add MOZ_COUNT_CTOR and MOZ_COUNT_DTOR macros to > mozilla::Mutex, mozilla::Monitor, and mozilla::CondVar to ensure > you're not leaking locks anywhere. > > Make sure you get green try runs before landing. Yep!

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 21

•

14 years ago

Comment on attachment 522612 [details] [diff] [review] part 5: Remove nsAutoLock.* >diff --git a/xpcom/glue/Makefile.in b/xpcom/glue/Makefile.in >--- a/xpcom/glue/Makefile.in >+++ b/xpcom/glue/Makefile.in >@@ -65,17 +65,16 @@ CPPSRCS = \ > nsStringAPI.cpp \ > GenericModule.cpp \ > $(NULL) > > # TODO nsAutoLock.h should die soon Kill this line too.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 22

•

14 years ago

Attached patch part 1: Migrate content/media to mozilla:: sync primitives (deleted) — Details — Splinter Review

Rebased, smaller now.

Attachment #522608 - Attachment is obsolete: true

Attachment #522608 - Flags: review?(chris.double)

Attachment #523088 - Flags: review?(chris.double)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 23

•

14 years ago

Attached patch part 6: Make editor test be nicer to deadlock detector (deleted) — Details — Splinter Review

Something about these tests is making for very slow. The new deadlock detector was explicitly designed to fix scalability problems in the old one, and in testing was much better, so there's something very wrong here. Not sure whether the problem lies in the detector, the tests, or the code being tested, but I'd rather fix whichever off the critical path to landing this.

Attachment #523091 - Flags: review?(ehsan)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 24

•

14 years ago

Attached patch part 7: Disable tracemalloc backtraces for xpcshell tests (obsolete) (deleted) — Details — Splinter Review

These patches are looking good on tryserver, except the xpcshell tests didn't finish within 2 hours on windows :/. Most likely due to capturing backtraces being slow.

Attachment #523180 - Flags: review?(dbaron)

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 25

•

14 years ago

Part 7 seems to fix timeout on tryserver. Looks like this is ready to go as soon as the reviews come in.

(no longer active)

Comment 26

•

14 years ago

Comment on attachment 523091 [details] [diff] [review] part 6: Make editor test be nicer to deadlock detector These tests are stupid, and I will kill them without mercy in the future, but in the meantime, you're free to do whatever you need to them in order to get this landed. :-)

Attachment #523091 - Flags: review?(ehsan) → review+

cajbir (:cajbir)

Updated

•

14 years ago

Attachment #523088 - Flags: review?(chris.double) → review+

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 27

•

14 years ago

(In reply to comment #23) > Created attachment 523091 [details] [diff] [review] > part 6: Make editor test be nicer to deadlock detector > > Not sure > whether the problem lies in the detector, the tests, or the code being tested, > but I'd rather fix whichever off the critical path to landing this. The main problem here is that - the test is external API - it defines XPCOM classes - by the nature of the tests, there are a gerjillion AddRef/Release of the COM-y things defined under the external API - AddRef/Release check NS_IsCycleCollectorThread() in debug builds - ... aaand NS_IsCycleCollectorThread() is hahahaomglollerskates slow for external-API code: http://mxr.mozilla.org/mozilla-central/source/xpcom/glue/nsCycleCollectorUtils.cpp#53 Most of the time in my profile was in tracemalloc. Running before/after with NS_TRACE_MALLOC_DISABLE_STACKS=1 took about the same amount of time; new detector was ~10% slower. The test was *much* slower with the patches here and stacks enabled; too slow for my patience. The old deadlock detector only checked lock ordering in the nsAutoLock/nsAutoMonitor constructors (which is a bug), and the new detector checks on all Mutex::Lock/Monitor::Enter calls. My guess is that something was using raw PR_* operations or nsAuto*::lock/Enter, which weren't getting ordering checks, and they're now being checked (which incurs grabbing a backtrace). Didn't confirm this. Anyway, the takeaway is - NS_IsCycleCollectorThread() is ridiculous for external-API code. Not sure how much anyone cares. - TestTXMgr is chewing a ton of CPU cycles in debug builds; not sure what we gain from that.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 28

•

14 years ago

Attached patch part 7: Disable tracemalloc backtraces for xpcshell tests, v1.1 (deleted) — Details — Splinter Review

Fixed a style inconsistency and a grammar-o.

Attachment #523180 - Attachment is obsolete: true

Attachment #523180 - Flags: review?(dbaron)

Attachment #523237 - Flags: review?(dbaron)

David Baron :dbaron:

Reporter

Comment 29

•

14 years ago

Comment on attachment 523237 [details] [diff] [review] part 7: Disable tracemalloc backtraces for xpcshell tests, v1.1 r=dbaron I should have done this in bug 549561.

Attachment #523237 - Flags: review?(dbaron) → review+

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 30

•

14 years ago

http://hg.mozilla.org/mozilla-central/rev/4beec31b9ea9

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

neil@parkwaycc.co.uk

Comment 31

•

14 years ago

Was this announced anywhere, particularly to existing nsAutoLock consumers?

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 32

•

14 years ago

Increased frequency of intermittent orange bug 618052, backed out http://hg.mozilla.org/mozilla-central/rev/e03c3a6df3cb Sigh.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 33

•

14 years ago

The first round of OS X debug tests was green on tryserver, with parts 1-6. I pushed a second time with part 7 added, and just ran on win/OS X 10.6, and saw bug 618052 on 10.6. That's very puzzling.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 34

•

14 years ago

I wonder if this is relevant: http://mxr.mozilla.org/mozilla-central/source/build/automation.py.in#590. Siiigh.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 35

•

14 years ago

(In reply to comment #31) > Was this announced anywhere, particularly to existing nsAutoLock consumers? It was deprecated 2 years ago [1], but this bug wasn't announced beforehand. There was an effort to kill off nsAutoLock that began about 2 years ago and fizzled out because of review neglect, so I tried to move fast here. [1] http://blog.mozilla.com/cjones/2009/05/04/farewell-prlock-i-hardly-knew-thee/

(no longer active)

Comment 36

•

14 years ago

(In reply to comment #27) > - TestTXMgr is chewing a ton of CPU cycles in debug builds; not sure what we > gain from that. Not much. => bug 647064.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 37

•

14 years ago

Sadly, - self.env["NS_TRACE_MALLOC_DISABLE_STACKS"] = "1" + if not (sys.platform == 'osx' or sys.platform == "darwin"): + # XXX automation.py has this odd special case; without it, bug + # 618052 seems to be exacerbated + self.env["NS_TRACE_MALLOC_DISABLE_STACKS"] = "1" made this pass both OS X debug xpcshell runs on tryserver. Ah well; guess it's good to be consistent with automation.py until the two are unified.

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 38

•

14 years ago

Relanded with, ahem, correction to make automation.py and runxpcshelltests.py agree. http://hg.mozilla.org/mozilla-central/rev/1a89509e25e4

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

Ginn Chen

Updated

•

14 years ago

Depends on: 647150

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Comment 39

•

14 years ago

Amazingly and perhaps frighteningly, this got an all-green.

Ian Neal

Updated

•

14 years ago

Blocks: 647000

neil@parkwaycc.co.uk

Comment 40

•

14 years ago

(In reply to comment #35) > (In reply to comment #31) > > Was this announced anywhere, particularly to existing nsAutoLock consumers? > It was deprecated 2 years ago I guess that constitutes fair warning.

Ian Neal

Updated

•

14 years ago

Blocks: 647003

Chris Jones [:cjones] inactive; ni?/f?/r? if you need me

Assignee

Updated

•

14 years ago

Depends on: 647600

You need to log in before you can comment on or make changes to this bug.