Closed
Bug 171220
Opened 22 years ago
Closed 22 years ago
Profile switching network teardown race condition with NSS shutdown
Categories
(Core Graveyard :: Profile: BackEnd, defect, P1)
Core Graveyard
Profile: BackEnd
Tracking
(Not tracked)
VERIFIED
FIXED
mozilla1.0.2
People
(Reporter: KaiE, Assigned: KaiE)
References
Details
(Keywords: topembed+, Whiteboard: [adt2] [ETA 10/15])
Attachments
(1 file)
(deleted),
patch
|
ccarlen
:
review+
darin.moz
:
superreview+
asa
:
approval+
|
Details | Diff | Splinter Review |
cc'ing everybody I know is involved in profile switching / turbo mode
We currently on switching profiles in embedding apps after doing some crypto
operations, becasue of a race condition between network shutdown and NSS shutdown.
Intro:
In order to switch profiles, it is necessary to prepare a clean environment, to
release all usage of NSS code, to allow NSS to shut itself down and re-init cleanly.
In the past we prepared this by switching Necko of offline state, and profile
manager sends out events to prepare that clean environment.
Three events are used:
- profile-change-net-teardown
- profile-change-teardown
- profile-before-change
I believe the plan was to bring down all network activity during event
profile-change-net-teardown.
Bug:
In file nsHttpHandler.cpp there is code to drop all http connections when events
are observed.
However, the code listens for "profile-before-change" (string at two places in
the file).
But the PSM/NSS code that assumes network is already down is also being called
when the same event is received!
In my test case, the NSS shutdown code was actually being reached before http
dropped the connections.
Proposal:
In order to fix the race condition, I propose to change nsHttpHandler to drop
connections on even "profile-change-net-teardown".
(When I talked to Darin yesterday, he reminded that a clean solution would be if
NSS itself took care, that it ignored any open handles after it has shut down
itself. But I'm unable to provide this approach at the current time. The NSS
team is still looking into a "hammer" shutdown, but still, I think we should
make sure that we avoid anything that could eventually mean stale resources
being still accessed/referenced.)
Assignee | ||
Comment 1•22 years ago
|
||
Comment 2•22 years ago
|
||
The hammer solution is fraugh with problems:
1) It would probably result in a later crash. This crash, the result of an app
keeping an NSS ref around that has become invalid because of the NSS shutdown.
This crash would still happen in NSS (because the data would be passed to an NSS
function), and the support calls on NSS would be even higher.
NSS cannot "ignore any open handle". The typical scenario would be: app keeps a
certificate around. NSS hammer shuts down. NSS starts up. app call
NSS_func(leaked_cert); How does NSS_func() know that leaked_cert is not valid.
There are methods in which applications keep NSS data as Aliases (i.e., the
member can be set to null by NSS, because the data is an AliasTarget) and apps
always test for null before using the data. However, Alias and AliasTargets have
to be used carefully in multithreaded environment (i.e., alias usage and
aliastarget cleanup of it's alias must occur on the same thread or else it must
be protected), and the application changes would be very large.
2) It's just telling application developers: don't worry about making sure that
you release all NSS data when you need to.
If the requirement is that NSS be shutdown and restarted, and no crash can
happen in the future because of leaked reference to NSS data, then the hammer
solution must be ruled out.
Comment 3•22 years ago
|
||
stephane:
i was only talking about the layered PRFileDesc... it seems like NSS could
protect itself at shutdown by going through and marking all of its open file
descriptors as invalid/broken (think: EPIPE). having this in place would make
NSS less susceptible to coding errors in the client. it doesn't mean that
clients shouldn't shutdown in the proper order; it just makes the system more
robust :-/
Comment 4•22 years ago
|
||
you assume NSS has access to all the open file descriptors. NSS is just a layer
on the stack, and has no knowledge about what file descriptors happenned to be
openned. All the state is stored in the file descriptor itself. There is no list
of open file descriptors in NSS -- that's up to the application to manage.
bob
Comment 5•22 years ago
|
||
ic, well, wishful thinking i guess :-/
Assignee | ||
Comment 6•22 years ago
|
||
Can you please review the patch?
Comment 7•22 years ago
|
||
Comment on attachment 100892 [details] [diff] [review]
Patch v1
sr=darin
Attachment #100892 -
Flags: superreview+
Comment 8•22 years ago
|
||
bryner: you might be interested in this patch as well.
Comment 9•22 years ago
|
||
I'm not sure why we need to observe "profile-change-net-teardown" in the http
handler as well.
It gets used here
http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsIOService.cpp#707
After the IOService has gone offline due to this notification (which I thought
happened synchronously), what's still hanging around?
Comment 10•22 years ago
|
||
yeah, socket transport service shutdown does happen synchronously (i.e., the UI
thread joins with the socket thread).
Assignee | ||
Comment 11•22 years ago
|
||
I can't confirm to your assumption.
The event observation code in HttpHandler is not a no-op.
When the event is observed in HttpHandler, it finds alive objects and cleans
them up.
If I remember correctly, it was happening in the call to:
DropConnections(mIdleConnections);
The cleanup code results in some http objects getting freed, and in case of
https connections, SSL cleanup functions are reached.
Assignee | ||
Comment 12•22 years ago
|
||
Note: The problem was seen on recent MOZILLA_1_0_BRANCH code.
Comment 13•22 years ago
|
||
Comment on attachment 100892 [details] [diff] [review]
Patch v1
r=ccarlen
Attachment #100892 -
Flags: review+
Assignee | ||
Updated•22 years ago
|
Keywords: adt1.0.2,
mozilla1.0.2
Comment 14•22 years ago
|
||
Comment on attachment 100892 [details] [diff] [review]
Patch v1
a=asa for checkin to 1.2beta (on behalf of drivers)
Attachment #100892 -
Flags: approval+
Assignee | ||
Comment 15•22 years ago
|
||
Fixed on trunk.
Darin cc'ed me on bug 174131, and if I understand correctly, that other bug
might be the real cause of the problem we are trying to fix here. While in
theory, this patch might no longer be required, I checked it in anyway, because
we all agreed it is safe - and it fixes the race condition.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Comment 16•22 years ago
|
||
yeah, i agree that this patch is still the right thing. thanks kai!
Updated•22 years ago
|
QA Contact: ktrina → gbush
Comment 17•22 years ago
|
||
a=rjesup@wgate.com for 1.0 branch. Change mozilla1.0.2+ to fixed1.0.2 when
checked in.
Keywords: mozilla1.0.2 → mozilla1.0.2+
Comment 18•22 years ago
|
||
Marking as topembed+, as this is needed by a major embedding customer.
Comment 19•22 years ago
|
||
Plussing for adt. Need checkin by cob 10/16. Please make sure to get all
required approvals (drivers) before checkin if this is not already done.
Comment 21•22 years ago
|
||
verifying code fix for branch
marking verified1.0.2
Keywords: fixed1.0.2 → verified1.0.2
Updated•9 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•