Closed Bug 1366187 Opened 8 years ago Closed 7 years ago

Crash in nssToken_AddRef | find_certs_from_nickname

Categories

(NSS :: Libraries, defect, P2)

3.31
x86
Windows 8
defect

Tracking

(firefox-esr52 unaffected, firefox53 unaffected, firefox54 unaffected, firefox55 wontfix, firefox56 wontfix, firefox57 fix-optional)

RESOLVED WONTFIX
Tracking Status
firefox-esr52 --- unaffected
firefox53 --- unaffected
firefox54 --- unaffected
firefox55 --- wontfix
firefox56 --- wontfix
firefox57 --- fix-optional

People

(Reporter: calixte, Assigned: ueno, NeedInfo)

References

(Blocks 1 open bug)

Details

(Keywords: crash, regression, Whiteboard: [clouseau])

Crash Data

Attachments

(2 files, 1 obsolete file)

This bug was filed from the Socorro interface and is report bp-b04f6cc6-4443-4e97-bcc3-d56f60170519. ============================================================= There is 1 crash in nightly 55 with buildid 20170518030213. In analyzing the backtrace the regression may have been introduced by patch [1] to fix bug 1345368. [1] https://hg.mozilla.org/mozilla-central/rev?node=0751b01e0e656787bd66f0289af2c8d47c64ad85
Flags: needinfo?(franziskuskiefer)
Looks like the recent PK11 changes introduced this. Kai or Daiki can you have a look at this? We'll have to keep an eye on the crash frequency. If it's too high we have to back out those changes if we can't fix them.
Flags: needinfo?(kaie)
Flags: needinfo?(franziskuskiefer)
Flags: needinfo?(dueno)
I suppose the relevant change is this: https://hg.mozilla.org/projects/nss/rev/497ec29ce188 which was originally authored by David and I am not familiar with. David, could you shed some light on this? Is it possible that PK11Slot_GetNSSToken(slot) returns NULL in find_certs_from_nickname()?
Flags: needinfo?(dueno) → needinfo?(dwmw2)
Attached patch nss-find-certs-from-nickname-null.patch (obsolete) (deleted) — Splinter Review
(In reply to Daiki Ueno [:ueno] from comment #2) > Is it possible that PK11Slot_GetNSSToken(slot) returns NULL in find_certs_from_nickname()? In any case, I am attaching a patch that adds a NULL check there.
Attachment #8869397 - Flags: review?(kaie)
Comment on attachment 8869397 [details] [diff] [review] nss-find-certs-from-nickname-null.patch It seems the "goto" isn't necessary, because all of the remaining code (before "loser") is wrapped inside an if(token). So you could alternatively write it as: token = PK11Slot_GetNSSToken(slot); if (token) { nssToken_AddRef(token); } } The previous block sets PORT_SetError(SEC_ERROR_NO_TOKEN) if no token was found, maybe you could do this in the else scenario? if (token) { nssToken_AddRef(token); } else { PORT_SetError(SEC_ERROR_NO_TOKEN); } Looking at the surrounding code, especially the block before that code (if the if condition from line 754 is true), a different approach is used for adding a reference. A reference to a slot is used instead (not to the token). I wonder if this code should be doublechecked if both code pathes make sense. Unfortunately I don't know anything about the referencing internals of this code.
Flags: needinfo?(kaie)
Attachment #8869397 - Flags: review?(kaie) → review+
Assignee: nobody → dueno
(In reply to Kai Engert (:kaie:) from comment #4) > The previous block sets PORT_SetError(SEC_ERROR_NO_TOKEN) > if no token was found, maybe you could do this in the else scenario? > if (token) { > nssToken_AddRef(token); > } else { > PORT_SetError(SEC_ERROR_NO_TOKEN); > } Thank you, this looks nicer. I have updated the patch. > Looking at the surrounding code, especially the block before that code (if > the if condition from line 754 is true), a different approach is used for > adding a reference. A reference to a slot is used instead (not to the > token). If I understand correctly, NSSTrustDomain_FindTokenByName() in the above block also adds a ref to a token, so the ref-counts of both the slot and the token are incremented.
Attachment #8869397 - Attachment is obsolete: true
Attachment #8869406 - Flags: review?(kaie)
Comment on attachment 8869406 [details] [diff] [review] nss-find-certs-from-nickname-null-v2.patch thanks, r=kaie I'll check it in.
Attachment #8869406 - Flags: review?(kaie) → review+
Status: NEW → RESOLVED
Closed: 8 years ago
Component: Security: PSM → Libraries
Product: Core → NSS
Resolution: --- → FIXED
Target Milestone: --- → 3.31
Version: 55 Branch → 3.31
Kai: I'm having trouble determining whether or not this fix is correct, or whether it's indicative of an underlying problem. In Comment #2, Daiki asked "Is it possible that PK11Slot_GetNSSToken(slot) returns NULL in find_certs_from_nickname()?" Recently, we've been investigating a recent Chrome crash (only on our leading edge), which relates to PK11Slot_GetNSSToken(slot) returning NULL, so I've spent the better part of today staring at the code and looking through it. Our crash is related to a call to nssToken_IsPresent, called on the resulting NSSToken from that function. The only time a PK11Slot should have a NULL NSSToken, as far as I can tell, is when STAN_RemoveModuleFromDefaultTrustDomain has been called (which explicitly NULLs it out by walking the module's token list and resetting the NSSToken). Otherwise, a PK11Token should always have an NSSToken, by virtue of STAN_InitTokenForSlotInfo always setting the token for the slot, and STAN_LoadDefaultNSS3TrustDomain / STAN_AddModuleToDefaultTrustDomain always calling that function. My attempts at investigating this have suggested that, especially because it's only a very recent regression in ChromeOS (and we recently had some changes to module initialization), it's entirely possible that there is some unsafe module usage going on that tickled this bug, perhaps in the original code, because the assumption is "No, PK11Slot_GetNSSToken should not return null as long as the module is loaded" (and the corollary, you should not have a module being accessed before you've loaded it or after you've unloaded it, which is the only way you should have the PK11SlotInfo) This also seems consistent that if you look at most of the call sites for PK11Slot_GetNSSToken, they consistently assume that its result will be non-null. * https://dxr.mozilla.org/nss/source/nss/lib/pk11wrap/pk11cert.c#247 * https://dxr.mozilla.org/nss/source/nss/lib/pk11wrap/pk11cert.c#2125 * https://dxr.mozilla.org/nss/source/nss/lib/pk11wrap/pk11cert.c#2182 * https://dxr.mozilla.org/nss/source/nss/lib/pk11wrap/pk11cert.c#2255 * https://dxr.mozilla.org/nss/source/nss/lib/pk11wrap/pk11nobj.c#418 * https://dxr.mozilla.org/nss/source/nss/lib/pk11wrap/pk11nobj.c#491 * https://dxr.mozilla.org/nss/source/nss/lib/pk11wrap/pk11nobj.c#527 * https://dxr.mozilla.org/nss/source/nss/lib/pki/pki3hack.c#1029 * https://dxr.mozilla.org/nss/source/nss/lib/pki/pki3hack.c#1221 Now, there are some places that explicitly check: * https://dxr.mozilla.org/nss/source/nss/lib/pk11wrap/pk11cert.c#1044 * https://dxr.mozilla.org/nss/source/nss/lib/pk11wrap/pk11cert.c#2325 (but only indirectly, simply because it doesn't access |tok|, nor does nssToken_FindCertificateByEncodedCertificate, and nssToken_FindObjectsByTemplate does parameter validation check and returns an error) * https://dxr.mozilla.org/nss/source/nss/lib/pki/pki3hack.c#185 (although understandable, because it's in a shutdown path) * https://dxr.mozilla.org/nss/source/nss/lib/pki/pki3hack.c#1221 (but only indirectly, simply because nssToken_ImportCertificate does a parameter validation check and returns an error) My conclusion from all of this is that I believe the checking is likely indicative of an error, perhaps around an unsafe reference counting/usage of a token beyond its module. This, combined with the fact that it's a recent and visible ChromeOS regression, in which we recently introduced new code related to our module lifetimes, seems to support this theory (although we've not fully bisected the error) I think it's worth re-opening this bug, because I'm concerned that https://hg.mozilla.org/projects/nss/rev/6b1d7e8bf127 may be papering over some form of unsafe usage / memory management, rather than resolving the underlying issue.
Flags: needinfo?(rrelyea)
Hi Ryan, thanks for bringing this up, and thanks a lot for the time you're spending investigating this, I really appreciate it. It's midnight over here, and I just saw your comment, so I might not be much of a help until monday, but I wanted to at least give you some immediate feedback. The crashes that you saw, were they new crashes, that started after the code from bug 1162897 had landed? Do your new regressions go away, if you back out all the recent commits from bug 1162897 comment 89? If yes, then I'd agree it might be necessary to back out. Let's investigate more deeply on monday.
reopening, we should investigate if there are side effects of these fixes
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
kaie: Our crashes were unrelated to Bug 1162897, as we are running an older version of NSS (I'm trying to upgrade today). From today's update on the bug investigation, it looks like it was a misbehaving PKCS#11 module library related to how we interact with the TPM, and thus, by not complying with the PKCS#11 interfaces, was resulting in the null token, through causing an improper shutdown sequence. To me, that bolsters the argument that the code is right for assuming a non-null NSSToken, and that if a NULL nsstoken is found, it's likely something else violating that invariant. It may be that the one Firefox crash was caused by a similarly broken module. Based on the extensions reported, it's clear that Firefox crash wasn't a Googler running ChromeOS ;) I don't see any obvious third-party modules (either PKCS#11 tokens or shady malware) in the report in https://crash-stats.mozilla.com/report/index/b04f6cc6-4443-4e97-bcc3-d56f60170519#tab-modules , but I haven't looked up the provenance of all of them (and am spoiled by Chrome crash reporting flagging known malware). The fact that our problems started without any changes in NSS, and solely related to Chrome-side changes, including at least one known bug with our TPM module loading/unloading, further suggest to me we may be papering over something with the null check :)
Did the crash go away after we fixed this bug? I'm trying to summarize what we know (which helps me to think about it): (a) with well behaved modules, after init is completed, and we aren't in shutdown yet, then PK11Slot_GetNSSToken(slot) should never return NULL (b) we don't know if there is a general bug in NSS that causes PK11Slot_GetNSSToken(slot) to occasionally return NULL (c) with a buggy module, Ryan recently saw that PK11Slot_GetNSSToken(slot) returned NULL (d) it's possible that the single crash reported in this bug might also have been caused by a Firefox user using an unidentified buggy module (e) Ryan is worried that the NULL-check introduced in this bug might result in (b) become invisible (f) Ryan has identified several other places in the code that don't use NULL-checks (g) Because Ryan's use of (c) didn't have the code from bug 1162897, it means the crash must have been in one of the places from (f). We could ask: If we had crashed because of (d), why did we crash in code that was recently changed in bug 1162897, not in one of the other places (f)? Is the code path fixed in this bug executed more often than the other ones from (f) ? What happens if the user removes a token during Firefox execution, while we're iterating lists - could it trigger the crash reported in this bug? I understand that Ryan's primary worry was, potentially, the fix for this bug could hide a more general problem. Could we argue, if there's indeed a general problem in the code, the other places (f) without NULL-check should still be sufficient to trigger systematic crashes, and are sufficient to not hide such a general issue? Here's my suggestion, lacking other good ideas: - keep the NULL check that we added - add a comment, that says the NULL check shouldn't be necessary - add an assertion, which is only triggered in debug builds, to at least enable developers to notice unexpected NULL scenarios during development
Attached patch 1366187-add-assert.patch (deleted) — Splinter Review
Attachment #8869992 - Flags: review?(ryan.sleevi)
> Did the crash go away after we fixed this bug? This didn't land in Firefox yet. I wanted to wait for the outcome of the re-opening. It also looks like the crash happened only once.
(In reply to Kai Engert (:kaie:) from comment #12) > Did the crash go away after we fixed this bug? Again, nothing changed in NSS to cause the bug to go away. Your issue was not at all the root cause of my issue. My issue, however, had me spend a lot of time looking at the same code your issue was touching. Code was introduced to Chrome that caused modules to be explicitly misbehaving - and caused crashes with identical stack traces. The code in Chrome was fixed - and the crashes went away. > (f) Ryan has identified several other places in the code that don't use > NULL-checks > > (g) Because Ryan's use of (c) didn't have the code from bug 1162897, it > means the crash must have been in one of the places from (f). > Not quite. We also saw crashes from this very path as well with the buggy module. > If we had crashed because of (d), why did we crash in code that was recently > changed in bug 1162897, not in one of the other places (f)? Is the code path > fixed in this bug executed more often than the other ones from (f) ? Yes, relative to some of the other callers. > What happens if the user removes a token during Firefox execution, while > we're iterating lists - could it trigger the crash reported in this bug? Define "remove". There is the physical act of removal of a card from a slot, but there's also the logical unload of a module using the API. The latter has greater opportunity to trigger this, particularly when using worker threads, which ties back to the above crashing opportunities. > Could we argue, if there's indeed a general problem in the code, the other > places (f) without NULL-check should still be sufficient to trigger > systematic crashes, and are sufficient to not hide such a general issue? > That would be a terrible argument, but it certainly can be made. That is to accept ignorance and the status quo rather than trying to dive in. > > Here's my suggestion, lacking other good ideas: > > - keep the NULL check that we added > - add a comment, that says the NULL check shouldn't be necessary > - add an assertion, which is only triggered in debug builds, > to at least enable developers to notice unexpected NULL scenarios during > development That is certainly a solution you can do. I think it is intellectually lazy, because it replaces understanding with superstition. The original patch was not based on a principled examination of the code or its patterns, and instead based on an "I don't know, and I won't bother to find out." I don't deny that it hides, for now, your other crash, and thus for purposes of time to dedicate to this issue, may represent an appropriate tradeoff. I think that it is unwise, because it adds to the burgeoning complexity and inconsistency of NSS on the basis of a single crash, but that may be the tradeoff NSS is willing to make. My ideal outcome: - Remove the NULL check added - Add oss-fuzz fuzzers that cover the new URL parsing code - Continue monitoring for new FF crashes. Examine the client IDs and loaded modules to determine if there is a pattern of perhaps buggy users. - Only add the NULL check if a path can be determined. Add it consistently if so. - Because the nssToken setting is guarded by the module list lock, the triggering of this may represent a point in which NSS is being used in a thread unsafe manner as well, if the crashes are legit. Fix those issues in Firefox.
Depends on: 1162897
So there are a couple of things I want to comment on. 1) If the crash is related to PK11Slot_GetNSSToken() returning NULL, then the issue is not related to the new patch (the new patch only exposes the issue). I think we should land Daiki's code since we need the functionality. 2) I think the evaluation that PK11Slot_GetNSSToken() should never return NULL is correct (though I'd really have to go back and look at the code). The fact that it does means we do have another issue to look at. Couple that with the fact Ryan is running into this separately means we should drill down as to why. I do remember that the 'Stan' semantics was such that nssslot objects would stay around, but nsstoken objects would disappear if the token is removed. If that semantic is actually implemented, then the missing token object should return a 'token removed' message. It seems weird that we would be getting this unless we have actual removable tokens in the test case that's crashing. Adding Kai's ASSERT means we could continue to push on this and get a reproducible case where the PK11Slot_GetNSSToken does return NULL and determine why it does so. bob
Flags: needinfo?(rrelyea)
Reading the comments further, I realize Ryan was suggesting that the patch itself was causing the underlying invariant. I find that unlikely since the patch only ever fetches slot and token information. It doesn't add or remove tokens from the trust domain, nor add or remove slots. Instead we are likely running into a corner case where the existing code is incorrectly returning NULL (or we are running into a case where the token could be removed and thus NULL, assuming that is the semantic). Looking more closely, this looks like it's a crash report from outside, not one of our tests. In that case we could indeed be running into an issue where a physical token is removed. I'm also clearing the need info for david. At this point we are dealing with NSS internals, not really the PK11URI stuff. bob I
Flags: needinfo?(dwmw2)
Bob, if we understand correctly, you're recommending, that we keep the code as checked in, and add the assertion patch, correct? If yes, please give r+ the attached patch. If not, can you please suggest what to do? Thanks
Flags: needinfo?(rrelyea)
Attachment #8869992 - Flags: review?(ryan.sleevi) → review?(rrelyea)
Comment on attachment 8869992 [details] [diff] [review] 1366187-add-assert.patch Review of attachment 8869992 [details] [diff] [review]: ----------------------------------------------------------------- yes, I think this is a good idea. I've looked at the code more. There is one case I found where we get NULL PK11Slot_GetNSSToken() and that's if we fail to allocate a new token structure. This can happen for a number of reasons (usually dubious issues like malloc failing), but one reason could be that the slot as been disabled. Disabled slots can happen either because the PKCS #11 module fails to provide some function we need, or because the user has explicitly disabled the slot. I've seen a log more check for null in the code then Ryan noted, but most of them are for direct access to the nssToken field in the slot rather than from the wrapper function.
So, in Bob's opinion, this cannot be a regression from bug 1162897. Based on that, I think we should consider bug 1162897 done, and include it in the next release. Nevertheless, this bug should probably be kept open for a while, until we no longer get any crash reports with that call stack. Because other code paths can suffer from the same underlying issue, we should still be able to see the eventual crash, just less frequently. This should partially address Ryan's suggestion to watch out for more crash reports, although less frequently. Ryan suggested to do some fuzzing, but for that to happen, we'd require a volunteer. Is there anyone in the group of NSS developers who has experience with the type of fuzzing that Ryan suggested? It seems we don't have clarity yet, if NULL checks should be added to all places that call PK11Slot_GetNSSToken(). Based on Ryan's argument to watch out for more crashes, we'd omit the NULL checks. Based on Bob's analysis, that there are some legitimate scenarios, adding NULL checks everywhere might be reasonable. But on the other hand, Bob agreed that the ASSERT is a good idea, which seems to suggest, that Bob is still uncertain, and is interested to learn more. Here's my suggestion for now: - check in the assert, I'll treat Bob's comment as an r+ - omit the NULL checks elsewhere, wait for new crashes - hope for a volunteer to help us with fuzzing If more crashes come in, we should analyze them. Ryan suggested to use client IDs and loaded modules. I don't know how much help from Firefox we'd help with that, or if crash logs already contain all the information we require to investigate and compare future crashes.
Comment on attachment 8869992 [details] [diff] [review] 1366187-add-assert.patch (In reply to Robert Relyea from comment #16) > > Adding Kai's ASSERT means we could continue to push on this and get a > reproducible case where the PK11Slot_GetNSSToken does return NULL and > determine why it does so. I just realize that there might be a misunderstanding here. The PR_ASSERT that I had suggested is "debug-build-only". If Bob says, he wants a reproducable case, does that mean he is asking for an assert-always, which even crashes in optimized builds? This would require us to use PR_Assert (note the difference in uppercase/lowercase). But if it's supposed to crash always in optimized mode, we don't really need a PR_Assert, we'd just have to backout the NULL check, because the original code, without NULL check, had exactly that effect, a crash. Bob, please clarify: Do you want this code to crash in optimized builds, or do you want to limit the assert to debug builds?
Debug only.is fine. I thought this tripped on a regression test, in which case we could track down the actual path. If the bug is a rare crash in the field, then it's very likely the case where someone has a token that's disabled (either it's broken or it's been explicitly disabled). We probably need to write a second bug to deal with the other places where we return nsstoken. I think those are coming from lists that can only be enabled tokens, which is why they aren't crashing now.
Flags: needinfo?(rrelyea)
Very low volume, setting fix-optional for 55. We could still take an nss update with this fix in beta.
Attachment #8869992 - Flags: review?(rrelyea) → review+
status?
Flags: needinfo?(rrelyea)
Flags: needinfo?(kaie)
Flags: needinfo?(dueno)
I'm OK with included both reviewed patches. (my preference). I'm also OK with closing the bug wontfix if kai and Daiki don't want to push their patches.
Flags: needinfo?(rrelyea)
Priority: -- → P2
Not seeing any crashes with this signature in the last month. Maybe this was fixed in another bug. Or, should these patches still be included (maybe in 58?)
Flags: needinfo?(ttaubert)
If we see no crashes anymore and Bob would be okay with closing this as wontfix then I'd say these patches here are optional. So not something we'd push to have in 58. But of course still up to Kai and Daiki to land them or not.
Flags: needinfo?(ttaubert)
Status: REOPENED → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → WONTFIX
Flags: needinfo?(kaie)
QA Contact: jjones
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: