Closed Bug 262192 Opened 20 years ago Closed 20 years ago

SIGSEGV in pk11_forceTokenAttribute in softoken

Categories

(NSS :: Libraries, defect, P1)

x86
SunOS
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: julien.pierre, Assigned: julien.pierre)

Details

Attachments

(1 file, 4 obsolete files)

I was running an SSL session reuse stress test on AMD64 solaris and got a core file after about 48hrs. Unfortunately, this was with optimized bits, and I can't narrow down the crash to a particular line of code. The stack on the crashing thread appears corrupt - dbx only shows a single stack frame. I'm trying to reproduce the crash with debug bits.
Keywords: sun-orion3
Priority: -- → P1
Target Milestone: --- → 3.9.3
Do you think this bug is only on Solaris 10 AMD64 or could affect any other plateform ?
Any idea what object, attribute, and value was being forced? Might give us some clues
Christophe, In my previous SSL stress tests, I never saw a core file on any other platform, so I'm not sure that they would be affected. It's possible this has to do with the compiler optimizations . I was using the highest level of optimizations in studio9, with the -xO5 and -fast options, in order to get the best performance. I'm now running a stress test with debug bits to find if I can reproduce the crash. I will leave it running until monday. If it doesn't crash by then, I may restart a test with the optimized bits to find if I can reproduce it again and narrow it down to compiler optimizations ... Nelson : unfortunately, the optimized core file really does not say much at all. I can't get the value of arguments displayed. (dbx) threads t@1 a l@1 ?() sleep on 0x53dfb8 in __lwp_park() t@2 a l@2 _pt_root() LWP suspended in mutex_trylock_adaptive() t@3 a l@3 _pt_root() LWP suspended in __lwp_park() o> t@4 a l@4 _pt_root() signal SIGSEGV in pk11_forceTokenAttribute() t@5 a l@5 _pt_root() LWP suspended in __lwp_park() t@6 a l@6 _pt_root() LWP suspended in __pollsys() t@7 a l@7 _pt_root() LWP suspended in _close() t@8 a l@8 _pt_root() LWP suspended in __pollsys() t@9 a l@9 _pt_root() LWP suspended in __lwp_park() t@10 a l@10 _pt_root() sleep on 0x5a89f8 in __lwp_park() (dbx) thread t@4 t@4 (l@4) stopped in pk11_forceTokenAttribute at 0x7ffffface53e 0x00007ffffface53e: pk11_forceTokenAttribute+0x00be: movq 0x0000000000000030(%r12),%r10 (dbx) where current thread: t@4 =>[1] pk11_forceTokenAttribute(), at 0x7ffffface53e (dbx)
Julien, thanks for emailing me the stack contents, disassembly and register values. The stack does not appear to me to be clobbered. The crash occurred at line 1714 in pk11_forceTokenAttribute 1713 attribute=pk11_FindAttribute(object,type); 1714 if ((type != CKA_LABEL) && (attribute->attrib.ulValueLen == len) && 1715 PORT_Memcmp(attribute->attrib.pValue,value,len) == 0) { the value of "attribute" returned by pk11_FindAttribute was NULL, and line 1714 doesn't check for NULL before dereferencing it. I found that the compiler had in-lined the functions pk11_narrowToTokenObject and pk11_FindAttribute, which effectively caused the pk11_FindAttribute call to be turned into a call to pk11_FindTokenAttribute. Pretty impressive optimization. The fact that pk11_FindTokenAttribute returned NULL may or may not be indicative of another error, perhaps due to optimization. But the immediate cause of the crash is the failure to check the attribute pointer value for NULL.
Thanks, Nelson ! I think it would be useful to add a NULL check for the attribute here, but I'm not certain what CKR error code to return. Most other parts of the code return CKR_TEMPLATE_INCOMPLETE for this case. But pk11_findAttribute doesn't say why it didn't find the attribute. The proper return code might be CKR_ATTRIBUTE_TYPE_INVALID, or CKR_OBJECT_HANDLE_INVALID . It's hard to tell. However, I just found out that the tree that got the core file is a tip tree, not 3.9 branch, which contained a patch made by Ian to enhance performance with PKCS#11 contexts . This may very well be the cause of this crash. I will attach the patch in question. Unfortunately it is one more variable if I have to reproduce the problem.
Target Milestone: 3.9.3 → ---
Version: 3.9.2 → unspecified
Attached patch Ian's patch for the tip (obsolete) (deleted) — Splinter Review
The attribute type being forced was CKA_NEVER_EXTRACTABLE. That attribute is forced in these places: /security/nss/lib/softoken/pkcs11.c, line 1306 /security/nss/lib/softoken/pkcs11.c, line 1413 /security/nss/lib/softoken/pkcs11c.c, line 2926 /security/nss/lib/softoken/pkcs11c.c, line 3404 /security/nss/lib/softoken/pkcs11c.c, line 3408
The places that attempt to set CKA_NEVER_EXTRACTABLE are all in key (pair) generation / unwrapping / deriving. Ian's patch seems to affect digesting only, so it's not at all obvious that his patch would affect the behavior of the operations relevant to this crash.
Hmm what is the object type that we are trying to force this attribute on? The correct CKR value should CKR_ATTRIBUTE_TYPE_INVALID or CKR_HOST_MEMORY, depending on why it couldn't be found. The Token object calls should return a static attribute for this value, unless object type isn't what we thought it was. (indicating a trash somewhere else). Adding a null check here is a good sanity check anyway, and should be included in the patch. bob
Julien sent me a dump of the memory to which the "object" pointer points. The content of that memory is NOT a valid PK11Object. That explains why pk11_FindTokenAttribute returned NULL. But it also says that the NULL pointer dereference was only a side-effect of the invalid object pointer, and the real cause of the crash (the code responsible for the invalid object pointer, or which trashed the PK11 Object) has not yet been identified. :( Since the stack appears intact, we should be able to walk down the stack until we find the source of the invalid pointer. That value appears several times in the stack, so we should be able to find the origin.
FYI, my test with debug bits is still running and the crash has not been reproduced.
Julien, you should try to do an optimized build with debug symbols. You can try adding -g after the -xO5 and -fast options. The compiler will tell you whether you can use -g or some other variant of -g with that compiler optimization level.
I made an "optimized debug build" after struggling a little bit with the compiler. I restarted my test. I hope I can reproduce the crash and get more info out of the new core ...
I haven't been able to reproduce this bug in over a week of trying, unfortunately. I can't debug my old core file either because the specific build has been overwritten, and I can't reproduce it as the particular compiler has gone away too. We are just going to have make the defensive fix in the function where the crash occured, and speculate about the root cause above it.
Attached patch fix crash symptoms (obsolete) (deleted) — Splinter Review
This patch checks for NULL pointers in the function where the crash was observed. As Nelson noted, the content of the PK11Object structure is corrupt, so I attempt to check if slot and refCount are valid. If there are any other assumptions I can make to validate the content of this structure, I'd like to add more checks as well.
Attached patch add missing assertions (obsolete) (deleted) — Splinter Review
Attachment #162015 - Attachment is obsolete: true
Attachment #162016 - Flags: superreview?(rrelyea0264)
Attachment #162016 - Flags: review?(saul.edwards.bugs)
Comment on attachment 162016 [details] [diff] [review] add missing assertions R+.. though I think the first return after the first if should probably be CKR_DEVICE_ERROR. It can only happen is there is some sort of internal corruption. CKR_ARGUMENTS_BAD means the user of the PKCS #11 module passed in bad arguments. That would already be checked by the code tha converts object id into PK11Object *.
Attachment #162016 - Flags: superreview?(rrelyea0264) → superreview+
Comment on attachment 160799 [details] [diff] [review] Ian's patch for the tip FYI, Ian's patch causes problems with the SSL3 cipher suite RSA with RC4 128 and SHA1 . The problem is that the server becomes non-interoperable with standard implementations (MAC mismatch). The RC4 128/ MD5 suite isn't affected. This may not have anything to do with the cause of the crash, but I thought I would mention it.
Attachment #160799 - Attachment is obsolete: true
Julien, your observation in comment 18 is more than a little frightening. Whiel this info is still fresh in your mind, please file a separate bug about it and cc me on that bug. Please be specific about the RCS version number and file name that contains the flaw, and how/why it adversely affects MACs. Thanks!
Nelson, I'm reluctant to file a bug about a patch that hasn't been checked in anywhere yet. To observe the problem, you can apply Ian's patch to an otherwise virgin tip NSS tree, run selfserv with the "-c n" option to select the RSA RC4 128 SHA1 cipher suite, then connect with mozilla or other tools using pre-existing NSS libs. They will not be able to connect. But if you use a client that is using the same broken libs, such as occurs in all.sh, the client will work . Given that Ian's patch deals with digests and attempts to optimize the use of PKCS#11 context, if there is something wrong with it, it is not completely surprising that it could affect MAC computation with SSL, as it does here for SHA1. But two wrongs seem to make one right in this case, as I suppose both the client and server perform the incorrect SHA1 computation with the same bad result . I think this uncovers a critical flaw in our test coverage. Our tests should be able to figure out that SSL interoperability has been broken by a code change such as this one.
Actually I need to move the checks in the patch one level up the stack or it won't do any good, since the same corruption in the PK11Object would crash pk11_forceAttribute, the sole caller of pk11_forceTokenAttribute . One more thing, about Ian's patch. I found that it just breaks SSL3 cipher suites in general, not specifically RSA 128/SHA1. But all TLS cipher suites still work with the patch. To reproduce the problem with the patch, run selfserv (from the tip) with -S and -T options, to disable SSL2 and TLS respectively. Then, the browser won't be able to connect.
With Nelson's help, I figured out that Ian's patch missed one place to flush the new digest buffer, in PK11_DigestKey . Fixing that also fixed SSL3. Unfortunately, the review of the patch shows it's very unlikely to have caused the memory corruption in the PK11Object, so it's probably unrelated to the crash in this bug.
Julien - I opened bug 265620 for my patch. Can you post your updated patch (with Nelson's suggested fix) there? Thanks.
Comment on attachment 162016 [details] [diff] [review] add missing assertions The assertions added in this patch, e.g. >+ PORT_Assert(object); >+ PORT_Assert(object->refCount); >+ PORT_Assert(object->slot); >+ if (!object || >+ !object->refCount || >+ !object->slot) { >+ return CKR_ARGUMENTS_BAD; >+ } >+ >+ to = pk11_narrowToTokenObject(object); > PORT_Assert(to); > if (to == NULL) { > return CKR_DEVICE_ERROR; > } are good, as far as they go. But they don't really assert that the object pointer really points to a valid PKCS11Object. They merely assert that two memory values at various offsets in the object are non-zero, which IMO has some significant probability of being true even if the pointer is pointing at random memory. So, given that the object is a private struct to softoken, I propose the following additional check be implemented. I propose adding a new const char * pointer to struct PK11ObjectStr, immediately after next and prev. Also, there will be a global const character string, global to the softoken only, that is the value "PKCS11Object". Whenever a PKCS11Object is created/initialized, the new pointer member must be set to point to that one string. Thereafter, at any time that we want to check that we have a valid PCKS11Object pointer, we merely have to check that the value of that pointer equals the address of that string. No need to do a string comparison. We could use any magic value, and could be an int rather than a char *, but having it be a char * has value in debugging. With the string, you don't have to memorize magic values in order to ask "what kind of object is this"? You just look at the string and it tells you. I suggest that if we assert that the object pointer points to a real PKCS11Object, in all the functions that are likely to have been in the crash's call stack, we will find the offending code much sooner.
Nelson, Thanks for the ideas. Either a string or a magic value is fine with me (we can use 0xDEADBEEF). I think we should implement this on the tip . For the short-term 3.9.4 patch release, I would like to use the shorter patch just submitted.
Attachment #163079 - Attachment description: move assertions one level up → move assertions one level up. Patch for NSS_3_9_BRANCH and tip for now; new patch for tip to follow later.
Attachment #163079 - Flags: review?(nelson)
Comment on attachment 163079 [details] [diff] [review] move assertions one level up. Patch for NSS_3_9_BRANCH and tip for now; new patch for tip to follow later. Saul, please review for 3.9.4 check in. Thanks.
Attachment #163079 - Flags: superreview?(saul.edwards.bugs)
I have checked in the patch to NSS_3_9_BRANCH due to our imminent 3.9.4 build. Checking in pkcs11u.c; /cvsroot/mozilla/security/nss/lib/softoken/pkcs11u.c,v <-- pkcs11u.c new revision: 1.54.2.2; previous revision: 1.54.2.1 done I would still like the reviews, though. Thanks.
Comment on attachment 163079 [details] [diff] [review] move assertions one level up. Patch for NSS_3_9_BRANCH and tip for now; new patch for tip to follow later. This patch has 2 "hunks". The first hunk is fine. The second one asserts that a pointer is not NULL, immediately AFTER dereferencing it. e.g. > if (pk11_isToken(object->handle)) { >+ PORT_Assert(object); >+ PORT_Assert(object->refCount); >+ PORT_Assert(object->slot); >+ if (!object || >+ !object->refCount || >+ !object->slot) { >+ return CKR_DEVICE_ERROR; >+ } > return pk11_forceTokenAttribute(object,type,value,len); > } Seems to me that these 3 assertions and the code that returns CKR_DEVICE_ERROR should take place BEFORE we dereference object. Also, I don't understand what is gained by moving these assertions up out of pk11_forceTokenAttribute into a code path that only occurs immediately priori to calling that function. I thought the point of moving the assertions up was so that they would also apply to the non-token cases. No?
Attachment #163079 - Flags: review?(nelson) → review-
Comment on attachment 163079 [details] [diff] [review] move assertions one level up. Patch for NSS_3_9_BRANCH and tip for now; new patch for tip to follow later. Julien, Please correct this before shipping 3.9.4
Attachment #163079 - Flags: superreview?(saul.edwards.bugs)
Nelson, You are right, the third patch was not functionally different from the second one. I moved the assertions before the pk11_isToken call . Checking in pkcs11u.c; /cvsroot/mozilla/security/nss/lib/softoken/pkcs11u.c,v <-- pkcs11u.c new revision: 1.54.2.3; previous revision: 1.54.2.2 done
Attached patch updated patch (deleted) — Splinter Review
move assertions a few lines up. This is a cumulative diff This patch has been checked in to both the NSS_3_9_BRANCH and the tip : Checking in pkcs11u.c; /cvsroot/mozilla/security/nss/lib/softoken/pkcs11u.c,v <-- pkcs11u.c new revision: 1.58; previous revision: 1.57 done A more extensive patch will follow later to incorporate the magic logic and Nelson's other ideas for the tip, so I'm keeping this bug open.
Attachment #163079 - Attachment is obsolete: true
Attachment #162016 - Flags: review?(saul.edwards.bugs)
I have not seen this specific problem occur again. Marking fixed.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Julien, please set the target milestone so we know in which NSS release this bug was fixed.
Target Milestone: --- → 3.9.4
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: