Closed Bug 28952 Opened 25 years ago Closed 23 years ago

nsCRT::strtok broken for characters > 0x7f

Categories

(Core :: XPCOM, defect, P3)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: Bienvenu, Assigned: scc)

References

Details

Attachments

(1 file)

nsCRT::strtok gives false positives when the string being searched contains characters > 0x7f - for example, when searching for '/', it thinks that ç is '/'. It's also very inefficient for the simple case of looking for a single delimiter, since it initializes two 32 byte arrays every time. I've stopped using it in the place that was causing me trouble, but I think we should fix this since it can cause hard to reproduce bugs. I would try to fix it myself, but it's not clear to me why it works (or doesn't work).
I went looking for where I had copied this strtok routine from and couldn't find it. But I did find this one in the 4.x codebase in lib/xp/xp_str.c: 1.31 (kmcentee 15-Jan-97): /************************************************* 1.31 (kmcentee 15-Jan-97): The following functions are used to implement 1.31 (kmcentee 15-Jan-97): a thread safe strtok 1.31 (kmcentee 15-Jan-97): *************************************************/ 1.31 (kmcentee 15-Jan-97): /* 1.31 (kmcentee 15-Jan-97): * Get next token from string *stringp, where tokens are (possibly empty) 1.31 (kmcentee 15-Jan-97): * strings separated by characters from delim. Tokens are separated 1.31 (kmcentee 15-Jan-97): * by exactly one delimiter iff the skip parameter is false; otherwise 1.31 (kmcentee 15-Jan-97): * they are separated by runs of characters from delim, because we 1.31 (kmcentee 15-Jan-97): * skip over any initial `delim' characters. 1.31 (kmcentee 15-Jan-97): * 1.31 (kmcentee 15-Jan-97): * Writes NULs into the string at *stringp to end tokens. 1.31 (kmcentee 15-Jan-97): * delim will usually, but need not, remain CONSTant from call to call. 1.31 (kmcentee 15-Jan-97): * On return, *stringp points past the last NUL written (if there might 1.31 (kmcentee 15-Jan-97): * be further tokens), or is NULL (if there are definitely no more tokens). 1.31 (kmcentee 15-Jan-97): * 1.31 (kmcentee 15-Jan-97): * If *stringp is NULL, strtoken returns NULL. 1.31 (kmcentee 15-Jan-97): */ 1.31 (kmcentee 15-Jan-97): static 1.31 (kmcentee 15-Jan-97): char *strtoken_r(char ** stringp, const char *delim, int skip) 1.31 (kmcentee 15-Jan-97): { 1.31 (kmcentee 15-Jan-97): char *s; 1.31 (kmcentee 15-Jan-97): const char *spanp; 1.31 (kmcentee 15-Jan-97): int c, sc; 1.31 (kmcentee 15-Jan-97): char *tok; 1.31 (kmcentee 15-Jan-97): 1.31 (kmcentee 15-Jan-97): if ((s = *stringp) == NULL) 1.31 (kmcentee 15-Jan-97): return (NULL); 1.31 (kmcentee 15-Jan-97): 1.31 (kmcentee 15-Jan-97): if (skip) { 1.31 (kmcentee 15-Jan-97): /* 1.31 (kmcentee 15-Jan-97): * Skip (span) leading delimiters (s += strspn(s, delim)). 1.31 (kmcentee 15-Jan-97): */ 1.31 (kmcentee 15-Jan-97): cont: 1.31 (kmcentee 15-Jan-97): c = *s; 1.31 (kmcentee 15-Jan-97): for (spanp = delim; (sc = *spanp++) != 0;) { 1.31 (kmcentee 15-Jan-97): if (c == sc) { 1.31 (kmcentee 15-Jan-97): s++; 1.31 (kmcentee 15-Jan-97): goto cont; 1.31 (kmcentee 15-Jan-97): } 1.31 (kmcentee 15-Jan-97): } 1.31 (kmcentee 15-Jan-97): if (c == 0) { /* no token found */ 1.31 (kmcentee 15-Jan-97): *stringp = NULL; 1.31 (kmcentee 15-Jan-97): return (NULL); 1.31 (kmcentee 15-Jan-97): } 1.31 (kmcentee 15-Jan-97): } 1.31 (kmcentee 15-Jan-97): 1.31 (kmcentee 15-Jan-97): /* 1.31 (kmcentee 15-Jan-97): * Scan token (scan for delimiters: s += strcspn(s, delim), sort of). 1.31 (kmcentee 15-Jan-97): * Note that delim must have one NUL; we stop if we see that, too. 1.31 (kmcentee 15-Jan-97): */ 1.31 (kmcentee 15-Jan-97): for (tok = s;;) { 1.31 (kmcentee 15-Jan-97): c = *s++; 1.31 (kmcentee 15-Jan-97): spanp = delim; 1.31 (kmcentee 15-Jan-97): do { 1.31 (kmcentee 15-Jan-97): if ((sc = *spanp++) == c) { 1.31 (kmcentee 15-Jan-97): if (c == 0) 1.31 (kmcentee 15-Jan-97): s = NULL; 1.31 (kmcentee 15-Jan-97): else 1.31 (kmcentee 15-Jan-97): s[-1] = 0; 1.31 (kmcentee 15-Jan-97): *stringp = s; 1.31 (kmcentee 15-Jan-97): return( (char *) tok ); 1.31 (kmcentee 15-Jan-97): } 1.31 (kmcentee 15-Jan-97): } while (sc != 0); 1.31 (kmcentee 15-Jan-97): } 1.31 (kmcentee 15-Jan-97): /* NOTREACHED */ 1.31 (kmcentee 15-Jan-97): } 1.31 (kmcentee 15-Jan-97): 1.31 (kmcentee 15-Jan-97): 1.31 (kmcentee 15-Jan-97): char *XP_STRTOK_R(char *s1, const char *s2, char **lasts) 1.31 (kmcentee 15-Jan-97): { 1.31 (kmcentee 15-Jan-97): if (s1) 1.31 (kmcentee 15-Jan-97): *lasts = s1; 1.31 (kmcentee 15-Jan-97): return (strtoken_r(lasts, s2, 1)); 1.31 (kmcentee 15-Jan-97): } 1.31 (kmcentee 15-Jan-97): 1.31 (kmcentee 15-Jan-97):
Target Milestone: M15
Scott: Can you own this problem along with all the other string/intl cleanup? Maybe this bug should just become "eliminate nsCRT::strtok". Thanks,
Assignee: warren → scc
Status: NEW → ASSIGNED
Target Milestone: M15 → M16
Target Milestone: M16 → M18
mass re-assigning to my new bugzilla account
Assignee: scc → scc
Status: ASSIGNED → NEW
Status: NEW → ASSIGNED
Blocks: 5365
Component: XP Miscellany → String
OS: Windows NT → All
Hardware: PC → All
Target Milestone: M18 → ---
Target Milestone: --- → mozilla0.9.1
Depends on: 70090
No longer depends on: 70090
Blocks: 70090
Target Milestone: mozilla0.9.1 → mozilla0.9
Target Milestone: mozilla0.9 → mozilla0.9.1
re-targeting milestones, starting from a clean slate
Target Milestone: mozilla0.9.1 → ---
The function was indeed very broken. Not only the 8-bit cleanliness (resulting from using a character from a string as an index, think about the sign of 'char'!) there was also some bit ro tin the parsing of the delims string. Why the length of delims should have anything to do with DELIM_TABLE_SIZE is beyond me. Anyhow, I'll attach a patch which I've tested (I'm writing this in the resulting browser).
Attached patch fix for nsCRT::strtok (deleted) — Splinter Review
We need to make sure this is well tested for side effects. Not so much for broken functionality but because of places that might depend on this broken functionality. This is used throughout the code. Also, why is the table size even in there in the first place? I noticed that Ulrich has removed it in his patch.
What changed are two things: - on systems with signed char type the array access does not leave the bounds of the array. If code relies on the old behavior it is broken since the compiler is free to place the array variable where it wants and therefore the memory outside the array is completely undefined - the removal of the i < DELIM_TABLE_SIZE test can potentially have consequences if some call to nsCRT::strtok is using more then 32 delimiter characters. This is highly unlikely, though. If you want to be safe place an assert after the loop checking that i is < DELIM_TABLE_SIZE. I've looked through a great deal of the strtok uses and haven't found a place where this can be a problem. There are in total no more than two or three dozen uses of this function so an extensive review is possible.
Actually, thinking more about my last comment, there cannot be any problem at all. If there would be a delim string > 32 chars this would already have been caught. Which leaves only people relying on uninitialized memory which is highly unlikely to work (to say the least) in a cross-platform product like Mozilla.
Ulrich: jst and I have reviewed your patch and |nsCRT::strtok| in detail. We think there are other good things that can be done to make |nsCRT::strtok| better, but your patch is a good start, so I'll check it in immediately. r=jst, sr=scc. Things we think are bad about |nsCRT::strtok|: initialization of the delimiter table could be done with a static string, e.g., char delim[ DELIM_TABLE_SIZE ] = "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"; We think more comments are needed. It took us a bit to figure out that the table was a bit map of all of ASCII. The macros need commenting to explain this, as do the magic numbers |DELIM_TABLE_SIZE|, 3, and 7. The assertion is nonsense, since the routine actually supports having up to 256 different delimiters (though that wouldn't be useful). The existing code uses post-increment in places where it shouldn't. The code is tricky, and that's not a good thing, especially when the tricks aren't commented.
Thanks for your help, Ulrich. Your fix is checked in. I'm closing this bug. If anyone thinks the other comments about |nsCRT::strtok| in this bug are worth fixing, we can file a new bug.
Status: ASSIGNED → RESOLVED
Closed: 23 years ago
Resolution: --- → FIXED
> Things we think are bad about |nsCRT::strtok|: initialization of the delimiter > table could be done with a static string, e.g., > char delim[ DELIM_TABLE_SIZE ] = > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"; Be careful. You'll have to look at the generated code on several platforms. Initializers like the above are generally a problem on some of them. gcc on SPARC had/has? problem, means, generates bad code. If the compiler is not generating an inlined memset() call for a construct like that above it will implement it as single register moves to fill the stack slots which is worse. If you take a look at the implementation I have in glibc you'd see that you'll see that I haven't taken the effort to make a lot of optimizations. I have a special version coded in asm for some platforms but it generally seems not to be necessary. Unless very nicely tuned the startup costs associated with creating the bitmap are higher than the gains. You can perhaps try to keep track of the calls to this function and see how much work is actually done in each calls (how much the input pointer is advanced). So my advice would be to not do anything except for using the system's strtok_r implementation if it exists. The LDAP c-sdk code already does this (and it works after applying the patch I filed).
If you're concerned about initialization costs, here's a data-structure that has a lower initialization cost than the bitmap, and I think an equivalent lookup cost (though I'd have to look at the disassembly to know for sure) PRUint8 set_size = 0; // number of delimiters in |set| unsigned char set[256]; // set[0..set_size-1] contains the delimiters // for which we are searching PRUint8 map[256]; // array of indexes into |set| // test if a delimiter |d| is in the set pos_in_set = map[PRUint8(d)] pos_in_set < set_size && set[pos_in_set] == d // add a delimiter |d| to the set (if it's not already there) set[map[PRUint8(d)] = set_size++] = d The initiliazation cost in this scheme for zero delimiters is only setting |set_size|. Neither array requires any initialization. Removal from the set is easy as well, but no need to show it here since it we have no use for it in this case.
Component: String → XPCOM
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: