Closed Bug 191541 Opened 22 years ago Closed 22 years ago

we need IsUTF8 /CouldBeUTF8 /CannotBeUTF8

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.4alpha

People

(Reporter: jshin1987, Assigned: jag+mozilla)

References

Details

(Keywords: intl)

Attachments

(1 file, 2 obsolete files)

This is a spin-off of bug 57011 and bug 162765. (see also http://bugzilla.mozilla.org/buglist.cgi?query_format=&short_desc_type=allwordssubstr&short_desc=&long_desc_type=substring&long_desc=IsUTF8&bug_file_loc_type=allwordssubstr&bug_file_loc=&status_whiteboard_type=allwordssubstr&status_whiteboard=&keywords_type=allwords&keywords=&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=RESOLVED&bug_status=VERIFIED&bug_status=CLOSED&emailassigned_to1=1&emailtype1=exact&email1=&emailassigned_to2=1&emailreporter2=1&emailqa_contact2=1&emailtype2=exact&email2=&bugidtype=include&bug_id=&votes=&changedin=&chfieldfrom=&chfieldto=Now&chfieldvalue=&cmdtype=doit&order=Bug+Number&field0-0-0=noop&type0-0-0=noop&value0-0-0= ) In several places in Mozilla code, we need to check if a given CString is a valid UTF-8 string. At the moment, something like the following cstr.Equals(NS_ConvertUCS2toUTF8(NS_ConvertUTF8toUCS2(cstr))) is used. (see, for instance, http://lxr.mozilla.org/seamonkey/source/mailnews/compose/src/nsSmtpService.cpp#107) As Conrad suggested in bug 57011 and implemented in attachment 17866 [details] [diff] [review], it'd be nice to have IsUTF8(). If we're worried that non-UTF8 strings are mistaken to be in UTF-8 (which I think rarely happens in *real-life*, as opposed to strings concocted to fool IsUTF8(), non-UTF-8 strings encoded in legacy encodings), we can name it CannotBeUTF8() and invert the return value (or just name it CouldBeUTF8()). FYI, glib has g_utf8_validate()(http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-validate) I also found that |nsCRT| class also has |IsAscii|. Is it an accidental duplication or was it done on purpose? In short, I think it's good to add IsUTF8/CannotBeUTF8/CouldBeUTF8 in attachment 17866 [details] [diff] [review] by Conrad.
adding some i18n people and 'intl' keyword.
Keywords: intl
There's a bug on this already, with a patch and stuff. I can't remember who was working on it, though... Neil, maybe?
Whiteboard: DUPEME
bz: not one of mine
Boris, attachment 17866 [details] [diff] [review] also includes a patch. Have you found out which is another bug on this? I couldn't find with Bugzilla.
I haven't; in the end it does not matter as long as we have this functionality and actually use it in the dozens of places it's needed....
I ran a RE search on the entire tree and found only a few places where UTF8ness check is done. However, I think it's a good idea to check in attachment 17866 [details] [diff] [review] (with a possible name change) because I'm pretty sure it'll find more customers once in the tree(see bug 16276). If not for any other reasons, I'd love to see this checked in to avoid being bombarded with tens of pop-up windows with warning when I tried a debug build for MS Windows.
Attached patch a patch (obsolete) (deleted) — Splinter Review
I modified attachment 17866 [details] [diff] [review] to filter out overlong UTF-8 sequences ( e.g. 0xC0 0x80 for U+0000) 'UTF-8 representation of surrogate codepoints' and non-characters ( 0xFFFE, 0xFFFF in all planes). With this change, this patch is equivalent to |str.Equals(NS_ConvertUCS2toUTF8(NS_ConvertUTF8toUCS2(str)))| in terms of 'features' and is more likely to detect non-UTF-8 string. In the patch, I only included one customer of this method(nsTextToSubURI.cpp), but there are a couple of more (see bug 162765).
Comment on attachment 117690 [details] [diff] [review] a patch asking r/sr to move this forward.
Attachment #117690 - Flags: superreview?(bzbarsky)
Attachment #117690 - Flags: review?(nhotta)
I won't be able to sr this for at least 2 weeks....
Comment on attachment 117690 [details] [diff] [review] a patch there are several experts on CC. Pls, pick this up and review it if you can.
Attachment #117690 - Flags: superreview?(bzbarsky) → superreview?(alecf)
Comment on attachment 117690 [details] [diff] [review] a patch Jungshik, please ask Simon for a review. He was working on this issue recently.
Attachment #117690 - Flags: review?(nhotta) → review?(smontagu)
Comment on attachment 117690 [details] [diff] [review] a patch some comments: + // for each chunk of |aString|... this should have a bit less indentation + for ( aString.BeginReading(iter); iter != done_reading; iter.advance( PRInt32(fragmentLength) ) ) + { if your style of spaces after/before parens and indented { on the next line really the current style of the file? +#if 0 why is this #if 0?
Naoki, thank you for transfering review. > + // for each chunk of |aString|... > this should have a bit less indentation > is your style of spaces after/before parens and indented { on the next line > really the current style of the file? Both indentation styles are consistently used in the file. See, for instance, http://lxr.mozilla.org/seamonkey/source/string/src/nsReadableUtils.cpp#570 BTW, I'll replace C++-style comment before |IsUTF8| with C-style comment as is done elsewhere in the file. > +#if 0 > why is this #if 0? 5-6 bytes long UTF-8 sequences are not supposed to exist in practically all cases (Unicode has only 20.1bit codespaces - upto 0x10FFFF - and ISO JTC1/SC2/WG2 committed itself not to assign any character beyond that limit - plane 16. A new IETF RFC has been in preparation to reflect this. see http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-04.txt ). Still, there's a possibility that some people produce them (for what? ask them :-)) so that I wasn't 100% sure whether to take that out or not. Excluding them not only speeds things up a bit but also makes |IsUTF8| less likely to mistake non-UTF8 for UTF-8.
Attached patch a new patch (obsolete) (deleted) — Splinter Review
Skimming over rfc2279-rev draft, I realized that I forgot to filter out 4byte UTF-8 sequences for codepoints above 0x10FFFF. I also replaced bitwise-and and comparison with comparison (which is possible because |c| is now PRUint8). Besides, I removed #if 0'd code for 5-6 bytes-long seq.
Attachment #117690 - Attachment is obsolete: true
Attachment #117784 - Flags: superreview?(alecf)
Attachment #117784 - Flags: review?(smontagu)
Comment on attachment 117690 [details] [diff] [review] a patch sorry for spamming
Attachment #117690 - Flags: superreview?(alecf)
Attachment #117690 - Flags: review?(smontagu)
I'm not sure about excluding the ..FFFE and ..FFFF non-characters. They aren't legal codepoints, but are they illegal UTF-8? There was discussion of this question in bug 172701.
Thank you for the ref. to bug 172701. It reminded me of additional non-characters that could be rather hard to throw away in the current scheme. I'm not sure of excluding non-chars, either. (FYI, a successor-to-be of RFC 2279 mentioned in my prev. comment doesn't regard non-chars as invalid.). I filtered them out because I wanted to preserve the behavior of |str.Equals(nsConvertUCS2toUTF8(nsConvertUTF8toUCS2(str)))| for which |IsUTF8| is a replacement. The context for bug 172701 appears different from the one for this bug. The former is about how to and where to validate UTF-8 (or content encoded in it). |IsUTF8| (or |CouldBeUTF8| ) is not a generic bullet-proof validator/invalidator (so that we don't have to worry about Unicode compliance as much) but rather a heuristic 'device' to tell relatively short non-UTF-8 strings from UTF-8 strings that we resort to when other options are exhausted (no external tag present, etc). |IsUTF8| is useful for URI, filename, mail header, http header, ftp list, etc. When |IsUTF8| returns false, the input string is interpreted as a fallback encoding(e.g. parent document encoding, locale charset, etc). Given this, it'd be extremly rare that non-chars are included in UTF-8 strings for cases listed above. On the other hand, someone with malicious intent may come up with a string valid in both a legacy encoding and UTF-8 (if we allow non-chars in UTF-8) that has some security implication if interpreted as the legacy encoding while benign if interpreted as UTF-8. However, there may as well the other way around (benign in the legacy encoding while security-risk in UTF-8). So, I'm really not sure, but I'm a bit inclined to preserve the behavior of what |IsUTF8| intends to replace.
Comment on attachment 117784 [details] [diff] [review] a new patch >+ if ( c <= 0xBF ) // [80-BF] when it's not expected. >+ return PR_FALSE; Why not make this test if ( c <= 0xC1 ) and then you can save the overlong test from the 2 bytes case: >+ if ( 0 == (0x1e & c) ) // overlong : [C0-C1][80-BF] >+ return PR_FALSE; >+ else if ( c <= 0xDF ) // 2 bytes UTF8 Nit: remove the tabs from this line r=smontagu with those changes
Attachment #117784 - Flags: review?(smontagu) → review+
Thank you for r, Simon. Alec, can I get your sr? Marking this as blocking bug 162765. It can go in w/o this, but |IsUTF8| would be much nicer than converting back and forth bet. UTF-8 and UTF-16.
Blocks: 162765
Whiteboard: DUPEME
Target Milestone: --- → mozilla1.4alpha
Comment on attachment 117784 [details] [diff] [review] a new patch + if ( c <= 0xBF ) // [80-BF] when it's not expected. + return PR_FALSE; + else if ( c <= 0xDF ) // 2 bytes UTF8 + { + state = 1; + if ( 0 == (0x1e & c) ) // overlong : [C0-C1][80-BF] + return PR_FALSE; + } + else if ( c <= 0xEF ) // 3 bytes UTF8 + please use UTF8Traits::is3byte() and so forth..
Attachment #117784 - Flags: superreview?(alecf) → superreview-
Attached patch a new patch with UTF8traits:: (deleted) — Splinter Review
Uses UTF8traits::isXXX where approrpiate per Alec's comment. BTW, it seems like UTF8traits, |UTF8InputStream::CountValidUTF8Bytes| and |CalculateUTF8Length| need to be updated.
Attachment #117784 - Attachment is obsolete: true
Comment on attachment 118366 [details] [diff] [review] a new patch with UTF8traits:: assuming Simon's r is carried over...
Attachment #118366 - Flags: superreview?(alecf)
Comment on attachment 118366 [details] [diff] [review] a new patch with UTF8traits:: what about this line: + else if ( c <= 0xF4 ) // 4 bytes UTF8 (note the bound is not 0xF7) sr=alecf with that line fixed (or explain why you can't use it in a comment) how should each of those things be updated? to reflect all this stuff with surrogates and such? I will gladly review any additional changes to UTF8Traits if you want to include them here.
Attachment #118366 - Flags: superreview?(alecf) → superreview+
Comment on attachment 118366 [details] [diff] [review] a new patch with UTF8traits:: what about this line: + else if ( c <= 0xF4 ) // 4 bytes UTF8 (note the bound is not 0xF7) sr=alecf with that line fixed (or explain why you can't use it in a comment) how should each of those things be updated? to reflect all this stuff with surrogates and such? I will gladly review any additional changes to UTF8Traits if you want to include them here.
> + else if ( c <= 0xF4 ) // 4 bytes UTF8 (note the bound is not 0xF7) > sr=alecf with that line fixed (or explain why you can't use it in a commen Thank you for sr. Currently, UTF8traits::Is4byte(c) includes [0xF5-0xF7] as well as [0xF0-0xF4], which is why I commented that 'note the bound ......'. [0xF5-0xF7] begins 4byte seq. for codepoints beyond 10FFFFF. I'll make it more clear. > how should each of those things be updated? to reflect all this stuff with > surrogates and such? I think UTF8traits::Is[56]byte have to be removed and Is4byte has to be revised to filter out [0xF5-0xF7]. Besides, Is2byte may have to be revised to exclude [0xC0-0xC1] (only used for illegal overlong UTF-8 seq.). Because UTF8traits only deals with a single 'char' at a time, it's not possible to filter out all overlong/surrogate cases. UTF8traits::isXXX are used in three places (I didn't count ConvertUTF8toUCS2 in nsString2.h in my prev. comment) and they can be updated accordingly. In case of ConvertUTF8toUCS2, it has its own overlong/non-char/surrogate/plane-17 or higher detectors. In two other places, not detecting all those cases wouldn't be critical. > I will gladly review any additional changes to UTF8Traits if you want to > include them here. How about taking care of them in a separate bug? Alternatively, we can just keep this open to deal with the issue after landing attachment 118366 [details] [diff] [review]. (expecting Is4byte() to be updated as proposed above, I'll comment that |c <= 0xF4| in attachment 118366 [details] [diff] [review] should be replaced by Is4byte when UTF8traits is updated.)
ah, I am beginning to understand can you file a new bug on the issues we just discussed? (I don't think I understand the issues well enough yet to file it myself) with the comments on the 4-byte sequences, can you reference the new bug that you file, so that people reading the code can understand the issues? thanks! it sounds like this is ready to go (With smontagu's review, and the comment update)
Thank you, Alec. Patch was just checked in. I filed bug 199090 and added a comment referring to it in the patch.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Component: String → XPCOM
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: