191541 - we need IsUTF8 /CouldBeUTF8 /CannotBeUTF8

Boris, attachment 17866 [details] [diff] [review] also includes a patch. Have you found out which is another bug on this? I couldn't find with Bugzilla.

Boris Zbarsky [:bzbarsky]

Comment 5

•

22 years ago

I haven't; in the end it does not matter as long as we have this functionality and actually use it in the dozens of places it's needed....

Jungshik Shin

Reporter

Comment 6

•

22 years ago

I ran a RE search on the entire tree and found only a few places where UTF8ness check is done. However, I think it's a good idea to check in attachment 17866 [details] [diff] [review] (with a possible name change) because I'm pretty sure it'll find more customers once in the tree(see bug 16276). If not for any other reasons, I'd love to see this checked in to avoid being bombarded with tens of pop-up windows with warning when I tried a debug build for MS Windows.

Jungshik Shin

Reporter

Comment 7

•

22 years ago

Attached patch a patch (obsolete) (deleted) — Details — Splinter Review

I modified attachment 17866 [details] [diff] [review] to filter out overlong UTF-8 sequences ( e.g. 0xC0 0x80 for U+0000) 'UTF-8 representation of surrogate codepoints' and non-characters ( 0xFFFE, 0xFFFF in all planes). With this change, this patch is equivalent to |str.Equals(NS_ConvertUCS2toUTF8(NS_ConvertUTF8toUCS2(str)))| in terms of 'features' and is more likely to detect non-UTF-8 string. In the patch, I only included one customer of this method(nsTextToSubURI.cpp), but there are a couple of more (see bug 162765).

Jungshik Shin

Reporter

Comment 8

•

22 years ago

Comment on attachment 117690 [details] [diff] [review] a patch asking r/sr to move this forward.

Attachment #117690 - Flags: superreview?(bzbarsky)

Attachment #117690 - Flags: review?(nhotta)

Boris Zbarsky [:bzbarsky]

Comment 9

•

22 years ago

I won't be able to sr this for at least 2 weeks....

Jungshik Shin

Reporter

Comment 10

•

22 years ago

Comment on attachment 117690 [details] [diff] [review] a patch there are several experts on CC. Pls, pick this up and review it if you can.

Attachment #117690 - Flags: superreview?(bzbarsky) → superreview?(alecf)

nhottanscp

Comment 11

•

22 years ago

Comment on attachment 117690 [details] [diff] [review] a patch Jungshik, please ask Simon for a review. He was working on this issue recently.

Attachment #117690 - Flags: review?(nhotta) → review?(smontagu)

Christian :Biesinger (don't email me, ping me on IRC)

Comment 12

•

22 years ago

Comment on attachment 117690 [details] [diff] [review] a patch some comments: + // for each chunk of |aString|... this should have a bit less indentation + for ( aString.BeginReading(iter); iter != done_reading; iter.advance( PRInt32(fragmentLength) ) ) + { if your style of spaces after/before parens and indented { on the next line really the current style of the file? +#if 0 why is this #if 0?

Jungshik Shin

Reporter

Comment 13

•

22 years ago

Naoki, thank you for transfering review. > + // for each chunk of |aString|... > this should have a bit less indentation > is your style of spaces after/before parens and indented { on the next line > really the current style of the file? Both indentation styles are consistently used in the file. See, for instance, http://lxr.mozilla.org/seamonkey/source/string/src/nsReadableUtils.cpp#570 BTW, I'll replace C++-style comment before |IsUTF8| with C-style comment as is done elsewhere in the file. > +#if 0 > why is this #if 0? 5-6 bytes long UTF-8 sequences are not supposed to exist in practically all cases (Unicode has only 20.1bit codespaces - upto 0x10FFFF - and ISO JTC1/SC2/WG2 committed itself not to assign any character beyond that limit - plane 16. A new IETF RFC has been in preparation to reflect this. see http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-04.txt ). Still, there's a possibility that some people produce them (for what? ask them :-)) so that I wasn't 100% sure whether to take that out or not. Excluding them not only speeds things up a bit but also makes |IsUTF8| less likely to mistake non-UTF8 for UTF-8.

Jungshik Shin

Reporter

Comment 14

•

22 years ago

Attached patch a new patch (obsolete) (deleted) — Details — Splinter Review

Skimming over rfc2279-rev draft, I realized that I forgot to filter out 4byte UTF-8 sequences for codepoints above 0x10FFFF. I also replaced bitwise-and and comparison with comparison (which is possible because |c| is now PRUint8). Besides, I removed #if 0'd code for 5-6 bytes-long seq.

Attachment #117690 - Attachment is obsolete: true

Jungshik Shin

Reporter

Updated

•

22 years ago

Attachment #117784 - Flags: superreview?(alecf)

Attachment #117784 - Flags: review?(smontagu)

Jungshik Shin

Reporter

Comment 15

•

22 years ago

Comment on attachment 117690 [details] [diff] [review] a patch sorry for spamming

Attachment #117690 - Flags: superreview?(alecf)

Attachment #117690 - Flags: review?(smontagu)

Simon Montagu :smontagu

Comment 16

•

22 years ago

I'm not sure about excluding the ..FFFE and ..FFFF non-characters. They aren't legal codepoints, but are they illegal UTF-8? There was discussion of this question in bug 172701.

Jungshik Shin

Reporter

Comment 17

•

22 years ago

Thank you for the ref. to bug 172701. It reminded me of additional non-characters that could be rather hard to throw away in the current scheme. I'm not sure of excluding non-chars, either. (FYI, a successor-to-be of RFC 2279 mentioned in my prev. comment doesn't regard non-chars as invalid.). I filtered them out because I wanted to preserve the behavior of |str.Equals(nsConvertUCS2toUTF8(nsConvertUTF8toUCS2(str)))| for which |IsUTF8| is a replacement. The context for bug 172701 appears different from the one for this bug. The former is about how to and where to validate UTF-8 (or content encoded in it). |IsUTF8| (or |CouldBeUTF8| ) is not a generic bullet-proof validator/invalidator (so that we don't have to worry about Unicode compliance as much) but rather a heuristic 'device' to tell relatively short non-UTF-8 strings from UTF-8 strings that we resort to when other options are exhausted (no external tag present, etc). |IsUTF8| is useful for URI, filename, mail header, http header, ftp list, etc. When |IsUTF8| returns false, the input string is interpreted as a fallback encoding(e.g. parent document encoding, locale charset, etc). Given this, it'd be extremly rare that non-chars are included in UTF-8 strings for cases listed above. On the other hand, someone with malicious intent may come up with a string valid in both a legacy encoding and UTF-8 (if we allow non-chars in UTF-8) that has some security implication if interpreted as the legacy encoding while benign if interpreted as UTF-8. However, there may as well the other way around (benign in the legacy encoding while security-risk in UTF-8). So, I'm really not sure, but I'm a bit inclined to preserve the behavior of what |IsUTF8| intends to replace.

Simon Montagu :smontagu

Comment 18

•

22 years ago

Comment on attachment 117784 [details] [diff] [review] a new patch >+ if ( c <= 0xBF ) // [80-BF] when it's not expected. >+ return PR_FALSE; Why not make this test if ( c <= 0xC1 ) and then you can save the overlong test from the 2 bytes case: >+ if ( 0 == (0x1e & c) ) // overlong : [C0-C1][80-BF] >+ return PR_FALSE; >+ else if ( c <= 0xDF ) // 2 bytes UTF8 Nit: remove the tabs from this line r=smontagu with those changes

Attachment #117784 - Flags: review?(smontagu) → review+

Jungshik Shin

Reporter

Comment 19

•

22 years ago

Thank you for r, Simon. Alec, can I get your sr? Marking this as blocking bug 162765. It can go in w/o this, but |IsUTF8| would be much nicer than converting back and forth bet. UTF-8 and UTF-16.

Blocks: 162765

Whiteboard: DUPEME

Target Milestone: --- → mozilla1.4alpha

Alec Flett

Comment 20

•

22 years ago

Comment on attachment 117784 [details] [diff] [review] a new patch + if ( c <= 0xBF ) // [80-BF] when it's not expected. + return PR_FALSE; + else if ( c <= 0xDF ) // 2 bytes UTF8 + { + state = 1; + if ( 0 == (0x1e & c) ) // overlong : [C0-C1][80-BF] + return PR_FALSE; + } + else if ( c <= 0xEF ) // 3 bytes UTF8 + please use UTF8Traits::is3byte() and so forth..

Attachment #117784 - Flags: superreview?(alecf) → superreview-

Jungshik Shin

Reporter

Comment 21

•

22 years ago

Attached patch a new patch with UTF8traits:: (deleted) — Details — Splinter Review

Uses UTF8traits::isXXX where approrpiate per Alec's comment. BTW, it seems like UTF8traits, |UTF8InputStream::CountValidUTF8Bytes| and |CalculateUTF8Length| need to be updated.

Attachment #117784 - Attachment is obsolete: true

Jungshik Shin

Reporter

Comment 22

•

22 years ago

Comment on attachment 118366 [details] [diff] [review] a new patch with UTF8traits:: assuming Simon's r is carried over...

Attachment #118366 - Flags: superreview?(alecf)

Alec Flett

Comment 23

•

22 years ago

Comment on attachment 118366 [details] [diff] [review] a new patch with UTF8traits:: what about this line: + else if ( c <= 0xF4 ) // 4 bytes UTF8 (note the bound is not 0xF7) sr=alecf with that line fixed (or explain why you can't use it in a comment) how should each of those things be updated? to reflect all this stuff with surrogates and such? I will gladly review any additional changes to UTF8Traits if you want to include them here.

Attachment #118366 - Flags: superreview?(alecf) → superreview+

Alec Flett

Comment 24

•

22 years ago

Comment on attachment 118366 [details] [diff] [review] a new patch with UTF8traits:: what about this line: + else if ( c <= 0xF4 ) // 4 bytes UTF8 (note the bound is not 0xF7) sr=alecf with that line fixed (or explain why you can't use it in a comment) how should each of those things be updated? to reflect all this stuff with surrogates and such? I will gladly review any additional changes to UTF8Traits if you want to include them here.

Jungshik Shin

Reporter

Comment 25

•

22 years ago

> + else if ( c <= 0xF4 ) // 4 bytes UTF8 (note the bound is not 0xF7) > sr=alecf with that line fixed (or explain why you can't use it in a commen Thank you for sr. Currently, UTF8traits::Is4byte(c) includes [0xF5-0xF7] as well as [0xF0-0xF4], which is why I commented that 'note the bound ......'. [0xF5-0xF7] begins 4byte seq. for codepoints beyond 10FFFFF. I'll make it more clear. > how should each of those things be updated? to reflect all this stuff with > surrogates and such? I think UTF8traits::Is[56]byte have to be removed and Is4byte has to be revised to filter out [0xF5-0xF7]. Besides, Is2byte may have to be revised to exclude [0xC0-0xC1] (only used for illegal overlong UTF-8 seq.). Because UTF8traits only deals with a single 'char' at a time, it's not possible to filter out all overlong/surrogate cases. UTF8traits::isXXX are used in three places (I didn't count ConvertUTF8toUCS2 in nsString2.h in my prev. comment) and they can be updated accordingly. In case of ConvertUTF8toUCS2, it has its own overlong/non-char/surrogate/plane-17 or higher detectors. In two other places, not detecting all those cases wouldn't be critical. > I will gladly review any additional changes to UTF8Traits if you want to > include them here. How about taking care of them in a separate bug? Alternatively, we can just keep this open to deal with the issue after landing attachment 118366 [details] [diff] [review]. (expecting Is4byte() to be updated as proposed above, I'll comment that |c <= 0xF4| in attachment 118366 [details] [diff] [review] should be replaced by Is4byte when UTF8traits is updated.)

Alec Flett

Comment 26

•

22 years ago

ah, I am beginning to understand can you file a new bug on the issues we just discussed? (I don't think I understand the issues well enough yet to file it myself) with the comments on the 4-byte sequences, can you reference the new bug that you file, so that people reading the code can understand the issues? thanks! it sounds like this is ready to go (With smontagu's review, and the comment update)

Jungshik Shin

Reporter

Comment 27

•

22 years ago

Thank you, Alec. Patch was just checked in. I filed bug 199090 and added a comment referring to it in the patch.

Status: NEW → RESOLVED

Closed: 22 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

4 years ago

Component: String → XPCOM

a patch 22 years ago Jungshik Shin (deleted), patch		Details \| Diff \| Splinter Review
a new patch 22 years ago Jungshik Shin (deleted), patch	smontagu : review+ alecf : superreview-	Details \| Diff \| Splinter Review
a new patch with UTF8traits:: 22 years ago Jungshik Shin (deleted), patch	alecf : superreview+	Details \| Diff \| Splinter Review