Closed
Bug 191541
Opened 22 years ago
Closed 22 years ago
we need IsUTF8 /CouldBeUTF8 /CannotBeUTF8
Categories
(Core :: XPCOM, defect)
Core
XPCOM
Tracking
()
RESOLVED
FIXED
mozilla1.4alpha
People
(Reporter: jshin1987, Assigned: jag+mozilla)
References
Details
(Keywords: intl)
Attachments
(1 file, 2 obsolete files)
(deleted),
patch
|
alecf
:
superreview+
|
Details | Diff | Splinter Review |
This is a spin-off of bug 57011 and bug 162765.
(see also
http://bugzilla.mozilla.org/buglist.cgi?query_format=&short_desc_type=allwordssubstr&short_desc=&long_desc_type=substring&long_desc=IsUTF8&bug_file_loc_type=allwordssubstr&bug_file_loc=&status_whiteboard_type=allwordssubstr&status_whiteboard=&keywords_type=allwords&keywords=&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&bug_status=RESOLVED&bug_status=VERIFIED&bug_status=CLOSED&emailassigned_to1=1&emailtype1=exact&email1=&emailassigned_to2=1&emailreporter2=1&emailqa_contact2=1&emailtype2=exact&email2=&bugidtype=include&bug_id=&votes=&changedin=&chfieldfrom=&chfieldto=Now&chfieldvalue=&cmdtype=doit&order=Bug+Number&field0-0-0=noop&type0-0-0=noop&value0-0-0=
)
In several places in Mozilla code, we need to check
if a given CString is a valid UTF-8 string. At the moment,
something like the following
cstr.Equals(NS_ConvertUCS2toUTF8(NS_ConvertUTF8toUCS2(cstr)))
is used. (see, for instance,
http://lxr.mozilla.org/seamonkey/source/mailnews/compose/src/nsSmtpService.cpp#107)
As Conrad suggested in bug 57011 and implemented
in attachment 17866 [details] [diff] [review], it'd be nice to have IsUTF8().
If we're worried that non-UTF8 strings are mistaken
to be in UTF-8 (which I think rarely happens in
*real-life*, as opposed to strings concocted
to fool IsUTF8(), non-UTF-8 strings encoded in legacy
encodings), we can name it CannotBeUTF8() and
invert the return value (or just name it
CouldBeUTF8()).
FYI, glib has
g_utf8_validate()(http://developer.gnome.org/doc/API/2.0/glib/glib-Unicode-Manipulation.html#g-utf8-validate)
I also found that |nsCRT| class also has |IsAscii|. Is it an accidental
duplication or was it done on purpose?
In short, I think it's good to add IsUTF8/CannotBeUTF8/CouldBeUTF8
in attachment 17866 [details] [diff] [review] by Conrad.
Comment 2•22 years ago
|
||
There's a bug on this already, with a patch and stuff. I can't remember who was
working on it, though... Neil, maybe?
Whiteboard: DUPEME
Comment 3•22 years ago
|
||
bz: not one of mine
Reporter | ||
Comment 4•22 years ago
|
||
Boris,
attachment 17866 [details] [diff] [review] also includes a patch. Have you found out
which is another bug on this? I couldn't find with Bugzilla.
Comment 5•22 years ago
|
||
I haven't; in the end it does not matter as long as we have this functionality
and actually use it in the dozens of places it's needed....
Reporter | ||
Comment 6•22 years ago
|
||
I ran a RE search on the entire tree and found only a few places
where UTF8ness check is done. However, I think it's a good
idea to check in attachment 17866 [details] [diff] [review] (with a possible name change)
because I'm pretty sure it'll find more customers once in
the tree(see bug 16276). If not for any other reasons, I'd love to
see this checked in to avoid being bombarded with
tens of pop-up windows with warning when I tried a debug build
for MS Windows.
Reporter | ||
Comment 7•22 years ago
|
||
I modified attachment 17866 [details] [diff] [review] to filter out
overlong UTF-8 sequences ( e.g. 0xC0 0x80 for U+0000)
'UTF-8 representation of surrogate codepoints' and
non-characters ( 0xFFFE, 0xFFFF in all planes).
With this change, this patch is equivalent
to |str.Equals(NS_ConvertUCS2toUTF8(NS_ConvertUTF8toUCS2(str)))|
in terms of 'features' and is more likely to
detect non-UTF-8 string.
In the patch, I only included one customer of this
method(nsTextToSubURI.cpp), but there are a couple
of more (see bug 162765).
Reporter | ||
Comment 8•22 years ago
|
||
Comment on attachment 117690 [details] [diff] [review]
a patch
asking r/sr to move this forward.
Attachment #117690 -
Flags: superreview?(bzbarsky)
Attachment #117690 -
Flags: review?(nhotta)
Comment 9•22 years ago
|
||
I won't be able to sr this for at least 2 weeks....
Reporter | ||
Comment 10•22 years ago
|
||
Comment on attachment 117690 [details] [diff] [review]
a patch
there are several experts on CC. Pls, pick this up and review it if you can.
Attachment #117690 -
Flags: superreview?(bzbarsky) → superreview?(alecf)
Comment 11•22 years ago
|
||
Comment on attachment 117690 [details] [diff] [review]
a patch
Jungshik, please ask Simon for a review. He was working on this issue recently.
Attachment #117690 -
Flags: review?(nhotta) → review?(smontagu)
Comment 12•22 years ago
|
||
Comment on attachment 117690 [details] [diff] [review]
a patch
some comments:
+ // for each chunk of |aString|...
this should have a bit less indentation
+ for ( aString.BeginReading(iter); iter != done_reading; iter.advance(
PRInt32(fragmentLength) ) )
+ {
if your style of spaces after/before parens and indented { on the next line
really the current style of the file?
+#if 0
why is this #if 0?
Reporter | ||
Comment 13•22 years ago
|
||
Naoki, thank you for transfering review.
> + // for each chunk of |aString|...
> this should have a bit less indentation
> is your style of spaces after/before parens and indented { on the next line
> really the current style of the file?
Both indentation styles are consistently used in the file. See, for instance,
http://lxr.mozilla.org/seamonkey/source/string/src/nsReadableUtils.cpp#570
BTW, I'll replace C++-style comment before |IsUTF8| with C-style comment
as is done elsewhere in the file.
> +#if 0
> why is this #if 0?
5-6 bytes long UTF-8 sequences are not supposed to exist in practically all
cases (Unicode has only 20.1bit codespaces - upto 0x10FFFF - and
ISO JTC1/SC2/WG2 committed itself not to assign any character beyond
that limit - plane 16. A new IETF RFC has been in preparation to reflect
this. see http://www.ietf.org/internet-drafts/draft-yergeau-rfc2279bis-04.txt
). Still, there's a possibility that some people
produce them (for what? ask them :-)) so that I wasn't 100% sure whether
to take that out or not. Excluding them not only speeds things up a bit
but also makes |IsUTF8| less likely to mistake non-UTF8 for UTF-8.
Reporter | ||
Comment 14•22 years ago
|
||
Skimming over rfc2279-rev draft,
I realized that I forgot to filter out 4byte UTF-8
sequences for codepoints above 0x10FFFF. I also
replaced bitwise-and and comparison with comparison
(which is possible because |c| is now PRUint8).
Besides, I removed #if 0'd code for 5-6 bytes-long seq.
Attachment #117690 -
Attachment is obsolete: true
Reporter | ||
Updated•22 years ago
|
Attachment #117784 -
Flags: superreview?(alecf)
Attachment #117784 -
Flags: review?(smontagu)
Reporter | ||
Comment 15•22 years ago
|
||
Comment on attachment 117690 [details] [diff] [review]
a patch
sorry for spamming
Attachment #117690 -
Flags: superreview?(alecf)
Attachment #117690 -
Flags: review?(smontagu)
Comment 16•22 years ago
|
||
I'm not sure about excluding the ..FFFE and ..FFFF non-characters. They aren't
legal codepoints, but are they illegal UTF-8? There was discussion of this
question in bug 172701.
Reporter | ||
Comment 17•22 years ago
|
||
Thank you for the ref. to bug 172701. It reminded me of additional
non-characters that could be rather hard to throw away in the
current scheme.
I'm not sure of excluding non-chars, either.
(FYI, a successor-to-be of RFC 2279 mentioned
in my prev. comment doesn't regard
non-chars as invalid.). I filtered them
out because I wanted to preserve the behavior of
|str.Equals(nsConvertUCS2toUTF8(nsConvertUTF8toUCS2(str)))|
for which |IsUTF8| is a replacement.
The context for bug 172701 appears different from the one for this bug.
The former is about how to and where to validate UTF-8 (or content
encoded in it). |IsUTF8| (or |CouldBeUTF8| ) is not a generic
bullet-proof validator/invalidator (so that we don't
have to worry about Unicode compliance as much) but rather
a heuristic 'device' to tell relatively short non-UTF-8 strings
from UTF-8 strings that we resort to when other options are exhausted
(no external tag present, etc). |IsUTF8| is useful
for URI, filename, mail header, http header, ftp list, etc.
When |IsUTF8| returns false, the input string is interpreted
as a fallback encoding(e.g. parent document encoding, locale
charset, etc).
Given this, it'd be extremly rare that non-chars are included
in UTF-8 strings for cases listed above. On the other hand,
someone with malicious intent may come up with a string
valid in both a legacy encoding and UTF-8
(if we allow non-chars in UTF-8) that has some security
implication if interpreted as the legacy encoding
while benign if interpreted as UTF-8. However,
there may as well the other way around (benign
in the legacy encoding while security-risk in
UTF-8).
So, I'm really not sure, but I'm a bit inclined to
preserve the behavior of what |IsUTF8| intends
to replace.
Comment 18•22 years ago
|
||
Comment on attachment 117784 [details] [diff] [review]
a new patch
>+ if ( c <= 0xBF ) // [80-BF] when it's not expected.
>+ return PR_FALSE;
Why not make this test
if ( c <= 0xC1 )
and then you can save the overlong test from the 2 bytes case:
>+ if ( 0 == (0x1e & c) ) // overlong : [C0-C1][80-BF]
>+ return PR_FALSE;
>+ else if ( c <= 0xDF ) // 2 bytes UTF8
Nit: remove the tabs from this line
r=smontagu with those changes
Attachment #117784 -
Flags: review?(smontagu) → review+
Reporter | ||
Comment 19•22 years ago
|
||
Thank you for r, Simon. Alec, can I get your sr?
Marking this as blocking bug 162765. It can go in
w/o this, but |IsUTF8| would be much nicer than
converting back and forth bet. UTF-8 and UTF-16.
Comment 20•22 years ago
|
||
Comment on attachment 117784 [details] [diff] [review]
a new patch
+ if ( c <= 0xBF ) // [80-BF] when it's not expected.
+ return PR_FALSE;
+ else if ( c <= 0xDF ) // 2 bytes UTF8
+ {
+ state = 1;
+ if ( 0 == (0x1e & c) ) // overlong : [C0-C1][80-BF]
+ return PR_FALSE;
+ }
+ else if ( c <= 0xEF ) // 3 bytes UTF8
+
please use UTF8Traits::is3byte() and so forth..
Attachment #117784 -
Flags: superreview?(alecf) → superreview-
Reporter | ||
Comment 21•22 years ago
|
||
Uses UTF8traits::isXXX where approrpiate per Alec's comment.
BTW, it seems like UTF8traits, |UTF8InputStream::CountValidUTF8Bytes| and
|CalculateUTF8Length| need to be updated.
Attachment #117784 -
Attachment is obsolete: true
Reporter | ||
Comment 22•22 years ago
|
||
Comment on attachment 118366 [details] [diff] [review]
a new patch with UTF8traits::
assuming Simon's r is
carried over...
Attachment #118366 -
Flags: superreview?(alecf)
Comment 23•22 years ago
|
||
Comment on attachment 118366 [details] [diff] [review]
a new patch with UTF8traits::
what about this line:
+ else if ( c <= 0xF4 ) // 4 bytes UTF8 (note the bound is not
0xF7)
sr=alecf with that line fixed (or explain why you can't use it in a comment)
how should each of those things be updated? to reflect all this stuff with
surrogates and such?
I will gladly review any additional changes to UTF8Traits if you want to
include them here.
Attachment #118366 -
Flags: superreview?(alecf) → superreview+
Comment 24•22 years ago
|
||
Comment on attachment 118366 [details] [diff] [review]
a new patch with UTF8traits::
what about this line:
+ else if ( c <= 0xF4 ) // 4 bytes UTF8 (note the bound is not
0xF7)
sr=alecf with that line fixed (or explain why you can't use it in a comment)
how should each of those things be updated? to reflect all this stuff with
surrogates and such?
I will gladly review any additional changes to UTF8Traits if you want to
include them here.
Reporter | ||
Comment 25•22 years ago
|
||
> + else if ( c <= 0xF4 ) // 4 bytes UTF8 (note the bound is not 0xF7)
> sr=alecf with that line fixed (or explain why you can't use it in a commen
Thank you for sr. Currently, UTF8traits::Is4byte(c) includes [0xF5-0xF7] as well
as [0xF0-0xF4], which is why I commented that 'note the bound ......'.
[0xF5-0xF7] begins 4byte seq. for codepoints beyond 10FFFFF. I'll make
it more clear.
> how should each of those things be updated? to reflect all this stuff with
> surrogates and such?
I think UTF8traits::Is[56]byte have to be removed and Is4byte
has to be revised to filter out [0xF5-0xF7]. Besides, Is2byte
may have to be revised to exclude [0xC0-0xC1] (only
used for illegal overlong UTF-8 seq.). Because UTF8traits
only deals with a single 'char' at a time, it's not possible to filter out
all overlong/surrogate cases. UTF8traits::isXXX are used
in three places (I didn't count ConvertUTF8toUCS2 in nsString2.h in
my prev. comment) and they can be updated accordingly. In case of
ConvertUTF8toUCS2, it has its own overlong/non-char/surrogate/plane-17 or higher
detectors. In two other places, not detecting all those cases wouldn't be critical.
> I will gladly review any additional changes to UTF8Traits if you want to
> include them here.
How about taking care of them in a separate bug? Alternatively,
we can just keep this open to deal with the issue after landing attachment
118366 [details] [diff] [review]. (expecting Is4byte() to be updated as proposed above, I'll comment that
|c <= 0xF4| in attachment 118366 [details] [diff] [review] should be replaced by Is4byte when
UTF8traits is updated.)
Comment 26•22 years ago
|
||
ah, I am beginning to understand
can you file a new bug on the issues we just discussed? (I don't think I
understand the issues well enough yet to file it myself)
with the comments on the 4-byte sequences, can you reference the new bug that
you file, so that people reading the code can understand the issues? thanks!
it sounds like this is ready to go (With smontagu's review, and the comment update)
Reporter | ||
Comment 27•22 years ago
|
||
Thank you, Alec. Patch was just checked in.
I filed bug 199090 and added a comment referring to it in the patch.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Updated•4 years ago
|
Component: String → XPCOM
You need to log in
before you can comment on or make changes to this bug.
Description
•