Closed Bug 231782 Opened 21 years ago Closed 7 years ago

need nsCaseInsensitiveUTF8StringComparator

Categories

(Core :: XPCOM, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: sicking, Assigned: dbaron)

References

Details

(Keywords: intl)

When comparing classnames caseinsensitivly in nsGenericHTMLElement::HasClass we
don't do that in a manner that works for all utf8 strings.

This is a regression from bug 195262, see bug 195262 comment 55 and 56.

Also see bug 195350 comment 37
Also note the id compare at
http://lxr.mozilla.org/seamonkey/source/content/html/style/src/nsCSSStyleSheet.cpp#3741
-- this has the same problem.  I guess that's sorta covered by bug 195262 comment 55
oh, for the record. IMHO we should revert the part of bug 195262 that made us
store atoms as UTF8 strings. The performance-regression it caused was never
fixed, see comment 66 and on.

We would still be able to do static atoms on platforms that support short
wide-chars. Other platforms could just go back to what they did before, i.e.
heapallocate them.

Even though 99% of our atoms are ascii-strings we end up doing a lot of
communication with other strings that are stored as UCS2. Every time that
happens we end up having to convert between UTF8 and UCS2. Also, the atoms that
we put in static tables won't be heap-allocated so their size isn't as important.
I strongly disagree.  We should move towards UTF8, not away from it.
My main objection is mixing encodings, having different pieces of code using
different types of encoding is bound hurt performance, as well as increase
codesize to try to reduce that. This is exactly what happened when bug 195262
landed.

Weather we're using UTF8 or UCS2 doesn't matter much to me, but we should pick
one and stick with it. And so far I can't say that i've seen a real effort to
move to UTF8.
I think for Mozilla 2.0 (yeah, I know: what's that?  Consider these prophetic
warnings, in advance of the coming of the spec) should favor UTF-8.  So we
should start moving, coherently, in concrete steps.  If we can do so with a fix
to this bug, let's go.

/be
if we make a well thought out desition to do so then i'm definitly fine with
keeping atoms as they are now.

However, doesn't utf8-strings have some performance-problems. For example, to
strip whitespace, won't you have to write some sort of utf8-aware loop that
keeps track of utf8 "control-characters" rather then just calling
nsCRT::IsAsciiSpace on each character? Basically any find-operation will have to
use fairly advanced iterators afaict.
Actaully, ascii (7-bit-clean) bytes in a UTF-8 string _always_ correspond to
ascii chars.  Every single byte in a multibyte sequence has the high bit set. 
So that particular example (stripping whitespace) is completely trivial in UTF-8.
hmm.. that does make me a lot more conviced that utf8 is a good way to go
(though i dread the road to get there).

So that means that stuff like parsing html or xml should be trivial from an utf8
string? Since we should convert our decoders to decode to utf8 rather then ucs2,
right?
bz: depends on definition of space.  JS defines the following code points to be
spaces: 9, 10, 11, 12, 13, 32, 160, 8192, 8193, 8194, 8195, 8196, 8197, 8198,
8199, 8200, 8201, 8202, 8203, 8232, 8233, 12288 (these numbers courtesy

javascript:s='';for(i=0;i<65536;i++)if(/^\s/.test(String.fromCharCode(i)))s=s?',
'+i:i;s

or something close -- session history doesn't keep javascript: URLs, darnit).

/be
Make that

javascript:s='';for(i=0;i<65536;i++)if(/^\s/.test(String.fromCharCode(i)))s=s?s+',
'+i:i;s

/be
hmm.. actually we'd still be in trouble when it comes to parsing xml, though i
guess expat will handle that for us.

What about stuff like DOM though? Can we really make dom-methods return
UTF8-strings? Doesn't javascript need ucs2 strings to avoid converting every
time a dom-call is made?
OK, here are the problems with the current string API that prevent me from doing
this:
 * ns[C]StringComparitor, frozen, has an operator()(char_type, char_type), which
it shouldn't.  I don't think anybody uses this, but we have a bunch of
implementations of it.
 * ns[C]StringComparitor::operator()(const char_type*, const char_type*,
PRUint32) only takes a single length parameter -- i.e., it assumes that the
comparison is one char_type unit at a time
 * nsAString::Equals checks length equality before calling the comparitor (same
bad assumption)

Perhaps we need nsAUTF8String after all.
Summary: case-insensitive compare for class isn't utf8-friendly → need nsCaseInsensitiveUTF8StringComparator
Oh, never mind about unused, FindInReadable_Impl does use it.
oh, another thing that i realized will be a problem is string-length. JS, as
well as XPath, needs to be able to get unicode-length of strings.

Basically what it comes down to is that programming environments (like js and
xpath) needs to at least expose a unicode api to the user. Which sounds like
it's going to force a lot of conversions if we're using utf8 everywhere else in
the code.
String length is no different in UTF8 than it is in UTF16, we just don't run
into it with UTF16 yet. Both encodings need to look at every single character
when calculating the length of a string, unicode is all about 32-bit characters,
even if that's most of the time not talked about. So any code that returns the
length of a UTF* string should return how many unicode characters that string
represents, not how many 16-bit units it represents (as our current UTF16 code
does).
How often do you really need to know how many characters you have?  Don't you
usually just need to know how much storage the string takes up?
re #15:
True, but we havn't converted our stringclasses to actually do UTF16 yet so we
havn't seen any performance impact of actually truly doing UTF16.

My argument is that doing anything other then what we're doing now risks having
a big negative performance impact.

Re #16:
javascript doesn't care about about storagespace. When the user call .charAt()
or .length on a string the engine needs treat the string as unicode. As I said
in comment 14, of course we could do that when the user actually calls those
methods, and store the strings internally as utf8 otherwise, but it might lead
to a lot of conversions.
Brendan would of course be the best person to ask about weather javascript can
store its strings internally as utf8.


I'm all for looking over our string strategies, but we need to actually analyze
what any change will do. Before starting to advocate it and changing code. This
has been done way too much in the past which has lead us to the situation we are
in right now.
> Re #16:
> javascript doesn't care about about storagespace. When the user call .charAt()
> or .length on a string the engine needs treat the string as unicode.

Not so, see ECMA-262 Edition 3 (http://www.mozilla.org/js/language/E262-3.pdf,
errata nearby), specifically sections 6 and 8.4.  JS string length counts 16-bit
UTF-16 storage units.

> As I said
> in comment 14, of course we could do that when the user actually calls those
> methods, and store the strings internally as utf8 otherwise, but it might
> lead to a lot of conversions.

It might, but maybe not.

> Brendan would of course be the best person to ask about weather javascript
> can store its strings internally as utf8.

JS could be taught about UTF-8 with some work.  I'll write more about this
tomorrow if I have time.

/be
I don't understand the advantage of using UTF-8 over UTF-16 (or vice versa for
that matter) but both require similar worries over multibyteness, so I don't
really see a performance problem either way (UTF-8 requires less space for,
e.g., stylesheets, and I would guess more space for, e.g., CJK HTML documents;
both require complicated length code, both can compare for US-ASCII characters
pretty trivially, etc).

Note than DOM officially says that DOMString must be UTF-16.
   http://www.w3.org/TR/DOM-Level-3-Core/core.html#DOMString
the debate isn't so much between UTF16 and UTF8 as between "what we have now"
and UTF8. At least as far as i'm concerned :)
Can "what we have now" do characters outside the BMP?
What we have now is UTF-16, although it was changed from UCS2 with no discussion
whatsoever.  (See bug 118000 and another bug that I can't find where I said
roughly the same thing more strongly.)
I'm not really convinced that we have 'true' UTF-16, pretty much all our code
considers each PRUnichar as a separate character, with no regard to
multi-doublebyte characters. This probably fine for most the code though since
it only deal with characters that appear as a single double-byte. I'm assuming
here that UTF16 has the same properties as UTF8, i.e. having the highbit set for
every multi-doublebyte character.

Assuming that assertion is true then:

(Pretty much) only code that are concerned about character that doesn't fit in a
single (double)byte when UTF encoded needs to treat a string as if it is UTF
encoded.

Apparenlty we don't have much code that needs to concern itself with characters
that are represented by multi-doublebyte in UTF-16, which means that we don't
take a perfhit from UTF-16 encoding.

I'm not as convinced that the same is true for UTF-8. Basically every
string-operation that uses non-ascii characters would have to be UTF aware. Has
anyone looked into how much of our code uses non-ascii? IMHO that would need to
be looked over before we start advocating UTF-8.

Another then that i'm very interested in is if anyone has looked into how much
memory we would actually save by switching to UTF-8. Saving memory is nice and
all, but switching to UTF-8 in all of mozilla is a *huge* effort. My gut feeling
is if the same effort were put elsewhere we could get a bigger gain. It's not
like there are other parts of mozilla that doesn't need rearchitecturing.
Rewrite the html-parser would be one example.

> I'm assuming here that UTF16 has the same properties as UTF8, i.e. having
> the highbit set for every multi-doublebyte character.

That assumption is completely wrong.

UTF16 encodes characters in planes 1 through 16 of Unicode using a set of
reserved characters, the high surrogates (U+D800 - U+DBFF) and low surrogates
(U+DC00 - U+DFFF).  They must occur in low+high pairs, and a character c in
planes 1 through 16 is encoded in two 16-bit units as follows:
< ((c & 0xffc00) >> 10) | 0xd800 , (c & 0x3ff) | 0xdc00 >.
Advantages of UTF-8:
 * all compilers can generate literals for the ASCII subset of UTF-8
 * less storage for "code" type things (such as element and attribute names)
 * less storage for Western users

vs.

Advantages of UTF-16 and UCS2:
 * strings cannot be equal under case-insensitive comparison without being equal
in length (although I don't know if the Unicode consortium promises to preserve
that for the former)

==================

Advantages of UTF-8 and UTF-16:
 * can represent all planes of Unicode

vs.

Advantages of UCS2:
 * constant character Units
> > I'm assuming here that UTF16 has the same properties as UTF8, i.e. having
> > the highbit set for every multi-doublebyte character.

> That assumption is completely wrong.
...
> A character c in
> planes 1 through 16 is encoded in two 16-bit units as follows:
> < ((c & 0xffc00) >> 10) | 0xd800 , (c & 0x3ff) | 0xdc00 >.

both 0xd800 and 0xdc00 has the highbit set so i don't see how my assumption was
false?


Another advantage of UCS2 (and possibly UTF-16, depending on the above assumption):

Can search for, and parse out, many (all* that we care about?) characters in a
string without having to UTF-decode or use a UTF-aware iterator.


* disregarding a few places such as the code touched in bug 118000
You can parse all the US-ASCII parts of a UTF-8 string without a UTF-8-aware
decoder. Do we ever have to parse characters outside that range, other than in
the initial reading step to check for data integrity (no overlong sequences, and
so forth, required for security reasons)?
As far as multibyte characters go, only code points U+D800 through U+DBFF and
U+DC00 through U+DFFF are surrogates. Plenty of words with their high bit set
are not part of UTF-16 multibyte sequences. US-ASCII bytes always have their 9
most significant bits unset.

I don't really understand what you were getting at though. Why is this different
or important or whatever?
My first worry is that advocating UTF-8 is going to lead to a codebase with
mixed string strategies, leading to a lot of conversions (this has already
happened with atoms). See comment 2.

If we want to do UTF8 then we should do it everywhere. However my worry about
doing that is if we switch from "what we have now" (which is UTF-16 treated as
UCS2 by pretty much all code) to UTF-8 is that we will take a performance hit. 

One problem is places where we do need to be able to treat strings as UTF-16:

The first place that springs to mind is the js-engine. According to comment 18
it needs to expose strings as UTF-16 to the user. Can it store strings
internally as UTF-8 anyway?

Another place is the xmlparser. It needs to check non-ascii names, can it deal
with data in UTF-8 format. (my guess is that it can though).

Then there's the DOM. Comment 19 says it needs to expose strings as UTF-16. Can
we ignore that and use UTF-8 anyway?

I bet there are other places, though i can't think of any core ones right now.


Then dbaron pointed out that case-insensitive no longer can early-return if
stringlengths are different. This might be an additional perf-hit.


On top of this there's of course the argument of frozen APIs. Until we decide
that we can break the frozen interfaces that contain string-arguments then I
don't see that we have an option at all.
What we have now is UTF16, rather incompletely implemented.

> Apparenlty we don't have much code that needs to concern itself with
> characters that are represented by multi-doublebyte in UTF-16

Everywhere UTF8-string code would have to worry about multibyte UTF8 characters,
UTF16-string code would also have to worry about multibyte characters. I think
in general there aren't as many of those places as you expect.

I think the main issue with going to UTF8 are those pesky APIs that pass around
string positions and lengths in UTF16 units. Thankfully we don't do that very
often internally. One way to help with those APIs would be to cache a bit on
each JS/DOM string that is set to TRUE when we have determined that the string
is entirely ASCII. (DOM strings effectively already have this.) On those strings
we can still do constant-time indexing and length. So we only lose some
performance when we have scripts manipulating non-ASCII strings.

From Hixie:
> (UTF-8 requires less space for, e.g., stylesheets, and I would guess more
> space for, e.g., CJK HTML documents;

I believe UTF8 would require much less space for CJK documents. Here's a page
from my wife's favourite site:
http://www.mingpaonews.com/20040121/index.htm
If you View Source, you'll see that the HTML markup swamps the actual CJK text.
And that's not counting stylesheets or Javascript.

The internal string APIs dbaron identified are unfortunate but we can live with
them.
> Can it store strings internally as UTF-8 anyway?

Yes.

> Then there's the DOM. Comment 19 says it needs to expose strings as UTF-16. Can
> we ignore that and use UTF-8 anyway?

Yes. bz and I had a thread about this on .i18n a while back.
By the way, I believe that most string comparisons are to constant strings. And
I have suggested a patch in bug 226439 that would reduce footprint significantly
by specializing comparisons to ASCII strings ... the new Equals(const char*)
does not do the length optimization.
Well, one more issue with moving to UTF8 is that for Windows we'll have to do
UTF8->UTF16 conversions when passing data to Win32 APIs.
Sicking, since your assumption on which comment 23 hangs was wrong, so do you
retract that comment? ;-)

> The first place that springs to mind is the js-engine. According to comment 18
> it needs to expose strings as UTF-16 to the user.

Not exactly -- more like as "UCS2".  Again, JS string.length counts uint16
storage units, not characters.  Indexing or slicing a JS string reckons by
counting 16-bit storage units.  This will remain true in JS2/ES4.

> Can it store strings internally as UTF-8 anyway?

Sure, it's just a small matter of programming.  Or not, if conversions are rare
between the JS-internal UCS2 world and the JS-external mostly-UTF-8 world.

I still have more to say, but not right now.

/be

> Another place is the xmlparser. It needs to check non-ascii names, can it deal
> with data in UTF-8 format. (my guess is that it can though).

It can.  See http://lxr.mozilla.org/seamonkey/source/expat/xmltok/xmldef.h#43
and the nice comments around
http://lxr.mozilla.org/seamonkey/source/expat/xmlparse/xmlparse.h#85
Is my assumption in comment 23 really wrong? DBarons comment 24 seems to
indicate it's write rather then wrong. Anyhow, the actual bits doesn't matter.
What matters is how much of the code we would have to make do UTF decoding, and
how critical codepath that code lives on.

Anyway, this discussion is way too theoretical. Until both the advantages and
disadvantages of going to UTF8 has been investigated IMHO the discussion is just
pointless guesswork.

Advantages being, how much memory would we actually save. Much of the DOM stores
strings as atoms; attribute/element names, attributevalues in xul, which means
that the string-data isn't duplicated (on top of this atoms are already utf8).
Textnodes store it's stringdata as ASCII when possible.

The disadvantages are: Are we prepared to break many of our frozen APIs? Which
codepaths would need to UTF decode or in other ways slow down? Are any of them
critical?
Can the js-engine be changed to use UTF-8? And when i say "can" i don't mean
just theoretically. Of course it's possible, but would such a change degrade
performance too much, and would it be an acceptable change to the js apis?

I'm not trying to dictate what should be done with our string classes or what
the official string-strategy should be. However I am asking how much
investigation and discussion has gone into such a desition as "We want to move
towards UTF8"?
There has been a lot of discussion over the years.

At some point a number of people agreed that rebadging the CString classes as
containing UTF8 would be a good way forward.
Except it turned out that the parser uses CStrings as bytebuffers with various
encodings, or something like that.
>Except it turned out that the parser uses CStrings as bytebuffers with various
>encodings, or something like that.

this applies to more than just the parser.  in fact, it typically applies
whenever we use nsCString with "narrow" filesystem paths, environment variable
strings, etc.
I say we should have a seperate string class for "we don't know the encoding"
strings, because most normal string operations can't be properly implemented for
such "strings". (Well, they're not really strings, they're more like byte arrays.)
>I say we should have a seperate string class for "we don't know the encoding"

i wish, but...

while it would be very nice to have encoding aware string classes, i think it'd
be difficult to implement.  the fact that C++ doesn't give you much help when
dealing with |const char*| and |const unsigned short*| makes it easy for
programmers to abuse the system.  programmers need to be conscientious of
character encodings (i think that's just what it boils down to).
No, I mean we should have a string class where the only operations we provide
are ones we can do if we don't know the encoding ... I guess that's
concatenation and (everything-sensitive) equality testing.
Does concatenation even fit in that set?  Two unknown-encoding strings might not
be same-unknown-encoding, in which case you're unlikely to get a "string" as a
result of concatenating them, I think.
Perhaps we can trust the caller to ensure that the two strings are in the same
encoding, even if we don't know what it is.
>No, I mean we should have a string class where the only operations we provide
>are ones we can do if we don't know the encoding ... I guess that's
>concatenation and (everything-sensitive) equality testing.

isn't this mostly solved by the nsStringComparator "interface"?  currently, you
have to pick the right comparator when doing "string" comparisons.  (there is a
UTF-16 case-insensitive string comparator provided by unicharutil, for example.)

there are definitely other issues such as real character iteration.  we could
probably introduce different iterator types (nsReadingUTF8Iterator) that
subclass nsReadingIterator<char> such that you could pass an instance to
BeginReading, but then request the next USC4 character from the iterator.  this
could be done on top of the existing string API.
Re: comment 36:

First, you wrote:

"I'm assuming here that UTF16 has the same properties as UTF8, i.e. having the
highbit set for every multi-doublebyte character."

and you're right that every surrogate pair has the high bit in each 16-bit
storage unit set, but the reverse isn't true: finding a high bit set does not
imply that consecutive multiple storage units comprise one character.  So, as
roc said in comment 30, everywhere you have to worry about UTF-8 multi-byte, you
have to worry about UTF-16 multi-double-byte.

It's true, if stripping Unicode white space, that we'll have to do a bit more
work with UTF-8 than with UTF-16 -- but I believe there are space characters
defined in plane 1 and above, so we have to look in either case.

Second: Mozilla 2.0 is an opportunity to break compatibility.  Whether we'll
want to break a bunch of compatibility remains to be seen; I think we need to
break some, for our own long-term health.  I'll be working on a list of APIs to
consider deprecating before 2.0, so we can obsolete them in 2.0 -- and what's
more important, many of us cc'd on this bug will be working on better APIs to
replace the obsoleted ones.  Strings are a prime target.

/be
http://www.unicode.org/notes/tn12/ is probably relevant to this discussion.
Although I like UTF-8 a lot as an external storage format (on Unix), I have some
(actually quite a lot of) reservation about switching to UTF-8 as 'the'
_internal_  encoding. My favorite was UTF-32/UCS-4 back in 1998 and still is,
but if that's not an option, I would go with UTF-16.(see bug 183048)   The fact
that UTF-8 is ASCII compatible is a blessing in some aspects but also has been a
cause for numerous bugs in Mozilla. One (side-)benefit of going to UTF-16/UTF-32
is that we can 'enforce' developers to think about the encoding. As pointed out
by the author of UTN #12 (referred to by smontagu), only few libraries/OS' use
UTF-8 for the internal processing(Gtk2/Gnome/Pango, Perl and BeOS are all that I
know) while there are many more libraries/OS that use UTF-16 internally, for
which there must be a reason. Besides, Gtk2/Gnome/Pango might switch to
UTF-16/UTF-32 sometime in the future (according to  the lead developer of Pango)
Keywords: intl
FYI, UTN #12 is wrong to say that Python uses UTF-16. It uses either UCS-2 (not
UTF-16) or UTF-32 (on RedHat and Fedora Linux, Python is compiled in such a way
that UTF-32 is used as the internal character representation format) See
http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf which is a nice read
by itself for the issue at hand.
> The fact that UTF-8 is ASCII compatible is a blessing in some aspects but also
> has been a cause for numerous bugs in Mozilla. One (side-)benefit of going to
> UTF-16/UTF-32 is that we can 'enforce' developers to think about the encoding.

UTF16 and UTF8 are suspceptible to exactly the same kind of programmer errors:
treating each unit (short or byte) as an individual character. One reason I like
UTF8 is that UTF8 will expose those errors as soon as you test with non-ASCII
characters. UTF16 will not expose errors until you test with non-BMP characters,
which are very rarely encountered in the wild.

> there are many more libraries/OS that use UTF-16 internally, for which there
> must be a reason.

I claim the reason is nothing technical at all --- rather, that when Unicode
support was first being designed into modern operating systems, programming
languages and applications, it was not widely known that more than 2^16
characters would be required, so people assumed that UCS2 would always be
sufficient, so they standardized on UCS2. When it became clear that non-BMP
characters would be required, it was far too late to rewrite everything so
everyone did what Mozilla has done --- replace 'UCS2' with 'UTF16' in
documentation everywhere, fix some bugs and hope for the best.

The other systems, those written since non-BMP was recognized as important, have
been constrained by platform APIs they depend on. Indeed I agree that platform
API compatibility (UTN#12's argument #2) is a powerful argument in favour of
UTF16. Maybe it's a bit less powerful for Mozilla than for other systems because
we have our own cross-platform layer.

UTN#12's argument #1, that Unicode is "optimized for" BMP characters, doesn't
convince me. Given the choice I'd rather optimize for ASCII given that we
largely deal with markup and programs.
(In reply to comment #50)
> UTN#12's argument #1, that Unicode is "optimized for" BMP characters, doesn't
> convince me. Given the choice I'd rather optimize for ASCII given that we
> largely deal with markup and programs.

Is this really true? Our *input* is markup, but after the initial parsing, we
largely deal with the text in the DOM, not the markup.
It depends on what you mean by "dealing with", which is about as fuzzy a concept
as "optimized for". In some sense we don't "deal with" DOM text very much. I
think most of the Mozilla code that's manipulating strings is not manipulating
real DOM text. Rather, there's endless mucking about with values and attributes
and URIs and names and headers and protocols etc etc etc.

Note that currently our content model stores actual DOM textnodes (and attribute
values??) in either UTF16 or "UTF16 compressed to 8 bits per character because
all high bytes are zero", which seems like something we definitely DON'T want to
change.

We desperately need some actual numbers. I should try to instrument the string
classes some more.
attribute-values are stored as UTF-16 strings for HTML and XML. Though HTML
often stores the value as a 'parsed' value, i.e. an PRInt32, a double or an
nsISupports.

XUL does the same except that it stores them as an atom when the length is less
then 12 characters (idea being that less then 12 characters it's most likly a
reoccuring value like "true" or "horizontal").

HTML and XML will soon do the same.
re: comment #51 and #52 I'm with Simon on the point. Considering recent trend
toward making everything I18Nizable (markup languages, domain names, URIs/IRIs,
email address, identifiers in programming languages), I don't think it makes as
much sense as before to assume that we have to optimize for ASCII. 

re: comment #50. 
> UTF16 and UTF8 are suspceptible to exactly the same kind of programmer errors:
> treating each unit (short or byte) as an individual character. 

  That's not a kind of errors I was talking about. Both UTF-8 and legacy
encodings are stored in byte arrays while UTF-16 are stored in 'PRUnichar
arrays'. This difference in storage class makes it easier to catch potential
bugs if our APIs are written for UTF-16 or UTF-32 than for UTF-8.

  Besides, we already have a lot of APIs written for UCS-2 some of which have
already been converted to work with UTF-16. Although it may not be a lot of work
(writting UTF-8 iterator is not hard, but do we have to bother?), I think it's
still a distraction to rewrite them all for UTF-8.  

> that when Unicode
> support was first being designed into modern operating systems, programming
> languages and applications, it was not widely known that more than 2^16

  Well, the Unicode began with 16bits in the late 1980's and later it was
expanded to 20.1 bits when it was merged with ISO 10646 (which began as a 31bit
character set). Anyway, I've made a similar argument about UTF-16 [1]. However,
I wouldn't use it to favor UTF-8 over UTF-16 [2]. I always used it to justify my
preference for UTF-32 (glibc and Python use UTF-32).


[1] ICU developers' choice of UTF-16 cannot be accounted for by that, though
because it's after Unicode became 20.1 bits that ICU project was started. 

[2]http://mail.nl.linux.org/linux-utf8/2003-07/msg00114.html In this post, I was
  'defending' UTF-8 and arguing that it's 'possible' to write APIs in UTF-8.
Then as well as now, however, I would choose UTF-16 APIs if I had to choose
between UTF-8 and UTF-16. 
> Considering recent trend toward making everything I18Nizable (markup
> languages, domain names URIs/IRIs, email address, identifiers in programming
> languages), I don't think it makes as much sense as before to assume that we
> have to optimize for ASCII.

Just because something's internationalizable doesn't mean we'll actually see
significant quantities of I18N text there anytime soon, or ever... Java source
and XML have been Unicode from the beginning but there's not a lot of non-ASCII
Java source or XML vocabularies out there, that I've seen.

> This difference in storage class makes it easier to catch potential bugs

can you give an example of the kind of bugs you're talking about? Can they still
happen considering we're using string classes not raw arrays?
Ugh, I've read about 75% of this conversation, and what it sounds like is a
debate between carpenters: When we build this house [which in this case is
already built] should we be using nails or screws?

I feel pretty strongly that there are appropriate uses for both UTF8 and UTF16.
We could debate this encoding or that, but there is never going to be a
universally best encoding. Furthermore, by enforcing some kind of universal
encoding, we will NOT avoid programmer error. We're not going to be able to
enforce it syntactically and we're not going to prevent bugs by posting "We use
UTF-16 now" to a newsgroup.

I also don't believe that the string classes are or even should be containers
for anything other than const char* and const short* buffers. I also believe, as
darin points out, that it would be impractical to try to decide and enforce a
singular encoding for any set of string classes, since they are so tightly
coupled to their C++ type counterparts const char* and const short*. The
"enforcement" would involve so much abstraction on TOP of the string APIs that
people would run away from this project screaming because they couldn't append
the letter 'a' to their string without jumping through hoops.

But going back to the basic issue of what universal encoding we use, I think we
MUST deal with a mishmash of encodings if we want to be a fast, lightweight
browser. Maybe that sounds contradictory but here's my perspective: Some data is
simply going to be ASCII at least 99% of the time, and for that data we should
be using UTF-8 or even ASCII itself (when that percentage is 100%) Data that
comes to mind in this case are:
- internal string tags
- HTML markup tags
- HTML markup attribute names
- HTML entity names
- most HTML markup attribute values
- CSS keywords
- CSS selector text
- ASCII-based protocol keywords (i.e. header names in HTTP)
- most atoms (Which are used for most of the above)
- .property file key names (i.e. the "foo" in foo=bar)
- filenames

Now, there are pleanty of uses for UTF16 as well. Namely:
- CDATA
- HTML entity values
- .property values

both lists can be expanded, but they key is that there are uses for both. And
when we need something like this bug requests,
nsCaseInsensitiveUTF8StringComparator, we need to look at why we need it and if
the data we need it for is stored in an appropriate encoding. In the cases
originally mentioned in this bug, it seems very appropriate, and IMHO
nsCaseInsensitiveUTF8StringComparator has been long overdue.

(In reply to comment #51)
> Is this really true? Our *input* is markup, but after the initial parsing, we
> largely deal with the text in the DOM, not the markup.

I wanted to address this too - or at least back up what roc said. Its true that
we do deal with much of parsed text like html tags, attributes, etc in the DOM.
However, ask yourself how many times we do a string comparison on a tag name to
resolve some CSS, vs. the few times we might access that same tag or attribute
via the DOM. CSS resolution happens MUCH more on the average web page.. because
every web page has to be laid out, period. That's what we want to optimize. Not
ever tag or attribute has to be accessed via the DOM (and most are not on most
web pages) So why would be optimize for the case which is used the least?
The problem with mixed encodings is that it's very easy to end up doing tons of
conversions. For example when atoms were converted into UTF8 this gave a
performancehit that was never addressed (not saying that the patch as a whole
was bad, but the fact remains that it gave a perf-hit). It also gave even bigger
perfhits in some extreme cases, we have a realworld XSLT stylesheet that got in
the order of 50%-100% slower since it performed a lot of compares with nodenames.

I agree that in some cases it makes sense different encodings however we need to
do it very carefully.
I would bet that the reasons for slowdowns of that magnitude are not because we
now have more strings of different encodings, rahter that our tools to work with
these encodings are less than ideal.

Just to state one example of this, right now if you want to compare a UTF8
string to a UTF16 string, you're required to convert and copy either string to
the other encoding to get matching strings, and then compare, when all you
should really need to do is to call a method that takes a UTF8 string and a
UTF16 string and do the compare w/o all the suck.

I seriously doubt that using UTF8 over UTF16 has any serious performance
problems if we've got the right tools to work with here. Clearly we do not have
those at this point, so let's work on that rather than fight over what way to
lean here, ok? :-)
Yay! jst has hit the nail on the head - its the tools and manner of string
management, not the fact that we have to do the conversions themselves. The atom
conversion was obviously an incomplete job - and I'm sorry about that. I should
have followed it up more quickly with some UTF8 string cleanup, but life got in
the way :)

Anyhow, lets move forward with this string comparator. We also need an actual
implementation of CopyUTF8toUTF16 and so forth. 
(In reply to comment #60)
> Anyhow, lets move forward with this string comparator. We also need an actual
> implementation of CopyUTF8toUTF16 and so forth. 

It's a little hard to move forward without breaking frozen APIs.  See comment 12.

We've had an implementation of CopyUTF8toUTF16 for ages.
hmm... we can definitely write UTF-8 and UTF-16 iterators that produces UTF-32
characters with each iteration.  we could use those iterators to write functions
like:

PRInt32 CompareUTF8toUTF16(utf8String, utf16String);

it's not as nice as subclassing a comparator interface that would plug into
Equals and other methods, but it is a solution :-/
Yeah, such a thing would get us some mileage on its own. And as a nice
optimization we'd only need to revert to extracting UTF-32 characters out of the
strings when non-ASCII characters are encounterd in either string, which would
make the character iteration trivial and really fast in 99% of the cases. I.e.
something like:

PRInt32
CompareUTF8toUTF16(const nsAFlatCString& utf8String,
                   const nsAFlatString& utf16String)
{
  nsAFlatCString::const_reading_iterator iter8, end8;
  nsAFlatString::const_reading_iterator iter16, end16;

  utf8String.BeginReading etc...
  utf16String.BeginReading etc...

  while (...) {
    const char ch = *iter8;

    if (IsAscii(ch)) {
      if (ch != *iter16) {
        return ch > *iter16;
      }
    } else {
      // extract UTF-32 characters from iter8 and iter16 and compare.

      // Advance both iterators appropriately
    }

    ++iter8;
    ++iter16;
  }

  return 0;
}

If we had that, and our atom code used it (and it knew its string length to
avoid looking for the null terminator all the time), atom to string compares
might not be very slow at all.
Unfortunatly this wouldn't help us in the XSLT case since we're essentially a
programming environment, so when someone asks for a nodename we have no idea if
they'll later use substring, concat or equality operations on it. So we'll
always have to simply convert to a string.

I realize that XSLT and XPath arn't high priorities, but the same thing applies
in js. Which is why i'm very curious to see if it's practically possible to make
js use UTF8 internally without too big performance degradation.
I seriously doubt the string conversion overhead is anything to worry about.
Compared to the string copy that always happens when you access strings through
the DOM today, the added conversion will hardly make a huge dent in the app
performance.

But if you're comparing string sharing against copy+convert, then the
copy+convert will naturally loose, but I don't see that being the common case.

I also doubt that it matters at all (or enough that it shows up on perf tests)
if JS stores strings internally as UTF-8 or UTF-16 (or UCS2).

IMO any place that already does a copy won't significantly degrade in
performance with the added conversion, if done right. Currently we don't do it
right, so it hurts in a lot of places, but that doesn't mean it can't be done right.
Status: NEW → ASSIGNED
hm, the bug's summary seems to be a duplicate of bug 145975 but the discussion
seems to have moved away from that
Blocks: 145975
Blocks: 308100
Component: Style System (CSS) → String
QA Contact: ian → string
I'm closing this WONTFIX since I don't think we have any non-ASCII case-insensitive matching left in the platform.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Component: String → XPCOM
You need to log in before you can comment on or make changes to this bug.