Closed
Bug 231782
Opened 21 years ago
Closed 7 years ago
need nsCaseInsensitiveUTF8StringComparator
Categories
(Core :: XPCOM, defect)
Core
XPCOM
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: sicking, Assigned: dbaron)
References
Details
(Keywords: intl)
When comparing classnames caseinsensitivly in nsGenericHTMLElement::HasClass we don't do that in a manner that works for all utf8 strings. This is a regression from bug 195262, see bug 195262 comment 55 and 56. Also see bug 195350 comment 37
Comment 1•21 years ago
|
||
Also note the id compare at http://lxr.mozilla.org/seamonkey/source/content/html/style/src/nsCSSStyleSheet.cpp#3741 -- this has the same problem. I guess that's sorta covered by bug 195262 comment 55
Reporter | ||
Comment 2•21 years ago
|
||
oh, for the record. IMHO we should revert the part of bug 195262 that made us store atoms as UTF8 strings. The performance-regression it caused was never fixed, see comment 66 and on. We would still be able to do static atoms on platforms that support short wide-chars. Other platforms could just go back to what they did before, i.e. heapallocate them. Even though 99% of our atoms are ascii-strings we end up doing a lot of communication with other strings that are stored as UCS2. Every time that happens we end up having to convert between UTF8 and UCS2. Also, the atoms that we put in static tables won't be heap-allocated so their size isn't as important.
Assignee | ||
Comment 3•21 years ago
|
||
I strongly disagree. We should move towards UTF8, not away from it.
Reporter | ||
Comment 4•21 years ago
|
||
My main objection is mixing encodings, having different pieces of code using different types of encoding is bound hurt performance, as well as increase codesize to try to reduce that. This is exactly what happened when bug 195262 landed. Weather we're using UTF8 or UCS2 doesn't matter much to me, but we should pick one and stick with it. And so far I can't say that i've seen a real effort to move to UTF8.
Comment 5•21 years ago
|
||
I think for Mozilla 2.0 (yeah, I know: what's that? Consider these prophetic warnings, in advance of the coming of the spec) should favor UTF-8. So we should start moving, coherently, in concrete steps. If we can do so with a fix to this bug, let's go. /be
Reporter | ||
Comment 6•21 years ago
|
||
if we make a well thought out desition to do so then i'm definitly fine with keeping atoms as they are now. However, doesn't utf8-strings have some performance-problems. For example, to strip whitespace, won't you have to write some sort of utf8-aware loop that keeps track of utf8 "control-characters" rather then just calling nsCRT::IsAsciiSpace on each character? Basically any find-operation will have to use fairly advanced iterators afaict.
Comment 7•21 years ago
|
||
Actaully, ascii (7-bit-clean) bytes in a UTF-8 string _always_ correspond to ascii chars. Every single byte in a multibyte sequence has the high bit set. So that particular example (stripping whitespace) is completely trivial in UTF-8.
Reporter | ||
Comment 8•21 years ago
|
||
hmm.. that does make me a lot more conviced that utf8 is a good way to go (though i dread the road to get there). So that means that stuff like parsing html or xml should be trivial from an utf8 string? Since we should convert our decoders to decode to utf8 rather then ucs2, right?
Comment 9•21 years ago
|
||
bz: depends on definition of space. JS defines the following code points to be spaces: 9, 10, 11, 12, 13, 32, 160, 8192, 8193, 8194, 8195, 8196, 8197, 8198, 8199, 8200, 8201, 8202, 8203, 8232, 8233, 12288 (these numbers courtesy javascript:s='';for(i=0;i<65536;i++)if(/^\s/.test(String.fromCharCode(i)))s=s?', '+i:i;s or something close -- session history doesn't keep javascript: URLs, darnit). /be
Comment 10•21 years ago
|
||
Make that javascript:s='';for(i=0;i<65536;i++)if(/^\s/.test(String.fromCharCode(i)))s=s?s+', '+i:i;s /be
Reporter | ||
Comment 11•21 years ago
|
||
hmm.. actually we'd still be in trouble when it comes to parsing xml, though i guess expat will handle that for us. What about stuff like DOM though? Can we really make dom-methods return UTF8-strings? Doesn't javascript need ucs2 strings to avoid converting every time a dom-call is made?
Assignee | ||
Comment 12•21 years ago
|
||
OK, here are the problems with the current string API that prevent me from doing this: * ns[C]StringComparitor, frozen, has an operator()(char_type, char_type), which it shouldn't. I don't think anybody uses this, but we have a bunch of implementations of it. * ns[C]StringComparitor::operator()(const char_type*, const char_type*, PRUint32) only takes a single length parameter -- i.e., it assumes that the comparison is one char_type unit at a time * nsAString::Equals checks length equality before calling the comparitor (same bad assumption) Perhaps we need nsAUTF8String after all.
Assignee | ||
Updated•21 years ago
|
Summary: case-insensitive compare for class isn't utf8-friendly → need nsCaseInsensitiveUTF8StringComparator
Reporter | ||
Comment 14•21 years ago
|
||
oh, another thing that i realized will be a problem is string-length. JS, as well as XPath, needs to be able to get unicode-length of strings. Basically what it comes down to is that programming environments (like js and xpath) needs to at least expose a unicode api to the user. Which sounds like it's going to force a lot of conversions if we're using utf8 everywhere else in the code.
Comment 15•21 years ago
|
||
String length is no different in UTF8 than it is in UTF16, we just don't run into it with UTF16 yet. Both encodings need to look at every single character when calculating the length of a string, unicode is all about 32-bit characters, even if that's most of the time not talked about. So any code that returns the length of a UTF* string should return how many unicode characters that string represents, not how many 16-bit units it represents (as our current UTF16 code does).
Assignee | ||
Comment 16•21 years ago
|
||
How often do you really need to know how many characters you have? Don't you usually just need to know how much storage the string takes up?
Reporter | ||
Comment 17•21 years ago
|
||
re #15: True, but we havn't converted our stringclasses to actually do UTF16 yet so we havn't seen any performance impact of actually truly doing UTF16. My argument is that doing anything other then what we're doing now risks having a big negative performance impact. Re #16: javascript doesn't care about about storagespace. When the user call .charAt() or .length on a string the engine needs treat the string as unicode. As I said in comment 14, of course we could do that when the user actually calls those methods, and store the strings internally as utf8 otherwise, but it might lead to a lot of conversions. Brendan would of course be the best person to ask about weather javascript can store its strings internally as utf8. I'm all for looking over our string strategies, but we need to actually analyze what any change will do. Before starting to advocate it and changing code. This has been done way too much in the past which has lead us to the situation we are in right now.
Comment 18•21 years ago
|
||
> Re #16: > javascript doesn't care about about storagespace. When the user call .charAt() > or .length on a string the engine needs treat the string as unicode. Not so, see ECMA-262 Edition 3 (http://www.mozilla.org/js/language/E262-3.pdf, errata nearby), specifically sections 6 and 8.4. JS string length counts 16-bit UTF-16 storage units. > As I said > in comment 14, of course we could do that when the user actually calls those > methods, and store the strings internally as utf8 otherwise, but it might > lead to a lot of conversions. It might, but maybe not. > Brendan would of course be the best person to ask about weather javascript > can store its strings internally as utf8. JS could be taught about UTF-8 with some work. I'll write more about this tomorrow if I have time. /be
Comment 19•21 years ago
|
||
I don't understand the advantage of using UTF-8 over UTF-16 (or vice versa for that matter) but both require similar worries over multibyteness, so I don't really see a performance problem either way (UTF-8 requires less space for, e.g., stylesheets, and I would guess more space for, e.g., CJK HTML documents; both require complicated length code, both can compare for US-ASCII characters pretty trivially, etc). Note than DOM officially says that DOMString must be UTF-16. http://www.w3.org/TR/DOM-Level-3-Core/core.html#DOMString
Reporter | ||
Comment 20•21 years ago
|
||
the debate isn't so much between UTF16 and UTF8 as between "what we have now" and UTF8. At least as far as i'm concerned :)
Assignee | ||
Comment 22•21 years ago
|
||
What we have now is UTF-16, although it was changed from UCS2 with no discussion whatsoever. (See bug 118000 and another bug that I can't find where I said roughly the same thing more strongly.)
Reporter | ||
Comment 23•21 years ago
|
||
I'm not really convinced that we have 'true' UTF-16, pretty much all our code considers each PRUnichar as a separate character, with no regard to multi-doublebyte characters. This probably fine for most the code though since it only deal with characters that appear as a single double-byte. I'm assuming here that UTF16 has the same properties as UTF8, i.e. having the highbit set for every multi-doublebyte character. Assuming that assertion is true then: (Pretty much) only code that are concerned about character that doesn't fit in a single (double)byte when UTF encoded needs to treat a string as if it is UTF encoded. Apparenlty we don't have much code that needs to concern itself with characters that are represented by multi-doublebyte in UTF-16, which means that we don't take a perfhit from UTF-16 encoding. I'm not as convinced that the same is true for UTF-8. Basically every string-operation that uses non-ascii characters would have to be UTF aware. Has anyone looked into how much of our code uses non-ascii? IMHO that would need to be looked over before we start advocating UTF-8. Another then that i'm very interested in is if anyone has looked into how much memory we would actually save by switching to UTF-8. Saving memory is nice and all, but switching to UTF-8 in all of mozilla is a *huge* effort. My gut feeling is if the same effort were put elsewhere we could get a bigger gain. It's not like there are other parts of mozilla that doesn't need rearchitecturing. Rewrite the html-parser would be one example.
Assignee | ||
Comment 24•21 years ago
|
||
> I'm assuming here that UTF16 has the same properties as UTF8, i.e. having
> the highbit set for every multi-doublebyte character.
That assumption is completely wrong.
UTF16 encodes characters in planes 1 through 16 of Unicode using a set of
reserved characters, the high surrogates (U+D800 - U+DBFF) and low surrogates
(U+DC00 - U+DFFF). They must occur in low+high pairs, and a character c in
planes 1 through 16 is encoded in two 16-bit units as follows:
< ((c & 0xffc00) >> 10) | 0xd800 , (c & 0x3ff) | 0xdc00 >.
Assignee | ||
Comment 25•21 years ago
|
||
Advantages of UTF-8: * all compilers can generate literals for the ASCII subset of UTF-8 * less storage for "code" type things (such as element and attribute names) * less storage for Western users vs. Advantages of UTF-16 and UCS2: * strings cannot be equal under case-insensitive comparison without being equal in length (although I don't know if the Unicode consortium promises to preserve that for the former) ================== Advantages of UTF-8 and UTF-16: * can represent all planes of Unicode vs. Advantages of UCS2: * constant character Units
Reporter | ||
Comment 26•21 years ago
|
||
> > I'm assuming here that UTF16 has the same properties as UTF8, i.e. having > > the highbit set for every multi-doublebyte character. > That assumption is completely wrong. ... > A character c in > planes 1 through 16 is encoded in two 16-bit units as follows: > < ((c & 0xffc00) >> 10) | 0xd800 , (c & 0x3ff) | 0xdc00 >. both 0xd800 and 0xdc00 has the highbit set so i don't see how my assumption was false? Another advantage of UCS2 (and possibly UTF-16, depending on the above assumption): Can search for, and parse out, many (all* that we care about?) characters in a string without having to UTF-decode or use a UTF-aware iterator. * disregarding a few places such as the code touched in bug 118000
Comment 27•21 years ago
|
||
You can parse all the US-ASCII parts of a UTF-8 string without a UTF-8-aware decoder. Do we ever have to parse characters outside that range, other than in the initial reading step to check for data integrity (no overlong sequences, and so forth, required for security reasons)?
Comment 28•21 years ago
|
||
As far as multibyte characters go, only code points U+D800 through U+DBFF and U+DC00 through U+DFFF are surrogates. Plenty of words with their high bit set are not part of UTF-16 multibyte sequences. US-ASCII bytes always have their 9 most significant bits unset. I don't really understand what you were getting at though. Why is this different or important or whatever?
Reporter | ||
Comment 29•21 years ago
|
||
My first worry is that advocating UTF-8 is going to lead to a codebase with mixed string strategies, leading to a lot of conversions (this has already happened with atoms). See comment 2. If we want to do UTF8 then we should do it everywhere. However my worry about doing that is if we switch from "what we have now" (which is UTF-16 treated as UCS2 by pretty much all code) to UTF-8 is that we will take a performance hit. One problem is places where we do need to be able to treat strings as UTF-16: The first place that springs to mind is the js-engine. According to comment 18 it needs to expose strings as UTF-16 to the user. Can it store strings internally as UTF-8 anyway? Another place is the xmlparser. It needs to check non-ascii names, can it deal with data in UTF-8 format. (my guess is that it can though). Then there's the DOM. Comment 19 says it needs to expose strings as UTF-16. Can we ignore that and use UTF-8 anyway? I bet there are other places, though i can't think of any core ones right now. Then dbaron pointed out that case-insensitive no longer can early-return if stringlengths are different. This might be an additional perf-hit. On top of this there's of course the argument of frozen APIs. Until we decide that we can break the frozen interfaces that contain string-arguments then I don't see that we have an option at all.
What we have now is UTF16, rather incompletely implemented. > Apparenlty we don't have much code that needs to concern itself with > characters that are represented by multi-doublebyte in UTF-16 Everywhere UTF8-string code would have to worry about multibyte UTF8 characters, UTF16-string code would also have to worry about multibyte characters. I think in general there aren't as many of those places as you expect. I think the main issue with going to UTF8 are those pesky APIs that pass around string positions and lengths in UTF16 units. Thankfully we don't do that very often internally. One way to help with those APIs would be to cache a bit on each JS/DOM string that is set to TRUE when we have determined that the string is entirely ASCII. (DOM strings effectively already have this.) On those strings we can still do constant-time indexing and length. So we only lose some performance when we have scripts manipulating non-ASCII strings. From Hixie: > (UTF-8 requires less space for, e.g., stylesheets, and I would guess more > space for, e.g., CJK HTML documents; I believe UTF8 would require much less space for CJK documents. Here's a page from my wife's favourite site: http://www.mingpaonews.com/20040121/index.htm If you View Source, you'll see that the HTML markup swamps the actual CJK text. And that's not counting stylesheets or Javascript. The internal string APIs dbaron identified are unfortunate but we can live with them.
> Can it store strings internally as UTF-8 anyway? Yes. > Then there's the DOM. Comment 19 says it needs to expose strings as UTF-16. Can > we ignore that and use UTF-8 anyway? Yes. bz and I had a thread about this on .i18n a while back.
By the way, I believe that most string comparisons are to constant strings. And I have suggested a patch in bug 226439 that would reduce footprint significantly by specializing comparisons to ASCII strings ... the new Equals(const char*) does not do the length optimization.
Well, one more issue with moving to UTF8 is that for Windows we'll have to do UTF8->UTF16 conversions when passing data to Win32 APIs.
Comment 34•21 years ago
|
||
Sicking, since your assumption on which comment 23 hangs was wrong, so do you retract that comment? ;-) > The first place that springs to mind is the js-engine. According to comment 18 > it needs to expose strings as UTF-16 to the user. Not exactly -- more like as "UCS2". Again, JS string.length counts uint16 storage units, not characters. Indexing or slicing a JS string reckons by counting 16-bit storage units. This will remain true in JS2/ES4. > Can it store strings internally as UTF-8 anyway? Sure, it's just a small matter of programming. Or not, if conversions are rare between the JS-internal UCS2 world and the JS-external mostly-UTF-8 world. I still have more to say, but not right now. /be
Comment 35•21 years ago
|
||
> Another place is the xmlparser. It needs to check non-ascii names, can it deal > with data in UTF-8 format. (my guess is that it can though). It can. See http://lxr.mozilla.org/seamonkey/source/expat/xmltok/xmldef.h#43 and the nice comments around http://lxr.mozilla.org/seamonkey/source/expat/xmlparse/xmlparse.h#85
Reporter | ||
Comment 36•21 years ago
|
||
Is my assumption in comment 23 really wrong? DBarons comment 24 seems to indicate it's write rather then wrong. Anyhow, the actual bits doesn't matter. What matters is how much of the code we would have to make do UTF decoding, and how critical codepath that code lives on. Anyway, this discussion is way too theoretical. Until both the advantages and disadvantages of going to UTF8 has been investigated IMHO the discussion is just pointless guesswork. Advantages being, how much memory would we actually save. Much of the DOM stores strings as atoms; attribute/element names, attributevalues in xul, which means that the string-data isn't duplicated (on top of this atoms are already utf8). Textnodes store it's stringdata as ASCII when possible. The disadvantages are: Are we prepared to break many of our frozen APIs? Which codepaths would need to UTF decode or in other ways slow down? Are any of them critical? Can the js-engine be changed to use UTF-8? And when i say "can" i don't mean just theoretically. Of course it's possible, but would such a change degrade performance too much, and would it be an acceptable change to the js apis? I'm not trying to dictate what should be done with our string classes or what the official string-strategy should be. However I am asking how much investigation and discussion has gone into such a desition as "We want to move towards UTF8"?
There has been a lot of discussion over the years. At some point a number of people agreed that rebadging the CString classes as containing UTF8 would be a good way forward.
Except it turned out that the parser uses CStrings as bytebuffers with various encodings, or something like that.
Comment 39•21 years ago
|
||
>Except it turned out that the parser uses CStrings as bytebuffers with various
>encodings, or something like that.
this applies to more than just the parser. in fact, it typically applies
whenever we use nsCString with "narrow" filesystem paths, environment variable
strings, etc.
I say we should have a seperate string class for "we don't know the encoding" strings, because most normal string operations can't be properly implemented for such "strings". (Well, they're not really strings, they're more like byte arrays.)
Comment 41•21 years ago
|
||
>I say we should have a seperate string class for "we don't know the encoding"
i wish, but...
while it would be very nice to have encoding aware string classes, i think it'd
be difficult to implement. the fact that C++ doesn't give you much help when
dealing with |const char*| and |const unsigned short*| makes it easy for
programmers to abuse the system. programmers need to be conscientious of
character encodings (i think that's just what it boils down to).
No, I mean we should have a string class where the only operations we provide are ones we can do if we don't know the encoding ... I guess that's concatenation and (everything-sensitive) equality testing.
Does concatenation even fit in that set? Two unknown-encoding strings might not be same-unknown-encoding, in which case you're unlikely to get a "string" as a result of concatenating them, I think.
Perhaps we can trust the caller to ensure that the two strings are in the same encoding, even if we don't know what it is.
Comment 45•21 years ago
|
||
>No, I mean we should have a string class where the only operations we provide
>are ones we can do if we don't know the encoding ... I guess that's
>concatenation and (everything-sensitive) equality testing.
isn't this mostly solved by the nsStringComparator "interface"? currently, you
have to pick the right comparator when doing "string" comparisons. (there is a
UTF-16 case-insensitive string comparator provided by unicharutil, for example.)
there are definitely other issues such as real character iteration. we could
probably introduce different iterator types (nsReadingUTF8Iterator) that
subclass nsReadingIterator<char> such that you could pass an instance to
BeginReading, but then request the next USC4 character from the iterator. this
could be done on top of the existing string API.
Comment 46•21 years ago
|
||
Re: comment 36: First, you wrote: "I'm assuming here that UTF16 has the same properties as UTF8, i.e. having the highbit set for every multi-doublebyte character." and you're right that every surrogate pair has the high bit in each 16-bit storage unit set, but the reverse isn't true: finding a high bit set does not imply that consecutive multiple storage units comprise one character. So, as roc said in comment 30, everywhere you have to worry about UTF-8 multi-byte, you have to worry about UTF-16 multi-double-byte. It's true, if stripping Unicode white space, that we'll have to do a bit more work with UTF-8 than with UTF-16 -- but I believe there are space characters defined in plane 1 and above, so we have to look in either case. Second: Mozilla 2.0 is an opportunity to break compatibility. Whether we'll want to break a bunch of compatibility remains to be seen; I think we need to break some, for our own long-term health. I'll be working on a list of APIs to consider deprecating before 2.0, so we can obsolete them in 2.0 -- and what's more important, many of us cc'd on this bug will be working on better APIs to replace the obsoleted ones. Strings are a prime target. /be
Comment 47•21 years ago
|
||
http://www.unicode.org/notes/tn12/ is probably relevant to this discussion.
Comment 48•21 years ago
|
||
Although I like UTF-8 a lot as an external storage format (on Unix), I have some (actually quite a lot of) reservation about switching to UTF-8 as 'the' _internal_ encoding. My favorite was UTF-32/UCS-4 back in 1998 and still is, but if that's not an option, I would go with UTF-16.(see bug 183048) The fact that UTF-8 is ASCII compatible is a blessing in some aspects but also has been a cause for numerous bugs in Mozilla. One (side-)benefit of going to UTF-16/UTF-32 is that we can 'enforce' developers to think about the encoding. As pointed out by the author of UTN #12 (referred to by smontagu), only few libraries/OS' use UTF-8 for the internal processing(Gtk2/Gnome/Pango, Perl and BeOS are all that I know) while there are many more libraries/OS that use UTF-16 internally, for which there must be a reason. Besides, Gtk2/Gnome/Pango might switch to UTF-16/UTF-32 sometime in the future (according to the lead developer of Pango)
Keywords: intl
Comment 49•21 years ago
|
||
FYI, UTN #12 is wrong to say that Python uses UTF-16. It uses either UCS-2 (not UTF-16) or UTF-32 (on RedHat and Fedora Linux, Python is compiled in such a way that UTF-32 is used as the internal character representation format) See http://www.egenix.com/files/python/Unicode-EPC2002-Talk.pdf which is a nice read by itself for the issue at hand.
> The fact that UTF-8 is ASCII compatible is a blessing in some aspects but also > has been a cause for numerous bugs in Mozilla. One (side-)benefit of going to > UTF-16/UTF-32 is that we can 'enforce' developers to think about the encoding. UTF16 and UTF8 are suspceptible to exactly the same kind of programmer errors: treating each unit (short or byte) as an individual character. One reason I like UTF8 is that UTF8 will expose those errors as soon as you test with non-ASCII characters. UTF16 will not expose errors until you test with non-BMP characters, which are very rarely encountered in the wild. > there are many more libraries/OS that use UTF-16 internally, for which there > must be a reason. I claim the reason is nothing technical at all --- rather, that when Unicode support was first being designed into modern operating systems, programming languages and applications, it was not widely known that more than 2^16 characters would be required, so people assumed that UCS2 would always be sufficient, so they standardized on UCS2. When it became clear that non-BMP characters would be required, it was far too late to rewrite everything so everyone did what Mozilla has done --- replace 'UCS2' with 'UTF16' in documentation everywhere, fix some bugs and hope for the best. The other systems, those written since non-BMP was recognized as important, have been constrained by platform APIs they depend on. Indeed I agree that platform API compatibility (UTN#12's argument #2) is a powerful argument in favour of UTF16. Maybe it's a bit less powerful for Mozilla than for other systems because we have our own cross-platform layer. UTN#12's argument #1, that Unicode is "optimized for" BMP characters, doesn't convince me. Given the choice I'd rather optimize for ASCII given that we largely deal with markup and programs.
Comment 51•21 years ago
|
||
(In reply to comment #50) > UTN#12's argument #1, that Unicode is "optimized for" BMP characters, doesn't > convince me. Given the choice I'd rather optimize for ASCII given that we > largely deal with markup and programs. Is this really true? Our *input* is markup, but after the initial parsing, we largely deal with the text in the DOM, not the markup.
It depends on what you mean by "dealing with", which is about as fuzzy a concept as "optimized for". In some sense we don't "deal with" DOM text very much. I think most of the Mozilla code that's manipulating strings is not manipulating real DOM text. Rather, there's endless mucking about with values and attributes and URIs and names and headers and protocols etc etc etc. Note that currently our content model stores actual DOM textnodes (and attribute values??) in either UTF16 or "UTF16 compressed to 8 bits per character because all high bytes are zero", which seems like something we definitely DON'T want to change. We desperately need some actual numbers. I should try to instrument the string classes some more.
Reporter | ||
Comment 53•21 years ago
|
||
attribute-values are stored as UTF-16 strings for HTML and XML. Though HTML often stores the value as a 'parsed' value, i.e. an PRInt32, a double or an nsISupports. XUL does the same except that it stores them as an atom when the length is less then 12 characters (idea being that less then 12 characters it's most likly a reoccuring value like "true" or "horizontal"). HTML and XML will soon do the same.
Comment 54•21 years ago
|
||
re: comment #51 and #52 I'm with Simon on the point. Considering recent trend toward making everything I18Nizable (markup languages, domain names, URIs/IRIs, email address, identifiers in programming languages), I don't think it makes as much sense as before to assume that we have to optimize for ASCII. re: comment #50. > UTF16 and UTF8 are suspceptible to exactly the same kind of programmer errors: > treating each unit (short or byte) as an individual character. That's not a kind of errors I was talking about. Both UTF-8 and legacy encodings are stored in byte arrays while UTF-16 are stored in 'PRUnichar arrays'. This difference in storage class makes it easier to catch potential bugs if our APIs are written for UTF-16 or UTF-32 than for UTF-8. Besides, we already have a lot of APIs written for UCS-2 some of which have already been converted to work with UTF-16. Although it may not be a lot of work (writting UTF-8 iterator is not hard, but do we have to bother?), I think it's still a distraction to rewrite them all for UTF-8. > that when Unicode > support was first being designed into modern operating systems, programming > languages and applications, it was not widely known that more than 2^16 Well, the Unicode began with 16bits in the late 1980's and later it was expanded to 20.1 bits when it was merged with ISO 10646 (which began as a 31bit character set). Anyway, I've made a similar argument about UTF-16 [1]. However, I wouldn't use it to favor UTF-8 over UTF-16 [2]. I always used it to justify my preference for UTF-32 (glibc and Python use UTF-32). [1] ICU developers' choice of UTF-16 cannot be accounted for by that, though because it's after Unicode became 20.1 bits that ICU project was started. [2]http://mail.nl.linux.org/linux-utf8/2003-07/msg00114.html In this post, I was 'defending' UTF-8 and arguing that it's 'possible' to write APIs in UTF-8. Then as well as now, however, I would choose UTF-16 APIs if I had to choose between UTF-8 and UTF-16.
> Considering recent trend toward making everything I18Nizable (markup > languages, domain names URIs/IRIs, email address, identifiers in programming > languages), I don't think it makes as much sense as before to assume that we > have to optimize for ASCII. Just because something's internationalizable doesn't mean we'll actually see significant quantities of I18N text there anytime soon, or ever... Java source and XML have been Unicode from the beginning but there's not a lot of non-ASCII Java source or XML vocabularies out there, that I've seen. > This difference in storage class makes it easier to catch potential bugs can you give an example of the kind of bugs you're talking about? Can they still happen considering we're using string classes not raw arrays?
Comment 56•21 years ago
|
||
Ugh, I've read about 75% of this conversation, and what it sounds like is a debate between carpenters: When we build this house [which in this case is already built] should we be using nails or screws? I feel pretty strongly that there are appropriate uses for both UTF8 and UTF16. We could debate this encoding or that, but there is never going to be a universally best encoding. Furthermore, by enforcing some kind of universal encoding, we will NOT avoid programmer error. We're not going to be able to enforce it syntactically and we're not going to prevent bugs by posting "We use UTF-16 now" to a newsgroup. I also don't believe that the string classes are or even should be containers for anything other than const char* and const short* buffers. I also believe, as darin points out, that it would be impractical to try to decide and enforce a singular encoding for any set of string classes, since they are so tightly coupled to their C++ type counterparts const char* and const short*. The "enforcement" would involve so much abstraction on TOP of the string APIs that people would run away from this project screaming because they couldn't append the letter 'a' to their string without jumping through hoops. But going back to the basic issue of what universal encoding we use, I think we MUST deal with a mishmash of encodings if we want to be a fast, lightweight browser. Maybe that sounds contradictory but here's my perspective: Some data is simply going to be ASCII at least 99% of the time, and for that data we should be using UTF-8 or even ASCII itself (when that percentage is 100%) Data that comes to mind in this case are: - internal string tags - HTML markup tags - HTML markup attribute names - HTML entity names - most HTML markup attribute values - CSS keywords - CSS selector text - ASCII-based protocol keywords (i.e. header names in HTTP) - most atoms (Which are used for most of the above) - .property file key names (i.e. the "foo" in foo=bar) - filenames Now, there are pleanty of uses for UTF16 as well. Namely: - CDATA - HTML entity values - .property values both lists can be expanded, but they key is that there are uses for both. And when we need something like this bug requests, nsCaseInsensitiveUTF8StringComparator, we need to look at why we need it and if the data we need it for is stored in an appropriate encoding. In the cases originally mentioned in this bug, it seems very appropriate, and IMHO nsCaseInsensitiveUTF8StringComparator has been long overdue.
Comment 57•21 years ago
|
||
(In reply to comment #51) > Is this really true? Our *input* is markup, but after the initial parsing, we > largely deal with the text in the DOM, not the markup. I wanted to address this too - or at least back up what roc said. Its true that we do deal with much of parsed text like html tags, attributes, etc in the DOM. However, ask yourself how many times we do a string comparison on a tag name to resolve some CSS, vs. the few times we might access that same tag or attribute via the DOM. CSS resolution happens MUCH more on the average web page.. because every web page has to be laid out, period. That's what we want to optimize. Not ever tag or attribute has to be accessed via the DOM (and most are not on most web pages) So why would be optimize for the case which is used the least?
Reporter | ||
Comment 58•21 years ago
|
||
The problem with mixed encodings is that it's very easy to end up doing tons of conversions. For example when atoms were converted into UTF8 this gave a performancehit that was never addressed (not saying that the patch as a whole was bad, but the fact remains that it gave a perf-hit). It also gave even bigger perfhits in some extreme cases, we have a realworld XSLT stylesheet that got in the order of 50%-100% slower since it performed a lot of compares with nodenames. I agree that in some cases it makes sense different encodings however we need to do it very carefully.
Comment 59•21 years ago
|
||
I would bet that the reasons for slowdowns of that magnitude are not because we now have more strings of different encodings, rahter that our tools to work with these encodings are less than ideal. Just to state one example of this, right now if you want to compare a UTF8 string to a UTF16 string, you're required to convert and copy either string to the other encoding to get matching strings, and then compare, when all you should really need to do is to call a method that takes a UTF8 string and a UTF16 string and do the compare w/o all the suck. I seriously doubt that using UTF8 over UTF16 has any serious performance problems if we've got the right tools to work with here. Clearly we do not have those at this point, so let's work on that rather than fight over what way to lean here, ok? :-)
Comment 60•21 years ago
|
||
Yay! jst has hit the nail on the head - its the tools and manner of string management, not the fact that we have to do the conversions themselves. The atom conversion was obviously an incomplete job - and I'm sorry about that. I should have followed it up more quickly with some UTF8 string cleanup, but life got in the way :) Anyhow, lets move forward with this string comparator. We also need an actual implementation of CopyUTF8toUTF16 and so forth.
Assignee | ||
Comment 61•21 years ago
|
||
(In reply to comment #60) > Anyhow, lets move forward with this string comparator. We also need an actual > implementation of CopyUTF8toUTF16 and so forth. It's a little hard to move forward without breaking frozen APIs. See comment 12. We've had an implementation of CopyUTF8toUTF16 for ages.
Comment 62•21 years ago
|
||
hmm... we can definitely write UTF-8 and UTF-16 iterators that produces UTF-32 characters with each iteration. we could use those iterators to write functions like: PRInt32 CompareUTF8toUTF16(utf8String, utf16String); it's not as nice as subclassing a comparator interface that would plug into Equals and other methods, but it is a solution :-/
Comment 63•21 years ago
|
||
Yeah, such a thing would get us some mileage on its own. And as a nice optimization we'd only need to revert to extracting UTF-32 characters out of the strings when non-ASCII characters are encounterd in either string, which would make the character iteration trivial and really fast in 99% of the cases. I.e. something like: PRInt32 CompareUTF8toUTF16(const nsAFlatCString& utf8String, const nsAFlatString& utf16String) { nsAFlatCString::const_reading_iterator iter8, end8; nsAFlatString::const_reading_iterator iter16, end16; utf8String.BeginReading etc... utf16String.BeginReading etc... while (...) { const char ch = *iter8; if (IsAscii(ch)) { if (ch != *iter16) { return ch > *iter16; } } else { // extract UTF-32 characters from iter8 and iter16 and compare. // Advance both iterators appropriately } ++iter8; ++iter16; } return 0; } If we had that, and our atom code used it (and it knew its string length to avoid looking for the null terminator all the time), atom to string compares might not be very slow at all.
Reporter | ||
Comment 64•21 years ago
|
||
Unfortunatly this wouldn't help us in the XSLT case since we're essentially a programming environment, so when someone asks for a nodename we have no idea if they'll later use substring, concat or equality operations on it. So we'll always have to simply convert to a string. I realize that XSLT and XPath arn't high priorities, but the same thing applies in js. Which is why i'm very curious to see if it's practically possible to make js use UTF8 internally without too big performance degradation.
Comment 65•21 years ago
|
||
I seriously doubt the string conversion overhead is anything to worry about. Compared to the string copy that always happens when you access strings through the DOM today, the added conversion will hardly make a huge dent in the app performance. But if you're comparing string sharing against copy+convert, then the copy+convert will naturally loose, but I don't see that being the common case. I also doubt that it matters at all (or enough that it shows up on perf tests) if JS stores strings internally as UTF-8 or UTF-16 (or UCS2). IMO any place that already does a copy won't significantly degrade in performance with the added conversion, if done right. Currently we don't do it right, so it hurts in a lot of places, but that doesn't mean it can't be done right.
Status: NEW → ASSIGNED
Comment 66•19 years ago
|
||
hm, the bug's summary seems to be a duplicate of bug 145975 but the discussion seems to have moved away from that
Updated•19 years ago
|
Component: Style System (CSS) → String
Updated•15 years ago
|
QA Contact: ian → string
Comment 67•7 years ago
|
||
I'm closing this WONTFIX since I don't think we have any non-ASCII case-insensitive matching left in the platform.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
Updated•4 years ago
|
Component: String → XPCOM
You need to log in
before you can comment on or make changes to this bug.
Description
•