Closed Bug 805081 Opened 12 years ago Closed 9 years ago

Add UTF-8 handling to SpiderMonkey

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: terrence, Unassigned)

References

Details

There is a potentially large amount of memory that could be saved by storing storing JSStrings internally in UTF-8 or ASCII, if they are given to us in that format. As an example, there is https://bugs.webkit.org/show_bug.cgi?id=66161. Any spec-conforming access would have to convert the string to UCS2, but we would have done this anyway, so any case where we get to delay that conversion is a potential 2x memory win. Moreover, the Intl module will require us to link ICU, so we will be able to do this without adding a significant amount of new code.
To wit: another proposal we've been entertaining is for strings to have either an ASCII or UCS2 representation. This preserves the O(1) random-access and should preserve most of the memory savings. The danger/challenge with any of these approaches is what we do at API boundaries that expect UCS2.
And then there's the discussion at https://groups.google.com/forum/#!msg/mozilla.dev.servo/aMyWB7p6MaY/TCitUarYdyQJ From Servo's point of view, split "ASCII or UCS-2" is strictly worse than "UTF-8", I suspect....
Please don't use "UCS-2" anymore. UCS-2 (which didn't allow for supplementary characters) is thoroughly obsolete, and ES6 won't reference it anymore. Where ES6 interprets strings or source code, it's as UTF-16 (with a few compatibility warts).
Depends on: 805080
We removed the MemShrink tag because the gains to be had are so unclear at the moment, but we can add it back if more measurements are done.
Whiteboard: [memshrink]
(In reply to Norbert Lindenberg from comment #3) > Please don't use "UCS-2" anymore. UCS-2 (which didn't allow for > supplementary characters) is thoroughly obsolete, and ES6 won't reference it > anymore. Where ES6 interprets strings or source code, it's as UTF-16 (with a > few compatibility warts). Okay, I am a bit confused here, so let me try to clarify my understanding. Currently, the character representation we expose to the web is not either UCS-2 or UTF-16, but a 2-bytes-at-a-time view of a UTF-16 encoded string. I think for the immediate future we will need to keep exposing such a representation even if we change it internally. Not coincidentally, this is exactly how we represent our characters internally in SpiderMonkey (anything of type jschar*). From a C++ perspective, I don't think we want to call this representation UTF-16: any routine we pass these chars to in the engine is not going to treat them as UTF-16. My thinking is that we should call the extant type something explicitly vague like "TwoByteChars". Later, as we bring up correct UTF-16 processing we can add a UTF16 type which these routines take which will automatically upcast from TwoByteChars. Does this make sense, or am I missing something important?
Blocks: 495790
Depends on: 811060
(In reply to Terrence Cole [:terrence] from comment #5) > Currently, the character representation we expose to the web is not either > UCS-2 or UTF-16, but a 2-bytes-at-a-time view of a UTF-16 encoded string. Actually, ECMAScript strings are simply sequences of 16-bit unsigned integers; most of ECMAScript doesn't even care whether they're text or some other data type. UCS-2 or UTF-16 only come into play where strings are interpreted as text: UTF-16 means that supplementary characters are supported; UCS-2 means that they're not. ES 5 uses both terms, and some functionality is clearly specified as not supporting supplementary characters; in ES 6 we're fixing that as much as possible. > I > think for the immediate future we will need to keep exposing such a > representation even if we change it internally. Correct; ECMAScript requires this. > Not coincidentally, this is > exactly how we represent our characters internally in SpiderMonkey (anything > of type jschar*). Internally, you're free to use whatever representation you like, as long as clients see their sequences of 16-bit units. > From a C++ perspective, I don't think we want to call this representation > UTF-16: any routine we pass these chars to in the engine is not going to > treat them as UTF-16. Correct. > My thinking is that we should call the extant type > something explicitly vague like "TwoByteChars". jschar seems fine. Languages like C/C++/Java have unfortunately broken any connection between char and character a long time ago. I'd avoid "TwoByte" because (a) it unnecessarily raises the question of byte order, (b) it's long been abused for a different kind of encoding. If you really need something different from jschar, CodeUnit would work. > Later, as we bring up > correct UTF-16 processing we can add a UTF16 type which these routines take > which will automatically upcast from TwoByteChars. How would that type differ from jschar*? If that's where you add methods for code point handling, or other UTF-16 specific functionality, it might be worthwhile.
(In reply to Norbert Lindenberg from comment #6) > (In reply to Terrence Cole [:terrence] from comment #5) > > > Currently, the character representation we expose to the web is not either > > UCS-2 or UTF-16, but a 2-bytes-at-a-time view of a UTF-16 encoded string. > > Actually, ECMAScript strings are simply sequences of 16-bit unsigned > integers; most of ECMAScript doesn't even care whether they're text or some > other data type. > > UCS-2 or UTF-16 only come into play where strings are interpreted as text: > UTF-16 means that supplementary characters are supported; UCS-2 means that > they're not. ES 5 uses both terms, and some functionality is clearly > specified as not supporting supplementary characters; in ES 6 we're fixing > that as much as possible. Thanks for chiming in! These comments are incredibly helpful. > > My thinking is that we should call the extant type > > something explicitly vague like "TwoByteChars". > > jschar seems fine. Languages like C/C++/Java have unfortunately broken any > connection between char and character a long time ago. I'd avoid "TwoByte" > because (a) it unnecessarily raises the question of byte order, (b) it's > long been abused for a different kind of encoding. If you really need > something different from jschar, CodeUnit would work. You are right that the name TwoByteChars is extremely awkward. The reason we want to change this type is to make changing the internal representation easier. A C++ type gives us a clean way to segregate the code base into places that understand the new internal representation, places that still use the old representation, and places that need whatever representation ES6 eventually settles on. We could stick with jschar* for one of the types, but with C++ it is just easier to make a pointer-like class behave correctly than a raw-pointer. Specifically, we want to transition to mozilla::Range and mozilla::RangePtr for all of our character types. Range carries the length for us so we don't have to pass it manually and RangePtr performs extensive bounds checking for safety and error detection. Not coincidentally, this nicely partitions the code between places that take a null-terminated pointer and those that take a pointer+length. As to the specific name to use for our existing jschar* type: CodeUnit seems a bit too generic. Perhaps ES5Chars would be better? > > Later, as we bring up > > correct UTF-16 processing we can add a UTF16 type which these routines take > > which will automatically upcast from TwoByteChars. > > How would that type differ from jschar*? If that's where you add methods for > code point handling, or other UTF-16 specific functionality, it might be > worthwhile. Yes, that is basically what I mean. In addition to providing characters and strings to the web, we have to interface with Gecko, which /is/ fully UTF-16 aware. It's not 100% clear what that means, but the status quo of sending broken ASCII error messages to the web console is clearly sub-optimal.
Whiteboard: [js:t]
Depends on: 724533
Whiteboard: [js:t]
Assignee: general → nobody
I think we're totally fixed here, certainly since Jan did the jschar/latin1 split.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.