Closed Bug 1449861 Opened 7 years ago Closed 5 years ago

Add a UTF-8 WebIDL string type

Categories

(Core :: DOM: Bindings (WebIDL), enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
mozilla73
Tracking Status
firefox73 --- fixed

People

(Reporter: hsivonen, Assigned: emilio)

References

(Blocks 2 open bugs)

Details

Attachments

(4 files, 1 obsolete file)

This should probably be built on bug 1449849. Let's add a WebIDL string type (to be used where specs say USVString or where the specs say DOMString but we're willing to break unpaired surrogates) that shows up as UTF-8 on the C++ side. Use cases: * CSSOM * Maybe TextEncoder depending on details * Probably a lot more over time Details: Details: * When a string is passed from the JS engine to the DOM implementation, the DOM implementation should receive an nsACString containing guaranteed-valid UTF-8. When the JS string has Latin1 storage and ASCII content, there should be no copy. If the JS string is backed by an nsStringBuffer, the nsACString should be backed by the same buffer. If the JS string isn't backed by nsStringBuffer but has Latin1 storage and ASCII content, the nsACString should be a nsDependentString. Otherwise, it should be a conversion using encoding_rs. Specifically, if the storage of the JS string is Latin1, the conversion should be Latin1 to UTF-8 without pivoting via UTF-16. * When a string is passed from the DOM to the JS engine, an all-ASCII nsStringBuffer should be adopted without a copy. If the DOM-side string is in the Latin1 range, UTF-8 to Latin1 conversion should happen using encoding_rs without a pivot via UTF-16. * When a string is passed from the DOM to the JS engine, the DOM should be able to pass an nsStringBuffer, it's filled length and an indication of whether it has Latin1 or UTF-16 semantics.
Priority: -- → P3
This could help quite a bit with OOM-smalls like bug 1471027.
Blocks: 1471027
This would also potentially benefit Rust WASM-support. Currently when wasm-bindgen passes string through the boundary of JS-WASM, it uses TextEncoder and TextDecoder to do the conversion. If they can avoid allocations and reuse the same underlying string buffer when possible, the performance there may be improved as well.
Flags: needinfo?(bzbarsky)
Depends on: 1561567
No longer depends on: 1449849
Flags: needinfo?(emilio)

I'm giving this a shot.

Assignee: nobody → emilio
Flags: needinfo?(emilio)

Tests in the next patch.

Depends on D58628

So as to avoid the heap allocation for small strings.

Depends on D58629

In particular, the ones where we transcode unconditionally atm (property names
and such).

There are others like cssText getters and setters which are a bit harder,
because I either need to rewrite all our serialization code to work with UTF8
(which is fine, but a lot of work), or teach webidl to have a setter that takes
UTF8String as input but returns DOMString as output (which is at best hacky).

Depends on D58630

Blocks: 1606957
Blocks: 1606958
Attachment #9118545 - Attachment is obsolete: true
Blocks: 1606995
Pushed by ealvarez@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f92bcbfef7aa Add UTF8String to the WebIDL parser. r=bzbarsky https://hg.mozilla.org/integration/autoland/rev/7cdbfc327a81 Add codegen support for UTF8String. r=bzbarsky https://hg.mozilla.org/integration/autoland/rev/6e17f3a19fd8 Use UTF8String for some CSSOM APIs. r=bzbarsky
Blocks: 1607006
Blocks: 1607080
Blocks: 1607083

Is there anything we could do here to ease reviewing changes to .webidl, since now we'll have more webidl not following what the specs have?

(In reply to Olli Pettay [:smaug] from comment #14)

Is there anything we could do here to ease reviewing changes to .webidl, since now we'll have more webidl not following what the specs have?

Observables should be effectively the same as USVString, or DOMString + conversion to UTF-8... I don't know, I guess I could change this so that this is something like an annotation on DOMString / USVString, if you think that'd be easier to review... Not sure.

If UTF8String does not match USVString, bug 1607083 will violate the spec. If UTF8String does not match DOMString, bug 1607080 will violate the spec. Is it possible for UTF8String to match both?

(In reply to Masatoshi Kimura [:emk] from comment #16)

If UTF8String does not match USVString, bug 1607083 will violate the spec. If UTF8String does not match DOMString, bug 1607080 will violate the spec. Is it possible for UTF8String to match both?

All our CSSOM / CSS parsing code converts to UTF-8, so the DOMString in bug 1607080 is a lie. Semantics are the same as USVString already. This is ok per https://github.com/w3c/csswg-drafts/issues/1217.

So I support USVString + annotation. At least CSSOM APIs should use typedef UTF8String CSSOMString.

(In reply to Masatoshi Kimura [:emk] from comment #18)

At least CSSOM APIs should use typedef UTF8String CSSOMString.

That's the ideal final state, yeah... Though it is not quite as easy, see bug 1606995

TLDR returned strings are utf-16 as of now, and I need at least to do a fair bit of profiling to prove that the UTF8String -> JSString conversion step is not worse than what we're doing. And if it's worse, then I have to optimize it somehow.

So I support USVString + annotation.

I'm torn about this. The problem with that approach is that then you need to find all places USVString is handled and make sure the annotation is handled right. With a separate type there's less chance of people messing things up in the codegen. that said, there's definitely value in having the IDL say "USVString semantics, but C++ will see UTF-8, not UTF-16", and the right way to do that is in fact via an annotation on the type. If we did that, then we could in fact have the string be UTF-8 coming in from JS and UTF-16 going out to JS if we wanted. Pretty straightforward to do...

In either case, I agree we should:

  1. typedef UTF8String CSSOMString (or whatever syntax we decide on for the UTF8 case, e.g. [UTF8] USVString)
  2. File issues on specs like https://drafts.fxtf.org/geometry/#dommatrixreadonly that should really be using CSSOMString as defined at https://drafts.csswg.org/cssom/#cssomstring-type but are not.
  3. Use CSSOMString in our own IDL for cases where it would make us match the CSSOM specs.
Flags: needinfo?(bzbarsky)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: