Open Bug 1355101 Opened 7 years ago Updated 2 years ago

stylo: Add CSSOMString in WebIDL, use it in CSSOM, make it USVString for Stylo

Categories

(Core :: CSS Parsing and Computation, enhancement, P4)

enhancement

Tracking

()

Tracking Status
firefox55 --- affected
firefox57 --- wontfix

People

(Reporter: SimonSapin, Unassigned)

References

Details

Responding to https://bugzilla.mozilla.org/show_bug.cgi?id=1354566#c4

> That said, does rust-cssparser actually expect UTF-8, or WTF-8?
> That is, what happens for surrogate code points in the input?
> Should bindings be outputting WTF-8 given unpaired surrogates
> in the WTF-16 input, or converting unpaired surrogate
> to replacement char, or something else?

cssparser itself is agnostic to surrogates, or indeed anything non-ASCII: they come out unchanged as part of an identifier-like or string token. However the Rust &str type is defined as UTF-8, without surrogates. Unsafe code can make a &str with surrogates in it and my guess is that nothing special would happen, but that is technically undefined behavior: other unsafe code is allowed to assume that this does not happen.

Currently we use NS_ConvertUTF16toUTF8 which replaces unpaired surrogates with U+FFFD as it should. This yields a different CSSOM behavior from (as far as I know) every other browser engine: they use WTF-16 internally and thus preserve unpaired surrogates "by default". (When they occurs in places where CSS syntax allows arbitrary input, for example the 'font-family' or 'content' properties.)

For WTF-16 engines, replacing unpaired surrogate with U+FFFD would require actively looking for them with some additional code and runtime cost. For this reason CSSWG rejected my proposal to have the CSSOM spec mandate this, last time I asked: https://lists.w3.org/Archives/Public/www-style/2014Jun/0060.html

I like to think that replacing with U+FFFD is probably OK, even if not every engine does it. I argue that unpaired surrogates are typically an error case that occurs accidentally. So, while I’m filing this bug to keep a record of this difference with non-Stylo and have a place to discuss it, I think we should not change this behavior unless we find it causes significant compatibility problem with real content. (Acid 3 doesn’t count.) Only in that case, we’ll consider extending rust-cssparser to support WTF-8 or WTF-16. (This implies changing every &str or String Rust type that can contain CSS syntax to some other types.)
I don't think we should be shipping something that violates the spec and deviates from currently interoperable browser behavior without _very_ strong reasons.  And I'm not convinced that choice of string type for our parser constitutes such a reason.  Especially when I assume we knew this was a design constraint when we made that choice....
That said, reading the linked discussion it's not clear to me that it actually considered the CSSOM case we're talking about here, since they're talking about "aren't using the CSS parser" which is very much NOT the case here.
I think we shouldn't support the preservation of unpaired surrogates. Although some accidental artifacts of the Web Platform become required features, I'm not aware of any evidence that the preservation of unpaired surrogates is needed for Web compat. Absent overwhelming evidence, preservation of unpaired surrogates in UTF-8/byte-oriented code just leads to excessive complication and inability to use code designed for Rust &str semantics.

When I asked Hixie to make document.write() not to turn unpaired surrogates into the REPLACEMENT CHARACTER, it was because I didn't want to *add* processing when all browsers were UTF-16 internally. Now we have good reasons to use UTF-8 internally, so now what constitutes an added burden is flipped.

So unless there's overwhelming evidence that the preservation of unpaired surrogates in CSSOM is a huge Web compat issue, I think we shouldn't preserve them. The same goes for document.write() if we switch to html5ever in the future.

It's also worth noting that since 2014 USVString has become a thing in WebIDL. I'd be happy for us to push s/DOMString/USVString/ more broadly in APIs. (Even if in some case it makes sense to downgrade them to DOMStrings in our own copies of the WebIDL definitions because a non-WebIDL layer deals. My "replace uconv with encoding_rs" patch, among many other things, changes TextEncoder to take DOMString, because encoding_rs internally behaves as if DOMString to USVString conversion had been applied to UTF-16 input.)
> I assume we knew this was a design constraint when we made that choice

rust-cssparser is the very first bit of Rust code that I wrote beyond Hello World, before I joined Mozilla in 2013. I believe I didn’t know about surrogates at the time.

I’ve known of this specific issue since at least 2014 (when I brought it up in CSSWG), but Servo had bigger web-compat fish to fry.

> they're talking about "aren't using the CSS parser" which is very much NOT the case here

I assume that either Tab misspoke or the minutes are inaccurate, and that he meant that CSSOM parameters are not going through the part of the CSS Syntax spec (which he wrote) that says how a byte stream is decoded into an Unicode stream: https://drafts.csswg.org/css-syntax/#determine-the-fallback-encoding

Later bits of minutes also mention "an unpaired from JS" .
> unless there's overwhelming evidence that the preservation of unpaired surrogates in CSSOM
> is a huge Web compat issue

I think given the risks we already have with stylo, this is backwards.  We should not be making behavior changes unless we have overwhelming evidence that they are NOT a compat issue.  We're going to have enough compat issues organically (due to bugs) without going looking for them...
(In reply to Boris Zbarsky [:bz] (still a bit busy) (if a patch has no decent message, automatic r-) from comment #5)
> > unless there's overwhelming evidence that the preservation of unpaired surrogates in CSSOM
> > is a huge Web compat issue
> 
> I think given the risks we already have with stylo, this is backwards.  We
> should not be making behavior changes unless we have overwhelming evidence
> that they are NOT a compat issue.  We're going to have enough compat issues
> organically (due to bugs) without going looking for them...

If that's the way burden of proof goes, we should get telemetry in place ASAP looking for unpaired surrogates in the old CSSOM implementation.

(I still think we should go for no-preservation unless there's overwhelming evidence to the contrary. Not being able to use code written for &str because of doing WTF-8 just in case would be bad and sad. Having to get telemetry for what a priori look like (very) fringe issues slows things down. I'm worried about over-prudence getting in the way of getting stuff landed and moving on. For example, I had an isindex patch ready to go, but because of the requirement to get telemetry evidence despite the other three engines already having removed isindex, the patch is rotting, since I got busy with other stuff and haven't gotten around to adding the telemetry even though it would be trivial to add. So there's both the issue of telemetry even when executed perfectly adding a delay and the requirement to add telemetry being enough of a speed bump that things get stalled in context switches.)
> We're going to have enough compat issues organically (due to bugs) without going looking for them...

You make it sound like we’re actively adding risk here, but &str in cssparser and servo/components/style is the status-quo and has been since before Stylo started. If this were an existing switch to flip and all other things were equal I’d agree we should default to the conservative approach, but an addition to being a sad outcome implementing support for WTF-8 or WTF-16 would take engineering time that could be spent on something else.
> You make it sound like we’re actively adding risk here

No, we're just adding risk compared to the other option.

I agree that this needs to be weighed against the cost of the engineering time needed to eliminate this risk.  I really wish we hadn't painted ourselves into this corner.  :(
My gut feeling is that we're probably not going to do anything about this until proven otherwise. We should reevaluate before shipping, marking p3.
Assignee: nobody → simon.sapin
Priority: -- → P3
Prior discussion on Servo’s issue tracker: https://github.com/servo/servo/issues/6564


I’ve filed https://github.com/w3c/csswg-drafts/issues/1217 to propose changing CSS specifications to align with Stylo, and will discuss it at CSSWG’s face-to-face meeting next week.
https://github.com/w3c/csswg-drafts/issues/1217 is resolved with https://github.com/w3c/csswg-drafts/pull/1266: the spec introduces a `CSSOMString` in WebIDL, a typedef for either `DOMString` or `USVString`, chosen by implementations. Choosing `USVString` effectively means replacing surrogates with U+FFFD.  This type is used for almost everything in CSSOM.

Re-purposing this bug to move or IDL definitions to align with the new spec.
Summary: stylo: CSSOM replaces unpaired surrogates with U+FFFD (which might be OK?) → stylo: Add CSSOMString in WebIDL, use it in CSSOM, make it USVString for Stylo
Blocks: 1355106
Assignee: simon.sapin → nobody
Priority: P3 → --
Boris, do we believe this will give us a perf win on CSSOM? Just trying to figure out the priority.
Flags: needinfo?(bzbarsky)
> Boris, do we believe this will give us a perf win on CSSOM?

It depends.  It will save us some copying with stylo in some cases.  But it will slow down things when the Gecko style system is in use, by requiring more copying there.

The conversion from ISO-8859-1 to UTF-8 will still be there in the common case.  The requirement to validate for unpaired surrogates will still be there, though maybe we can avoid it in the ISO-8859-1 case.

I really doubt we should block 57 for this, especially because it will be a pure performance regression for our chrome, for Fennec, etc...
Flags: needinfo?(bzbarsky)
Also, it's not clear to me what the scope of this bug is.  Is it to implement CSSOMString and then make it do direct to-UTF8 conversion?  Or is it really to implement it and make it an alias for USVString?  The latter won't be a performance win for stylo either, compared to what we have now.
Ok.
Priority: -- → P4
I'm developing SIMD-ASCII-optimized Latin1 to UTF-8 and UTF-8 to Latin1 conversions as part of bug 1402247.
Depends on: 1402247
status-firefox57=wontfix unless someone thinks this bug should block 57
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.