Closed Bug 1434429 Opened 7 years ago Closed 7 years ago

More rounds of TokenStream adjustments for UTF-8 tokenizing/parsing

Categories

(Core :: JavaScript Engine, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox60 --- fixed

People

(Reporter: Waldo, Assigned: Waldo)

References

Details

Attachments

(7 files)

(deleted), patch
till
: review+
Details | Diff | Splinter Review
(deleted), patch
arai
: review+
Details | Diff | Splinter Review
(deleted), patch
arai
: review+
Details | Diff | Splinter Review
(deleted), patch
arai
: review+
Details | Diff | Splinter Review
(deleted), patch
arai
: review+
Details | Diff | Splinter Review
(deleted), patch
arai
: review+
Details | Diff | Splinter Review
(deleted), patch
arai
: review+
Details | Diff | Splinter Review
No description provided.
I find myself wanting TokenStream::error to actually use the current immediate offset soon, and changing how these places work is necessary to have that. Plus, these error messages are mildly improvable.
Attachment #8946815 - Flags: review?(till)
The inheritance structure of TokenStream is coming increasingly to roughly mirror that of Parser. Unsurprisingly, given the two need to be merged pretty much (but not now).
Attachment #8946817 - Flags: review?(arai.unmht)
There are no tests for this because there are no behavior changes. It *happens* that right now, badchar only happens for something that's wrong at the *start* of a token, so the ungets here just revert to exactly the same position. But, this is clearer, uses the instantaneous offset (as one would expect), and it provides a new primitive that will be useful as other places come to use better report locations as they shift away from reportError.
Attachment #8946824 - Flags: review?(arai.unmht)
Ostensible UTF-16 source text in JS *can* have mispaired surrogates and other junk in it, per ECMAScript rules. But ECMAScript doesn't directly process UTF-8, only UTF-16 -- so the implicit process of conversion from malformed UTF-8 to this UTF-16-alike input should produce an error earlier in the compilation chain. So, when tokenizing UTF-8 and we get something malformed, technically we're in a place that ECMAScript says nothing about. So *not* allowing malformed code point sequences and goo is the most desirable thing to do. That requires changing the existing multi-unit code point consumption signature to allow errors, even if no errors are reported in the case of processing UTF-16. Note that bz says that Gecko will only hand the JS engine properly encoded data (because it does that validation step internally, as it *should*). So none of this should be web-observable, once the browser starts invoking UTF-8-specific parsing entrypoints.
Attachment #8946830 - Flags: review?(arai.unmht)
It doesn't seem impossible to me that in the long run, we can write fully code-unit-length-generic code and never have to think about this goo anywhere in tokenizing. But we'll see.
Attachment #8946832 - Flags: review?(arai.unmht)
Attachment #8946817 - Flags: review?(arai.unmht) → review+
Attachment #8946821 - Flags: review?(arai.unmht) → review+
Comment on attachment 8946825 [details] [diff] [review] Use MakeScopeExit to handle resetting userbuf offset in TokenStreamSpecific::putIdentInTokenbuf Review of attachment 8946825 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/frontend/TokenStream.cpp @@ +1325,5 @@ > > + auto restoreNextRawCharAddress = > + MakeScopeExit([this, originalAddress]() { > + this->userbuf.setAddressOfNextRawChar(originalAddress); > + }); oh, this class is nice :) I think we have some other places that this can simplify.
Attachment #8946825 - Flags: review?(arai.unmht) → review+
Attachment #8946824 - Flags: review?(arai.unmht) → review+
Attachment #8946830 - Flags: review?(arai.unmht) → review+
Comment on attachment 8946832 [details] [diff] [review] Change TokenStreamCharsBase::appendMultiUnitCodepointToTokenbuf to TokenStreamCharsBase::appendCodePointToTokenbuf so that the idea doesn't require knowledge of multi-unitness Review of attachment 8946832 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/frontend/TokenStream.cpp @@ +1304,5 @@ > + > + if (!tokenbuf.append(units[0])) > + return false; > + > + return numUnits == 1 || tokenbuf.append(units[1]); I prefer separating `numUnits == 1` and `tokenbuf.append(units[1])` into 2 statements. since "true" has different meaning for them.
Attachment #8946832 - Flags: review?(arai.unmht) → review+
Pushed by jwalden@mit.edu: https://hg.mozilla.org/integration/mozilla-inbound/rev/024f70aeba70 Use MakeScopeExit to handle resetting userbuf offset in TokenStreamSpecific::putIdentInTokenbuf. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/112eaf00632c Move TokenStreamSpecific::ungetChar into a new GeneralTokenStreamChars<CharT, AnyCharsAccess> inserted between TokenStreamCharsBase<CharT> and TokenStreamChars<CharT, AnyCharsAccess> in the token stream inheritance hierarchy. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/45c9102825ab Move TokenStreamSpecific::ungetCharIgnoreEOL into TokenStreamCharsBase. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/21c79fd76133 Implement a TokenStreamChars::ungetCodePointIgnoreEOL and use it to report errors at precise locations, rather than blindly at the beginning of the token (which happens to be the same thing, just not nearly as clear about it). r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/95e07a79f4b2 Implement TokenStreamChars::matchMultiUnitCodePoint as a better nailing-down of behavior when processing a multi-code unit code point. r=arai
Extricated several of these for landing, still a couple to go after the last review.
Keywords: leave-open
Pushed by jwalden@mit.edu: https://hg.mozilla.org/integration/mozilla-inbound/rev/b190c5a3dab2 Followup bustage fix (?) for gcc (and maybe other?) compiler bustage. Worked in recent clang... r=boogstage in a CLOSED TREE
Pushed by jwalden@mit.edu: https://hg.mozilla.org/integration/mozilla-inbound/rev/1e0faca62e86 Comment out an assertion whose condition is disliked by certain versions of gcc for entirely mysterious reasons, to be debugged later. r=bustage in a still-CLOSED TREE
Backout by csabou@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/63fe40a76a25 Backed out 7 changesets for build bustages on regress-618572.js and TokenStream.h on a CLOSED TREE
Comment on attachment 8946815 [details] [diff] [review] Give better errors for unterminated/EOF'd string/template literals Review of attachment 8946815 [details] [diff] [review]: ----------------------------------------------------------------- r=me
Attachment #8946815 - Flags: review?(till) → review+
Pushed by jwalden@mit.edu: https://hg.mozilla.org/integration/mozilla-inbound/rev/049cc7799b53 Use MakeScopeExit to reset userbuf offset in TokenStreamSpecific::putIdentInTokenbuf. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/a9cb684274da Move TokenStreamSpecific::ungetChar into a new GeneralTokenStreamChars<CharT, AnyCharsAccess> inserted between TokenStreamCharsBase<CharT> and TokenStreamChars<CharT, AnyCharsAccess> in the token stream inheritance hierarchy. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/d353d4a61cb0 Move TokenStreamSpecific::ungetCharIgnoreEOL into TokenStreamCharsBase. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/1bf0b12b992f Implement a TokenStreamChars::ungetCodePointIgnoreEOL and use it to report errors at precise locations, rather than blindly at the beginning of the token (which happens to be the same thing, just not nearly as clear about it). r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/229b3ea6530f Implement TokenStreamChars::matchMultiUnitCodePoint as a better nailing-down of behavior when processing a multi-code unit code point. r=arai
(In reply to Tooru Fujisawa [:arai] from comment #9) > I prefer separating `numUnits == 1` and `tokenbuf.append(units[1])` into 2 > statements. > since "true" has different meaning for them. Yeah, I dithered on that before posting. Will change when landing.
Pushed by jwalden@mit.edu: https://hg.mozilla.org/integration/mozilla-inbound/rev/4b03e86fb95a Use the current offset, not the offset of the start of the current token, when reporting errors for unterminated string/template literals. r=till https://hg.mozilla.org/integration/mozilla-inbound/rev/dcde99a8fa6c Change TokenStreamCharsBase::appendMultiUnitCodepointToTokenbuf to TokenStreamCharsBase::appendCodePointToTokenbuf so that the idea doesn't require knowledge of multi-unitness. r=arai
Blocks: 1433909
Priority: -- → P2
Depends on: 1436150
Pushed by jwalden@mit.edu: https://hg.mozilla.org/integration/mozilla-inbound/rev/11ead800137e Use the current offset, not the offset of the start of the current token, when reporting errors for unterminated string/template literals. r=till https://hg.mozilla.org/integration/mozilla-inbound/rev/e4213d1706e9 Change TokenStreamCharsBase::appendMultiUnitCodepointToTokenbuf to TokenStreamCharsBase::appendCodePointToTokenbuf so that the idea doesn't require knowledge of multi-unitness. r=arai
Did all patches for this issue land? (I'm waiting for this bug to be closed so I can proceed with bug 1433909).
Flags: needinfo?(jwalden+bmo)
Oh, whoops. Yes, it all landed.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: needinfo?(jwalden+bmo)
Keywords: leave-open
Resolution: --- → FIXED
(In reply to Jeff Walden [:Waldo] from comment #24) > Oh, whoops. Yes, it all landed. Thanks! :-)
Target Milestone: --- → mozilla60
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: