Closed Bug 1467336 Opened 6 years ago Closed 6 years ago

Move away from having tokenizer code use "char" in function names to using "code unit" or "code point" as appropriate

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla63

Tracking Flags:

Tracking

Status

firefox63

---

fixed

People

(Reporter: Waldo, Assigned: Waldo)

References

Details

Attachments

(24 files, 1 obsolete file)

Rename getCharIgnoreEOL to getCodeUnit to better indicate that's all it does. 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Rename ungetCharIgnoreEOL to ungetCodeUnit 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Add TokenStreamChars::getCodePoint and use it in one trivial place 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Rename TokenStreamSpecific::matchChar to TokenStreamCharsBase::matchCodeUnit to clarify its unit-matching nature 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Implement mozilla::IsAscii to detect pure ASCII characters 6 years ago Jeff Walden [:Waldo] (deleted), patch	froydnj : review+	Details \| Diff \| Splinter Review
Introduce code to bifurcate handling of arbitrary-code-point getting in the tokenizer to distinctly and separately handle ASCII and non-ASCII code points 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Process the code units comprising regular expression literals in a manner that optimizes the pure-ASCII case 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Move regular expression literal tokenizing into its own function 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Rename the \|c\| variables in TokenStreamSpecific::getStringOrTemplateToken to \|unit\| 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Make much of TokenStreamSpecific::getStringOrTemplateToken process by code unit, not by code-unit-generally-assumed-to-be-code-point 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Rename TokenStreamSpecific::peekChars to SourceUnits::peekCodeUnits and optimize its code in now-obvious ways 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Invert SourceUnits::hasRawChars's meaning and rename it to SourceUnits::atEnd 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Change SourceUnits::atStart to indicate being at the start of the source units of code, not at the start of the containing source 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Move skipChars to SourceUnits::skipCodeUnits, and make it efficient in the obvious way 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Convert TokenStreamSpecific::getDirective act on code units, and remove TokenStreamSpecific::peekChar 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Make TokenStreamSpecific::decimalNumber work by consuming code units 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Move Unicode escape processing from TokenStreamSpecific to GeneralTokenStreamChars 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Refactor Unicode-escape identifier matching to work in terms of code units 6 years ago Jeff Walden [:Waldo] (deleted), patch		Details \| Diff \| Splinter Review
Make TokenStreamSpecific::identifierName consume in terms of code units 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Make TokenStreamSpecific::putIdentInTokenbuf read identifier characters a code unit at a time 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Make TokenStreamSpecific::getTokenInternal itself deal only in code units 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Move a couple small functions to be defined inline in TokenStream class bodies, not out-of-line in TokenStream.cpp, for readability 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Refactor Unicode-escape identifier matching to work in terms of code units 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Remove GeneralTokenStreamChars::{,un}getChar now that they're unused 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Change a bunch of 'character' nomenclature in token stream code to 'code unit', completing the transition from tokenizing by UTF-16 'character' to tokenizinng by UTF-8/16 code unit 6 years ago Jeff Walden [:Waldo] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review

Jeff Walden [:Waldo]

Assignee

Description

•

6 years ago

This has sort of been ongoing for awhile as a broader goal, but lots of function names still use "char" in them with inconsistent meanings applied. Ultimately it would be nice to use, instead of int32_t to represent a code-unit-or-EOF, some sort of class abstracting that better. But I don't want to try touching those aspects of APIs until there's better unit/point handling in place.

Jeff Walden [:Waldo]

Assignee

Comment 1

•

6 years ago

Attached patch Rename getCharIgnoreEOL to getCodeUnit to better indicate that's all it does. (deleted) — Details — Splinter Review

Nicely, the "unit" nomenclature fairly clearly conveys that "ignore EOL" is part of the idea of it, IMO. And when a series of patches I'm doing now is done, we'll start with getCodeUnit, proceed to test isAscii, then handle ASCII in one place and non-ASCII (including multi-unit) in the other, with relatively clean separation of the two.

Attachment #8984001 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 2

•

6 years ago

Attached patch Rename ungetCharIgnoreEOL to ungetCodeUnit (deleted) — Details — Splinter Review

Attachment #8984002 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 3

•

6 years ago

Attached patch Add TokenStreamChars::getCodePoint and use it in one trivial place (deleted) — Details — Splinter Review

The getChar function is used a bunch of places for a bunch of things, and many places IMO should be rewritten to somewhat different manners that first try a getCodeUnit approach, then only proceed to handle non-ASCII in a slower path. But this particular case *does* want precisely the semantics of getChar. So, rename the function, then add the bad-old-name back as effectively an alias.

Attachment #8984006 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 4

•

6 years ago

Attached patch Rename TokenStreamSpecific::matchChar to TokenStreamCharsBase::matchCodeUnit to clarify its unit-matching nature (deleted) — Details — Splinter Review

Attachment #8984007 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 5

•

6 years ago

Attached patch Implement mozilla::IsAscii to detect pure ASCII characters (deleted) — Details — Splinter Review

< 0x80 or <= 0x7F or something is pretty tiresome and repetitive and harder to read than a named function.

Attachment #8984008 - Flags: review?(nfroyd)

Jeff Walden [:Waldo]

Assignee

Comment 6

•

6 years ago

Attached patch Introduce code to bifurcate handling of arbitrary-code-point getting in the tokenizer to distinctly and separately handle ASCII and non-ASCII code points (deleted) — Details — Splinter Review

This replaces one use of getChar with replacements that bifurcate unit-treatment into ASCII and non-ASCII cases. It's a simpler case with less cases to consider, so I thought it made a plausible place to simply demonstrate how the new approach works. Previously you would do int32_t c; if (!getChar(&c)) // includes LineTerminator normalizing, in UTF-8 will require multi-unit handling return false; if (c == EOF) { ...handle usually as bad... } ...cases for ASCII special characters, often special handling for LineTerminator... ...handle non-special stuff... So, here's how things now will generally look, if you're getting a code point in some place where particular ASCII code points have special semantics: int32_t unit = getCodeUnit(); // EOF, or the code unit -- no LineTerminator norming if (unit == EOF) { ...handle usually as bad... } if (isAsciiCodePoint(unit)) { ...special handling of units that are special... // Otherwise we don't have a special unit, so we have to handle the possibility // of LineTerminator sequences -- or not, if the special handling above means no // LineTerminator encountered.. int32_t codePoint; if (!getFullAsciiCodePoint(unit, &codePoint)) // requires a not-'\r'/'\n' test return false; ...do whatever to handle an ASCII code point, including '\n'... } else { int32_t codePoint; if (!getNonAsciiCodePoint(unit, &codePoint)) return false; ...do whatever to handle a possibly non-ASCII code point... } The isAsciiCodePoint function is a renamed wrapper around MOZ_LIKELY(IsAscii(unit)). Hint should pay off for ASCII sorts of text, including JS libraries in common use, and stuff outside that can hit the slow case. Special characters can either be specifically picked off inside the isAscii path, or they can be picked off in advance of it. There are perhaps arguable tradeoffs to each approach. There is a somewhat alternative approach that could be taken, that privileges *all* one-unit code points (even in UTF-16). I originally started writing it up this way -- instead of an IsAscii(unit) test, you'd just do !unicode::IsLeadSurrogate -- but I ran into two mini-issues. First, good function names for the various APIs are harder to find, IMO. You could do isSingleUnit/getFullSingleUnitCodePoint/getMultiUnitCodePoint maybe. It's a bit of a mouthful. ASCII is pretty clean to name. Second, and I would guess maybe more important, formulation using ASCII means that you only have two line break characters to consider in the fast path, always. If you instead do single-unit, then all the UTF-16 variants have to handle line/paragraph separators in all cases. That's fairly stupid given these code points are rare, and it's slower than just excluding non-ASCII. But, this decision could be revisited in the future if we wanted, without horrid amounts of extra effort.

Attachment #8984017 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 7

•

6 years ago

Attached patch Process the code units comprising regular expression literals in a manner that optimizes the pure-ASCII case (deleted) — Details — Splinter Review

Attachment #8984018 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 8

•

6 years ago

Attached patch Move regular expression literal tokenizing into its own function (deleted) — Details — Splinter Review

Possibly should have happened before the previous patch, but doesn't really make a difference the order.

Attachment #8984019 - Flags: review?(arai.unmht)

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984001 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984002 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984006 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984007 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Comment 9

•

6 years ago

Comment on attachment 8984017 [details] [diff] [review] Introduce code to bifurcate handling of arbitrary-code-point getting in the tokenizer to distinctly and separately handle ASCII and non-ASCII code points Review of attachment 8984017 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/frontend/TokenStream.cpp @@ +538,5 @@ > +{ > +#ifdef DEBUG > + { > + MOZ_ASSERT(!IsAscii(lead), > + "ASCII code unit/point must be handled separately"); s/must be/must be/ @@ +542,5 @@ > + "ASCII code unit/point must be handled separately"); > + > + sourceUnits.ungetCodeUnit(); > + > + char16_t current = sourceUnits.getCodeUnit(); can we add dedicated DEBUG-only SourceUnits method which returns the "current" code unit, which clearly explains it's actually no-op for sourceUnits itself in certain situation? (or maybe add some comment here) calling those methods only in DEBUG looks like it behaves differently between DEBUG and non-DEBUG, even if it actually doesn't. ::: js/src/frontend/TokenStream.h @@ +1071,5 @@ > + /** > + * Determine whether a code unit is ASCII (and thus can be handled quickly > + * when tokenizing either UTF-8 or UTF-16 source code). > + */ > + static MOZ_MUST_USE MOZ_ALWAYS_INLINE bool isAsciiCodePoint(CharT unit) { function name, parameter, and comment have both "code point" and "code unit", which is a bit confusing what to pass to this function. maybe be better saying that this checks if the given code unit is the leading code unit of a sequence of code units which consists an ascii code point, in comment (and maybe function name) or perhaps I'm misunderstanding what "code point" means in term of CR-LF sequence normalization, inside Tokenizer? to my understanding from those comments around newly added functions, CR-LF is a special single code point here, isn't it? @@ +1072,5 @@ > + * Determine whether a code unit is ASCII (and thus can be handled quickly > + * when tokenizing either UTF-8 or UTF-16 source code). > + */ > + static MOZ_MUST_USE MOZ_ALWAYS_INLINE bool isAsciiCodePoint(CharT unit) { > + return MOZ_LIKELY(mozilla::IsAscii(unit)); this might need some comment why MOZ_LIKELY is here instead of callsite, while this function name itself doesn't explain the probability for true/false. of course, it's clear that it's likely, if this is used while common regular JS files, tho @@ +1298,5 @@ > + * If a LineTerminator sequence was consumed, also update line/column > + * information. > + */ > + MOZ_MUST_USE bool getFullAsciiCodePoint(char16_t lead, int32_t* codePoint) { > + MOZ_ASSERT(mozilla::IsAscii(lead)); this assertion should use isAsciiCodePoint, given that the consumer of this function should use it. (which contains MOZ_LIKELY, instead of only bare isAscii) same for the assertion in getNonAsciiCodePoint @@ +1303,5 @@ > + > + if (MOZ_UNLIKELY(lead == '\r')) { > + if (MOZ_LIKELY(sourceUnits.hasRawChars())) > + sourceUnits.matchCodeUnit('\n'); > + } else if (MOZ_LIKELY(lead != '\n')) { can we check not-LF case first? so that the flow is clearer. if there's any reason not to do so (maybe optimization reason?) it would be nice to mention it here. @@ +1318,5 @@ > + * leading code unit. Normalizes LineTerminators to '\n'. Store the > + * normalized code point in |*cp| and return true on success, otherwise > + * return false and leave |*cp| undefined on failure. > + * > + * This may have the effect of changing the current |sourceUnits| offset. this comment also applies to getFullAsciiCodePoint, right? (when consuming LF after CR)

Attachment #8984017 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984018 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984019 - Flags: review?(arai.unmht) → review+

Nathan Froyd [:froydnj]

Comment 10

•

6 years ago

Comment on attachment 8984008 [details] [diff] [review] Implement mozilla::IsAscii to detect pure ASCII characters Review of attachment 8984008 [details] [diff] [review]: ----------------------------------------------------------------- ::: mfbt/tests/TestTextUtils.cpp @@ +17,5 @@ > > static void > +TestIsAscii() > +{ > + // char Do you think it's worth tossing in `signed char` tests, just to exercise things more on systems where `char` is unsigned by default? (Do any of our systems actually care about this? I think PPC is the opposite of x86 on this front, but I can't remember which is which.)

Attachment #8984008 - Flags: review?(nfroyd) → review+

Jeff Walden [:Waldo]

Assignee

Comment 11

•

6 years ago

Web searches suggest Android is the oddball with char having the same representation as unsigned char, everything else having char being signed char. I certainly wouldn't object to splitting the existing char tests for everything into explicitly signed and unsigned char tests, but at least for the simplicity of these things now, it seems like a bunch of work for not much gain (and given Android counters the other platforms, a gain that only appears if *one* platform is somehow internally buggy, seems like). Doesn't seem valuable enough to just *do*, so I'm leaving this alone.

Jeff Walden [:Waldo]

Assignee

Comment 12

•

6 years ago

(In reply to Tooru Fujisawa [:arai] from comment #9) > can we add dedicated DEBUG-only SourceUnits method which returns the > "current" code unit, which clearly explains it's actually no-op for > sourceUnits itself in certain situation? Good idea -- added a previousCodeUnit function. > function name, parameter, and comment have both "code point" and "code > unit", which is a bit confusing what to pass to this function. The idea is that we're testing whether this code *unit*, which in the general case is but a portion of a code point, also constitutes an entire ASCII code point: anything [0, 128). Both '\r' and '\n' are code units that are ASCII code points. This function doesn't care about LineTerminatorSequence: it acts purely at the encoding level. > maybe be better saying that this checks if the given code unit is the > leading code unit of a sequence of code units which consists an ascii code > point, in comment (and maybe function name) There isn't really any "sequence" meaning to this at all. Sequence-handling is a secondary step after this function's return value is examined. > or perhaps I'm misunderstanding what "code point" means in term of CR-LF > sequence normalization, inside Tokenizer? > to my understanding from those comments around newly added functions, CR-LF > is a special single code point here, isn't it? As a general rule in JS, LineTerminatorSequence is any of (examined greedily) '\r', '\n', "\r\n" (i.e. '\r' demands it not be followed by '\n'), '\u2028', and '\u2029'. Outside of string and template literals, these are all normalized to the single ASCII code point '\n'. (Inside string and template literals, '\u2028' and '\u2029' are used literally.) getFullAsciiCodePoint, because it knows |lead| is ASCII, for anything but '\r' and '\n', returns it exactly. '\n' is also returned exactly, after line-info is updated. For '\r', a subsequent '\n' is also consumed -- then either "\r\n" or '\r' is normalized to '\n'. In the end, I went with /** * Determine whether a code unit constitutes a complete ASCII code point. * (The code point's exact identity might not end up being used, however, * if the code point is part of a LineTerminator sequence.) */ as the comment. (Tho looking at this, I probably should change that to LineTerminatorSequence.) > this might need some comment why MOZ_LIKELY is here instead of callsite, Mm, fair point -- MOZ_LIKELY and such should go at call sites. Moved that out of here and into them. > this assertion should use isAsciiCodePoint, given that the consumer of this > function should use it. > > same for the assertion in getNonAsciiCodePoint Good point, changed. > @@ +1303,5 @@ > > + > > + if (MOZ_UNLIKELY(lead == '\r')) { > > + if (MOZ_LIKELY(sourceUnits.hasRawChars())) > > + sourceUnits.matchCodeUnit('\n'); > > + } else if (MOZ_LIKELY(lead != '\n')) { > > can we check not-LF case first? so that the flow is clearer. > if there's any reason not to do so (maybe optimization reason?) it would be > nice to mention it here. Don't think so? The not-'\n' case has to treat '\r' as a line break, so you have to pile up two conditions before you can just return the code unit directly. This maps straightforwardly onto the ASCII subset of LineTerminatorSequence, so I'm not sure a comment does much, and I haven't added one. > this comment also applies to getFullAsciiCodePoint, right? (when consuming > LF after CR) Yeah -- changed both comments to be more similar in the final version.

Tooru Fujisawa [:arai]

Comment 13

•

6 years ago

(In reply to Jeff Walden [:Waldo] from comment #12) > In the end, I went with > > /** > * Determine whether a code unit constitutes a complete ASCII code point. > * (The code point's exact identity might not end up being used, however, > * if the code point is part of a LineTerminator sequence.) > */ > > as the comment. (Tho looking at this, I probably should change that to > LineTerminatorSequence.) Thank you for explanation and the comment :) > > @@ +1303,5 @@ > > > + > > > + if (MOZ_UNLIKELY(lead == '\r')) { > > > + if (MOZ_LIKELY(sourceUnits.hasRawChars())) > > > + sourceUnits.matchCodeUnit('\n'); > > > + } else if (MOZ_LIKELY(lead != '\n')) { > > > > can we check not-LF case first? so that the flow is clearer. > > if there's any reason not to do so (maybe optimization reason?) it would be > > nice to mention it here. > > Don't think so? The not-'\n' case has to treat '\r' as a line break, so you > have to pile up two conditions before you can just return the code unit > directly. This maps straightforwardly onto the ASCII subset of > LineTerminatorSequence, so I'm not sure a comment does much, and I haven't > added one. okay, makes sense. the current one should be better.

Pulsebot

Comment 14

•

6 years ago

Pushed by jwalden@mit.edu: https://hg.mozilla.org/integration/mozilla-inbound/rev/b6821b80af87 Rename getCharIgnoreEOL to getCodeUnit to better indicate that's all it does. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/e90575cf96ce Rename ungetCharIgnoreEOL to ungetCodeUnit. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/7d80671bf849 Add TokenStreamChars::getCodePoint and use it in one trivial place. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/b2b3d152d7ea Rename TokenStreamSpecific::matchChar to TokenStreamCharsBase::matchCodeUnit to clarify its unit-matching nature. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/2dac13bda723 Implement mozilla::IsAscii to detect pure ASCII characters. r=froydnj https://hg.mozilla.org/integration/mozilla-inbound/rev/564254cf34aa Introduce code to bifurcate handling of arbitrary-code-point getting in the tokenizer to distinctly and separately handle ASCII and non-ASCII code points. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/5a295d6c1688 Process the code units comprising regular expression literals in a manner that optimizes the pure-ASCII case. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/afcc2d65b6fb Move regular expression literal tokenizing into its own function. r=arai

Jeff Walden [:Waldo]

Assignee

Comment 15

•

6 years ago

More patches coming along these lines -- I am somewhat likely to run this bug to ground with many patches, not spin off a new one each time I change something.

Keywords: leave-open

Arthur Iakab [arthur_iakab]

Comment 16

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/b6821b80af87 https://hg.mozilla.org/mozilla-central/rev/e90575cf96ce https://hg.mozilla.org/mozilla-central/rev/7d80671bf849 https://hg.mozilla.org/mozilla-central/rev/b2b3d152d7ea https://hg.mozilla.org/mozilla-central/rev/2dac13bda723 https://hg.mozilla.org/mozilla-central/rev/564254cf34aa https://hg.mozilla.org/mozilla-central/rev/5a295d6c1688 https://hg.mozilla.org/mozilla-central/rev/afcc2d65b6fb

Jeff Walden [:Waldo]

Assignee

Comment 17

•

6 years ago

Attached patch Rename the |c| variables in TokenStreamSpecific::getStringOrTemplateToken to |unit| (deleted) — Details — Splinter Review

They aren't all used as code units yet, but an alpha-rename hopefully simplifies future changes to make that so.

Attachment #8984543 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 18

•

6 years ago

Attached patch Make much of TokenStreamSpecific::getStringOrTemplateToken process by code unit, not by code-unit-generally-assumed-to-be-code-point (deleted) — Details — Splinter Review

The one place that doesn't process in this manner, should be suitably flagged in the changes. :-) I will confess, however, that the escape-handling goo I am largely relying on tests to verify correctness, combined with trying to just not disturb it too much. (And I seriously question whether strings and template tokens should be parsed by the same function, given what a mess shoehorning in template-parsing, or alternatively having to have all these silly conditions guarding everything, makes to both algorithms.)

Attachment #8984544 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 19

•

6 years ago

Attached patch Rename TokenStreamSpecific::peekChars to SourceUnits::peekCodeUnits and optimize its code in now-obvious ways (deleted) — Details — Splinter Review

It's almost unbelievably silly that the current algorithm wasn't written to work the way this new algorithm clearly is.

Attachment #8984545 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 20

•

6 years ago

Attached patch Invert SourceUnits::hasRawChars's meaning and rename it to SourceUnits::atEnd (deleted) — Details — Splinter Review

I paired the name up with atStart -- except that atStart doesn't actually seem to indicate what it seems like it does. Notably, since this revision it appears atStart has stopped meaning pointing-at-the-original-character-of-source: https://searchfox.org/mozilla-central/diff/29234315debd45187859a4c7ea8dc8630db0d9a3/js/src/frontend/TokenStream.h#681 Every user of atStart really wants the old meaning (borne out by a try-push), so I'll fix it in a followup patch.

Attachment #8984547 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 21

•

6 years ago

Attached patch Change SourceUnits::atStart to indicate being at the start of the source units of code, not at the start of the *containing* source (deleted) — Details — Splinter Review

Attachment #8984548 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 22

•

6 years ago

Attached patch Move skipChars to SourceUnits::skipCodeUnits, and make it efficient in the obvious way (deleted) — Details — Splinter Review

Attachment #8984549 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 23

•

6 years ago

Attached patch Convert TokenStreamSpecific::getDirective act on code units, and remove TokenStreamSpecific::peekChar (deleted) — Details — Splinter Review

Attachment #8984706 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 24

•

6 years ago

Attached patch Make TokenStreamSpecific::decimalNumber work by consuming code units (deleted) — Details — Splinter Review

Attachment #8984707 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 25

•

6 years ago

Attached patch Move Unicode escape processing from TokenStreamSpecific to GeneralTokenStreamChars (deleted) — Details — Splinter Review

Attachment #8984708 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 26

•

6 years ago

Attached patch Refactor Unicode-escape identifier matching to work in terms of code units (obsolete) (deleted) — Details — Splinter Review

This also eliminates ungets-then-skips that cancel out when a valid escape is matched (as it usually will be, 'cause people don't write invalid code).

Attachment #8984709 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 27

•

6 years ago

Attached patch Make TokenStreamSpecific::identifierName consume in terms of code units (deleted) — Details — Splinter Review

Attachment #8984710 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 28

•

6 years ago

Attached patch Make TokenStreamSpecific::putIdentInTokenbuf read identifier characters a code unit at a time (deleted) — Details — Splinter Review

I note in passing that this seems to use IsIdentifierPart even for the first code point. I guess/hope IsIdentifierStart must be a proper subset, because otherwise this seems buggy.

Attachment #8984711 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 29

•

6 years ago

Attached patch Make TokenStreamSpecific::getTokenInternal itself deal only in code units (deleted) — Details — Splinter Review

Attachment #8984712 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 30

•

6 years ago

Attached patch Move a couple small functions to be defined inline in TokenStream class bodies, not out-of-line in TokenStream.cpp, for readability (deleted) — Details — Splinter Review

Attachment #8984713 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 31

•

6 years ago

The remaining functions requiring modification along these lines is pretty small. Maybe even zero, at a quick skim. Will regroup and examine fully Monday.

Jeff Walden [:Waldo]

Assignee

Comment 32

•

6 years ago

Attached patch Refactor Unicode-escape identifier matching to work in terms of code units (deleted) — Details — Splinter Review

Attachment #8984719 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Updated

•

6 years ago

Attachment #8984709 - Attachment is obsolete: true

Attachment #8984709 - Flags: review?(arai.unmht)

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984543 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Comment 33

•

6 years ago

Comment on attachment 8984544 [details] [diff] [review] Make much of TokenStreamSpecific::getStringOrTemplateToken process by code unit, not by code-unit-generally-assumed-to-be-code-point Review of attachment 8984544 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/frontend/TokenStream.cpp @@ +2332,5 @@ > + MOZ_ASSERT(!this->sourceUnits.hasRawChars() || > + this->sourceUnits.peekCodeUnit() == '\r' || > + this->sourceUnits.peekCodeUnit() == '\n', > + "must be parked at EOF or EOL to call this function"); > + const char delimiters[] = { untilChar, untilChar, '\0' }; pre-existing tho, can you add some comment about those 2-untilChar's, which explains they're used as beginning quotation mark and ending quotation mark in the error message? (it's a bit surprising/confusing that we're passing 2 untilChar's, given that now `errnum` is also variable) @@ +2395,5 @@ > + if (!getNonAsciiCodePoint(unit, &codePoint)) > + return false; > + > + // LineContinuation represents no code points. > + if (codePoint != '\n') { might be better explicitly saying they're LINE_SEPARATOR and PARA_SEPARATOR (of course LineContinuation implies them tho), so that it's clear that the '\n' matches to `!isAsciiCodePoint` condition above. @@ +2509,5 @@ > + if (!appendCodePointToTokenbuf(code)) > + return false; > + > + continue; > + } // (c2 == '{') // delimited Unicode escape this comment is confusing, in term of which part this is for, whether the above block (corresponds to namespace's one) or the following block (implicit else condition). IMO this is not necessary.

Attachment #8984544 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984545 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984547 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984548 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984549 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984706 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984707 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Comment 34

•

6 years ago

Comment on attachment 8984708 [details] [diff] [review] Move Unicode escape processing from TokenStreamSpecific to GeneralTokenStreamChars Review of attachment 8984708 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/frontend/TokenStream.cpp @@ +1086,2 @@ > { > + MOZ_ASSERT(sourceUnits.previousCodeUnit() == '{'); nice that this can now be an assert :)

Attachment #8984708 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984719 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Comment 35

•

6 years ago

Comment on attachment 8984710 [details] [diff] [review] Make TokenStreamSpecific::identifierName consume in terms of code units Review of attachment 8984710 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/frontend/TokenStream.cpp @@ +1480,5 @@ > + > + if (!unicode::IsIdentifierPart(uint32_t(codePoint))) { > + ungetCodePointIgnoreEOL(codePoint); > + if (codePoint == unicode::LINE_SEPARATOR || codePoint == unicode::PARA_SEPARATOR) > + anyCharsAccess().undoInternalUpdateLineInfoForEOL(); those 3 lines appear multiple times. can you add a method for them?

Attachment #8984710 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984711 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Comment 36

•

6 years ago

Comment on attachment 8984712 [details] [diff] [review] Make TokenStreamSpecific::getTokenInternal itself deal only in code units Review of attachment 8984712 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/frontend/TokenStream.cpp @@ +1858,5 @@ > > + if (MOZ_UNLIKELY(!isAsciiCodePoint(unit))) { > + // Non-ASCII code points can only be identifiers or whitespace. > + // It would be nice to compute these *after* discarding whitespace, > + // but ɪɴ ᴀ ᴡᴏʀʟᴅ where |unicode::IsSpaceOrBOM2| requires consuming hmm... it's not a good idea to use these code points in the code comment, for search-ability and accessibility (code search doesn't match for ASCII query "IN A WORLD", and screen-reader cannot pronounce them properly).

Attachment #8984712 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8984713 - Flags: review?(arai.unmht) → review+

Jeff Walden [:Waldo]

Assignee

Comment 37

•

6 years ago

(In reply to Tooru Fujisawa [:arai] from comment #33) > > + } // (c2 == '{') // delimited Unicode escape > > this comment is confusing, in term of which part this is for, whether the > above block (corresponds to namespace's one) or the following block > (implicit else condition). > > IMO this is not necessary. Mm. This isn't the biggest block of interminable code, but I found it very difficult as I was working on this to remember precisely what this block of code was handling, as I was scrolling around vertically within this code. Basically the same sort of issue motivating the "// defined(XP_WIN)"-style comments after #endifs and such. I changed the comment to } // end of delimited Unicode escape handling which hopefully is more readable as prose.

Jeff Walden [:Waldo]

Assignee

Comment 38

•

6 years ago

(In reply to Tooru Fujisawa [:arai] from comment #36) > > + // but ɪɴ ᴀ ᴡᴏʀʟᴅ where |unicode::IsSpaceOrBOM2| requires consuming > > hmm... it's not a good idea to use these code points in the code comment, > for search-ability and accessibility (code search doesn't match for ASCII > query "IN A WORLD", and screen-reader cannot pronounce them properly). Boooooooooooooooooooooooooooooooooo you're wrong but also right. :-( :-) (But that's totes a screen reader bug IMO.)

Pulsebot

Comment 39

•

6 years ago

Pushed by jwalden@mit.edu: https://hg.mozilla.org/integration/mozilla-inbound/rev/103a82585566 Rename the |c| variables in TokenStreamSpecific::getStringOrTemplateToken to |unit|. They aren't all used as code units yet, but an alpha-rename hopefully simplifies future changes to make that so. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/82729f89bc61 Make much of TokenStreamSpecific::getStringOrTemplateToken process by code unit, not by code-unit-generally-assumed-to-be-code-point. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/3d10297c4384 Rename TokenStreamSpecific::peekChars to SourceUnits::peekCodeUnits and optimize its code in now-obvious ways. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/03e60a811919 Invert SourceUnits::hasRawChars's meaning and rename it to SourceUnits::atEnd. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/673d93f7c030 Change SourceUnits::atStart to indicate being at the start of the source units of code, not at the start of the *containing* source. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/2b5d5815e7ef Move skipChars to SourceUnits::skipCodeUnits, and make it efficient in the obvious way. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/18c217828e1b Convert TokenStreamSpecific::getDirective act on code units, and remove TokenStreamSpecific::peekChar. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/b04275fa34e6 Make TokenStreamSpecific::decimalNumber work by consuming code units. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/96bab59f48e0 Move Unicode escape processing from TokenStreamSpecific to GeneralTokenStreamChars. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/96438f0ed68a Refactor Unicode-escape identifier matching to work in terms of code units. This also eliminates ungets-then-skips that cancel out when a valid escape is matched. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/57a1edba27d7 Make TokenStreamSpecific::identifierName consume in terms of code units. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/dba0ac1d0a0f Make TokenStreamSpecific::putIdentInTokenbuf read identifier characters a code unit at a time. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/36f3b5eb7294 Make TokenStreamSpecific::getTokenInternal itself deal only in code units. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/8949f1ff42ac Move a couple small functions to be defined inline in TokenStream class bodies, not out-of-line in TokenStream.cpp, for readability. r=arai

Jeff Walden [:Waldo]

Assignee

Comment 40

•

6 years ago

Not yet ready to conclude this work is all done, so continuing to leave open for now.

Raul Gurzau (:RaulG)

Comment 41

•

6 years ago

bugherder

Jeff Walden [:Waldo]

Assignee

Updated

•

6 years ago

Depends on: 1471465

Jeff Walden [:Waldo]

Assignee

Comment 42

•

6 years ago

Attached patch Remove GeneralTokenStreamChars::{,un}getChar now that they're unused (deleted) — Details — Splinter Review

Attachment #8988435 - Flags: review?(arai.unmht)

Jeff Walden [:Waldo]

Assignee

Comment 43

•

6 years ago

Attached patch Change a bunch of 'character' nomenclature in token stream code to 'code unit', completing the transition from tokenizing by UTF-16 'character' to tokenizinng by UTF-8/16 code unit (deleted) — Details — Splinter Review

With this patch I will declare this bug's challenge complete. Other bugs will follow for various mop-up needed to increasingly instantiate TokenStream base classes for UTF-8 as well as UTF-16.

Attachment #8988436 - Flags: review?(arai.unmht)

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8988435 - Flags: review?(arai.unmht) → review+

Tooru Fujisawa [:arai]

Updated

•

6 years ago

Attachment #8988436 - Flags: review?(arai.unmht) → review+

Nicolas B. Pierron [:nbp]

Updated

•

6 years ago

status-firefox63: --- → fix-optional

Priority: -- → P1

Pulsebot

Comment 44

•

6 years ago

Pushed by jwalden@mit.edu: https://hg.mozilla.org/integration/mozilla-inbound/rev/5d9c420a912b Remove GeneralTokenStreamChars::{,un}getChar now that they're unused. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/4396a31f09b5 Change a bunch of 'character' nomenclature in token stream code to 'code unit', completing the transition from tokenizing by UTF-16 'character' to tokenizinng by UTF-8/16 code unit. (There are straggling places where algorithms will need to be specialized for UTF-8, or functions will need to move within the TokenStream* hierarchy to permit such; but what is in the tree now universally acts on code units first, full code points second.) r=arai

Jeff Walden [:Waldo]

Assignee

Updated

•

6 years ago

Keywords: leave-open

Bogdan Tara[:bogdan_tara | bogdant]

Comment 45

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/5d9c420a912b https://hg.mozilla.org/mozilla-central/rev/4396a31f09b5

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

status-firefox63: fix-optional → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla63

Jeff Walden [:Waldo]

Assignee

Updated

•

6 years ago

Depends on: 1476409

You need to log in before you can comment on or make changes to this bug.