Closed Bug 1344152 Opened 8 years ago Closed 7 years ago

Find out how many scripts are purely in ASCII

Categories

(Core :: JavaScript Engine, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED WORKSFORME
Performance Impact none

People

(Reporter: Yoric, Assigned: Yoric)

References

Details

Attachments

(1 file, 1 obsolete file)

There is a suspicion that we could specialize and optimize considerably the tokenizer for latin-1 scripts. Let's see if how often this would work.
Summary: Find out how many scripts are purely in latin 1 → Find out how many scripts are purely in ASCII
I wrote a quick & dirty checker by patching `TokenStream::getTokenInternal` to report if we ever have `c >= 128`. Much to my surprise, the code of both Firefox and Facebook parses without triggering this (although the code of http://liberation.fr does trigger it). This is surprising because Facebook's page contains e.g. emojis. Here is a sample taken from Facebook's test page: emojiChoices:["✌","<U+1F602>","<U+1F61D>"," <U+1F601>","<U+1F631>","<U+1F449>","<U+1F64C>","<U+1F37B>","<U+1F525>","<U+1F308>","☀","<U+1F388>","<U+1F339>","<U+1F484>","<U+1F380>","<U+26BD>","<U+1F3BE>","<U+1F3C1>","<U+1F621>","<U+1F47F>","<U+1F43B>"," <U+1F436>","<U+1F42C>","<U+1F41F>","<U+1F340>","<U+1F440>","<U+1F697>","<U+1F34E>","<U+1F49D>","<U+1F499>","<U+1F44C>","❤","<U+1F60D>","<U+1F609>","<U+1F613>","<U+1F633>","<U+1F4AA>","<U+1F4A9>","<U+1F378>"," <U+1F511>","<U+1F496>","<U+1F31F>","<U+1F389>","<U+1F33A>","<U+1F3B6>","<U+1F460>","<U+1F3C8>","<U+26BE>","<U+1F3C6>","<U+1F47D>","<U+1F480>","<U+1F435>","<U+1F42E>","<U+1F429>","<U+1F40E>","<U+1F4A3>"," <U+1F443>","<U+1F442>","<U+1F353>","<U+1F498>","<U+1F49C>","<U+1F44A>","<U+1F48B>","<U+1F618>","<U+1F61C>","<U+1F635>","<U+1F64F>","<U+1F44B>","<U+1F6BD>","<U+1F483>","<U+1F48E>","<U+1F680>","<U+1F319>"," <U+1F381>","<U+26C4>","<U+1F30A>","<U+26F5>","<U+1F3C0>","<U+1F3B1>","<U+1F4B0>","<U+1F476>","<U+1F478>","<U+1F430>","<U+1F437>","<U+1F40D>","<U+1F42B>","<U+1F52B>","<U+1F444>","<U+1F6B2>","<U+1F349>","<U+1F49B> ","<U+1F49A>"] So either I'm measuring in the wrong place or something inside Necko or Gecko is somehow converting these surrogate pair to something that doesn't trigger `c >= 128`.
Or, alternatively, the tokenizer has a "I'm in a string" mode that I haven't found yet.
Ok, found the "I'm in a string" mode at `getStringOrTemplateToken`. As far as I can tell, there is no special treatment for non-ASCII characters in strings, so I don't think that there is anything to be gained by specializing that method, so I'll assume that we don't care about non-ASCII in strings for the moment.
It's weird, in my experiment, we parse chrome://global/content/bindings/general.xml 56 times for just a few seconds of execution of Firefox.
Assignee: nobody → dteller
Attaching a WIP patch for f?. I'm not sure if there is a better way to report findings through Telemetry.
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #3) > As far > as I can tell, there is no special treatment for non-ASCII characters in > strings, so I don't think that there is anything to be gained by > specializing that method, so I'll assume that we don't care about non-ASCII > in strings for the moment. If we want to specialize the tokenizer for ASCII/Latin1 source code (this would be great for performance!), we care about *any* non-ASCII character no matter where it occurs right? Because I doubt we would want to use the ASCII/Latin1 mode when the input contains TwoByte chars anywhere.
We would want to know if one exists anywhere in the provided source code. (That said, there's a decent chance I'll be making a single-byte-char tokenizer that handles UTF-8 input, as that's likely increasingly how scripts are shipped these days, not ASCII or Latin-1.)
Comment on attachment 8843734 [details] Bug 1344152 - Find out how many scripts are in ASCII; https://reviewboard.mozilla.org/r/117304/#review119532 I guess this is a quick and dirty version, given the debug printfs. Do you want to clean it up to try to land? Given that it may incur some slight overhead in the tokenizer, we should make sure it doesn't regress performance first. ::: js/src/frontend/Parser.cpp:672 (Diff revision 1) > abortedSyntaxParse(false), > isUnexpectedEOF_(false), > handler(cx, *alloc, tokenStream, syntaxParser, lazyOuterFunction) > { > + > + Extra newlines. ::: js/src/frontend/Parser.cpp:705 (Diff revision 1) > > template <typename ParseHandler> > Parser<ParseHandler>::~Parser() > { > + > + Extra newlines. ::: js/src/frontend/Parser.cpp:728 (Diff revision 1) > template <typename ParseHandler> > ObjectBox* > Parser<ParseHandler>::newObjectBox(JSObject* obj) > { > + > + Extra newlines. ::: js/src/frontend/Parser.cpp:790 (Diff revision 1) > Directives inheritedDirectives, > GeneratorKind generatorKind, > JSObject* enclosingStaticScope) > { > + > + Extra newlines. ::: js/src/frontend/Parser.cpp:833 (Diff revision 1) > template <typename ParseHandler> > ModuleBox* > Parser<ParseHandler>::newModuleBox(Node pn, HandleModuleObject module) > { > + > + Extra newlines. ::: js/src/frontend/Parser.cpp:886 (Diff revision 1) > template <typename ParseHandler> > typename ParseHandler::Node > Parser<ParseHandler>::parse() > { > MOZ_ASSERT(checkOptionsCalled); > + Extra newline. ::: js/src/frontend/Parser.cpp:920 (Diff revision 1) > if (foldConstants) { > if (!FoldConstants(context, &pn, this)) > return null(); > } > } > + No need to add a newline here. ::: js/src/frontend/Parser.cpp:1013 (Diff revision 1) > + addEncodingTelemetry(); > MOZ_ASSERT(mn->pn_modulebox == modulebox); > return mn; > } > > +template<typename ParseHandler> Nit: space after template ::: js/src/frontend/Parser.cpp:1026 (Diff revision 1) > + tokenStream.isASCIISoFar() ? "ASCII": "_", > + getFilename() > + ); > + } > + // FIXME: Need to review that this is an appropriate use of > + // `runtimeFromAnyThread`. Probably not, since the telemetry callback is an opaque function pointer and we have no idea if it's thread safe. ::: js/src/frontend/TokenStream.cpp:1121 (Diff revision 1) > // early allows subsequent checking to be faster. > if (MOZ_UNLIKELY(c >= 128)) { > + if (isASCIISoFar_) { > + fprintf(stderr, "Yoric: Not ASCII anymore %s\n", filename); > + isASCIISoFar_ = false; > + } No need to check for isASCIISoFar_, unconditionally assigning to isASCIISoFar_ is fine.
Attachment #8843734 - Flags: review?(shu)
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #8) > We would want to know if one exists anywhere in the provided source code. > (That said, there's a decent chance I'll be making a single-byte-char > tokenizer that handles UTF-8 input, as that's likely increasingly how > scripts are shipped these days, not ASCII or Latin-1.) What's the difference between ASCII and single-byte UTF-8? I thought the only range UTF-8 can represent in single byte was ASCII.
Comment on attachment 8843734 [details] Bug 1344152 - Find out how many scripts are in ASCII; https://reviewboard.mozilla.org/r/117304/#review119532 Yes, this is the Q&D version to get an early feedback. I'll clean it up. > No need to check for isASCIISoFar_, unconditionally assigning to isASCIISoFar_ is fine. Ah, yeah, that's leftover debug code, thanks.
(In reply to Jan de Mooij [:jandem] from comment #7) > > If we want to specialize the tokenizer for ASCII/Latin1 source code (this > would be great for performance!), we care about *any* non-ASCII character no > matter where it occurs right? Because I doubt we would want to use the > ASCII/Latin1 mode when the input contains TwoByte chars anywhere. As far as I understand, our Tokenizer already expects the code to have been transcoded to UTF-8 by Necko. If I read the code correctly, the only place where it would benefit from an ASCII optimization would be in `TokenStream::getTokenInternal`'s handling of identifiers. So I'm only measuring the use of non-ASCII chars in identifiers, and expecting that any optimization would only affect this method.
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #12) > As far as I understand, our Tokenizer already expects the code to have been > transcoded to UTF-8 by Necko. No the tokenizer works on char16_t data. > If I read the code correctly, the only place > where it would benefit from an ASCII optimization would be in > `TokenStream::getTokenInternal`'s handling of identifiers. So I'm only > measuring the use of non-ASCII chars in identifiers, and expecting that any > optimization would only affect this method. As I understand it, the basic idea would be to templatize TokenStream to work on either char16_t or Latin1Char chars (or UTF8 as Waldo suggested).
(In reply to Jan de Mooij [:jandem] from comment #13) > No the tokenizer works on char16_t data. Sorry, typo. You're right, it's char16_t. > > If I read the code correctly, the only place > > where it would benefit from an ASCII optimization would be in > > `TokenStream::getTokenInternal`'s handling of identifiers. So I'm only > > measuring the use of non-ASCII chars in identifiers, and expecting that any > > optimization would only affect this method. > > As I understand it, the basic idea would be to templatize TokenStream to > work on either char16_t or Latin1Char chars (or UTF8 as Waldo suggested). How would that work for, well, identifiers, strings and other atoms? Would we need to transcode them after reading them? Also, what other changes in Gecko would be necessary to actually feed latin-1 or UTF-8 to the TokenStream?
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #14) > Also, what other changes in Gecko would be necessary to actually feed > latin-1 or UTF-8 to the TokenStream? CC'ing Servo folks who also might be interested in this as Servo uses ut8 internally.
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #14) > How would that work for, well, identifiers, strings and other atoms? Would > we need to transcode them after reading them? The nice thing is that we have Latin1 strings now, so if we did ASCII or Latin1 parsing we could just call AtomizeString etc with Latin1Char* instead of char16_t*. Then we also don't have to check "can we store this char16_t* string as Latin1?" when we have to allocate a new atom. > Also, what other changes in Gecko would be necessary to actually feed > latin-1 or UTF-8 to the TokenStream? Gecko will often receive UTF8 data so if we added APIs for that we could potentially skip the UTF8 -> char16_t conversion (see also Waldo's comment 8). Other options are to keep working with char16_t* data initially (it should still be faster when you can assume all chars are ASCII/Latin1) or to add a fixed-size ASCII/Latin1 buffer and (incrementally) put chars into that as TokenStream needs them. The latter is what V8 does (or at least did a few years ago, maybe it changed).
Ok, so I'm probably measuring the wrong thing. Do I now understand correctly that the main optimization would be when creating strings/atoms? If so, I need to measure something entirely different. Or what am I missing?
Flags: needinfo?(till)
Flags: needinfo?(shu)
Flags: needinfo?(jdemooij)
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #17) > Do I now understand correctly that the main optimization would be when > creating strings/atoms? That's part of it, but it's also just faster to parse Latin1/ASCII/UTF8 than char16_t*. For instance because more data will fit in CPU caches etc. When I enabled Latin1 strings we saw a 10-15% or so improvement on the Kraken JSON tests. For what it's worth, JSC templatizes their lexer on their equivalent of our Latin1Char/char16_t and Chakra's lexer can work on UTF8 data. I expect both of these to be faster than what we do now and I don't care too much what we do :)
Flags: needinfo?(jdemooij)
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #17) > Ok, so I'm probably measuring the wrong thing. > > Do I now understand correctly that the main optimization would be when > creating strings/atoms? If so, I need to measure something entirely > different. I imagine both strings and easier handling of identifiers would be impactful optimizations. We should measure if there are *any* non-latin1/ascii characters in a buffer. (Maybe this should happen in a different layer in Gecko?) For a completely latin1 buffer we can then have simpler everything, the most important of which are, as you've seen, identifiers and strings.
Flags: needinfo?(shu)
In addition to what jandem and shu said, being able to consume utf8 would allow us to just hand in utf8 strings unmodified and have SM take ownership of them without any copying or decoding. This'd be nice for Gecko if the network resource is utf8 - we'd need to ensure that it's valid utf8 or be able to handle invalid encodings, though - and even nicer for Servo, where all strings are utf8.
Flags: needinfo?(till)
Ok, I'm changing strategy and measuring directly in the script loader. Finding out if we restrict ourselves to the ASCII subset of Unicode is easy. Finding out if we only use Latin 1, though, doesn't seem to entirely make sense. If I understand correctly, and despite what I thought, Latin 1 is not a subset of UTF-8, unless you decide to interpret some invalid byte sequences as Latin 1. So I'm back to not being sure what we want to measure here. Note that I have opened bug 1345437 to determine the original encoding of scripts, but I'm starting to think that this is orthogonal.
Blocks: 1345703
So what's our definition of "Latin 1" in this context? UTF-8 that only uses the first byte?
Flags: needinfo?(jwalden+bmo)
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #23) > So what's our definition of "Latin 1" in this context? UTF-8 that only uses > the first byte? We consider a char16_t buffer that doesn't have any char16_t values > 0xff to be Latin1: http://searchfox.org/mozilla-central/source/js/src/vm/String.cpp#1148
Flags: needinfo?(jwalden+bmo)
Attachment #8843734 - Attachment is obsolete: true
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #25) > Since finding out if a script is within [0x00, 0xFF[ can take time, > this patch limits collection to sample randomly once every 10k loads. Shouldn't it be possible to check this during decoding? Most scripts will be sent as utf8, so we'll decode them anyway. If during decoding we detect that they're really latin1, could we just discard the decoding result and use the original source?
As far as I recall, we have multiple decoders. Do you suggest I patch them all? Or just the utf8 decoder?
Flags: needinfo?(till)
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #27) > As far as I recall, we have multiple decoders. Do you suggest I patch them > all? Or just the utf8 decoder? No, I'd assume the one or two most used by Gecko should be enough. I just checked however and think that maybe the best place to patch is nsUTF8ToUnicode::Convert. If you add an additional bool* aIsLatin1 outparam to that, it should catch the vast majority of scripts cheaply. That'd be much more invasive however, so perhaps for just measuring the optimization potential, it might be too much.
Flags: needinfo?(till)
Ah, got it. Yeah, it's a bit invasive, especially with the ongoing plans to change Gecko's transcoding. I figured my patch would be sufficient for a quick experiment.
Attachment #8845865 - Flags: feedback?(hsivonen)
Comment on attachment 8845865 [details] Bug 1344152 - Determine how many scripts are in "Latin 1";f?hsivonen https://reviewboard.mozilla.org/r/119010/#review121460 First, I'd like to understand how this knowledge could be put to use. We have external scripts and internal scripts. In the case of internal scripts, the DOM can store the data that's logically UTF-16 with the leading zeros omitted so that the storage is actually ISO-8859-1 (in the ISO sense, not in the WHAWWG sense). However, it seems that this patch doesn't check for that. As for extenal scripts, which is what this patch is about, even if we knew how many scripts only use code points up to and including U+00FF, what would be able to do with that information? Even if we knew from telemetry that a sizeable proportion of scripts didn't use code points beyond U+00FF, how would we make use of a parsing code path optimized for that case? For an incoming script that we haven't iterated over yet, how would we know that it's only going to contain code points up to and including U+00FF? The Web platform offers no way for the site to promise that. (In particular, charset=ISO-8859-1 does *not* promise that, because on the Web that's just an alias for charset=windows-1252. Moreover, we want everyone to move to UTF-8, so we shouldn't do anything that would result in us recommending to authors that they should use something other than UTF-8 for interchange.) Furthermore, if we wanted to gather telemetry for this (and per the above paragraph, I don't see what we could do with the knowledge acquired via telemetry) it would probably make performance sense to use SSE2 for checking that the code points stay at most at U+00FF.
Attachment #8845865 - Flags: review-
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #8) > (That said, there's a decent chance I'll be making a single-byte-char > tokenizer that handles UTF-8 input, as that's likely increasingly how > scripts are shipped these days, not ASCII or Latin-1.) For external scripts, having a UTF-8 JS parser code path makes a lot more sense than having a Latin1 one. (A Latin1 one might make sense for internal scripts.) Bug 1261841 will enable Gecko to decode directly to UTF-8 (without going via UTF-16) and doing it in such a way that there's no copy if the input is in UTF-8 and valid[1], if the input is in an ASCII-compatible encoding and the input is all ASCII, or if the input is in ISO-2022-JP and the input stays entirely in the ASCII state of ISO-2022-JP. [1] The Cow<'a, str> Rust API can avoid a copy even if there's a UTF-8 BOM. The nsACString C++ API will have to copy if there's a UTF-8 BOM, because nsACString buffer sharing works only with entire strings and not with substrings.
(In reply to Henri Sivonen (:hsivonen) from comment #30) > Comment on attachment 8845865 [details] > Bug 1344152 - Determine how many scripts are in "Latin 1";f?hsivonen > > https://reviewboard.mozilla.org/r/119010/#review121460 > > First, I'd like to understand how this knowledge could be put to use. Shu and/or Waldo are running experiments in parallel to find out how much faster a specialized single-byte-UTF-8 parser would be. As you point out, we can only run this parser if we already know that a script is using single-byte-UTF-8, so we would need to first iterate through the code to check whether it actually is single-byte-UTF-8. We win if: 1. a vast majority of scripts would benefit from the optimization; 2. initial iteration + specialized parsing is actually meaningfully faster than unspecialized parsing. 3. initial iteration + unspecialized parsing is not too slow. So this patch checks that: 1. there are enough scripts in the wild to make this optimization useful in the first place; 2. how quickly we can iterate through the source to check whether it is actually single-byte-UTF-8. > We > have external scripts and internal scripts. In the case of internal scripts, > the DOM can store the data that's logically UTF-16 with the leading zeros > omitted so that the storage is actually ISO-8859-1 (in the ISO sense, not in > the WHAWWG sense). However, it seems that this patch doesn't check for that. Yes, I'm new to this part of the tree, so I figured I would start with something small. I'm quite happy to add further measures for the internal scripts if you can point me in the right direction. > As for extenal scripts, which is what this patch is about, even if we knew > how many scripts only use code points up to and including U+00FF, what would > be able to do with that information? Even if we knew from telemetry that a > sizeable proportion of scripts didn't use code points beyond U+00FF, how > would we make use of a parsing code path optimized for that case? For an > incoming script that we haven't iterated over yet, how would we know that > it's only going to contain code points up to and including U+00FF? The Web > platform offers no way for the site to promise that. (In particular, > charset=ISO-8859-1 does *not* promise that, because on the Web that's just > an alias for charset=windows-1252. Moreover, we want everyone to move to > UTF-8, so we shouldn't do anything that would result in us recommending to > authors that they should use something other than UTF-8 for interchange.) (Most points answered above) In this case, I'm trying to measure a subcase of UTF-8, which wouldn't need developers to change anything, just to find out if we're in a sweet spot. I'm pretty new at UTF-8, so it's entirely possible that I'm measuring the wrong thing (or that I misunderstood Waldo or Shu), but I *think* that this is what my code does. > Furthermore, if we wanted to gather telemetry for this (and per the above > paragraph, I don't see what we could do with the knowledge acquired via > telemetry) it would probably make performance sense to use SSE2 for checking > that the code points stay at most at U+00FF. Good idea. Unfortunately, I know next to nothing about SSE2 programming, do we have something like this in the code already? (In reply to Henri Sivonen (:hsivonen) from comment #31) > For external scripts, having a UTF-8 JS parser code path makes a lot more > sense than having a Latin1 one. (A Latin1 one might make sense for internal > scripts.) > > Bug 1261841 will enable Gecko to decode directly to UTF-8 (without going via > UTF-16) and doing it in such a way that there's no copy if the input is in > UTF-8 and valid[1], if the input is in an ASCII-compatible encoding and the > input is all ASCII, or if the input is in ISO-2022-JP and the input stays > entirely in the ASCII state of ISO-2022-JP. Yes, I kind of expect that we're heading towards a UTF-8 parser, with possible special-cases for e.g. single-byte UTF-8.
Flags: needinfo?(hsivonen)
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #32) > (In reply to Henri Sivonen (:hsivonen) from comment #30) > > Comment on attachment 8845865 [details] > > Bug 1344152 - Determine how many scripts are in "Latin 1";f?hsivonen > > > > https://reviewboard.mozilla.org/r/119010/#review121460 > > > > First, I'd like to understand how this knowledge could be put to use. > > > Shu and/or Waldo are running experiments in parallel to find out how much > faster a specialized single-byte-UTF-8 parser would be. As you point out, we > can only run this parser if we already know that a script is using > single-byte-UTF-8, so we would need to first iterate through the code to > check whether it actually is single-byte-UTF-8. Single-byte UTF-8 is ASCII, that is < 0x80, but that's not what this batch checks for. What reason is there to believe that iterating over the input to see if it's all-ASCII and then running an ASCII-only parser would be more efficient than running an ASCII-biased UTF-8 parser? I'm not saying that it is impossible for an ASCIIness check followed by an ASCII-only parser to be faster than an ASCII-biased UTF-8 parser, but it seems remarkably counterintuitive. Do we have some data suggesting that there is a performance case to be made for an ASCII-only parser compared to an ASCII-biased UTF-8 parser? It seems rather premature to gather data with the assumption that we might want to do the counter-intuitive thing unless we already have some data that shows that, against intuition, having "else" branches for non-ASCII in a byte-oriented UTF-8 parser is so expensive that pursuing an ASCII-only parser really makes sense. > We win if: > 1. a vast majority of scripts would benefit from the optimization; > 2. initial iteration + specialized parsing is actually meaningfully faster > than unspecialized parsing. > 3. initial iteration + unspecialized parsing is not too slow. > > So this patch checks that: > 1. there are enough scripts in the wild to make this optimization useful in > the first place; > 2. how quickly we can iterate through the source to check whether it is > actually single-byte-UTF-8. This patch checks < 0xFF. Single-byte UTF-8 is < 0x80. > > We > > have external scripts and internal scripts. In the case of internal scripts, > > the DOM can store the data that's logically UTF-16 with the leading zeros > > omitted so that the storage is actually ISO-8859-1 (in the ISO sense, not in > > the WHAWWG sense). However, it seems that this patch doesn't check for that. > > Yes, I'm new to this part of the tree, so I figured I would start with > something small. I'm quite happy to add further measures for the internal > scripts if you can point me in the right direction. You need to walk all the descendant (of the script element) text nodes to see if mState.mIs2b is false for all of them. > > Furthermore, if we wanted to gather telemetry for this (and per the above > > paragraph, I don't see what we could do with the knowledge acquired via > > telemetry) it would probably make performance sense to use SSE2 for checking > > that the code points stay at most at U+00FF. > > Good idea. Unfortunately, I know next to nothing about SSE2 programming, do > we have something like this in the code already? Not yet. (We will once we get encoding_rs and SIMD in Rust.) The general idea is: 0) Splat 0x7F to all 8 16-bit SSE2 lanes. Let's call this The Constant. 1) Loop: 2) Lead 8 UTF-16 code units with _mm_loadu_si128. 3) Use _mm_cmpgt_epi16 with The Constant and the loaded SSE2 vector. 4) If _mm_movemask_epi8 on the result is non-zero, you found non-ASCII. Break. But it's still unclear to me that this is worth checking.
Flags: needinfo?(hsivonen)
Oh, and just about every popular JS lib is ASCII-only, so we don't need telemetry to tell us that the ASCII-only case is very common.
(In reply to Henri Sivonen (:hsivonen) from comment #33) > > Shu and/or Waldo are running experiments in parallel to find out how much > > faster a specialized single-byte-UTF-8 parser would be. As you point out, we > > can only run this parser if we already know that a script is using > > single-byte-UTF-8, so we would need to first iterate through the code to > > check whether it actually is single-byte-UTF-8. > > Single-byte UTF-8 is ASCII, that is < 0x80, but that's not what this batch > checks for. Good point. I believe that several people (including me) are pretty confused on encodings. At this stage, my revised understanding is that: 1. We know that some competing VMs are iterating through the source first, to find out whether they can use an optimized tokenizer, and it provides them with performance benefits. 2. If the source is Latin 1, we know that we can speed up tokenizing, in particular interning strings. 3. If the source is ASCII (even if specified as UTF-8), well, it's a subset of Latin 1, so we can have this speed up, plus removing an additional `if` branch in a hot function. 4. If we know that the source is Latin 1, and if we have a specialized Latin 1 parser, we probably want to feed the Latin 1 directly from the DOM to the specialized JS parser, without transcoding. 5. Likewise, if we know that the source is ASCII, we probably want to feed the ASCII directly to the specialized JS parser. 6. Doing the above two things would require some rewrite in the parser and encoding_rs in Gecko. At this stage, I don't know if everybody uses the same notion of "Latin 1", which I understand could be iso-88519-1, iso-88519-15 or Windows-1252. Did I get things right?
Flags: needinfo?(shu)
Flags: needinfo?(jwalden+bmo)
Flags: needinfo?(hsivonen)
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #35) > 2. If the source is Latin 1, we know that we can speed up tokenizing, in > particular interning strings. For the definition of Latin 1 where each code point is < U+0100, I imagine the mapping to SpiderMonkey's single-byte strings would be straight-forward. On the Gecko side, you could get such input from internal scripts (if all the text nodes [typically there's just one] of the script have mState.mIs2b as false). The Web Platform provides no way for an external script to be declared as having code points only below U+0100. > 3. If the source is ASCII (even if specified as UTF-8), well, it's a subset > of Latin 1, so we can have this speed up, plus removing an additional `if` > branch in a hot function. Is the branch hot, though? Most JS code out there is ASCII and most of the syntax is ASCII. For example, if in a UTF-8 parser you are scanning for whitespace, comma or a closing parathesis, if you arrange to check for ASCII whitespace first, then comma and then a closing paranthesis, chances are that the remaining check for non-ASCII whitespace (which, sadly, exists in JS) is not hot. That is, you wouldn't perform UTF-8 checks or math until you failed to find one of the various ASCII things you are scanning for, and chances are that the remaining non-ASCII "else" isn't ever hot. On surface, the same kind of arrangement would seem to fit many if not all parts of JS syntax. (Even if the branch actually was hot, just moving it out of the parser but keeping it as a per-byte branch wouldn't make sense. It would make sense to use SSE2 to make the branch per 16 bytes.) > 4. If we know that the source is Latin 1, and if we have a specialized Latin > 1 parser, we probably want to feed the Latin 1 directly from the DOM to the > specialized JS parser, without transcoding. Yes. > 5. Likewise, if we know that the source is ASCII, we probably want to feed > the ASCII directly to the specialized JS parser. I still have a hard time believing we'd want to have a specialized ASCII parser as opposed to having a byte-oriented UTF-8 parser that performs UTF-8 checks/math only if the byte didn't match the ASCII options in the grammar at a given grammar production. In any case, we already know that a lot of JS uses only ASCII code points (regardless of the nominal encoding), so the data we need isn't telemetry but an experiment in parser writing. > 6. Doing the above two things would require some rewrite in the parser and > encoding_rs in Gecko. The internal script Latin1 opportunity doesn't require encoding_rs. For taking external bytes and converting them to known-valid UTF-8 without copying if possible, it makes sense to wait for encoding_rs. > At this stage, I don't know if everybody uses the same notion of "Latin 1", > which I understand could be iso-88519-1, iso-88519-15 or Windows-1252. I'd hope ISO-8859-15 isn't part of the confusion. The confusion is that on one hand there's ISO-8859-1 as defined by ISO, which matches the single-byte storage in SpiderMonkey and in Gecko's DOM single-byte text nodes, but on the other hand the label "ISO-8859-1" on the wire doesn't mean ISO-8859-1 as defined by ISO but actually means windows-1252.
Flags: needinfo?(hsivonen)
Just to avoid confusion: I realize that my patch is incorrect, just discussing what we should do now. > Is the branch hot, though? Well, finding this out is what we need Telemetry for. Or do I misunderstand your question? > For example, if in a UTF-8 parser you are scanning for whitespace [...] and chances are that the remaining non-ASCII "else" isn't ever hot. Sure, that's pretty much what the current UTF-16 tokenizer is doing. Now, if we specialize our tokenizer for Latin 1, we already pay the code complexity of specialization. It's worth checking out if we can improve further by having one more tokenizer that's basically the same one, minus one branch. I don't know that this will make things faster, but it's worth trying. > (Even if the branch actually was hot, just moving it out of the parser but keeping it as a per-byte branch wouldn't make sense. It would make sense to use SSE2 to make the branch per 16 bytes.) Not entirely sure I understand how SSE fits at this stage, at least if we assume a UTF-8 tokenizer. The tokenizer handles one token at a time, so in most cases, one char == one byte. If we look at 16 bytes, we need to store the result somewhere until the next 15 bytes are needed, which I suspect defeats any SSE optimization. > In any case, we already know that a lot of JS uses only ASCII code points (regardless of the nominal encoding), so the data we need isn't telemetry but an experiment in parser writing. I have trouble following this specific line. How do we *know* it if we never measure it? > The internal script Latin1 opportunity doesn't require encoding_rs. For taking external bytes and converting them to known-valid UTF-8 without copying if possible, it makes sense to wait for encoding_rs. Ok. Is there an ETA for encoding_rs?
Flags: needinfo?(hsivonen)
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #37) > > Is the branch hot, though? > > Well, finding this out is what we need Telemetry for. Or do I misunderstand > your question? Telemetry that looks at the ASCIIness of the source as a whole doesn't tell us if it's bad to have an ASCII-biased UTF-8 parser compared to an ASCII-only parser plus preflight. For JS syntax excluding string and regexps literals, the ASCII-biased UTF-8 would run the same branches first that an ASCII-only parser would run anyway. Handling the non-ASCII case as a potentially hotter branch in string and regexp literals would still be cheaper than checking everything for ASCIIness up front. If there's something to measure, it should be something like: * Write an ASCII-biased UTF-8 parser. * Make a copy with the non-ASCII "else" branches ripped out. * Parse e.g. minified jQuery with the UTF-8 parser and with the ASCII only version with an ASCII check preflight and see what the difference is. This isn't a telemetry issue. > > For example, if in a UTF-8 parser you are scanning for whitespace [...] and chances are that the remaining non-ASCII "else" isn't ever hot. > > Sure, that's pretty much what the current UTF-16 tokenizer is doing. Except it has to burn through twice the memory/cache for the stuff that's actually in the ASCII range. > Not entirely sure I understand how SSE fits at this stage, at least if we > assume a UTF-8 tokenizer. If we end up with any kind of "is the whole thing ASCII" preflight (and I expect we don't really need one), SSE2 is perfect for that kind of preflight. > > In any case, we already know that a lot of JS uses only ASCII code points (regardless of the nominal encoding), so the data we need isn't telemetry but an experiment in parser writing. > > I have trouble following this specific line. How do we *know* it if we never > measure it? See above. It's not something we can measure from telemetry. It needs to be measured with actual parser candidates, and that can be done offline. > > The internal script Latin1 opportunity doesn't require encoding_rs. For taking external bytes and converting them to known-valid UTF-8 without copying if possible, it makes sense to wait for encoding_rs. > > Ok. Is there an ETA for encoding_rs? The code will exist Real Soon Now. encoding_rs already exists, the glue code almost completely exists and I'm now writing the megapatch to change all the callers. However, it's unclear when I get to land it. In order not to regress the perf of the UTF-8 to UTF-16 case (which is the most important case for HTML/XML and for the time being for JS until the JS parser switches to UTF-8), encoding_rs needs to use SSE2 (since uconv uses SSE2 for that case). Explicit SSE2 currently requires nightly Rust but we use release-channel Rust for Firefox. So the status of SSE2 in Rust poses a hard-to-estimate delay.
Flags: needinfo?(hsivonen)
(In reply to Henri Sivonen (:hsivonen) from comment #38) > If there's something to measure, it should be something like: > * Write an ASCII-biased UTF-8 parser. > * Make a copy with the non-ASCII "else" branches ripped out. > * Parse e.g. minified jQuery with the UTF-8 parser and with the ASCII only > version with an ASCII check preflight and see what the difference is. Additionally, even if the ASCII-only parser plus preflight was a bit faster, it would still make sense to use an ASCII-biased UTF-8 parser. One copyright sign or emoji pessimizes the entire script in the case of an ASCII-only parser but not in the case of an ASCII-biased UTF-8 parser.
Henri and Yoric, there's been some confusion on ASCII vs "Latin1" here. Here's the plan in my head: "Latin1" is used inside the JS engine, perhaps incorrectly, to mean if a UTF-16 buffer has no char16_t c > 0xFF (or |c & ~0xFF| more cheaply). If this is not what Latin1 technically is, let's call it "8-bit" or something and not dwell on the incorrect use of encoding name here. This is larger than ASCII i.e. 1-byte UTF-8, which is 7-bit. If we know we can parse a source with an 8bit tokenizer, we win on at least the following: - Being able to figure out if a character is an identifier by indexing into a precomputed 255-character table. - Things just generally faster on char* instead of char16_t* (caches, vectorization, whatever). - Directly atomizing 8bit strings instead of 16bit strings, which SpiderMonkey already has the ability to do. I think it's uncontroversial that having a specialized 8bit tokenizer would be a net win all around. The contentions here are as I understand them: 1. Should the 8bit lexer be Latin1-biased or be hardcoded to only deal with byte values? From the engine's POV, I think its input buffer should be known ahead of time to be consumable as 8bit or not. An 8bit-biased parser would require pre-inflated input, since 1-byte UTF-8 is 7-bit, and one of the goals here is to process char*. 2. How do we get an 8-bit input buffer? If Gecko knows something is consumable as 8-bit, then great, I hope we can get that straightforwardly. If not, one thing we're gonna try and measure is this preflight thing. I agree more or less that telemetry for this "8bit" thing isn't too valuable, in that I also bet 99% of Western sites would fall into it. That said, it isn't ASCII and I don't see how the info can hurt.
Flags: needinfo?(shu)
(In reply to Henri Sivonen (:hsivonen) from comment #39) > Additionally, even if the ASCII-only parser plus preflight was a bit faster, > it would still make sense to use an ASCII-biased UTF-8 parser. > > One copyright sign or emoji pessimizes the entire script in the case of an > ASCII-only parser but not in the case of an ASCII-biased UTF-8 parser. Fwiw, I believe that everybody agrees that the fallback parser should be an ASCII-biased UTF-8 parser.
Henri, if instead of talking about an ASCII preflight we were talking about a "Latin 1 as defined by <0x100" preflight, would you still have the same objections?
Flags: needinfo?(hsivonen)
I chatted with Waldo in person about this. Recap below to more concretely pin down a plan: 1. Most of the engineering work is in refactoring so TokenStream can be specialized. a. Waldo's working on refactoring TokenStream. b. I or someone else can look at separating frontend error reporting from TokenStream. 2. Once 1 is done, we can straightforwardly prototype both approaches: a. 7-bit (ASCII-biased) UTF-8 tokenizing. b. 8-bit preflight + 8-bit tokenizing. Yoric, for telemetry, the only interesting question is "how many non-7-bit (i.e. ASCII) files are 8-bit?" That is, do most non-ASCII files in the wild end up being able to be single-byte parsed if we did an 8-bit preflight? Or do most non-ASCII files in the wild end up requiring multi-byte lexing anyways. I hope that clears everything up!
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #42) > Henri, if instead of talking about an ASCII preflight we were talking about > a "Latin 1 as defined by <0x100" preflight, would you still have the same > objections? Mostly the same. It's unclear what you propose the preflight would be on. For internal scripts, the <0x100 preflight doesn't need to iterate over data. It just needs to iterate over the text nodes (normally only one) and see if mState.mIs2b is false for all of them. While optimizing that case would like make it faster, overall internal scripts contain less JS source than external scripts, so it would make more sense to pursue optimizing external scripts. If we switched to decoding external scripts to UTF-8, a <0x100 preflight wouldn't make sense. If we continue to decode external scripts to UTF-16, chances are that a <0x100 preflight would give as little benefit as a <0x7F preflight would for UTF-8 per comment 36. Even if the <0x100 preflight showed that the UTF-16 source was actually Latin1, the parser would still have to touch twice the memory/cache compared to the UTF-8 case. (In reply to Shu-yu Guo [:shu] from comment #40) > Henri and Yoric, there's been some confusion on ASCII vs "Latin1" here. > Here's the plan in my head: > > "Latin1" is used inside the JS engine, perhaps incorrectly, to mean if a > UTF-16 buffer has no char16_t c > 0xFF (or |c & ~0xFF| more cheaply). I'm aware. > If > this is not what Latin1 technically is, let's call it "8-bit" or something > and not dwell on the incorrect use of encoding name here. This is larger > than ASCII i.e. 1-byte UTF-8, which is 7-bit. I think calling it 8-bit at the stage when it's stored as UTF-16 (the current case prior to parsing) is even more confusing than calling it Latin1. > If we know we can parse a source with an 8bit tokenizer, we win on at least > the following: > - Being able to figure out if a character is an identifier by indexing into > a precomputed 255-character table. > - Things just generally faster on char* instead of char16_t* (caches, > vectorization, whatever). > - Directly atomizing 8bit strings instead of 16bit strings, which > SpiderMonkey already has the ability to do. > > I think it's uncontroversial that having a specialized 8bit tokenizer would > be a net win all around. But I think it is controversial! The benefits you outline above apply to parsing internal scripts that are already stored as 8 bits per code point in the DOM (mState.mIs2b is false), and the internal scripts total to much less JS source than the external scripts. For external scripts, we don't have option to decode directly into 8-bit values that range from 0 to 0xFF. 1. There exists no way in the Web Platform to flag an incoming script as having its bytes directly memcpyable to such form. (charset=ISO-8850-1 or charset=latin1 don't actually mean that.) 2. We don't have (either currently with uconv or in the future with encoding_rs) have decoder infrastructure for that for arbitrarily-labeled inputs. 3. If we decode to UTF-16 first (possible with uconv or encoding_rs), we've already messed up the cache-friendliness by exploding the data to twice the size. I.e. every code point takes 16 bits even if only the lower 8 are in use. 4. If we decode to UTF-8 first (possible with encoding_rs), the range that can be one-byte-optimized is <0x80, not <0x100. Additionally, if the input is UTF-8, the "decoding" is just copyless validation and if the input is presumptively something else (except UTF-16, but UTF-16 on the wire is *rare*) but actually contains just ASCII, the "decoding" is also just copyless validation. Furthermore, an ASCII-biased UTF-8 parser would be superior to (what I hear) V8 does. A single emoji somewhere has a tiny local effect on an ASCII-biased UTF-8 parser. With a V8-style single-byte parser, a single emoji puts the entire script file on the slow path. (Note that Chromium doesn't have the decoder infra to decode directly to UTF-8. Neither do we today, but we will soon with encoding_rs.) As I see it, since external scripts are larger than internal scripts, we should optimize external ones first. For external ones, it's pretty clear that an ASCII-biased UTF-8 parser is the way to go.
Flags: needinfo?(hsivonen)
(In reply to Henri Sivonen (:hsivonen) from comment #44) > (In reply to David Teller [:Yoric] (please use "needinfo") from comment #42) > > Henri, if instead of talking about an ASCII preflight we were talking about > > a "Latin 1 as defined by <0x100" preflight, would you still have the same > > objections? > > Mostly the same. > > It's unclear what you propose the preflight would be on. So far, I'm mostly interested in external scripts, as I figure that's the biggest cost. [...] > If we switched to decoding external scripts to UTF-8, a <0x100 preflight > wouldn't make sense. Switching to UTF-8 is mostly planned, eventually. So what would make sense in that context? > > I think it's uncontroversial that having a specialized 8bit tokenizer would > > be a net win all around. > > But I think it is controversial! > > The benefits you outline above apply to parsing internal scripts that are > already stored as 8 bits per code point in the DOM (mState.mIs2b is false), > and the internal scripts total to much less JS source than the external > scripts. I believe everybody agrees that internal scripts are secondary at this stage. > For external scripts, we don't have option to decode directly into 8-bit > values that range from 0 to 0xFF. [...] > 4. If we decode to UTF-8 first (possible with encoding_rs), I believe that this was the plan. > the range that > can be one-byte-optimized is <0x80, not <0x100. Ah, interesting, that's probably the source of the initial misunderstanding. > Additionally, if the input > is UTF-8, the "decoding" is just copyless validation and if the input is > presumptively something else (except UTF-16, but UTF-16 on the wire is > *rare*) but actually contains just ASCII, the "decoding" is also just > copyless validation. > > Furthermore, an ASCII-biased UTF-8 parser would be superior to (what I hear) > V8 does. A single emoji somewhere has a tiny local effect on an ASCII-biased > UTF-8 parser. With a V8-style single-byte parser, a single emoji puts the > entire script file on the slow path. (Note that Chromium doesn't have the > decoder infra to decode directly to UTF-8. Neither do we today, but we will > soon with encoding_rs.) > > As I see it, since external scripts are larger than internal scripts, we > should optimize external ones first. For external ones, it's pretty clear > that an ASCII-biased UTF-8 parser is the way to go. I'm ok with starting with that, then finding out if we can further optimize.
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #45) > > If we switched to decoding external scripts to UTF-8, a <0x100 preflight > > wouldn't make sense. > > Switching to UTF-8 is mostly planned, eventually. So what would make sense > in that context? Having the parser operate on a byte iterator (as opposed to a code point iterator) and checking for the ASCII parts (<0x80) of each grammar production first when parsing so that the non-ASCII handling branches for a given grammar production execute only in the rare event when the ASCII part of a grammar production didn't match. As noted, great thing is that this doesn't require a preflight, because the effects of non-ASCII are local and don't need to pessimize the overall parsing approach for the entire script. > > As I see it, since external scripts are larger than internal scripts, we > > should optimize external ones first. For external ones, it's pretty clear > > that an ASCII-biased UTF-8 parser is the way to go. > > I'm ok with starting with that, then finding out if we can further optimize. Cool. Then we don't need telemetry at this time.
If all goes well with Rust SSE2 support getting into release, when does encoding_rs land?
(In reply to Shu-yu Guo [:shu] from comment #47) > If all goes well with Rust SSE2 support getting into release, when does > encoding_rs land? My guess is that I'll be done with the megapatch for m-c in early Q2, which I estimate to be before explicit SIMD gets into Rust stable channel, so my best current guess is "when explicit SIMD gets into Rust stable channel". That's not a very useful answer, I know. Once I have the megapatch up and running and can get Talos numbers for startup time with encoding_rs without SSE2, I intend to start discussion on alternatives to waiting for explicit SIMD on Rust stable channel. If the new UTF-8 parser for JS is ready sooner, IsASCII() and IsUTF8() (in nsReadableUtils) can be used to approximate (best-case scenarios only) the upcoming encoding_rs functionality as temporary band-aid in nsScriptLoader.
Whiteboard: [qf]
Whiteboard: [qf] → [qf:p1]
Henri: I'm somewhat confused. There seem to be two distinct cases here. The first one, and most high-value case, is the situation where an external script is encoded in UTF-8, and gecko passes that script directly to a SpiderMonkey UTF-8 parser without inflating it to UTF-16 first. The second one, and lower-value case, is the situation where we have a JS string that may be "<0x100" only, and we pass that to a SpiderMonkey "Latin1" parser. This is the case where a preflight check would be required to confirm that the entire string is 8bit, right? Am I understanding this correctly?
Flags: needinfo?(hsivonen)
(In reply to Kannan Vijayan [:djvj] from comment #49) > There seem to be two distinct cases here. The first one, and most > high-value case, is the situation where an external script is encoded in > UTF-8, and gecko passes that script directly to a SpiderMonkey UTF-8 parser > without inflating it to UTF-16 first. (This case also applies when the external script is in a non-UTF-8 ASCII-compatible encoding and happens to only contain ASCII. encoding_rs handles this automatically when requesting conversion to UTF-8. In the present pre-encoding_rs situation, the behavior can be emulated with IsASCII() from nsReadableUtils.) > The second one, and lower-value case, is the situation where we have a JS > string that may be "<0x100" only, and we pass that to a SpiderMonkey > "Latin1" parser. This is the case where a preflight check would be required > to confirm that the entire string is 8bit, right? Most of the preflight happens when the DOM text nodes are created. If all code points in a text node are < 0x100, the leading zeros are omitted at the text node creation time (only the low 8 bits per code unit are stored). The preflight needed at the time of script execution is iterating over all the text nodes that form the internal script and checking that mState.mIs2b is false for all of them. (Usually, of course, there's only one text node.)
Flags: needinfo?(hsivonen)
Clearing needinfos to the extent possible now, don't look to me for further insightful comments here. :-) (In reply to David Teller [:Yoric] (please use "needinfo") from comment #35) > 1. We know that some competing VMs are iterating through the source first, > to find out whether they can use an optimized tokenizer, and it provides > them with performance benefits. I'm unclear exactly how much they've checked this, and also not sure about UTF-8 tokenizing versus what we call Latin-1 or its ASCII subset. But of course we need to do our own measurement regardless. > 2. If the source is Latin 1, we know that we can speed up tokenizing, in > particular interning strings. It doesn't seem like it should be inordinately difficult to have optimal atomizing even of UTF-8. You could easily preflight the whole string and then have single-byte/UTF-8 versions selected based on that, dynamically, and it adds up to effectively the same whole-script pass being posed as a possibility in #1 (just because it's a smaller range, possibly is less well optimized/cached in the CPU -- but the true extent of the effect I'd bet is smallish, but we'd have to profile to say for sure). > 3. If the source is ASCII (even if specified as UTF-8), well, it's a subset > of Latin 1, so we can have this speed up, plus removing an additional `if` > branch in a hot function. I think everyone else has covered the "this branch should be in cold code" response. > 4. If we know that the source is Latin 1, and if we have a specialized Latin > 1 parser, we probably want to feed the Latin 1 directly from the DOM to the > specialized JS parser, without transcoding. I didn't know the ISO-8859-1 really labels Windows-1252 thing and am now Sad.
Flags: needinfo?(jwalden+bmo)
The main single-byte lexer bug will remain qf:p1, this was just an investigation.
Whiteboard: [qf:p1] → [qf]
Whiteboard: [qf] → [qf-]
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #37) > Is there an ETA for encoding_rs? It landed.
-> bug 1351107 for ongoing work
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
Performance Impact: --- → -
Whiteboard: [qf-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: