Closed
Bug 1289003
Opened 8 years ago
Closed 8 years ago
Add API to create JSString from UTF-8 string efficiently.
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
mozilla51
People
(Reporter: arai, Assigned: arai)
References
Details
Attachments
(3 files, 6 obsolete files)
(deleted),
patch
|
arai
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
Waldo
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
arai
:
review+
|
Details | Diff | Splinter Review |
We're going to change the error reporting API and internal to use UTF-8 in bug 1283710.
So we'd need API to create JSString from UTF-8 string (const char*).
JS_NewStringCopyFromUTF8, and maybe JS_NewStringCopyFromUTF8Lossy or something if needed.
We already have an API to convert JSString to UTF-8 (JS_EncodeStringToUTF8), so only API for UTF-8 to JSString is needed.
I have a draft patch, that was originally for bug 1283710.
will post it shortly.
Assignee | ||
Comment 1•8 years ago
|
||
looks like similar API (JS_NewStringCopyUTF8Z and JS_NewStringCopyUTF8N) are going to be added in bug 987069.
hopefully I can improve the efficiency :)
Assignee | ||
Updated•8 years ago
|
Assignee | ||
Comment 2•8 years ago
|
||
Added UTF8CharsToNewLatin1CharsZ and LossyUTF8CharsToNewLatin1CharsZ,
that converts UTF-8 encoded string to Latin1.
Those function suppose the input UTF-8 encoded string *can* be encoded in Latin1, and doesn't handle overflow.
That is because those are supposed to be used after PossibleMinumumEncodingForUTF8 function, that is added in Part 2, that returns possible minimum encoding for the given UTF-8 string.
To make those APIs, added CharT typedef for {,Const}{Latin1,UTF8,TwoBytes}Chars{,Z} classes, to use it from InflateUTF8StringHelper, that is now templatized for return type, that could now be TwoByteCharsZ or Latin1CharsZ.
Assignee | ||
Comment 3•8 years ago
|
||
As mentioned above, PossibleMinumumEncodingForUTF8 function returns possible minimum encoding for the given UTF-8 string, from one of ASCII, Latin1, UTF16.
ASCII for char <= 0x7f
Latin1 for char <= 0xff
UTF16 for other case
Assignee | ||
Comment 4•8 years ago
|
||
JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z functions use those above functions.
Those funtions first check the possible minimum encoding,
and then use NewStringCopyN for ASCII, UTF8CharsToNewLatin1CharsZ+NewString for Latin1, UTF8CharsToNewTwoByteCharsZ+NewString for UTF-16.
This is done to reduce the number of copy or allocation.
for ASCII case, we can just copy the string.
for other cases, we need to convert UTF-8 encoded string to Latin1 or UTF-16 first, and then allocate JSString.
by adding and using PossibleMinumumEncodingForUTF8 and UTF8CharsToNewLatin1CharsZ, we don't need temporary space for any case.
Assignee | ||
Comment 5•8 years ago
|
||
I'll ask review after bug 1294940.
Assignee | ||
Comment 6•8 years ago
|
||
rebased
Attachment #8781384 -
Attachment is obsolete: true
Attachment #8786525 -
Flags: review?(jwalden+bmo)
Assignee | ||
Comment 7•8 years ago
|
||
Attachment #8781385 -
Attachment is obsolete: true
Attachment #8786526 -
Flags: review?(jwalden+bmo)
Assignee | ||
Comment 8•8 years ago
|
||
Attachment #8781387 -
Attachment is obsolete: true
Attachment #8786527 -
Flags: review?(jwalden+bmo)
Comment 9•8 years ago
|
||
Comment on attachment 8786526 [details] [diff] [review]
Part 2: Add PossibleMinumumEncodingForUTF8.
Review of attachment 8786526 [details] [diff] [review]:
-----------------------------------------------------------------
With these few comments, enough changes that I should just punt for a closer look at something much closer to the final deal.
::: js/public/CharacterEncoding.h
@@ +288,5 @@
> DeflateStringToUTF8Buffer(JSFlatString* src, mozilla::RangedPtr<char> dst,
> size_t* dstlenp = nullptr, size_t* numcharsp = nullptr);
>
> /*
> + * Returns the possible minimum encoding for the given UTF8 string.
The enum shouldn't include "minimum" in it -- that's a function of the API function, not the enum. Suggest just "Encoding" for a name, and this as comment:
"""
JSAPI character encodings.
"""
@@ +296,5 @@
> + Latin1,
> + UTF16
> +};
> +JS_PUBLIC_API(PossibleMinumumEncoding)
> +PossibleMinumumEncodingForUTF8(UTF8Chars utf8);
I'd say
"""
Returns the most restricted encoding possible for the given string: if all codepoints are <128 then ASCII, otherwise if all codepoints are <256 Latin-1, else UTF16.
"""
and name the function MostRestrictedEncoding. "ForUTF8" is implicit in the function argument, doesn't need to be spelled out.
::: js/src/vm/CharacterEncoding.cpp
@@ +262,5 @@
> // LossyConvertUTF8toUTF16() in dom/wifi/WifiUtils.cpp
> template <InflateUTF8Action Action, typename CharT>
> static bool
> InflateUTF8StringToBuffer(JSContext* cx, const UTF8Chars src, CharT* dst, size_t* dstlenp,
> + bool* isAsciip, bool* isLatin1)
Instead of multiple bool*, why not |Encoding* mostRestrictedEncoding|? (The 'p' at end doesn't seem necessary to me, not least because we're not consistent about doing it.) Doesn't admit nonsense like |*isAsciip == true && *isLatin1 == false|, and users get told exactly what encoding is best, without having to reconstruct it from bools. That also means you could not add a new InflateUTF8Action -- just replace all |*isAsciip| assignments with assigments to mRE.
@@ +332,5 @@
>
> // Check the continuation bytes.
> for (uint32_t m = 1; m < n; m++)
> if ((src[i + m] & 0xC0) != 0x80)
> INVALID(ReportInvalidCharacter, i, m);
Put braces around the for-loop while you're here.
@@ +394,5 @@
> ReportOutOfMemory(cx);
> return CharsT();
> }
>
> if (isAscii) {
In a most-restricted-encoding outparam world
@@ +444,5 @@
> + utf8,
> + /* dst = */ nullptr,
> + /* dstlen = */ nullptr,
> + &isAscii,
> + &isLatin1)));
With just one outparam this becomes easy -- just return that outparam, none of this if-stuff.
Attachment #8786526 -
Flags: review?(jwalden+bmo) → review-
Comment 10•8 years ago
|
||
Comment on attachment 8786525 [details] [diff] [review]
Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ.
Review of attachment 8786525 [details] [diff] [review]:
-----------------------------------------------------------------
::: js/public/CharacterEncoding.h
@@ +290,5 @@
>
> +/*
> + * Inflate bytes in UTF-8 encoding to Latin1 string.
> + * The caller must check if the given UTF-8 string can be encoded with Latin1
> + * before calling this.
Wait, that's not how this works. As stated, this makes Latin-1 encoding of the string a *precondition*, and if you don't do it, it's undefined what happens. This sentence really should be something like, "If the string contains non-Latin-1 characters, reports an error." It's really perfectly fine to call this with a non-Latin-1 string, as long as you understand the consequences. :-)
@@ +301,5 @@
> +
> +/*
> + * The same as UTF8CharsToNewLatin1CharsZ(), except that any malformed UTF-8
> + * characters will be replaced by "?". No exception will be thrown for
> + * malformed UTF-8 input.
I wouldn't document by reference with alterations here, just say things separately -- there's too little complexity here to not just spit it out, because document-by-reference increases the burden on anyone reading the comment. So I'd have these two doc-comments:
"""
Return a null-terminated Latin-1 string copied from the input string, storing its length (excluding null terminator) in |*outlen|. Fail and report an error if the string contains non-Latin-1 codepoints. Returns Latin1CharsZ() on failure.
"""
"""
Return a null-terminated Latin-1 string copied from the input string, storing its length (excluding null terminator) in |*outlen|. Non-Latin-1 codepoints are replaced by '?'. Returns Latin1CharsZ() on failure.
"""
::: js/src/vm/CharacterEncoding.cpp
@@ +6,5 @@
>
> #include "js/CharacterEncoding.h"
>
> #include "mozilla/Range.h"
> +#include "mozilla/TypeTraits.h"
These days you can use <type_traits> -- someone's converting things over, might as well not add new uses when we're not interplaying with existing code.
@@ +254,5 @@
> Copy
> };
>
> static const uint32_t REPLACE_UTF8 = 0xFFFD;
> +static const uint32_t REPLACE_UTF8_LATIN1 = '?';
Make these char16_t and Latin1Char typed.
@@ +266,5 @@
> {
> if (Action != AssertNoInvalids)
> *isAsciip = true;
>
> + // Count how many characters need to be in the inflated string.
s/characters/code units/ so there's no confusion about "characters" and what that might mean.
@@ +280,5 @@
>
> } else {
> // Non-ASCII code unit. Determine its length in bytes (n).
> if (Action != AssertNoInvalids)
> *isAsciip = false;
Along with the Encoding* outparam thing, I might add lambdas like so:
auto RequireLatin1 = [&mostRestrictedEncoding]{
mostRestrictedEncoding = std::max(Latin1, mostRestrictedEncoding);
};
auto RequireUTF16 = [&mostRestrictedEncoding]{
mostRestrictedEncoding = UTF16;
};
and then call those in the right places. The Latin-1 bit is the tricky one, because it has to *conditionally* upgrade, and not overwrite once UTF16 is encountered.
Although, hm. This does bring to mind that *if* we have a CheckEncoding enum as I suggested removing in the last review, *if* we have that we can just return immediately once a non-Latin-1 character is encountered. That probably motivates having CheckEncoding after all.
@@ +294,5 @@
> } else if (Action == AssertNoInvalids) { \
> MOZ_CRASH("invalid UTF-8 string: " # report); \
> } else { \
> + if (Action == Copy) { \
> + if (mozilla::IsSame<CharT, Latin1Char>::value) \
std::is_same<decltype(dst[0]), Latin1Char>::value
so that we don't have to wonder, however briefly, if CharT is |*dst|'s type or not.
@@ +371,3 @@
> InflateUTF8StringHelper(JSContext* cx, const UTF8Chars src, size_t* outlen)
> {
> + typedef typename CharsT::CharT CharT;
using CharT = CharsT::CharT;
maybe with a typename in there. Basically at this point, never typedef, always using. :-)
Attachment #8786525 -
Flags: review?(jwalden+bmo) → review+
Comment 11•8 years ago
|
||
Comment on attachment 8786526 [details] [diff] [review]
Part 2: Add PossibleMinumumEncodingForUTF8.
Review of attachment 8786526 [details] [diff] [review]:
-----------------------------------------------------------------
Okay, previous r- more or less stands, just a couple slightly different takes on some things.
::: js/public/CharacterEncoding.h
@@ +296,5 @@
> + Latin1,
> + UTF16
> +};
> +JS_PUBLIC_API(PossibleMinumumEncoding)
> +PossibleMinumumEncodingForUTF8(UTF8Chars utf8);
FindSmallestEncoding is a better name for this, on second thought. And I guess then SmallestEncoding is a better name than PossibleMinimumEncoding, with the doc-comment "The smallest character encoding capable of fully representing a particular string."
::: js/src/vm/CharacterEncoding.cpp
@@ +251,5 @@
> CountAndReportInvalids,
> CountAndIgnoreInvalids,
> AssertNoInvalids,
> + Copy,
> + CheckEncoding
FindSmallestEncoding seems better than "check", which implies the input might not be valid
@@ +338,5 @@
> // Determine the code unit's length in CharT and act accordingly.
> v = JS::Utf8ToOneUcs4Char((uint8_t*)&src[i], n);
> + if (Action == CheckEncoding) {
> + if (v > 0xff)
> + *isLatin1 = false;
We could/should just assert here that |dst| is null and immediately return SmallestEncoding::UTF16.
Comment 12•8 years ago
|
||
Comment on attachment 8786527 [details] [diff] [review]
Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z.
Review of attachment 8786527 [details] [diff] [review]:
-----------------------------------------------------------------
::: js/src/vm/String.cpp
@@ +1317,5 @@
> + Latin1Char* latin1 = UTF8CharsToNewLatin1CharsZ(cx, utf8, &length).get();
> + if (!latin1)
> + return nullptr;
> +
> + Rooted<JSFlatString*> result(cx, NewString<allowGC>(cx, latin1, length));
Can just use JSFlatString* for this type (and for the other |result| below) -- the Rooted doesn't do anything.
Attachment #8786527 -
Flags: review?(jwalden+bmo) → review+
Assignee | ||
Comment 13•8 years ago
|
||
addressed review comments.
RequireLatin1/RequireUTF16 are in Part 2.
Attachment #8786525 -
Attachment is obsolete: true
Attachment #8787434 -
Flags: review+
Assignee | ||
Comment 14•8 years ago
|
||
For now, used "FindEncoding" instead of "FindSmallestEncoding" for enum value, as it conflicts with FindSmallestEncoding function, especially inside it.
Do you have any other idea?
(we might be able to use enum class, but it will make template parameter long...)
Attachment #8786526 -
Attachment is obsolete: true
Attachment #8787437 -
Flags: review?(jwalden+bmo)
Assignee | ||
Comment 15•8 years ago
|
||
Attachment #8786527 -
Attachment is obsolete: true
Attachment #8787439 -
Flags: review+
Comment 16•8 years ago
|
||
Comment on attachment 8787437 [details] [diff] [review]
Part 2: Add PossibleMinumumEncodingForUTF8.
Review of attachment 8787437 [details] [diff] [review]:
-----------------------------------------------------------------
::: js/src/vm/CharacterEncoding.cpp
@@ +268,3 @@
> {
> + auto RequireLatin1 = [&smallestEncoding]{
> + *smallestEncoding = std::max(JS::SmallestEncoding::Latin1, *smallestEncoding);
#include <algorithm>
(I think?) at top of the file somewhere, if it's not there.
@@ +443,5 @@
> +JS::SmallestEncoding
> +JS::FindSmallestEncoding(UTF8Chars utf8)
> +{
> + JS::SmallestEncoding encoding;
> + JS_ALWAYS_TRUE((InflateUTF8StringToBuffer<FindEncoding, char16_t>(
Use MOZ_ALWAYS_TRUE for everything new you're adding here, we need to delete the other.
Attachment #8787437 -
Flags: review?(jwalden+bmo) → review+
Assignee | ||
Comment 17•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/b4c7d481bf10a39a99cb41b89bc677434d80dc0c
Bug 1289003 - Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ. r=jwalden
https://hg.mozilla.org/integration/mozilla-inbound/rev/1bbdfed5b149b231a27692542b6cee5b9f5138a8
Bug 1289003 - Part 2: Add FindSmallestEncoding. r=jwalden
https://hg.mozilla.org/integration/mozilla-inbound/rev/3abe6a1579f9bc79816ab781869b5232a1b3e483
Bug 1289003 - Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z. r=jwalden
Assignee | ||
Comment 18•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/586730aa152198ece8f016f3098ad2fbf055090b
Bug 1289003 followup - Initialize *smallestEncoding. r=bustage
Assignee | ||
Comment 19•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/f73442255a8308c776282d1133724899dc2c2a3f
Backed out changeset 586730aa1521 (bug 1289003)
https://hg.mozilla.org/integration/mozilla-inbound/rev/e1942a9b87424c15245bea5f86f7fcdc14c0df42
Backed out changeset 3abe6a1579f9 (bug 1289003)
https://hg.mozilla.org/integration/mozilla-inbound/rev/6135cb7ef5bda300672ce7290ccf905b2cab68ff
Backed out changeset 1bbdfed5b149 (bug 1289003)
https://hg.mozilla.org/integration/mozilla-inbound/rev/d18edfa7a8f9d0d9ed6b48c6038039f149c46d1d
Backed out changeset b4c7d481bf10 (bug 1289003)
Assignee | ||
Comment 20•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/bae24022813be38a0433b39469d4cd38ff288ce2
Bug 1289003 - Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ. r=jwalden
https://hg.mozilla.org/integration/mozilla-inbound/rev/18b15a11adf0a6f5c297c48364993a231fecbf91
Bug 1289003 - Part 2: Add FindSmallestEncoding. r=jwalden
https://hg.mozilla.org/integration/mozilla-inbound/rev/9f509473b88714d6dbd7666b6896551d669a6fda
Bug 1289003 - Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z. r=jwalden
Comment 21•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/bae24022813b
https://hg.mozilla.org/mozilla-central/rev/18b15a11adf0
https://hg.mozilla.org/mozilla-central/rev/9f509473b887
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox51:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla51
You need to log in
before you can comment on or make changes to this bug.
Description
•