Closed Bug 1289003 Opened 8 years ago Closed 8 years ago

Add API to create JSString from UTF-8 string efficiently.

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla51

Tracking Flags:

Tracking

Status

firefox50

---

affected

firefox51

---

fixed

People

(Reporter: arai, Assigned: arai)

References

Details

Attachments

(3 files, 6 obsolete files)

Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ. 8 years ago Tooru Fujisawa [:arai] (deleted), patch		Details \| Diff \| Splinter Review
Part 2: Add PossibleMinumumEncodingForUTF8. 8 years ago Tooru Fujisawa [:arai] (deleted), patch		Details \| Diff \| Splinter Review
Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z. 8 years ago Tooru Fujisawa [:arai] (deleted), patch		Details \| Diff \| Splinter Review
Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ. 8 years ago Tooru Fujisawa [:arai] (deleted), patch	Waldo : review+	Details \| Diff \| Splinter Review
Part 2: Add PossibleMinumumEncodingForUTF8. 8 years ago Tooru Fujisawa [:arai] (deleted), patch	Waldo : review-	Details \| Diff \| Splinter Review
Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z. 8 years ago Tooru Fujisawa [:arai] (deleted), patch	Waldo : review+	Details \| Diff \| Splinter Review
Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ. r=jwalden 8 years ago Tooru Fujisawa [:arai] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review
Part 2: Add PossibleMinumumEncodingForUTF8. 8 years ago Tooru Fujisawa [:arai] (deleted), patch	Waldo : review+	Details \| Diff \| Splinter Review
Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z. r=jwalden 8 years ago Tooru Fujisawa [:arai] (deleted), patch	arai : review+	Details \| Diff \| Splinter Review

Tooru Fujisawa [:arai]

Assignee

Description

•

8 years ago

We're going to change the error reporting API and internal to use UTF-8 in bug 1283710. So we'd need API to create JSString from UTF-8 string (const char*). JS_NewStringCopyFromUTF8, and maybe JS_NewStringCopyFromUTF8Lossy or something if needed. We already have an API to convert JSString to UTF-8 (JS_EncodeStringToUTF8), so only API for UTF-8 to JSString is needed. I have a draft patch, that was originally for bug 1283710. will post it shortly.

Tooru Fujisawa [:arai]

Assignee

Comment 1

•

8 years ago

looks like similar API (JS_NewStringCopyUTF8Z and JS_NewStringCopyUTF8N) are going to be added in bug 987069. hopefully I can improve the efficiency :)

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=987069

Tooru Fujisawa [:arai]

Assignee

Updated

•

8 years ago

Blocks: 1295017
No longer blocks: 1283710

Tooru Fujisawa [:arai]

Assignee

Updated

•

8 years ago

Depends on: 1294940

Tooru Fujisawa [:arai]

Assignee

Comment 2

•

8 years ago

Attached patch Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ. (obsolete) (deleted) — Details — Splinter Review

Added UTF8CharsToNewLatin1CharsZ and LossyUTF8CharsToNewLatin1CharsZ, that converts UTF-8 encoded string to Latin1. Those function suppose the input UTF-8 encoded string *can* be encoded in Latin1, and doesn't handle overflow. That is because those are supposed to be used after PossibleMinumumEncodingForUTF8 function, that is added in Part 2, that returns possible minimum encoding for the given UTF-8 string. To make those APIs, added CharT typedef for {,Const}{Latin1,UTF8,TwoBytes}Chars{,Z} classes, to use it from InflateUTF8StringHelper, that is now templatized for return type, that could now be TwoByteCharsZ or Latin1CharsZ.

Tooru Fujisawa [:arai]

Assignee

Comment 3

•

8 years ago

Attached patch Part 2: Add PossibleMinumumEncodingForUTF8. (obsolete) (deleted) — Details — Splinter Review

As mentioned above, PossibleMinumumEncodingForUTF8 function returns possible minimum encoding for the given UTF-8 string, from one of ASCII, Latin1, UTF16. ASCII for char <= 0x7f Latin1 for char <= 0xff UTF16 for other case

Tooru Fujisawa [:arai]

Assignee

Comment 4

•

8 years ago

Attached patch Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z. (obsolete) (deleted) — Details — Splinter Review

JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z functions use those above functions. Those funtions first check the possible minimum encoding, and then use NewStringCopyN for ASCII, UTF8CharsToNewLatin1CharsZ+NewString for Latin1, UTF8CharsToNewTwoByteCharsZ+NewString for UTF-16. This is done to reduce the number of copy or allocation. for ASCII case, we can just copy the string. for other cases, we need to convert UTF-8 encoded string to Latin1 or UTF-16 first, and then allocate JSString. by adding and using PossibleMinumumEncodingForUTF8 and UTF8CharsToNewLatin1CharsZ, we don't need temporary space for any case.

Tooru Fujisawa [:arai]

Assignee

Comment 5

•

8 years ago

I'll ask review after bug 1294940.

Tooru Fujisawa [:arai]

Assignee

Comment 6

•

8 years ago

Attached patch Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ. (obsolete) (deleted) — Details — Splinter Review

rebased

Attachment #8781384 - Attachment is obsolete: true

Attachment #8786525 - Flags: review?(jwalden+bmo)

Tooru Fujisawa [:arai]

Assignee

Comment 7

•

8 years ago

Attached patch Part 2: Add PossibleMinumumEncodingForUTF8. (obsolete) (deleted) — Details — Splinter Review

Attachment #8781385 - Attachment is obsolete: true

Attachment #8786526 - Flags: review?(jwalden+bmo)

Tooru Fujisawa [:arai]

Assignee

Comment 8

•

8 years ago

Attached patch Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z. (obsolete) (deleted) — Details — Splinter Review

Attachment #8781387 - Attachment is obsolete: true

Attachment #8786527 - Flags: review?(jwalden+bmo)

Jeff Walden [:Waldo]

Comment 9

•

8 years ago

Comment on attachment 8786526 [details] [diff] [review] Part 2: Add PossibleMinumumEncodingForUTF8. Review of attachment 8786526 [details] [diff] [review]: ----------------------------------------------------------------- With these few comments, enough changes that I should just punt for a closer look at something much closer to the final deal. ::: js/public/CharacterEncoding.h @@ +288,5 @@ > DeflateStringToUTF8Buffer(JSFlatString* src, mozilla::RangedPtr<char> dst, > size_t* dstlenp = nullptr, size_t* numcharsp = nullptr); > > /* > + * Returns the possible minimum encoding for the given UTF8 string. The enum shouldn't include "minimum" in it -- that's a function of the API function, not the enum. Suggest just "Encoding" for a name, and this as comment: """ JSAPI character encodings. """ @@ +296,5 @@ > + Latin1, > + UTF16 > +}; > +JS_PUBLIC_API(PossibleMinumumEncoding) > +PossibleMinumumEncodingForUTF8(UTF8Chars utf8); I'd say """ Returns the most restricted encoding possible for the given string: if all codepoints are <128 then ASCII, otherwise if all codepoints are <256 Latin-1, else UTF16. """ and name the function MostRestrictedEncoding. "ForUTF8" is implicit in the function argument, doesn't need to be spelled out. ::: js/src/vm/CharacterEncoding.cpp @@ +262,5 @@ > // LossyConvertUTF8toUTF16() in dom/wifi/WifiUtils.cpp > template <InflateUTF8Action Action, typename CharT> > static bool > InflateUTF8StringToBuffer(JSContext* cx, const UTF8Chars src, CharT* dst, size_t* dstlenp, > + bool* isAsciip, bool* isLatin1) Instead of multiple bool*, why not |Encoding* mostRestrictedEncoding|? (The 'p' at end doesn't seem necessary to me, not least because we're not consistent about doing it.) Doesn't admit nonsense like |*isAsciip == true && *isLatin1 == false|, and users get told exactly what encoding is best, without having to reconstruct it from bools. That also means you could not add a new InflateUTF8Action -- just replace all |*isAsciip| assignments with assigments to mRE. @@ +332,5 @@ > > // Check the continuation bytes. > for (uint32_t m = 1; m < n; m++) > if ((src[i + m] & 0xC0) != 0x80) > INVALID(ReportInvalidCharacter, i, m); Put braces around the for-loop while you're here. @@ +394,5 @@ > ReportOutOfMemory(cx); > return CharsT(); > } > > if (isAscii) { In a most-restricted-encoding outparam world @@ +444,5 @@ > + utf8, > + /* dst = */ nullptr, > + /* dstlen = */ nullptr, > + &isAscii, > + &isLatin1))); With just one outparam this becomes easy -- just return that outparam, none of this if-stuff.

Attachment #8786526 - Flags: review?(jwalden+bmo) → review-

Jeff Walden [:Waldo]

Comment 10

•

8 years ago

Comment on attachment 8786525 [details] [diff] [review] Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ. Review of attachment 8786525 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/public/CharacterEncoding.h @@ +290,5 @@ > > +/* > + * Inflate bytes in UTF-8 encoding to Latin1 string. > + * The caller must check if the given UTF-8 string can be encoded with Latin1 > + * before calling this. Wait, that's not how this works. As stated, this makes Latin-1 encoding of the string a *precondition*, and if you don't do it, it's undefined what happens. This sentence really should be something like, "If the string contains non-Latin-1 characters, reports an error." It's really perfectly fine to call this with a non-Latin-1 string, as long as you understand the consequences. :-) @@ +301,5 @@ > + > +/* > + * The same as UTF8CharsToNewLatin1CharsZ(), except that any malformed UTF-8 > + * characters will be replaced by "?". No exception will be thrown for > + * malformed UTF-8 input. I wouldn't document by reference with alterations here, just say things separately -- there's too little complexity here to not just spit it out, because document-by-reference increases the burden on anyone reading the comment. So I'd have these two doc-comments: """ Return a null-terminated Latin-1 string copied from the input string, storing its length (excluding null terminator) in |*outlen|. Fail and report an error if the string contains non-Latin-1 codepoints. Returns Latin1CharsZ() on failure. """ """ Return a null-terminated Latin-1 string copied from the input string, storing its length (excluding null terminator) in |*outlen|. Non-Latin-1 codepoints are replaced by '?'. Returns Latin1CharsZ() on failure. """ ::: js/src/vm/CharacterEncoding.cpp @@ +6,5 @@ > > #include "js/CharacterEncoding.h" > > #include "mozilla/Range.h" > +#include "mozilla/TypeTraits.h" These days you can use <type_traits> -- someone's converting things over, might as well not add new uses when we're not interplaying with existing code. @@ +254,5 @@ > Copy > }; > > static const uint32_t REPLACE_UTF8 = 0xFFFD; > +static const uint32_t REPLACE_UTF8_LATIN1 = '?'; Make these char16_t and Latin1Char typed. @@ +266,5 @@ > { > if (Action != AssertNoInvalids) > *isAsciip = true; > > + // Count how many characters need to be in the inflated string. s/characters/code units/ so there's no confusion about "characters" and what that might mean. @@ +280,5 @@ > > } else { > // Non-ASCII code unit. Determine its length in bytes (n). > if (Action != AssertNoInvalids) > *isAsciip = false; Along with the Encoding* outparam thing, I might add lambdas like so: auto RequireLatin1 = [&mostRestrictedEncoding]{ mostRestrictedEncoding = std::max(Latin1, mostRestrictedEncoding); }; auto RequireUTF16 = [&mostRestrictedEncoding]{ mostRestrictedEncoding = UTF16; }; and then call those in the right places. The Latin-1 bit is the tricky one, because it has to *conditionally* upgrade, and not overwrite once UTF16 is encountered. Although, hm. This does bring to mind that *if* we have a CheckEncoding enum as I suggested removing in the last review, *if* we have that we can just return immediately once a non-Latin-1 character is encountered. That probably motivates having CheckEncoding after all. @@ +294,5 @@ > } else if (Action == AssertNoInvalids) { \ > MOZ_CRASH("invalid UTF-8 string: " # report); \ > } else { \ > + if (Action == Copy) { \ > + if (mozilla::IsSame<CharT, Latin1Char>::value) \ std::is_same<decltype(dst[0]), Latin1Char>::value so that we don't have to wonder, however briefly, if CharT is |*dst|'s type or not. @@ +371,3 @@ > InflateUTF8StringHelper(JSContext* cx, const UTF8Chars src, size_t* outlen) > { > + typedef typename CharsT::CharT CharT; using CharT = CharsT::CharT; maybe with a typename in there. Basically at this point, never typedef, always using. :-)

Attachment #8786525 - Flags: review?(jwalden+bmo) → review+

Jeff Walden [:Waldo]

Comment 11

•

8 years ago

Comment on attachment 8786526 [details] [diff] [review] Part 2: Add PossibleMinumumEncodingForUTF8. Review of attachment 8786526 [details] [diff] [review]: ----------------------------------------------------------------- Okay, previous r- more or less stands, just a couple slightly different takes on some things. ::: js/public/CharacterEncoding.h @@ +296,5 @@ > + Latin1, > + UTF16 > +}; > +JS_PUBLIC_API(PossibleMinumumEncoding) > +PossibleMinumumEncodingForUTF8(UTF8Chars utf8); FindSmallestEncoding is a better name for this, on second thought. And I guess then SmallestEncoding is a better name than PossibleMinimumEncoding, with the doc-comment "The smallest character encoding capable of fully representing a particular string." ::: js/src/vm/CharacterEncoding.cpp @@ +251,5 @@ > CountAndReportInvalids, > CountAndIgnoreInvalids, > AssertNoInvalids, > + Copy, > + CheckEncoding FindSmallestEncoding seems better than "check", which implies the input might not be valid @@ +338,5 @@ > // Determine the code unit's length in CharT and act accordingly. > v = JS::Utf8ToOneUcs4Char((uint8_t*)&src[i], n); > + if (Action == CheckEncoding) { > + if (v > 0xff) > + *isLatin1 = false; We could/should just assert here that |dst| is null and immediately return SmallestEncoding::UTF16.

Jeff Walden [:Waldo]

Comment 12

•

8 years ago

Comment on attachment 8786527 [details] [diff] [review] Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z. Review of attachment 8786527 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/vm/String.cpp @@ +1317,5 @@ > + Latin1Char* latin1 = UTF8CharsToNewLatin1CharsZ(cx, utf8, &length).get(); > + if (!latin1) > + return nullptr; > + > + Rooted<JSFlatString*> result(cx, NewString<allowGC>(cx, latin1, length)); Can just use JSFlatString* for this type (and for the other |result| below) -- the Rooted doesn't do anything.

Attachment #8786527 - Flags: review?(jwalden+bmo) → review+

Tooru Fujisawa [:arai]

Assignee

Comment 13

•

8 years ago

Attached patch Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ. r=jwalden (deleted) — Details — Splinter Review

addressed review comments. RequireLatin1/RequireUTF16 are in Part 2.

Attachment #8786525 - Attachment is obsolete: true

Attachment #8787434 - Flags: review+

Tooru Fujisawa [:arai]

Assignee

Comment 14

•

8 years ago

Attached patch Part 2: Add PossibleMinumumEncodingForUTF8. (deleted) — Details — Splinter Review

For now, used "FindEncoding" instead of "FindSmallestEncoding" for enum value, as it conflicts with FindSmallestEncoding function, especially inside it. Do you have any other idea? (we might be able to use enum class, but it will make template parameter long...)

Attachment #8786526 - Attachment is obsolete: true

Attachment #8787437 - Flags: review?(jwalden+bmo)

Tooru Fujisawa [:arai]

Assignee

Comment 15

•

8 years ago

Attached patch Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z. r=jwalden (deleted) — Details — Splinter Review

Attachment #8786527 - Attachment is obsolete: true

Attachment #8787439 - Flags: review+

Jeff Walden [:Waldo]

Comment 16

•

8 years ago

Comment on attachment 8787437 [details] [diff] [review] Part 2: Add PossibleMinumumEncodingForUTF8. Review of attachment 8787437 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/vm/CharacterEncoding.cpp @@ +268,3 @@ > { > + auto RequireLatin1 = [&smallestEncoding]{ > + *smallestEncoding = std::max(JS::SmallestEncoding::Latin1, *smallestEncoding); #include <algorithm> (I think?) at top of the file somewhere, if it's not there. @@ +443,5 @@ > +JS::SmallestEncoding > +JS::FindSmallestEncoding(UTF8Chars utf8) > +{ > + JS::SmallestEncoding encoding; > + JS_ALWAYS_TRUE((InflateUTF8StringToBuffer<FindEncoding, char16_t>( Use MOZ_ALWAYS_TRUE for everything new you're adding here, we need to delete the other.

Attachment #8787437 - Flags: review?(jwalden+bmo) → review+

Tooru Fujisawa [:arai]

Assignee

Comment 17

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/b4c7d481bf10a39a99cb41b89bc677434d80dc0c Bug 1289003 - Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ. r=jwalden https://hg.mozilla.org/integration/mozilla-inbound/rev/1bbdfed5b149b231a27692542b6cee5b9f5138a8 Bug 1289003 - Part 2: Add FindSmallestEncoding. r=jwalden https://hg.mozilla.org/integration/mozilla-inbound/rev/3abe6a1579f9bc79816ab781869b5232a1b3e483 Bug 1289003 - Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z. r=jwalden

Tooru Fujisawa [:arai]

Assignee

Comment 18

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/586730aa152198ece8f016f3098ad2fbf055090b Bug 1289003 followup - Initialize *smallestEncoding. r=bustage

Tooru Fujisawa [:arai]

Assignee

Comment 19

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/f73442255a8308c776282d1133724899dc2c2a3f Backed out changeset 586730aa1521 (bug 1289003) https://hg.mozilla.org/integration/mozilla-inbound/rev/e1942a9b87424c15245bea5f86f7fcdc14c0df42 Backed out changeset 3abe6a1579f9 (bug 1289003) https://hg.mozilla.org/integration/mozilla-inbound/rev/6135cb7ef5bda300672ce7290ccf905b2cab68ff Backed out changeset 1bbdfed5b149 (bug 1289003) https://hg.mozilla.org/integration/mozilla-inbound/rev/d18edfa7a8f9d0d9ed6b48c6038039f149c46d1d Backed out changeset b4c7d481bf10 (bug 1289003)

Tooru Fujisawa [:arai]

Assignee

Comment 20

•

8 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/bae24022813be38a0433b39469d4cd38ff288ce2 Bug 1289003 - Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ. r=jwalden https://hg.mozilla.org/integration/mozilla-inbound/rev/18b15a11adf0a6f5c297c48364993a231fecbf91 Bug 1289003 - Part 2: Add FindSmallestEncoding. r=jwalden https://hg.mozilla.org/integration/mozilla-inbound/rev/9f509473b88714d6dbd7666b6896551d669a6fda Bug 1289003 - Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z. r=jwalden

Ryan VanderMeulen [:RyanVM]

Comment 21

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/bae24022813b https://hg.mozilla.org/mozilla-central/rev/18b15a11adf0 https://hg.mozilla.org/mozilla-central/rev/9f509473b887

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox51: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla51

You need to log in before you can comment on or make changes to this bug.