Closed Bug 1289003 Opened 8 years ago Closed 8 years ago

Add API to create JSString from UTF-8 string efficiently.

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla51
Tracking Status
firefox50 --- affected
firefox51 --- fixed

People

(Reporter: arai, Assigned: arai)

References

Details

Attachments

(3 files, 6 obsolete files)

We're going to change the error reporting API and internal to use UTF-8 in bug 1283710. So we'd need API to create JSString from UTF-8 string (const char*). JS_NewStringCopyFromUTF8, and maybe JS_NewStringCopyFromUTF8Lossy or something if needed. We already have an API to convert JSString to UTF-8 (JS_EncodeStringToUTF8), so only API for UTF-8 to JSString is needed. I have a draft patch, that was originally for bug 1283710. will post it shortly.
looks like similar API (JS_NewStringCopyUTF8Z and JS_NewStringCopyUTF8N) are going to be added in bug 987069. hopefully I can improve the efficiency :)
Blocks: 1295017
No longer blocks: 1283710
Depends on: 1294940
Added UTF8CharsToNewLatin1CharsZ and LossyUTF8CharsToNewLatin1CharsZ, that converts UTF-8 encoded string to Latin1. Those function suppose the input UTF-8 encoded string *can* be encoded in Latin1, and doesn't handle overflow. That is because those are supposed to be used after PossibleMinumumEncodingForUTF8 function, that is added in Part 2, that returns possible minimum encoding for the given UTF-8 string. To make those APIs, added CharT typedef for {,Const}{Latin1,UTF8,TwoBytes}Chars{,Z} classes, to use it from InflateUTF8StringHelper, that is now templatized for return type, that could now be TwoByteCharsZ or Latin1CharsZ.
Attached patch Part 2: Add PossibleMinumumEncodingForUTF8. (obsolete) (deleted) — Splinter Review
As mentioned above, PossibleMinumumEncodingForUTF8 function returns possible minimum encoding for the given UTF-8 string, from one of ASCII, Latin1, UTF16. ASCII for char <= 0x7f Latin1 for char <= 0xff UTF16 for other case
JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z functions use those above functions. Those funtions first check the possible minimum encoding, and then use NewStringCopyN for ASCII, UTF8CharsToNewLatin1CharsZ+NewString for Latin1, UTF8CharsToNewTwoByteCharsZ+NewString for UTF-16. This is done to reduce the number of copy or allocation. for ASCII case, we can just copy the string. for other cases, we need to convert UTF-8 encoded string to Latin1 or UTF-16 first, and then allocate JSString. by adding and using PossibleMinumumEncodingForUTF8 and UTF8CharsToNewLatin1CharsZ, we don't need temporary space for any case.
I'll ask review after bug 1294940.
rebased
Attachment #8781384 - Attachment is obsolete: true
Attachment #8786525 - Flags: review?(jwalden+bmo)
Attached patch Part 2: Add PossibleMinumumEncodingForUTF8. (obsolete) (deleted) — Splinter Review
Attachment #8781385 - Attachment is obsolete: true
Attachment #8786526 - Flags: review?(jwalden+bmo)
Attachment #8781387 - Attachment is obsolete: true
Attachment #8786527 - Flags: review?(jwalden+bmo)
Comment on attachment 8786526 [details] [diff] [review] Part 2: Add PossibleMinumumEncodingForUTF8. Review of attachment 8786526 [details] [diff] [review]: ----------------------------------------------------------------- With these few comments, enough changes that I should just punt for a closer look at something much closer to the final deal. ::: js/public/CharacterEncoding.h @@ +288,5 @@ > DeflateStringToUTF8Buffer(JSFlatString* src, mozilla::RangedPtr<char> dst, > size_t* dstlenp = nullptr, size_t* numcharsp = nullptr); > > /* > + * Returns the possible minimum encoding for the given UTF8 string. The enum shouldn't include "minimum" in it -- that's a function of the API function, not the enum. Suggest just "Encoding" for a name, and this as comment: """ JSAPI character encodings. """ @@ +296,5 @@ > + Latin1, > + UTF16 > +}; > +JS_PUBLIC_API(PossibleMinumumEncoding) > +PossibleMinumumEncodingForUTF8(UTF8Chars utf8); I'd say """ Returns the most restricted encoding possible for the given string: if all codepoints are <128 then ASCII, otherwise if all codepoints are <256 Latin-1, else UTF16. """ and name the function MostRestrictedEncoding. "ForUTF8" is implicit in the function argument, doesn't need to be spelled out. ::: js/src/vm/CharacterEncoding.cpp @@ +262,5 @@ > // LossyConvertUTF8toUTF16() in dom/wifi/WifiUtils.cpp > template <InflateUTF8Action Action, typename CharT> > static bool > InflateUTF8StringToBuffer(JSContext* cx, const UTF8Chars src, CharT* dst, size_t* dstlenp, > + bool* isAsciip, bool* isLatin1) Instead of multiple bool*, why not |Encoding* mostRestrictedEncoding|? (The 'p' at end doesn't seem necessary to me, not least because we're not consistent about doing it.) Doesn't admit nonsense like |*isAsciip == true && *isLatin1 == false|, and users get told exactly what encoding is best, without having to reconstruct it from bools. That also means you could not add a new InflateUTF8Action -- just replace all |*isAsciip| assignments with assigments to mRE. @@ +332,5 @@ > > // Check the continuation bytes. > for (uint32_t m = 1; m < n; m++) > if ((src[i + m] & 0xC0) != 0x80) > INVALID(ReportInvalidCharacter, i, m); Put braces around the for-loop while you're here. @@ +394,5 @@ > ReportOutOfMemory(cx); > return CharsT(); > } > > if (isAscii) { In a most-restricted-encoding outparam world @@ +444,5 @@ > + utf8, > + /* dst = */ nullptr, > + /* dstlen = */ nullptr, > + &isAscii, > + &isLatin1))); With just one outparam this becomes easy -- just return that outparam, none of this if-stuff.
Attachment #8786526 - Flags: review?(jwalden+bmo) → review-
Comment on attachment 8786525 [details] [diff] [review] Part 1: Add UTF8CharsToNewLatin1CharsZ, LossyUTF8CharsToNewLatin1CharsZ. Review of attachment 8786525 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/public/CharacterEncoding.h @@ +290,5 @@ > > +/* > + * Inflate bytes in UTF-8 encoding to Latin1 string. > + * The caller must check if the given UTF-8 string can be encoded with Latin1 > + * before calling this. Wait, that's not how this works. As stated, this makes Latin-1 encoding of the string a *precondition*, and if you don't do it, it's undefined what happens. This sentence really should be something like, "If the string contains non-Latin-1 characters, reports an error." It's really perfectly fine to call this with a non-Latin-1 string, as long as you understand the consequences. :-) @@ +301,5 @@ > + > +/* > + * The same as UTF8CharsToNewLatin1CharsZ(), except that any malformed UTF-8 > + * characters will be replaced by "?". No exception will be thrown for > + * malformed UTF-8 input. I wouldn't document by reference with alterations here, just say things separately -- there's too little complexity here to not just spit it out, because document-by-reference increases the burden on anyone reading the comment. So I'd have these two doc-comments: """ Return a null-terminated Latin-1 string copied from the input string, storing its length (excluding null terminator) in |*outlen|. Fail and report an error if the string contains non-Latin-1 codepoints. Returns Latin1CharsZ() on failure. """ """ Return a null-terminated Latin-1 string copied from the input string, storing its length (excluding null terminator) in |*outlen|. Non-Latin-1 codepoints are replaced by '?'. Returns Latin1CharsZ() on failure. """ ::: js/src/vm/CharacterEncoding.cpp @@ +6,5 @@ > > #include "js/CharacterEncoding.h" > > #include "mozilla/Range.h" > +#include "mozilla/TypeTraits.h" These days you can use <type_traits> -- someone's converting things over, might as well not add new uses when we're not interplaying with existing code. @@ +254,5 @@ > Copy > }; > > static const uint32_t REPLACE_UTF8 = 0xFFFD; > +static const uint32_t REPLACE_UTF8_LATIN1 = '?'; Make these char16_t and Latin1Char typed. @@ +266,5 @@ > { > if (Action != AssertNoInvalids) > *isAsciip = true; > > + // Count how many characters need to be in the inflated string. s/characters/code units/ so there's no confusion about "characters" and what that might mean. @@ +280,5 @@ > > } else { > // Non-ASCII code unit. Determine its length in bytes (n). > if (Action != AssertNoInvalids) > *isAsciip = false; Along with the Encoding* outparam thing, I might add lambdas like so: auto RequireLatin1 = [&mostRestrictedEncoding]{ mostRestrictedEncoding = std::max(Latin1, mostRestrictedEncoding); }; auto RequireUTF16 = [&mostRestrictedEncoding]{ mostRestrictedEncoding = UTF16; }; and then call those in the right places. The Latin-1 bit is the tricky one, because it has to *conditionally* upgrade, and not overwrite once UTF16 is encountered. Although, hm. This does bring to mind that *if* we have a CheckEncoding enum as I suggested removing in the last review, *if* we have that we can just return immediately once a non-Latin-1 character is encountered. That probably motivates having CheckEncoding after all. @@ +294,5 @@ > } else if (Action == AssertNoInvalids) { \ > MOZ_CRASH("invalid UTF-8 string: " # report); \ > } else { \ > + if (Action == Copy) { \ > + if (mozilla::IsSame<CharT, Latin1Char>::value) \ std::is_same<decltype(dst[0]), Latin1Char>::value so that we don't have to wonder, however briefly, if CharT is |*dst|'s type or not. @@ +371,3 @@ > InflateUTF8StringHelper(JSContext* cx, const UTF8Chars src, size_t* outlen) > { > + typedef typename CharsT::CharT CharT; using CharT = CharsT::CharT; maybe with a typename in there. Basically at this point, never typedef, always using. :-)
Attachment #8786525 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 8786526 [details] [diff] [review] Part 2: Add PossibleMinumumEncodingForUTF8. Review of attachment 8786526 [details] [diff] [review]: ----------------------------------------------------------------- Okay, previous r- more or less stands, just a couple slightly different takes on some things. ::: js/public/CharacterEncoding.h @@ +296,5 @@ > + Latin1, > + UTF16 > +}; > +JS_PUBLIC_API(PossibleMinumumEncoding) > +PossibleMinumumEncodingForUTF8(UTF8Chars utf8); FindSmallestEncoding is a better name for this, on second thought. And I guess then SmallestEncoding is a better name than PossibleMinimumEncoding, with the doc-comment "The smallest character encoding capable of fully representing a particular string." ::: js/src/vm/CharacterEncoding.cpp @@ +251,5 @@ > CountAndReportInvalids, > CountAndIgnoreInvalids, > AssertNoInvalids, > + Copy, > + CheckEncoding FindSmallestEncoding seems better than "check", which implies the input might not be valid @@ +338,5 @@ > // Determine the code unit's length in CharT and act accordingly. > v = JS::Utf8ToOneUcs4Char((uint8_t*)&src[i], n); > + if (Action == CheckEncoding) { > + if (v > 0xff) > + *isLatin1 = false; We could/should just assert here that |dst| is null and immediately return SmallestEncoding::UTF16.
Comment on attachment 8786527 [details] [diff] [review] Part 3: Add JS_NewStringCopyUTF8N and JS_NewStringCopyUTF8Z. Review of attachment 8786527 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/vm/String.cpp @@ +1317,5 @@ > + Latin1Char* latin1 = UTF8CharsToNewLatin1CharsZ(cx, utf8, &length).get(); > + if (!latin1) > + return nullptr; > + > + Rooted<JSFlatString*> result(cx, NewString<allowGC>(cx, latin1, length)); Can just use JSFlatString* for this type (and for the other |result| below) -- the Rooted doesn't do anything.
Attachment #8786527 - Flags: review?(jwalden+bmo) → review+
addressed review comments. RequireLatin1/RequireUTF16 are in Part 2.
Attachment #8786525 - Attachment is obsolete: true
Attachment #8787434 - Flags: review+
For now, used "FindEncoding" instead of "FindSmallestEncoding" for enum value, as it conflicts with FindSmallestEncoding function, especially inside it. Do you have any other idea? (we might be able to use enum class, but it will make template parameter long...)
Attachment #8786526 - Attachment is obsolete: true
Attachment #8787437 - Flags: review?(jwalden+bmo)
Attachment #8786527 - Attachment is obsolete: true
Attachment #8787439 - Flags: review+
Comment on attachment 8787437 [details] [diff] [review] Part 2: Add PossibleMinumumEncodingForUTF8. Review of attachment 8787437 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/vm/CharacterEncoding.cpp @@ +268,3 @@ > { > + auto RequireLatin1 = [&smallestEncoding]{ > + *smallestEncoding = std::max(JS::SmallestEncoding::Latin1, *smallestEncoding); #include <algorithm> (I think?) at top of the file somewhere, if it's not there. @@ +443,5 @@ > +JS::SmallestEncoding > +JS::FindSmallestEncoding(UTF8Chars utf8) > +{ > + JS::SmallestEncoding encoding; > + JS_ALWAYS_TRUE((InflateUTF8StringToBuffer<FindEncoding, char16_t>( Use MOZ_ALWAYS_TRUE for everything new you're adding here, we need to delete the other.
Attachment #8787437 - Flags: review?(jwalden+bmo) → review+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: