Closed Bug 1094318 Opened 10 years ago Closed 10 years ago

nsTextToSubURI::convertURItoUnicode does not work correctly when aCharset=UTF-16

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: MatsPalmgren_bugz, Assigned: MatsPalmgren_bugz)

Details

Attachments

(1 file, 1 obsolete file)

Followup from bug 415491 comment 22 / 23.
So to followup our example "a%00" from bug 415491: With 'esc_SkipControl' removed in UnEscapeURIForUI, and commenting out lines 159 .. 169 here: http://mxr.mozilla.org/mozilla-central/source/intl/uconv/nsTextToSubURI.cpp?rev=3c98330df76f#149 then I get the results: "a" (0x61 0x00) for "utf-16" and "utf-16le", and "愀" (0x00 0x61) for "utf-16be". So at least the latter part of convertURItoUnicode seems to do the right thing.
Attached patch wip (obsolete) (deleted) — Splinter Review
This passes tests on Try: https://tbpl.mozilla.org/?tree=Try&rev=90f3c625d81a UnEscapeURIForUI in pseudo code: 1. unescape everything, including control characters 2. convertURItoUnicode 3. are there any blacklisted characters? yes: reescape all non-ASCII, including control characters no: reescape only control characters
Still, the unescaping in 1) seems odd to me -- we unescape as if it's ASCII, but what if aCharset = UTF-16 ? Or does the given aCharset only describe what the result is *after* that ASCII-based unescaping ?
Attached patch fix (deleted) — Splinter Review
This adds an else-branch for non-ASCII compatible 'aCharset'. It does things in a different order: convertURItoUnicode is called first, then NS_UnescapeURL. I also made convertURItoUnicode check the charset before assuming it's ASCII/UTF-8. I also removed 'esc_AlwaysCopy' in both branches to avoid making a string copy in NS_UnescapeURL when there are no escape sequences. https://tbpl.mozilla.org/?tree=Try&rev=a3b260fd5a24
Assignee: nobody → mats
Attachment #8518247 - Attachment is obsolete: true
Attachment #8520334 - Flags: review?(bzbarsky)
Mats, can you find a different reviewer for this? I'm pretty swamped for a while. :(
Sure, no problem. I think Simon is back now, or will be soon.
Attachment #8520334 - Flags: review?(bzbarsky) → review?(smontagu)
Comment on attachment 8520334 [details] [diff] [review] fix Review of attachment 8520334 [details] [diff] [review]: ----------------------------------------------------------------- ::: intl/uconv/nsTextToSubURI.cpp @@ +192,3 @@ > > + if (aIRI) { > + const bool tryAsUTF8 = tryAsASCII; Why do you need two booleans with the same value? What is more, the statefulCharset() helper is only used right here. You could move the IsAsciiCompatible check inside statefulCharset() (maybe changing the name to asciiSafeCharset() or something) and simplify the whole thing to something like if (asciiSafeCharset(aCharset.get())) { if (IsASCII(aURI) { CopyASCIItoUTF16(aURI, _retval); return NS_OK; } if (aIRI && IsUTF8(aURI) { CopyUTF8toUTF16(aURI, _retval); return NS_OK; } } (the aIRI arg also looks ripe for removal IIANM, but I'll try not to get too far out of scope ;-) ) @@ +244,5 @@ > + // Test for != NS_OK rather than NS_FAILED, because incomplete multi-byte > + // sequences are also considered failure in this context > + if (convertURItoUnicode(charset, unescapedURI, true, _retval) != NS_OK) { > + // Assume UTF-8 instead of ASCII because hostname (IDN) may be in UTF-8. > + CopyUTF8toUTF16(aURIFragment, _retval); Nit: this would be a little clearer if you used flatURI instead of aURIFragment here too.
Attachment #8520334 - Flags: review?(smontagu)
This caused a test failure: https://tbpl.mozilla.org/?tree=Try&rev=1b216057a434 /html/infrastructure/urls/resolving-urls/query-encoding/utf-16be.html | Scheme javascript (getting <a>.href) - assert_true: expected substring %C3%A5 got javascript:"?x=å" expected true got false In nsTextToSubURI::UnEscapeNonAsciiURI we have aCharset="UTF-16BE" and aURIFragment="javascript:"?x=Ã¥" (the last two are: 0xC3 and 0xA5) so the string is already unescaped on entry and 'unescapedSpec' is the same. IsUTF8(unescapedSpec) is true, so with the current code we take the early return in convertURItoUnicode (CopyUTF8toUTF16). I think my premise that we would get strings encoded per aCharset in convertURItoUnicode is simply wrong and the bug invalid. I'm not really sure what aCharset is good for, perhaps it's only used for some "error correction" in the case the original string resulted in an invalid UTF-8 string, and that the "converter" path at the end of convertURItoUnicode somehow takes it into account. Sorry for wasting your time Simon. :-/
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INVALID
> the aIRI arg also looks ripe for removal IIANM Yep, I'll kill it in bug 1101625.
Would aCharset be the charset to use for things that get unescaped? So if there are no escapes to start with, it's not relevant...
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: