Closed
Bug 1094318
Opened 10 years ago
Closed 10 years ago
nsTextToSubURI::convertURItoUnicode does not work correctly when aCharset=UTF-16
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
INVALID
People
(Reporter: MatsPalmgren_bugz, Assigned: MatsPalmgren_bugz)
Details
Attachments
(1 file, 1 obsolete file)
(deleted),
patch
|
Details | Diff | Splinter Review |
Followup from bug 415491 comment 22 / 23.
Assignee | ||
Comment 1•10 years ago
|
||
So to followup our example "a%00" from bug 415491:
With 'esc_SkipControl' removed in UnEscapeURIForUI,
and commenting out lines 159 .. 169 here:
http://mxr.mozilla.org/mozilla-central/source/intl/uconv/nsTextToSubURI.cpp?rev=3c98330df76f#149
then I get the results: "a" (0x61 0x00) for "utf-16" and "utf-16le",
and "愀" (0x00 0x61) for "utf-16be". So at least the latter part of
convertURItoUnicode seems to do the right thing.
Assignee | ||
Comment 2•10 years ago
|
||
This passes tests on Try: https://tbpl.mozilla.org/?tree=Try&rev=90f3c625d81a
UnEscapeURIForUI in pseudo code:
1. unescape everything, including control characters
2. convertURItoUnicode
3. are there any blacklisted characters?
yes: reescape all non-ASCII, including control characters
no: reescape only control characters
Assignee | ||
Comment 3•10 years ago
|
||
Still, the unescaping in 1) seems odd to me -- we unescape as if it's ASCII, but what
if aCharset = UTF-16 ? Or does the given aCharset only describe what the result is
*after* that ASCII-based unescaping ?
Assignee | ||
Comment 4•10 years ago
|
||
This adds an else-branch for non-ASCII compatible 'aCharset'.
It does things in a different order: convertURItoUnicode is called first,
then NS_UnescapeURL. I also made convertURItoUnicode check the charset
before assuming it's ASCII/UTF-8.
I also removed 'esc_AlwaysCopy' in both branches to avoid making a string
copy in NS_UnescapeURL when there are no escape sequences.
https://tbpl.mozilla.org/?tree=Try&rev=a3b260fd5a24
Assignee: nobody → mats
Attachment #8518247 -
Attachment is obsolete: true
Attachment #8520334 -
Flags: review?(bzbarsky)
Assignee | ||
Comment 5•10 years ago
|
||
FYI, here's the IsAsciiCompatible I'm using:
http://mxr.mozilla.org/mozilla-central/source/dom/encoding/EncodingUtils.cpp#57
Comment 6•10 years ago
|
||
Mats, can you find a different reviewer for this? I'm pretty swamped for a while. :(
Assignee | ||
Comment 7•10 years ago
|
||
Sure, no problem. I think Simon is back now, or will be soon.
Assignee | ||
Updated•10 years ago
|
Attachment #8520334 -
Flags: review?(bzbarsky) → review?(smontagu)
Comment 8•10 years ago
|
||
Comment on attachment 8520334 [details] [diff] [review]
fix
Review of attachment 8520334 [details] [diff] [review]:
-----------------------------------------------------------------
::: intl/uconv/nsTextToSubURI.cpp
@@ +192,3 @@
>
> + if (aIRI) {
> + const bool tryAsUTF8 = tryAsASCII;
Why do you need two booleans with the same value?
What is more, the statefulCharset() helper is only used right here. You could move the IsAsciiCompatible check inside statefulCharset() (maybe changing the name to asciiSafeCharset() or something) and simplify the whole thing to something like
if (asciiSafeCharset(aCharset.get())) {
if (IsASCII(aURI) {
CopyASCIItoUTF16(aURI, _retval);
return NS_OK;
}
if (aIRI && IsUTF8(aURI) {
CopyUTF8toUTF16(aURI, _retval);
return NS_OK;
}
}
(the aIRI arg also looks ripe for removal IIANM, but I'll try not to get too far out of scope ;-) )
@@ +244,5 @@
> + // Test for != NS_OK rather than NS_FAILED, because incomplete multi-byte
> + // sequences are also considered failure in this context
> + if (convertURItoUnicode(charset, unescapedURI, true, _retval) != NS_OK) {
> + // Assume UTF-8 instead of ASCII because hostname (IDN) may be in UTF-8.
> + CopyUTF8toUTF16(aURIFragment, _retval);
Nit: this would be a little clearer if you used flatURI instead of aURIFragment here too.
Attachment #8520334 -
Flags: review?(smontagu)
Assignee | ||
Comment 9•10 years ago
|
||
This caused a test failure: https://tbpl.mozilla.org/?tree=Try&rev=1b216057a434
/html/infrastructure/urls/resolving-urls/query-encoding/utf-16be.html | Scheme javascript (getting <a>.href) - assert_true: expected substring %C3%A5 got javascript:"?x=å" expected true got false
In nsTextToSubURI::UnEscapeNonAsciiURI we have aCharset="UTF-16BE"
and aURIFragment="javascript:"?x=Ã¥" (the last two are: 0xC3 and 0xA5)
so the string is already unescaped on entry and 'unescapedSpec' is the same.
IsUTF8(unescapedSpec) is true, so with the current code we take the early
return in convertURItoUnicode (CopyUTF8toUTF16).
I think my premise that we would get strings encoded per aCharset in
convertURItoUnicode is simply wrong and the bug invalid.
I'm not really sure what aCharset is good for, perhaps it's only used
for some "error correction" in the case the original string resulted
in an invalid UTF-8 string, and that the "converter" path at the
end of convertURItoUnicode somehow takes it into account.
Sorry for wasting your time Simon. :-/
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INVALID
Assignee | ||
Comment 10•10 years ago
|
||
> the aIRI arg also looks ripe for removal IIANM
Yep, I'll kill it in bug 1101625.
Comment 11•10 years ago
|
||
Would aCharset be the charset to use for things that get unescaped? So if there are no escapes to start with, it's not relevant...
You need to log in
before you can comment on or make changes to this bug.
Description
•