Closed Bug 1094318 Opened 10 years ago Closed 10 years ago

nsTextToSubURI::convertURItoUnicode does not work correctly when aCharset=UTF-16

Tracking

()

Status:

RESOLVED INVALID

People

(Reporter: MatsPalmgren_bugz, Assigned: MatsPalmgren_bugz)

Details

Attachments

(1 file, 1 obsolete file)

wip 10 years ago Mats Palmgren (inactive) (deleted), patch		Details \| Diff \| Splinter Review
fix 10 years ago Mats Palmgren (inactive) (deleted), patch		Details \| Diff \| Splinter Review

Mats Palmgren (inactive)

Assignee

Description

•

10 years ago

Followup from bug 415491 comment 22 / 23.

Mats Palmgren (inactive)

Assignee

Comment 1

•

10 years ago

So to followup our example "a%00" from bug 415491: With 'esc_SkipControl' removed in UnEscapeURIForUI, and commenting out lines 159 .. 169 here: http://mxr.mozilla.org/mozilla-central/source/intl/uconv/nsTextToSubURI.cpp?rev=3c98330df76f#149 then I get the results: "a" (0x61 0x00) for "utf-16" and "utf-16le", and "愀" (0x00 0x61) for "utf-16be". So at least the latter part of convertURItoUnicode seems to do the right thing.

Mats Palmgren (inactive)

Assignee

Comment 2

•

10 years ago

Attached patch wip (obsolete) (deleted) — Details — Splinter Review

This passes tests on Try: https://tbpl.mozilla.org/?tree=Try&rev=90f3c625d81a UnEscapeURIForUI in pseudo code: 1. unescape everything, including control characters 2. convertURItoUnicode 3. are there any blacklisted characters? yes: reescape all non-ASCII, including control characters no: reescape only control characters

Mats Palmgren (inactive)

Assignee

Comment 3

•

10 years ago

Still, the unescaping in 1) seems odd to me -- we unescape as if it's ASCII, but what if aCharset = UTF-16 ? Or does the given aCharset only describe what the result is *after* that ASCII-based unescaping ?

Mats Palmgren (inactive)

Assignee

Comment 4

•

10 years ago

Attached patch fix (deleted) — Details — Splinter Review

This adds an else-branch for non-ASCII compatible 'aCharset'. It does things in a different order: convertURItoUnicode is called first, then NS_UnescapeURL. I also made convertURItoUnicode check the charset before assuming it's ASCII/UTF-8. I also removed 'esc_AlwaysCopy' in both branches to avoid making a string copy in NS_UnescapeURL when there are no escape sequences. https://tbpl.mozilla.org/?tree=Try&rev=a3b260fd5a24

Assignee: nobody → mats

Attachment #8518247 - Attachment is obsolete: true

Attachment #8520334 - Flags: review?(bzbarsky)

Mats Palmgren (inactive)

Assignee

Comment 5

•

10 years ago

FYI, here's the IsAsciiCompatible I'm using: http://mxr.mozilla.org/mozilla-central/source/dom/encoding/EncodingUtils.cpp#57

Boris Zbarsky [:bzbarsky]

Comment 6

•

10 years ago

Mats, can you find a different reviewer for this? I'm pretty swamped for a while. :(

Mats Palmgren (inactive)

Assignee

Comment 7

•

10 years ago

Sure, no problem. I think Simon is back now, or will be soon.

Mats Palmgren (inactive)

Assignee

Updated

•

10 years ago

Attachment #8520334 - Flags: review?(bzbarsky) → review?(smontagu)

Simon Montagu :smontagu

Comment 8

•

10 years ago

Comment on attachment 8520334 [details] [diff] [review] fix Review of attachment 8520334 [details] [diff] [review]: ----------------------------------------------------------------- ::: intl/uconv/nsTextToSubURI.cpp @@ +192,3 @@ > > + if (aIRI) { > + const bool tryAsUTF8 = tryAsASCII; Why do you need two booleans with the same value? What is more, the statefulCharset() helper is only used right here. You could move the IsAsciiCompatible check inside statefulCharset() (maybe changing the name to asciiSafeCharset() or something) and simplify the whole thing to something like if (asciiSafeCharset(aCharset.get())) { if (IsASCII(aURI) { CopyASCIItoUTF16(aURI, _retval); return NS_OK; } if (aIRI && IsUTF8(aURI) { CopyUTF8toUTF16(aURI, _retval); return NS_OK; } } (the aIRI arg also looks ripe for removal IIANM, but I'll try not to get too far out of scope ;-) ) @@ +244,5 @@ > + // Test for != NS_OK rather than NS_FAILED, because incomplete multi-byte > + // sequences are also considered failure in this context > + if (convertURItoUnicode(charset, unescapedURI, true, _retval) != NS_OK) { > + // Assume UTF-8 instead of ASCII because hostname (IDN) may be in UTF-8. > + CopyUTF8toUTF16(aURIFragment, _retval); Nit: this would be a little clearer if you used flatURI instead of aURIFragment here too.

Attachment #8520334 - Flags: review?(smontagu)

Mats Palmgren (inactive)

Assignee

Comment 9

•

10 years ago

This caused a test failure: https://tbpl.mozilla.org/?tree=Try&rev=1b216057a434 /html/infrastructure/urls/resolving-urls/query-encoding/utf-16be.html | Scheme javascript (getting <a>.href) - assert_true: expected substring %C3%A5 got javascript:"?x=å" expected true got false In nsTextToSubURI::UnEscapeNonAsciiURI we have aCharset="UTF-16BE" and aURIFragment="javascript:"?x=Ã¥" (the last two are: 0xC3 and 0xA5) so the string is already unescaped on entry and 'unescapedSpec' is the same. IsUTF8(unescapedSpec) is true, so with the current code we take the early return in convertURItoUnicode (CopyUTF8toUTF16). I think my premise that we would get strings encoded per aCharset in convertURItoUnicode is simply wrong and the bug invalid. I'm not really sure what aCharset is good for, perhaps it's only used for some "error correction" in the case the original string resulted in an invalid UTF-8 string, and that the "converter" path at the end of convertURItoUnicode somehow takes it into account. Sorry for wasting your time Simon. :-/

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → INVALID

Mats Palmgren (inactive)

Assignee

Comment 10

•

10 years ago

> the aIRI arg also looks ripe for removal IIANM Yep, I'll kill it in bug 1101625.

Boris Zbarsky [:bzbarsky]

Comment 11

•

10 years ago

Would aCharset be the charset to use for things that get unescaped? So if there are no escapes to start with, it's not relevant...

You need to log in before you can comment on or make changes to this bug.

Bugzilla

nsTextToSubURI::convertURItoUnicode does not work correctly when aCharset=UTF-16

Categories

(Core :: Internationalization, defect)

Tracking

()

People

(Reporter: MatsPalmgren_bugz, Assigned: MatsPalmgren_bugz)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file, 1 obsolete file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Comment 10

Comment 11

Attachment

General

Description

File Name

Content Type