Closed Bug 181117 Opened 22 years ago Closed 22 years ago

Non-ASCII (Plane 1 surrogates) link not correctly parsed in UTF-16

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: nikd, Assigned: shanjian)

References

(Depends on 1 open bug,
URL
)

Details

(Keywords: intl)

Attachments

(1 file, 1 obsolete file)

patch 22 years ago Shanjian Li (deleted), patch	darin.moz : review-	Details \| Diff \| Splinter Review
new patch 22 years ago Shanjian Li (deleted), patch	nhottanscp : review+ darin.moz : superreview+	Details \| Diff \| Splinter Review

Niklas Dougherty

Reporter

Description

•

22 years ago

User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.2b) Gecko/20021031 Build Identifier: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.2b) Gecko/20021031 Go to http://bugzilla.mozilla.org/attachment.cgi?id=106911&action=view and copy the link displayed (it doesn't go to any existing file). Paste the copied link into the URL field. Result: http://bugzilla.mozilla.org/%D8.xhtml Desired result: http://bugzilla.mozilla.org/%F0%90%8D%85%F0%90%8C%BF%F0%90%8C%BB%F0%90%8D%86%F0%90%8C%B9%F0%90%8C%BB%F0%90%8C%B0.xhtml The link consists of Gothic characters in Unicode Plane 1, so the parsing method used for URI:s seemingly doesn't consider surrogates. Reproducible: Always Steps to Reproduce: 1. 2. 3.

Niklas Dougherty

Reporter

Comment 1

•

22 years ago

Somehow parsing does work, since you can view the source, copy the link there and then paste it into the URL field. It just doesn't do it right when copying from link (or clicking the link in normal fashion).

Boris Zbarsky [:bzbarsky]

Comment 2

•

22 years ago

intl. Happens on PC Linux too.

Assignee: asa → smontagu

Status: UNCONFIRMED → NEW

Component: Browser-General → Internationalization

Ever confirmed: true

OS: MacOS X → All

QA Contact: asa → ylong

Hardware: Macintosh → All

Simon Montagu :smontagu

Comment 3

•

22 years ago

To Shanjian. I notice some other problems: When hovering over the link in the attachment, the URL is displayed incorrectly in the status bar (as CJK characters) When hovering over the URL-encoded version in comment 1, the URL is displayed in a different incorrect form in the status bar.

Assignee: smontagu → shanjian

Yuying Long

Comment 4

•

22 years ago

Seems there is a difference between branch and trunk build with the problem in original report - when I right click to copy the link , I got: http://bugzilla.mozilla.org/%FF%FD%FF%FD%FF%FD%FF%FD%FF%FD%FF%FD%FF%FD.xhtml on WinXP-SimpChinese with 1.02 branch build. while in 11-20 trunk build I have: http://bugzilla.mozilla.org/%D8.xhtml

Keywords: intl

Yuying Long

Updated

•

22 years ago

Blocks: 157673

nhottanscp

Comment 5

•

22 years ago

Is this working on IE?

Simon Montagu :smontagu

Updated

•

22 years ago

Depends on: 169388

Yuying Long

Comment 6

•

22 years ago

On my WinXP-SC, IE doesn't display properly when load page: http://bugzilla.mozilla.org/attachment.cgi?id=106911&action=view When click on the link to open a page in URL bar: http://bugzilla.mozilla.org/<squares>.xhtml but when I copy those string from IE URL bar to mozilla, then will display find with those (hebrew looking?) characters (not escaped).

Niklas Dougherty

Reporter

Comment 7

•

22 years ago

Yes, those Gothic characters look a bit like Hebrew. Same kind of rounded style. Bug 168382 handles garbage in status bar for UTF-16 pages... thought this was fixed, but that fix also doesn't consider surrogates in the link. Must be the same routines involved somehow.

nhottanscp

Comment 8

•

22 years ago

I talked to Shanjian about how do we want to handle UTF-16 URI. One way is just follow RFC 2396 and escape whatever necessary. http://www.ietf.org/rfc/rfc2396.txt Other way is to generate UTF-8 URI. I think that is good if UTF-16 is not really used currently. UTF-8 URI is proposed in IRI draft. http://www.ietf.org/internet-drafts/draft-duerst-iri-02.txt cc to drain

Niklas Dougherty

Reporter

Comment 9

•

22 years ago

http://www.student.lu.se/~kin02ndo/gothic/ works as intended in UTF-8.

Summary: Non-ASCII (Plane 1 surrogates) link not correctly parsed → Non-ASCII (Plane 1 surrogates) link not correctly parsed in UTF-16

Shanjian Li

Assignee

Comment 10

•

22 years ago

If we want to handle utf-16 encoding as it is, many code need to be changed. Some string assignment assume the string is null terminated, which does not work for utf-16. Some part of the code do no conversion for ascii char. Some code assuming scheme part (and probably other parts) can only be encoded in ascii. All those issues need to be cleared if we go that approach.

Niklas Dougherty

Reporter

Comment 11

•

22 years ago

UTF-16 URI:s work as they should for BMP (plane 0): http://www.student.lu.se/~kin02ndo/%E4%B8%AD%E5%9B%BD%E8%AF%97/ It is just when you get surrogates that things go wrong. (With reservation for what is causing the crash in bug 180263.)

Darin Fisher

Comment 12

•

22 years ago

UTF-8 can represent all of the UCS-4, right? so, isn't it possible to always perform a non-lossy conversion from UTF-16 to UTF-8? if so, then why don't we just do that for URLs. i suppose we could run into trouble if code converts UTF-8 to UCS-2... is it possible to UTF-16 everywhere that UCS-2 is currently used? are windows wide-APIs UCS-2 or UTF-16?

nhottanscp

Comment 13

•

22 years ago

Let me clarify a couple of things. 1) Internally, we store URI as UTF-8 with the document charset (as originCharset). Exception is if the URI is already escaped in the document then that is stored as escaped. 2) When we send to the server, UTF-8 to originChrset conversion may be applied depends on the protocol. So, we are talking about #2 issue, whether we want to send UTF-8 or UTF-16 to the server. Is that correct? Or is there dataloss internally for the UTF-16 case? about darin's comment, >UTF-8 can represent all of the UCS-4, right? that is right >i suppose we could run into trouble if code converts >UTF-8 to UCS-2 I think the conversion function we have (e.g. ConvertUTF8toUCS2) already takes care UTF-16. Shanjian, is that correct?

Shanjian Li

Assignee

Comment 14

•

22 years ago

>> UTF-16 URI:s work as they should for BMP (plane 0): >> http://www.student.lu.se/~kin02ndo/%E4%B8%AD%E5%9B%BD%E8%AF%97/ >> It is just when you get surrogates that things go wrong. (With >> reservation for >> what is causing the crash in bug 180263.) Not really. First, the url you referenced is not really encoded in UTF-16, if so thing like "http" should be encoded as "%00h%00t%00t%00p"... Sencond, the string you mentioned does not contains the byte 0x00. It is 0x00 caused most of the problem, not surrogates. >>I think the conversion function we have (e.g. ConvertUTF8toUCS2) >>already takes >>care UTF-16. Shanjian, is that correct? Our conversion function between utf8 and utf-16 works as expected. The problem is in other places where we have some wrong assumption. (like ascii remains ascii, not true for utf-16).

Status: NEW → ASSIGNED

nhottanscp

Comment 15

•

22 years ago

>The problem >is in other places where we have some wrong assumption. (like ascii remains >ascii, not true for utf-16). But as we store URI as UTF-8, we are not supposed to have UTF-16 URI (AUTF8String is used in nsIURI.idl). Are you talking about the code before the URI is stored in nsIURI?

Shanjian Li

Assignee

Comment 16

•

22 years ago

I just read rfc2396, it didn't mention encoding uri in utf-16. The rfc also has the same assumption that ascii is a subset. In fact, I found this is a necessary assumption. One just can't encode uri in utf-16, because the escape method won't work for utf-16. '%', '0', '1' ... are all ascii. So I guess we have to use utf-8 in this situation.

Shanjian Li

Assignee

Comment 17

•

22 years ago

Attached patch patch (obsolete) (deleted) — Details — Splinter Review

Darin Fisher

Comment 18

•

22 years ago

Comment on attachment 107085 [details] [diff] [review] patch >+ if (originCharset.Equals(NS_LITERAL_STRING("UTF-16BE")) || >+ originCharset.Equals(NS_LITERAL_STRING("UTF-16LE")) ) >+ originCharset = NS_LITERAL_STRING("UTF-8"); >+ > rv = nsHTMLUtils::IOService->NewURI(NS_ConvertUCS2toUTF8(aSpec), > NS_LossyConvertUCS2toASCII(originCharset).get(), this looks like two buffer copies to me. can you do something like this instead: if (originCharset.Equals(NS_LITERAL_STRING("UTF-16BE")) || originCharset.Equals(NS_LITERAL_STRING("UTF-16LE")) ) originCharset.Truncate(); // empty charset implies UTF-8 rv = nsHTMLUtils::IOService->NewURI(NS_ConvertUCS2toUTF8(aSpec), NS_LossyConvertUCS2toASCII(originCharset).get(), an empty string implies UTF-8.

Attachment #107085 - Flags: review-

nhottanscp

Comment 19

•

22 years ago

I think we want to add UTF-16, UTF-32, UTF32-BE, UTF-32LE which have the same issue.

Shanjian Li

Assignee

Comment 20

•

22 years ago

Attached patch new patch (deleted) — Details — Splinter Review

Since we know of all UTF encodings, only UTF-8 is NULL safe. I will speed the comparison by only checking prefix utf-.

Attachment #107085 - Attachment is obsolete: true

nhottanscp

Comment 21

•

22 years ago

Is the charset name canonicalized at that point and no need to check lower cases?

Shanjian Li

Assignee

Comment 22

•

22 years ago

The charset name should have been normalized till this stage. Whenever we get a charset name from outside world, it should be normalized in the first place. If not, that is a bug. For normal sources like meta, http header, I am sure that has been done.

nhottanscp

Updated

•

22 years ago

Attachment #107159 - Flags: review+

Shanjian Li

Assignee

Comment 23

•

22 years ago

darin, could you sr?

Darin Fisher

Updated

•

22 years ago

Attachment #107159 - Flags: superreview+

Shanjian Li

Assignee

Comment 24

•

22 years ago

fix checked in.

Status: ASSIGNED → RESOLVED

Closed: 22 years ago

Resolution: --- → FIXED

Boris Zbarsky [:bzbarsky]

Comment 25

•

22 years ago

Out of curiousity, would it have made more sense to put this check inside NS_NewURI so other callers could benefit too?

Darin Fisher

Comment 26

•

22 years ago

bz: hmm... yeah. that does sound like a good idea (nsIIOService::newURI you mean?). shanjian, nhotta: should we just assume that UTF* as an origin charset always means UTF-8 across the entire application? we could make the change to nsStandardURL::Init to do this correction.

Shanjian Li

Assignee

Comment 27

•

22 years ago

It should be safe to assume all utf* should be encoded by utf8. I will take a look into this later.

Frank Tang

Updated

•

22 years ago

No longer blocks: 157673

Frank Tang

Updated

•

22 years ago

Depends on: 180372

You need to log in before you can comment on or make changes to this bug.