Closed Bug 261929 Opened 20 years ago Closed 19 years ago

Consider sending urls in UTF-8 by default (images/links with non-ASCII chacters not displayed)

Categories

(Core :: Networking, enhancement, P4)

enhancement

Tracking

()

RESOLVED FIXED
mozilla1.9alpha1

People

(Reporter: c00lyman, Assigned: emk)

References

(Depends on 1 open bug, )

Details

(Keywords: intl, jp-critical)

Attachments

(1 file, 6 obsolete files)

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910 There are german Umlaute in the image link addresses and it looks like mozilla can't figure out the right request address for the images. If you try the same page with IE you will see the images ... Reproducible: Always Steps to Reproduce: 1.load the page 2. 3. Actual Results: broken link images - picture is not loadad Expected Results: You should see a picture
Joerg: is "Always send URLs in UTF-8" enabled in your MSIE settings?
Unchecking the "Always send URLs as UTF-8" makes it not work in MSIE6.
Setting network.standard-url.encode-utf8 makes it work in Mozilla. If this pref defaults to true in MSIE, we should probably do the same.
Hey, great timing - you guys make a good job!
We have a dillemma here. It seems like MS IIS has a built-in support for translating url-escaped UTF-8 in URL to whatever is used on the file system (otherwise, MS IE wouldn't have set 'always send .. UTF-8' on by default, I guess). Our trouble is that a significant portion of Apache servers don't have a module (that does the same thing) on and are installed on Unix boxes where the local file system character encoding is not UTF-8. (this has been changing with recent Linux distributions using UTF-8 by default). In addition, the module is only available for apache 2.x which is lot less widely used than apache 1.x. Perhaps, we have to turn it on by default. In the past, I saw many web sites asking their visitors to turn off 'Always send URLs in UTF-8' in MSIE. These days, I rarely see it. However, my sample may be skewed. What do others think about this?
Status: UNCONFIRMED → ASSIGNED
Component: Browser-General → Networking
Ever confirmed: true
Keywords: intl
Summary: You can't see the images on the page due to german ä in the image link You can't see the images on the page due to german ä in the image link → Consider sending urls in UTF-8 by default (images/links with non-ASCII chacters not displayed) Consider sending urls in UTF-8 by default (images/links with non-ASCII chacters not displayed)
I still prefer a failover-based solution (though it would be non-trivial to implement), but maybe the proliferation of IE means that we can assume UTF-8 as a defacto standard?
I am all for making UTF-8 ubiquitous, especially in URIs.
I also for encoding urls as utf-8 by default. But we must not change the behavior with unescaped urls and form content on the page. They must always use the page charset.
The IRI spec (currently at http://www.w3.org/International/iri-edit/draft-duerst-iri-11.txt) has been approved as an IETF Proposed Standard by the IESG (see http://www1.ietf.org/mail-archive/web/ietf-announce/current/msg00752.html) and will soon be issued as an RFC. In short, IRI spec defines how to convert non-ASCII characters in a Web address to a fully standard URI, using UTF-8 and %HH-encoding. Given that this is now approved, Mozilla code should move towards implementing it without delay. In particular, this means that any URIs that contain non-ASCII characters should be converted to %HH-form for resolution using UTF-8. Using a legacy encoding is (and always has been) a clearly totally non-standard way of resolving Web addresses with non-ASCII characters. So: 1) 'network.standard-url.encode-utf8' should be on by default (and eventually become unnecessary) 2) 'always use the page charset' for 'unescaped uris', as Nil Soffer asks for, is clearly against standards. 3) For form content generated on a page by the user (as opposed to the query part of an uri), the page encoding has to be used. Opera and Safari already have a good implementation of IRIs. IE does so, too, with the exception of IDNs. Mozilla has all the right code, it just has to be activated and deployed. A very simple test is available at http://www.w3.org/2001/08/iri-test/resumeHtmlImgSrcBase.html. I suggest changing 'OS' to 'All', because this is a general issue, not a Win XP issue.
Martin, Yes, we are aware of the IRI specification, but has anyone proposed a migration strategy that web browsers might implement? Clearly, much of the web does not conform to the IRI specification, so it would be a mistake for Mozilla to switch to IRI encoding by default. Perhaps servers should indicate IRI expectation somehow?
OS: Windows XP → All
Hardware: PC → All
Hmm... on second thought, the fact that IE encodes URLs as UTF-8 by default (at least it appears to be true on my system) means that we should be in the clear to do the same. I would like to study IE's behavior more before I agree that we can make this change, but for now I'll put it on my TODO list for Moz 1.8 / FFox 1.1.
Assignee: general → darin
Status: ASSIGNED → NEW
Target Milestone: --- → mozilla1.8beta
Target Milestone: mozilla1.8beta → ---
Target Milestone: --- → mozilla1.8beta
Attached patch v1 patch (obsolete) (deleted) — Splinter Review
This patch flips the default value of the pref. This patch means that the URI object will discard the information it was given about the encoding of the page from which it was extracted. We might want to change that, and instead make the pref only affect the encoding and not affect the value of mOriginCharset. Or, perhaps that's unnecessary.
Attachment #169156 - Flags: review?(smontagu)
Darin Fisher asks about a migration strategy. If you really want that, then the best idea is probably to do the following: 1) Request a resource using %-encoding based on UTF-8 2) If you get a 404 (not found), do another request based on the legacy encoding of the page. For that, it would indeed make sense to keep the page encoding in the URI object. There should probably be an option to switch part 2 off, because hopefully in the future it won't be needed anymore. This approach has been proposed ages ago, but I'm not sure it has ever been implemented. But it should not be too difficult. It has the following advantages: - The official, standard, thing is done first, bugwards compatibility stuff follows only later. - Servers serving things with legacy encodings 'get the message', i.e. they'll see all the UTF-8 requests in their logs. It has the following issues: - double requests for broken links - Exceptional wrong positive (if the UTF-8 form of a path coincides with an existing path in legacy encoding; for actual preexisting pages (as opposed to query-like things that get constructed), the chance of this happening is very low. The 404 is an efficient way (in particular in terms of implementation!) for the server to indicate that it's not doing UTF-8 (at least not on the resource requested). For the server to use some additional thing (e.g. header) to indicate that it's actually doing the standard thing (i.e. IRIs) as such doesn't seem like a good idea. If you're going to upgrade the server anyway, the best way to do it is probably to use mod_fileiri (see http://www.w3.org/2003/06/mod_fileiri/), which lets you use a feature already in the HTTP protocol (permanent redirect) to tell the client that it's using UTF-8 now.
(In reply to comment #11) > Hmm... on second thought, the fact that IE encodes URLs as UTF-8 by default (at > least it appears to be true on my system) means that we should be in the clear > to do the same. Darin, see comment #5. IIS apparently has that. However, the majority of Apache servers in the wild don't have Martin's IRI extension module installed. His module is only available for Apache 2.x, but the share of Apache 1.x is far larger than that of Apache 2.x. However, due to IE's behavior, it seems that a lot of apache 1.x admins have added some ad-hock solutions because nowadays I rarely see a note that asks users to turn off 'always send urls in UTF-8' in MS IE (again comment #5). As for Martin's migration strategy, that has been considered before, but somehow it wasn't implemented (for perf. reason?). > object will discard the information it was given about the encoding of the page This should be kept, IMO.
Re. my fileiri module for Apache 1.3, I don't think it would be difficult to port it; I just haven't done it, nor have I heard from anybody else doing it. If you need it, please drop me a line. Re. Jungshik Shin's comment "nowadays I rarely see a note that asks users to turn off 'always send urls in UTF-8' in MS IE", if there is any place where non-UTF-8-based URIs are (or where) in wide use, it's Korea, and Jungshik is extremely familliar with the situation there.
Attached patch v2 patch (obsolete) (deleted) — Splinter Review
OK, this version preserves the value of nsIURI::originCharset, and makes the implementation simply ignore it when the encode-as-utf8 preference is set.
Attachment #169156 - Attachment is obsolete: true
Attachment #169156 - Flags: review?(smontagu)
Attachment #169230 - Flags: review?(jshin)
I've changed access to http://www.w3.org/2003/06/mod_fileiri/. It should be visible for everybody now. Sorry, I thought this was already visible.
Comment on attachment 169230 [details] [diff] [review] v2 patch This looks fine as long as references to remote files are concerned. I just came across another complication. See bug 263570 comment #17. I believe that the problem described in bug 263570 comment #17 only happens when this option is set true. We may just go ahead here and then have to come up with a way to special-case(?) local file access.
Attachment #169230 - Flags: review?(jshin) → review+
Comment on attachment 169230 [details] [diff] [review] v2 patch If we want to display IRIs unescaped in the address bar, what do we need to do? I'm not clear whether the pref "network.standard-url.escape-utf8" affects only the display or also the form in which the URI/IRI is sent to the server.
(In reply to comment #20) > If we want to display IRIs unescaped in the address bar, what do we need to do? > I'm not clear whether the pref "network.standard-url.escape-utf8" affects only > the display or also the form in which the URI/IRI is sent to the server. It's only for what's sent to the server. As for the display in the addressbar, I guess we have to use |nsITextToSubURI|
yeah, what jshin said.
Attachment #169230 - Flags: superreview?(bzbarsky)
Comment on attachment 169230 [details] [diff] [review] v2 patch sr=bzbarsky
Attachment #169230 - Flags: superreview?(bzbarsky) → superreview+
> We may just go ahead here and then have to come up with a way to > special-case(?) local file access. Hmm... yeah, local file access is going to be an interesting issue. We could ignore this preference when nsStandardURL::mSupportsFileURL is set the true. That may be best. Thoughts? (bz: sorry for the possibly premature sr request)
Hmm... yeah, if this breaks us on Windows then we may want to condition on the file url flag. But really, it would break us any time the HTML charset doesn't match the native charset, no? So it's just a problem no matter what...
if we're redesigning this, i have a consumer who wants the ability to send unencoded urls :).
> But really, it would break us any time the HTML charset doesn't match the native > charset, no? So it's just a problem no matter what... boris, well... if everything a user does is Shift-JIS (documents on their harddrive, documents downloaded from websites, etc.) then it would be a problem to switch suddenly to UTF-8 for local files. but you make a good point when considering that the average user probably does encounter documents in a variety of character encodings.
> if we're redesigning this, i have a consumer who wants the ability to send > unencoded urls :). timeless: we're not redesigning this per se. we're just talking about when to enable an existing preference (and some details related to it). you might want to file a new bug for that feature.
(In reply to comment #25) > But really, it would break us any time the HTML charset doesn't match the native > charset, no? So it's just a problem no matter what... How about changing |net_GetFileFromURLSpec| so that it works with url-escaped UTF-8's and call |InitWithPath| with a UTF-16 path instead of |InitWithNativePath|? If its ripple effect is so huge that we need to keep the backward compatibility (i.e. have to make so many changes throughout the code), we may consider checking the UTF8ness of an escaped URLspec and branching depending on the result. http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsURLHelperWin.cpp#99
Making net_GetFileFromURLSpec handle UTF-8 encoded file:// URLs is probably a good idea, and it will probably be necessary to support UTF-8 (and native encoding as a fallback) when we start supporting Unicode file paths under windows.
OK, I went ahead and checked this patch in. I want to make sure we get as much feedback as possible on this change. We need to fix the way we handle non-ASCII file paths anyways, so that need not block this patch. Marking FIXED.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Thanks. I've just come across a page (actually, a Korean firefox 1.0 user reported that he couldn't see images in the page) in EUC-KR (one at one of major portals in Korea) where I had to turn on this because the server only understands IRI. The page address is http://www.cdpkorea.com/zboard4/zboard.php?id=pdsboard&page=1&page_num=20&select_arrange=headnum&desc=&sn=off&ss=on&sc=on&keyword=&no=43865&category= With this option on, four photos of a CD player will show up. Otherwise, they're replaced with an image for 'not found files'.
Blocks: iri
*** Bug 283051 has been marked as a duplicate of this bug. ***
This change was reverted because it UTF-8'ed some things that IE and Opera don't UTF-8, breaking some web sites. See bug 284474.
reopening this bug since the fix was basically reverted.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Severity: normal → enhancement
Target Milestone: mozilla1.8beta1 → mozilla1.9alpha
Two comments I'm quoting were taken from bug 284474, but I'm adding my comment here because this bug is where we need to discuss what's gotta be done. (In reply to comment #33) > (In reply to comment #30) > > My understanding is that for the mid-range future (after Firefox 1.5) it > > would be very good to move to what's called the IE/Opera behavior, i.e. > > using UTF-8 for the path/file part and the page encoding for the query > > part. While this solution is not optimal, it would definitely be a big > > step forward. > > However, that behavior is in violation of RFC 3987 you wrote, or did I misread > it? I haven't yet confirmed it myself (I have yet to find a Windows box where I can safely test IE 7 beta preview2??. It doesn't work on my Win 2k box), but it seems like IE 7 beta preview2 behaves exactly the same way as firefox with the backed-out patch in place (judging from a few test reports I got in the Korean mozilla community). I don't know whether that's a just oversight on the part of IE7 team or done on purpose to get IE7 in compliance with the letters of RFC 3987 at the risk of breaking some web sites. The former is more likely than the latter, but who knows....
Priority: -- → P4
Attached patch patch v3 (obsolete) (deleted) — Splinter Review
This will make Mozilla compatible with IE and Opera. When network.standard-url.encode-utf8 is true, URLs except query part will be encoded in UTF-8 (similar to IE6 "Always encode in UTF-8"). I added a new pref "network.standard-url.encode-query-utf8". The query part will be encoded in UTF-8 only if it's true (similar to "Send UTF-8 query strings" introduced in IE7). This patch also contains a fix for bug 284474.
Assignee: darin → VYV03354
Status: REOPENED → ASSIGNED
Attachment #215656 - Flags: superreview?(darin)
Attachment #215656 - Flags: review?(jshin1987)
If we are inputting non-ASCII URI to location bar directly, the sending request is not compatible at IE6(default settings). + jp-critical
Keywords: jp-critical
Comment on attachment 215656 [details] [diff] [review] patch v3 >Index: netwerk/base/src/nsStandardURL.cpp >+ nsCAutoString loweredSpec(spec); >+ ToLowerCase(loweredSpec); >+ GET_SEGMENT_ENCODER_FROM_SPEC(encoder, loweredSpec.get()); I don't understand why it is necessary to lowercase the entire URL spec.
Attached patch patch v3.1 (obsolete) (deleted) — Splinter Review
using PL_strncasecmp instead of stricmp + ToLowerCase.
Attachment #169230 - Attachment is obsolete: true
Attachment #215656 - Attachment is obsolete: true
Attachment #215810 - Flags: superreview?(darin)
Attachment #215810 - Flags: review?(jshin1987)
Attachment #215656 - Flags: superreview?(darin)
Attachment #215656 - Flags: review?(jshin1987)
Attached patch patch v3.2 (obsolete) (deleted) — Splinter Review
Removed file protocol hack since bug 278161 was fixed.
Attachment #215810 - Attachment is obsolete: true
Attachment #216628 - Flags: review?(jshin1987)
Attachment #215810 - Flags: superreview?(darin)
Attachment #215810 - Flags: review?(jshin1987)
Comment on attachment 216628 [details] [diff] [review] patch v3.2 Looks good, but let biesi take a look. >Index: netwerk/base/src/nsStandardURL.cpp > > #define NS_NET_PREF_ALWAYSENCODEINUTF8 "network.standard-url.encode-utf8" >+#define NS_NET_PREF_ENCODEQUERYINUTF8 "network.standard-url.encode-query-utf8" Can you align the opening quote of a new line with that of the line above it? Even with that, it's only 79 chars. long. >+#define GET_QUERY_ENCODER(name) \ >+ GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8 && \ >+ gEncodeQueryInUTF8) Perhaps, it's better this way: GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8 && \ gEncodeQueryInUTF8)
Attachment #216628 - Flags: review?(jshin1987) → review?(cbiesinger)
Attached patch resolved points of comment #42 (obsolete) (deleted) — Splinter Review
Attachment #216628 - Attachment is obsolete: true
Attachment #216631 - Flags: review?(cbiesinger)
Attachment #216628 - Flags: review?(cbiesinger)
Comment on attachment 216631 [details] [diff] [review] resolved points of comment #42 unfortunate that we have to do this, oh well :( what about mozilla code that unescapes URIs? like nsTextToSubURI.cpp? doesn't that need to be changed to deal with URIs that are partially in UTF-8 and partially in the origin charset?
Attachment #216631 - Flags: review?(cbiesinger) → review+
Comment on attachment 216631 [details] [diff] [review] resolved points of comment #42 > what about mozilla code that unescapes URIs? like nsTextToSubURI.cpp? doesn't > that need to be changed to deal with URIs that are partially in UTF-8 and > partially in the origin charset? That's bug 320807.
Attachment #216631 - Flags: superreview?(darin)
Comment on attachment 216631 [details] [diff] [review] resolved points of comment #42 >Index: netwerk/base/src/nsStandardURL.cpp > #define GET_SEGMENT_ENCODER(name) \ >+ GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8) >+ >+#define GET_QUERY_ENCODER(name) \ >+ GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8 && \ >+ gEncodeQueryInUTF8) >+ >+#define GET_SEGMENT_ENCODER_INTERNAL(name, useUTF8) \ >+ nsSegmentEncoder name(useUTF8 ? nsnull : mOriginCharset.get()) It's usually nice to define a macro above where it is first used. sr=darin
Attachment #216631 - Flags: superreview?(darin) → superreview+
Attached patch resolved darin's comment (deleted) — Splinter Review
Thank you. Carring over r+sr.
Attachment #216631 - Attachment is obsolete: true
Attachment #218073 - Flags: superreview+
Attachment #218073 - Flags: review+
checked-in. Don't we need this for 1.8.1 if there are no regressions?
Status: ASSIGNED → RESOLVED
Closed: 20 years ago19 years ago
Resolution: --- → FIXED
I really want to land this on branch, but we will have to wait for bug 278161.
Depends on: 278161
I've filed bug 333859 to followup the query part issue.
Don't we need the fix for bug 320807 anywhere where we land this?
Depends on: 320807
(In reply to comment #51) > Don't we need the fix for bug 320807 anywhere where we land this? At least, we need to handle the "mixed-charset" url (comment #45).
*** Bug 340376 has been marked as a duplicate of this bug. ***
(In reply to comment #49) > I really want to land this on branch, but we will have to wait for bug 278161. > Bug 278161 did land on the branch on May 1st. Can we check this in too ? (there's no blocking-flag yet)
Comment on attachment 218073 [details] [diff] [review] resolved darin's comment Bug 278161 is fixed now. And I didn't hear any regressions.
Attachment #218073 - Flags: approval-branch-1.8.1?(cbiesinger)
(In reply to comment #55) > (From update of attachment 218073 [details] [diff] [review] [edit]) > Bug 278161 is fixed now. > And I didn't hear any regressions. What about comment 51?
Bug 320807 will not be suitable for landing on the branch because it's large and contains interface change. Probably we will need a subset patch to handle only mixed-charset case.
Comment on attachment 218073 [details] [diff] [review] resolved darin's comment Clearing request until bug 320807 is solved
Attachment #218073 - Flags: approval-branch-1.8.1?(cbiesinger)
*** Bug 349671 has been marked as a duplicate of this bug. ***
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: