Closed
Bug 261929
Opened 20 years ago
Closed 19 years ago
Consider sending urls in UTF-8 by default (images/links with non-ASCII chacters not displayed)
Categories
(Core :: Networking, enhancement, P4)
Core
Networking
Tracking
()
RESOLVED
FIXED
mozilla1.9alpha1
People
(Reporter: c00lyman, Assigned: emk)
References
(Depends on 1 open bug, )
Details
(Keywords: intl, jp-critical)
Attachments
(1 file, 6 obsolete files)
(deleted),
patch
|
emk
:
review+
emk
:
superreview+
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.7.3) Gecko/20040910
There are german Umlaute in the image link addresses and it looks like mozilla
can't figure out the right request address for the images. If you try the same
page with IE you will see the images ...
Reproducible: Always
Steps to Reproduce:
1.load the page
2.
3.
Actual Results:
broken link images - picture is not loadad
Expected Results:
You should see a picture
Comment 1•20 years ago
|
||
Joerg: is "Always send URLs in UTF-8" enabled in your MSIE settings?
Unchecking the "Always send URLs as UTF-8" makes it not work in MSIE6.
Setting network.standard-url.encode-utf8 makes it work in Mozilla. If this pref
defaults to true in MSIE, we should probably do the same.
Reporter | ||
Comment 4•20 years ago
|
||
Hey, great timing - you guys make a good job!
Comment 5•20 years ago
|
||
We have a dillemma here. It seems like MS IIS has a built-in support for
translating url-escaped UTF-8 in URL to whatever is used on the file system
(otherwise, MS IE wouldn't have set 'always send .. UTF-8' on by default, I
guess). Our trouble is that a significant portion of Apache servers don't have a
module (that does the same thing) on and are installed on Unix boxes where the
local file system character encoding is not UTF-8. (this has been changing with
recent Linux distributions using UTF-8 by default). In addition, the module is
only available for apache 2.x which is lot less widely used than apache 1.x.
Perhaps, we have to turn it on by default. In the past, I saw many web sites
asking their visitors to turn off 'Always send URLs in UTF-8' in MSIE. These
days, I rarely see it. However, my sample may be skewed. What do others think
about this?
Status: UNCONFIRMED → ASSIGNED
Component: Browser-General → Networking
Ever confirmed: true
Keywords: intl
Summary: You can't see the images on the page due to german ä in the image link You can't see the images on the page due to german ä in the image link → Consider sending urls in UTF-8 by default (images/links with non-ASCII chacters not displayed) Consider sending urls in UTF-8 by default (images/links with non-ASCII chacters not displayed)
Comment 6•20 years ago
|
||
I still prefer a failover-based solution (though it would be non-trivial to
implement), but maybe the proliferation of IE means that we can assume UTF-8 as
a defacto standard?
Comment 7•20 years ago
|
||
I am all for making UTF-8 ubiquitous, especially in URIs.
Comment 8•20 years ago
|
||
I also for encoding urls as utf-8 by default.
But we must not change the behavior with unescaped urls and form content on the page. They must
always use the page charset.
Comment 9•20 years ago
|
||
The IRI spec (currently at
http://www.w3.org/International/iri-edit/draft-duerst-iri-11.txt)
has been approved as an IETF Proposed Standard by the IESG (see
http://www1.ietf.org/mail-archive/web/ietf-announce/current/msg00752.html)
and will soon be issued as an RFC.
In short, IRI spec defines how to convert non-ASCII characters in a
Web address to a fully standard URI, using UTF-8 and %HH-encoding.
Given that this is now approved, Mozilla code should move towards
implementing it without delay. In particular, this means that any
URIs that contain non-ASCII characters should be converted to
%HH-form for resolution using UTF-8. Using a legacy encoding is
(and always has been) a clearly totally non-standard way of
resolving Web addresses with non-ASCII characters.
So:
1) 'network.standard-url.encode-utf8' should be on by default
(and eventually become unnecessary)
2) 'always use the page charset' for 'unescaped uris', as
Nil Soffer asks for, is clearly against standards.
3) For form content generated on a page by the user
(as opposed to the query part of an uri), the page
encoding has to be used.
Opera and Safari already have a good implementation of IRIs.
IE does so, too, with the exception of IDNs. Mozilla has
all the right code, it just has to be activated and deployed.
A very simple test is available at
http://www.w3.org/2001/08/iri-test/resumeHtmlImgSrcBase.html.
I suggest changing 'OS' to 'All', because this is a general issue,
not a Win XP issue.
Comment 10•20 years ago
|
||
Martin,
Yes, we are aware of the IRI specification, but has anyone proposed a migration
strategy that web browsers might implement? Clearly, much of the web does not
conform to the IRI specification, so it would be a mistake for Mozilla to switch
to IRI encoding by default. Perhaps servers should indicate IRI expectation
somehow?
OS: Windows XP → All
Hardware: PC → All
Comment 11•20 years ago
|
||
Hmm... on second thought, the fact that IE encodes URLs as UTF-8 by default (at
least it appears to be true on my system) means that we should be in the clear
to do the same.
I would like to study IE's behavior more before I agree that we can make this
change, but for now I'll put it on my TODO list for Moz 1.8 / FFox 1.1.
Assignee: general → darin
Status: ASSIGNED → NEW
Target Milestone: --- → mozilla1.8beta
Updated•20 years ago
|
Target Milestone: mozilla1.8beta → ---
Updated•20 years ago
|
Target Milestone: --- → mozilla1.8beta
Comment 12•20 years ago
|
||
This patch flips the default value of the pref. This patch means that the URI
object will discard the information it was given about the encoding of the page
from which it was extracted. We might want to change that, and instead make
the pref only affect the encoding and not affect the value of mOriginCharset.
Or, perhaps that's unnecessary.
Attachment #169156 -
Flags: review?(smontagu)
Comment 13•20 years ago
|
||
Darin Fisher asks about a migration strategy.
If you really want that, then the best idea is
probably to do the following:
1) Request a resource using %-encoding based on UTF-8
2) If you get a 404 (not found), do another request
based on the legacy encoding of the page.
For that, it would indeed make sense to keep the
page encoding in the URI object. There should
probably be an option to switch part 2 off, because
hopefully in the future it won't be needed anymore.
This approach has been proposed ages ago, but I'm
not sure it has ever been implemented. But it should
not be too difficult.
It has the following advantages:
- The official, standard, thing is done first,
bugwards compatibility stuff follows only later.
- Servers serving things with legacy encodings
'get the message', i.e. they'll see all the
UTF-8 requests in their logs.
It has the following issues:
- double requests for broken links
- Exceptional wrong positive (if the UTF-8 form
of a path coincides with an existing path in
legacy encoding; for actual preexisting pages
(as opposed to query-like things that get
constructed), the chance of this happening
is very low.
The 404 is an efficient way (in particular in terms of
implementation!) for the server to indicate that it's
not doing UTF-8 (at least not on the resource requested).
For the server to use some additional thing (e.g. header)
to indicate that it's actually doing the standard thing
(i.e. IRIs) as such doesn't seem like a good idea.
If you're going to upgrade the server anyway, the best
way to do it is probably to use mod_fileiri
(see http://www.w3.org/2003/06/mod_fileiri/), which lets
you use a feature already in the HTTP protocol
(permanent redirect) to tell the client that it's using
UTF-8 now.
Comment 14•20 years ago
|
||
(In reply to comment #11)
> Hmm... on second thought, the fact that IE encodes URLs as UTF-8 by default (at
> least it appears to be true on my system) means that we should be in the clear
> to do the same.
Darin, see comment #5. IIS apparently has that. However, the majority of Apache
servers in the wild don't have Martin's IRI extension module installed. His
module is only available for Apache 2.x, but the share of Apache 1.x is far
larger than that of Apache 2.x. However, due to IE's behavior, it seems that a
lot of apache 1.x admins have added some ad-hock solutions because nowadays I
rarely see a note that asks users to turn off 'always send urls in UTF-8' in MS
IE (again comment #5).
As for Martin's migration strategy, that has been considered before, but somehow
it wasn't implemented (for perf. reason?).
> object will discard the information it was given about the encoding of the page
This should be kept, IMO.
Comment 15•20 years ago
|
||
Re. my fileiri module for Apache 1.3, I don't think it would be
difficult to port it; I just haven't done it, nor have I heard
from anybody else doing it. If you need it, please drop me a line.
Re. Jungshik Shin's comment "nowadays I rarely see a note that
asks users to turn off 'always send urls in UTF-8' in MS IE",
if there is any place where non-UTF-8-based URIs are (or where)
in wide use, it's Korea, and Jungshik is extremely familliar
with the situation there.
Comment 16•20 years ago
|
||
> (see http://www.w3.org/2003/06/mod_fileiri/)
that seems to require authentication...
Comment 17•20 years ago
|
||
OK, this version preserves the value of nsIURI::originCharset, and makes the
implementation simply ignore it when the encode-as-utf8 preference is set.
Attachment #169156 -
Attachment is obsolete: true
Updated•20 years ago
|
Attachment #169156 -
Flags: review?(smontagu)
Updated•20 years ago
|
Attachment #169230 -
Flags: review?(jshin)
Comment 18•20 years ago
|
||
I've changed access to http://www.w3.org/2003/06/mod_fileiri/.
It should be visible for everybody now.
Sorry, I thought this was already visible.
Comment 19•20 years ago
|
||
Comment on attachment 169230 [details] [diff] [review]
v2 patch
This looks fine as long as references to remote files are concerned. I just
came across another complication. See bug 263570 comment #17. I believe that
the problem described in bug 263570 comment #17 only happens when this option
is set true. We may just go ahead here and then have to come up with a way to
special-case(?) local file access.
Attachment #169230 -
Flags: review?(jshin) → review+
Comment 20•20 years ago
|
||
Comment on attachment 169230 [details] [diff] [review]
v2 patch
If we want to display IRIs unescaped in the address bar, what do we need to do?
I'm not clear whether the pref "network.standard-url.escape-utf8" affects only
the display or also the form in which the URI/IRI is sent to the server.
Comment 21•20 years ago
|
||
(In reply to comment #20)
> If we want to display IRIs unescaped in the address bar, what do we need to do?
> I'm not clear whether the pref "network.standard-url.escape-utf8" affects only
> the display or also the form in which the URI/IRI is sent to the server.
It's only for what's sent to the server. As for the display in the addressbar,
I guess we have to use |nsITextToSubURI|
Comment 22•20 years ago
|
||
yeah, what jshin said.
Updated•20 years ago
|
Attachment #169230 -
Flags: superreview?(bzbarsky)
Comment 23•20 years ago
|
||
Comment on attachment 169230 [details] [diff] [review]
v2 patch
sr=bzbarsky
Attachment #169230 -
Flags: superreview?(bzbarsky) → superreview+
Comment 24•20 years ago
|
||
> We may just go ahead here and then have to come up with a way to
> special-case(?) local file access.
Hmm... yeah, local file access is going to be an interesting issue. We could
ignore this preference when nsStandardURL::mSupportsFileURL is set the true.
That may be best. Thoughts? (bz: sorry for the possibly premature sr request)
Comment 25•20 years ago
|
||
Hmm... yeah, if this breaks us on Windows then we may want to condition on the
file url flag.
But really, it would break us any time the HTML charset doesn't match the native
charset, no? So it's just a problem no matter what...
Comment 26•20 years ago
|
||
if we're redesigning this, i have a consumer who wants the ability to send
unencoded urls :).
Comment 27•20 years ago
|
||
> But really, it would break us any time the HTML charset doesn't match the native
> charset, no? So it's just a problem no matter what...
boris, well... if everything a user does is Shift-JIS (documents on their
harddrive, documents downloaded from websites, etc.) then it would be a problem
to switch suddenly to UTF-8 for local files. but you make a good point when
considering that the average user probably does encounter documents in a variety
of character encodings.
Comment 28•20 years ago
|
||
> if we're redesigning this, i have a consumer who wants the ability to send
> unencoded urls :).
timeless: we're not redesigning this per se. we're just talking about when to
enable an existing preference (and some details related to it). you might want
to file a new bug for that feature.
Comment 29•20 years ago
|
||
(In reply to comment #25)
> But really, it would break us any time the HTML charset doesn't match the native
> charset, no? So it's just a problem no matter what...
How about changing |net_GetFileFromURLSpec| so that it works with url-escaped
UTF-8's and call |InitWithPath| with a UTF-16 path instead of
|InitWithNativePath|? If its ripple effect is so huge that we need to keep the
backward compatibility (i.e. have to make so many changes throughout the code),
we may consider checking the UTF8ness of an escaped URLspec and branching
depending on the result.
http://lxr.mozilla.org/seamonkey/source/netwerk/base/src/nsURLHelperWin.cpp#99
Comment 30•20 years ago
|
||
Making net_GetFileFromURLSpec handle UTF-8 encoded file:// URLs is probably a
good idea, and it will probably be necessary to support UTF-8 (and native
encoding as a fallback) when we start supporting Unicode file paths under windows.
Comment 31•20 years ago
|
||
OK, I went ahead and checked this patch in. I want to make sure we get as much
feedback as possible on this change. We need to fix the way we handle non-ASCII
file paths anyways, so that need not block this patch.
Marking FIXED.
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → FIXED
Comment 32•20 years ago
|
||
Thanks. I've just come across a page (actually, a Korean firefox 1.0 user
reported that he couldn't see images in the page) in EUC-KR (one at one of
major portals in Korea) where I had to turn on this because the server only
understands IRI.
The page address is
http://www.cdpkorea.com/zboard4/zboard.php?id=pdsboard&page=1&page_num=20&select_arrange=headnum&desc=&sn=off&ss=on&sc=on&keyword=&no=43865&category=
With this option on, four photos of a CD player will show up. Otherwise, they're
replaced with an image for 'not found files'.
Comment 33•20 years ago
|
||
*** Bug 283051 has been marked as a duplicate of this bug. ***
Comment 34•19 years ago
|
||
This change was reverted because it UTF-8'ed some things that IE and Opera don't
UTF-8, breaking some web sites. See bug 284474.
Comment 35•19 years ago
|
||
reopening this bug since the fix was basically reverted.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Updated•19 years ago
|
Severity: normal → enhancement
Target Milestone: mozilla1.8beta1 → mozilla1.9alpha
Comment 36•19 years ago
|
||
Two comments I'm quoting were taken from bug 284474, but I'm adding my comment here because this bug is where we need to discuss what's gotta be done.
(In reply to comment #33)
> (In reply to comment #30)
> > My understanding is that for the mid-range future (after Firefox 1.5) it
> > would be very good to move to what's called the IE/Opera behavior, i.e.
> > using UTF-8 for the path/file part and the page encoding for the query
> > part. While this solution is not optimal, it would definitely be a big
> > step forward.
>
> However, that behavior is in violation of RFC 3987 you wrote, or did I misread
> it?
I haven't yet confirmed it myself (I have yet to find a Windows box where I can safely test IE 7 beta preview2??. It doesn't work on my Win 2k box), but it seems like IE 7 beta preview2 behaves exactly the same way as firefox with the backed-out patch in place (judging from a few test reports I got in the Korean mozilla community). I don't know whether that's a just oversight on the part of IE7 team or done on purpose to get IE7 in compliance with the letters of RFC 3987 at the risk of breaking some web sites. The former is more likely than the latter, but who knows....
Updated•19 years ago
|
Priority: -- → P4
Assignee | ||
Comment 37•19 years ago
|
||
This will make Mozilla compatible with IE and Opera.
When network.standard-url.encode-utf8 is true, URLs except query part will be encoded in UTF-8 (similar to IE6 "Always encode in UTF-8").
I added a new pref "network.standard-url.encode-query-utf8". The query part will be encoded in UTF-8 only if it's true (similar to "Send UTF-8 query strings" introduced in IE7).
This patch also contains a fix for bug 284474.
Assignee: darin → VYV03354
Status: REOPENED → ASSIGNED
Attachment #215656 -
Flags: superreview?(darin)
Attachment #215656 -
Flags: review?(jshin1987)
Comment 38•19 years ago
|
||
If we are inputting non-ASCII URI to location bar directly, the sending request is not compatible at IE6(default settings).
+ jp-critical
Keywords: jp-critical
Comment 39•19 years ago
|
||
Comment on attachment 215656 [details] [diff] [review]
patch v3
>Index: netwerk/base/src/nsStandardURL.cpp
>+ nsCAutoString loweredSpec(spec);
>+ ToLowerCase(loweredSpec);
>+ GET_SEGMENT_ENCODER_FROM_SPEC(encoder, loweredSpec.get());
I don't understand why it is necessary to lowercase the entire URL spec.
Assignee | ||
Comment 40•19 years ago
|
||
using PL_strncasecmp instead of stricmp + ToLowerCase.
Attachment #169230 -
Attachment is obsolete: true
Attachment #215656 -
Attachment is obsolete: true
Attachment #215810 -
Flags: superreview?(darin)
Attachment #215810 -
Flags: review?(jshin1987)
Attachment #215656 -
Flags: superreview?(darin)
Attachment #215656 -
Flags: review?(jshin1987)
Assignee | ||
Comment 41•19 years ago
|
||
Removed file protocol hack since bug 278161 was fixed.
Attachment #215810 -
Attachment is obsolete: true
Attachment #216628 -
Flags: review?(jshin1987)
Attachment #215810 -
Flags: superreview?(darin)
Attachment #215810 -
Flags: review?(jshin1987)
Comment 42•19 years ago
|
||
Comment on attachment 216628 [details] [diff] [review]
patch v3.2
Looks good, but let biesi take a look.
>Index: netwerk/base/src/nsStandardURL.cpp
>
> #define NS_NET_PREF_ALWAYSENCODEINUTF8 "network.standard-url.encode-utf8"
>+#define NS_NET_PREF_ENCODEQUERYINUTF8 "network.standard-url.encode-query-utf8"
Can you align the opening quote of a new line with that of the line above it? Even with that, it's only 79 chars. long.
>+#define GET_QUERY_ENCODER(name) \
>+ GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8 && \
>+ gEncodeQueryInUTF8)
Perhaps, it's better this way:
GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8 && \
gEncodeQueryInUTF8)
Attachment #216628 -
Flags: review?(jshin1987) → review?(cbiesinger)
Assignee | ||
Comment 43•19 years ago
|
||
Attachment #216628 -
Attachment is obsolete: true
Attachment #216631 -
Flags: review?(cbiesinger)
Attachment #216628 -
Flags: review?(cbiesinger)
Comment 44•19 years ago
|
||
Comment on attachment 216631 [details] [diff] [review]
resolved points of comment #42
unfortunate that we have to do this, oh well :(
what about mozilla code that unescapes URIs? like nsTextToSubURI.cpp? doesn't that need to be changed to deal with URIs that are partially in UTF-8 and partially in the origin charset?
Attachment #216631 -
Flags: review?(cbiesinger) → review+
Assignee | ||
Comment 45•19 years ago
|
||
Comment on attachment 216631 [details] [diff] [review]
resolved points of comment #42
> what about mozilla code that unescapes URIs? like nsTextToSubURI.cpp? doesn't
> that need to be changed to deal with URIs that are partially in UTF-8 and
> partially in the origin charset?
That's bug 320807.
Attachment #216631 -
Flags: superreview?(darin)
Comment 46•19 years ago
|
||
Comment on attachment 216631 [details] [diff] [review]
resolved points of comment #42
>Index: netwerk/base/src/nsStandardURL.cpp
> #define GET_SEGMENT_ENCODER(name) \
>+ GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8)
>+
>+#define GET_QUERY_ENCODER(name) \
>+ GET_SEGMENT_ENCODER_INTERNAL(name, gAlwaysEncodeInUTF8 && \
>+ gEncodeQueryInUTF8)
>+
>+#define GET_SEGMENT_ENCODER_INTERNAL(name, useUTF8) \
>+ nsSegmentEncoder name(useUTF8 ? nsnull : mOriginCharset.get())
It's usually nice to define a macro above where it is first used.
sr=darin
Attachment #216631 -
Flags: superreview?(darin) → superreview+
Assignee | ||
Comment 47•19 years ago
|
||
Thank you.
Carring over r+sr.
Attachment #216631 -
Attachment is obsolete: true
Attachment #218073 -
Flags: superreview+
Attachment #218073 -
Flags: review+
Comment 48•19 years ago
|
||
checked-in.
Don't we need this for 1.8.1 if there are no regressions?
Status: ASSIGNED → RESOLVED
Closed: 20 years ago → 19 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 49•19 years ago
|
||
I really want to land this on branch, but we will have to wait for bug 278161.
Depends on: 278161
Assignee | ||
Comment 50•19 years ago
|
||
I've filed bug 333859 to followup the query part issue.
Comment 51•19 years ago
|
||
Don't we need the fix for bug 320807 anywhere where we land this?
Depends on: 320807
Assignee | ||
Comment 52•19 years ago
|
||
(In reply to comment #51)
> Don't we need the fix for bug 320807 anywhere where we land this?
At least, we need to handle the "mixed-charset" url (comment #45).
Comment 53•19 years ago
|
||
*** Bug 340376 has been marked as a duplicate of this bug. ***
Comment 54•19 years ago
|
||
(In reply to comment #49)
> I really want to land this on branch, but we will have to wait for bug 278161.
>
Bug 278161 did land on the branch on May 1st. Can we check this in too ? (there's no blocking-flag yet)
Assignee | ||
Comment 55•19 years ago
|
||
Comment on attachment 218073 [details] [diff] [review]
resolved darin's comment
Bug 278161 is fixed now.
And I didn't hear any regressions.
Attachment #218073 -
Flags: approval-branch-1.8.1?(cbiesinger)
Comment 56•19 years ago
|
||
(In reply to comment #55)
> (From update of attachment 218073 [details] [diff] [review] [edit])
> Bug 278161 is fixed now.
> And I didn't hear any regressions.
What about comment 51?
Assignee | ||
Comment 57•19 years ago
|
||
Bug 320807 will not be suitable for landing on the branch because it's large and contains interface change.
Probably we will need a subset patch to handle only mixed-charset case.
Assignee | ||
Comment 58•19 years ago
|
||
Comment on attachment 218073 [details] [diff] [review]
resolved darin's comment
Clearing request until bug 320807 is solved
Attachment #218073 -
Flags: approval-branch-1.8.1?(cbiesinger)
Comment 59•18 years ago
|
||
*** Bug 349671 has been marked as a duplicate of this bug. ***
You need to log in
before you can comment on or make changes to this bug.
Description
•