Closed
Bug 102984
Opened 23 years ago
Closed 13 years ago
URL search part is always encoded UTF8
Categories
(SeaMonkey :: General, defect)
Tracking
(Not tracked)
RESOLVED
INVALID
People
(Reporter: bugzilla.mozilla.org, Unassigned)
References
()
Details
(Keywords: intl, Whiteboard: [SmBugEvent])
In my bookmarks, I have defined "google" as following:
Name: Google
Location: http://www.google.de/search?q=%s
Keywords: ?
Now if I enter '?' plus a word in the URL line, google is used
to search for that word.
But if I enter '? wüste', the resulting URL is
http://www.google.de/search?q=w%C3%BCste
I.e. the umlaut 'ü' is transformed to %C3%BC (UTF8).
But correct would be %FC.
Comment 1•23 years ago
|
||
->necko (is this where the URL parser lives?)
Assignee: asa → neeti
Component: Browser-General → Networking
QA Contact: doronr → benc
Comment 3•23 years ago
|
||
shotgun cc:
Comment 4•23 years ago
|
||
no, this is not a necko issue, this is an i18n issue.
reassign to ftang for now
The problem how can we know we want to encode to ISO-8859-1 instead of UTF-8 in
this case.
Assignee: neeti → ftang
Reporter | ||
Comment 5•23 years ago
|
||
This is nearly the same problem as when entering the URL
http://www.mozilla.org/htdig-cgi/htsearch?words=müll
But I agree: you don't know whether you have to convert it to ISO8859-1 or
ISO8859-9 or whatever.
But to choose ISO8859-1 seems to make more sense than to choose UTF8 which is
never correct.
Comment 7•23 years ago
|
||
>But to choose ISO8859-1 seems to make more sense than to choose UTF8 which
>is never correct.
first, read http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2277.html
http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2718.html
http://www.cis.ohio-state.edu/cgi-bin/rfc/rfc2640.html
http://www.w3.org/International/2000/03/draft-masinter-url-i18n-05.txt
http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1
http://www.w3.org/TR/REC-xml#sec-external-ent
Reporter | ||
Comment 8•23 years ago
|
||
Sorry, I was not correct. If a page is encoded in UTF8, the contents of forms
are of course sent as UTF8 encoded.
But pages encoded in UTF8 are the minority, I guess.
Comment 9•23 years ago
|
||
yea, but there are many many page which are not using ISO-8859-1, for example,
chinese page use BIG5, GB2312, Japanese page use Shift_JIS, EUC-JP. This is not
a binary decision between ISO-8859-1 and UTF-8. It is a multiple choice
decision.
Comment 10•23 years ago
|
||
how do IE handle this ?
Should we build a default URL bar encoding in pref and allwo user to change it
from pref ?
Comment 11•23 years ago
|
||
Can someone update the summary to be more descriptive?
Like "URL: encoding in bookmarks is <a> when it should be <b>"
Reporter | ||
Updated•23 years ago
|
Summary: URL encoding not correct → URL search part is always encoded UTF8
Comment 12•23 years ago
|
||
I tried typing this in the location bar and hitting return:
http://www.google.de/search?charset=UTF-8&q=w%C3%BCste
Interestingly, google.de returns good results, but it mislabels the results page:
<meta HTTP-EQUIV="content-type" CONTENT="text/html; charset=ISO-8859-1">
But if you override the above tag using the menu
View|Character Coding|More|Unicode (UTF-8)
then the results page looks fine and you see that google did find hits for
"wüste".
So, we could fix the client by adding the charset parameter to the query string.
Will this work of all servers? But we also need google (and others) to
correctly label the charset of the results page.
cc'ing Kat Momoi in evangelism.
Reporter | ||
Comment 13•23 years ago
|
||
>Interestingly, google.de returns good results
Google only returns pages encoded in UTF8. The result is not
the normal desired result! This is a google bug.
The "charset=" parameter does not have any influence since it
depends on the underlying CGI.
If a form is in a page with UTF8-encoding, the form generates
an UTF8 encoded search string.
If the form is in a ISO8859-1 encoded page, it generates a
ISO8859-1 encoded search string.
If one enters the search string manually and you don't know the
encoding that is expected by the CGI, do not encode the search
string, just leave it as it is on that machine.
Comment 14•23 years ago
|
||
> how do IE handle this ?
It seems to send all its search requests in the Address Bar
to an MSN search engine. So, in essence, they control what should
be sent. I tried Latin 1 string and it did not use UTF-8 %-escaped
format. It was 8859-1 in %-escaped format.
> Should we build a default URL bar encoding in pref and
> allwo user to change it from pref ?
IE has such an option to send URL in UTF-8 (default except in
Asian versions of IE).
A good solution to this problem needs to include the
following consideratons:
1. An option to turn ON or OFF UTF-8 URL encoding when input
from URL bar. In case it is a search string as opposed to
normal URLs, provide fallback encoding in case there is no info
for target search engines/cgi's.
2. Easier way to choose encoding for different search engines.
Currently this is set via RDF files for each search engine.
This approach has limitations -- users are unable to change
values easily.
3. Possible support for emerging CNRP (Common Name Resolution
Protocol) -- which uses UTF-8, for keyword search.
http://www.ietf.org/ids.by.wg/cnrp.html
Are there other requirements?
Updated•23 years ago
|
Status: NEW → ASSIGNED
Comment 15•23 years ago
|
||
Are german umlaute allowed in links anyway ?
Comment 16•23 years ago
|
||
url related issue. give to nhotta.
Assignee: ftang → nhotta
Severity: minor → major
Status: ASSIGNED → NEW
Priority: -- → P3
Target Milestone: --- → mozilla0.9.7
Comment 17•23 years ago
|
||
Current only ASCII is allowed for URL. Characters above 127 have to be escaped.
I don't think there is any standard of what character set to use for URL.
There is a similar bug 105909.
I think UTF-8 is the way to support as many characters possible. But depends on
the situation, using other character sets might be preferred. In bug 105909, I
proposed to specify the character set as a pref. Other possibility is to provide
an option to use a current document charaset.
Updated•23 years ago
|
Target Milestone: mozilla1.0 → mozilla1.2
Comment 18•23 years ago
|
||
*** Bug 135763 has been marked as a duplicate of this bug. ***
Comment 19•23 years ago
|
||
Google now supports ie=encoding (input encoding) and oe=encoding (output
encoding) fields in its search requests, so we can simply set those to UTF-8
(ie=UTF-8&oe=UTF-8) and be over with it.
For other search engines, we should introduce some way to define in the search
engine definition file which encoding the input should be encoded in.
Comment 20•23 years ago
|
||
->bookmarks
Actually, this is a specific issue for each search engine, internet keywords,
and bookmarks.
The original problem was about a custom keyword in a bookmark.
Other areas should be filed a separate bugs.
Assignee: nhotta → new-network-bugs
Status: ASSIGNED → NEW
Reporter | ||
Comment 21•22 years ago
|
||
With RC1, at least under OS/2, characters above 128 are no longer encoded in
UTF8, but in CP437. But for this case, the ie= parm from google is nevertheless
useful. Where is it documented?
Comment 22•22 years ago
|
||
Its not documented anywhere on Google's site, but it is a part of their SOAP
API. I still believe we should use it in Mozilla until someone from Google
confirms this argument is going to stay during further site improvements.
I've mailed them (Google) few days ago and they didn't answer me yet. If someone
has a shortcut to get a quicker response, that'll be nice :)
Reporter | ||
Comment 23•22 years ago
|
||
Maybe http://groups.google.de/groups?hl=de&group=google.public.support.general
is the right place to ask.
Reporter | ||
Comment 26•22 years ago
|
||
Why is the encoding no longer UTF8? (1.0 final)
Reporter | ||
Comment 27•21 years ago
|
||
This has got nothing to do with bookmarks.
The problem also exists if I directly enter a search URL like the following:
http://phone.people.yahoo.com/py/psPhoneSearch.py?LastName=Müller
Component: Bookmarks → Browser-General
Updated•20 years ago
|
Product: Browser → Seamonkey
Comment 28•20 years ago
|
||
I tried to set all utf8 related options in "about:config" to be "false", but it
does not work.
In firefox, it works fine.
Comment 29•20 years ago
|
||
The spec is very clear.
http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars
Put back/validate utf-8 encoding on:
-ANY form submitted via GETs (maybe not post, since they have a content-type for
themselves, to which you could add charset=utf-8)
-ANY anchor/bookmark/addressbar URL.
This mistakes break clf log file analysis, stats and monitoring tools.
And let google fix their problems themselves.
Updated•18 years ago
|
Assignee: bugs → general
QA Contact: claudius → general
Updated•16 years ago
|
Priority: P3 → --
Target Milestone: mozilla1.2alpha → ---
Comment 30•13 years ago
|
||
The W3C specification linked in comment #29 is indeed very clear: percent-encoding in an anchor (or in the address bar) should be according to the UTF-8 representation of the text.
The "actual behaviour" is therefore the correct (desired) behaviour. => INVALID.
Updated•13 years ago
|
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INVALID
Whiteboard: [SmBugEvent]
You need to log in
before you can comment on or make changes to this bug.
Description
•