Closed Bug 102984 Opened 23 years ago Closed 13 years ago

URL search part is always encoded UTF8

Categories

(SeaMonkey :: General, defect)

Other
All
defect
Not set
major

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: bugzilla.mozilla.org, Unassigned)

References

()

Details

(Keywords: intl, Whiteboard: [SmBugEvent])

In my bookmarks, I have defined "google" as following: Name: Google Location: http://www.google.de/search?q=%s Keywords: ? Now if I enter '?' plus a word in the URL line, google is used to search for that word. But if I enter '? wüste', the resulting URL is http://www.google.de/search?q=w%C3%BCste I.e. the umlaut 'ü' is transformed to %C3%BC (UTF8). But correct would be %FC.
->necko (is this where the URL parser lives?)
Assignee: asa → neeti
Component: Browser-General → Networking
QA Contact: doronr → benc
It's a start. Who can we add from i18n?
shotgun cc:
Blocks: 86948
no, this is not a necko issue, this is an i18n issue. reassign to ftang for now The problem how can we know we want to encode to ISO-8859-1 instead of UTF-8 in this case.
Assignee: neeti → ftang
This is nearly the same problem as when entering the URL http://www.mozilla.org/htdig-cgi/htsearch?words=müll But I agree: you don't know whether you have to convert it to ISO8859-1 or ISO8859-9 or whatever. But to choose ISO8859-1 seems to make more sense than to choose UTF8 which is never correct.
setting bug status to New
Status: UNCONFIRMED → NEW
Ever confirmed: true
Sorry, I was not correct. If a page is encoded in UTF8, the contents of forms are of course sent as UTF8 encoded. But pages encoded in UTF8 are the minority, I guess.
yea, but there are many many page which are not using ISO-8859-1, for example, chinese page use BIG5, GB2312, Japanese page use Shift_JIS, EUC-JP. This is not a binary decision between ISO-8859-1 and UTF-8. It is a multiple choice decision.
how do IE handle this ? Should we build a default URL bar encoding in pref and allwo user to change it from pref ?
Can someone update the summary to be more descriptive? Like "URL: encoding in bookmarks is <a> when it should be <b>"
Summary: URL encoding not correct → URL search part is always encoded UTF8
I tried typing this in the location bar and hitting return: http://www.google.de/search?charset=UTF-8&q=w%C3%BCste Interestingly, google.de returns good results, but it mislabels the results page: <meta HTTP-EQUIV="content-type" CONTENT="text/html; charset=ISO-8859-1"> But if you override the above tag using the menu View|Character Coding|More|Unicode (UTF-8) then the results page looks fine and you see that google did find hits for "wüste". So, we could fix the client by adding the charset parameter to the query string. Will this work of all servers? But we also need google (and others) to correctly label the charset of the results page. cc'ing Kat Momoi in evangelism.
>Interestingly, google.de returns good results Google only returns pages encoded in UTF8. The result is not the normal desired result! This is a google bug. The "charset=" parameter does not have any influence since it depends on the underlying CGI. If a form is in a page with UTF8-encoding, the form generates an UTF8 encoded search string. If the form is in a ISO8859-1 encoded page, it generates a ISO8859-1 encoded search string. If one enters the search string manually and you don't know the encoding that is expected by the CGI, do not encode the search string, just leave it as it is on that machine.
> how do IE handle this ? It seems to send all its search requests in the Address Bar to an MSN search engine. So, in essence, they control what should be sent. I tried Latin 1 string and it did not use UTF-8 %-escaped format. It was 8859-1 in %-escaped format. > Should we build a default URL bar encoding in pref and > allwo user to change it from pref ? IE has such an option to send URL in UTF-8 (default except in Asian versions of IE). A good solution to this problem needs to include the following consideratons: 1. An option to turn ON or OFF UTF-8 URL encoding when input from URL bar. In case it is a search string as opposed to normal URLs, provide fallback encoding in case there is no info for target search engines/cgi's. 2. Easier way to choose encoding for different search engines. Currently this is set via RDF files for each search engine. This approach has limitations -- users are unable to change values easily. 3. Possible support for emerging CNRP (Common Name Resolution Protocol) -- which uses UTF-8, for keyword search. http://www.ietf.org/ids.by.wg/cnrp.html Are there other requirements?
Keywords: intl
Status: NEW → ASSIGNED
Are german umlaute allowed in links anyway ?
url related issue. give to nhotta.
Assignee: ftang → nhotta
Severity: minor → major
Status: ASSIGNED → NEW
Priority: -- → P3
Target Milestone: --- → mozilla0.9.7
Current only ASCII is allowed for URL. Characters above 127 have to be escaped. I don't think there is any standard of what character set to use for URL. There is a similar bug 105909. I think UTF-8 is the way to support as many characters possible. But depends on the situation, using other character sets might be preferred. In bug 105909, I proposed to specify the character set as a pref. Other possibility is to provide an option to use a current document charaset.
Status: NEW → ASSIGNED
Keywords: mozilla1.0
Target Milestone: mozilla0.9.7 → mozilla1.0
Target Milestone: mozilla1.0 → mozilla1.2
*** Bug 135763 has been marked as a duplicate of this bug. ***
Google now supports ie=encoding (input encoding) and oe=encoding (output encoding) fields in its search requests, so we can simply set those to UTF-8 (ie=UTF-8&oe=UTF-8) and be over with it. For other search engines, we should introduce some way to define in the search engine definition file which encoding the input should be encoded in.
->bookmarks Actually, this is a specific issue for each search engine, internet keywords, and bookmarks. The original problem was about a custom keyword in a bookmark. Other areas should be filed a separate bugs.
Assignee: nhotta → new-network-bugs
Status: ASSIGNED → NEW
With RC1, at least under OS/2, characters above 128 are no longer encoded in UTF8, but in CP437. But for this case, the ie= parm from google is nevertheless useful. Where is it documented?
Its not documented anywhere on Google's site, but it is a part of their SOAP API. I still believe we should use it in Mozilla until someone from Google confirms this argument is going to stay during further site improvements. I've mailed them (Google) few days ago and they didn't answer me yet. If someone has a shortcut to get a quicker response, that'll be nice :)
really moving to bookmarks.
Component: Networking → Bookmarks
okay.
Assignee: new-network-bugs → ben
QA Contact: benc → claudius
Why is the encoding no longer UTF8? (1.0 final)
This has got nothing to do with bookmarks. The problem also exists if I directly enter a search URL like the following: http://phone.people.yahoo.com/py/psPhoneSearch.py?LastName=Müller
Component: Bookmarks → Browser-General
Product: Browser → Seamonkey
I tried to set all utf8 related options in "about:config" to be "false", but it does not work. In firefox, it works fine.
The spec is very clear. http://www.w3.org/TR/html40/appendix/notes.html#non-ascii-chars Put back/validate utf-8 encoding on: -ANY form submitted via GETs (maybe not post, since they have a content-type for themselves, to which you could add charset=utf-8) -ANY anchor/bookmark/addressbar URL. This mistakes break clf log file analysis, stats and monitoring tools. And let google fix their problems themselves.
Assignee: bugs → general
QA Contact: claudius → general
Priority: P3 → --
Target Milestone: mozilla1.2alpha → ---
The W3C specification linked in comment #29 is indeed very clear: percent-encoding in an anchor (or in the address bar) should be according to the UTF-8 representation of the text. The "actual behaviour" is therefore the correct (desired) behaviour. => INVALID.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → INVALID
Whiteboard: [SmBugEvent]
You need to log in before you can comment on or make changes to this bug.