Closed
Bug 615595
Opened 14 years ago
Closed 13 years ago
Forms of UTF-16LE documents are encoded in UTF-8, although _charset_ declares UTF-16LE
Categories
(Core :: DOM: Core & HTML, defect)
Core
DOM: Core & HTML
Tracking
()
RESOLVED
FIXED
mozilla13
People
(Reporter: loic.etienne, Assigned: bzbarsky)
References
Details
Attachments
(3 files, 1 obsolete file)
User-Agent: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12
Build Identifier: Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12
The following html-document should be converted to UTF16-LE, named form.utf16le.html, and then opened as file with Firefox:
<?xml version="1.0" encoding="UTF-16LE"?>
<html>
<head>
<title>encoding bug</title>
<meta http-equiv='Content-Type' content='text/html; charset=UTF-16LE' />
</head>
<body>
<form method='get'>
<input type='text' name='oops' value='aïe' />
<input type='hidden' name='_charset_' />
</form>
</body>
</html>
Submitting it (with the return key) an inconsistent URI is produced.
Reproducible: Always
Steps to Reproduce:
1. open file:///.../form.utf16le.html
2. commit the form (with the return key)
Actual Results:
file:///.../form.utf16le.html?oops=a%C3%AFe&_charset_=UTF-16LE
(but %C3%AF are UTF-8 bytes, as is the whole query string).
Expected Results:
file:///.../form.utf16le.html?oops=a%C3%AFe&_charset_=UTF-8
or
file:///.../form.utf16le.html?o%00o%00p%00s=a%00%EF%00e%00
or
file:///.../form.utf16le.html?o%00o%00p%00s=a%00%EF%00e%00&_charset_=UTF-16LE
or
? (open)
Note the logical difficulty to encode the string value "UTF-16LE" in UTF-16LE; here I used US-ASCII to encode the _charset_ parameter, because otherwise it would be of little use. I am not sure that _charset_ makes sense for non US-ASCII compatible encodings.
While it is not clear what is the best strategy, the adopted strategy should be documented.
This bug is related to #169575. I agree, it would be simpler to always use UTF-8 for query string parameters. Unfortunately, the history cannot be changed... And Firefox still declares ISO-8859-1 as preferred encoding!
In an older release, I observed that _charset_=UTF-16 instead of _charset_=UTF-16LE. While LE may be the default of the OS, BE is the default for network communication; it is confusing. I suggest never to use UTF-16 as _charset_ value without an explicit LE or BE.
Updated•13 years ago
|
Version: unspecified → 3.6 Branch
Comment 1•13 years ago
|
||
Reporter, Firefox 4.0.1 has been released, and it features significant improvements over previous releases. Can you please update to Firefox 4.0.1 or later, and retest your bug? Please also create a fresh profile (
http://support.mozilla.com/kb/Managing+profiles), update your plugins (Flash, Java, Quicktime, Reader, etc) and update your graphics driver and Operating system to the latest versions available.
If you still continue to see this issue, please comment. If you do not, please close this bug as RESOLVED > WORKSFORME
filter: prefirefox4uncobugs
Comment 3•13 years ago
|
||
Confirming, our behavior violates the application/x-www-form-urlencoded encoding algorithm.
http://dev.w3.org/html5/spec/Overview.html#application-x-www-form-urlencoded-encoding-algorithm
file:///.../form.utf16le.html?oops=a%C3%AFe&_charset_=UTF-8
should be sent per spec.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Updated•13 years ago
|
Component: General → HTML: Form Submission
Product: Firefox → Core
QA Contact: general → form-submission
Version: 3.6 Branch → unspecified
Comment 4•13 years ago
|
||
Comment 5•13 years ago
|
||
> file:///.../form.utf16le.html?oops=a%C3%AFe&_charset_=UTF-8
> should be sent per spec.
Although the spec hasn't defined the behavior on file scheme, our behavior is invalid even on http(s).
http://dev.w3.org/html5/spec/Overview.html#form-submission-algorithm
Assignee | ||
Comment 6•13 years ago
|
||
Mmm... Gotta love black-hole components like "Firefox:General". ;)
Do we also want to send windows-1252 when we're using that instead of ISO-8859-1?
Assignee: nobody → bzbarsky
OS: Linux → All
Hardware: x86_64 → All
Whiteboard: [need review]
Assignee | ||
Comment 7•13 years ago
|
||
Jonas, please let me know if you think we should keep sending ISO-8859-1 instead of windows-1252?
Attachment #595772 -
Flags: review?(jonas)
Assignee | ||
Comment 8•13 years ago
|
||
Note that the spec says nothing about ISO-8859-1 vs windows-1252 here...
Comment 9•13 years ago
|
||
Per Encoding Standard, ISO-8859-1 is just an alias of windows-1252. But it is more generic problem than this bug.
Comment 10•13 years ago
|
||
> - if (charset.EqualsLiteral("ISO-8859-1")) {
> - charset.AssignLiteral("windows-1252");
> - }
I think we will have to continue to encode form contents as "windows-1252" for Web compat even if we treat "ISO-8859-1" is different from "windows-1252" right now.
Assignee | ||
Comment 11•13 years ago
|
||
> I think we will have to continue to encode form contents as "windows-1252" for Web compat
Sure. The question is which string should go in the _charset_ in the URL. Should it be _charset_=windows-1252, or should it be _charset_=ISO-8859-1? My patch makes it be the former, but it's easy to do the latter...
Comment 12•13 years ago
|
||
(In reply to Boris Zbarsky (:bz) from comment #11)
> > I think we will have to continue to encode form contents as "windows-1252" for Web compat
> Sure.
Ah, I overlooked the code added in GetSubmissionFromForm.
> The question is which string should go in the _charset_ in the URL.
> Should it be _charset_=windows-1252, or should it be _charset_=ISO-8859-1?
> My patch makes it be the former, but it's easy to do the latter...
IMO it should be _charset_=windows-1252 to comply with Encoding Standard until we implement different alias sets between browser and mail.
Comment 13•13 years ago
|
||
Comment 14•13 years ago
|
||
For the record,
IE9: ?oops=a%EFe%99&_charset_=iso-8859-1
Firefox without patch : ?oops=a%EFe%99&_charset_=ISO-8859-1
WebKit: ?oops=a%EFe%99&_charset_=
Opera, Firefox with patch: ?oops=a%EFe%99&_charset_=windows-1252
Hm, it may break compatibility with existing contents (unless those contents consider about Opera).
Attachment #595883 -
Attachment is obsolete: true
Assignee | ||
Comment 15•13 years ago
|
||
Yeah, that was my worry... Thank you for the data-gathering!
Jonas, thoughts?
I think we should land with _charset_=windows-1252 to get an indication of whether it's feasible to start treating ISO-8859-1 as an alias for Windows-1252 (i.e. to see if anything important breaks). If we see breakage, we'll probably need to change Anne's spec to say something more complicated than treating ISO-8859-1 as a plain alias for Windows-1252.
Comment on attachment 595772 [details] [diff] [review]
Proposed fix that changes both cases
I don't know enough about the windows-1252 issue to have an opinion.
The rest looks good though.
Attachment #595772 -
Flags: review?(jonas) → review+
Assignee | ||
Comment 18•13 years ago
|
||
Simon, thoughts?
I'm not sure I want to try dealing with possible compat fallout here, honestly. :(
What do other browsers send in this situation?
Comment 19•13 years ago
|
||
I'm not sure why people suspect that changing ISO-8859-1 to windows-1252 in the URL is likely to make things break. In the cases from attachment 14 [details] [diff] [review],
Firefox without patch : ?oops=a%EFe%99&_charset_=ISO-8859-1
Firefox with patch: ?oops=a%EFe%99&_charset_=windows-1252
"Firefox without patch" is already treating ISO-8859-1 as an alias for windows-1252, since 0x99 is ™ in windows-1252 but not in ISO-8859-1; but "Firefox with patch" is correct without any aliasing.
Or did I totally misunderstand the question?
Assignee | ||
Comment 20•13 years ago
|
||
The question is what the server will do with the "Firefox with patch" query string. If it's explicitly checking for ISO-8859-1 it could break... and I wouldn't bet on servers not doing that. :(
Ignore my question about other browsers from comment 18; it's answered in comment 14...
Comment 21•13 years ago
|
||
So we're weighing the possibility of the server explicitly checking for ISO-8859-1 against the possibility of the server explicitly decoding ISO-8859-1 (which would be a problem with the status quo already)?
I say go for _charset_=ISO-8859-1 to maintain compatibility with IE and our old behaviour.
Assignee | ||
Comment 22•13 years ago
|
||
> So we're weighing the possibility of the server explicitly checking for ISO-8859-1
> against the possibility of the server explicitly decoding ISO-8859-1 (which would be a
> problem with the status quo already)?
Yes.
> I say go for _charset_=ISO-8859-1 to maintain compatibility with IE and our old behaviour.
OK. I'll file a followup for considering changing the ISO-8859-1 bit.
Assignee | ||
Comment 23•13 years ago
|
||
Comment 24•13 years ago
|
||
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Component: HTML: Form Submission → DOM: Core & HTML
You need to log in
before you can comment on or make changes to this bug.
Description
•