Closed
Bug 687859
Opened 13 years ago
Closed 12 years ago
BOM should take precedence for charset determination for scripts
Categories
(Core :: DOM: Core & HTML, defect)
Tracking
()
RESOLVED
FIXED
mozilla19
People
(Reporter: dindog, Assigned: hsivonen)
References
(Blocks 1 open bug)
Details
Attachments
(2 files, 1 obsolete file)
(deleted),
application/octet-stream
|
Details | |
(deleted),
patch
|
smaug
:
review+
|
Details | Diff | Splinter Review |
User Agent: Mozilla/5.0 (Windows NT 6.1; rv:7.0) Gecko/20100101 Firefox/7.0
Build ID: 20110918030911
Steps to reproduce:
We find out that when setting a wrong charset for external sources, Firefox is the only browser fail to switch to correct charset.
testcase is a GB2312 charset HTML, with src='foo.js' in utf-8 bom, wrong charset by intention
Chrome, IE, Opera is fine. Firefox is the only one will fail in this testcase. It should improve
Update the testcase and set the OS to all for I reproduce it in linxu too.
Attachment #561201 -
Attachment is obsolete: true
Status: UNCONFIRMED → NEW
Component: General → File Handling
Ever confirmed: true
OS: Windows 7 → All
Product: Firefox → Core
QA Contact: general → file-handling
Version: 7 Branch → unspecified
Comment 2•13 years ago
|
||
Wrong charset in what sense? What exact charset information is being provided? The "charset" attribute on the element and the BOM in the file, with no HTTP-level charset? Which charset is the "right" one?
The current HTML5 spec text about this says:
To obtain the Unicode string, the user agent run the following steps:
If the resource's Content Type metadata, if any, specifies a character encoding, and
the user agent supports that encoding, then let character encoding be that encoding,
and jump to the bottom step in this series of steps.
If the algorithm above set the script block's character encoding, then let character
encoding be that encoding, and jump to the bottom step in this series of steps.
For each of the rows in the following table, starting with the first one and going
down, if the file has as many or more bytes available than the number of bytes in the
first column, and the first bytes of the file match the bytes given in the first
column, then set character encoding to the encoding given in the cell in the second
column of that row, and jump to the bottom step in this series of steps:
Bytes in Hexadecimal Encoding
FE FF Big-endian UTF-16
FF FE Little-endian UTF-16
EF BB BF UTF-8
Note: This step looks for Unicode Byte Order Marks (BOMs).
So the charset attribute overrides the BOM, per spec. If you think the spec is wrong, please raise a spec issue.
(In reply to Boris Zbarsky (:bz) from comment #2)
> Wrong charset in what sense? What exact charset information is being
> provided? The "charset" attribute on the element and the BOM in the file,
> with no HTTP-level charset? Which charset is the "right" one?
>
> The current HTML5 spec text about this says:
>
> To obtain the Unicode string, the user agent run the following steps:
>
> If the resource's Content Type metadata, if any, specifies a character
> encoding, and
> the user agent supports that encoding, then let character encoding be
> that encoding,
> and jump to the bottom step in this series of steps.
>
> If the algorithm above set the script block's character encoding, then
> let character
> encoding be that encoding, and jump to the bottom step in this series of
> steps.
>
> For each of the rows in the following table, starting with the first one
> and going
> down, if the file has as many or more bytes available than the number of
> bytes in the
> first column, and the first bytes of the file match the bytes given in
> the first
> column, then set character encoding to the encoding given in the cell in
> the second
> column of that row, and jump to the bottom step in this series of steps:
> Bytes in Hexadecimal Encoding
> FE FF Big-endian UTF-16
> FF FE Little-endian UTF-16
> EF BB BF UTF-8
>
> Note: This step looks for Unicode Byte Order Marks (BOMs).
>
> So the charset attribute overrides the BOM, per spec. If you think the spec
> is wrong, please raise a spec issue.
I am not an developer, but here is the thing:
Our forum people found an website which Firefox acted different from other browser, and we found is the website's false, sort of. It return a wrong charset in HTTP header, and obviously I can't do that with a local testcase, but the issue remain if I declare a wrong charset in script tag.
Firefox is the only mainstream browsers will fail to run the script if the charset is incorrect, don't you think this should be fixed? The site's false indeed, but users won't think that way when they try every other browser work out fine, they will think it's Firefox's false
(In reply to Boris Zbarsky (:bz) from comment #2)
> Wrong charset in what sense? What exact charset information is being
> provided? The "charset" attribute on the element and the BOM in the file,
> with no HTTP-level charset? Which charset is the "right" one?
Simple answer is both. Wrong HTTP header charset or attribute on the element will go wrong. (and the testcase demonstrates latter)
Comment 5•13 years ago
|
||
Henri, do IE and Opera allow a BOM in a script to override HTTP headers and @charset?
Component: File Handling → DOM
QA Contact: file-handling → general
Any news about this?
I see a similar bug fixed recently: Bug 693806, according to the spec though.
I mean, now that Firefox is the only one fail to run the testcase, should we pay a little attention to it?
Updated•13 years ago
|
Summary: firefox should verify external sources' charset → BOM should take precedence for charset determination for scripts
Assignee | ||
Updated•12 years ago
|
Assignee | ||
Comment 7•12 years ago
|
||
Attachment #672774 -
Flags: review?(bugs)
Assignee | ||
Updated•12 years ago
|
Assignee | ||
Comment 8•12 years ago
|
||
(In reply to Boris Zbarsky (:bz) from comment #5)
> Henri, do IE and Opera allow a BOM in a script to override HTTP headers and
> @charset?
I tested IE7, Opera.next and Chrome. BOM takes precedence over HTTP in all of them: http://hsivonen.iki.fi/test/moz/bom/no-charset-attribute.html1251
Comment 9•12 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #8)
> (In reply to Boris Zbarsky (:bz) from comment #5)
> > Henri, do IE and Opera allow a BOM in a script to override HTTP headers and
> > @charset?
>
> I tested IE7, Opera.next and Chrome. BOM takes precedence over HTTP in all
> of them: http://hsivonen.iki.fi/test/moz/bom/no-charset-attribute.html1251
IE10 displayed 'โฌ' in all documents mode.
Assignee | ||
Comment 10•12 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #9)
> IE10 displayed 'โฌ' in all documents mode.
It strips the BOM and then decodes according to HTTP charset.
Assignee | ||
Comment 11•12 years ago
|
||
Filed bug 804113 about a potential pre-existing problem preserved by this patch.
Comment 12•12 years ago
|
||
Comment on attachment 672774 [details] [diff] [review]
Make the BOM take precedence
># HG changeset patch
># User Henri Sivonen <hsivonen@iki.fi>
># Date 1350558618 -10800
># Node ID 755369f4f020ee6983b3e312dd05e05c7f34ab8d
># Parent 89ff8e55638111c977538dbd22edf00b82f013ee
>Bug 687859 - Make the BOM take precedence when determining the character encoding of scripts. r=NOT_REVIEWED.
>
>diff --git a/content/base/src/nsScriptLoader.cpp b/content/base/src/nsScriptLoader.cpp
>--- a/content/base/src/nsScriptLoader.cpp
>+++ b/content/base/src/nsScriptLoader.cpp
>@@ -999,94 +999,103 @@ nsScriptLoader::ConvertToUTF16(nsIChanne
> uint32_t aLength, const nsAString& aHintCharset,
> nsIDocument* aDocument, nsString& aString)
> {
> if (!aLength) {
> aString.Truncate();
> return NS_OK;
> }
>
>- nsAutoCString characterSet;
>+ // The encoding info precedence is as follows from high to low:
>+ // The BOM
>+ // HTTP Content-Type (if name recognized)
>+ // charset attribute (if name recognized)
>+ // The encoding of the document
>
>- nsresult rv = NS_OK;
>- if (aChannel) {
>- rv = aChannel->GetContentCharset(characterSet);
>- }
>-
>- if (!aHintCharset.IsEmpty() && (NS_FAILED(rv) || characterSet.IsEmpty())) {
>- // charset name is always ASCII.
>- LossyCopyUTF16toASCII(aHintCharset, characterSet);
>- }
>-
>- if (NS_FAILED(rv) || characterSet.IsEmpty()) {
>- DetectByteOrderMark(aData, aLength, characterSet);
>- }
>-
>- if (characterSet.IsEmpty() && aDocument) {
>- // charset from document default
>- characterSet = aDocument->GetDocumentCharacterSet();
>- }
>-
>- if (characterSet.IsEmpty()) {
>- // fall back to ISO-8859-1, see bug 118404
>- characterSet.AssignLiteral("ISO-8859-1");
>- }
>+ nsAutoCString charset;
>
> nsCOMPtr<nsICharsetConverterManager> charsetConv =
>- do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID, &rv);
>+ do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID);
>
> nsCOMPtr<nsIUnicodeDecoder> unicodeDecoder;
>
>- if (NS_SUCCEEDED(rv) && charsetConv) {
>- rv = charsetConv->GetUnicodeDecoder(characterSet.get(),
>+ if (DetectByteOrderMark(aData, aLength, charset)) {
>+ // charset is now "UTF-8" or "UTF-16". The UTF-16 decoder will re-sniff
>+ // the BOM for endianness. Both the UTF-16 and the UTF-8 decoder will
>+ // take care of swallowing the BOM.
>+ charsetConv->GetUnicodeDecoderRaw(charset.get(),
> getter_AddRefs(unicodeDecoder));
Could you fix the indentation of the latter parameter.
>+ NS_SUCCEEDED(aChannel->GetContentCharset(charset))) {
>+ charsetConv->GetUnicodeDecoder(charset.get(),
>+ getter_AddRefs(unicodeDecoder));
ditto
>+ }
>
>- rv = unicodeDecoder->GetMaxLength(reinterpret_cast<const char*>(aData),
>- aLength, &unicodeLength);
>- if (NS_SUCCEEDED(rv)) {
>- if (!EnsureStringLength(aString, unicodeLength))
>- return NS_ERROR_OUT_OF_MEMORY;
>+ if (!unicodeDecoder) {
>+ CopyUTF16toUTF8(aHintCharset, charset);
>+ charsetConv->GetUnicodeDecoder(charset.get(),
>+ getter_AddRefs(unicodeDecoder));
ditto
>- PRUnichar *ustr = aString.BeginWriting();
>+ if (!unicodeDecoder && aDocument) {
>+ charset = aDocument->GetDocumentCharacterSet();
>+ charsetConv->GetUnicodeDecoderRaw(charset.get(),
>+ getter_AddRefs(unicodeDecoder));
>+ }
ditto
>+ if (!unicodeDecoder) {
>+ // Curiously, there are various callers that don't pass aDocument. The
>+ // fallback in the old code was ISO-8859-1, which behaved like
>+ // windows-1252. Saying windows-1252 for clarity and for compliance
>+ // with the Encoding Standard.
>+ charsetConv->GetUnicodeDecoderRaw("windows-1252",
>+ getter_AddRefs(unicodeDecoder));
ditto
Attachment #672774 -
Flags: review?(bugs) → review+
Assignee | ||
Comment 13•12 years ago
|
||
Thanks. Landed with indent fixed.
https://hg.mozilla.org/integration/mozilla-inbound/rev/369d723c52ee
Comment 14•12 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
Updated•6 years ago
|
Component: DOM → DOM: Core & HTML
You need to log in
before you can comment on or make changes to this bug.
Description
•