687859 - BOM should take precedence for charset determination for scripts

Reporter

Description

•

13 years ago

Attached file testcase.html & foo.js included (obsolete) (deleted) — Details

User Agent: Mozilla/5.0 (Windows NT 6.1; rv:7.0) Gecko/20100101 Firefox/7.0 Build ID: 20110918030911 Steps to reproduce: We find out that when setting a wrong charset for external sources, Firefox is the only browser fail to switch to correct charset. testcase is a GB2312 charset HTML, with src='foo.js' in utf-8 bom, wrong charset by intention Chrome, IE, Opera is fine. Firefox is the only one will fail in this testcase. It should improve

dindog

Reporter

Comment 1

•

13 years ago

Attached file testcase (deleted) — Details

Update the testcase and set the OS to all for I reproduce it in linxu too.

Attachment #561201 - Attachment is obsolete: true

dindog

Reporter

Updated

•

13 years ago

Status: UNCONFIRMED → NEW

Component: General → File Handling

Ever confirmed: true

OS: Windows 7 → All

Product: Firefox → Core

QA Contact: general → file-handling

Version: 7 Branch → unspecified

Boris Zbarsky [:bzbarsky]

Comment 2

•

13 years ago

Wrong charset in what sense? What exact charset information is being provided? The "charset" attribute on the element and the BOM in the file, with no HTTP-level charset? Which charset is the "right" one? The current HTML5 spec text about this says: To obtain the Unicode string, the user agent run the following steps: If the resource's Content Type metadata, if any, specifies a character encoding, and the user agent supports that encoding, then let character encoding be that encoding, and jump to the bottom step in this series of steps. If the algorithm above set the script block's character encoding, then let character encoding be that encoding, and jump to the bottom step in this series of steps. For each of the rows in the following table, starting with the first one and going down, if the file has as many or more bytes available than the number of bytes in the first column, and the first bytes of the file match the bytes given in the first column, then set character encoding to the encoding given in the cell in the second column of that row, and jump to the bottom step in this series of steps: Bytes in Hexadecimal Encoding FE FF Big-endian UTF-16 FF FE Little-endian UTF-16 EF BB BF UTF-8 Note: This step looks for Unicode Byte Order Marks (BOMs). So the charset attribute overrides the BOM, per spec. If you think the spec is wrong, please raise a spec issue.

dindog

Reporter

Comment 3

•

13 years ago

(In reply to Boris Zbarsky (:bz) from comment #2) > Wrong charset in what sense? What exact charset information is being > provided? The "charset" attribute on the element and the BOM in the file, > with no HTTP-level charset? Which charset is the "right" one? > > The current HTML5 spec text about this says: > > To obtain the Unicode string, the user agent run the following steps: > > If the resource's Content Type metadata, if any, specifies a character > encoding, and > the user agent supports that encoding, then let character encoding be > that encoding, > and jump to the bottom step in this series of steps. > > If the algorithm above set the script block's character encoding, then > let character > encoding be that encoding, and jump to the bottom step in this series of > steps. > > For each of the rows in the following table, starting with the first one > and going > down, if the file has as many or more bytes available than the number of > bytes in the > first column, and the first bytes of the file match the bytes given in > the first > column, then set character encoding to the encoding given in the cell in > the second > column of that row, and jump to the bottom step in this series of steps: > Bytes in Hexadecimal Encoding > FE FF Big-endian UTF-16 > FF FE Little-endian UTF-16 > EF BB BF UTF-8 > > Note: This step looks for Unicode Byte Order Marks (BOMs). > > So the charset attribute overrides the BOM, per spec. If you think the spec > is wrong, please raise a spec issue. I am not an developer, but here is the thing: Our forum people found an website which Firefox acted different from other browser, and we found is the website's false, sort of. It return a wrong charset in HTTP header, and obviously I can't do that with a local testcase, but the issue remain if I declare a wrong charset in script tag. Firefox is the only mainstream browsers will fail to run the script if the charset is incorrect, don't you think this should be fixed? The site's false indeed, but users won't think that way when they try every other browser work out fine, they will think it's Firefox's false

dindog

Reporter

Comment 4

•

13 years ago

(In reply to Boris Zbarsky (:bz) from comment #2) > Wrong charset in what sense? What exact charset information is being > provided? The "charset" attribute on the element and the BOM in the file, > with no HTTP-level charset? Which charset is the "right" one? Simple answer is both. Wrong HTTP header charset or attribute on the element will go wrong. (and the testcase demonstrates latter)

Boris Zbarsky [:bzbarsky]

Comment 5

•

13 years ago

Henri, do IE and Opera allow a BOM in a script to override HTTP headers and @charset?

Component: File Handling → DOM

QA Contact: file-handling → general

dindog

Reporter

Comment 6

•

13 years ago

Any news about this? I see a similar bug fixed recently: Bug 693806, according to the spec though. I mean, now that Firefox is the only one fail to run the testcase, should we pay a little attention to it?

Boris Zbarsky [:bzbarsky]

Updated

•

13 years ago

Summary: firefox should verify external sources' charset → BOM should take precedence for charset determination for scripts

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Updated

•

12 years ago

Assignee: nobody → hsivonen

Blocks: encoding

Status: NEW → ASSIGNED

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 7

•

12 years ago

Attached patch Make the BOM take precedence (deleted) — Details — Splinter Review

Attachment #672774 - Flags: review?(bugs)

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Updated

•

12 years ago

See Also: → https://www.w3.org/Bugs/Public/show_bug.cgi?id=19623

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 8

•

12 years ago

(In reply to Boris Zbarsky (:bz) from comment #5) > Henri, do IE and Opera allow a BOM in a script to override HTTP headers and > @charset? I tested IE7, Opera.next and Chrome. BOM takes precedence over HTTP in all of them: http://hsivonen.iki.fi/test/moz/bom/no-charset-attribute.html1251

Masatoshi Kimura [:emk]

Comment 9

•

12 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #8) > (In reply to Boris Zbarsky (:bz) from comment #5) > > Henri, do IE and Opera allow a BOM in a script to override HTTP headers and > > @charset? > > I tested IE7, Opera.next and Chrome. BOM takes precedence over HTTP in all > of them: http://hsivonen.iki.fi/test/moz/bom/no-charset-attribute.html1251 IE10 displayed 'โฌ' in all documents mode.

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 10

•

12 years ago

(In reply to Masatoshi Kimura [:emk] from comment #9) > IE10 displayed 'โฌ' in all documents mode. It strips the BOM and then decodes according to HTTP charset.

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 11

•

12 years ago

Filed bug 804113 about a potential pre-existing problem preserved by this patch.

Olli Pettay [:smaug][bugs@pettay.fi]

Comment 12

•

12 years ago

Comment on attachment 672774 [details] [diff] [review] Make the BOM take precedence ># HG changeset patch ># User Henri Sivonen <hsivonen@iki.fi> ># Date 1350558618 -10800 ># Node ID 755369f4f020ee6983b3e312dd05e05c7f34ab8d ># Parent 89ff8e55638111c977538dbd22edf00b82f013ee >Bug 687859 - Make the BOM take precedence when determining the character encoding of scripts. r=NOT_REVIEWED. > >diff --git a/content/base/src/nsScriptLoader.cpp b/content/base/src/nsScriptLoader.cpp >--- a/content/base/src/nsScriptLoader.cpp >+++ b/content/base/src/nsScriptLoader.cpp >@@ -999,94 +999,103 @@ nsScriptLoader::ConvertToUTF16(nsIChanne > uint32_t aLength, const nsAString& aHintCharset, > nsIDocument* aDocument, nsString& aString) > { > if (!aLength) { > aString.Truncate(); > return NS_OK; > } > >- nsAutoCString characterSet; >+ // The encoding info precedence is as follows from high to low: >+ // The BOM >+ // HTTP Content-Type (if name recognized) >+ // charset attribute (if name recognized) >+ // The encoding of the document > >- nsresult rv = NS_OK; >- if (aChannel) { >- rv = aChannel->GetContentCharset(characterSet); >- } >- >- if (!aHintCharset.IsEmpty() && (NS_FAILED(rv) || characterSet.IsEmpty())) { >- // charset name is always ASCII. >- LossyCopyUTF16toASCII(aHintCharset, characterSet); >- } >- >- if (NS_FAILED(rv) || characterSet.IsEmpty()) { >- DetectByteOrderMark(aData, aLength, characterSet); >- } >- >- if (characterSet.IsEmpty() && aDocument) { >- // charset from document default >- characterSet = aDocument->GetDocumentCharacterSet(); >- } >- >- if (characterSet.IsEmpty()) { >- // fall back to ISO-8859-1, see bug 118404 >- characterSet.AssignLiteral("ISO-8859-1"); >- } >+ nsAutoCString charset; > > nsCOMPtr<nsICharsetConverterManager> charsetConv = >- do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID, &rv); >+ do_GetService(NS_CHARSETCONVERTERMANAGER_CONTRACTID); > > nsCOMPtr<nsIUnicodeDecoder> unicodeDecoder; > >- if (NS_SUCCEEDED(rv) && charsetConv) { >- rv = charsetConv->GetUnicodeDecoder(characterSet.get(), >+ if (DetectByteOrderMark(aData, aLength, charset)) { >+ // charset is now "UTF-8" or "UTF-16". The UTF-16 decoder will re-sniff >+ // the BOM for endianness. Both the UTF-16 and the UTF-8 decoder will >+ // take care of swallowing the BOM. >+ charsetConv->GetUnicodeDecoderRaw(charset.get(), > getter_AddRefs(unicodeDecoder)); Could you fix the indentation of the latter parameter. >+ NS_SUCCEEDED(aChannel->GetContentCharset(charset))) { >+ charsetConv->GetUnicodeDecoder(charset.get(), >+ getter_AddRefs(unicodeDecoder)); ditto >+ } > >- rv = unicodeDecoder->GetMaxLength(reinterpret_cast<const char*>(aData), >- aLength, &unicodeLength); >- if (NS_SUCCEEDED(rv)) { >- if (!EnsureStringLength(aString, unicodeLength)) >- return NS_ERROR_OUT_OF_MEMORY; >+ if (!unicodeDecoder) { >+ CopyUTF16toUTF8(aHintCharset, charset); >+ charsetConv->GetUnicodeDecoder(charset.get(), >+ getter_AddRefs(unicodeDecoder)); ditto >- PRUnichar *ustr = aString.BeginWriting(); >+ if (!unicodeDecoder && aDocument) { >+ charset = aDocument->GetDocumentCharacterSet(); >+ charsetConv->GetUnicodeDecoderRaw(charset.get(), >+ getter_AddRefs(unicodeDecoder)); >+ } ditto >+ if (!unicodeDecoder) { >+ // Curiously, there are various callers that don't pass aDocument. The >+ // fallback in the old code was ISO-8859-1, which behaved like >+ // windows-1252. Saying windows-1252 for clarity and for compliance >+ // with the Encoding Standard. >+ charsetConv->GetUnicodeDecoderRaw("windows-1252", >+ getter_AddRefs(unicodeDecoder)); ditto

Attachment #672774 - Flags: review?(bugs) → review+

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Assignee

Comment 13

•

12 years ago

Thanks. Landed with indent fixed. https://hg.mozilla.org/integration/mozilla-inbound/rev/369d723c52ee

Ryan VanderMeulen [:RyanVM]

Comment 14

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/369d723c52ee

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Flags: in-testsuite+

Resolution: --- → FIXED

Target Milestone: --- → mozilla19

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: DOM → DOM: Core & HTML

testcase.html & foo.js included 13 years ago dindog (deleted), application/octet-stream		Details
testcase 13 years ago dindog (deleted), application/octet-stream		Details
Make the BOM take precedence 12 years ago Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11) (deleted), patch	smaug : review+	Details \| Diff \| Splinter Review