1224239 - [e10s] view-source on a UTF-8 document with late meta charset that displays correctly shows the source in "Western" rather than UTF-8

Reporter

Description

•

9 years ago

Steps to reproduce: 1. load http://www.thsrc.com.tw/index.html?force=1 2. press Ctrl+U (or appropriate platform alternative) to view source Expected results: A. a good bit of chinese text B. character encoding menu (View -> Text Encoding) shows "Western" Actual results: A. misencoded garbage where the text should be B. character encoding menu (View -> Text Encoding) shows "Unicode", just like it does when viewing the page This reproduces in a clean profile on the 2015-11-12-03-02-38-mozilla-central Linux-64bit nightly, built on https://hg.mozilla.org/mozilla-central/rev/3cc3b1968524248450c465c4ea2ee5596ffa65f2

Simon Montagu :smontagu

Comment 1

•

9 years ago

regression range: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=58c5a3427997&tochange=2893f60d5903 Maybe bug 913806? The URL has a <meta charset> way down the page.

Keywords: regression

Simon Montagu :smontagu

Comment 2

•

9 years ago

The bug doesn't reproduce on http://moztw.org/community/, which has a meta charset declaration near the top.

Summary: view-source on a UTF-8 document that displays correctly shows the source in "Western" rather than UTF-8 → view-source on a UTF-8 document with late meta charset that displays correctly shows the source in "Western" rather than UTF-8

Doug Turner (:dougt)

Comment 3

•

9 years ago

jason, can you or honza look into this? Thanks.

Assignee: nobody → jduell.mcbugs

Mark Banner (:standard8)

Comment 4

•

8 years ago

This appears to be an e10s only bug. What I see is: ==== Steps to reproduce: 1. Run Firefox in e10s mode. 1. load http://www.thsrc.com.tw/index.html?force=1 2. press Ctrl+U (or appropriate platform alternative) to view source Expected results: A. a good bit of chinese text B. character encoding menu (View -> Text Encoding) shows "Unicode" Actual results: A. misencoded garbage where the text should be B. character encoding menu (View -> Text Encoding) shows "Western", just like it does when viewing the page ==== This is slightly different to David's STR in that the character encoding menu is the other way around. If I open a non-e10s window and run it there, then it works fine. Although the regression range doesn't match exactly, I think this might be related to bug 1025146 which might have landed around the same time. I did find one minor issue with forcedCharSet not being set correctly in `viewSource-content.js` function "viewSource" (I'll probably be fixing this in bug 1315951), but fixing that doesn't seem to fix this issue. Since Mike did bug 1025146, I'll needinfo him here and hopefully he'll have an idea.

Blocks: 1025146

Flags: needinfo?(mconley)

Summary: view-source on a UTF-8 document with late meta charset that displays correctly shows the source in "Western" rather than UTF-8 → [e10s] view-source on a UTF-8 document with late meta charset that displays correctly shows the source in "Western" rather than UTF-8

Mark Banner (:standard8)

Comment 5

•

8 years ago

I just noticed, changing: https://dxr.mozilla.org/mozilla-central/rev/f13e90d496cf1bc6dfc4fd398da33e4afe785bde/toolkit/components/viewsource/content/viewSource-content.js#240 from: ``` let forcedCharSet = utils.docCharsetIsForced ? doc.characterSet : null; ``` to: ``` forcedCharSet = doc.characterSet ? doc.characterSet : null; ``` appears to fix this. Though I don't know if it is the correct fix or not (and forceCharSet would need to be renamed etc).

Mike Conley (:mconley) (:⚙️)

Comment 6

•

8 years ago

I think the problem is at a deeper level. In the non-e10s (good) case, we return "UTF-8" when attempting to determine the charset from the channel here: http://searchfox.org/mozilla-central/rev/8562d3859b89ac89f46690b9ed2c473e0728d6c0/dom/html/nsHTMLDocument.cpp#701 Whereas in the e10s (bad) case, we return the empty string, which causes us to eventually resolve the charset to the fallback encoding, which happens to be windows-1252 (set here: http://searchfox.org/mozilla-central/source/dom/encoding/FallbackEncoding.cpp#100) I think the solution is to have the channel properly return the charset.

Flags: needinfo?(mconley)

Valentin Gosu [:valentin] (he/him)

Comment 7

•

6 years ago

It seems this bug still happens.
Mike, am I correct when saying that the charset from the meta tag would be set on the channel via HttpBaseChannel::SetContentCharset ? In which case I think that we're sending the response head to the childChannel during OnStartRequest, but the charset is set on the parent channel later than that, so the child never gets it.
I think we could probably add a new IPDL message to set the charset. Assuming the IPC connection is still up by the time we parse the meta tag :)

Assignee: jduell.mcbugs → nobody

Flags: needinfo?(mconley)

Priority: -- → P3

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Comment 8

•

6 years ago

(In reply to Mike Conley (:mconley) (:⚙️) from comment #6)

I think the problem is at a deeper level.

In the non-e10s (good) case, we return "UTF-8" when attempting to determine
the charset from the channel here:

http://searchfox.org/mozilla-central/rev/
8562d3859b89ac89f46690b9ed2c473e0728d6c0/dom/html/nsHTMLDocument.cpp#701

How does the encoding end up on the channel in that place instead of coming from the channel in the TryCacheCharset() step?

Mike Conley (:mconley) (:⚙️)

Comment 9

•

6 years ago

I completely forgot about this bug! Forgive me if my memory is a bit foggy on it.

Reading my comment again, I think I was basically reporting my findings, without really understanding how the underlying mechanisms of determining encoding work. Really, I was just comparing the working case (non-e10s) with the broken case (e10s), and finding what ultimately was different.

How things work under the hood from there is a bit of a mystery there - and I don't have much advice on how best to address it.

Flags: needinfo?(mconley)

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Comment 10

•

6 years ago

FWIW, the mechanism by which WebKit finds the meta is that WebKit doesn't actually fully honor the WebKit-based 1024-byte limit and scans past 1024 bytes if it hasn't seen a tag that is not allowed in head.

Henri Sivonen (:hsivonen) (away from Bugzilla until 2023-09-11)

Comment 11

•

3 years ago

The site has fixed the problem, but the original problem reported here is being fixed in bug 1701828.

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → DUPLICATE

Bugzilla

[e10s] view-source on a UTF-8 document with late meta charset that displays correctly shows the source in "Western" rather than UTF-8

Categories

(Core :: Internationalization, defect, P3)

Tracking

()

People

(Reporter: dbaron, Unassigned)

References

(
URL
)

Details

(Keywords: regression)

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11