Closed
Bug 799917
Opened 12 years ago
Closed 4 years ago
Make document.characterSet always lowercase
Categories
(Core :: DOM: HTML Parser, defect, P5)
Core
DOM: HTML Parser
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: ayg, Unassigned)
References
()
Details
Attachments
(2 files)
(deleted),
patch
|
bzbarsky
:
review+
hsivonen
:
feedback+
smontagu
:
feedback+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
Details | Diff | Splinter Review |
This is what the specs require right now, and it seems to match IE/Opera. Gecko/WebKit use mixed-case, like "UTF-8" and "Shift_JIS" instead of "utf-8" and "shift_jis". Since IE does it, it's probably web-compatible, and Anne says he'd know if Opera ran into any issues. hsivonen said to ping smontagu and maybe bz, so consider yourselves pinged. I think this is worth trying, because there's no way we'll get interop here otherwise, but obviously we need to keep a close eye out for compat in case IE goes down different codepaths.
Reporter | ||
Comment 1•12 years ago
|
||
This causes us to fail tests in <http://w3c-test.org/webapps/DOMCore/tests/approved/Node-properties.html>, which should be imported into our repo relatively soon. They probably aren't comprehensive enough to mark this in-testsuite+, though.
Updated•12 years ago
|
Comment 2•12 years ago
|
||
I don't think I'm opposed to this particularly, as long as you check the uses of this property in our chrome and extensions.
Comment 3•12 years ago
|
||
And in our C++.
And in particular, if you make this change, watch out for consumers that treat the value returned by .characterSet as a canonical charset name.
Reporter | ||
Comment 4•12 years ago
|
||
Per the spec, the canonical names *are* all lowercase. I haven't looked at the code at all, but it seems simplest to just change our internal canonical names to lowercase, no? And grep for references to the old ones in the codebase, and lowercase those too.
Comment 5•12 years ago
|
||
> but it seems simplest to just change our internal canonical names to lowercase, no?
Maybe. You'd have to check with Simon on that. For one thing, it's likely to affect how things display in the UI, which may not be desirable.
In general, from the point of view of not causing regressions, that sounds like the most dangerous approach...
(In reply to Boris Zbarsky (:bz) from comment #5)
> For one thing, it's likely
> to affect how things display in the UI, which may not be desirable.
How? AFAICT, the labels we show in the UI are not the Gecko-canonical names. (UI says “Windows-1252” but Gecko-canonical is “windows-1252”.)
Comment 7•12 years ago
|
||
Ah, if the UI is not showing the Gecko-canonical ones, that makes things much simpler.
Again, I'm not opposed to this change. I just want it to be done in such a way that nothing breaks in the process.
Comment 8•12 years ago
|
||
The safest approach would be using EncodingUtils::FindEncodingForLabel from HTML parser. It will always return a lowercased encoding name.
Comment 9•12 years ago
|
||
And only C++ callers would be affected because GetUnicode(En|De)coderRaw is not scriptable.
Reporter | ||
Comment 10•12 years ago
|
||
So this is the easy way out -- it only changes .characterSet. This will also change .inputEncoding, which we apparently want to get rid of. And apparently nsIDocument::GetXMLDeclaration. Those are the only C++ callers, according to DXR. Should I change GetXMLDeclaration to use GetDocumentCharacterSet or something instead? I don't know what it's used for.
There are probably more tests than this that will break, since I looked pretty superficially. Try run (Linux64 debug only): https://tbpl.mozilla.org/?tree=Try&rev=cdd1d87a0d57 Anything else you want me to do here, or is this okay and we'll see how it goes?
Assignee: nobody → ayg
Status: NEW → ASSIGNED
Attachment #670369 -
Flags: review?(bzbarsky)
Attachment #670369 -
Flags: feedback?(hsivonen)
Comment 11•12 years ago
|
||
If we really have to do this, I *think* that the safest way would be to make our internal canonical names all-lowercase (and audit all callers of GetUnicodeEn/DecoderRaw), but the whole thing seems to me like a classic example of fixing something not broken.
Comment 6 is correct that encoding names displayed in the UI come from a localizable properties file and won't be affected.
(In reply to :Aryeh Gregor from comment #4)
> Per the spec, the canonical names *are* all lowercase.
Which spec? Our current canonical names come from http://www.iana.org/assignments/character-sets, which I consider *the* spec for these purposes.
Comment 12•12 years ago
|
||
(In reply to Simon Montagu from comment #11)
> Which spec? Our current canonical names come from
> http://www.iana.org/assignments/character-sets, which I consider *the* spec
> for these purposes.
Per Encoding Standard.
http://encoding.spec.whatwg.org/#concept-encoding-get
DOM is going to reduce the dependency on external specs as much as possible (especially when those specs are open-ended). All supported encodings are given by the fixed list and those names are lower cased.
Reporter | ||
Comment 13•12 years ago
|
||
(In reply to Simon Montagu from comment #11)
> If we really have to do this, I *think* that the safest way would be to make
> our internal canonical names all-lowercase (and audit all callers of
> GetUnicodeEn/DecoderRaw), but the whole thing seems to me like a classic
> example of fixing something not broken.
I'm fine with either as long as browsers are willing to align. Specifically, this change would make us match IE and Opera. If they want to change to match us, that's fine by me too.
> (In reply to :Aryeh Gregor from comment #4)
> > Per the spec, the canonical names *are* all lowercase.
>
> Which spec? Our current canonical names come from
> http://www.iana.org/assignments/character-sets, which I consider *the* spec
> for these purposes.
Well, this is a DOM attribute, so the spec is the DOM spec:
"The characterSet attribute must return the name of the encoding."
http://dom.spec.whatwg.org/#dom-document-characterset
"name" is a link to:
http://encoding.spec.whatwg.org/#name
The names in that spec are all lowercase. Anne thinks the IANA spec is "pretty useless"; I gather it's not web-compatible at all. His spec is intended to supersede it for all web purposes. He feels that all-lowercase makes more sense than mixed-case, so Gecko/WebKit should change instead of IE/Opera. The IANA spec's casing is certainly inconsistent -- e.g., "windows-1252" instead of "Windows-1252", and a few other odd ones that are lowercase like "us-dk" and "latin-lap". Lowercase is more predictable.
(Anne also points out that GBK and GB18030, at least, are uppercase in the IANA standard but lowercase in Gecko.)
Basically, I don't think either way is particularly better than the other, but browsers do need to be consistent. So either we and WebKit have to change, or IE and Opera. The IANA spec is apparently not close to sufficient for browser interop on encodings, so Anne's Encodings spec is the one to look at, and he went with IE/Opera on this point. I don't see compelling advantages to either way, so I'd be in favor of going with whatever Anne wants to spec.
I don't think we should push back just because it's more of a pain for us, or because we don't the other way is better than ours. *Someone* has to take the pain, and we're not going to get interop if everyone wants to push it off on everyone else.
Comment on attachment 670369 [details] [diff] [review]
Patch, just changes GetCharacterSet
feedback+ if you change ToLowerCase() to nsContentUtils::ASCIIToLower().
Our current canonical names come from
> http://www.iana.org/assignments/character-sets, which I consider *the* spec
> for these purposes.
I really think we should treat http://encoding.spec.whatwg.org/ as *the* spec. The IANA does not even admit that ISO-8859-1 is an alias for Windows-1252.
Attachment #670369 -
Flags: feedback?(hsivonen) → feedback+
Comment 15•12 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #14)
> I really think we should treat http://encoding.spec.whatwg.org/ as *the*
> spec. The IANA does not even admit that ISO-8859-1 is an alias for
> Windows-1252.
Our current decoders for ISO-8859-1 and windows-1252 are indeed different, although bug 562096 is present for them. I made EncodingUtils::FindEncodingForLabel as an interim workaround until bug 562096 is fixed.
Comment 16•12 years ago
|
||
Comment on attachment 670369 [details] [diff] [review]
Patch, just changes GetCharacterSet
OK, so we're saved by the fact that necko uses GetUnicodeEncoder without Raw in URI code, right? Because that code certainly gets the output of .characterSet.
In general, I skimmed the users of GetUnicodeEncoderRaw and GetUnicodeDecoderRaw and none look like they should end up with values from here, but it's worth double-checking that.
GetXMLDeclaration is used when serializing XML documents. Having it lowercase is probably fine.
r=me if you do the above double-checking.
Attachment #670369 -
Flags: review?(bzbarsky) → review+
Reporter | ||
Comment 17•12 years ago
|
||
Comment on attachment 670369 [details] [diff] [review]
Patch, just changes GetCharacterSet
I fixed a bunch of test failures and did a new try push, mochitest-only but on multiple platforms, because I'm nervous about the possibility of platform-specific code here. Maybe I'm being overly cautious:
https://tbpl.mozilla.org/?tree=Try&rev=003fa29f6e66
Simon, are you on board with the change at this point?
(In reply to Boris Zbarsky (:bz) from comment #16)
> In general, I skimmed the users of GetUnicodeEncoderRaw and
> GetUnicodeDecoderRaw and none look like they should end up with values from
> here, but it's worth double-checking that.
>
> GetXMLDeclaration is used when serializing XML documents. Having it
> lowercase is probably fine.
>
> r=me if you do the above double-checking.
I glanced at all of them, and none of them look like they could plausibly use .characterSet, but I'm not familiar enough with this code to make guarantees.
Attachment #670369 -
Flags: feedback?(smontagu)
Comment 18•12 years ago
|
||
Comment on attachment 670369 [details] [diff] [review]
Patch, just changes GetCharacterSet
I suppose I am "on board" in as much as I'm not going to be the sole hold-out, but I'm not happy with the current trend of rewriting specs to retroactively dictate interop on the behaviour of the implementers that are least likely to conform to the spec. Not much of an incentive for anyone to care about the spec, is it?
Attachment #670369 -
Flags: feedback?(smontagu) → feedback+
Reporter | ||
Comment 19•12 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/ba1ece534b74
(In reply to Simon Montagu from comment #18)
> I suppose I am "on board" in as much as I'm not going to be the sole
> hold-out, but I'm not happy with the current trend of rewriting specs to
> retroactively dictate interop on the behaviour of the implementers that are
> least likely to conform to the spec. Not much of an incentive for anyone to
> care about the spec, is it?
To the contrary -- it makes the spec more likely to match the consensus of implementations, which makes it more useful, and makes it more likely that we'll get interop. If the spec is more realistic, that only increases people's incentives to care about it, because there are practical reasons as well as principled ones.
It's annoying when it's not clear-cut whether the spec is right and we follow the spec and then it changes to match the people who didn't follow the spec, but it's better than not getting as much interop. If IE and/or WebKit aren't going to change to get interop, better that we do than no one does.
Flags: in-testsuite+
Comment 20•12 years ago
|
||
(In reply to :Aryeh Gregor from comment #19)
> https://hg.mozilla.org/integration/mozilla-inbound/rev/ba1ece534b74
You didn't change ToLowerCase() to nsContentUtils::ASCIIToLower().
> 1.12 + ToLowerCase(aCharacterSet);
Reporter | ||
Comment 21•12 years ago
|
||
Oops. Thanks for pointing that out. Henri, would you like me to push a followup?
Comment 22•12 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla19
(In reply to :Aryeh Gregor from comment #21)
> Oops. Thanks for pointing that out. Henri, would you like me to push a
> followup?
I guess ToLowerCase is good enough when the input to is guaranteed to be ASCII, which is the case here.
Comment 24•12 years ago
|
||
This broke the character set menu, which assumes that document.characterSet will be canonical.
Comment 25•12 years ago
|
||
Uh... didn't I ask for that to be tested in comment 5?
Aryeh, what's the plan for fixing the regressions?
Reporter | ||
Comment 26•12 years ago
|
||
(In reply to Boris Zbarsky (:bz) from comment #25)
> Uh... didn't I ask for that to be tested in comment 5?
I thought you were only talking about the approach of changing our canonical case, not the approach I took of just changing .characterSet.
> Aryeh, what's the plan for fixing the regressions?
I don't have the time -- I'm working only a few hours a week at present and have security bugs and crashers being assigned to me. If no one else has time either, this will have to be backed out.
Comment 27•12 years ago
|
||
In that case, I think we should back this out for now. This is very low priority, imo, so I have a hard time asking someone to drop anything else and work on the regressions.
Reporter | ||
Comment 28•12 years ago
|
||
Makes sense to me. Maybe when I next have a block of free time I can take this and tackle the regressions then.
Comment 29•12 years ago
|
||
I want this backout because it interferes with bug 801402.
Comment 30•12 years ago
|
||
This is a simple backout, so review should not be required.
Keywords: checkin-needed
Comment 31•12 years ago
|
||
Status: RESOLVED → REOPENED
Flags: in-testsuite+
Keywords: checkin-needed
Resolution: FIXED → ---
Target Milestone: mozilla19 → ---
Comment 32•12 years ago
|
||
Reporter | ||
Updated•12 years ago
|
Assignee: ayg → nobody
Comment 33•4 years ago
|
||
Bulk-downgrade of unassigned, untouched DOM/Storage bug's priority.
If you have reason to believe, this is wrong, please write a comment and ni :jstutte.
Severity: minor → S4
Priority: -- → P5
We decided that we don't want to do this and changed the spec accordingly. Not sure how the bug remained open.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 4 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•