Closed
Bug 8865
Opened 25 years ago
Closed 25 years ago
need API to convert UCS-2 HTML into a target encoding
Categories
(Core :: Internationalization, defect, P3)
Tracking
()
RESOLVED
FIXED
M11
People
(Reporter: bobj, Assigned: nhottanscp)
References
Details
Provide an API which will convert UCS-2 HTML into a target encoding including
appropriate HTML entity substitution. The implementation will probably call
the standard Unicode converters APIs with additional code to handle the entity
substitutions. The conversion method for given character ranges it may do one
of the following conversions (controlled via prefs):
1. convert to character code in target charset encoding
2. convert to HTML3 Named Entity
3. convert to various subsets of HTML4 Named Entities (e.g., math)
4. convert to decimal Numeric Character Reference (NCR)
5. convert to HTML4 hexadecimal NCR
Default behavior is still being debated. Factors to be considered are
backwards compatibility (e.g., decimal NCR are supported by almost all clients
currently in use, but they may not support HTML4 hex NCRs). The biggest
debate concerns when to output character code values instead of entities.
UI work is needed to control the pref values.
Of course when converting from UCS-2, we need to always entity-ize:
"<" represents the < sign
">" represents the > sign
"&" represents the & sign
"" represents the " mark
but Ender should do that before the conversion, right?
Updated•25 years ago
|
Summary: need API to convert UCS-2 HTML into a target encoding
Comment 2•25 years ago
|
||
Do we always need to entity-ize " ? I'm not so sure of that one, or of
& though I agree about < and >.
Ender doesn't do anything like this; it just works with the content tree.
The XIF converter creates <entity> tags for (currently) &, <, and >. The
nsXIFDTD then lets the parser map them back to unicode chars. Finally, the
content sink (e.g. nsHTMLContentSinkStream.cpp calls NS_UnicodeToEntity on each
unicode char in the stream to turn them into &foo; format.
Good question about " Seems like we didn't handle that before. I just
tried it in the 4.5 composer. But can't quotes cause problems for thing like:
<tag-name param-name="foobar">
For reference, http://www.w3.org/TR/REC-html40/charset.html#h-5.3.2
...
Four character entity references deserve special mention since they
are frequently used to escape special characters:
- "<" represents the < sign.
- ">" represents the > sign.
- "&" represents the & sign.
- "" represents the " mark.
Authors wishing to put the "<" character in text should use "<"
(ASCII decimal 60) to avoid possible confusion with the beginning
of a tag (start tag open delimiter). Similarly, authors should use
">" (ASCII decimal 62) in text instead of ">" to avoid problems
with older user agents that incorrectly perceive this as the end
of a tag (tag close delimiter) when it appears in quoted attribute
values.
Authors should use "&" (ASCII decimal 38) instead of "&" to
avoid confusion with the beginning of a character reference
(entity reference open delimiter). Authors should also use
"&" in attribute values since character references are
allowed within CDATA attribute values.
Some authors use the character entity reference """ to
encode instances of the double quote mark (") since that
character may be used to delimit attribute values.
...
For my strawmen pref controls, I think we need (UE should rename/reword):
Editor Entity Output Preferences
Named Entity Priority (radio buttons)
o After valid character code values (default - but this is debatable)
o Before valid character code values
Additional (base set is HTML 3.2) Named Entities(check boxes)
[ ] HTML 4.0 ISO 8859-1 (Latin-1) characters (default off)
[ ] HTML 4.0 symbols, math symbols, and Greek letters (default off)
[ ] HTML 4.0 markup-significant and i18n characters (default off)
NCR format (radio buttons)
o decimal (default)
o hexadecimal (recommended by HTML4, but would cause backwards
compatibility problems)
Comment 5•25 years ago
|
||
I'm a bit confused by that list. Does it offer some way to encode &, < and > to
named entities but not to encode " ? I think a lot of people will be annoyed if
their quotes all become entities when there doesn't seem to be any pressing
reason for that.
There will be cases when we need to convert " to an entity. The easiest
implementation is probably to entity-ize all occurrences of the 4 special
characters. Otherwise Ender needs to have smarts when it is OK or not OK to
use the raw character codes.
I tried a test using 4.5 Composer. In Notepad I created foo.html:
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Author" content="Robert >"Bob"< Jung">
<title>>"hello"<</title>
</head>
<body>
less-than: <
<br>greater-than: >
<br>double-quote: "
<br>ampersand: &
</body>
</html>
I opened foo.html with 4.5 Composer and saved it to foo2.html. It change all
">" to > and "quot;" to ". It even changed the "<" in the
<meta author...> tag to <. Here is the resulting foo2.html:
<!doctype html public "-//w3c//dtd html 4.0 transitional//en">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="Author" content="Robert >"Bob"< Jung">
<meta name="GENERATOR" content="Mozilla/4.5 [en]C-NSCP (Win95; U)
[Netscape]">
<title>>"hello"<</title>
</head>
<body>
less-than: <
<br>greater-than: >
<br>double-quote: "
<br>ampersand: &
</body>
</html>
<body>
less-than: <
<br>greater-than: >
<br>double-quote: "
<br>ampersand: &
</body>
</html>
Comment 7•25 years ago
|
||
Putting that logic in the editor library would require some sort of big ugly
hack. Right now it's handled by the content sink stream on output, after ender
no longer has control over it. The content sink stream can have multiple modes;
but if every sink stream has to know all about which special character codes do
and do not get mapped to entities, then what's the point of having a service to
do the mapping?
Reporter | ||
Comment 10•25 years ago
|
||
Do you still feel this has a dependency on bug 8865?
Comment 11•25 years ago
|
||
Tague, can we make this for M10. The scope of this bug is the converter itself
without the intergration part, right ?
Comment 12•25 years ago
|
||
i have an api and a shell on my machine at home. i stopped working on this,
because i found that someone had already implemented some entity conversion code
already.
right now i'm looking at html save as path to try to figure out what is going on
with entity conversion before i invest alot of time in building this converter.
i doubt this is going to make m10.
Updated•25 years ago
|
Assignee: tague → nhotta
Status: ASSIGNED → NEW
Comment 13•25 years ago
|
||
reassign to naoki, tague, please give naoki a brain dump
Assignee | ||
Updated•25 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Updated•25 years ago
|
Status: ASSIGNED → NEW
Assignee | ||
Comment 14•25 years ago
|
||
tague, are you able to check in whatever you have to the tree?
Then I can start woking on this.
For M11, messenger can use it to generate entities before converting to the mail
charset.
Later if entities are generated elsewhere then it can be simply removed from
messenger.
Assignee | ||
Updated•25 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 15•25 years ago
|
||
The plan is to call nsIEntityConverter::ConvertToEntity from message compose
before converting unicode to mail charset.
Assignee | ||
Updated•25 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 16•25 years ago
|
||
Hooked up the entity converter to nsMsgSend.cpp.
If later editor generates entities then it can be removed from messenger.
Assignee | ||
Updated•25 years ago
|
Status: RESOLVED → REOPENED
Assignee | ||
Comment 17•25 years ago
|
||
Reopening the bug. We have not implemented everything we planned.
The original spec says.
1. convert to character code in target charset encoding
2. convert to HTML3 Named Entity
3. convert to various subsets of HTML4 Named Entities (e.g., math)
4. convert to decimal Numeric Character Reference (NCR)
5. convert to HTML4 hexadecimal NCR
We currently have 1, 2 and partially 3. We need to have a simple interface to
convert unicode to HTML (or plain text) by using charset converters and the
entity converter also generates NCRs.
In case of plain text, instead of NE and NCR, either skip not convertable chars
or convert to UTF-8 plain text.
Assignee | ||
Updated•25 years ago
|
Status: REOPENED → ASSIGNED
Comment 18•25 years ago
|
||
Clearing fixed resolution since it's been reopened.
Assignee | ||
Comment 19•25 years ago
|
||
Removing 6672 from dependency because it was depended on nbsp entity generation
which is done.
Assignee | ||
Comment 20•25 years ago
|
||
Adding 15706 for the dependency, that is needed for the fallback case if unicode
converter fails for no mapping. For now, I can do a work around to do the
conversion per character but that is slow.
Assignee | ||
Comment 21•25 years ago
|
||
I checked in my changes today. There are two changes.
1) Changed nsIEntityConverter to support complete html40 entities. It supports
2&3 of the original spec.
2) Added a new interface nsISaveAsCharset. This is a superset of
nsIEntityCovnerter, supports entity and NCR plus do a charset conversion. It
also supports plain text input (see the idl file for detail). This
interface supports all the requirements of the original spec.
Note: nsISaveAsCharset implementation depends on the unicode encoder bug 15706,
for some charsets (e.g. ISO-8859-1) it doesn't work correctly until 15706 fixes.
nsIEntityConverter does not have this dependency.
Assignee | ||
Updated•25 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 25 years ago → 25 years ago
No longer depends on: 15706
Resolution: --- → FIXED
Assignee | ||
Comment 22•25 years ago
|
||
Removing 15706 from dependency, we agreed on that encoders to include the
unmapped character in the consumed length which is the current behavior of all
encoders except ISO-2022-JP. 15706 is now a specific problem for ISO-2022-JP.
I made a change to callers of unicode encoder. Marking this bug as FIXED.
You need to log in
before you can comment on or make changes to this bug.
Description
•