Closed Bug 8865 Opened 25 years ago Closed 25 years ago

need API to convert UCS-2 HTML into a target encoding

Categories

(Core :: Internationalization, defect, P3)

All
Other
defect

Tracking

()

RESOLVED FIXED

People

(Reporter: bobj, Assigned: nhottanscp)

References

Details

Provide an API which will convert UCS-2 HTML into a target encoding including appropriate HTML entity substitution. The implementation will probably call the standard Unicode converters APIs with additional code to handle the entity substitutions. The conversion method for given character ranges it may do one of the following conversions (controlled via prefs): 1. convert to character code in target charset encoding 2. convert to HTML3 Named Entity 3. convert to various subsets of HTML4 Named Entities (e.g., math) 4. convert to decimal Numeric Character Reference (NCR) 5. convert to HTML4 hexadecimal NCR Default behavior is still being debated. Factors to be considered are backwards compatibility (e.g., decimal NCR are supported by almost all clients currently in use, but they may not support HTML4 hex NCRs). The biggest debate concerns when to output character code values instead of entities. UI work is needed to control the pref values.
Of course when converting from UCS-2, we need to always entity-ize: "&lt;" represents the < sign "&gt;" represents the > sign "&amp;" represents the & sign "&quot; represents the " mark but Ender should do that before the conversion, right?
Summary: need API to convert UCS-2 HTML into a target encoding
Do we always need to entity-ize &quot; ? I'm not so sure of that one, or of &amp; though I agree about &lt; and &gt;. Ender doesn't do anything like this; it just works with the content tree. The XIF converter creates <entity> tags for (currently) &, <, and >. The nsXIFDTD then lets the parser map them back to unicode chars. Finally, the content sink (e.g. nsHTMLContentSinkStream.cpp calls NS_UnicodeToEntity on each unicode char in the stream to turn them into &foo; format.
Target Milestone: M8
Good question about &quot; Seems like we didn't handle that before. I just tried it in the 4.5 composer. But can't quotes cause problems for thing like: <tag-name param-name="foobar"> For reference, http://www.w3.org/TR/REC-html40/charset.html#h-5.3.2 ... Four character entity references deserve special mention since they are frequently used to escape special characters: - "&lt;" represents the < sign. - "&gt;" represents the > sign. - "&amp;" represents the & sign. - "&quot; represents the " mark. Authors wishing to put the "<" character in text should use "&lt;" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). Similarly, authors should use "&gt;" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values. Authors should use "&amp;" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&amp;" in attribute values since character references are allowed within CDATA attribute values. Some authors use the character entity reference "&quot;" to encode instances of the double quote mark (") since that character may be used to delimit attribute values. ...
For my strawmen pref controls, I think we need (UE should rename/reword): Editor Entity Output Preferences Named Entity Priority (radio buttons) o After valid character code values (default - but this is debatable) o Before valid character code values Additional (base set is HTML 3.2) Named Entities(check boxes) [ ] HTML 4.0 ISO 8859-1 (Latin-1) characters (default off) [ ] HTML 4.0 symbols, math symbols, and Greek letters (default off) [ ] HTML 4.0 markup-significant and i18n characters (default off) NCR format (radio buttons) o decimal (default) o hexadecimal (recommended by HTML4, but would cause backwards compatibility problems)
I'm a bit confused by that list. Does it offer some way to encode &, < and > to named entities but not to encode " ? I think a lot of people will be annoyed if their quotes all become entities when there doesn't seem to be any pressing reason for that.
Blocks: 5894
There will be cases when we need to convert " to an entity. The easiest implementation is probably to entity-ize all occurrences of the 4 special characters. Otherwise Ender needs to have smarts when it is OK or not OK to use the raw character codes. I tried a test using 4.5 Composer. In Notepad I created foo.html: <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta name="Author" content="Robert &gt;&quot;Bob&quot;&lt; Jung"> <title>&gt;&quot;hello&quot;&lt;</title> </head> <body> less-than: &lt; <br>greater-than: &gt; <br>double-quote: &quot; <br>ampersand: &amp; </body> </html> I opened foo.html with 4.5 Composer and saved it to foo2.html. It change all "&gt;" to > and "quot;" to ". It even changed the "&lt;" in the <meta author...> tag to <. Here is the resulting foo2.html: <!doctype html public "-//w3c//dtd html 4.0 transitional//en"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta name="Author" content="Robert >"Bob"< Jung"> <meta name="GENERATOR" content="Mozilla/4.5 [en]C-NSCP (Win95; U) [Netscape]"> <title>>"hello"&lt;</title> </head> <body> less-than: &lt; <br>greater-than: > <br>double-quote: " <br>ampersand: &amp; </body> </html> <body> less-than: &lt; <br>greater-than: > <br>double-quote: " <br>ampersand: &amp; </body> </html>
Putting that logic in the editor library would require some sort of big ugly hack. Right now it's handled by the content sink stream on output, after ender no longer has control over it. The content sink stream can have multiple modes; but if every sink stream has to know all about which special character codes do and do not get mapped to entities, then what's the point of having a service to do the mapping?
Target Milestone: M8 → M9
this probably won't make it until m9, and issue with moving it?
Target Milestone: M9 → M10
not going to make m9 cutoff, moving to m10
Blocks: 6672
Blocks: 10940
Do you still feel this has a dependency on bug 8865?
Tague, can we make this for M10. The scope of this bug is the converter itself without the intergration part, right ?
i have an api and a shell on my machine at home. i stopped working on this, because i found that someone had already implemented some entity conversion code already. right now i'm looking at html save as path to try to figure out what is going on with entity conversion before i invest alot of time in building this converter. i doubt this is going to make m10.
Target Milestone: M10 → M11
Blocks: 13401
Assignee: tague → nhotta
Status: ASSIGNED → NEW
reassign to naoki, tague, please give naoki a brain dump
No longer blocks: 10940
Status: NEW → ASSIGNED
Status: ASSIGNED → NEW
tague, are you able to check in whatever you have to the tree? Then I can start woking on this. For M11, messenger can use it to generate entities before converting to the mail charset. Later if entities are generated elsewhere then it can be simply removed from messenger.
Status: NEW → ASSIGNED
The plan is to call nsIEntityConverter::ConvertToEntity from message compose before converting unicode to mail charset.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Hooked up the entity converter to nsMsgSend.cpp. If later editor generates entities then it can be removed from messenger.
QA Contact: teruko → nhotta
Status: RESOLVED → REOPENED
Reopening the bug. We have not implemented everything we planned. The original spec says. 1. convert to character code in target charset encoding 2. convert to HTML3 Named Entity 3. convert to various subsets of HTML4 Named Entities (e.g., math) 4. convert to decimal Numeric Character Reference (NCR) 5. convert to HTML4 hexadecimal NCR We currently have 1, 2 and partially 3. We need to have a simple interface to convert unicode to HTML (or plain text) by using charset converters and the entity converter also generates NCRs. In case of plain text, instead of NE and NCR, either skip not convertable chars or convert to UTF-8 plain text.
Status: REOPENED → ASSIGNED
Resolution: FIXED → ---
Clearing fixed resolution since it's been reopened.
No longer blocks: 5894
No longer blocks: 6672
Removing 6672 from dependency because it was depended on nbsp entity generation which is done.
Blocks: 12392
Blocks: 15674
Depends on: 15706
Adding 15706 for the dependency, that is needed for the fallback case if unicode converter fails for no mapping. For now, I can do a work around to do the conversion per character but that is slow.
Blocks: 15475
Blocks: 16441
I checked in my changes today. There are two changes. 1) Changed nsIEntityConverter to support complete html40 entities. It supports 2&3 of the original spec. 2) Added a new interface nsISaveAsCharset. This is a superset of nsIEntityCovnerter, supports entity and NCR plus do a charset conversion. It also supports plain text input (see the idl file for detail). This interface supports all the requirements of the original spec. Note: nsISaveAsCharset implementation depends on the unicode encoder bug 15706, for some charsets (e.g. ISO-8859-1) it doesn't work correctly until 15706 fixes. nsIEntityConverter does not have this dependency.
Blocks: 16950
Status: ASSIGNED → RESOLVED
Closed: 25 years ago25 years ago
No longer depends on: 15706
Resolution: --- → FIXED
Removing 15706 from dependency, we agreed on that encoders to include the unmapped character in the consumed length which is the current behavior of all encoders except ISO-2022-JP. 15706 is now a specific problem for ISO-2022-JP. I made a change to callers of unicode encoder. Marking this bug as FIXED.
Blocks: 17432
No longer blocks: 17432
You need to log in before you can comment on or make changes to this bug.