Closed Bug 8865 Opened 25 years ago Closed 25 years ago

need API to convert UCS-2 HTML into a target encoding

Categories

(Core :: Internationalization, defect, P3)

Product:

Component:

Platform:

All

Other

Type:

defect

Priority:

P3

Severity:

normal

Tracking

()

Status:

RESOLVED FIXED

Milestone:

M11

People

(Reporter: bobj, Assigned: nhottanscp)

References

Details

Reporter

Description

•

25 years ago

Provide an API which will convert UCS-2 HTML into a target encoding including appropriate HTML entity substitution. The implementation will probably call the standard Unicode converters APIs with additional code to handle the entity substitutions. The conversion method for given character ranges it may do one of the following conversions (controlled via prefs): 1. convert to character code in target charset encoding 2. convert to HTML3 Named Entity 3. convert to various subsets of HTML4 Named Entities (e.g., math) 4. convert to decimal Numeric Character Reference (NCR) 5. convert to HTML4 hexadecimal NCR Default behavior is still being debated. Factors to be considered are backwards compatibility (e.g., decimal NCR are supported by almost all clients currently in use, but they may not support HTML4 hex NCRs). The biggest debate concerns when to output character code values instead of entities. UI work is needed to control the pref values.

Reporter

Comment 1

•

25 years ago

Of course when converting from UCS-2, we need to always entity-ize: "<" represents the < sign ">" represents the > sign "&" represents the & sign "" represents the " mark but Ender should do that before the conversion, right?

Updated

•

25 years ago

Summary: need API to convert UCS-2 HTML into a target encoding

Comment 2

•

25 years ago

Do we always need to entity-ize " ? I'm not so sure of that one, or of & though I agree about < and >. Ender doesn't do anything like this; it just works with the content tree. The XIF converter creates <entity> tags for (currently) &, <, and >. The nsXIFDTD then lets the parser map them back to unicode chars. Finally, the content sink (e.g. nsHTMLContentSinkStream.cpp calls NS_UnicodeToEntity on each unicode char in the stream to turn them into &foo; format.

Updated

•

25 years ago

Target Milestone: M8

Reporter

Comment 3

•

25 years ago

Good question about " Seems like we didn't handle that before. I just tried it in the 4.5 composer. But can't quotes cause problems for thing like: <tag-name param-name="foobar"> For reference, http://www.w3.org/TR/REC-html40/charset.html#h-5.3.2 ... Four character entity references deserve special mention since they are frequently used to escape special characters: - "<" represents the < sign. - ">" represents the > sign. - "&" represents the & sign. - "" represents the " mark. Authors wishing to put the "<" character in text should use "<" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). Similarly, authors should use ">" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values. Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values. Some authors use the character entity reference """ to encode instances of the double quote mark (") since that character may be used to delimit attribute values. ...

Reporter

Comment 4

•

25 years ago

For my strawmen pref controls, I think we need (UE should rename/reword): Editor Entity Output Preferences Named Entity Priority (radio buttons) o After valid character code values (default - but this is debatable) o Before valid character code values Additional (base set is HTML 3.2) Named Entities(check boxes) [ ] HTML 4.0 ISO 8859-1 (Latin-1) characters (default off) [ ] HTML 4.0 symbols, math symbols, and Greek letters (default off) [ ] HTML 4.0 markup-significant and i18n characters (default off) NCR format (radio buttons) o decimal (default) o hexadecimal (recommended by HTML4, but would cause backwards compatibility problems)

Comment 5

•

25 years ago

I'm a bit confused by that list. Does it offer some way to encode &, < and > to named entities but not to encode " ? I think a lot of people will be annoyed if their quotes all become entities when there doesn't seem to be any pressing reason for that.

Assignee

Updated

•

25 years ago

Blocks: 5894

Reporter

Comment 6

•

25 years ago

There will be cases when we need to convert " to an entity. The easiest implementation is probably to entity-ize all occurrences of the 4 special characters. Otherwise Ender needs to have smarts when it is OK or not OK to use the raw character codes. I tried a test using 4.5 Composer. In Notepad I created foo.html: <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta name="Author" content="Robert >"Bob"< Jung"> <title>>"hello"<</title> </head> <body> less-than: < <br>greater-than: > <br>double-quote: " <br>ampersand: & </body> </html> I opened foo.html with 4.5 Composer and saved it to foo2.html. It change all ">" to > and "quot;" to ". It even changed the "<" in the <meta author...> tag to <. Here is the resulting foo2.html: <!doctype html public "-//w3c//dtd html 4.0 transitional//en"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> <meta name="Author" content="Robert >"Bob"< Jung"> <meta name="GENERATOR" content="Mozilla/4.5 [en]C-NSCP (Win95; U) [Netscape]"> <title>>"hello"<</title> </head> <body> less-than: < <br>greater-than: > <br>double-quote: " <br>ampersand: & </body> </html> <body> less-than: < <br>greater-than: > <br>double-quote: " <br>ampersand: & </body> </html>

Comment 7

•

25 years ago

Putting that logic in the editor library would require some sort of big ugly hack. Right now it's handled by the content sink stream on output, after ender no longer has control over it. The content sink stream can have multiple modes; but if every sink stream has to know all about which special character codes do and do not get mapped to entities, then what's the point of having a service to do the mapping?

Updated

•

25 years ago

Target Milestone: M8 → M9

Comment 8

•

25 years ago

this probably won't make it until m9, and issue with moving it?

Updated

•

25 years ago

Target Milestone: M9 → M10

Comment 9

•

25 years ago

not going to make m9 cutoff, moving to m10

Updated

•

25 years ago

Blocks: 6672

Jean-Francois Ducarroz

Updated

•

25 years ago

Blocks: 10940

Reporter

Comment 10

•

25 years ago

Do you still feel this has a dependency on bug 8865?

Comment 11

•

25 years ago

Tague, can we make this for M10. The scope of this bug is the converter itself without the intergration part, right ?

Comment 12

•

25 years ago

i have an api and a shell on my machine at home. i stopped working on this, because i found that someone had already implemented some entity conversion code already. right now i'm looking at html save as path to try to figure out what is going on with entity conversion before i invest alot of time in building this converter. i doubt this is going to make m10.

Updated

•

25 years ago

Target Milestone: M10 → M11

Updated

•

25 years ago

Blocks: 13401

Updated

•

25 years ago

Assignee: tague → nhotta

Status: ASSIGNED → NEW

Comment 13

•

25 years ago

reassign to naoki, tague, please give naoki a brain dump

Assignee

Updated

•

25 years ago

No longer blocks: 10940

Assignee

Updated

•

25 years ago

Status: NEW → ASSIGNED

Assignee

Updated

•

25 years ago

Status: ASSIGNED → NEW

Assignee

Comment 14

•

25 years ago

tague, are you able to check in whatever you have to the tree? Then I can start woking on this. For M11, messenger can use it to generate entities before converting to the mail charset. Later if entities are generated elsewhere then it can be simply removed from messenger.

Assignee

Updated

•

25 years ago

Status: NEW → ASSIGNED

Assignee

Comment 15

•

25 years ago

The plan is to call nsIEntityConverter::ConvertToEntity from message compose before converting unicode to mail charset.

Assignee

Updated

•

25 years ago

Status: ASSIGNED → RESOLVED

Closed: 25 years ago

Resolution: --- → FIXED

Assignee

Comment 16

•

25 years ago

Hooked up the entity converter to nsMsgSend.cpp. If later editor generates entities then it can be removed from messenger.

Updated

•

25 years ago

QA Contact: teruko → nhotta

Assignee

Updated

•

25 years ago

Status: RESOLVED → REOPENED

Assignee

Comment 17

•

25 years ago

Reopening the bug. We have not implemented everything we planned. The original spec says. 1. convert to character code in target charset encoding 2. convert to HTML3 Named Entity 3. convert to various subsets of HTML4 Named Entities (e.g., math) 4. convert to decimal Numeric Character Reference (NCR) 5. convert to HTML4 hexadecimal NCR We currently have 1, 2 and partially 3. We need to have a simple interface to convert unicode to HTML (or plain text) by using charset converters and the entity converter also generates NCRs. In case of plain text, instead of NE and NCR, either skip not convertable chars or convert to UTF-8 plain text.

Assignee

Updated

•

25 years ago

Status: REOPENED → ASSIGNED

Updated

•

25 years ago

Resolution: FIXED → ---

Comment 18

•

25 years ago

Clearing fixed resolution since it's been reopened.

Assignee

Updated

•

25 years ago

No longer blocks: 5894

Assignee

Updated

•

25 years ago

No longer blocks: 6672

Assignee

Comment 19

•

25 years ago

Removing 6672 from dependency because it was depended on nbsp entity generation which is done.

Assignee

Updated

•

25 years ago

Blocks: 12392

Updated

•

25 years ago

Blocks: 15674

Assignee

Updated

•

25 years ago

Depends on: 15706

Assignee

Comment 20

•

25 years ago

Adding 15706 for the dependency, that is needed for the fallback case if unicode converter fails for no mapping. For now, I can do a work around to do the conversion per character but that is slow.

Assignee

Updated

•

25 years ago

Blocks: 15475

Updated

•

25 years ago

Blocks: 16441

Assignee

Comment 21

•

25 years ago

I checked in my changes today. There are two changes. 1) Changed nsIEntityConverter to support complete html40 entities. It supports 2&3 of the original spec. 2) Added a new interface nsISaveAsCharset. This is a superset of nsIEntityCovnerter, supports entity and NCR plus do a charset conversion. It also supports plain text input (see the idl file for detail). This interface supports all the requirements of the original spec. Note: nsISaveAsCharset implementation depends on the unicode encoder bug 15706, for some charsets (e.g. ISO-8859-1) it doesn't work correctly until 15706 fixes. nsIEntityConverter does not have this dependency.

Updated

•

25 years ago

Blocks: 16950

Assignee

Updated

•

25 years ago

Status: ASSIGNED → RESOLVED

Closed: 25 years ago → 25 years ago

No longer depends on: 15706

Resolution: --- → FIXED

Assignee

Comment 22

•

25 years ago

Removing 15706 from dependency, we agreed on that encoders to include the unmapped character in the consumed length which is the current behavior of all encoders except ISO-2022-JP. 15706 is now a specific problem for ISO-2022-JP. I made a change to callers of unicode encoder. Marking this bug as FIXED.

Updated

•

25 years ago

Blocks: 17432

Updated

•

24 years ago

No longer blocks: 17432

You need to log in before you can comment on or make changes to this bug.