Closed Bug 74137 Opened 24 years ago Closed 24 years ago

Composer: anchor doesn't accept non-ascii name

Categories

(Core :: Internationalization, defect, P3)

defect

Tracking

()

VERIFIED FIXED
mozilla0.9.2

People

(Reporter: ji, Assigned: cmanske)

References

(Depends on 1 open bug)

Details

(Keywords: intl, regression)

Attachments

(6 files)

Please mark this bug as a dup if there is already one for this problem. The anchor can't accept non-ascii data. Steps to reproduce: 1. Launch Composer. 2. Enter a string cotaining non-ascii characters in the composer, like ànchor. 3. Highlight the string and click on Anchor button. The anchor name strips out latin-1 character, it shows up as "nchor" If the string is Japanese, no anchor name is catched. 4. In the anchor name field, type non-ascii data. The non-ascii chars don't show after hit Enter to commit.
Is this a new problem or same as 6.0 and 6.01?
Keywords: intl
6.01 doesn't have this problem. This is a regression.
Keywords: regression
I think Charley changed this recently.
OS: Windows 95 → All
I didn't see this when I checked it on 03-05 trunk build, so must be some change after that.
Reassign to cmanske, please check if this is related to your change.
Assignee: nhotta → cmanske
The current implementation allows only "CDATA" strings, according to HTML 4.0 spec. I don't think it's supposed to allow foreign characters, is it? Can a URL have Japanese characters? If not, then current implementation is correct, since anchor name becomes part of a URL when referred to in a link.
"CDATA is a sequence of characters from the document character set and may include character entities." http://www.w3.org/TR/html4/types.html#type-cdata So non ASCII is allowed. But other place in the spec recommends UTF-8 to be used for HREF. I implemented this in mozilla after 6.0 so this is not supported in 6.01 and earier. http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars I think not allowing non ASCII is fine. We may want to have an option to allow non ASCII and use UTF-8 for that case. Cc to momoi, he is responsible for HTML compatibility issues. So in case the user selects non ASCII characters only then the anchor edit field to come up as empty? What if the user type non ASCII in the field? We need to alert the user and notify non ASCII cannot be used there.
QA Contact: andreasb → ylong
setting to mozilla0.9.1
Priority: -- → P3
Target Milestone: --- → mozilla0.9.1
> I think not allowing non ASCII is fine. We may want to > have an option to allow non ASCII and use UTF-8 for that > case. Cc to momoi, he is responsible for HTML > compatibility issues. If we are not allowing non-ASCII characters in the 'name' attribute, how can a user input non-ASCII domain names? This will probably become a necessity not so long from now. We can debate whether or not there should be an UI for the non-ASCII option in the anchor value but there is a bug here. Composer should be able to handle non-ASCII anchor values in any case. This way when the iDN gets implemented, Composer will at least not mangle non-ASCII characters. We can discourage users from creating "#non-ASCII_string" type of anchor names internal to the document, but I don't know if we should not provide basic means for it. I recommend fixing the non-ASCII handling bug regardless.
This is for anchor name so not really related to iDNS. So anchors can be none ASCII as I mentioned. Whether we allow it or not in the editor can be an option. When non ASCII is allowed, HREF should take care of it. a) Follow HTML 4 spec to generate UTF-8 with percent encoded. I think older browsers (6.01 and earlier does not recognize this). b) Use raw 8 bit in HREF which is not recommended in the HTML spec. c) Use percent encoded name in document charset. I don't know 4.x generate anchors as b) or c) style. What I meant to propose in my last comment was to have an option to allow/disallow non ASCII anchors then use c) which is UTF-8. So if the user wants to support older browsers the user has to use non ASCII anchors. I prefer not allow non ASCII as a default to avoid complication. When editing an existing document, non ASCII characters in anchor should not be removed unless the user edit the anchor in mozilla editor.
Thanks for the reply. > I don't know 4.x generate anchors as b) or c) style. Comm 4.x uses b). > When editing an existing document, non ASCII characters in > anchor should not be removed unless the user edit the > anchor in mozilla editor. With current builds, editing of an existing non-ASCII anchor is not possible due to this bug. > What I meant to propose in my last comment was to have an > option to allow/disallow non ASCII anchors then use c) > which is UTF-8. So if the user wants to support older > browsers the user has to use non ASCII anchors. I prefer > not allow non ASCII as a default to avoid complication. Can this process be automatic so that the user can load and edit an existing non-ASCII anchor without setting the pref? You can disallows creations of new non-ASCII anchor names.
So can someone help me here? How do I "support non-ascii characters"?
* We can have a bool pref for non ASCII character option in anchor. If non ASCII is allowed then the editor should not exclude non ASCII characters from editing and generating anchor. * For URL in HREF, if the pref allows non ASCII anchor, then convert the string to UTF-8 otherwise convert to a document charset, apply URL escape regardless of the pref value. Answering Momoi san's question, existing non ASCII anchors will be preserved regardless of the pref value (in UTF-8 or document charset).
Please, no more options/preferences. We should handle all characters accepted in the charset. Charley--sounds like you'll need to convert the string to the document charset and apply URL escaping (I think the latter is described in the HTML4 spec; %- escaping?).
If the editor has a pref to control standard vs compatibility mode then that could be used instead. I don't feel good generating format which is not recommended by the spec.
Here is an example document which uses non ASCII anchor names (from bug 58819). http://www.fcenter.ru/wn/hn02082000.htm This document does not escape either in "<A NAME=" and "<A HREF=", so using raw 8 bit characters in document charset (in windows-1251). There is another example from also bug 58819. http://bugzilla.mozilla.org/showattachment.cgi?attach_id=19353 It has HREF parts which uses document charset and URL escaped. "ja_anchors.html#%C6%FC%CB%DC%B8%EC" Also HREF with UTF-8 and URL escaped (as mentioned in the HTML spec). "ja_anchors.html#%E6%BC%A2%E5%AD%97" All above cases are supported in the trunk but not in 6.0/6.01 (and that was bug 58819). I cannot find an example which escapes "<A NAME=" part. I think it's because that doesn't have to be escaped (you can just use 8bit in a document charset). Let me try to create one later and see if that's supported by our browsers.
I tested a document with escaped anchors (e.g. <A NAME="%E0">). Because an anchor name is not URL, it is not supposed to be URL escaped. If its escaped then it is treated as an indivisual entities (e.g. name "%E0" and "à" should be recognized as different entities). So what I told Charley this morning was incorrect. Anchor names should not be unescaped. I checked that both 4.x and 6.0, they do not unescape anchors.
Naoki understands this much better than me.
Assignee: cmanske → nhotta
adding nsbeta1 keyword.
Keywords: nsbeta1
adding nsbeta1 keyword.
Before allowing non ASCII input, we should URL escape as Kathy mentioned. Which file should I look to escape URL in HREF?
Marina, thanks for help!
QA Contact: ylong → marina
re-assigning QA contact
QA Contact: marina → ylong
I think there are two possible places where we can apply URL encoding. The fist place is after the user types (or selects) the link in the dialog. We can only use UTF-8 because the target charset is unknown at that point. The other issue is that we may have to do the URL encode after multiple UI where the user can enter URL (e.g. Image URL). The other possibility is to change nsDocumentEncoder. It might be possible to detect URIs in a document then apply the URL encoding. Currently, the code does named entity conversion first then converts to charset. But for URI, it has to be converted to charset first then apply URL encode and no entity conversion is needed. Also the current code seems to be doing the charset conversion by block, not by node bases. Anyway, it would not be a simple change, I am going to contact jst if the change is possible.
Naoki--I notice in the EscapeURI method, that you start out Append'ing (rather than truncating the string to 0 length or otherwise initializing it). Is that intentional or do you intend that people might call this method intending to leverage the Appending (such as adding parts to the uri)?
FYI, I had the same question as brade mentioned above. In nsHTMLContentSerializer::EscapeURI(), I think that if we execute the code in the "if" statement, that documentCharset will be pointing to a buffer that has been free'd since the temp created by NS_ConvertUCS2toUTF8() will have gone out of scope: + if (!uri.IsASCII()) { + textToSubURI = do_GetService(NS_ITEXTTOSUBURI_CONTRACTID, &rv); + NS_ENSURE_SUCCESS(rv, rv); + const PRUnichar *u; + mCharSet->GetUnicode(&u); + documentCharset = NS_ConvertUCS2toUTF8(u).get(); + } Also, I think that if we checked if start >= aURI.Length() after we exit the while loop, we could avoid executing all that "// Escape the remaining part." code when processing URLs that end with '=', etc. Not sure if this matters or not, but if this method gets called alot, it may be faster to change the code at the end of nsHTMLContentSerializer::SerializeAttributes(): + // Need to escape URI. + nsAutoString escapedURI; + if (NS_SUCCEEDED(EscapeURI(valueStr, escapedURI))) + valueStr = escapedURI; to copy valueStr into a temp, pass the temp as the first param, and have EscapeURI write directly into valueStr. I say this because escaped URI strings are sometimes a lot longer than the unescaped version.
Thanks for your comments, I will update. "write directly into valueStr", not doing this to avoid creating an incomplete URI which might happen by errors (e.g. charset conversion failure). But I can restore the temp in case of an error, let me try that.
Status: NEW → ASSIGNED
Kathy, Kin, please review the new patch, thanks.
r=ftang
The code: + const char *documentCharset; + mCharSet->GetUnicode(&charset); + documentCharset = NS_ConvertUCS2toUTF8(charset).get(); is not good, the temporary NS_ConvertUCS2toUTF8() (NS_ConvertUCS2toUTF8 is actaully a class, not a function) created here only lives long enough for the assignment to happen but it's not guaranteed to live any longer than that, so this is a crash waiting to happen. To solve this either make documentCharset an nsXPIDLCString (which would let you move the assignment into the "if (!uri.IsASCII())" block), or ensure that the lifetime of the temporary NS_ConvertUCS2toUTF8 is longer than documentCharset. Shouldn't this code use the *base* uri (if available) as the base to NS_MakeAbsoluteURI()? mCharSet really needs to be an nsCOMPtr, and you need null checks around access to mCharSet. Other than that the changes look good, fix that and I'll have one more look.
NS_MakeAbsoluteURI() is used for the following condition, same as the current code. if (mFlags & nsIDocumentEncoder::OutputAbsoluteLinks)
sr=jst, I filed a new bug on the base uri stuff which I still think is wrong (and was wrong before this change too), see bug 80081.
Checked in the patch, now I may enable 8 bit characters in the UI. But I don't know how. Charley, can that be done by flipping a switch or something?
Sorry, I don't have a clue! Maybe someone in XPFE group knows?
Charley--are you sure that we aren't restricting (with JS) the characters that can go in that field? That might be what is interfering with what Naoki has done.
There is a function ValidateData(), http://lxr.mozilla.org/seamonkey/source/editor/ui/dialogs/content/EdNamedAnchorP rops.js#119 that calls ConvertToCDATAString, http://lxr.mozilla.org/seamonkey/source/editor/ui/dialogs/content/EdDialogCommon .js#331 329 // Replace whitespace with "_" and allow only HTML CDATA 330 // characters: "a"-"z","A"-"Z","0"-"9", "_", ":", "-", and "." 331 function ConvertToCDATAString(string) 332 { 333 return string.replace(/\s+/g,"_").replace(/[^a-zA-Z0-9_\.\-\:]+/g,''); 334 } But I don't know where non ASCII characters are filtered out.
Let me ask a different question. NS6 does not restrict input for characters greater than 127. What has changed around the anchor dialog since NS6?
The difference is that cmanske added some code to filter out characters that were not allowed in CDATA strings (The ConvertToCDATAString() replace call you mentioned above) as the user types. In order to make this work for you, I think we need to make it so that we don't filter while the user types at all (except for any CDATA requirements about what the first char must be?) since the URI escape code should take care of non CDATA chars. The next thing we need to figure out is how our DOM implementation expects the 'name' attribute of the anchor node to be stored. Escaped? Unescaped? If it expects it to be escaped, than the the modification made to the serializer should have been unneccessary since the escaping should have been done by the anchor dialog code before it was stored in anchor node.
URL escape cannot be done while the data is in unicode (unless we use UTF-8), it has to be done after we know the target charset.
The rest of the change will be inside editor and I am not familiar with it. Clear the milestone and reassign to the editor group. The remaining part is not filtering out characters above 127 in the anchor property dialog. I fixed HTML seriarizer to escape those character so the UI not need to filter out those characters.
Assignee: nhotta → beppe
Status: ASSIGNED → NEW
Target Milestone: mozilla0.9.1 → ---
over to charley
Assignee: beppe → cmanske
Target Milestone: --- → mozilla0.9.2
I think this is easy to do by including the characters in the range 128 to 'end' in the list of characters to NOT filter. What is the largest number in the range (i.e what is 'end')?
Status: NEW → ASSIGNED
I think we can use 65535 (0xFFFF).
Can you please test the patch to editor's EdDialogCommon.js? The JS reference says to use "\xnn" for hex character, so I hope "\xFFFF" works for values above 255.
Okay, I will apply the patch and post the result after I test.
After the patch, I was able to input Latin1 characters (128 - 255) but above 256 are still filtered out. You can test these characters in US environment by pasting from "Character Map".
That's to bad! That would be a deficiency in JavaScript's RegExp character support, imho. Brendan: How can we detect characters above 255?
Setting status. Use last patch unless we hear about other method to get char > 255 in regexp.
Keywords: patch, review
Whiteboard: FIX IN HAND need r=, sr=
I tried the last patch and it did not filter Japanese character I input. But I found the data is corrupted somewhere. I put a break point at HTML content serializer which expects UCS2 but what I got was zero extended string like 0x00E300810082 for one Japanese character. 0012A9E8 23 00 E3 00 81 00 82 00 E3 #.ã...?.ã 0012A9F1 00 81 00 84 00 E3 00 81 00 ...?.ã... 0012A9FA 86 00 00 00 00 00 58 AD 12 ?.....X­. I think what happens here is something like this. Somewhere in the code, we have a string in UTF-8 (e.g. 0xE38182), instead of converting to UCS2 it uses AssignWithConversion, then result would be 0x00E300810082, just like we see in the dump above. Note this is nothing to do with the patch. But I didn't see that when I tested my serializer change using existing non ASCII anchors, so it must be somewhere in the editor code.
The problem is in nsHTMLAnchorElement.cpp not editor. I file a bug 81090.
Depends on: 81090
Yikes! Cc'ing rogerl. JS uses Unicode, including in its regexp implementation. If you can show a bug where [^a-zA-Z0-9_\.\-\:]+ does not match code points >= 256, please file a bug on the JS engine. Cc'ing pschwartau for help reducing the testcase. The workaround of looping over each character is too expensive to contemplate. Please, let's fix this right, ASAP. /be
Ah, I should have read my bugmail or this bug's comments before commenting. Thanks to nhotta, this sounds like a bug in HTML content code, treating UTF-8 as ASCII. But it'd be cool to confirm that JS regexp character classes (negated ones, yet) work with two-byte code points. cmanske, the klunky fix seems spurious in light of nhotta's finding -- I hope you don't land it! /be
My comment on 2001-05-15 17:23, >Note this is nothing to do with the patch. Just to clarify. I meant, the patch for EdDialogCommon.js has nothing to do with the string corruption. The string corruption happens at href creation (that's 81090). Patch for EdDialogCommon.js is needed separately in order to avoid filtering user's input for non ASCII anchor name.
nhotta: agreed, but the klunky fix (last patch) is not needed (besides being too slow and based on a fear of a JS regexp unicode bug that does not exist), once 81090 is fixed -- right? /be
It's needed, bug 81090 fixes a problem of the data loss after "Link Property" dialog. The .js patch in this bug is to avoid the data loss in "Anchor Property" dialog. There are two separate data losses, so we need two different patches to fix both problems.
Brendan: So how hard would it be to extend RegExp to handle full range of characters ("\xnnnn" instead of "\xnn")?
nhotta: yes, but we don't need the last patch ("Klunky fix") once 81090 is fixed, I hope -- because JS regexps correctly handle UTF16 code points above 255. Agreed? cmanske: regexps already handle full Unicode. Use \uXXXX to spell Unicode code points in the UTF16-like encoding that JS uses. /be
Also see bug 72964: "String method for pattern matching failed for Chinese Simplified (GB2312)" There is a reduced HTML testcase attached there: http://bugzilla.mozilla.org/showattachment.cgi?attach_id=28553 The final section of the testcase uses Unicode escape sequences (Ex: str = '\u00BF\u00CD\u00BB\u00A7' ) and pattern-matching in JavaScript -
Brendan, do you mean the patch on 05/15/01 15:20 should works once bug 81090 is fixed? Let me try and I will post result later.
I tested with the patch of ug 81090 and the patch on 05/15/01 15:20. Japanese characters are still filtered out. But it worked after I changed the .js file to use \u0080 and \uFFFF instead of \x80 and \xFFFF.
I checked in for a problem of nsHTMLAnchorElement::GetHref, removing the depend.
No longer depends on: 81090
Totally cool! I didn't know about "\uxxxx"! So r=cmanske on that part. Brendan: does this look good now? Can you please sr=?
Whiteboard: FIX IN HAND need r=, sr= → FIX IN HAND need sr=
nhotta: I don't understand your latest patch here -- the negated character class excludes all characters not listed after the [^ and before the closing ], so you don't want to add \u0080-\uFFFF -- that range is already excluded by cmanske's patch. /be
Blocks: 81090
No longer blocks: 81090
Depends on: 81090
05/15/01 15:20 string.replace(/\s+/g,"_").replace(/[^a-zA-Z0-9_\.\-\:\x80-\xFFFF]+/g,''); 05/16/01 16:31 string.replace(/\s+/g,"_").replace(/[^a-zA-Z0-9_\.\-\:\u0080-\uFFFF]+/g,''); I just changed \x80 to \u0080 and \xFFFF to \uFFFF.
The patch works -- I tested it. The higher characters should be included in the list of negated characters. The filter does: "replace any character that is *not* one of the specific characters or a char > 128 with a blank", thus *allowing* the higher characters to be typed.
I checked in the 5/16 patch. So all related issues are resolved now, i.e., we can marked this fixed?
Whiteboard: FIX IN HAND need sr= → FIX IN HAND
yes, I think so.
fixed
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Keywords: review
Whiteboard: FIX IN HAND
Sorry, I misread the diff and added confusion. I also didn't know that chars > 127 were legal CDATA chars. New one on me. /be
changing QA contact
QA Contact: ylong → marina
based on the 2001-05-17 build i am reopening this bug: not only all non-ascii chars are stripped in the anchor dlgbox when highlighted but i can not input them at all... i can enter ascii chars into the anchor dlgbox but everytime i am attempting to enter non-ascii they are escaped
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
2001-05-17 does not have the fix, please verify with 5/18 build or later, thanks.
Status: REOPENED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → FIXED
Ok, i thought it was checked into the latest 17th build... the today's build crashes for me...I'll verify on the next one then
There is a regression caused by this (bug 81238), it double escapes already escaped URI.
it works using 2001-05-24-04 build, i am able to enter non-ascii strings for anchor ( latin-1 and db)
Status: RESOLVED → VERIFIED
Depends on: 1371010
Depends on: 1371559
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: