Closed
Bug 74137
Opened 24 years ago
Closed 24 years ago
Composer: anchor doesn't accept non-ascii name
Categories
(Core :: Internationalization, defect, P3)
Core
Internationalization
Tracking
()
VERIFIED
FIXED
mozilla0.9.2
People
(Reporter: ji, Assigned: cmanske)
References
(Depends on 1 open bug)
Details
(Keywords: intl, regression)
Attachments
(6 files)
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
patch
|
Details | Diff | Splinter Review |
Please mark this bug as a dup if there is already one for this problem.
The anchor can't accept non-ascii data.
Steps to reproduce:
1. Launch Composer.
2. Enter a string cotaining non-ascii characters in the composer, like ànchor.
3. Highlight the string and click on Anchor button.
The anchor name strips out latin-1 character, it shows up as "nchor"
If the string is Japanese, no anchor name is catched.
4. In the anchor name field, type non-ascii data. The non-ascii chars don't show
after hit Enter to commit.
6.01 doesn't have this problem.
This is a regression.
Keywords: regression
Comment 4•24 years ago
|
||
I didn't see this when I checked it on 03-05 trunk build, so must be some change
after that.
Comment 5•24 years ago
|
||
Reassign to cmanske, please check if this is related to your change.
Assignee: nhotta → cmanske
Assignee | ||
Comment 6•24 years ago
|
||
The current implementation allows only "CDATA" strings, according to HTML
4.0 spec. I don't think it's supposed to allow foreign characters, is it?
Can a URL have Japanese characters? If not, then current implementation
is correct, since anchor name becomes part of a URL when referred to in a link.
Comment 7•24 years ago
|
||
"CDATA is a sequence of characters from the document character set and may
include character entities."
http://www.w3.org/TR/html4/types.html#type-cdata
So non ASCII is allowed. But other place in the spec recommends UTF-8 to be used
for HREF. I implemented this in mozilla after 6.0 so this is not supported in
6.01 and earier.
http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars
I think not allowing non ASCII is fine. We may want to have an option to allow
non ASCII and use UTF-8 for that case. Cc to momoi, he is responsible for HTML
compatibility issues.
So in case the user selects non ASCII characters only then the anchor edit field
to come up as empty?
What if the user type non ASCII in the field? We need to alert the user and
notify non ASCII cannot be used there.
Updated•24 years ago
|
QA Contact: andreasb → ylong
Comment 8•24 years ago
|
||
setting to mozilla0.9.1
Priority: -- → P3
Target Milestone: --- → mozilla0.9.1
Comment 9•24 years ago
|
||
> I think not allowing non ASCII is fine. We may want to
> have an option to allow non ASCII and use UTF-8 for that
> case. Cc to momoi, he is responsible for HTML
> compatibility issues.
If we are not allowing non-ASCII characters in
the 'name' attribute, how can a user input non-ASCII
domain names? This will probably become a necessity
not so long from now.
We can debate whether or not there should be an
UI for the non-ASCII option in the anchor value
but there is a bug here. Composer should be able to
handle non-ASCII anchor values in any case. This way
when the iDN gets implemented, Composer will at least
not mangle non-ASCII characters.
We can discourage users from creating "#non-ASCII_string"
type of anchor names internal to the document, but I don't
know if we should not provide basic means for it.
I recommend fixing the non-ASCII handling bug regardless.
Comment 10•24 years ago
|
||
This is for anchor name so not really related to iDNS.
So anchors can be none ASCII as I mentioned. Whether we allow it or not in the
editor can be an option. When non ASCII is allowed, HREF should take care of it.
a) Follow HTML 4 spec to generate UTF-8 with percent encoded. I think older
browsers (6.01 and earlier does not recognize this).
b) Use raw 8 bit in HREF which is not recommended in the HTML spec.
c) Use percent encoded name in document charset.
I don't know 4.x generate anchors as b) or c) style.
What I meant to propose in my last comment was to have an option to
allow/disallow non ASCII anchors then use c) which is UTF-8. So if the user
wants to support older browsers the user has to use non ASCII anchors. I prefer
not allow non ASCII as a default to avoid complication.
When editing an existing document, non ASCII characters in anchor should not be
removed unless the user edit the anchor in mozilla editor.
Comment 11•24 years ago
|
||
Thanks for the reply.
> I don't know 4.x generate anchors as b) or c) style.
Comm 4.x uses b).
> When editing an existing document, non ASCII characters in
> anchor should not be removed unless the user edit the
> anchor in mozilla editor.
With current builds, editing of an existing non-ASCII anchor
is not possible due to this bug.
> What I meant to propose in my last comment was to have an
> option to allow/disallow non ASCII anchors then use c)
> which is UTF-8. So if the user wants to support older
> browsers the user has to use non ASCII anchors. I prefer
> not allow non ASCII as a default to avoid complication.
Can this process be automatic so that the user can
load and edit an existing non-ASCII anchor without
setting the pref? You can disallows creations of new
non-ASCII anchor names.
Assignee | ||
Comment 12•24 years ago
|
||
So can someone help me here? How do I "support non-ascii characters"?
Comment 13•24 years ago
|
||
* We can have a bool pref for non ASCII character option in anchor. If non ASCII
is allowed then the editor should not exclude non ASCII characters from editing
and generating anchor.
* For URL in HREF, if the pref allows non ASCII anchor, then convert the string
to UTF-8 otherwise convert to a document charset, apply URL escape regardless of
the pref value.
Answering Momoi san's question, existing non ASCII anchors will be preserved
regardless of the pref value (in UTF-8 or document charset).
Comment 14•24 years ago
|
||
Please, no more options/preferences. We should handle all characters accepted in
the charset.
Charley--sounds like you'll need to convert the string to the document charset
and apply URL escaping (I think the latter is described in the HTML4 spec; %-
escaping?).
Comment 15•24 years ago
|
||
If the editor has a pref to control standard vs compatibility mode then that
could be used instead. I don't feel good generating format which is not
recommended by the spec.
Comment 16•24 years ago
|
||
Here is an example document which uses non ASCII anchor names (from bug 58819).
http://www.fcenter.ru/wn/hn02082000.htm
This document does not escape either in "<A NAME=" and "<A HREF=", so using raw
8 bit characters in document charset (in windows-1251).
There is another example from also bug 58819.
http://bugzilla.mozilla.org/showattachment.cgi?attach_id=19353
It has HREF parts which uses document charset and URL escaped.
"ja_anchors.html#%C6%FC%CB%DC%B8%EC"
Also HREF with UTF-8 and URL escaped (as mentioned in the HTML spec).
"ja_anchors.html#%E6%BC%A2%E5%AD%97"
All above cases are supported in the trunk but not in 6.0/6.01 (and that was
bug 58819).
I cannot find an example which escapes "<A NAME=" part. I think it's because
that doesn't have to be escaped (you can just use 8bit in a document charset).
Let me try to create one later and see if that's supported by our browsers.
Comment 17•24 years ago
|
||
I tested a document with escaped anchors (e.g. <A NAME="%E0">).
Because an anchor name is not URL, it is not supposed to be URL escaped. If its
escaped then it is treated as an indivisual entities (e.g. name "%E0" and "à"
should be recognized as different entities).
So what I told Charley this morning was incorrect. Anchor names should not be
unescaped. I checked that both 4.x and 6.0, they do not unescape anchors.
Assignee | ||
Comment 18•24 years ago
|
||
Naoki understands this much better than me.
Assignee: cmanske → nhotta
Comment 20•24 years ago
|
||
adding nsbeta1 keyword.
Comment 21•24 years ago
|
||
Before allowing non ASCII input, we should URL escape as Kathy mentioned.
Which file should I look to escape URL in HREF?
Comment 24•24 years ago
|
||
I think there are two possible places where we can apply URL encoding.
The fist place is after the user types (or selects) the link in the dialog.
We can only use UTF-8 because the target charset is unknown at that point. The
other issue is that we may have to do the URL encode after multiple UI where the
user can enter URL (e.g. Image URL).
The other possibility is to change nsDocumentEncoder. It might be possible to
detect URIs in a document then apply the URL encoding. Currently, the code does
named entity conversion first then converts to charset. But for URI, it has to
be converted to charset first then apply URL encode and no entity conversion is
needed. Also the current code seems to be doing the charset conversion by block,
not by node bases. Anyway, it would not be a simple change, I am going to
contact jst if the change is possible.
Comment 25•24 years ago
|
||
Comment 26•24 years ago
|
||
Naoki--I notice in the EscapeURI method, that you start out Append'ing (rather
than truncating the string to 0 length or otherwise initializing it). Is that
intentional or do you intend that people might call this method intending to
leverage the Appending (such as adding parts to the uri)?
Comment 27•24 years ago
|
||
FYI, I had the same question as brade mentioned above.
In nsHTMLContentSerializer::EscapeURI(), I think that if we execute the code in
the "if" statement, that documentCharset will be pointing to a buffer that has
been free'd since the temp created by NS_ConvertUCS2toUTF8() will have gone out
of scope:
+ if (!uri.IsASCII()) {
+ textToSubURI = do_GetService(NS_ITEXTTOSUBURI_CONTRACTID, &rv);
+ NS_ENSURE_SUCCESS(rv, rv);
+ const PRUnichar *u;
+ mCharSet->GetUnicode(&u);
+ documentCharset = NS_ConvertUCS2toUTF8(u).get();
+ }
Also, I think that if we checked if start >= aURI.Length() after we exit the
while loop, we could avoid executing all that "// Escape the remaining part."
code when processing URLs that end with '=', etc.
Not sure if this matters or not, but if this method gets called alot, it may be
faster to change the code at the end of
nsHTMLContentSerializer::SerializeAttributes():
+ // Need to escape URI.
+ nsAutoString escapedURI;
+ if (NS_SUCCEEDED(EscapeURI(valueStr, escapedURI)))
+ valueStr = escapedURI;
to copy valueStr into a temp, pass the temp as the first param, and have
EscapeURI write directly into valueStr. I say this because escaped URI strings
are sometimes a lot longer than the unescaped version.
Comment 28•24 years ago
|
||
Thanks for your comments, I will update.
"write directly into valueStr", not doing this to avoid creating an incomplete
URI which might happen by errors (e.g. charset conversion failure). But I can
restore the temp in case of an error, let me try that.
Status: NEW → ASSIGNED
Comment 29•24 years ago
|
||
Comment 30•24 years ago
|
||
Kathy, Kin, please review the new patch, thanks.
Comment 31•24 years ago
|
||
r=ftang
Comment 32•24 years ago
|
||
The code:
+ const char *documentCharset;
+ mCharSet->GetUnicode(&charset);
+ documentCharset = NS_ConvertUCS2toUTF8(charset).get();
is not good, the temporary NS_ConvertUCS2toUTF8() (NS_ConvertUCS2toUTF8 is
actaully a class, not a function) created here only lives long enough for the
assignment to happen but it's not guaranteed to live any longer than that, so
this is a crash waiting to happen. To solve this either make documentCharset an
nsXPIDLCString (which would let you move the assignment into the "if
(!uri.IsASCII())" block), or ensure that the lifetime of the temporary
NS_ConvertUCS2toUTF8 is longer than documentCharset.
Shouldn't this code use the *base* uri (if available) as the base to
NS_MakeAbsoluteURI()?
mCharSet really needs to be an nsCOMPtr, and you need null checks around access
to mCharSet.
Other than that the changes look good, fix that and I'll have one more look.
Comment 33•24 years ago
|
||
Comment 34•24 years ago
|
||
NS_MakeAbsoluteURI() is used for the following condition, same as the current code.
if (mFlags & nsIDocumentEncoder::OutputAbsoluteLinks)
Comment 35•24 years ago
|
||
sr=jst, I filed a new bug on the base uri stuff which I still think is wrong
(and was wrong before this change too), see bug 80081.
Comment 36•24 years ago
|
||
Checked in the patch, now I may enable 8 bit characters in the UI.
But I don't know how. Charley, can that be done by flipping a switch or
something?
Assignee | ||
Comment 37•24 years ago
|
||
Sorry, I don't have a clue! Maybe someone in XPFE group knows?
Comment 38•24 years ago
|
||
Charley--are you sure that we aren't restricting (with JS) the characters that
can go in that field? That might be what is interfering with what Naoki has
done.
Comment 39•24 years ago
|
||
There is a function ValidateData(),
http://lxr.mozilla.org/seamonkey/source/editor/ui/dialogs/content/EdNamedAnchorP
rops.js#119
that calls ConvertToCDATAString,
http://lxr.mozilla.org/seamonkey/source/editor/ui/dialogs/content/EdDialogCommon
.js#331
329 // Replace whitespace with "_" and allow only HTML CDATA
330 // characters: "a"-"z","A"-"Z","0"-"9", "_", ":", "-", and "."
331 function ConvertToCDATAString(string)
332 {
333 return string.replace(/\s+/g,"_").replace(/[^a-zA-Z0-9_\.\-\:]+/g,'');
334 }
But I don't know where non ASCII characters are filtered out.
Comment 40•24 years ago
|
||
Let me ask a different question. NS6 does not restrict input for characters
greater than 127. What has changed around the anchor dialog since NS6?
Comment 41•24 years ago
|
||
The difference is that cmanske added some code to filter out characters that
were not allowed in CDATA strings (The ConvertToCDATAString() replace call you
mentioned above) as the user types.
In order to make this work for you, I think we need to make it so that we don't
filter while the user types at all (except for any CDATA requirements about what
the first char must be?) since the URI escape code should take care of non CDATA
chars.
The next thing we need to figure out is how our DOM implementation expects the
'name' attribute of the anchor node to be stored. Escaped? Unescaped?
If it expects it to be escaped, than the the modification made to the serializer
should have been unneccessary since the escaping should have been done by the
anchor dialog code before it was stored in anchor node.
Comment 42•24 years ago
|
||
URL escape cannot be done while the data is in unicode (unless we use UTF-8), it
has to be done after we know the target charset.
Comment 43•24 years ago
|
||
The rest of the change will be inside editor and I am not familiar with it.
Clear the milestone and reassign to the editor group.
The remaining part is not filtering out characters above 127 in the anchor
property dialog. I fixed HTML seriarizer to escape those character so the UI not
need to filter out those characters.
Assignee: nhotta → beppe
Status: ASSIGNED → NEW
Target Milestone: mozilla0.9.1 → ---
Comment 44•24 years ago
|
||
over to charley
Assignee: beppe → cmanske
Target Milestone: --- → mozilla0.9.2
Assignee | ||
Comment 45•24 years ago
|
||
I think this is easy to do by including the characters in the range 128 to 'end'
in the list of characters to NOT filter. What is the largest number in the range
(i.e what is 'end')?
Status: NEW → ASSIGNED
Comment 46•24 years ago
|
||
I think we can use 65535 (0xFFFF).
Assignee | ||
Comment 47•24 years ago
|
||
Assignee | ||
Comment 48•24 years ago
|
||
Can you please test the patch to editor's EdDialogCommon.js?
The JS reference says to use "\xnn" for hex character, so I hope "\xFFFF" works
for values above 255.
Comment 49•24 years ago
|
||
Okay, I will apply the patch and post the result after I test.
Comment 50•24 years ago
|
||
After the patch, I was able to input Latin1 characters (128 - 255) but above 256
are still filtered out.
You can test these characters in US environment by pasting from "Character Map".
Assignee | ||
Comment 51•24 years ago
|
||
That's to bad! That would be a deficiency in JavaScript's RegExp character
support, imho.
Brendan: How can we detect characters above 255?
Assignee | ||
Comment 52•24 years ago
|
||
Assignee | ||
Comment 53•24 years ago
|
||
Setting status. Use last patch unless we hear about other method to get
char > 255 in regexp.
Comment 54•24 years ago
|
||
I tried the last patch and it did not filter Japanese character I input.
But I found the data is corrupted somewhere. I put a break point at HTML content
serializer which expects UCS2 but what I got was zero extended string like
0x00E300810082 for one Japanese character.
0012A9E8 23 00 E3 00 81 00 82 00 E3 #.ã...?.ã
0012A9F1 00 81 00 84 00 E3 00 81 00 ...?.ã...
0012A9FA 86 00 00 00 00 00 58 AD 12 ?.....X.
I think what happens here is something like this. Somewhere in the code, we have
a string in UTF-8 (e.g. 0xE38182), instead of converting to UCS2 it uses
AssignWithConversion, then result would be 0x00E300810082, just like we see in
the dump above.
Note this is nothing to do with the patch. But I didn't see that when I tested
my serializer change using existing non ASCII anchors, so it must be somewhere
in the editor code.
Comment 55•24 years ago
|
||
The problem is in nsHTMLAnchorElement.cpp not editor. I file a bug 81090.
Depends on: 81090
Comment 56•24 years ago
|
||
Yikes! Cc'ing rogerl. JS uses Unicode, including in its regexp implementation.
If you can show a bug where [^a-zA-Z0-9_\.\-\:]+ does not match code points >=
256, please file a bug on the JS engine. Cc'ing pschwartau for help reducing
the testcase.
The workaround of looping over each character is too expensive to contemplate.
Please, let's fix this right, ASAP.
/be
Comment 57•24 years ago
|
||
Ah, I should have read my bugmail or this bug's comments before commenting.
Thanks to nhotta, this sounds like a bug in HTML content code, treating UTF-8 as
ASCII. But it'd be cool to confirm that JS regexp character classes (negated
ones, yet) work with two-byte code points.
cmanske, the klunky fix seems spurious in light of nhotta's finding -- I hope
you don't land it!
/be
Comment 58•24 years ago
|
||
My comment on 2001-05-15 17:23,
>Note this is nothing to do with the patch.
Just to clarify. I meant, the patch for EdDialogCommon.js has nothing to do with
the string corruption. The string corruption happens at href creation (that's
81090). Patch for EdDialogCommon.js is needed separately in order to avoid
filtering user's input for non ASCII anchor name.
Comment 59•24 years ago
|
||
nhotta: agreed, but the klunky fix (last patch) is not needed (besides being too
slow and based on a fear of a JS regexp unicode bug that does not exist), once
81090 is fixed -- right?
/be
Comment 60•24 years ago
|
||
It's needed, bug 81090 fixes a problem of the data loss after "Link Property"
dialog. The .js patch in this bug is to avoid the data loss in "Anchor Property"
dialog. There are two separate data losses, so we need two different patches to
fix both problems.
Assignee | ||
Comment 61•24 years ago
|
||
Brendan: So how hard would it be to extend RegExp to handle full range of
characters ("\xnnnn" instead of "\xnn")?
Comment 62•24 years ago
|
||
nhotta: yes, but we don't need the last patch ("Klunky fix") once 81090 is
fixed, I hope -- because JS regexps correctly handle UTF16 code points above
255. Agreed?
cmanske: regexps already handle full Unicode. Use \uXXXX to spell Unicode code
points in the UTF16-like encoding that JS uses.
/be
Comment 63•24 years ago
|
||
Also see bug 72964:
"String method for pattern matching failed for Chinese Simplified (GB2312)"
There is a reduced HTML testcase attached there:
http://bugzilla.mozilla.org/showattachment.cgi?attach_id=28553
The final section of the testcase uses Unicode escape sequences
(Ex: str = '\u00BF\u00CD\u00BB\u00A7' ) and pattern-matching in JavaScript -
Comment 64•24 years ago
|
||
Brendan, do you mean the patch on 05/15/01 15:20 should works once bug 81090 is
fixed? Let me try and I will post result later.
Comment 65•24 years ago
|
||
I tested with the patch of ug 81090 and the patch on 05/15/01 15:20.
Japanese characters are still filtered out. But it worked after I changed the
.js file to use \u0080 and \uFFFF instead of \x80 and \xFFFF.
Comment 66•24 years ago
|
||
Comment 67•24 years ago
|
||
I checked in for a problem of nsHTMLAnchorElement::GetHref, removing the depend.
No longer depends on: 81090
Assignee | ||
Comment 68•24 years ago
|
||
Totally cool! I didn't know about "\uxxxx"! So r=cmanske on that part.
Brendan: does this look good now? Can you please sr=?
Whiteboard: FIX IN HAND need r=, sr= → FIX IN HAND need sr=
Comment 69•24 years ago
|
||
cmanske: yes, patch at
http://bugzilla.mozilla.org/showattachment.cgi?attach_id=34620 looks good to me,
sr=brendan@mozilla.org.
/be
Comment 70•24 years ago
|
||
nhotta: I don't understand your latest patch here -- the negated character class
excludes all characters not listed after the [^ and before the closing ], so you
don't want to add \u0080-\uFFFF -- that range is already excluded by cmanske's
patch.
/be
Comment 71•24 years ago
|
||
05/15/01 15:20
string.replace(/\s+/g,"_").replace(/[^a-zA-Z0-9_\.\-\:\x80-\xFFFF]+/g,'');
05/16/01 16:31
string.replace(/\s+/g,"_").replace(/[^a-zA-Z0-9_\.\-\:\u0080-\uFFFF]+/g,'');
I just changed \x80 to \u0080 and \xFFFF to \uFFFF.
Assignee | ||
Comment 72•24 years ago
|
||
The patch works -- I tested it. The higher characters should be included in the
list of negated characters. The filter does: "replace any character that is *not*
one of the specific characters or a char > 128 with a blank", thus *allowing*
the higher characters to be typed.
Assignee | ||
Comment 73•24 years ago
|
||
I checked in the 5/16 patch. So all related issues are resolved now, i.e., we
can marked this fixed?
Whiteboard: FIX IN HAND need sr= → FIX IN HAND
Comment 74•24 years ago
|
||
yes, I think so.
Comment 75•24 years ago
|
||
fixed
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → FIXED
Comment 76•24 years ago
|
||
Sorry, I misread the diff and added confusion. I also didn't know that chars >
127 were legal CDATA chars. New one on me.
/be
Comment 78•24 years ago
|
||
based on the 2001-05-17 build i am reopening this bug: not only all non-ascii
chars are stripped in the anchor dlgbox when highlighted but i can not input
them at all... i can enter ascii chars into the anchor dlgbox but everytime i am
attempting to enter non-ascii they are escaped
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 79•24 years ago
|
||
2001-05-17 does not have the fix, please verify with 5/18 build or later, thanks.
Status: REOPENED → RESOLVED
Closed: 24 years ago → 24 years ago
Resolution: --- → FIXED
Comment 80•24 years ago
|
||
Ok, i thought it was checked into the latest 17th build... the today's build
crashes for me...I'll verify on the next one then
Comment 81•24 years ago
|
||
There is a regression caused by this (bug 81238), it double escapes already
escaped URI.
Comment 82•24 years ago
|
||
it works using 2001-05-24-04 build, i am able to enter non-ascii strings for
anchor ( latin-1 and db)
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•