Closed
Bug 440926
Opened 16 years ago
Closed 15 years ago
Regular expression character sets that contain "\u0130" match "i" character
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
FIXED
mozilla1.9.3a1
People
(Reporter: cacyclewp, Assigned: steveharper60)
References
Details
Attachments
(4 files, 6 obsolete files)
(deleted),
patch
|
Details | Diff | Splinter Review | |
(deleted),
application/javascript
|
Details | |
(deleted),
patch
|
dmandelin
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
Details | Diff | Splinter Review |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0
Regular expression character set that contain or include "\u0130" match the "i" character.
This does not happen for \u0130 outside a regExp character set.
Reproducible: Always
Steps to Reproduce:
Paste any of the following regExp into the address bar:
javascript:'abcdefghijklmnopqrstuvwxyz'.replace(/([\u0130])/gi, '#')
javascript:'abcdefghijklmnopqrstuvwxyz'.replace(/([\u007f-ffff])/gi, '#')
Actual Results:
Always: abcdefgh#jklmnopqrstuvwxyz (i is replaced)
This bug might be related to Bug 416933
Comment 2•16 years ago
|
||
I'm not sure I agree with you that a case-insensitive match of "Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130)" should _not_ match a lower-case "i". Can you provide some argument?
You might be right as this behaviour is seen for standard ASCII characters:
javascript:'aA'.replace(/[a]/gi, '#') results in '##'
javascript:'aA'.replace(/a/gi, '#') results in '##'
But:
javascript:'iI\u0130'.replace(/[\u0130]/gi, '#') results in '###'
javascript:'iI\u0130'.replace(/\u0130/gi, '#') results in 'iI#' and NOT in '###' as one would expect!
There is an discrepancy between normal characters and character ranges as mentioned above.
Severity: major → normal
Also, if you read the Latin Extended-A unicode block code tables you see that there are myriads of variants of ASCII characters named LATIN CAPITAL/SMALL LETTER X WITH Y.
I do not think that any of these characters should be treated as case variants of ASCII characters (if only to avoid a common source of totally counterintuitive and erratic behavior).
Also, why should LATIN CAPITAL LETTER I WITH DOT ABOVE be treated differently from LATIN CAPITAL LETTER G WITH DOT ABOVE?
Comment 5•16 years ago
|
||
x0'll appreciate this one. Yeah, we're probably wrong in the case-insensitive situation, for non-ASCII characters. I'll check this out more on Monday, if x0 hasn't replied by then.
I changed severity to major as this has the potential to block javascripts in an unpredictable way.
Severity: normal → major
Comment 7•16 years ago
|
||
Ah, the sins of the past! It seems this is still a remnant from the old hack for Turkish, which has given localization all around the world a lot of trouble, not just at Mozilla.
Comment 8•16 years ago
|
||
Brendan: Do you have any thoughts on this? It's a bit of unicode weirdness in regexp. Of the examples in comment #3, which are correct?
Brian: It is correct that "javascript:'iI\u0130'.replace(/\u0130/gi, '#')" results in "iI#". It is a bug that "javascript:'iI\u0130'.replace(/[\u0130]/gi, '#')" results in "###".
Character ranges in case insensitive regular expressions MUST NOT include any non-ASCII "case" variants.
Reporter | ||
Comment 10•15 years ago
|
||
Bug 378738 is related to this. It shows that non-ASCII Unicode characters should not be treated as ASCII character variants. It would be nice if this could be fixed, please note the high number of votes for this bug.
Comment 11•15 years ago
|
||
cc:ing dmandelin since he is closer to JS regexp these days than I am. Glad to help w/ any questions, though!
Comment 12•15 years ago
|
||
(In reply to comment #3)
ECMA-262 says that case-insensitive regular expression matching is done by converting both the pattern character and the text character to upper case, then comparing. So:
This one is wrong, because upper(U+0130) -> U+130, while upper('i') -> 'I', which are not equal:
> javascript:'iI\u0130'.replace(/[\u0130]/gi, '#') results in '###'
(I actually get '#I#' currently, but that's still wrong.)
This one is right:
> javascript:'iI\u0130'.replace(/\u0130/gi, '#') results in 'iI#' and NOT in
> '###' as one would expect!
So, do we want to go with EMCA-262 here, or does web compat require something different? (For comparison, SFX and V8 both give the EMCA-262-specified 'iI#' for both examples.)
Comment 13•15 years ago
|
||
Did you try IE? I'm guessing we should conform, and that they probably do, too. This is likely just a weird bug we have.
Comment 14•15 years ago
|
||
Yeah, try IE and Opera -- maybe we are lone wolf.
/be
Comment 15•15 years ago
|
||
IE8 returns '###' for both tests.
Reporter | ||
Comment 16•15 years ago
|
||
Google Chrome 2.0.172.37 returns the correct iI# for both tests.
I is hard to imagine that any existing script relies in any way on this bug - quite the opposite, it is VERY complicated to create workarounds (and those would not be affected by fixing this bug).
Therefore I do not see any reason to imitate IE when it is so obviously wrong and weird that even they might fix it soon...
Reporter | ||
Comment 17•15 years ago
|
||
Opera 9.64 returns the correct iI# for both tests.
Reporter | ||
Comment 19•15 years ago
|
||
Any chance to get this fixed anytime soon? This has been reported one and a half years ago and it seems like a real no-brainer.
Comment 20•15 years ago
|
||
Yes, this should be considered as a spec conformance fix. ES5 is a good occasion although this is not strictly an "ES5" bug. I did bring it up at the Redmond TC39 meeting in July and talk about it the Microsoft rep. No idea when or if they'll fix IE JScript. We should fix soon tho.
/be
Updated•15 years ago
|
Assignee: nobody → general
Component: General → JavaScript Engine
Product: Firefox → Core
QA Contact: general → general
Assignee | ||
Comment 21•15 years ago
|
||
I have made some minor changes to the regular expression code in Spidermonkey inorder to ensure regular expressionos in firefox produce the same results as google chrome. It looks like the regular expression parser converts the input string and pattern to lower case before checking for matches. I have attached a patch to this post.
Comment 22•15 years ago
|
||
Mr. Harper: It looks like your patch contains tabs, do you mind uploading a new one without tabs for review? (otherwise, it looks pretty good to me) Does it fix this case in particular? What other tests are you using to compare results against Google Chrome?
Thanks!
Assignee | ||
Comment 23•15 years ago
|
||
Assignee | ||
Comment 24•15 years ago
|
||
I have tested the two regular expressions above:
javascript:'iI\u0130'.replace(/[\u0130]/gi, '#')
javascript:'iI\u0130'.replace(/\u0130/gi, '#')
both return "iI#" (The same result as google chrome)
I check to see if the unicode character has an ascii lowercase character. If it does then i just return the origional uppercase character causing matches against any ascii character to fail. Hope the tabs are ok, I use Visual Studio as my text editor. Im new to the project so im not sure how to run the automated unit tests at the code changes.
Updated•15 years ago
|
Attachment #416270 -
Attachment is patch: true
Attachment #416270 -
Attachment mime type: application/octet-stream → text/plain
Comment 25•15 years ago
|
||
This patch seems to have no spaces at all.... What I meant in comment 22 is that we need patches that contain correct white-spacing (indentation) without tabs.
Thanks!
Assignee | ||
Comment 26•15 years ago
|
||
Assignee | ||
Comment 27•15 years ago
|
||
Attachment #416235 -
Attachment is obsolete: true
Attachment #416270 -
Attachment is obsolete: true
Attachment #416277 -
Attachment is obsolete: true
Comment 28•15 years ago
|
||
Comment on attachment 416278 [details] [diff] [review]
Patch for Bug 440926 (without tabs)
mrbkap: Please redirect if you've got a better reviewer. This looks good to me, at a cursory glance.
Attachment #416278 -
Flags: review?(mrbkap)
Reporter | ||
Comment 29•15 years ago
|
||
(In reply to comment #24)
> I check to see if the unicode character has an ascii lowercase character. If it
> does then i just return the origional uppercase character causing matches
> against any ascii character to fail. Hope the tabs are ok, I use Visual Studio
> as my text editor.
Steve Harper: I have not checked the patched code, but want to make sure that above-ASCII characters are never ever be treated as case variants of ASCII characters (as discussed above).
Comment 30•15 years ago
|
||
(In reply to comment #29)
> (In reply to comment #24)
> > I check to see if the unicode character has an ascii lowercase character. If it
> > does then i just return the origional uppercase character causing matches
> > against any ascii character to fail. Hope the tabs are ok, I use Visual Studio
> > as my text editor.
>
> Steve Harper: I have not checked the patched code, but want to make sure that
> above-ASCII characters are never ever be treated as case variants of ASCII
> characters (as discussed above).
Looking at the patch (without knowing the full context of the code) it appears that that is both the purpose and the effect of the patch.
Assignee | ||
Comment 31•15 years ago
|
||
Cacycle: Nathan Tuggy is correct, the intention of the patch is to ensure that ASCII characters never match unicode character variants in a case insensitive regular expression.
Comment 32•15 years ago
|
||
Also relevant, the program:
print('iI\u0130'.replace(/[\u0130]/gi, '#'));
print('iI\u0130'.replace(/\u0130/gi, '#'));
prints:
#I# iI#
without JIT and:
#I# #I#
with JIT.
Assignee | ||
Comment 33•15 years ago
|
||
Luke: I have tested the trunk version of Firefox with the patch attached to this bug. Both regular expressions you have listed in comment 32 return the correct result: iI#. This is with both JIT enabled and disabled.
Comment 34•15 years ago
|
||
(In reply to comment #33)
Oh sorry, I didn't mean to imply that I got those outputs from running with the patch applied. That is without the patch; I was just pointing out the different behavior of with and without JIT.
Comment 35•15 years ago
|
||
I see that ECMA 262v3 15.10.2.8 describes case insensitive matching in terms of Canonicalize, which is reflected, modulo the IgnoreCase test in step 1, by upcase(). However, the current bug, as I understand it, with \u0130 matching 'i' is not that downcase(\u0130) == 'i' but rather that we don't check that upcase(downcase(\u0130)) == \u0130. The modifications in the previous patch seem to achieve this for the specific case of \u0130 but perhaps there are other cases where c is non-ASCII, downcase(c) is non-ASCII and upcase(downcase(c)) != c. I'm not at all a Unicode expert, so maybe I'm wrong here, but the attached patch does what I'm talking about and correctly produces "iI#" for the above tests.
Comment 36•15 years ago
|
||
There may be a perf cost to any approach that does a more up/downcasing, have you tested?
Comment 37•15 years ago
|
||
(In reply to comment #36)
> There may be a perf cost to any approach that does a more up/downcasing, have
> you tested?
The extra upcast() happens only when compiling the regex, so the overhead is dwarved by the cost of compilation. To be sure I tested on SS and a micro-bench.
Assignee | ||
Comment 38•15 years ago
|
||
I think I can follow what your trying to explain Luke. But wouldn't upcase(downcase(c)) == c if c was none ASCII and downcase(c) was none ASCII? The small patch I submitted just excludes any characters from the ASCII subset matching lower case unicode characters.
javascript:'i\u00e1\u00c1'.replace(/[\u00c1]/gi, '#') is an example that correctly matches 'LATIN CAPITAL LETTER A WITH ACUTE' to the lower case version.
Comment 39•15 years ago
|
||
(In reply to comment #38)
I agree that your patch works for non-ASCII lowercase and uppercase matching like \u00e1 and \u00c1; what I was thinking was that there might be two non-ASCII characters a and b where upcase(a) == a and downcase(a) == b but upcase(b) != a (thus, /[a]/i.test(b) should be false while the earlier patch would, I believe, return true).
I made the attached test using charCodeAt and toUpperCase/toLowerCase to find if there are any characters with this property and found U+10A0 through U+10C5.
Comment 40•15 years ago
|
||
Comment on attachment 416278 [details] [diff] [review]
Patch for Bug 440926 (without tabs)
I'm not really the best reviewer here. I think dmandelin would be better, but it seems like there's still some debate over the best way to write the patch, so I'll just remove this request for now.
Attachment #416278 -
Flags: review?(mrbkap)
Assignee | ||
Comment 41•15 years ago
|
||
Attachment #416278 -
Attachment is obsolete: true
Updated•15 years ago
|
Attachment #417193 -
Flags: review?(dmandelin)
Assignee | ||
Comment 42•15 years ago
|
||
The character set ranging from U+10A0 to U+10C5 corresponds to the Georgian characters(http://unicode.org/charts/PDF/U10A0.pdf). It appears that Capital letters in this character set can have lower case versions, but the lower case version cannot map to the upper case version. I have submitted a new patch to account for the fact that upcase(downcase(c)) != c when using characters in the Georgian character set.
The regular expressions:
javascript:'iI\u10a0'.replace(/[\u10d0]/gi, '#')
javascript:'iI\u10a0'.replace(/\u10d0/gi, '#')
javascript:'iI\u10d0'.replace(/[\u10a0]/gi, '#')
javascript:'iI\u10d0'.replace(/\u10a0/gi, '#')
Should now produce the same results as Google chrome.
Comment 43•15 years ago
|
||
(In reply to comment #42)
- return JS_TOLOWER(ch);
+ cu = JS_TOLOWER(ch);
+ return JS_TOLOWER(cu < 128) ? ch : cu;
Ignoring the typo with |JS_TOLOWER(cu < 128)|, I believe the change to downcase is unnecessary; it's subsumed by the upcase(downcase(c)) != c check.
Comment 44•15 years ago
|
||
http://www.iam.uni-bonn.de/~alt/html/unicode_34.html
"Khutsuri
This is the uppercase of the old ecclesiastical alphabet. The style shown in the code charts is known as Asomtavruli."
See also http://www.unicode.org/reports/tr21/tr21-4.3.html.
/be
Assignee | ||
Comment 45•15 years ago
|
||
Sorry Luke you are right. I must have forgot to revert back to the origional file before making the new changes. I will remove this and check i get the same results.
Assignee | ||
Comment 46•15 years ago
|
||
Brendan: Those two links look interesting. Thanks, I will take a look.
Assignee | ||
Comment 47•15 years ago
|
||
Attachment #417193 -
Attachment is obsolete: true
Attachment #417193 -
Flags: review?(dmandelin)
Assignee | ||
Updated•15 years ago
|
Attachment #417250 -
Flags: review?(mrbkap)
Comment 48•15 years ago
|
||
Comment on attachment 417250 [details] [diff] [review]
Patch for Bug 440926 V3 (Removed superfluous code in downcase function)
Moving review to dmandelin as mrbkap previously requested.
Attachment #417250 -
Flags: review?(mrbkap) → review?(dmandelin)
Comment 49•15 years ago
|
||
Comment on attachment 417250 [details] [diff] [review]
Patch for Bug 440926 V3 (Removed superfluous code in downcase function)
The logic looks good. I do have a couple of alternate suggestions for how to express this piece, which is part of a somewhat tricky bit of code (at least to me, and I think I wrote it ;-) ). See if you think either is clearer. r+ with the form you think is most readable (including original).
> if (cs->flags & JSREG_FOLD) {
> ch = JS_TOUPPER(ch);
> jschar lch = JS_TOLOWER(ch);
>+
>+ if (upcase(lch) != ch)
>+ lch = ch;
>
> if (ch != lch) {
1. The bottom two statements could be done shorter (also matching the way it's coded later in the patch):
if (ch != lch && upcase(lch) == ch) {
2. But I think it's still far from obvious what we're doing, which is computing the inverse image (in the mathematical sense) of |ch| for JS_TOUPPER, { x | JS_TOUPPER(x) == ch }.
|lch| is supposed to be a canonical element of the inverse image, namely the one such that |JS_TOLOWER(ch) == lch|, if that element is in the inverse image, or else |ch|. We could try to make this more obvious (heh) like this:
// Return the 'canonical' inverse upcase of |ch|. That is the character
// |lch| such that |upcase(lch) == ch| and (|lch| is the lower-case form
// of |ch| or is |ch|).
static inline jschar inverse_upcase(jschar ch)
{
jschar lch = JS_TOLOWER(ch);
return upcase(lch) == ch ? lch : ch;
}
Then we can say:
if (cs->flags & JSREG_FOLD) {
ch = JS_TOUPPER(ch);
jschar lch = inverse_upcase(ch);
if (ch != lch) {
and the same way in the character class computation.
Attachment #417250 -
Flags: review?(dmandelin) → review+
Assignee | ||
Comment 50•15 years ago
|
||
Hi David, thanks for taking time to review this patch. Im not sure what you meant by "r+ with the form you think is most readable (including original)".
I think that Using the inverse_upcase function with the comment above would make the patch intentions more ovbious. Should I submit a new patch, or would you like to work on this?
Comment 51•15 years ago
|
||
Steve, just FYI, r+ is short for review+ and means something like "passes my review"; often, it's coupled with conditions that usually involve submitting a new patch, and in this case, David would probably like you to submit a new patch with the inverse_upcase changes, since you both agree that that's the right move.
BTW, thanks for working on this. :)
Comment 52•15 years ago
|
||
(In reply to comment #51)
> Steve, just FYI, r+ is short for review+ and means something like "passes my
> review"; often, it's coupled with conditions that usually involve submitting a
> new patch, and in this case, David would probably like you to submit a new
> patch with the inverse_upcase changes, since you both agree that that's the
> right move.
Yes. I was also implying that a second review was not needed; simply consider my suggestions, choose what you think is best, and consider my r+ to apply to your final version.
Assignee | ||
Comment 53•15 years ago
|
||
Attachment #417250 -
Attachment is obsolete: true
Assignee | ||
Comment 54•15 years ago
|
||
Can we get this patch into the current trunk build?
Cheers,
Steve
Comment 55•15 years ago
|
||
Attachment #418379 -
Flags: review+
Comment 56•15 years ago
|
||
I added a test case.
http://hg.mozilla.org/mozilla-central/rev/5c102146cd95
Updated•15 years ago
|
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Updated•15 years ago
|
Assignee: general → steveharper60
Flags: in-testsuite+
Target Milestone: --- → mozilla1.9.3a1
Version: unspecified → Trunk
You need to log in
before you can comment on or make changes to this bug.
Description
•