Closed Bug 440926 Opened 16 years ago Closed 15 years ago

Regular expression character sets that contain "\u0130" match "i" character

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla1.9.3a1

People

(Reporter: cacyclewp, Assigned: steveharper60)

References

Details

Attachments

(4 files, 6 obsolete files)

Patch for Bug 440926 15 years ago Steve Harper (deleted), patch		Details \| Diff \| Splinter Review
Patch for Bug 440926 (without tabs) 15 years ago Steve Harper (deleted), patch		Details \| Diff \| Splinter Review
Patch for Bug 440926 (without tabs) 15 years ago Steve Harper (deleted), application/octet-stream		Details
Patch for Bug 440926 (without tabs) 15 years ago Steve Harper (deleted), patch		Details \| Diff \| Splinter Review
different strategy 15 years ago Luke Wagner [:luke] (deleted), patch		Details \| Diff \| Splinter Review
find Unicode characters where upcase(downcase(c)) != c 15 years ago Luke Wagner [:luke] (deleted), application/javascript		Details
Patch for Bug 440926 V2 15 years ago Steve Harper (deleted), patch		Details \| Diff \| Splinter Review
Patch for Bug 440926 V3 (Removed superfluous code in downcase function) 15 years ago Steve Harper (deleted), patch	dmandelin : review+	Details \| Diff \| Splinter Review
Patch for Bug 440926 V4 15 years ago Steve Harper (deleted), patch	dmandelin : review+	Details \| Diff \| Splinter Review
Patch as committed 15 years ago David Mandelin [:dmandelin] (deleted), patch		Details \| Diff \| Splinter Review

Cacycle

Reporter

Description

•

16 years ago

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0 Regular expression character set that contain or include "\u0130" match the "i" character. This does not happen for \u0130 outside a regExp character set. Reproducible: Always Steps to Reproduce: Paste any of the following regExp into the address bar: javascript:'abcdefghijklmnopqrstuvwxyz'.replace(/([\u0130])/gi, '#') javascript:'abcdefghijklmnopqrstuvwxyz'.replace(/([\u007f-ffff])/gi, '#') Actual Results: Always: abcdefgh#jklmnopqrstuvwxyz (i is replaced)

Cacycle

Reporter

Comment 1

•

16 years ago

This bug might be related to Bug 416933

Brian Crowder

Comment 2

•

16 years ago

I'm not sure I agree with you that a case-insensitive match of "Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130)" should _not_ match a lower-case "i". Can you provide some argument?

Cacycle

Reporter

Comment 3

•

16 years ago

You might be right as this behaviour is seen for standard ASCII characters: javascript:'aA'.replace(/[a]/gi, '#') results in '##' javascript:'aA'.replace(/a/gi, '#') results in '##' But: javascript:'iI\u0130'.replace(/[\u0130]/gi, '#') results in '###' javascript:'iI\u0130'.replace(/\u0130/gi, '#') results in 'iI#' and NOT in '###' as one would expect! There is an discrepancy between normal characters and character ranges as mentioned above.

Severity: major → normal

Cacycle

Reporter

Comment 4

•

16 years ago

Also, if you read the Latin Extended-A unicode block code tables you see that there are myriads of variants of ASCII characters named LATIN CAPITAL/SMALL LETTER X WITH Y. I do not think that any of these characters should be treated as case variants of ASCII characters (if only to avoid a common source of totally counterintuitive and erratic behavior). Also, why should LATIN CAPITAL LETTER I WITH DOT ABOVE be treated differently from LATIN CAPITAL LETTER G WITH DOT ABOVE?

Brian Crowder

Comment 5

•

16 years ago

x0'll appreciate this one. Yeah, we're probably wrong in the case-insensitive situation, for non-ASCII characters. I'll check this out more on Monday, if x0 hasn't replied by then.

Cacycle

Reporter

Comment 6

•

16 years ago

I changed severity to major as this has the potential to block javascripts in an unpredictable way.

Severity: normal → major

Sebastian Helm

Comment 7

•

16 years ago

Ah, the sins of the past! It seems this is still a remnant from the old hack for Turkish, which has given localization all around the world a lot of trouble, not just at Mozilla.

Brian Crowder

Comment 8

•

16 years ago

Brendan: Do you have any thoughts on this? It's a bit of unicode weirdness in regexp. Of the examples in comment #3, which are correct?

Cacycle

Reporter

Comment 9

•

16 years ago

Brian: It is correct that "javascript:'iI\u0130'.replace(/\u0130/gi, '#')" results in "iI#". It is a bug that "javascript:'iI\u0130'.replace(/[\u0130]/gi, '#')" results in "###". Character ranges in case insensitive regular expressions MUST NOT include any non-ASCII "case" variants.

Cacycle

Reporter

Comment 10

•

15 years ago

Bug 378738 is related to this. It shows that non-ASCII Unicode characters should not be treated as ASCII character variants. It would be nice if this could be fixed, please note the high number of votes for this bug.

Brian Crowder

Comment 11

•

15 years ago

cc:ing dmandelin since he is closer to JS regexp these days than I am. Glad to help w/ any questions, though!

David Mandelin [:dmandelin]

Comment 12

•

15 years ago

(In reply to comment #3) ECMA-262 says that case-insensitive regular expression matching is done by converting both the pattern character and the text character to upper case, then comparing. So: This one is wrong, because upper(U+0130) -> U+130, while upper('i') -> 'I', which are not equal: > javascript:'iI\u0130'.replace(/[\u0130]/gi, '#') results in '###' (I actually get '#I#' currently, but that's still wrong.) This one is right: > javascript:'iI\u0130'.replace(/\u0130/gi, '#') results in 'iI#' and NOT in > '###' as one would expect! So, do we want to go with EMCA-262 here, or does web compat require something different? (For comparison, SFX and V8 both give the EMCA-262-specified 'iI#' for both examples.)

Brian Crowder

Comment 13

•

15 years ago

Did you try IE? I'm guessing we should conform, and that they probably do, too. This is likely just a weird bug we have.

Brendan Eich [:brendan]

Comment 14

•

15 years ago

Yeah, try IE and Opera -- maybe we are lone wolf. /be

David Mandelin [:dmandelin]

Comment 15

•

15 years ago

IE8 returns '###' for both tests.

Cacycle

Reporter

Comment 16

•

15 years ago

Google Chrome 2.0.172.37 returns the correct iI# for both tests. I is hard to imagine that any existing script relies in any way on this bug - quite the opposite, it is VERY complicated to create workarounds (and those would not be affected by fixing this bug). Therefore I do not see any reason to imitate IE when it is so obviously wrong and weird that even they might fix it soon...

Cacycle

Reporter

Comment 17

•

15 years ago

Opera 9.64 returns the correct iI# for both tests.

Cacycle

Reporter

Comment 19

•

15 years ago

Any chance to get this fixed anytime soon? This has been reported one and a half years ago and it seems like a real no-brainer.

Brendan Eich [:brendan]

Comment 20

•

15 years ago

Yes, this should be considered as a spec conformance fix. ES5 is a good occasion although this is not strictly an "ES5" bug. I did bring it up at the Redmond TC39 meeting in July and talk about it the Microsoft rep. No idea when or if they'll fix IE JScript. We should fix soon tho. /be

Blocks: es5

Status: UNCONFIRMED → NEW

Ever confirmed: true

John P Baker

Updated

•

15 years ago

Assignee: nobody → general

Component: General → JavaScript Engine

Product: Firefox → Core

QA Contact: general → general