Closed Bug 157534 Opened 22 years ago Closed 17 years ago

Edit->Find in Page found substring in Thai display cell, but it shouldn't be

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED WONTFIX

People

(Reporter: arthit, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: intl)

Attachments

(1 file)

From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; SunOS sun4u; en-US; rv:1.1a+) Gecko/20020714 BuildID: 202071422 in Thai writing system, there're 4 display-levels of character. from highest to lowest: top, above, base, below. several Thai characters with different display-levels can be composed together. we call the composed unit a "display cell". when do searching, this "display cell" should be treated as an atomic unit, searching any seperated "substring" within "display cell" is not allowed. for example, in "U+0E2A U+0E35 U+0E48" string, U+0E48 can't be seperated from U+0E2A U+0E35, all of them should be treated as only one single atom. so, the search of "U+0E2A U+0E35" or "U+0E35 U+0E48" or any one character alone is expected to be NOT FOUND. Reproducible: Always Steps to Reproduce: 1. open attached test case (findthai.html) in browser, we will see only one string, "Ъеш" (U+0E2A U+0E35 U+0E48) 2. select Edit -> Find in this page 3. in "Find text" field, type "Ъе" (U+0E2A U+0E35) and click "Find" Actual Results: found a result. it founds "Ъе" (U+0E2A U+0E35). Expected Results: found no result. "Ъе" (U+0E2A U+0E35) should not be found. fyi, from the testcase "Ъеш" (U+0E2A U+0E35 U+0E48), means "four" in Thai. "Ъе" (U+0E2A U+0E35), means "color" in Thai. they have different meaning. search for one, should not found another.
Attached file testcase#1: for reproduce the bug (deleted) —
note: attachment is tis-620 encoded html. tested and found it occurs with Mozilla 1.0, 1.1alpha, build 2002071422 NS7 PR1 on Windows, NS7.0_03 on Solaris
Blocks: thai
-->akkana
Assignee: sgehani → akkana
It sounds like solving this will require quite a bit of knowledge of Thai (as well as the ability to type it in order to test). I don't think my mozilla build (on linux) is even showing me the Thai characters -- in your examples I'm seeing one character per unicode number, and all the characters look the same, while from your description it sounds like they should probably each be one character, and each different. I can't see this getting fixed unless we make a push to get search (that would be me) and I18n people together, discuss the issues and procedures for how to test it, and come up with a design for how it should work. If anyone who understands the Thai issue wants to work on this, I'll be glad to help with it, explain how find works or some tricks for debugging it, or meet to discuss solutions. For now, I'm reassigning to I18n in the hope that they'll have a better handle on it. If that's not appropriate, send it back -- I'm willing to own it but it'll probably have to be Future/helpwanted.
Assignee: akkana → yokoyama
Component: Search → Internationalization
QA Contact: claudius → ruixu
Keywords: intl
QA Contact: ruixu → ylong
hi akkanna, Could you please explain how search works and tricks to debug? This is not just a Thai bug but also Arabic. Indian scripts will have the same problem. For all CTL languages searching needs to be by display cell or cluster. A simple code point based search will not work since characters are shaped when they are displayed. I am aware of Thai and Indic search issues but do not know how mozilla handles it. prabhat.
Here's a start. I'll also html-ify this description (probably with the references to this specific bug edited out) and put it up at mozilla.org, probably under editor. ----------------------- Understanding the Find algorithm The algorithm is located in embedding/components/find/src/nsFind.cpp, and the core of it is in nsFind::Find. I'm going to assume that these clusters are entirely located within one text node, so you shouldn't need to worrk about routines like NextNode or SkipNode -- those should just work. Find() searches forward or backward through the tree, looking at nodes (almost always a text node, though there's an RFE to be able to search in other types of nodes) and comparing character by character with our pattern string. In reverse, we compare characters starting from the end of the pattern string, not the beginning (that might be important for this bug). When we start a match (i.e. match one or more characters), we save the point where the match started then continue through the string. If we reach the end of the pattern string and we're still matching, we have succeeded; if we hit a non-match before the end of the pattern string, then we have to go back to the next character in the tree after the place where this match started, and reset our position in the pattern string to the beginning (or end, if backwards). ome important variables within Find(): mIterNode is the current node we're examining. mIterOffset is the current offset into the node: the place at which the current match started, or, if we're not in a partial match, the character we're comparing right now. pindex is the current offset into the pattern string, and findex is the current offset into the current node (so these two point to the characters we're currently comparing). You may have noticed that findex and mIterOffset seem to do the same thing. They basically do, but findex points to a character, whereas mIterOffset is a range offset that points between characters: see the note around line 680. Since a match may (and often does) span multiple nodes in the tree, matchAnchorNode and matchAnchorOffset store the node and offset (range offset) where the current partial match began. You can #define DEBUG_find 1 and get lots of spew on standard output about what the algorithm is doing; I initially put these in because gdb doesn't always work reliably, but it's also useful in cases where you have to search a lot of text before the first match, so stopping/continuing past a breakpoint would take a long time. Fragments of text nodes in the dom can be either single or double byte: see the use of the pointers t1b and t2b. This may be the right place to add support for languages which need more than two bytes -- perhaps we need a tMultiByte option everywhere we now deal with t1b and t2b. In fact, that might be all you need, assuming it's possible to iterate backwards through such a multibyte DOM text node fragment. If that's not possible, then you might have to define iterators which do what we currently do now in the single and double byte case, but do more work to implement forward and backward iteration in the multibyte case. Given this info, the comments in the code may be enough to take you through the algorithm. Most of it is fairly straightforward; some of the tricky logic has to do with matching whitespace (any space in the pattern string matches one or more spaces in the document, so for example a double space in the pattern string must match two or more characters of whitespace -- which includes space, dom newline (\n), and non-breaking space ( ). There's also some tricky logic regarding when to advance findex and pindex. With any luck you won't have to change any of that.
A slightly more general version of that last long comment is now up at: http://www.mozilla.org/editor/find-algorithm.html If you have other questions or anything needs clarifying, please ask!
I think this is an invalid bug >when do searching, this "display cell" should be treated as an atomic unit, Who said that? I don't think WTT2.0 specify that. Unicode 3.0 does not mention that also. I think that is your personal opinion.
Assignee: yokoyama → ftang
> I think this is an invalid bug No. And this is not just a bug for Thai but an i18n bug. >>when do searching, this "display cell" should be treated as an atomic unit, >Who said that? I don't think WTT2.0 specify that. Unicode 3.0 does not mention that also. I think that is your personal opinion. WTT2.0 didn't said that. It's mentioned in Unicode 3.0, in the section on text boundaries. It's said that everything must be done on "grapheme cluster" boundary, i.e. caret movement and string search MUST take cluster boundary into account. A simple example will make it easier to understand:- If you want to search for an 'A', and the text is "B~A^D" (where ~ and ^ are non-spacing diacritics). If you consider that 'A' (which is a part of 'A^') is a match and make selection around it, users can just delete that 'A' (by BS or DEL) and make 'B~^D', a bug. Or if you want to search for an 'A' and replace any occurence with 'A~', the result will be "B~A~^D", another kind of bug. So selection that doesn't fit in cluster boundaries must be forbidden. This is why the Unicode 3.0 standard says so. ---- Back to our bug, the idea is easy to implement :- 1) when a substring is found 2) see if the "end" of match is on cluster boundary 3) if no, consider it a non-match
>It's mentioned in Unicode 3.0, in the section on >text boundaries. It's said that everything must be done on "grapheme cluster" >boundary, i.e. caret movement and string search MUST take cluster boundary >into account. I don't think that is what the Unicode 3.0 say about. Read page 116 session 5.12 Editin and Selection. It mention the following: Three types of boundaries are gnerally useful in editing and selecting within words. Cluster Boundaries Stacked Boundaries Atomic Chacter Boundaries Linear Bundaries Nolinear Boundaries
Status: NEW → ASSIGNED
Right, I simply used the term (new in Unicode 3.2) ambiguously to define the default unit for caret movement. Unicode 3.0 say in page 118, section 5.13, heading "Other Processes" : "When searching string, remember to check for additional nonspacing marks in the target string that may affect the interpretation of the last matching character." Isn't this bug the exact case that the sentence was written to prevent?
Blocks: 283271
what a hack. I have not touch mozilla code for 2 years. I didn't read these bugs for 2 years. And they are still there. Just close them as won't fix to clean up.
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → WONTFIX
Mass Bug Re-Open of bugs Frank Tang Closed with no good reason. Spam is his fault not my own
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Mass Re-assinging Frank Tangs old bugs that he closed won't fix and had to be re-open. Spam is his fault not my own
Assignee: ftang → nobody
Status: REOPENED → NEW
BugAThon Bangkok We (7-8 native Thais) discussed about this and all agreed that while searching substring is expected behavior, searching substring starting with non-base Thai character is a non-expected behavior. As a result, we should not allow user to type non-base Thai character at the start of string (e.g. search word, url) at the first place. i.e. we should do character sequence checking in find and address bar If we can strict that input, anomalies in Bug 157534 and Bug 421275 will no longer appear. So both will be closed. New Bug 425900 open "Should not allow non-base Thai character as first character in textfield/textarea".
Status: NEW → RESOLVED
Closed: 20 years ago17 years ago
Resolution: --- → WONTFIX
I think it's not a bug but it's a feature.
I use Firefox 4b6 on Windows 7. This bug has not fix.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: