Closed Bug 62656 Opened 24 years ago Closed 16 years ago

Form manager needs i18n-neutral test for alphanumerics

Categories

(Toolkit :: Form Manager, enhancement, P3)

enhancement

Tracking

()

RESOLVED WONTFIX

People

(Reporter: morse, Unassigned)

Details

(Keywords: intl)

Attachments

(2 files)

Form manager (wallet.cpp) needs to be able to be able to strip non-alphanumerics from certain strings. Currently that is done using routines that perform the test based on the latin alphabet encodings of letters and digits. This needs to be generalized so that we can determine alphanumerics in other character sets as well. Otherwise the auto-fill ability will not work well with non-latin-character languages. See bug 62407 for more detail as to why this is necessary.
Status: NEW → ASSIGNED
Whiteboard: [x]
In the absence of established standards, it seems to me that one to deal with this is something like the following: 1. When on a form page encoded in encoding "X", remember the URL and associted form field names in that encoding "X" (converted to Mozilla internal store encoding, e.g. UTF-8). 2. In the wallet database, associate the filed names, the data entery and the URL to that page. 3. When the user visits that page, compare the stored data entries for each field name that exists on that page and if the field names match, fill in the stored data. There is no workable way to establish field names in other languages, i.e. There are too many! and not uniform. So this seems to be the only workable solution. Doing this should improve the "hit" rate dramatically.
please note that this is not a bug report but a feature request added ftang, bstell to cc list
Severity: normal → enhancement
searching the text on a page is called SCREEN SCRAPING
Netscape Nav triage team: this is not a Netscape beta stopper. However, this is a feature that you *probably* will want, but we need your help! Adding Jaime to the cc list.
Keywords: intl, nsbeta1-
Adding msanz to cc: list.
Need Infor from QA - - IQA please look at this for us, to help determine user impact.
changing platform/OS to all since this is a request for enhancement that should apply to all platforms and OSes
OS: Windows NT → All
Hardware: PC → All
Msanz - Can you assign someone to look at this one? We need to know if there is any real user or development impact because of this bug.
Whiteboard: [x]
Jaime, is this something that you think we should do for nsbeta1? Is seems to me that without this, form-manager would be very ineffective for non-ascii alphabets.
Steve - I would think so, but I'm waiting on a response from Msanz or Ji.
Jaime, I don't understand what the assignment for QA is. Can you clarify? As for when to implement this, I certainly agree it has to be in nsbeta1 so that we have some user feedback.
That depends on the spec for implementing this. Steve, do you have a list of to-do's for this bug? I was wondering if the tasks simply include collecting some standard/oft-used field names for each language or if they also include a kind of "scraping" for each form page as I suggested above.
Adding Teruko to cc: list. Teruko - This is the bug Montse and I were talkign to you about. I'd like to get your input by the end of this week.
The spec for this is very simple. I want to take a string in any alphabet and strip it down to significant characters (namely alphanumerics) that I can then use for determining what the meaning of the field is. So if I have a string first_name, first-name, first#name, etc, I want them all to look the same (i.e., "firstname") so I can simply test against that. It's easy to do that in latin alphabets. Simply test for being in the range 0..9, A..Z, or a..z. And there are even standard routines that I can use for making such a test. I need to be able to do something similar in other alphabets. In other words, given a string of japanese characters, how do I strip out the punctuation marks and other special symbols and arrive at some connonical form for the string? You needn't be concerned about the algorithm that I use for then determining the meaning of each field. That algorithm is language neutral, is already implemented, and works very satisfactory.
Hi Jaime, Teruko won't be able to do this by the end of this week; best case is end of next week.
morse, Are you commiting do doing all the international verisons of the name canonicalizing? Surely you don't plan to just do the American version and throw the problem over the wall to let someone else do the English, Scottish, Irish, Welsh, Dutch, German, French, Finnish, Swedish, Danish, Polish, Austrian, Swiss, Italian, (my fingers are getting tired) versions?
All this bug is asking for is for a method of determining if an arbitrary character is an alphanumeric. For every one of those versions that you just mentioned, the problem is already solved since they all use the latin alphabet. What I need is a test that will work for non-latin alphabets as well.
Target Milestone: --- → mozilla1.0
Frank, can you talk to Steve to help us break down this work further? Or Steve can you call Frank? We need to scope out who should own doing this. Also Teruko/Ylong - can you confirm how well Form Manager works for non-latin-1 languages?
QA Contact: tpreston → ylong
Jaime, I would like to know how much we support Form manager for I18n. We had discussion about this before 6.0 rtm. The decision was not clear.
QA Contact: ylong → teruko
Now that the interview section has been removed, I would like to support the feature in better internationally. The only portion, I would not sign up for at this time is to localize the demonstration area for nsbeta1. Everything else, we should make our best efforts to ensure it works best for the end-user.
I tested the form manager with Japanese data and Japanese sites. When I use the sample files under Demonstration, Japanese Data is captured same as Ascii characters. However, the sample files are created for US users. I cannot type Japanese data in some field. The format of some field are small for Japanese data. For example, State has 2 single byte character length. To fill in Japanese state, we need at least 3 double byte character (6 single byte character) length. The length of Phone number and Zip are not applicable for Japanese data. I tried to go to Japanese website to fill out the following forms. https://www.rakuten.co.jp/cgi-bin/RRS?MIval=rem_regist http://pages.ebayjapan.co.jp/services/registration/registration-show.html http://babel/tests/Browser/bft/euc-jp/bft_test_case_index.html After I went to the above URL and I typed data in each field, I selected menu Edit|Save Form Data to capture the data. Then, I looked at the saved data. Only few data was stored in other than "Other Saved Information->URL-Specific". In the "URL-Specific", the data will grow. I went to new form as follow, https://ecserve103.ecfactory.com/tms-ts/bin/userinfo.cgi?op=new&mode=b&key_store _seqid=400003&locale=ja-JP then, I selected the menu Edit|Prefill Form. Only some field was filled out. Also, I found out that the format of some field in the Form Manager dialog are small for Japanese data. Tested 2001-05-04-06 Win32 build
Teruko, Yes, the samples are US-specific. They were taken from specific US sites and were not simply a demonstration of what is available. If you would like to add some Japanese sites to the set of samples, we can surely do that. The fact that your captured data from a japanese site appeared only in the URL-specific section is understandable. It means that the wallet tables did not contain localized entries to allow form-manager to determine the intended schema. In that case, form-manager makes up its own schema name and makes it specific to that site so that when you return to that particular site in the future it can be prefilled. The bug for adding localization entries to the wallet tables is bug 62841 The fact that some form fields are too small in the form-manager dialog is yet another bug. The form-manager dialog needs to be localized not just for particular languages, but for partiucular local conventions as well. For example, is the zipcode 5 digits (US) or 6 digits (Canada). That is the topic of bug 62730.
Steve, I have a question about concatination. How should concatination field in Form Manager work?
If a field is covered by a concatenation rule, then the individual values are concatenated together, separated by a single space. The concatenation rules are contained in one of the wallet tables. For a description of the wallet tables, see the document attached to bug 62730.
Moving to M0.9.2 for i18n Nav triage eval. Vishy - I need to know what this means in English.
Target Milestone: mozilla1.0 → mozilla0.9.2
Please tell me what you don't understand and I'll try to explain it.
nav intl triage: vishy, jaime, steve and ftang need to sit down and discuss this bug more in a meeting.
Assignee: morse → ftang
Status: ASSIGNED → NEW
reassign to ftang. morse, please indicate which file which function call the alphanumerics testing in your code. Thanks
See extensions/wallet/src/wallet.cpp. In particular, search for the three places where IsAsciiAlpha(c) is used. The second one in particular has some commented-out code in which I tried to use an existing i18n api for doing this but it didn't work.
any test cases to show the current code are not good enough?
Status: NEW → ASSIGNED
General test cases are in http://babel/tests/browser/form_manager/mojo-testcases-formmanager.html In my test result, I tried to go to Japanese website to fill out the following forms. https://www.rakuten.co.jp/cgi-bin/RRS?MIval=rem_regist http://pages.ebayjapan.co.jp/services/registration/registration-show.html http://babel/tests/Browser/bft/euc-jp/bft_test_case_index.html After I went to the above URL and I typed data in each field, I selected menu Edit|Save Form Data to capture the data. Then, I looked at the saved data. Only few data was stored in other than "Other Saved Information->URL-Specific". In the "URL-Specific", the data will grow. We do not have enough infomation what is expect behavior. Jaime, Frank, and I need to get together to talk about what is the expect behavior.
morse: I can answer your question. Every Unicode character has a character class; see http://www.unicode.org/Public/UNIDATA/UnicodeData.txt http://www.unicode.org/Public/UNIDATA/UnicodeData.html#General Category explains what the character classes are. You merely need to write code that checks each character to see if it's in one of the ranges called "Letter". Because I was experimenting with Unicode ROT-13 (don't ask) I actually have a) some Javascript which does this, and b) a Perl script which takes that data file and outputs all the character ranges which are "letters". Would either of these be useful? Gerv
This is exactly what I need. Why didn't someone tell me about this sooner. ;-) Please attach the unicode for reference so I can see how the class is obtained from the unicode value. I'll need to do it in c++, but once I see how you do it in javascript it should be easy for me to port it over. Thanks.
ftang, Don't we already have this? http://lxr.mozilla.org/seamonkey/source/intl/unicharutil/src/ucdata.h#116 116 /* 117 * This is the primary function for testing to see if a character has some set 118 * of properties. The macros that test for various character properties all 119 * call this function with some set of masks. 120 */ 121 extern int ucisprop __((unsigned long code, unsigned long mask1, 122 unsigned long mask2)); 123 124 #define ucisalpha(cc) ucisprop(cc, UC_LU|UC_LL|UC_LM|UC_LO|UC_LT, 0) 125 #define ucisdigit(cc) ucisprop(cc, UC_ND, 0) ...
intl/unicharutil/src/ucdata.h is obsoleted. We no longer use it in the product.
> Please attach the unicode for reference so I can see how the class is obtained > from the unicode value. Unfortunately the only way to do this is to check whether it's in one of about forty ranges over the entire Unicode range. You also have the minor problem that further characters could be defined later; however, these will be for such irrelevant languages that it won't matter. I'm not going to attach the Unicode definitions file because it's huge - the URL is http://www.unicode.org/Public/UNIDATA/UnicodeData.txt > I'll need to do it in c++, but once I see how you do it > in javascript it should be easy for me to port it over. I'll attach my Perl and JS at some point soon. I'm in the wrong OS at the moment. Gerv
Important points to note about the above two attachments: The Perl script is not guaranteed correct, although it probably is, because I just rewrote it to remove a bug. You should probably get someone who understands Perl to check it for sanity. The Javascript code uses an old version of the Perl script's output, minus some ranges at the beginning and all single character ranges. Don't take the ranges given in it as the right ones! It's trivially simple, though. If I were you I'd do what I did, and special-case character values under 256, for speed reasons. If you need me to explain more, just shout :-) Gerv
Gervase Markham- I am not sure your script is correct. There are some speical case in the UnicodeData.txt for han characters and surrogate pair. It is hard for me to tell since you generate decimal instead of hex.
Move this to future.
Target Milestone: mozilla0.9.2 → Future
Frank: are you able to modify the script to generate hex, or do you need me to do that? Are you saying that some codepoints which are classified as "Letters" should not be letters in morse's algorithm? Gerv
what a hack. I have not touch mozilla code for 2 years. I didn't read these bugs for 2 years. And they are still there. Just close them as won't fix to clean up.
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → WONTFIX
Mass Re-open of Frank Tangs Won't fix debacle. Spam is his responsibility not my own
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Mass Re-assinging Frank Tangs old bugs that he closed won't fix and had to be re-open. Spam is his fault not my own
Assignee: ftang → nobody
Status: REOPENED → NEW
Filter on "Nobody_NScomTLD_20080620"
QA Contact: teruko → form-manager
This doesn't seem relevant to the modern form manager.
Status: NEW → RESOLVED
Closed: 20 years ago16 years ago
Resolution: --- → WONTFIX
Product: Core → Toolkit
QA Contact: form-manager → form.manager
Target Milestone: Future → ---
Version: Trunk → unspecified
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: