Closed
Bug 62656
Opened 24 years ago
Closed 16 years ago
Form manager needs i18n-neutral test for alphanumerics
Categories
(Toolkit :: Form Manager, enhancement, P3)
Toolkit
Form Manager
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: morse, Unassigned)
Details
(Keywords: intl)
Attachments
(2 files)
Form manager (wallet.cpp) needs to be able to be able to strip non-alphanumerics
from certain strings. Currently that is done using routines that perform the
test based on the latin alphabet encodings of letters and digits. This needs to
be generalized so that we can determine alphanumerics in other character sets as
well. Otherwise the auto-fill ability will not work well with
non-latin-character languages.
See bug 62407 for more detail as to why this is necessary.
Reporter | ||
Updated•24 years ago
|
Status: NEW → ASSIGNED
Whiteboard: [x]
Comment 1•24 years ago
|
||
In the absence of established standards, it seems to me
that one to deal with this is something like the following:
1. When on a form page encoded in encoding "X",
remember the URL and associted form field names
in that encoding "X" (converted to Mozilla internal
store encoding, e.g. UTF-8).
2. In the wallet database, associate the filed names,
the data entery and the URL to that page.
3. When the user visits that page, compare the
stored data entries for each field name that exists
on that page and if the field names match, fill in
the stored data.
There is no workable way to establish field names in
other languages, i.e. There are too many! and not
uniform. So this seems to be the only workable
solution. Doing this should improve the "hit" rate
dramatically.
Comment 2•24 years ago
|
||
please note that this is not a bug report but a feature request
added ftang, bstell to cc list
Severity: normal → enhancement
Comment 3•24 years ago
|
||
searching the text on a page is called SCREEN SCRAPING
Comment 4•24 years ago
|
||
Netscape Nav triage team: this is not a Netscape beta stopper. However, this is
a feature that you *probably* will want, but we need your help! Adding Jaime to
the cc list.
Comment 5•24 years ago
|
||
Adding msanz to cc: list.
Comment 6•24 years ago
|
||
Need Infor from QA - - IQA please look at this for us, to help determine user
impact.
Comment 7•24 years ago
|
||
changing platform/OS to all since this is a request for enhancement that
should apply to all platforms and OSes
OS: Windows NT → All
Hardware: PC → All
Comment 8•24 years ago
|
||
Msanz - Can you assign someone to look at this one? We need to know if there is
any real user or development impact because of this bug.
Reporter | ||
Updated•24 years ago
|
Whiteboard: [x]
Reporter | ||
Comment 9•24 years ago
|
||
Jaime, is this something that you think we should do for nsbeta1? Is seems to
me that without this, form-manager would be very ineffective for non-ascii
alphabets.
Comment 10•24 years ago
|
||
Steve - I would think so, but I'm waiting on a response from Msanz or Ji.
Comment 11•24 years ago
|
||
Jaime, I don't understand what the assignment for QA is. Can you clarify? As for
when to implement this, I certainly agree it has to be in nsbeta1 so that we
have some user feedback.
Comment 12•24 years ago
|
||
That depends on the spec for implementing this.
Steve, do you have a list of to-do's for this bug?
I was wondering if the tasks simply include collecting
some standard/oft-used field names for each language or if
they also include a kind of "scraping" for each form page
as I suggested above.
Comment 13•24 years ago
|
||
Adding Teruko to cc: list.
Teruko - This is the bug Montse and I were talkign to you about. I'd like to get
your input by the end of this week.
Reporter | ||
Comment 14•24 years ago
|
||
The spec for this is very simple. I want to take a string in any alphabet and
strip it down to significant characters (namely alphanumerics) that I can then
use for determining what the meaning of the field is. So if I have a string
first_name, first-name, first#name, etc, I want them all to look the same (i.e.,
"firstname") so I can simply test against that.
It's easy to do that in latin alphabets. Simply test for being in the range
0..9, A..Z, or a..z. And there are even standard routines that I can use for
making such a test.
I need to be able to do something similar in other alphabets. In other words,
given a string of japanese characters, how do I strip out the punctuation marks
and other special symbols and arrive at some connonical form for the string?
You needn't be concerned about the algorithm that I use for then determining the
meaning of each field. That algorithm is language neutral, is already
implemented, and works very satisfactory.
Comment 15•24 years ago
|
||
Hi Jaime, Teruko won't be able to do this by the end of this week; best case is
end of next week.
Comment 16•24 years ago
|
||
morse,
Are you commiting do doing all the international verisons of the
name canonicalizing?
Surely you don't plan to just do the American version and throw
the problem over the wall to let someone else do the English,
Scottish, Irish, Welsh, Dutch, German, French, Finnish, Swedish,
Danish, Polish, Austrian, Swiss, Italian, (my fingers are getting
tired) versions?
Reporter | ||
Comment 17•24 years ago
|
||
All this bug is asking for is for a method of determining if an arbitrary
character is an alphanumeric. For every one of those versions that you just
mentioned, the problem is already solved since they all use the latin alphabet.
What I need is a test that will work for non-latin alphabets as well.
Reporter | ||
Updated•24 years ago
|
Target Milestone: --- → mozilla1.0
Comment 18•24 years ago
|
||
Frank, can you talk to Steve to help us break down this work further? Or Steve
can you call Frank? We need to scope out who should own doing this.
Also Teruko/Ylong - can you confirm how well Form Manager works for non-latin-1
languages?
QA Contact: tpreston → ylong
Comment 19•24 years ago
|
||
Jaime, I would like to know how much we support Form manager for I18n. We had
discussion about this before 6.0 rtm. The decision was not clear.
QA Contact: ylong → teruko
Comment 20•24 years ago
|
||
Now that the interview section has been removed, I would like to support the
feature in better internationally. The only portion, I would not sign up for at
this time is to localize the demonstration area for nsbeta1. Everything else, we
should make our best efforts to ensure it works best for the end-user.
Comment 21•24 years ago
|
||
I tested the form manager with Japanese data and Japanese sites.
When I use the sample files under Demonstration, Japanese Data is captured same
as Ascii characters.
However, the sample files are created for US users. I cannot type Japanese data
in some field. The format of some field are small for Japanese data.
For example,
State has 2 single byte character length. To fill in Japanese state, we need at
least 3 double byte character (6 single byte character) length.
The length of Phone number and Zip are not applicable for Japanese data.
I tried to go to Japanese website to fill out the following forms.
https://www.rakuten.co.jp/cgi-bin/RRS?MIval=rem_regist
http://pages.ebayjapan.co.jp/services/registration/registration-show.html
http://babel/tests/Browser/bft/euc-jp/bft_test_case_index.html
After I went to the above URL and I typed data in each field, I selected menu
Edit|Save Form Data to capture the data.
Then, I looked at the saved data. Only few data was stored in other than "Other
Saved Information->URL-Specific". In the "URL-Specific", the data will grow.
I went to new form as follow,
https://ecserve103.ecfactory.com/tms-ts/bin/userinfo.cgi?op=new&mode=b&key_store
_seqid=400003&locale=ja-JP
then, I selected the menu Edit|Prefill Form. Only some field was filled out.
Also, I found out that the format of some field in the Form Manager dialog are
small for Japanese data.
Tested 2001-05-04-06 Win32 build
Reporter | ||
Comment 22•24 years ago
|
||
Teruko,
Yes, the samples are US-specific. They were taken from specific US sites and
were not simply a demonstration of what is available. If you would like to add
some Japanese sites to the set of samples, we can surely do that.
The fact that your captured data from a japanese site appeared only in the
URL-specific section is understandable. It means that the wallet tables did not
contain localized entries to allow form-manager to determine the intended
schema. In that case, form-manager makes up its own schema name and makes it
specific to that site so that when you return to that particular site in the
future it can be prefilled. The bug for adding localization entries to the
wallet tables is bug 62841
The fact that some form fields are too small in the form-manager dialog is yet
another bug. The form-manager dialog needs to be localized not just for
particular languages, but for partiucular local conventions as well. For
example, is the zipcode 5 digits (US) or 6 digits (Canada). That is the topic
of bug 62730.
Comment 23•24 years ago
|
||
Steve, I have a question about concatination.
How should concatination field in Form Manager work?
Reporter | ||
Comment 24•24 years ago
|
||
If a field is covered by a concatenation rule, then the individual values are
concatenated together, separated by a single space. The concatenation rules are
contained in one of the wallet tables.
For a description of the wallet tables, see the document attached to bug 62730.
Comment 25•23 years ago
|
||
Moving to M0.9.2 for i18n Nav triage eval.
Vishy - I need to know what this means in English.
Target Milestone: mozilla1.0 → mozilla0.9.2
Reporter | ||
Comment 26•23 years ago
|
||
Please tell me what you don't understand and I'll try to explain it.
Comment 27•23 years ago
|
||
nav intl triage: vishy, jaime, steve and ftang need to sit down and discuss this
bug more in a meeting.
Updated•23 years ago
|
Assignee: morse → ftang
Status: ASSIGNED → NEW
Comment 28•23 years ago
|
||
reassign to ftang. morse, please indicate which file which function call the
alphanumerics testing in your code. Thanks
Reporter | ||
Comment 29•23 years ago
|
||
See extensions/wallet/src/wallet.cpp. In particular, search for the three
places where IsAsciiAlpha(c) is used. The second one in particular has some
commented-out code in which I tried to use an existing i18n api for doing this
but it didn't work.
Comment 30•23 years ago
|
||
any test cases to show the current code are not good enough?
Status: NEW → ASSIGNED
Comment 31•23 years ago
|
||
General test cases are in
http://babel/tests/browser/form_manager/mojo-testcases-formmanager.html
In my test result,
I tried to go to Japanese website to fill out the following forms.
https://www.rakuten.co.jp/cgi-bin/RRS?MIval=rem_regist
http://pages.ebayjapan.co.jp/services/registration/registration-show.html
http://babel/tests/Browser/bft/euc-jp/bft_test_case_index.html
After I went to the above URL and I typed data in each field, I selected menu
Edit|Save Form Data to capture the data.
Then, I looked at the saved data. Only few data was stored in other than "Other
Saved Information->URL-Specific". In the "URL-Specific", the data will grow.
We do not have enough infomation what is expect behavior.
Jaime, Frank, and I need to get together to talk about what is the expect
behavior.
Comment 32•23 years ago
|
||
morse: I can answer your question. Every Unicode character has a character
class; see http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
http://www.unicode.org/Public/UNIDATA/UnicodeData.html#General Category
explains what the character classes are. You merely need to write code that
checks each character to see if it's in one of the ranges called "Letter".
Because I was experimenting with Unicode ROT-13 (don't ask) I actually have a)
some Javascript which does this, and b) a Perl script which takes that data file
and outputs all the character ranges which are "letters".
Would either of these be useful?
Gerv
Reporter | ||
Comment 33•23 years ago
|
||
This is exactly what I need. Why didn't someone tell me about this sooner. ;-)
Please attach the unicode for reference so I can see how the class is obtained
from the unicode value. I'll need to do it in c++, but once I see how you do it
in javascript it should be easy for me to port it over.
Thanks.
Comment 34•23 years ago
|
||
ftang, Don't we already have this?
http://lxr.mozilla.org/seamonkey/source/intl/unicharutil/src/ucdata.h#116
116 /*
117 * This is the primary function for testing to see if a character has some
set
118 * of properties. The macros that test for various character properties all
119 * call this function with some set of masks.
120 */
121 extern int ucisprop __((unsigned long code, unsigned long mask1,
122 unsigned long mask2));
123
124 #define ucisalpha(cc) ucisprop(cc, UC_LU|UC_LL|UC_LM|UC_LO|UC_LT, 0)
125 #define ucisdigit(cc) ucisprop(cc, UC_ND, 0)
...
Comment 35•23 years ago
|
||
intl/unicharutil/src/ucdata.h is obsoleted. We no longer use it in the product.
Comment 36•23 years ago
|
||
> Please attach the unicode for reference so I can see how the class is obtained
> from the unicode value.
Unfortunately the only way to do this is to check whether it's in one of about
forty ranges over the entire Unicode range. You also have the minor problem that
further characters could be defined later; however, these will be for such
irrelevant languages that it won't matter.
I'm not going to attach the Unicode definitions file because it's huge - the URL
is http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
> I'll need to do it in c++, but once I see how you do it
> in javascript it should be easy for me to port it over.
I'll attach my Perl and JS at some point soon. I'm in the wrong OS at the
moment.
Gerv
Comment 37•23 years ago
|
||
Comment 38•23 years ago
|
||
Comment 39•23 years ago
|
||
Important points to note about the above two attachments:
The Perl script is not guaranteed correct, although it probably is, because I
just rewrote it to remove a bug. You should probably get someone who understands
Perl to check it for sanity.
The Javascript code uses an old version of the Perl script's output, minus some
ranges at the beginning and all single character ranges. Don't take the ranges
given in it as the right ones! It's trivially simple, though. If I were you I'd
do what I did, and special-case character values under 256, for speed reasons.
If you need me to explain more, just shout :-)
Gerv
Comment 40•23 years ago
|
||
Gervase Markham- I am not sure your script is correct. There are some speical
case in the UnicodeData.txt for han characters and surrogate pair.
It is hard for me to tell since you generate decimal instead of hex.
Comment 42•23 years ago
|
||
Frank: are you able to modify the script to generate hex, or do you need me to
do that?
Are you saying that some codepoints which are classified as "Letters" should not
be letters in morse's algorithm?
Gerv
Comment 43•20 years ago
|
||
what a hack. I have not touch mozilla code for 2 years. I didn't read these bugs
for 2 years. And they are still there. Just close them as won't fix to clean up.
Status: ASSIGNED → RESOLVED
Closed: 20 years ago
Resolution: --- → WONTFIX
Comment 44•20 years ago
|
||
Mass Re-open of Frank Tangs Won't fix debacle. Spam is his responsibility not my own
Status: RESOLVED → REOPENED
Resolution: WONTFIX → ---
Comment 45•20 years ago
|
||
Mass Re-assinging Frank Tangs old bugs that he closed won't fix and had to be
re-open. Spam is his fault not my own
Assignee: ftang → nobody
Status: REOPENED → NEW
Comment 47•16 years ago
|
||
This doesn't seem relevant to the modern form manager.
Status: NEW → RESOLVED
Closed: 20 years ago → 16 years ago
Resolution: --- → WONTFIX
Updated•16 years ago
|
Product: Core → Toolkit
QA Contact: form-manager → form.manager
Target Milestone: Future → ---
Version: Trunk → unspecified
You need to log in
before you can comment on or make changes to this bug.
Description
•