Closed Bug 304951 Opened 19 years ago Closed 19 years ago

Hebrew detector too confident

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.8beta4

People

(Reporter: bugs.caleb, Assigned: shooshx)

References

Details

(Keywords: fixed1.8, regression, testcase)

Attachments

(2 files)

I've bumped into a bugzilla page where the reporter had 2 hebrew chars as part of his name, and that cuased the Universal detection to change encodings for the whole page.
Attached file the page (deleted) —
This "testcase" needs some major reduction, but it shows the bug pretty clearly.
(In reply to comment #0) > I've bumped into a bugzilla page where the reporter had 2 hebrew chars as part > of his name It should be noted that these are NOT really Hebrew characters. these are two Finnish characters (encoded in Latin-1) which were incorrectly identified as Hebrew characters.
I just ran the prober with debug printing on this page and got the following output: MBCS inactive: [UTF8] (confidence is too low). MBCS inactive: [SJIS] (confidence is too low). MBCS inactive: [EUCJP] (confidence is too low). MBCS inactive: [GB18030] (confidence is too low). MBCS inactive: [EUCKR] (confidence is too low). MBCS inactive: [Big5] (confidence is too low). MBCS inactive: [EUCTW] (confidence is too low). SBCS Group Prober --------begin status SBCS: 0.010 [windows-1251] SBCS: 0.010 [KOI8-R] SBCS: 0.010 [ISO-8859-5] SBCS: 0.010 [x-mac-cyrillic] SBCS: 0.010 [IBM866] SBCS: 0.010 [IBM855] SBCS: 0.010 [ISO-8859-7] SBCS: 0.010 [windows-1253] SBCS: 0.010 [ISO-8859-5] SBCS: 0.010 [windows-1251] HEB: 0 - 0 [Logical-Visual score] SBCS: 0.508 [windows-1255] SBCS: 0.508 [windows-1255] SBCS Group found best match [windows-1255] confidence 0.508128. Latin1Prober: 0.500 [windows-1252] Universal Charset Detector report charset windows-1255 . significant part of this is: SBCS: 0.508 [windows-1255] SBCS: 0.508 [windows-1255] SBCS Group found best match [windows-1255] confidence 0.508128. Latin1Prober: 0.500 [windows-1252] meaning only 0.8% difference between the hebrew and latin these 0.8% are exactly the two offending letters which are two 0xE4 in the name of the reporter of the bug and which correspond to the hebrew letter HEH in windows-1255. possible solutions I can see for this bug... - make MINIMUM_THRESHOLD in http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/nsUniversalDetector.cpp#99 higher. 0.20 is way too low in my opinion. increasing it to 0.55 for instance solves this bug but possibly causes other regressions for other languages - tell nsUniversalDetector and in general whoever makes a decision according to model conficence to not make the decision if the difference between two probers is lower then some threshold, say 1%. implemeted trivially, this is almost certain to cause problems to the two hebrew detectors which often both get 99% in hebrew pages. - a possible combination of the above two methods. tell the detector to not made the decision if the difference is lower the 1% AND both scores are low. it should be noted that this needs to be done generally, with any number of scores, not just two. these are my thoughs about it for the moment.
I don't really understand how charset-detection works, but there seems to be something wrong here. A page which is 99% ASCII is *much* more likely to be Latin1 than Hebrew (or any non-Latin alphabet). It seems like the detector is only looking at the non-ASCII chcracters, which I think is wrong. I don't know if this is a flaw with this particular detector or with the detection mechanism in general, but IMO it's something that should be fixed.
you have a very good point there. something IS fundumentally wrong. I have infact just noticed that there a serious mistake in the Hebrew Model in LangHebrewModel.cpp which nither me nor simon noticed. after a short debug session I've determined that the problem is infact not the letters themselves but the pair "/xE4/x20" which appears in the text the prober is given and is not ignored as it should be (since it contains a space). LangHebrewModel.cpp is the unchaged model simon produced with the model maker (model maker produced by the original author of this library). In the Language models already existing, I noticed that it is mentioned that the models have been slightly altered from what the tool produced. I believe this alteration is exactly what gives the hebrew model its problems. I haven't fully understood all of this alteration but I believe its base is a complete rewrite of 0x00-0x3f of the CharToOrderMap table. this rewrite, among other things assigns 0x20 (space) order 253 (close to last) instead of order 0 (first, most common) which is where the current model assigns him. this effectively excluded space from the pair check and saves the day for this test case. another evidance that there is infact a change on the original model from the tool is this comment from LangBulgarianModel.cpp: /**************************************************************** 255: Control characters that usually does not exist in any text 254: Carriage/Return 253: symbol (punctuation) that does not belong to word 252: 0 - 9 *****************************************************************/ as I understand it, this is infact a rudimentary mapping of the so called alteration to the original CharToOrderMap table. however I suspect that it doesn't describe the alretation fully. to test this theory I copied 0x00-0x3f from the bulgaian model to simons hebrew one and voila, it worked perfectly. space is now 253, last pair ignored and the hebrew model prober stays with 1% like the rest of the probers: MBCS inactive: [UTF8] (confidence is too low). MBCS inactive: [SJIS] (confidence is too low). MBCS inactive: [EUCJP] (confidence is too low). MBCS inactive: [GB18030] (confidence is too low). MBCS inactive: [EUCKR] (confidence is too low). MBCS inactive: [Big5] (confidence is too low). MBCS inactive: [EUCTW] (confidence is too low). SBCS Group Prober --------begin status SBCS: 0.010 [windows-1251] SBCS: 0.010 [KOI8-R] SBCS: 0.010 [ISO-8859-5] SBCS: 0.010 [x-mac-cyrillic] SBCS: 0.010 [IBM866] SBCS: 0.010 [IBM855] SBCS: 0.010 [ISO-8859-7] SBCS: 0.010 [windows-1253] SBCS: 0.010 [ISO-8859-5] SBCS: 0.010 [windows-1251] HEB: 0 - 0 [Logical-Visual score] SBCS: 0.010 [windows-1255] SBCS: 0.010 [windows-1255] SBCS Group found best match [windows-1251] confidence 0.010000. Latin1Prober: 0.500 [windows-1252] I would want to discuss this further with simon before making this into a patch.
Attached patch patch 1 (deleted) — Splinter Review
this patch fixes the bug by applying the appropriate modification needed in win1255_CharToOrderMap the chage contains: - rewrite of 0x00-0x40, 0x5b-0x60, 0x7b-0x7f (inclusive) to the values from any of the other old tables. erasing the original values from the model to 252-255. this makes the model engine disreguard these characters as it intentionally should work. - add a comment about values 252-255, this comment is copied from one of the other models as well. this solves the bug by making the model engine disreguard the space which gave the testcase a high hebrew score, higher then latin-1. testing: the original test case of this bug is not showing any signs of being recognized as hebrew. also, I've tested all of my previous test cases from 86999 and they work perfectly. the modification made to the old language models is not really documented anywhere and I had to use a bit of guess work and reverse engeneering to do it right. there's no doubt this should be documented somewhere proprly but I'm pretty sure this is not in the scope of this patch. it should be a part of the model-making tool, whenever it is checked in.
Attachment #193395 - Flags: review?(smontagu)
Comment on attachment 193395 [details] [diff] [review] patch 1 I should have spotted this when creating the language model. r=me.
Attachment #193395 - Flags: review?(smontagu) → review+
Assignee: smontagu → shoosh20012001
The middle line below seems wrong: +255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255, //10 ++253,253,253,253,253,253,253,253,253,253,253,253,253,253,253,253, //20 +252,252,252,252,252,252,252,252,252,252,253,253,253,253,253,253, //30
if you are referring to the odd "+" in the begining of the line then this is not a mistake. this is how it should be and how it is in all the other language models. I'm guessing that it's there to show to importance of this being what it is :)
Status: NEW → ASSIGNED
Attachment #193395 - Flags: superreview?(roc)
Attachment #193395 - Flags: superreview?(roc) → superreview+
We really need that documentation checked in somewhere!
Attachment #193395 - Flags: approval1.8b4?
That documentation by itself is useless since its context is the language model maker tool which produces these tables. Simon and I are the only active developers who currently have this code and in its current state it is not ready for checkin. current state being - only compiles on windows with VC6 and with very little documentation. In the coming weeks I would make my best effort to prepare this code for checkin and document it to my best of understanding. at any rate, this current bug is not the place to discuss this code. it should have a bug of its own.
Attachment #193395 - Flags: approval1.8b4? → approval1.8b4+
Trunk: Checking in src/LangHebrewModel.cpp; /cvsroot/mozilla/extensions/universalchardet/src/LangHebrewModel.cpp,v <-- LangHebrewModel.cpp new revision: 1.2; previous revision: 1.1 done 1.8 Branch: Checking in src/LangHebrewModel.cpp; /cvsroot/mozilla/extensions/universalchardet/src/LangHebrewModel.cpp,v <-- LangHebrewModel.cpp new revision: 1.2; previous revision: 1.1 done
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Keywords: fixed1.8
Resolution: --- → FIXED
OS: Windows XP → All
Hardware: PC → All
Target Milestone: --- → mozilla1.8beta4
Just a FYI: Now the page https://bugzilla.mozilla.org/show_bug.cgi?id=1156 is detected as Korean (EUC-KR) :/
That seems to be a return to the status quo at least, because it's detected as Korean in DPA2 also.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: