304951 - Hebrew detector too confident

Reporter

Description

•

19 years ago

I've bumped into a bugzilla page where the reporter had 2 hebrew chars as part of his name, and that cuased the Universal detection to change encodings for the whole page.

Caleb

Reporter

Comment 1

•

19 years ago

Attached file the page (deleted) — Details

This "testcase" needs some major reduction, but it shows the bug pretty clearly.

Uri Bernstein (Google)

Comment 2

•

19 years ago

(In reply to comment #0) > I've bumped into a bugzilla page where the reporter had 2 hebrew chars as part > of his name It should be noted that these are NOT really Hebrew characters. these are two Finnish characters (encoded in Latin-1) which were incorrectly identified as Hebrew characters.

Shy Shalom

Assignee

Comment 3

•

19 years ago

I just ran the prober with debug printing on this page and got the following output: MBCS inactive: [UTF8] (confidence is too low). MBCS inactive: [SJIS] (confidence is too low). MBCS inactive: [EUCJP] (confidence is too low). MBCS inactive: [GB18030] (confidence is too low). MBCS inactive: [EUCKR] (confidence is too low). MBCS inactive: [Big5] (confidence is too low). MBCS inactive: [EUCTW] (confidence is too low). SBCS Group Prober --------begin status SBCS: 0.010 [windows-1251] SBCS: 0.010 [KOI8-R] SBCS: 0.010 [ISO-8859-5] SBCS: 0.010 [x-mac-cyrillic] SBCS: 0.010 [IBM866] SBCS: 0.010 [IBM855] SBCS: 0.010 [ISO-8859-7] SBCS: 0.010 [windows-1253] SBCS: 0.010 [ISO-8859-5] SBCS: 0.010 [windows-1251] HEB: 0 - 0 [Logical-Visual score] SBCS: 0.508 [windows-1255] SBCS: 0.508 [windows-1255] SBCS Group found best match [windows-1255] confidence 0.508128. Latin1Prober: 0.500 [windows-1252] Universal Charset Detector report charset windows-1255 . significant part of this is: SBCS: 0.508 [windows-1255] SBCS: 0.508 [windows-1255] SBCS Group found best match [windows-1255] confidence 0.508128. Latin1Prober: 0.500 [windows-1252] meaning only 0.8% difference between the hebrew and latin these 0.8% are exactly the two offending letters which are two 0xE4 in the name of the reporter of the bug and which correspond to the hebrew letter HEH in windows-1255. possible solutions I can see for this bug... - make MINIMUM_THRESHOLD in http://lxr.mozilla.org/seamonkey/source/extensions/universalchardet/src/nsUniversalDetector.cpp#99 higher. 0.20 is way too low in my opinion. increasing it to 0.55 for instance solves this bug but possibly causes other regressions for other languages - tell nsUniversalDetector and in general whoever makes a decision according to model conficence to not make the decision if the difference between two probers is lower then some threshold, say 1%. implemeted trivially, this is almost certain to cause problems to the two hebrew detectors which often both get 99% in hebrew pages. - a possible combination of the above two methods. tell the detector to not made the decision if the difference is lower the 1% AND both scores are low. it should be noted that this needs to be done generally, with any number of scores, not just two. these are my thoughs about it for the moment.

Uri Bernstein (Google)

Comment 4

•

19 years ago

I don't really understand how charset-detection works, but there seems to be something wrong here. A page which is 99% ASCII is *much* more likely to be Latin1 than Hebrew (or any non-Latin alphabet). It seems like the detector is only looking at the non-ASCII chcracters, which I think is wrong. I don't know if this is a flaw with this particular detector or with the detection mechanism in general, but IMO it's something that should be fixed.

Shy Shalom

Assignee

Comment 5

•

19 years ago

you have a very good point there. something IS fundumentally wrong. I have infact just noticed that there a serious mistake in the Hebrew Model in LangHebrewModel.cpp which nither me nor simon noticed. after a short debug session I've determined that the problem is infact not the letters themselves but the pair "/xE4/x20" which appears in the text the prober is given and is not ignored as it should be (since it contains a space). LangHebrewModel.cpp is the unchaged model simon produced with the model maker (model maker produced by the original author of this library). In the Language models already existing, I noticed that it is mentioned that the models have been slightly altered from what the tool produced. I believe this alteration is exactly what gives the hebrew model its problems. I haven't fully understood all of this alteration but I believe its base is a complete rewrite of 0x00-0x3f of the CharToOrderMap table. this rewrite, among other things assigns 0x20 (space) order 253 (close to last) instead of order 0 (first, most common) which is where the current model assigns him. this effectively excluded space from the pair check and saves the day for this test case. another evidance that there is infact a change on the original model from the tool is this comment from LangBulgarianModel.cpp: /**************************************************************** 255: Control characters that usually does not exist in any text 254: Carriage/Return 253: symbol (punctuation) that does not belong to word 252: 0 - 9 *****************************************************************/ as I understand it, this is infact a rudimentary mapping of the so called alteration to the original CharToOrderMap table. however I suspect that it doesn't describe the alretation fully. to test this theory I copied 0x00-0x3f from the bulgaian model to simons hebrew one and voila, it worked perfectly. space is now 253, last pair ignored and the hebrew model prober stays with 1% like the rest of the probers: MBCS inactive: [UTF8] (confidence is too low). MBCS inactive: [SJIS] (confidence is too low). MBCS inactive: [EUCJP] (confidence is too low). MBCS inactive: [GB18030] (confidence is too low). MBCS inactive: [EUCKR] (confidence is too low). MBCS inactive: [Big5] (confidence is too low). MBCS inactive: [EUCTW] (confidence is too low). SBCS Group Prober --------begin status SBCS: 0.010 [windows-1251] SBCS: 0.010 [KOI8-R] SBCS: 0.010 [ISO-8859-5] SBCS: 0.010 [x-mac-cyrillic] SBCS: 0.010 [IBM866] SBCS: 0.010 [IBM855] SBCS: 0.010 [ISO-8859-7] SBCS: 0.010 [windows-1253] SBCS: 0.010 [ISO-8859-5] SBCS: 0.010 [windows-1251] HEB: 0 - 0 [Logical-Visual score] SBCS: 0.010 [windows-1255] SBCS: 0.010 [windows-1255] SBCS Group found best match [windows-1251] confidence 0.010000. Latin1Prober: 0.500 [windows-1252] I would want to discuss this further with simon before making this into a patch.

Shy Shalom

Assignee

Comment 6

•

19 years ago

Attached patch patch 1 (deleted) — Details — Splinter Review

this patch fixes the bug by applying the appropriate modification needed in win1255_CharToOrderMap the chage contains: - rewrite of 0x00-0x40, 0x5b-0x60, 0x7b-0x7f (inclusive) to the values from any of the other old tables. erasing the original values from the model to 252-255. this makes the model engine disreguard these characters as it intentionally should work. - add a comment about values 252-255, this comment is copied from one of the other models as well. this solves the bug by making the model engine disreguard the space which gave the testcase a high hebrew score, higher then latin-1. testing: the original test case of this bug is not showing any signs of being recognized as hebrew. also, I've tested all of my previous test cases from 86999 and they work perfectly. the modification made to the old language models is not really documented anywhere and I had to use a bit of guess work and reverse engeneering to do it right. there's no doubt this should be documented somewhere proprly but I'm pretty sure this is not in the scope of this patch. it should be a part of the model-making tool, whenever it is checked in.

Attachment #193395 - Flags: review?(smontagu)

Simon Montagu :smontagu

Comment 7

•

19 years ago

Comment on attachment 193395 [details] [diff] [review] patch 1 I should have spotted this when creating the language model. r=me.

Attachment #193395 - Flags: review?(smontagu) → review+

Simon Montagu :smontagu

Updated

•

19 years ago

Assignee: smontagu → shoosh20012001

mmc

Comment 8

•

19 years ago

The middle line below seems wrong: +255,255,255,255,255,255,255,255,255,255,255,255,255,255,255,255, //10 ++253,253,253,253,253,253,253,253,253,253,253,253,253,253,253,253, //20 +252,252,252,252,252,252,252,252,252,252,253,253,253,253,253,253, //30

Shy Shalom

Assignee

Comment 9

•

19 years ago

if you are referring to the odd "+" in the begining of the line then this is not a mistake. this is how it should be and how it is in all the other language models. I'm guessing that it's there to show to importance of this being what it is :)

Status: NEW → ASSIGNED

Shy Shalom

Assignee

Updated

•

19 years ago

Attachment #193395 - Flags: superreview?(roc)

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

19 years ago

Attachment #193395 - Flags: superreview?(roc) → superreview+

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 10

•

19 years ago

We really need that documentation checked in somewhere!

Shy Shalom

Assignee

Updated

•

19 years ago

Attachment #193395 - Flags: approval1.8b4?

Shy Shalom

Assignee

Comment 11

•

19 years ago

That documentation by itself is useless since its context is the language model maker tool which produces these tables. Simon and I are the only active developers who currently have this code and in its current state it is not ready for checkin. current state being - only compiles on windows with VC6 and with very little documentation. In the coming weeks I would make my best effort to prepare this code for checkin and document it to my best of understanding. at any rate, this current bug is not the place to discuss this code. it should have a bug of its own.

Asa Dotzler [:asa]

Updated

•

19 years ago

Attachment #193395 - Flags: approval1.8b4? → approval1.8b4+

Asaf Romano (gone)

Comment 12

•

19 years ago

Trunk: Checking in src/LangHebrewModel.cpp; /cvsroot/mozilla/extensions/universalchardet/src/LangHebrewModel.cpp,v <-- LangHebrewModel.cpp new revision: 1.2; previous revision: 1.1 done 1.8 Branch: Checking in src/LangHebrewModel.cpp; /cvsroot/mozilla/extensions/universalchardet/src/LangHebrewModel.cpp,v <-- LangHebrewModel.cpp new revision: 1.2; previous revision: 1.1 done

Status: ASSIGNED → RESOLVED

Closed: 19 years ago

Keywords: fixed1.8

Resolution: --- → FIXED

Asaf Romano (gone)

Updated

•

19 years ago

OS: Windows XP → All

Hardware: PC → All

Target Milestone: --- → mozilla1.8beta4

Caleb

Reporter

Comment 13

•

19 years ago

Just a FYI: Now the page https://bugzilla.mozilla.org/show_bug.cgi?id=1156 is detected as Korean (EUC-KR) :/

Simon Montagu :smontagu

Comment 14

•

19 years ago

That seems to be a return to the status quo at least, because it's detected as Korean in DPA2 also.

the page 19 years ago Caleb (deleted), text/html		Details
patch 1 19 years ago Shy Shalom (deleted), patch	smontagu : review+ roc : superreview+ asa : approval1.8b4+	Details \| Diff \| Splinter Review