Closed Bug 803189 Opened 12 years ago Closed 11 years ago

[keyboard] don't suggest offensive words

Categories

(Firefox OS Graveyard :: Gaia::Keyboard, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(blocking-b2g:koi+, blocking-basecamp:-, b2g-v1.2 verified)

RESOLVED FIXED
blocking-b2g koi+
blocking-basecamp -
Tracking Status
b2g-v1.2 --- verified

People

(Reporter: djf, Assigned: janjongboom)

References

Details

Attachments

(1 file)

The word suggestions provided by the keyboard use comprehensive wordlists from Android. One feature of these lists is that they include words flagged as "offensive". We currently ignore that flag, so our suggestion dictionaries will help users who want to swear like sailors. On the other hand, the nice grandmother who wants to send a text about her fuchsia plants might see unwanted suggestions after typing the first three letters of that word. I suspect the safe thing to do here is to filter out the offensive words so they never appear as suggestions. I'm going to nominate this as blocking, but will wait to hear from product managers before making a change. On the other hand, it appears that android's word lists give all offensive words a frequency value of 0. So if you type 'fuc' you'll see fuchsia, fucose, fuchsias and fucoid before you see any suggestions listed as offensive. In most cases, you'd have to type an offensive root word and then you'd see suggestions that include suffixes. But unless we actually remove the words from the dictionary, we can't be sure they won't show up unexpectedly as spelling corrections or when there just aren't any non-offensive words with the same initial letters. For v2, we could include the offensive words in the dictionary but include a setting to turn them on or off. That's not an option for v1, though. We just need to decide whether to sanitize the dictionaries or not. Cc'ing Andreas since he knows the dictionaries well. And cc'ing Chris Lee since this seems like a product decision that is his to make.
> On the other hand, it appears that android's word lists give all offensive > words a frequency value of 0. So if you type 'fuc' you'll see fuchsia, > fucose, fuchsias and fucoid before you see any suggestions listed as > offensive. In most cases, you'd have to type an offensive root word and > then you'd see suggestions that include suffixes. This seems like a reasonable compromise. Is this the easiest path forward for us to limit offensive suggestions from showing up, but still offer users the ability to correct to these words if a user chooses to?
This is the status quo: offensive words won't appear very easily, but we can't guarantee that they will never appear. Users who are easily offended by offensive words might be offended, but it probably won't happen often. If product management is okay with that, please mark the bug Resolved/wontfix
I am blocking-minus'ing this bug so we can keep track of it for a future release.
blocking-basecamp: --- → -
After some additional testing, I can confirm that it is very unlikely for offensive word suggestions to appear. Because the dictionaries assign a 0 frequency to all offensive words, we seem to get suggestions like "duck", "ducking", "ship" and so on. If we ever enable a true auto-correct, this may turn out to be annoying to users who actually do want to use those offensive words.
No longer blocks: 797155
I am already ducking kissed.
Component: Gaia → Gaia::System::Keyboard
Interesting bug. Interesting comments, too.(I guess I still don't get what "ducking kissed is"). So, is there any plan to complete an algorithm for this? Perhaps a hash table for offernsive words to avoid using them or to suggest them only if there is no other better options?
OS: Mac OS X → Gonk (Firefox OS)
Hardware: x86 → ARM
At the moment the situation is also not ideal, f.e. if I type fucking it will autocorrect to ducking, which doesn't make sense to me. It's fine with me if the words don't show up, but a gynaecologist is gonna have a lot of trouble using a FFOS device :p
Assignee: nobody → janjongboom
Attached file Patch (deleted) —
This patch adds words that are considered profane to the dictionaries in Gaia. To prevent profane words showing up in the suggestion list we'll only suggest a word if the input matches the output. This means that it will be possible to type a word like penis, whereas it currently gets changed into Peninsula. As to be seen in the code I do the checking right before calling addWord. This basically means that on profane words we'll have less than 3 suggestions. I think this would be the right thing to do because they already have a frequency of 0 so the words that would be suggested otherwise don't make any sense anyway. Also there are a nice bunch of tests. I needed to change the dictionary format because the marker byte at the beginning of each node is already used up completely. We could change frequency from 5 to 4 bits to save a byte per word but that would mean quite a big loss in resolution. I added a second marker byte that holds currently one bit of information.
Attachment #772029 - Flags: review?(rlu)
If you already add a byte, any way we can put the other 7 bits to use? (increase frequency resolution for example)
We could move the frequency to the second byte, then we would have 8 bit resolution, plus 4 bits left in the first byte. I'd rather have some feedback from :RudyL or :djf whether that would be useful.
Nominating for leo+, because without this patch perfectly fine words will get autocorrected (and we have 700 or something of 'em).
blocking-b2g: --- → leo?
This isn't a 1.0.1 regression or something that is new to 1.1 and given where we are the 1.1 cycle it is late to take this enhancement, hence moving this to koi ?
blocking-b2g: leo? → koi?
Attachment #772029 - Flags: review?(rlu) → review?(dflanagan)
Comment on attachment 772029 [details] Patch Adding a second byte to each node adds one byte per character, not one byte per word. It looks like this patch increases the dictionary sizes by almost 50%, and I'm giving the r- for that reason. My plan for this was to follow google's lead in the word lists and dedicate frequency 0 to profane words. So we'd still use 5 bits for the word frequency, but there would be 31 distinct frequency levels instead of 32. The value 0 would be special. I haven't tried out the patch, but I don't understand your comment about there only being two suggestions in some cases? Can you give an example of why that might occur? I'd like to avoid that if possible.
Attachment #772029 - Flags: review?(dflanagan) → review-
Comment on attachment 772029 [details] Patch I have played around with embedding the profanity flag into the frequency field, but I've failed. The problem is due to the suffix tree you can't assign '0' freq to all nodes because when searching the tree these nodes will be skipped regardless, f.e. expelling 'I' from results. So we need to maintain the original frequency to not screw up the search algorithm. If only setting the frequency for the profane words (but not any other leaf that leads to it) then it started suggesting profane words again (any input on that might be nice) which also happens when using 32 as the marker value. I think it's due to the algorithm thinking 'this leaf is good enough just pick all of them'; skipping profanity check. So the only easy way out here is lowering the freq resolution to 4 bits and using bit 4 of the first byte as a profanity flag. Which is what I've done in the PR for now. Please let me know if that's acceptable. Due to the frequency altering bug 890896 shows up in one of my tests as well now.
Attachment #772029 - Flags: review- → review?(dflanagan)
Comment on attachment 772029 [details] Patch I haven't looked at the patch, but I'm not willing to accept 4 bits of frequency data. The issue here is just the data encoding question of how you encode a boolean and a number into 5 bits. The algorithm itself should remain unchanged. If the binary encoding of a node in the dictionary file has the frequency of 0, then in readNode(), set the frequency to 1 and set the profanity flag true. If the binary encoding is 1 through 31, then use that number as the frequency and set the profanity flag false. Wouldn't that work?
Attachment #772029 - Flags: review?(dflanagan) → review-
No, this screws things up because the prefix trees are not used just for profanity things and the whole leaf would be ignored at one point, also removing a lot of proper suggestions. F.e. check out my branch, add predictions.js:823: ` if (node.profane) node.freq = 1; ` And run the worker_test.js. Words like `as` are replaced by `ASB` (probably because sharing prefix tree with `asshole`?), and a bunch more tests are failing... Perhaps we can discuss this on IRC, for a faster roundtrip?
Flags: needinfo?(dflanagan)
(In reply to Jan Jongboom [:janjongboom] from comment #17) > No, this screws things up because the prefix trees are not used just for > profanity things and the whole leaf would be ignored at one point, also > removing a lot of proper suggestions. F.e. check out my branch, add > predictions.js:823: > > ` > if (node.profane) > node.freq = 1; > ` > > And run the worker_test.js. Words like `as` are replaced by `ASB` (probably > because sharing prefix tree with `asshole`?), and a bunch more tests are > failing... > It sounds to me like there is some other bug you're seeing here. What happens if you add the profane words to the dictionary and give them a frequency of 1 and don't to anything else? What does 'as' get replaced with then? And if that works, what happens if you make all non-profane words have a frequency of >= 2? If that works, then it should be possible to special case frequency 1 words since that frequency will always mean profanity. > Perhaps we can discuss this on IRC, for a faster roundtrip? Sure, if you like.
Flags: needinfo?(dflanagan)
Alright, the problem lay in changing the frequency when *writing* the node list, that's not right because a lot of nodes that would lead to a profane word would also be marked profane at that point. I wrote a patch that changes the frequency in `end_element(name)` to mark any profane word as 0 and map every valid word between (1/32) and 1. Tah dah.
Attachment #772029 - Flags: review- → review?(dflanagan)
Comment on attachment 772029 [details] Patch Jan, Thank you for pursuing this. Overall I like this patch, but am giving r- because it maps all freqencies to the range 1-31 and does not make use of 0. We can get one more frequency level if we use 0, and I think we can do it with cleaner code, as noted on github. (On the JS side, we'll add 1 and get a range from 1 to 32) Also, the fix for the default layout page thing ought to be cleaner, too.
Attachment #772029 - Flags: review?(dflanagan) → review-
Comment on attachment 772029 [details] Patch Hi, I've added some replies on GH.
Attachment #772029 - Flags: review- → review?(dflanagan)
To keep the discussion here as well. (In reply to David Flanagan [:djf] from comment #20) > Comment on attachment 772029 [details] > Patch > > Jan, > > Thank you for pursuing this. Overall I like this patch, but am giving r- > because it maps all freqencies to the range 1-31 and does not make use of 0. It doesn't. It maps profane words to 0, others to 1-31 (as opposed to 0-31 before). > Also, the fix for the default layout page thing ought to be cleaner, too. It's not really a fix, just a workaround for global detection in mocha unfortunately...
ping
Flags: needinfo?(dflanagan)
Comment on attachment 772029 [details] Patch r- because you didn't make the changes I asked for on the last review.
Attachment #772029 - Flags: review?(dflanagan) → review-
I've responded on github.
Flags: needinfo?(dflanagan)
Hmm, I think I might have forgotten to push. Haha, sorry.
Oh, omit that last comment. It was something weird on GH view.
Comment on attachment 772029 [details] Patch Moved the frequency mapping to emitNode and created bug 915549 as a follow up to get rid of the global in latin.js. Other than that, see GH.
Attachment #772029 - Flags: review- → review?(dflanagan)
Comment on attachment 772029 [details] Patch r+ if you make the frequency calculation simplification I've listed on github and if you squash your commits before landing. But let's also think about what this patch accomplishes and what it does not, because I think we might want to file a followup. From testing it words like "penis" and "vagina" no longer get autocorrected. That is a good change, but it only works for these words because there are not similar words that are more common. But this patch does not do anything for users who want to use common profanity. "fuck" still auto-corrects to "duck", "shit" to "shot", "prick" to "prickly", "ass" to "ads". The reason is that profanity is given artificially low weight, so even when the user types it exactly right, we still weight typos of other words higher. It used to be that we had a rule where if the user's input was an actual word, then we would never auto-correct it. If that rule was in place, users could type "fuck" and it would not autocorrect to duck. (But when they typed duck we would not suggest fuck) But we can't do that because we really want words like "wont" and "cant" (both legal English words) to autocorrect to the much more common "won't" and "can't". I suspect that most users who type "fuck" really want that word, and it is not typo for "duck". So I wonder if we need to followup with another bug (or add to this bug, if you prefer) so that when the user types a profane word, and spells it exactly, it goes to the first position in the list even though it has low weight. Basically what we're saying is that if the user types profanity with no errors, then we increase the weight from 0 to the maximum. (Or maybe increasing the weight from 1 to something like 20, so that in languages where a profane word is one letter away from a really common word we do still autocorrect away from the profanity.) With the changes listed above you can go ahead and land this, mostly to get the profane words into the dictionaries. But I think that we still need more work in predictions.js to handle those profane words.
Attachment #772029 - Flags: review?(dflanagan) → review+
Whoa, that's not what should happen... At least I have tests in place for "fuck" and stuff. I'll investigate a bit further...
Attachment #772029 - Flags: review+ → review?(dflanagan)
Alright, I found out why there is difference in behavior between the tests and real life. I was sending the nearby key information stringified, which postMessage already does for me, predictions had no clue what to do with the string and omitted the info. Therefore in the tests we have results without nearbykey info which caused all the tests to pass but give different results in reality. So, with nearby keyinfo enabled we should indeed bump the frequency of profane words if a full match occurs. I've played around a bit with the value, and 15 seems to be good enough. I've also added test cases for `wont`, `cant` so we don't break it. Please let me know what you think.
Comment on attachment 772029 [details] Patch Please add a comment about why we have to increase the frequency of profane words, and define a symbolic constant instead of hardcoding 15. See github for details. Thanks for all your work on this, Jan!
Attachment #772029 - Flags: review?(dflanagan) → review+
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Sorry, I had to revert this because this is preventing about 800 tests to run. (Going from ~5100 tests to ~4300 tests) https://github.com/mozilla-b2g/gaia/commit/e74c42eacef67acd50a6a903d7e42c4071b7a798 Our suspicion is that the worker mock is conflicting with how mocha runs the tests. ccing James and Julien to help look into this.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Alright, this is actually weird. The file name is the problem, `latin_suggest_test.js` prevents `latin_test.js` from running. Renaming it to `suggest_test.js` and all is fine... Relanded it in https://github.com/mozilla-b2g/gaia/commit/75b49cc72954463fbaee883c1d7c805d69c4ef65
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Fixed before branching.
blocking-b2g: koi? → koi+
Attachment mime type: text/plain → text/x-github-pull-request
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: