Closed
Bug 1160596
Opened 10 years ago
Closed 9 years ago
Create a general blacklist for negative adjacency
Categories
(Content Services Graveyard :: Tiles, defect)
Content Services Graveyard
Tiles
Tracking
(Not tracked)
RESOLVED
FIXED
Iteration:
41.1 - May 25
People
(Reporter: Mardak, Assigned: mruttley)
References
Details
(Whiteboard: .001)
Attachments
(1 file, 3 obsolete files)
(deleted),
application/json
|
Details |
We need a blacklist to hardcode in bug 1159884. We have various sources of blacklist data that we want to combine. We also want to filter probably based on site ranking/popularity to reduce the total number of entries to lessen impact on disk/memory usage of Firefox.
Assignee | ||
Comment 1•9 years ago
|
||
Here's the most current blacklist: https://github.com/matthewruttley/contentfilter/blob/master/sites.json
I'm continually updating it. In terms of the bloom filter, we have to decide at a legal level what error rate is acceptable (perhaps none?). Maksik mentioned that he will investigate if there is any existing BF implementation in FF. We also have to decide which hashing function is best. My suggestion is that we use a popular function like murmur: https://gist.github.com/raycmorgan/588423
Comment 2•9 years ago
|
||
mxr points to C++ implementation of bloom filter
http://mxr.mozilla.org/mozilla-central/source/mfbt/BloomFilter.h
I am not sure how to surface it to the jsm module
Comment 3•9 years ago
|
||
When checking false positive rate, please run BloomFilter on 1m of alexa with adult sites removed and see how many actual false positives we get.
Reporter | ||
Comment 4•9 years ago
|
||
There's a bloom filter implemented in bug 1138022 https://bugzilla.mozilla.org/attachment.cgi?id=8571008&action=diff
And I don't think it matters too much which exact bloom filter function is used. I expect it to be more of a calculation that can be confirmed with an actual implementation. E.g., given alexa top 1m domains, how what's the collision/false positive rate for a bloom filter with X bits.
Sure, different functions will generate different hashes and which exact sites collide will be different, but in general, I would think it's more of the size of blacklist requiring how many bits and how many additional bits to lessen false positives.
Reporter | ||
Comment 5•9 years ago
|
||
There's this wikipedia article:
http://en.wikipedia.org/wiki/Bloom_filter#Probability_of_false_positives
Reporter | ||
Comment 6•9 years ago
|
||
This article has a rule of thumb:
http://corte.si/%2Fposts/code/bloom-filter-rules-of-thumb/index.html
1 - One byte per item in the input set gives about a 2% false positive rate.
If we had 1024 blacklisted items, we would get 2% false positive with 1KB bloom filter. If we want 100k items, we would need roughly 100KB bloom filter with 2% false positive rate. Increasing to 10% false positive is roughly half the size.
Assignee | ||
Comment 7•9 years ago
|
||
Mardak: Are we doing domain, subdomain or path level matching?
This is a great list we can integrate: http://dsi.ut-capitole.fr/blacklists/index_en.php but requires a bit more than just domain to get at various google groups and live.com specifics.
Reporter | ||
Comment 8•9 years ago
|
||
I believe we'll start with just the domain matching for now.
Assignee | ||
Comment 9•9 years ago
|
||
I've integrated some more data from a comscore source and we now have over 2900 domains: https://github.com/matthewruttley/contentfilter/blob/master/sites.json
The counts per category are as follows:
- drugs: 19
- gambling: 154
- adult: 2743
- alcohol: 50
All domains are in the latest daily Alexa top 1m sites.
Reporter | ||
Comment 10•9 years ago
|
||
mruttley, can you calculate the maxP of those blacklisted sites as one adgroup?
Assignee | ||
Comment 11•9 years ago
|
||
The MaxP of blacklist groups is currently:
Drugs 0.114959
Gambling 0.094614
Adult 0.020494
Alcohol 0.074813
All Together 0.019026
Comment 12•9 years ago
|
||
Matthew, could you generate json file of the form:
{"blacklist": [ base64(bad_site_1), base64(bad_site_2), .... , base64("example.com") ]}
Note that I included "example.com" in the list, so we can browser test negative adjacency without actually going to pron sites. I am using btoa() call to generate base64 in the client, so it will be advisable to use that for json file generation.
When the file is generated, do, please, attach to the bug
Flags: needinfo?(mruttley)
Assignee | ||
Comment 13•9 years ago
|
||
maksik: Here is the file: https://github.com/matthewruttley/contentfilter/blob/master/sitesb64.json
Flags: needinfo?(mruttley)
Assignee | ||
Comment 14•9 years ago
|
||
Updated with b64 encoding of example.com as requested by maksik
Reporter | ||
Updated•9 years ago
|
Status: NEW → RESOLVED
Iteration: 40.3 - 11 May → 41.1 - May 25
Points: --- → 5
Closed: 9 years ago
Resolution: --- → FIXED
Comment 15•9 years ago
|
||
Why are we reinventing the wheel and not using the Safe Browsing data format and server? It supports more than just domain matches and is already usable by JS in Firefox. I don't understand how one second it's argued that we need this ASAP and then on the other hand extra work is made for ourselves.
Reporter | ||
Comment 16•9 years ago
|
||
Do you know who can help with using the existing safe browsing data format and server? If it is indeed simple and low risk to uplift to 39, then we could go that route.
Comment 17•9 years ago
|
||
:gcp or :mmc (not sure of availability) knows the client code in toolkit and looking at the server code[1] that I linked to before, it would seem like rtilder does. It looks like the client changes would be low risk for uplift IMO. I didn't realize you were trying to rush this into Beta which doesn't seem like a good idea for any of these solutions as there could be performance issues to be handled (especially with a custom solution).
[1] https://github.com/mozilla-services/shavar
Assignee | ||
Comment 18•9 years ago
|
||
md5 encoding of the blacklist
Assignee | ||
Comment 19•9 years ago
|
||
Attachment #8604906 -
Attachment is obsolete: true
Attachment #8608241 -
Attachment is obsolete: true
Comment 20•9 years ago
|
||
We need to change format of the json file. "blacklist" needs to be replaced with :"domains". It should be
{
"domains": [
.....
]
}
could you please make a correction
Flags: needinfo?(mruttley)
Assignee | ||
Comment 21•9 years ago
|
||
Also available here: https://github.com/matthewruttley/contentfilter/blob/master/md5_b64.json
Attachment #8608315 -
Attachment is obsolete: true
Flags: needinfo?(mruttley)
Reporter | ||
Comment 22•9 years ago
|
||
What are the changes?
939a940,941
> "mNlYGAOPc6KIMW8ITyBzIg==",
> "bK045TkBlz+/3+6n6Qwvrg==",
1065d1066
< "t1O9jSNjg4DTIv/Za4NbtA==",
1163d1163
< "+k5lDb+QdNc9iZ01hL5yBg==",
1244d1243
< "2qK2ZEY9LgdKSTaLf6VnLA==",
1369a1369
> "hIJA+1QGuKEj+3ijniyBSQ==",
1528d1527
< "E2lvMXqHdTw0x+KCKVnblg==",
1973a1973
> "8DtgIyYiNFqDc5qVrpFUng==",
2070d2069
< "RzX2OfSFEd//LhZwRwzBVw==",
2211a2211
> "O7JiE0bbp583G6ZWRGBcfw==",
2925a2926
> "gYgCu/qUpXWryubJauuPNw==",
3010a3012
> "+YVxSyViJfrme/ENe1zA7A==",
3089a3092
> "VZX1FnyC8NS2k3W+RGQm4g==",
Reporter | ||
Comment 23•9 years ago
|
||
Ah I see the commit:
https://github.com/matthewruttley/contentfilter/commit/048e1410da2408d4e79ef0ea8001691606fe9af4
The commit message doesn't quite explain the changes in the list though. Any reason for those changes?
Flags: needinfo?(mruttley)
Assignee | ||
Comment 24•9 years ago
|
||
The changes reflect updates in how contentfilter is created. Contentfilter is generated by sites only found in the Alexa top 1m sites, which change every day. Thus each time I regenerate the list, a few sites change.
I used backticks in the commit message which seem to have eliminated most/all of it :/ TIL.
Flags: needinfo?(mruttley)
You need to log in
before you can comment on or make changes to this bug.
Description
•