669407 - Reduce disk space requirements for Safe Browsing

Assignee

Description

•

13 years ago

The current Safe Browsing implementation uses an SQLite database which is currently +- 50M. This is too much storage for mobile devices, so Safe Browsing is currently disabled on Fennec. Figure out a way to reduce the space requirement. Currently under consideration is removing the SQLite database, and replacing it by a prefix tree. This should reduce the requirement from about 50M to about 4.5M (2.2M hash-prefixes, and likely another 2.2M for update information). Some of the work needed is already available here: http://hg.mozilla.org/users/mmmulani_uwaterloo.ca/safe-browsing/ http://hg.mozilla.org/users/gpascutto_mozilla.com/patches/

Gian-Carlo Pascutto [:gcp]

Assignee

Updated

•

13 years ago

Assignee: nobody → gpascutto

Doug Turner (:dougt)

Updated

•

13 years ago

tracking-fennec: --- → ?

Gian-Carlo Pascutto [:gcp]

Assignee

Updated

•

13 years ago

Depends on: 669410

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 1

•

13 years ago

Bug 669410 tracks the first part of this, which is to construct the prefix tree and use it to reduce the accesses to the SQLite DB.

tracking-fennec: ? → ---

Gian-Carlo Pascutto [:gcp]

Assignee

Updated

•

13 years ago

Blocks: 470876

Nicholas Nethercote [inactive]

Updated

•

13 years ago

Blocks: 650649

Gian-Carlo Pascutto [:gcp]

Assignee

Updated

•

13 years ago

No longer blocks: 650649

Justin Lebar (not reading bugmail)

Comment 2

•

13 years ago

I'm not sure if this is the bug where you're tracking matching only on the prefixes and ignoring the hostname hashes. But here's the comment I posted at [1] regarding this change. If you drop the 32-bit hostname hash, then would you be matching only on the 32-bit prefixes? If so, I think you're setting yourself up for false-positives. Consider: With 600,000 hashes in the database, the chance of some other website colliding on a 32-bit hash is 600,000 * 2^(-32) = 0.00014. So you'd need to view .5 / 0.00014 = 3,600 pages to have a 50% chance of a false-positive. That's not a lot of pages. Using (truncated) hashes is itself probabilistic. So if you tuned the bloom filter so the chance of a false-positive is much lower than the chance of a hash collision, it might be sensible. [1] http://www.morbo.org/2011/07/bringing-safebrowsing-to-mobile-devies.html

Justin Lebar (not reading bugmail)

Comment 3

•

13 years ago

Oops, the chance of a collision is actually 1 - (1 - 0.00014) ^ y where y is the number of websites you visit. So if you want probability of collision = 0.5, you get y = log 0.5 / log(1 - 0.00014) or about 5000 webpages.

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 4

•

13 years ago

>I'm not sure if this is the bug where you're tracking matching only on the >prefixes and ignoring the hostname hashes. For now it's an optional patch for bug 669410. >Using (truncated) hashes is itself probabilistic. So if you tuned the bloom >filter so the chance of a false-positive is much lower than the chance of a >hash collision, it might be sensible. There are 2 aspects here: the first one is a collision on a prefix causing a lookup to the Google server. You're right this will now happen about once every few thousand pages instead of, well, almost never. This will increases the load on Google's server, though note that full prefix lookups are cached. I don't know if the resulting load increase is substantial or meaningful. The second aspect is that if you want to handle updates, a "false positive" makes it impossible to safely remove an entry from the datastructure. This isn't an issue with a Prefix Trie (which can handle duplicated entries), whereas it *is* an issue with a Bloom filter. The persistent datastructure must *support* deletions. Lastly, a Bloom filter with false positive rate of 1 out of every 5000 lookups and 600 000 prefixes takes 17.7 bits per entry, or 1.7 bit per entry more than the Prefix Trie. So it's a loss in every case.

Justin Lebar (not reading bugmail)

Comment 5

•

13 years ago

I see. I didn't realize that when there's a hash collision, we send the full URL to Google for verification. That's much more sensible. Increasing the probability of sending the URL to Google by to 0.00014 doesn't seem so bad. Do you know if we wait to render the page until Google responds? Then I suppose it might matter some. It sounds like a prefix trie tromps a bloom filter here, although I wonder how Chrome deals with deletions if they also use a bloom filter...

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 6

•

13 years ago

>Do you know if we wait to render the page until Google responds? Then I >suppose it might matter some. Yes, everything blocks until the URL is confirmed (for security reasons). If the hostkeys are still saved on disk (which I'd do for desktop), the impact will be 0. If they aren't (mobile), we'll load 1 in 5000 pages slower. >It sounds like a prefix trie tromps a bloom filter here, although I wonder how >Chrome deals with deletions if they also use a bloom filter... They don't. They delete it and reconstruct it entirely from a (larger) disk-based database. We'd like to use Prefix Tries for both. Not storing hostkeys is an optimization for either technique.

Dão Gottwald [:dao]

Comment 7

•

13 years ago

(In reply to comment #6) > >Do you know if we wait to render the page until Google responds? Then I > >suppose it might matter some. > > Yes, everything blocks until the URL is confirmed (for security reasons). If > the hostkeys are still saved on disk (which I'd do for desktop), the impact > will be 0. If they aren't (mobile), we'll load 1 in 5000 pages slower. This seems particularly painful because (a) we're talking about slow connections and (b) when it slows down a page, it does so for all our users (which is bad for whoever owns that page), right?

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 8

•

13 years ago

As for (a), I'm not sure connection speed is relevant as the time spent doing the extra query compared to the total page load time will be approximately the same independent of the connection speed, and quite low (tiny data query to fast Google server). Is it perceived more annoying by a user to wait 5 seconds for a page load instead of 4 seconds, as compared to 250 ms instead of 200 ms? Can they even detect that 1 out 5000 pages is a fraction slower? As for (b), we can rehash the prefixes with a unique-per-browser hash to ensure this 1/5000th page collision is happening on a different page for every user, if we feel that is needed. We cannot avoid false positives if we hash. If we want less of them, they have to be payed for in bits of flash storage. (An interesting consideration on the side: right now, in the current implementation existing since FF3, something like half the entries in the DB are full-domain blocks, where both hashes are equal (because hostkey == URL-key). This means that some (1/10000) domain owners already suffer from this issue.)

Justin Lebar (not reading bugmail)

Comment 9

•

13 years ago

> As for (b), we can rehash the prefixes with a unique-per-browser hash to ensure > this 1/5000th page collision is happening on a different page for every user, > if we feel that is needed. How would this work, exactly? You don't get the URL itself, just the prefix (a hash)? So you generate a random number R and do for all prefixes, store sha32(prefix|R). Then when you visit a page, you check whether sha32(sha32(url)|R) is in the database? But doesn't this collide if sha32(url) == prefix, which is exactly the condition you had before?

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 10

•

13 years ago

You get from Google: domain prefix (sha32(domain))=DH + canonicalized-url prefix (sha32(url))=PH. for all prefixes, store sha32(DH|PH|R) When visiting a page, check whether sha32(sha32(domain)|sha32(url)|R) is in the database. Now sha32(url) = PH with sha32(domain) != DH will collide depending on R.

Justin Lebar (not reading bugmail)

Comment 11

•

13 years ago

Ah, I see. I had to write it out to understand, so if it helps anyone else: * sha32(PH) collides if two URLs have the same hash. (probability 2^-32) * sha32(DH|PH) collides if * two URLs have the same hash and their domains have the same hash, or * two URLs' DH|PH values have the same hash. (probability 2^-64 + 2^-32) * sha32(DH|PH|R) collides if * two URLs have the same hash and their domains have the same hash, or * two URLs' DH|PH|R values have the same hash. (probability 2^-64 + 2^-32) So you're trading off a (tiny) increase in chance of collisions and getting back randomization on the 2^-32 term.

Gian-Carlo Pascutto [:gcp]

Assignee

Updated

•

13 years ago

Depends on: 673470

Gian-Carlo Pascutto [:gcp]

Assignee

Updated

•

13 years ago

Blocks: 549288

Gian-Carlo Pascutto [:gcp]

Assignee

Updated

•

13 years ago

Component: General → Phishing Protection

OS: Android → All

Product: Fennec → Firefox

QA Contact: general → phishing.protection

Hardware: ARM → All

Gian-Carlo Pascutto [:gcp]

Assignee

Comment 12

•

13 years ago

Bug 673470 landed, reducing the store from 40M to 5-6M.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Gian-Carlo Pascutto [:gcp]

Assignee

Updated

•

13 years ago

Target Milestone: --- → Firefox 13

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: Firefox → Toolkit

Bugzilla

Reduce disk space requirements for Safe Browsing

Categories

(Toolkit :: Safe Browsing, defect)

Tracking

()

People

(Reporter: gcp, Assigned: gcp)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Comment 1

Updated

Updated

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated

Updated

Comment 12

Updated

Updated