889561 - moz_places_url_uniqueindex takes up to 1/3 of places.sqlite

(dormant account)

Reporter

Description

•

11 years ago

No description provided.

Marco Bonardo [:mak]

Assignee

Comment 1

•

11 years ago

So, we need the url index for 2 reasons: 1. uniqueness check (internal functionality relies on this precondition) 2. actual queries doing WHERE url = some_url (many APIs rely on this) though, we don't exactly need an url index that replicates the whole url in this case (that is what SQLite does), we just need the 2 above points. A good hashing algorithm (the mfbt one, cityhash or murmur) may easily provide the same benefits, taking just 64 bits per entry in the ideal case (to be verified). This may bring in a reduction of 15-20% the database size. So the plan of action here may be: 1. figure out a good hashing algo. We may add some telemetry probes to the current method that adds URIs to the db and report collisions in Nightly/Aurora (this requires some changes to the db, but if plan the change that's fine, we can just add the url_hash column earlier). 2. remove the url_uniqueindex, add instead a unique index on the url_hash column. Measure the db size win. The hashing may either be done at the Places level or (better) implemented as a SQLite function. 3. Fix queries to actually compare hashes instead of plain urls.

Marco Bonardo [:mak]

Assignee

Comment 2

•

11 years ago

note: I said 64bits, cause we'd need a 32 bits integer column for the hash and an index replicating it.

(dormant account)

Reporter

Comment 3

•

11 years ago

(In reply to Marco Bonardo [:mak] from comment #2) >The hashing may either be done at the Places level or (better) implemented as a SQLite function. > note: I said 64bits, cause we'd need a 32 bits integer column for the hash > and an index replicating it. We could also ask Dr Hipp to implement function indexes...That might not be more complicated than changing all of the callsites to deal with a new column and have the advantage of not wasting extra space

Marco Bonardo [:mak]

Assignee

Comment 4

•

11 years ago

(In reply to Taras Glek (:taras) from comment #3) > We could also ask Dr Hipp to implement function indexes...That might not be > more complicated than changing all of the callsites to deal with a new > column and have the advantage of not wasting extra space See Hash Indexes on this page: http://www2.sqlite.org/cvstrac/wiki?p=ProposedIncompatibleChanges that said, the given reasoning is weak for us, in the sense you may not care about in_order retrieval, for example in Places we order by URI only in a very specific and not so interesting case (clicking the Library url column)

Marco Bonardo [:mak]

Assignee

Comment 5

•

11 years ago

Though that may be a very valid reason why an hash index can't become a default choice in a generic storage like indexedDB

Marco Bonardo [:mak]

Assignee

Comment 6

•

11 years ago

I am under the impression we can't use 32 bits, considered number of urls in the world (http://www.worldwidewebsize.com/ says about 50 billions indexed pages on Google, we may ignore the rest for a rough estimate), with 64 bits we get 1 collision every 4 billions hashes, to have 1 collision out of 70 billions hashes we need at least 72 bits.

(dormant account)

Reporter

Comment 7

•

11 years ago

(In reply to Marco Bonardo [:mak] from comment #6) > I am under the impression we can't use 32 bits, considered number of urls in > the world (http://www.worldwidewebsize.com/ says about 50 billions indexed > pages on Google, we may ignore the rest for a rough estimate), with 64 bits > we get 1 collision every 4 billions hashes, to have 1 collision out of 70 > billions hashes we need at least 72 bits. we'll just need a little extra logic to compare the actual url. Based on my history(42K urls) a 16bit index would do.

Marco Bonardo [:mak]

Assignee

Comment 8

•

11 years ago

Comparing the urls is still one (or more depending on how many collisions we have) additional string compare we could avoid though. It's the usual size vs performance problem, we can decide any direction. fwiw I have 100k urls and my urls are likely different than yours, each user owns a subgroup of the whole urls in the web and it's hard to predict collisions in a subgroup. 16 bits would give us 1 collision every 256 urls, doesn't sound too good even for your 40k pages. Definitely telemetry will help telling us collisions rate, we could start with 32 (likely to collide once in an average db) and see.

(dormant account)

Reporter

Comment 9

•

11 years ago

(In reply to Marco Bonardo [:mak] from comment #8) > 16 bits would give us 1 collision every 256 urls, doesn't sound too good > even for your 40k pages. Definitely telemetry will help telling us > collisions rate, we could start with 32 (likely to collide once in an > average db) and see. I think you are worrying too much about collisions. Even with a 1/256 collision rate, we are doing 256x less IO. I think we should just pick a hashing algo without worrying too much about collision perf.

Marco Bonardo [:mak]

Assignee

Comment 10

•

11 years ago

I'm worrying cause on collision we will have to compare urls, that's fine, if not that the search is linear at that point. If I simply calculate 100k (pages in my db) / 256 (possible collisions) = 390, means that in the worst case SQLite will have to do a linear scan and string compare across 390 entries. Plus it will have to actually fetch those 390 entries for the linear scan. VS having to do a linear scan and a fetch of 2 or 3 entries.

Marco Bonardo [:mak]

Assignee

Comment 11

•

11 years ago

Attached patch wip (obsolete) (deleted) — Details — Splinter Review

this implements an hash() SQL function based on what we already have in mfbt (we use it for internal hash tables and atoms, iirc it's ). I ran it on a couple profiles, on my profile (106k entries) it returns 2 hash conflicts, so it appears to be quite good and we know it's fast per the past analysis done in bug 729952. See also http://mxr.mozilla.org/mozilla-central/source/mfbt/HashFunctions.h I think it may be a good starting point to make an experimental patch.

Marco Bonardo [:mak]

Assignee

Updated

•

11 years ago

Assignee: nobody → mak77

Blocks: fxdesktopbacklog

Keywords: perf

Whiteboard: [triage]

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Whiteboard: [triage]

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Whiteboard: p=0

Marco Bonardo [:mak]

Assignee

Comment 12

•

11 years ago

I investigated using the new WITHOUT ROWID feature of Sqlite. Unfortunately it's not usable for our use-case, while it makes selecting by url faster and avoids duplicating url into its own index, each other index in moz_places ends up duplicating the primary key value, that is now the url. So my 70MB database becomes a 150MB one :( Looks like the use-cases of WITHOUT ROWID are fewer than I expected. Though, we can still use WITHOUT_ROWID to save space and increase speed of other tables, where we currently have compound indices. I will file bugs for specific cases once I have some numbers. I guess at this point the only plausible solution for this specific issue is the hash. This will break some use-cases: - uriIsPrefix queries will become extremely slow, basically queries getting all pages beginning with a given string. The feature will have to be dropped (it would end up scanning hundred thousands of urls one by one) - for basically the same reason we won't be able to query entries based on the protocol with tricks like "url BETWEEN 'file://' AND 'file:/~'" or "url BETWEEN 'place:' AND 'place;'", we'll need a further column to filter these. For the latter case we could make the hash be "protocol:hash", it would take a bit more space though

Marco Bonardo [:mak]

Assignee

Updated

•

11 years ago

Blocks: PlacesDiet

Tim Taubert [:ttaubert] (inactive)

Updated

•

11 years ago

Blocks: fxdesktoptriage
No longer blocks: fxdesktopbacklog

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Blocks: fxdesktopbacklog
No longer blocks: fxdesktoptriage

Marco Mucci [:MarcoM]

Updated

•

11 years ago

No longer blocks: fxdesktopbacklog

Flags: firefox-backlog+

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Summary: moz_places_url_uniqueindex takes up 26% of my 30mb places db...that's wrong → Investigation - moz_places_url_uniqueindex takes up 26% of my 30mb places db...that's wrong

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Whiteboard: p=0 → p=3

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Whiteboard: p=3 → p=3 [qa-]

Marco Mucci [:MarcoM]

Updated

•

10 years ago

Points: --- → 3

Flags: qe-verify-

Whiteboard: p=3 [qa-]

Marco Bonardo [:mak]

Assignee

Comment 13

•

10 years ago

likely more a 5 points than a 3.

Points: 3 → 5

Marco Bonardo [:mak]

Assignee

Updated

•

9 years ago

Priority: -- → P1

Marco Bonardo [:mak]

Assignee

Updated

•

9 years ago

Blocks: 1214723

Marco Bonardo [:mak]

Assignee

Updated

•

9 years ago

Priority: P1 → P2

Whiteboard: [fxsearch]

Marco Bonardo [:mak]

Assignee

Updated

•

9 years ago

Depends on: 1158511

Marco Bonardo [:mak]

Assignee

Comment 14

•

9 years ago

These are the cases we need to address, since they won't work anymore in the new index-less world, or better, it will keep working but will be slow as hell. /toolkit/components/places/nsNavHistory.cpp (View Hg log or Hg annotations) line 1030 -- "WHEN url BETWEEN 'place:' AND 'place;' " line 1824 -- "AND h.url BETWEEN 'file://' AND 'file:/~' " line 3373 -- clause.Condition("h.url >= ").Param(":uri") line 3374 -- .Condition("h.url <= ").Param(":uri_upper"); line 3464 -- clause.Condition("NOT h.url BETWEEN 'place:' AND 'place;'"); /toolkit/components/places/History.jsm (View Hg log or Hg annotations) line 481 -- WHEN url BETWEEN 'place:' AND 'place;' /toolkit/components/places/SQLFunctions.cpp (View Hg log or Hg annotations) line 491 -- "(url > 'place:' AND url < 'place;') " /dom/push/PushRecord.jsm (View Hg log or Hg annotations) line 154 -- p.url >= :urlLowerBound AND p.url <= :urlUpperBound AND To fix the place: and file: checks, I'll likely need to add a prefix to the hashes, but that should be fine since it's just a few entries. The other history checks are due to uriIsPrefix query option, that we really never used in the product. Though looks like the addons-sdk exposes it through createQuery http://mxr.mozilla.org/mozilla-central/source/addon-sdk/source/lib/sdk/places/utils.js#143 I'll file a separate bug to remove uriIsQuery and verify the impact there. The most problematic thing cause I'm not sure what is doing is dom/push/PushRecord.jsm. Looks like it's trying to query by origin, I think we can restrict the records using the rev_host column and then it should be fine. Again this needs a separate bug.

Marco Bonardo [:mak]

Assignee

Updated

•

9 years ago

Depends on: 1243778

Marco Bonardo [:mak]

Assignee

Updated

•

9 years ago

Depends on: 1243779

Marco Bonardo [:mak]

Assignee

Updated

•

9 years ago

Summary: Investigation - moz_places_url_uniqueindex takes up 26% of my 30mb places db...that's wrong → moz_places_url_uniqueindex takes up to 1/3 of places.sqlite

Lina Butler [:lina]

Comment 15

•

9 years ago

(In reply to Marco Bonardo [::mak] from comment #14) > The most problematic thing cause I'm not sure what is doing is > dom/push/PushRecord.jsm. Looks like it's trying to query by origin, I think > we can restrict the records using the rev_host column and then it should be > fine. Again this needs a separate bug. Yup, it's querying the last visit to a given origin. The push code uses this to calculate the background message quota, which changes based on the number of days the user last visited the site. Would it be sufficient to replace `p.url` with `rev_host` and `PlacesUtils.getReversedHost`? Would we need to restrict the query to account for the scheme and port? Thanks for filing the bug!

Marco Bonardo [:mak]

Assignee

Comment 16

•

9 years ago

(In reply to Kit Cambridge [:kitcambridge] from comment #15) > Yup, it's querying the last visit to a given origin. The push code uses this > to calculate the background message quota, which changes based on the number > of days the user last visited the site. > > Would it be sufficient to replace `p.url` with `rev_host` and > `PlacesUtils.getReversedHost`? Would we need to restrict the query to > account for the scheme and port? The thing that will be slow in the "new world" will be queries doing "url >", "url <", "url BETWEEN", so I suppose that would do. But depends on the push design requirements whether it should be restricted by scheme and port. Btw let's keep the discussion in the specifc Push bug. I will try to look into it soon and we can evaluate what to do.

Marco Bonardo [:mak]

Assignee

Comment 17

•

9 years ago

Attached file Bug 889561 - moz_places_url_uniqueindex takes up too much database space (obsolete) (deleted) — Details

Review commit: https://reviewboard.mozilla.org/r/51163/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/51163/

Marco Bonardo [:mak]

Assignee

Updated

•

9 years ago

Attachment #814619 - Attachment is obsolete: true

Marco Bonardo [:mak]

Assignee

Comment 18

•

9 years ago

Comment on attachment 8749808 [details] Bug 889561 - moz_places_url_uniqueindex takes up too much database space Review request updated; see interdiff: https://reviewboard.mozilla.org/r/51163/diff/1-2/

Marco Bonardo [:mak]

Assignee

Comment 19

•

8 years ago

Comment on attachment 8749808 [details] Bug 889561 - moz_places_url_uniqueindex takes up too much database space Review request updated; see interdiff: https://reviewboard.mozilla.org/r/51163/diff/2-3/

Attachment #8749808 - Attachment description: MozReview Request: Bug 889561 - moz_places_url_uniqueindex takes up too much database space → Bug 889561 - moz_places_url_uniqueindex takes up too much database space

Marco Bonardo [:mak]

Assignee

Comment 20

•

8 years ago

Attached file Bug 889561 - Reduce the size of places.sqlite by removing the url unique index from moz_places. (deleted) — Details

Review commit: https://reviewboard.mozilla.org/r/59938/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/59938/

Attachment #8763819 - Flags: review?(adw)

Marco Bonardo [:mak]

Assignee

Updated

•

8 years ago

Attachment #8749808 - Attachment is obsolete: true

Marco Bonardo [:mak]

Assignee

Comment 21

•

8 years ago

The patch is adding an hash SQL function, that can accept up to 2 arguments, the former is the string to hash, the latter (optional) is an hashing mode (currently can be prefix_hi or profix_lo, for querying schemes) Maintenance will take care of eventual misses from add-ons. It's also changing expiration to adapt better to every single db, while trying to limit the db size to 60MiB (the previous 160MiB limit was unrealistic). We should take a close look at telemetry after the change. When moving favicons out of the db we can likely improve this further.

Marco Bonardo [:mak]

Assignee

Comment 22

•

8 years ago

Comment on attachment 8763819 [details] Bug 889561 - Reduce the size of places.sqlite by removing the url unique index from moz_places. Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59938/diff/1-2/

Drew Willcoxon :adw

Comment 23

•

8 years ago

Comment on attachment 8763819 [details] Bug 889561 - Reduce the size of places.sqlite by removing the url unique index from moz_places. https://reviewboard.mozilla.org/r/59938/#review57468 This looks great, I don't see any problems. Nice work! ::: toolkit/components/places/SQLFunctions.cpp:882 (Diff revision 2) > + > + RefPtr<nsVariant> result = new nsVariant(); > + if (mode.IsEmpty()) { > + // URI-like strings (having a prefix before a colon), are handled specially, > + // as a 48 bit hash, where first 16 bits are the prefix hash, while the > + // other 48 are the string hash. Typo: should be "other 32" right?

Attachment #8763819 - Flags: review?(adw) → review+

Marco Bonardo [:mak]

Assignee

Comment 24

•

8 years ago

(In reply to Drew Willcoxon :adw from comment #23) > Typo: should be "other 32" right? yep. I'll fix that as soon as I can recover a working tree, some recent pull has broken it in a dom subfolder I never touched...

Pulsebot

Comment 25

•

8 years ago

Pushed by mak77@bonardo.net: https://hg.mozilla.org/integration/fx-team/rev/ceff61c9fc5a Reduce the size of places.sqlite by removing the url unique index from moz_places. r=adw

Pulsebot

Comment 26

•

8 years ago

Pushed by mak77@bonardo.net: https://hg.mozilla.org/integration/fx-team/rev/07e1f152fe92 followup - Missing changes to Bookmarks.jsm. r=post-facto

Carsten Book [:Tomcat]

Comment 27

•

8 years ago

this caused frequent failures like https://treeherder.mozilla.org/logviewer.html#?job_id=10160078&repo=fx-team on pgo and so had to back this out, sorry

Carsten Book [:Tomcat]

Updated

•

8 years ago

Flags: needinfo?(mak77)

Pulsebot

Comment 28

•

8 years ago

Backout by cbook@mozilla.com: https://hg.mozilla.org/integration/fx-team/rev/3fa6d81f720a Backed out changeset 07e1f152fe92 https://hg.mozilla.org/integration/fx-team/rev/cda8c83f6a57 Backed out changeset ceff61c9fc5a for frequent testfailures on pgo in /bookmarks/test_

Marco Bonardo [:mak]

Assignee

Comment 29

•

8 years ago

Testing a fix on Try. If I'm right, the problem is that history tries to start expiration very late on shutdown, and now that fails cause on expiration startup we try to query the database, that is already closed at that time. So we bail out with an exception. This happens in tests mostly, it's unlikely to happen in reality. The fix only requires fixing our AsyncShutdown blockers to properly manage the exception we throw when we can't open the db cause it's too late.

Flags: needinfo?(mak77)

Marco Bonardo [:mak]

Assignee

Comment 30

•

8 years ago

Ok, looks like this did it https://treeherder.mozilla.org/#/jobs?repo=try&revision=da26210afb1365a81be917594ad05880927be774

Pulsebot

Comment 31

•

8 years ago

Pushed by mak77@bonardo.net: https://hg.mozilla.org/integration/fx-team/rev/21585b3e4814 Reduce the size of places.sqlite by removing the url unique index from moz_places. r=adw

Carsten Book [:Tomcat]

Comment 32

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/21585b3e4814

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox50: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla50

Chris H-C :chutten

Comment 33

•

8 years ago

After this change landed, an alert for the telemetry probe PLACES_MOST_RECENT_EXPIRED_VISIT_DAYS was raised [1]. Looking at this patch, it seems as though maybe the change in nsPlacesExpiration's DATABASE_MAX_SIZE resulted in a (presumably-temporary) change in how long uris are in the database before being expired?

Marco Bonardo [:mak]

Assignee

Comment 34

•

8 years ago

(In reply to Chris H-C :chutten from comment #33) > After this change landed, an alert for the telemetry probe > PLACES_MOST_RECENT_EXPIRED_VISIT_DAYS was raised [1]. Looking at this patch, > it seems as though maybe the change in nsPlacesExpiration's > DATABASE_MAX_SIZE resulted in a (presumably-temporary) change in how long > uris are in the database before being expired? yes, we are constantly checking that probe to find the right compromise for our performance work. The scope of the probe is to verify the expiration changes are not causing large dataloss. The only meaningful regression would be if the value becomes *much* lower than it is, then we would actually be losing a meaningful amount of user's history. I'm sure there will be further changes in the next months, since we are trying to make places.sqlite smaller in size.

Marco Bonardo [:mak]

Assignee

Updated

•

8 years ago

Blocks: 1296958

Masatoshi Kimura [:emk]

Updated

•

8 years ago

Depends on: 1318650

Marco Bonardo [:mak]

Assignee

Updated

•

7 years ago

Blocks: 977053

wip 11 years ago Marco Bonardo [:mak] (deleted), patch		Details \| Diff \| Splinter Review
Bug 889561 - moz_places_url_uniqueindex takes up too much database space 9 years ago Marco Bonardo [:mak] (deleted), text/x-review-board-request		Details
Bug 889561 - Reduce the size of places.sqlite by removing the url unique index from moz_places. 8 years ago Marco Bonardo [:mak] (deleted), text/x-review-board-request	adw : review+	Details