Closed Bug 889561 Opened 11 years ago Closed 8 years ago

moz_places_url_uniqueindex takes up to 1/3 of places.sqlite

Categories

(Toolkit :: Places, defect, P2)

x86_64
Windows 8
defect
Points:
5

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox50 --- fixed

People

(Reporter: taras.mozilla, Assigned: mak)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: [fxsearch])

Attachments

(1 file, 2 obsolete files)

No description provided.
So, we need the url index for 2 reasons: 1. uniqueness check (internal functionality relies on this precondition) 2. actual queries doing WHERE url = some_url (many APIs rely on this) though, we don't exactly need an url index that replicates the whole url in this case (that is what SQLite does), we just need the 2 above points. A good hashing algorithm (the mfbt one, cityhash or murmur) may easily provide the same benefits, taking just 64 bits per entry in the ideal case (to be verified). This may bring in a reduction of 15-20% the database size. So the plan of action here may be: 1. figure out a good hashing algo. We may add some telemetry probes to the current method that adds URIs to the db and report collisions in Nightly/Aurora (this requires some changes to the db, but if plan the change that's fine, we can just add the url_hash column earlier). 2. remove the url_uniqueindex, add instead a unique index on the url_hash column. Measure the db size win. The hashing may either be done at the Places level or (better) implemented as a SQLite function. 3. Fix queries to actually compare hashes instead of plain urls.
note: I said 64bits, cause we'd need a 32 bits integer column for the hash and an index replicating it.
(In reply to Marco Bonardo [:mak] from comment #2) >The hashing may either be done at the Places level or (better) implemented as a SQLite function. > note: I said 64bits, cause we'd need a 32 bits integer column for the hash > and an index replicating it. We could also ask Dr Hipp to implement function indexes...That might not be more complicated than changing all of the callsites to deal with a new column and have the advantage of not wasting extra space
(In reply to Taras Glek (:taras) from comment #3) > We could also ask Dr Hipp to implement function indexes...That might not be > more complicated than changing all of the callsites to deal with a new > column and have the advantage of not wasting extra space See Hash Indexes on this page: http://www2.sqlite.org/cvstrac/wiki?p=ProposedIncompatibleChanges that said, the given reasoning is weak for us, in the sense you may not care about in_order retrieval, for example in Places we order by URI only in a very specific and not so interesting case (clicking the Library url column)
Though that may be a very valid reason why an hash index can't become a default choice in a generic storage like indexedDB
I am under the impression we can't use 32 bits, considered number of urls in the world (http://www.worldwidewebsize.com/ says about 50 billions indexed pages on Google, we may ignore the rest for a rough estimate), with 64 bits we get 1 collision every 4 billions hashes, to have 1 collision out of 70 billions hashes we need at least 72 bits.
(In reply to Marco Bonardo [:mak] from comment #6) > I am under the impression we can't use 32 bits, considered number of urls in > the world (http://www.worldwidewebsize.com/ says about 50 billions indexed > pages on Google, we may ignore the rest for a rough estimate), with 64 bits > we get 1 collision every 4 billions hashes, to have 1 collision out of 70 > billions hashes we need at least 72 bits. we'll just need a little extra logic to compare the actual url. Based on my history(42K urls) a 16bit index would do.
Comparing the urls is still one (or more depending on how many collisions we have) additional string compare we could avoid though. It's the usual size vs performance problem, we can decide any direction. fwiw I have 100k urls and my urls are likely different than yours, each user owns a subgroup of the whole urls in the web and it's hard to predict collisions in a subgroup. 16 bits would give us 1 collision every 256 urls, doesn't sound too good even for your 40k pages. Definitely telemetry will help telling us collisions rate, we could start with 32 (likely to collide once in an average db) and see.
(In reply to Marco Bonardo [:mak] from comment #8) > 16 bits would give us 1 collision every 256 urls, doesn't sound too good > even for your 40k pages. Definitely telemetry will help telling us > collisions rate, we could start with 32 (likely to collide once in an > average db) and see. I think you are worrying too much about collisions. Even with a 1/256 collision rate, we are doing 256x less IO. I think we should just pick a hashing algo without worrying too much about collision perf.
I'm worrying cause on collision we will have to compare urls, that's fine, if not that the search is linear at that point. If I simply calculate 100k (pages in my db) / 256 (possible collisions) = 390, means that in the worst case SQLite will have to do a linear scan and string compare across 390 entries. Plus it will have to actually fetch those 390 entries for the linear scan. VS having to do a linear scan and a fetch of 2 or 3 entries.
Attached patch wip (obsolete) (deleted) — Splinter Review
this implements an hash() SQL function based on what we already have in mfbt (we use it for internal hash tables and atoms, iirc it's ). I ran it on a couple profiles, on my profile (106k entries) it returns 2 hash conflicts, so it appears to be quite good and we know it's fast per the past analysis done in bug 729952. See also http://mxr.mozilla.org/mozilla-central/source/mfbt/HashFunctions.h I think it may be a good starting point to make an experimental patch.
Assignee: nobody → mak77
Keywords: perf
Whiteboard: [triage]
Whiteboard: [triage]
Whiteboard: p=0
I investigated using the new WITHOUT ROWID feature of Sqlite. Unfortunately it's not usable for our use-case, while it makes selecting by url faster and avoids duplicating url into its own index, each other index in moz_places ends up duplicating the primary key value, that is now the url. So my 70MB database becomes a 150MB one :( Looks like the use-cases of WITHOUT ROWID are fewer than I expected. Though, we can still use WITHOUT_ROWID to save space and increase speed of other tables, where we currently have compound indices. I will file bugs for specific cases once I have some numbers. I guess at this point the only plausible solution for this specific issue is the hash. This will break some use-cases: - uriIsPrefix queries will become extremely slow, basically queries getting all pages beginning with a given string. The feature will have to be dropped (it would end up scanning hundred thousands of urls one by one) - for basically the same reason we won't be able to query entries based on the protocol with tricks like "url BETWEEN 'file://' AND 'file:/~'" or "url BETWEEN 'place:' AND 'place;'", we'll need a further column to filter these. For the latter case we could make the hash be "protocol:hash", it would take a bit more space though
Blocks: PlacesDiet
Blocks: fxdesktopbacklog
No longer blocks: fxdesktoptriage
No longer blocks: fxdesktopbacklog
Flags: firefox-backlog+
Summary: moz_places_url_uniqueindex takes up 26% of my 30mb places db...that's wrong → Investigation - moz_places_url_uniqueindex takes up 26% of my 30mb places db...that's wrong
Whiteboard: p=0 → p=3
Whiteboard: p=3 → p=3 [qa-]
Points: --- → 3
Flags: qe-verify-
Whiteboard: p=3 [qa-]
likely more a 5 points than a 3.
Points: 3 → 5
Priority: -- → P1
Blocks: 1214723
Priority: P1 → P2
Whiteboard: [fxsearch]
Depends on: 1158511
These are the cases we need to address, since they won't work anymore in the new index-less world, or better, it will keep working but will be slow as hell. /toolkit/components/places/nsNavHistory.cpp (View Hg log or Hg annotations) line 1030 -- "WHEN url BETWEEN 'place:' AND 'place;' " line 1824 -- "AND h.url BETWEEN 'file://' AND 'file:/~' " line 3373 -- clause.Condition("h.url >= ").Param(":uri") line 3374 -- .Condition("h.url <= ").Param(":uri_upper"); line 3464 -- clause.Condition("NOT h.url BETWEEN 'place:' AND 'place;'"); /toolkit/components/places/History.jsm (View Hg log or Hg annotations) line 481 -- WHEN url BETWEEN 'place:' AND 'place;' /toolkit/components/places/SQLFunctions.cpp (View Hg log or Hg annotations) line 491 -- "(url > 'place:' AND url < 'place;') " /dom/push/PushRecord.jsm (View Hg log or Hg annotations) line 154 -- p.url >= :urlLowerBound AND p.url <= :urlUpperBound AND To fix the place: and file: checks, I'll likely need to add a prefix to the hashes, but that should be fine since it's just a few entries. The other history checks are due to uriIsPrefix query option, that we really never used in the product. Though looks like the addons-sdk exposes it through createQuery http://mxr.mozilla.org/mozilla-central/source/addon-sdk/source/lib/sdk/places/utils.js#143 I'll file a separate bug to remove uriIsQuery and verify the impact there. The most problematic thing cause I'm not sure what is doing is dom/push/PushRecord.jsm. Looks like it's trying to query by origin, I think we can restrict the records using the rev_host column and then it should be fine. Again this needs a separate bug.
Depends on: 1243778
Depends on: 1243779
Summary: Investigation - moz_places_url_uniqueindex takes up 26% of my 30mb places db...that's wrong → moz_places_url_uniqueindex takes up to 1/3 of places.sqlite
(In reply to Marco Bonardo [::mak] from comment #14) > The most problematic thing cause I'm not sure what is doing is > dom/push/PushRecord.jsm. Looks like it's trying to query by origin, I think > we can restrict the records using the rev_host column and then it should be > fine. Again this needs a separate bug. Yup, it's querying the last visit to a given origin. The push code uses this to calculate the background message quota, which changes based on the number of days the user last visited the site. Would it be sufficient to replace `p.url` with `rev_host` and `PlacesUtils.getReversedHost`? Would we need to restrict the query to account for the scheme and port? Thanks for filing the bug!
(In reply to Kit Cambridge [:kitcambridge] from comment #15) > Yup, it's querying the last visit to a given origin. The push code uses this > to calculate the background message quota, which changes based on the number > of days the user last visited the site. > > Would it be sufficient to replace `p.url` with `rev_host` and > `PlacesUtils.getReversedHost`? Would we need to restrict the query to > account for the scheme and port? The thing that will be slow in the "new world" will be queries doing "url >", "url <", "url BETWEEN", so I suppose that would do. But depends on the push design requirements whether it should be restricted by scheme and port. Btw let's keep the discussion in the specifc Push bug. I will try to look into it soon and we can evaluate what to do.
Attachment #814619 - Attachment is obsolete: true
Comment on attachment 8749808 [details] Bug 889561 - moz_places_url_uniqueindex takes up too much database space Review request updated; see interdiff: https://reviewboard.mozilla.org/r/51163/diff/1-2/
Comment on attachment 8749808 [details] Bug 889561 - moz_places_url_uniqueindex takes up too much database space Review request updated; see interdiff: https://reviewboard.mozilla.org/r/51163/diff/2-3/
Attachment #8749808 - Attachment description: MozReview Request: Bug 889561 - moz_places_url_uniqueindex takes up too much database space → Bug 889561 - moz_places_url_uniqueindex takes up too much database space
Attachment #8749808 - Attachment is obsolete: true
The patch is adding an hash SQL function, that can accept up to 2 arguments, the former is the string to hash, the latter (optional) is an hashing mode (currently can be prefix_hi or profix_lo, for querying schemes) Maintenance will take care of eventual misses from add-ons. It's also changing expiration to adapt better to every single db, while trying to limit the db size to 60MiB (the previous 160MiB limit was unrealistic). We should take a close look at telemetry after the change. When moving favicons out of the db we can likely improve this further.
Comment on attachment 8763819 [details] Bug 889561 - Reduce the size of places.sqlite by removing the url unique index from moz_places. Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59938/diff/1-2/
Comment on attachment 8763819 [details] Bug 889561 - Reduce the size of places.sqlite by removing the url unique index from moz_places. https://reviewboard.mozilla.org/r/59938/#review57468 This looks great, I don't see any problems. Nice work! ::: toolkit/components/places/SQLFunctions.cpp:882 (Diff revision 2) > + > + RefPtr<nsVariant> result = new nsVariant(); > + if (mode.IsEmpty()) { > + // URI-like strings (having a prefix before a colon), are handled specially, > + // as a 48 bit hash, where first 16 bits are the prefix hash, while the > + // other 48 are the string hash. Typo: should be "other 32" right?
Attachment #8763819 - Flags: review?(adw) → review+
(In reply to Drew Willcoxon :adw from comment #23) > Typo: should be "other 32" right? yep. I'll fix that as soon as I can recover a working tree, some recent pull has broken it in a dom subfolder I never touched...
Pushed by mak77@bonardo.net: https://hg.mozilla.org/integration/fx-team/rev/ceff61c9fc5a Reduce the size of places.sqlite by removing the url unique index from moz_places. r=adw
Pushed by mak77@bonardo.net: https://hg.mozilla.org/integration/fx-team/rev/07e1f152fe92 followup - Missing changes to Bookmarks.jsm. r=post-facto
this caused frequent failures like https://treeherder.mozilla.org/logviewer.html#?job_id=10160078&repo=fx-team on pgo and so had to back this out, sorry
Flags: needinfo?(mak77)
Backout by cbook@mozilla.com: https://hg.mozilla.org/integration/fx-team/rev/3fa6d81f720a Backed out changeset 07e1f152fe92 https://hg.mozilla.org/integration/fx-team/rev/cda8c83f6a57 Backed out changeset ceff61c9fc5a for frequent testfailures on pgo in /bookmarks/test_
Testing a fix on Try. If I'm right, the problem is that history tries to start expiration very late on shutdown, and now that fails cause on expiration startup we try to query the database, that is already closed at that time. So we bail out with an exception. This happens in tests mostly, it's unlikely to happen in reality. The fix only requires fixing our AsyncShutdown blockers to properly manage the exception we throw when we can't open the db cause it's too late.
Flags: needinfo?(mak77)
Pushed by mak77@bonardo.net: https://hg.mozilla.org/integration/fx-team/rev/21585b3e4814 Reduce the size of places.sqlite by removing the url unique index from moz_places. r=adw
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
After this change landed, an alert for the telemetry probe PLACES_MOST_RECENT_EXPIRED_VISIT_DAYS was raised [1]. Looking at this patch, it seems as though maybe the change in nsPlacesExpiration's DATABASE_MAX_SIZE resulted in a (presumably-temporary) change in how long uris are in the database before being expired?
(In reply to Chris H-C :chutten from comment #33) > After this change landed, an alert for the telemetry probe > PLACES_MOST_RECENT_EXPIRED_VISIT_DAYS was raised [1]. Looking at this patch, > it seems as though maybe the change in nsPlacesExpiration's > DATABASE_MAX_SIZE resulted in a (presumably-temporary) change in how long > uris are in the database before being expired? yes, we are constantly checking that probe to find the right compromise for our performance work. The scope of the probe is to verify the expiration changes are not causing large dataloss. The only meaningful regression would be if the value becomes *much* lower than it is, then we would actually be losing a meaningful amount of user's history. I'm sure there will be further changes in the next months, since we are trying to make places.sqlite smaller in size.
Blocks: 1296958
Depends on: 1318650
Blocks: 977053
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: