Closed
Bug 889561
Opened 11 years ago
Closed 8 years ago
moz_places_url_uniqueindex takes up to 1/3 of places.sqlite
Categories
(Toolkit :: Places, defect, P2)
Tracking
()
RESOLVED
FIXED
mozilla50
Tracking | Status | |
---|---|---|
firefox50 | --- | fixed |
People
(Reporter: taras.mozilla, Assigned: mak)
References
(Blocks 1 open bug)
Details
(Keywords: perf, Whiteboard: [fxsearch])
Attachments
(1 file, 2 obsolete files)
No description provided.
Assignee | ||
Comment 1•11 years ago
|
||
So, we need the url index for 2 reasons:
1. uniqueness check (internal functionality relies on this precondition)
2. actual queries doing WHERE url = some_url (many APIs rely on this)
though, we don't exactly need an url index that replicates the whole url in this case (that is what SQLite does), we just need the 2 above points.
A good hashing algorithm (the mfbt one, cityhash or murmur) may easily provide the same benefits, taking just 64 bits per entry in the ideal case (to be verified).
This may bring in a reduction of 15-20% the database size.
So the plan of action here may be:
1. figure out a good hashing algo. We may add some telemetry probes to the current method that adds URIs to the db and report collisions in Nightly/Aurora (this requires some changes to the db, but if plan the change that's fine, we can just add the url_hash column earlier).
2. remove the url_uniqueindex, add instead a unique index on the url_hash column. Measure the db size win. The hashing may either be done at the Places level or (better) implemented as a SQLite function.
3. Fix queries to actually compare hashes instead of plain urls.
Assignee | ||
Comment 2•11 years ago
|
||
note: I said 64bits, cause we'd need a 32 bits integer column for the hash and an index replicating it.
Reporter | ||
Comment 3•11 years ago
|
||
(In reply to Marco Bonardo [:mak] from comment #2)
>The hashing may either be done at the Places level or (better) implemented as a SQLite function.
> note: I said 64bits, cause we'd need a 32 bits integer column for the hash
> and an index replicating it.
We could also ask Dr Hipp to implement function indexes...That might not be more complicated than changing all of the callsites to deal with a new column and have the advantage of not wasting extra space
Assignee | ||
Comment 4•11 years ago
|
||
(In reply to Taras Glek (:taras) from comment #3)
> We could also ask Dr Hipp to implement function indexes...That might not be
> more complicated than changing all of the callsites to deal with a new
> column and have the advantage of not wasting extra space
See Hash Indexes on this page:
http://www2.sqlite.org/cvstrac/wiki?p=ProposedIncompatibleChanges
that said, the given reasoning is weak for us, in the sense you may not care about in_order retrieval, for example in Places we order by URI only in a very specific and not so interesting case (clicking the Library url column)
Assignee | ||
Comment 5•11 years ago
|
||
Though that may be a very valid reason why an hash index can't become a default choice in a generic storage like indexedDB
Assignee | ||
Comment 6•11 years ago
|
||
I am under the impression we can't use 32 bits, considered number of urls in the world (http://www.worldwidewebsize.com/ says about 50 billions indexed pages on Google, we may ignore the rest for a rough estimate), with 64 bits we get 1 collision every 4 billions hashes, to have 1 collision out of 70 billions hashes we need at least 72 bits.
Reporter | ||
Comment 7•11 years ago
|
||
(In reply to Marco Bonardo [:mak] from comment #6)
> I am under the impression we can't use 32 bits, considered number of urls in
> the world (http://www.worldwidewebsize.com/ says about 50 billions indexed
> pages on Google, we may ignore the rest for a rough estimate), with 64 bits
> we get 1 collision every 4 billions hashes, to have 1 collision out of 70
> billions hashes we need at least 72 bits.
we'll just need a little extra logic to compare the actual url. Based on my history(42K urls) a 16bit index would do.
Assignee | ||
Comment 8•11 years ago
|
||
Comparing the urls is still one (or more depending on how many collisions we have) additional string compare we could avoid though.
It's the usual size vs performance problem, we can decide any direction.
fwiw I have 100k urls and my urls are likely different than yours, each user owns a subgroup of the whole urls in the web and it's hard to predict collisions in a subgroup.
16 bits would give us 1 collision every 256 urls, doesn't sound too good even for your 40k pages. Definitely telemetry will help telling us collisions rate, we could start with 32 (likely to collide once in an average db) and see.
Reporter | ||
Comment 9•11 years ago
|
||
(In reply to Marco Bonardo [:mak] from comment #8)
> 16 bits would give us 1 collision every 256 urls, doesn't sound too good
> even for your 40k pages. Definitely telemetry will help telling us
> collisions rate, we could start with 32 (likely to collide once in an
> average db) and see.
I think you are worrying too much about collisions. Even with a 1/256 collision rate, we are doing 256x less IO.
I think we should just pick a hashing algo without worrying too much about collision perf.
Assignee | ||
Comment 10•11 years ago
|
||
I'm worrying cause on collision we will have to compare urls, that's fine, if not that the search is linear at that point.
If I simply calculate 100k (pages in my db) / 256 (possible collisions) = 390, means that in the worst case SQLite will have to do a linear scan and string compare across 390 entries. Plus it will have to actually fetch those 390 entries for the linear scan. VS having to do a linear scan and a fetch of 2 or 3 entries.
Assignee | ||
Comment 11•11 years ago
|
||
this implements an hash() SQL function based on what we already have in mfbt (we use it for internal hash tables and atoms, iirc it's ). I ran it on a couple profiles, on my profile (106k entries) it returns 2 hash conflicts, so it appears to be quite good and we know it's fast per the past analysis done in bug 729952.
See also http://mxr.mozilla.org/mozilla-central/source/mfbt/HashFunctions.h
I think it may be a good starting point to make an experimental patch.
Assignee | ||
Updated•11 years ago
|
Updated•11 years ago
|
Whiteboard: [triage]
Updated•11 years ago
|
Whiteboard: p=0
Assignee | ||
Comment 12•11 years ago
|
||
I investigated using the new WITHOUT ROWID feature of Sqlite. Unfortunately it's not usable for our use-case, while it makes selecting by url faster and avoids duplicating url into its own index, each other index in moz_places ends up duplicating the primary key value, that is now the url. So my 70MB database becomes a 150MB one :( Looks like the use-cases of WITHOUT ROWID are fewer than I expected.
Though, we can still use WITHOUT_ROWID to save space and increase speed of other tables, where we currently have compound indices. I will file bugs for specific cases once I have some numbers.
I guess at this point the only plausible solution for this specific issue is the hash. This will break some use-cases:
- uriIsPrefix queries will become extremely slow, basically queries getting all pages beginning with a given string. The feature will have to be dropped (it would end up scanning hundred thousands of urls one by one)
- for basically the same reason we won't be able to query entries based on the protocol with tricks like "url BETWEEN 'file://' AND 'file:/~'" or "url BETWEEN 'place:' AND 'place;'", we'll need a further column to filter these.
For the latter case we could make the hash be "protocol:hash", it would take a bit more space though
Assignee | ||
Updated•11 years ago
|
Blocks: PlacesDiet
Updated•11 years ago
|
Updated•11 years ago
|
Updated•11 years ago
|
No longer blocks: fxdesktopbacklog
Flags: firefox-backlog+
Updated•11 years ago
|
Summary: moz_places_url_uniqueindex takes up 26% of my 30mb places db...that's wrong → Investigation - moz_places_url_uniqueindex takes up 26% of my 30mb places db...that's wrong
Updated•11 years ago
|
Whiteboard: p=0 → p=3
Updated•11 years ago
|
Whiteboard: p=3 → p=3 [qa-]
Updated•10 years ago
|
Points: --- → 3
Flags: qe-verify-
Whiteboard: p=3 [qa-]
Assignee | ||
Updated•9 years ago
|
Priority: -- → P1
Assignee | ||
Updated•9 years ago
|
Priority: P1 → P2
Whiteboard: [fxsearch]
Assignee | ||
Comment 14•9 years ago
|
||
These are the cases we need to address, since they won't work anymore in the new index-less world, or better, it will keep working but will be slow as hell.
/toolkit/components/places/nsNavHistory.cpp (View Hg log or Hg annotations)
line 1030 -- "WHEN url BETWEEN 'place:' AND 'place;' "
line 1824 -- "AND h.url BETWEEN 'file://' AND 'file:/~' "
line 3373 -- clause.Condition("h.url >= ").Param(":uri")
line 3374 -- .Condition("h.url <= ").Param(":uri_upper");
line 3464 -- clause.Condition("NOT h.url BETWEEN 'place:' AND 'place;'");
/toolkit/components/places/History.jsm (View Hg log or Hg annotations)
line 481 -- WHEN url BETWEEN 'place:' AND 'place;'
/toolkit/components/places/SQLFunctions.cpp (View Hg log or Hg annotations)
line 491 -- "(url > 'place:' AND url < 'place;') "
/dom/push/PushRecord.jsm (View Hg log or Hg annotations)
line 154 -- p.url >= :urlLowerBound AND p.url <= :urlUpperBound AND
To fix the place: and file: checks, I'll likely need to add a prefix to the hashes, but that should be fine since it's just a few entries.
The other history checks are due to uriIsPrefix query option, that we really never used in the product. Though looks like the addons-sdk exposes it through createQuery
http://mxr.mozilla.org/mozilla-central/source/addon-sdk/source/lib/sdk/places/utils.js#143
I'll file a separate bug to remove uriIsQuery and verify the impact there.
The most problematic thing cause I'm not sure what is doing is dom/push/PushRecord.jsm. Looks like it's trying to query by origin, I think we can restrict the records using the rev_host column and then it should be fine. Again this needs a separate bug.
Assignee | ||
Updated•9 years ago
|
Summary: Investigation - moz_places_url_uniqueindex takes up 26% of my 30mb places db...that's wrong → moz_places_url_uniqueindex takes up to 1/3 of places.sqlite
Comment 15•9 years ago
|
||
(In reply to Marco Bonardo [::mak] from comment #14)
> The most problematic thing cause I'm not sure what is doing is
> dom/push/PushRecord.jsm. Looks like it's trying to query by origin, I think
> we can restrict the records using the rev_host column and then it should be
> fine. Again this needs a separate bug.
Yup, it's querying the last visit to a given origin. The push code uses this to calculate the background message quota, which changes based on the number of days the user last visited the site.
Would it be sufficient to replace `p.url` with `rev_host` and `PlacesUtils.getReversedHost`? Would we need to restrict the query to account for the scheme and port?
Thanks for filing the bug!
Assignee | ||
Comment 16•9 years ago
|
||
(In reply to Kit Cambridge [:kitcambridge] from comment #15)
> Yup, it's querying the last visit to a given origin. The push code uses this
> to calculate the background message quota, which changes based on the number
> of days the user last visited the site.
>
> Would it be sufficient to replace `p.url` with `rev_host` and
> `PlacesUtils.getReversedHost`? Would we need to restrict the query to
> account for the scheme and port?
The thing that will be slow in the "new world" will be queries doing "url >", "url <", "url BETWEEN", so I suppose that would do. But depends on the push design requirements whether it should be restricted by scheme and port. Btw let's keep the discussion in the specifc Push bug. I will try to look into it soon and we can evaluate what to do.
Assignee | ||
Comment 17•9 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/51163/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/51163/
Assignee | ||
Updated•9 years ago
|
Attachment #814619 -
Attachment is obsolete: true
Assignee | ||
Comment 18•9 years ago
|
||
Comment on attachment 8749808 [details]
Bug 889561 - moz_places_url_uniqueindex takes up too much database space
Review request updated; see interdiff: https://reviewboard.mozilla.org/r/51163/diff/1-2/
Assignee | ||
Comment 19•8 years ago
|
||
Comment on attachment 8749808 [details]
Bug 889561 - moz_places_url_uniqueindex takes up too much database space
Review request updated; see interdiff: https://reviewboard.mozilla.org/r/51163/diff/2-3/
Attachment #8749808 -
Attachment description: MozReview Request: Bug 889561 - moz_places_url_uniqueindex takes up too much database space → Bug 889561 - moz_places_url_uniqueindex takes up too much database space
Assignee | ||
Comment 20•8 years ago
|
||
Review commit: https://reviewboard.mozilla.org/r/59938/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/59938/
Attachment #8763819 -
Flags: review?(adw)
Assignee | ||
Updated•8 years ago
|
Attachment #8749808 -
Attachment is obsolete: true
Assignee | ||
Comment 21•8 years ago
|
||
The patch is adding an hash SQL function, that can accept up to 2 arguments, the former is the string to hash, the latter (optional) is an hashing mode (currently can be prefix_hi or profix_lo, for querying schemes)
Maintenance will take care of eventual misses from add-ons.
It's also changing expiration to adapt better to every single db, while trying to limit the db size to 60MiB (the previous 160MiB limit was unrealistic). We should take a close look at telemetry after the change. When moving favicons out of the db we can likely improve this further.
Assignee | ||
Comment 22•8 years ago
|
||
Comment on attachment 8763819 [details]
Bug 889561 - Reduce the size of places.sqlite by removing the url unique index from moz_places.
Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59938/diff/1-2/
Comment 23•8 years ago
|
||
Comment on attachment 8763819 [details]
Bug 889561 - Reduce the size of places.sqlite by removing the url unique index from moz_places.
https://reviewboard.mozilla.org/r/59938/#review57468
This looks great, I don't see any problems. Nice work!
::: toolkit/components/places/SQLFunctions.cpp:882
(Diff revision 2)
> +
> + RefPtr<nsVariant> result = new nsVariant();
> + if (mode.IsEmpty()) {
> + // URI-like strings (having a prefix before a colon), are handled specially,
> + // as a 48 bit hash, where first 16 bits are the prefix hash, while the
> + // other 48 are the string hash.
Typo: should be "other 32" right?
Attachment #8763819 -
Flags: review?(adw) → review+
Assignee | ||
Comment 24•8 years ago
|
||
(In reply to Drew Willcoxon :adw from comment #23)
> Typo: should be "other 32" right?
yep. I'll fix that as soon as I can recover a working tree, some recent pull has broken it in a dom subfolder I never touched...
Comment 25•8 years ago
|
||
Pushed by mak77@bonardo.net:
https://hg.mozilla.org/integration/fx-team/rev/ceff61c9fc5a
Reduce the size of places.sqlite by removing the url unique index from moz_places. r=adw
Comment 26•8 years ago
|
||
Pushed by mak77@bonardo.net:
https://hg.mozilla.org/integration/fx-team/rev/07e1f152fe92
followup - Missing changes to Bookmarks.jsm. r=post-facto
Comment 27•8 years ago
|
||
this caused frequent failures like https://treeherder.mozilla.org/logviewer.html#?job_id=10160078&repo=fx-team on pgo and so had to back this out, sorry
Updated•8 years ago
|
Flags: needinfo?(mak77)
Comment 28•8 years ago
|
||
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/fx-team/rev/3fa6d81f720a
Backed out changeset 07e1f152fe92
https://hg.mozilla.org/integration/fx-team/rev/cda8c83f6a57
Backed out changeset ceff61c9fc5a for frequent testfailures on pgo in /bookmarks/test_
Assignee | ||
Comment 29•8 years ago
|
||
Testing a fix on Try.
If I'm right, the problem is that history tries to start expiration very late on shutdown, and now that fails cause on expiration startup we try to query the database, that is already closed at that time. So we bail out with an exception.
This happens in tests mostly, it's unlikely to happen in reality.
The fix only requires fixing our AsyncShutdown blockers to properly manage the exception we throw when we can't open the db cause it's too late.
Flags: needinfo?(mak77)
Assignee | ||
Comment 30•8 years ago
|
||
Ok, looks like this did it https://treeherder.mozilla.org/#/jobs?repo=try&revision=da26210afb1365a81be917594ad05880927be774
Comment 31•8 years ago
|
||
Pushed by mak77@bonardo.net:
https://hg.mozilla.org/integration/fx-team/rev/21585b3e4814
Reduce the size of places.sqlite by removing the url unique index from moz_places. r=adw
Comment 32•8 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 8 years ago
status-firefox50:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Comment 33•8 years ago
|
||
After this change landed, an alert for the telemetry probe PLACES_MOST_RECENT_EXPIRED_VISIT_DAYS was raised [1]. Looking at this patch, it seems as though maybe the change in nsPlacesExpiration's DATABASE_MAX_SIZE resulted in a (presumably-temporary) change in how long uris are in the database before being expired?
Assignee | ||
Comment 34•8 years ago
|
||
(In reply to Chris H-C :chutten from comment #33)
> After this change landed, an alert for the telemetry probe
> PLACES_MOST_RECENT_EXPIRED_VISIT_DAYS was raised [1]. Looking at this patch,
> it seems as though maybe the change in nsPlacesExpiration's
> DATABASE_MAX_SIZE resulted in a (presumably-temporary) change in how long
> uris are in the database before being expired?
yes, we are constantly checking that probe to find the right compromise for our performance work. The scope of the probe is to verify the expiration changes are not causing large dataloss.
The only meaningful regression would be if the value becomes *much* lower than it is, then we would actually be losing a meaningful amount of user's history.
I'm sure there will be further changes in the next months, since we are trying to make places.sqlite smaller in size.
You need to log in
before you can comment on or make changes to this bug.
Description
•