Closed Bug 1387694 Opened 7 years ago Closed 7 years ago

Top frecent sites query should dedupe to the more frecent rev_nowww

Categories

(Firefox :: New Tab Page, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Firefox 57
Tracking Status
firefox57 --- fixed

People

(Reporter: Mardak, Assigned: Mardak)

References

Details

Attachments

(1 file)

Looks like the query is missing an ORDER for a GROUP and picks the correct most frecent of a given rev_host but then the less frecent of rev_nowww then the most frecent of those hosts.
Updating patch. I thought this code seemed familiar... :( nanj fixed this in the Test Pilot version but after we had already exported the query to m-c: https://github.com/mozilla/activity-stream/pull/2372 As was suggested in the PR, I'll make the query index-fast and do custom logic in JS.
Yay for unit tests. And SQL can indeed be quite compact line-wise but not necessarily easily performant with complex logic.
Comment on attachment 8894063 [details] Bug 1387694 - Top frecent sites query should dedupe to the more frecent rev_nowww. https://reviewboard.mozilla.org/r/165162/#review171196 Overall looks good, just a couple questions in the query and with the regex. Also would be nice to know how all this post-processing stuff performs on a larger number of sites i.e when Object.keys(BlockedLinks.links).length returns a large number and therefore our limit is a large number. ::: toolkit/modules/NewTabUtils.jsm:946 (Diff revision 2) > - // Which is why subselect orders frecency and last_visit_date backwards. > - // In general the groupby behavior in the absence of aggregates is not > - // defined in SQL, hence we are relying on sqlite implementation that may > - // change in the future. > - > - const limit = Object.keys(BlockedLinks.links).length + TOP_SITES_LIMIT; > + const limit = Object.keys(BlockedLinks.links).length + TOP_SITES_LIMIT * 2 * 10; > + > + // Keep this query fast with frecency-indexed lookups (even with excess > + // rows) and shift the more complex logic to post-processing afterwards > + const sqlQuery = `SELECT > + moz_bookmarks.guid AS bookmarkGuid, Do you need this ```AS bookmarkGuid``` here? ::: toolkit/modules/NewTabUtils.jsm:949 (Diff revision 2) > - // change in the future. > - > - const limit = Object.keys(BlockedLinks.links).length + TOP_SITES_LIMIT; > - let sqlQuery = `/* do not warn (bug N/A): do not need index */ > - SELECT url, title, SUM(frecency) frecency, guid, bookmarkGuid, > - last_visit_date / 1000 as lastVisitDate, "history" as type > + // rows) and shift the more complex logic to post-processing afterwards > + const sqlQuery = `SELECT > + moz_bookmarks.guid AS bookmarkGuid, > + frecency, > + moz_places.guid, > + last_visit_date / 1000 AS lastVisitDate, Same here with ```AS lastVisitDate```. I'm asking because above you don't have ```moz_places.guid AS guid``` yet you're still extracting the guid as a listed column, I'm wondering if you need it here or if you forgot to add ```AS guid``` for ```moz_places.guid``` ::: toolkit/modules/NewTabUtils.jsm:1003 (Diff revision 2) > + } > + > + link.type = "history"; > + > + // First we want to find the best link for an exact host > + setBetterLink(exactHosts, link, url => url.match(/:\/\/([^\/]+)\//)); Why match on a trailing slash here? For example, this regex matches on 'http://foo.co.uk/' but wouldn't match on 'http://foo.co.uk', even though they would have the same host, right? So exactHosts wouldn't have the right values in the end. Unless this case would never happen? ::: toolkit/modules/NewTabUtils.jsm:1009 (Diff revision 2) > } > - links = links.slice(0, TOP_SITES_LIMIT); > + > + // Clean up exact hosts to dedupe as non-www hosts > + const hosts = new Map(); > + for (const link of exactHosts.values()) { > + setBetterLink(hosts, link, url => url.match(/:\/\/(?:www\.)?([^\/]+)\//), Same question here with the trailing slash; this regex matches on 'http://foo.com/' but not on 'http://www.foo.com', does that affect the result set?
Attachment #8894063 - Flags: review?(usarracini) → review+
Comment on attachment 8894063 [details] Bug 1387694 - Top frecent sites query should dedupe to the more frecent rev_nowww. https://reviewboard.mozilla.org/r/165162/#review171196 > Do you need this ```AS bookmarkGuid``` here? Yup. Otherwise "guid" conflicts with the one from moz_places > Same here with ```AS lastVisitDate```. I'm asking because above you don't have ```moz_places.guid AS guid``` yet you're still extracting the guid as a listed column, I'm wondering if you need it here or if you forgot to add ```AS guid``` for ```moz_places.guid``` Yup. Need this one to be able to select the computed value. Although, we could do the computation in JS instead but then that would probably require us to clean up the undesired column from the object. > Why match on a trailing slash here? > For example, this regex matches on 'http://foo.co.uk/' but wouldn't match on 'http://foo.co.uk', even though they would have the same host, right? So exactHosts wouldn't have the right values in the end. > Unless this case would never happen? It doesn't have to, but pretty sure normal places usage will have a trailing slash on top level urls. sqlite> SELECT COUNT(1) FROM moz_places; 118245 sqlite> SELECT url FROM moz_places WHERE url LIKE 'http%' AND INSTR(SUBSTR(url, 9), "/") = 0 LIMIT 10; From my own history, there's no pages that start with http that don't have a slash somewhere (other than for the protocol). I'll just update this to not require the trailing slash anyway.
Pushed by edilee@gmail.com: https://hg.mozilla.org/integration/autoland/rev/601590dc02a7 Top frecent sites query should dedupe to the more frecent rev_nowww. r=ursula
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 57
Blocks: 1394533
Component: Activity Streams: Newtab → New Tab Page
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: