Closed
Bug 1382332
Opened 7 years ago
Closed 7 years ago
Dedupe http(s) in highlights; dedupe multiple bookmarks for the same url in top sites
Categories
(Firefox for Android Graveyard :: General, enhancement, P1)
Firefox for Android Graveyard
General
Tracking
(firefox57 fixed)
RESOLVED
FIXED
Firefox 57
Tracking | Status | |
---|---|---|
firefox57 | --- | fixed |
People
(Reporter: sdaswani, Assigned: mcomella)
References
Details
(Whiteboard: [MobileAS] 1.27 [summary in comment 48 49])
Attachments
(6 files)
No description provided.
Blocks: as-android-blockers
Updated•7 years ago
|
Whiteboard: [MobileAS]
Comment 1•7 years ago
|
||
Susheel, can you click on the menu icons of the entries, copy the address and post them here? We are deduplicating highlights and content from the same domain. But this seems to fail here for some kind of reason.
Flags: needinfo?(sdaswani)
Sure for the two Feedly's in Highlights:
https://feedly.com/
https://www.feedly.com/
For the top sites:
https://news.google.com/
http://news.google.com/
Flags: needinfo?(sdaswani)
Assignee | ||
Comment 3•7 years ago
|
||
I have a few duplicates in Top Sites on my test device too:
https://gizmodo.com/us-tangles-with-the-world-over-climate-change-at-g20-1796741069
https://gizmodo.com/frozen-couple-unearthed-by-climate-change-and-a-ski-li-1797024985
http://gizmodo.com
https://www.rockpapershotgun.com/2017/07/13/civilization-6-post-mortem/
https://www.rockpapershotgun.com/
http://www.bbc.co.uk/
http://www.bbc.com/
http://www.bbc.com/news/world-us-canada-40662772 (different favicon from the other two)
---
It looks like it's partially because my articles are showing up as top sites which would presumably clear with more realistic browsing scenarios (but perhaps not – what if you only open external links?).
Assignee | ||
Comment 4•7 years ago
|
||
Here's what the duplicates look like for me (via bug 1377287 comment 17).
Assignee | ||
Comment 5•7 years ago
|
||
Another thought: in the event that there are duplicates and you want to pin one (e.g. I hate that I have 3 bbc's so I want to pin bbc.co.uk), it's impossible to get the url without 1) clicking it (which might reorder the top sites, argh!) or 2) copying the data from the context menu.
It'd be great if there was a way to see the actual URL of the page (and it's likely there are other use cases where you might want to see the URL too).
Comment 6•7 years ago
|
||
Those are two different things: (Android) Top sites can have (and always had) domain duplicates. Highlights tries to prevent that. Our top sites algorithm is unchanged (except for pins that work a little bit differently now).
Comment 7•7 years ago
|
||
We definitely avoid having dupes on desktop for Top Sites and we should follow suit in the mobile experiences as well. Right now we dedupe based on domain. Also, maybe related, is we weight .com's higher than pages with full paths (in this case, http://gizmodo.com would weight higher than https://gizmodo.com/us-tangles-with-the-world-over-climate-change-at-g20-1796741069). Weighting the .com higher AND deduping based on domain would give us the experience we're shooting for.
For more reference, here's a reference to the code for Top Sites on Desktop: http://searchfox.org/mozilla-central/source/toolkit/modules/NewTabUtils.jsm#935
Assignee | ||
Comment 8•7 years ago
|
||
Adding to triage so we can discuss the trade-offs: we could fix this on the engineering side to dedupe, we could fix it on the UI side (make titles more permissive), etc.
tracking-fennec: --- → ?
Assignee | ||
Comment 9•7 years ago
|
||
Let's compare this against desktop & iOS to see what they're doing and see if we can do something similar.
tracking-fennec: ? → ---
Priority: -- → P2
Assignee | ||
Updated•7 years ago
|
Iteration: --- → 1.27
Priority: P2 → P1
Whiteboard: [MobileAS] → [MobileAS] 1.27
Assignee | ||
Updated•7 years ago
|
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → michael.l.comella
Assignee | ||
Comment 10•7 years ago
|
||
From a UX perspective, this relates to bug 1386902: "TopSites tiles' titles do not accurately indicate site (from PageMetadata provider)".
Assignee | ||
Comment 11•7 years ago
|
||
I see a few more duplicates on my personal device:
kotaku.com
kotaku.com
rockpapershotgun.com
rockpapershotgun.com
(^ Both of these are bookmarked twice - I wonder if we don't dedupe that)
accounts.craigslist.org
sfbay.craigslist.org
accounts.craigslist.org/login/home
(^ All three of these share a title, "craigslist", and a favicon so they're indistinguishable. The first two can't be fixed in this bug but can be fixed with bug 1386902).
Assignee | ||
Comment 12•7 years ago
|
||
(In reply to Aaron Benson from comment #7)
> For more reference, here's a reference to the code for Top Sites on Desktop:
> http://searchfox.org/mozilla-central/source/toolkit/modules/NewTabUtils.
> jsm#935
And iOS:
- Parsing the results [1]
- Direct DB query [2] which is called by [3]
[1]: https://github.com/mozilla-mobile/firefox-ios/blob/b6d1b35a3c0aef82f1482748812a060b3547ce2a/Client/Frontend/Home/ActivityStreamPanel.swift#L483
[2]: https://github.com/mozilla-mobile/firefox-ios/blob/0d116764bc29156553b23298102ad3d839b96d05/Storage/SQL/SQLiteHistory.swift#L129
[3]: https://github.com/mozilla-mobile/firefox-ios/blob/0d116764bc29156553b23298102ad3d839b96d05/Storage/SQL/SQLiteHistory.swift#L311
Assignee | ||
Comment 13•7 years ago
|
||
In bug 1386902, we're going to change the naming scheme from provider -> subdomain.domain, which should fix the duplicates within top sites.
For this bug, I spoke to bbell about the vision for Top Sites/Highlights:
- Top Sites should favor domain names (kind of like "apps") & Highlights should favor paths (pages in the "apps").
- Top sites should be the pages you've visited absolutely the most (based on the frecency algorithm)
- Highlights should be constrained within some temporal parameters (e.g. past 72 hours)
These are different pools of pages so there should be little overlap and thus duplication.
Note: top sites can display pages if you visit that page frequently enough. Deduping should generally weight domains with no path higher though.
Assignee | ||
Comment 14•7 years ago
|
||
I created a document with all of the duplication examples so far in this bug and a list of how they can be fixed [1], the latter copied below:
Things to change
- Highlights: dedupe “www.domain” and “domain”
- Top sites: dedupe “http” & “https”
- Top sites: ensure multiple bookmarks don’t create duplicates
- Top sites: subdomain.domain (bug 1386902)
Awaiting UX
- Top sites: display path when two pages from same domain appear (how interact with plain domain?)
- Screenshots for pages?
Open questions:
- bbc.co.uk vs. bbc.com
---
Unless someone provides a counter-example I think this means the top sites and highlights algorithms might be working just fine right now - it's just the details that need to be fixed.
[1]: https://docs.google.com/document/d/1UvjKXC86azj5LptWRmU5D7jyY_B7Owig9mHrQhgmWzo/edit?usp=sharing
Assignee | ||
Comment 15•7 years ago
|
||
> - Highlights: dedupe “www.domain” and “domain”
> - Top sites: dedupe “http” & “https”
fwiw, I think these are not very likely to occur on real profiles and just happen to appear on our test profiles because we don't have many visits to outweigh redirects.
Assignee | ||
Comment 16•7 years ago
|
||
> - Highlights: dedupe “www.domain” and “domain”
Did this because it's likely to happen - we only have to visit a Highlight once for it to appear and if that visit was redirect from "www.", we'd get a dupe.
> - Top sites: dedupe “http” & “https”
This is annoying to do because we're passing a Cursor directly to the Adapter (e.g. create CursorWrapper-like thing, process dupes, and skip over dupe entries). This will happen with light use of the browser or a clean profile so it'll be slightly unpolished but I think it's unlilkely to happen with regular use of the browser and so, due to code complexity, I'm going to skip it.
> - Top sites: ensure multiple bookmarks don’t create duplicates
Looking into this.
> Awaiting UX
> - Top sites: display path when two pages from same domain appear (how interact with plain domain?)
Still awaiting UX - will probably be a new bug.
> - Screenshots for pages?
Moved to bug 1389259.
Assignee | ||
Comment 17•7 years ago
|
||
Bryan, a few questions:
How would we handle the case of "bbc.co.uk" and "bbc.com"? They both have the same favicon and would both appear as "bbc". Perhaps we should we include the TLD in this case?
To be explicit, were you still figuring out what the title should be for two top sites that have the same path? Or did we decide on "github/mozilla-mobile/focus-android" (in the case of https://github.com/mozilla-mobile/focus-android)? We could also remove parts of the path if they overlap, e.g. "github/focus-android" & "github/focus-ios" Note that I have the bug open for using screenshots for pages (bug 1389259).
Flags: needinfo?(bbell)
Comment 18•7 years ago
|
||
(In reply to Michael Comella (:mcomella) from comment #17)
> Bryan, a few questions:
>
> How would we handle the case of "bbc.co.uk" and "bbc.com"? They both have
> the same favicon and would both appear as "bbc". Perhaps we should we
> include the TLD in this case?
>
> To be explicit, were you still figuring out what the title should be for two
> top sites that have the same path? Or did we decide on
> "github/mozilla-mobile/focus-android" (in the case of
> https://github.com/mozilla-mobile/focus-android)? We could also remove parts
> of the path if they overlap, e.g. "github/focus-android" &
> "github/focus-ios" Note that I have the bug open for using screenshots for
> pages (bug 1389259).
Remove parts of the path if they overlap feels right to me. You might even consider going one level further, and drop the domain for sites with a duplicate and only show the parts of the URL that are unique.
e.g. "focus-android" & "focus-ios"
Flags: needinfo?(bbell)
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 22•7 years ago
|
||
Still TODO: ^ bbell's suggestion.
Bryan, how would we handle the case of "bbc.co.uk" and "bbc.com"? They both have the same favicon and would both appear as "bbc". Perhaps we should we include the TLD in this case?
Flags: needinfo?(bbell)
Assignee | ||
Comment 23•7 years ago
|
||
Bryan, I overrode the title with the unshared path (e.g. "focus-ios" & "focus-android") when two top sites shared the same host and it doesn't appear to be very useful:
- It's hard to understand which site we're looking at sometimes ("signin/v2/ide..." is accounts.google.com sign in but that URL may not have a URL)
- The unshared path isn't always useful, as in "2017/08/02/..." for RockPaperShotgun and "describecom..." for bugzilla.
- If we were to also include the domain, I think we're more likely to overflow than we are to actually show the difference in the paths
Any thoughts?
Caveat 1: this is a clean profile so 1) we see more duplicates than we're likely to see otherwise and 2) less commonly visited sites, which might not make top sites, are appearing like accounts.google.com sign-in.
Caveat 2: This patch isn't perfect: there are two support.google.com entries because their paths are the same but their query parameters are different and I only process paths at the moment.
Assignee | ||
Comment 24•7 years ago
|
||
For comparison, looking at my top sites on desktop, it looks like they're more strict about only showing one site per host, e.g. I saw http://github.com/mozilla-mobile/prox is in my top sites even though I'm sure I visit other github sites just as often like Prox's issues as well as only one bugzilla.mozilla, mail.google top site.
From a user's perspective, however, I'm not convinced this is better behavior (e.g. if I only use my browser for github, I'd want to see all of those sites).
Assignee | ||
Comment 25•7 years ago
|
||
To summarize the consequences of comment 23, I think there are two variables we can tweak are:
1) Revise the UI to help distinguish multiple sites with same host (e.g. allow titles to take up multiple lines only if there is another top site with the same host)
2) Change the top sites algorithm so we have fewer sites with the same host.
In my opinion, I think we should focus on #1: afaik, there's been no desire to modify our existing top sites query so it's probably pretty good (caveat: users used to be able to enter custom sites). The approach of limiting the number of same-host sites that appear in our top sites doesn't seem to be a better user experience to me (e.g. if I use my browser primarily to look at github; and how my top sites on desktop doesn't seem to accurately represent my browser usage).
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 34•7 years ago
|
||
FYI that reddit's subreddits is another (and probably common) use-case for having multiple pages:
(In reply to Cameron McCormack (:heycam) from bug 1388595 comment #3)
> To be clear, for me, all of similarly labelled tiles for me are to different
> sub-reddits, and I want to be able to distinguish/select them individually.
Assignee | ||
Comment 36•7 years ago
|
||
Another case:
(In reply to Paul from comment bug 1385651 #5)
> In this bug, the Top Sites are NOT duplicated. The screenshot shows four
> separate pages from the same weather site, all four of which I use very
> frequently. The problem is that all four icons look the same, despite
> belonging to legitimately different pages. That's why a fix of bug #1332621
> will not fix this bug.
Comment 37•7 years ago
|
||
mozreview-review |
Comment on attachment 8896070 [details]
Bug 1382332: Add MapUtils.putIfAbsent.
https://reviewboard.mozilla.org/r/167354/#review172978
Attachment #8896070 -
Flags: review?(liuche) → review+
Comment 38•7 years ago
|
||
mozreview-review |
Comment on attachment 8896071 [details]
Bug 1382332: Rm duplicate www. hosts from Highlights.
https://reviewboard.mozilla.org/r/167356/#review172982
Some confusing naming, but other than that this seems good.
::: mobile/android/base/java/org/mozilla/gecko/activitystream/ranking/HighlightsRanking.java:255
(Diff revision 3)
>
> /**
> - * Remove candidates that are pointing to the same host.
> + * Remove candidates that are pointing to the same host, with special restrictions for "www." hosts.
> */
> @VisibleForTesting static void dedupeSites(List<HighlightCandidate> candidates) {
> - final Set<String> knownHosts = new HashSet<String>();
> + final Map<String, HighlightCandidate> knownHostToCandidate = new HashMap<>();
I don't really like "candidate" in this name because I think it's a little confusing - maybe you can add a brief comment with key/value example, and possibly change it to be knownHostToHighlight or HighlightCandidate?
::: mobile/android/base/java/org/mozilla/gecko/activitystream/ranking/HighlightsRanking.java:256
(Diff revision 3)
> /**
> - * Remove candidates that are pointing to the same host.
> + * Remove candidates that are pointing to the same host, with special restrictions for "www." hosts.
> */
> @VisibleForTesting static void dedupeSites(List<HighlightCandidate> candidates) {
> - final Set<String> knownHosts = new HashSet<String>();
> + final Map<String, HighlightCandidate> knownHostToCandidate = new HashMap<>();
> + final List<HighlightCandidate> wwwHosts = new ArrayList<>();
Just to be consistent, shouldn't this be wwwCandidates? or wwwHighlightCandidates? because it literally is the candidate, not the host.
::: mobile/android/base/java/org/mozilla/gecko/activitystream/ranking/HighlightsRanking.java:275
(Diff revision 3)
> });
> +
> + // Include "www." hosts only if (see ifs below):
> + for (final HighlightCandidate wwwCandidate : wwwHosts) {
> + final String wwwCandidateHostNoWWW = wwwCandidate.getHost().substring(WWW.length()); // non-null b/c we check above.
> + final HighlightCandidate correspondingCandidate = knownHostToCandidate.get(wwwCandidateHostNoWWW);
Maybe name this knownCandidate? to match with the knownHost?
Attachment #8896071 -
Flags: review?(liuche) → review+
Updated•7 years ago
|
Iteration: 1.27 → 1.28
Comment 39•7 years ago
|
||
mozreview-review |
Comment on attachment 8896072 [details]
Bug 1382332: Remove duplicate URLs from AS top sites.
https://reviewboard.mozilla.org/r/167358/#review173684
OK, I'm assuming that the additional table scan (will it sort things, too?) won't be much of a problem since it'll run on a filtered data set.
Sidenote, we should clean up the top sites query(ies) for the post-positioned pins world. A lot of it seems to deal with that side of things.
Attachment #8896072 -
Flags: review?(gkruglov) → review+
Assignee | ||
Comment 40•7 years ago
|
||
To follow up on comment 25, spoke with bbell: we're going to try using "subdomain.domain" when the page does not have a path and use the page title when the page has a path.
fwiw, he mentioned nchapman has some code that will extract the more relevant parts out of a url (e.g. would pull out the Bug # for Bugzilla) but we haven't had a chance to talk to him and to me it sounds like a heuristic rabbit hole so we'll try ^ first.
Flags: needinfo?(bbell)
Assignee | ||
Comment 41•7 years ago
|
||
I'm going to land what we have and continue working.
Code from the future fixes will build on top of the code from bug 1386902.
Depends on: 1386902
Keywords: leave-open
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 45•7 years ago
|
||
(In reply to Michael Comella (:mcomella) from comment #41)
> I'm going to land what we have and continue working.
>
> Code from the future fixes will build on top of the code from bug 1386902.
mozreview team said it's better to open new bugs because mozreview doesn't handle pushing two sets of commits to a bug so I'll file another one.
No longer depends on: 1386902
Keywords: leave-open
Comment 46•7 years ago
|
||
Pushed by michael.l.comella@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/5bb1644f7b89
Add MapUtils.putIfAbsent. r=liuche
https://hg.mozilla.org/integration/autoland/rev/c252481fa62a
Rm duplicate www. hosts from Highlights. r=liuche
https://hg.mozilla.org/integration/autoland/rev/cd7fce820293
Remove duplicate URLs from AS top sites. r=Grisha
Comment 47•7 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/5bb1644f7b89
https://hg.mozilla.org/mozilla-central/rev/c252481fa62a
https://hg.mozilla.org/mozilla-central/rev/cd7fce820293
Status: NEW → RESOLVED
Closed: 7 years ago
status-firefox57:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → Firefox 57
Updated•7 years ago
|
Assignee | ||
Comment 48•7 years ago
|
||
Changing the title to reflect what I've actually changed in this bug.
Things that are still a problem:
- Dedupe "http(s)" in top sites (this bug was highlights only) (bug 1332621)
- Dedupe "(www.)url" in top sites (bug 1332621)
- bbc.com & bbc.co.uk look identical in top sites (bug 1391479)
Summary: Both Highlights & Top Sites Have Duplicate Entries → Dedupe http(s) in highlights; dedupe multiple bookmarks for the same url in top sites
Assignee | ||
Comment 49•7 years ago
|
||
Also, to summarize the relevant labeling changes:
- Use subdomain.domain for top sites page titles (bug 1386902)
- Follow-up to ^: Show the page title for pages that have a path (bug 1390372)
- Screenshots could help distinguish sites (bug 1389259)
Between these two comments, it should summarize all the changes I've made or are considering making for these issues: distinguishing similar top sites entries and removing actual duplicates.
Whiteboard: [MobileAS] 1.27 → [MobileAS] 1.27 [summary in comment 48 49]
Assignee | ||
Comment 50•7 years ago
|
||
(In reply to Michael Comella (:mcomella) from comment #49)
> Also, to summarize the relevant labeling changes:
One more:
- Distributions should use page titles (bug 1391413)
Updated•4 years ago
|
Product: Firefox for Android → Firefox for Android Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•