Closed Bug 1323333 Opened 8 years ago Closed 7 years ago

Dodgy bookmark deduping

Categories

(Firefox :: Sync, defect, P2)

defect

Tracking

()

RESOLVED DUPLICATE of bug 1305563

People

(Reporter: lina, Assigned: lina)

References

Details

(Whiteboard: [data-integrity][sync:bookmarks][Sync Q3 OKR])

Attachments

(5 files, 1 obsolete file)

Our bookmark deduping logic is a bit strange. I started looking into it for bug 1305563, and there are a few things that puzzle me. Here's my understanding of how it works: * Before each sync, we build a GUID map in the form of `{ parentTitle: { key: new String("guid"), key: new String("guid"), ... }, ... }`. The key is based on the bookmark's type (title and URL for bookmarks, smart bookmark ID for queries, title for folders and livemarks). The value is a string object containing the GUID. * It's possible for a bookmark to have the same name, URL, and parent title. In that case, the GUID map will only record the last GUID it sees, but will set `hasDupe = true` on the string object. For example, let's say I have the following tree: Toolbar ("toolbar_____") > Searches ("guid11111111") > Google ("guid22222222") Menu ("menu________") > Searches ("guid33333333") > Google ("guid44444444"), Yahoo ("guid55555555") Part of my GUID map might look like this: `{ "Searches": { "bhttps://www.google.com:Google": new String("guid44444444", hasDupe = true), "bhttps://www.yahoo.com:Yahoo": new String("guid55555555", hasDupe = false) } }`. * Bookmark records *also* have a "hasDupe" property, which is synced to other devices! (http://searchfox.org/mozilla-central/rev/594937fec2e2fc45fa9308ba2fb964816631f017/services/sync/modules/engines/bookmarks.js#140). If an incoming (remote) bookmark has this property, we don't try to de-dupe using the (local) GUID map: http://searchfox.org/mozilla-central/rev/594937fec2e2fc45fa9308ba2fb964816631f017/services/sync/modules/engines/bookmarks.js#568-572. (In that function, `item` is the plaintext JSON, not a `CryptoWrapper`). If a bookmark doesn't have this property, we try to de-dupe using the map. If we find an existing local match, we change the local GUID to the remote GUID, and bump the local parent's change counter (bug 1276823, http://searchfox.org/mozilla-central/rev/594937fec2e2fc45fa9308ba2fb964816631f017/services/sync/modules/engines/bookmarks.js#606-607). We don't update the GUID map with the new remote GUID. * Inflating a WBO record sets the "hasDupe" property on it. This happens in `_createRecord` (http://searchfox.org/mozilla-central/rev/594937fec2e2fc45fa9308ba2fb964816631f017/services/sync/modules/engines/bookmarks.js#551-555), which we call in two places: when comparing two records for equality during download (http://searchfox.org/mozilla-central/rev/594937fec2e2fc45fa9308ba2fb964816631f017/services/sync/modules/engines.js#1480), and during upload (http://searchfox.org/mozilla-central/rev/594937fec2e2fc45fa9308ba2fb964816631f017/services/sync/modules/engines.js#1576). So, what does the "hasDupe" field tell us? I think it only answers "are there at least two local copies of this bookmark, with the same URL, title, and parent name anywhere in this tree?". Although, `_reconcile` calls `_findDupe` before `_createRecord`, so we might have already changed the GUID at this point. But then we sync this "hasDupe" property, and other devices use it to decide whether to look up the GUID in *their* local map. I'm confused if this is the right thing to do, and I wonder if the code is just as confused. * iOS knows about the "hasDupe" property (https://github.com/mozilla-mobile/firefox-ios/blob/54c89aa48812c35a5b6ffb32243bce544764f2cf/Storage/SQL/BrowserTable.swift#L272), but doesn't actually seem to do anything with it apart from carrying it forward. The Android Sync codebase doesn't mention "hasDupe" at all. For bug 1305563, I've been trying to express these semantics in SQL, and came up with https://gist.github.com/kitcambridge/55b64d07eb8906092cd33e13885c2705. That query is quite unwieldy, and only for detecting incoming dupes; I haven't figured out how to set "hasDupe" yet on outgoing records. But I'm wondering if the existing semantics are even correct, or valuable. Now that all our platforms use GUIDs—and backup and restore supports them (bug 967204)—do we still need this complicated code? It feels a bit too clever for its own good, it's hard to reason about, and increases our chances of doing the wrong thing. The only case I can think of where it might still apply is default bookmarks, since they don't have stable GUIDs. But we've also had reports where Sync resurrects them, so I suspect deduping is already broken for them. Some questions: * Am I understanding our deduping logic correctly? * Is it doing the right thing? * Do other clients rely on it? * Can we remove it entirely? * If not, can we simplify any of it?
There are several parts to this. 1. The value of content-based reconciling. 2. The guidMap implementation. 3. The tracking of hasDupe in the bookmark record format. Let's go in reverse order. 3. hasDupe is only necessary because guidMap doesn't have bucketing, and so without it desktop can't handle incoming duplicate records that it hasn't seen before. As you note, iOS doesn't really bother with it: iOS is tree-aware, so it doesn't use anything like guidMap for content-based reconciling at all. (This does mean we're a little lazy when uploading records, and can set a trap for desktop, but that won't occur until iOS allows the creation of bookmark structures.) Android is just lazy and wrong. My bad. Please file a bug when you're done with this discussion :D 2. guidMap is fundamentally flawed: the idea is to do subtree matching, but it's laughably naïve in only looking at depth=1, and it's very expensive. guidMap is cheaper than doing a string-based DB hit for N records when N is large, but we build it every time we see a new record. It's also wrong in other ways: Bug 747699. It would be fantastic to figure out how big N is, and kill guidMap altogether in most cases. 1. Some amount of content-based reconciling is necessary: if I have two desktops, both with a "Wishlist" folder on the toolbar, and I sync them, I expect those two folders to be merged. But note that a top-down tree merge algorithm can get this right without hasDupe: a "Wishlist" folder somewhere else in the tree isn't even a candidate. One might think that hasDupe could still perhaps be useful for duplicates in the same folder -- say I have a "Daily rotation" folder which starts and ends at Bugzilla -- but this is a much narrower concept, and I think it's possible to prove that a correctly implemented tree walker won't get this wrong: both duplicate records are present in the input tree, at known positions. iOS, for example, gets this right without using hasDupe. In summary: * We dupe more coarsely than we should. * Desktop is inefficient. * Android is wrong. * Tree-walking solves all ills. * Removing content-based duping altogether could still be harmful, but we don't know how much.
Flags: needinfo?(markh)
(In reply to Richard Newman [:rnewman] from comment #1) > There are several parts to this. > > 1. The value of content-based reconciling. > 2. The guidMap implementation. > 3. The tracking of hasDupe in the bookmark record format. More generally, there's the question of whether this approach is actually one that is worthwhile. > 3. hasDupe is only necessary because guidMap doesn't have bucketing, and so > without it desktop can't handle incoming duplicate records that it hasn't > seen before. We discussed this a little in IRC - IIRC, you explained the concept of "bucketing" could, in this context, be implemented by allowing the GUIDMap to use an array for each key rather than a single object. We'd expect almost every entry to have a single value in that array and in some edge cases have multiple. This would effectively mean the code would, in some cases, have to choose between multiple possible duplicates using some yet-to-be-defined heuristics. This seems better as it avoids the problem that hasDupe *really* means "at some point in the past there was a 'conflict' with this item, but that may or may not be true today". However, I'm still not clear on whether the concept of "find something similar anywhere else in the tree" is actually valuable today, and whether that behaviour is actually what the user expects. IOW, unless the user carefully inspects their entire bookmarks tree, they may conclude that this deduping actually *lost* bookmarks (as they are no longer where they expect and are unaware they are now somewhere else), which is potentially far worse than the user later discovering they have similar trees duplicated across their bookmarks. Assuming no GUID matches an incoming record, what are the downsides of always doing simple "content matching" with a very limited scope - eg, an item at the exact same place in the hierarchy? This isn't going to handle some cases the guidMap approach does, but it seems easy to argue this is a feature rather than a problem. Doing this efficiently might be a bit of a challenge in some cases, but I'm having trouble thinking of edge-cases we are likely to hit. But even then, ISTM there is scope for sticking with a cache that's built as we need it rather than up-front. The user *may*, in edge-cases I'm yet to understand, end up with some duplicates where they wouldn't today, but the user is probably going to think "oh right, that's because I did actually have duplicate trees in different locations - I'll just fix it once and the future will be good". That's a better outcome than "where are my bookmarks that I knew were right here before I did a sync" This could IIUC kill both hasDupe and the guidMap. It wouldn't give us the full benefits that tree-wide/2-phase application gives us, but would make the implementation of that tree-wide/2-phase bug much simpler in terms of the de-duping logic it would have to re-implement. IOW, simplifying this de-dupe before starting on 2-phase seems to make alot of sense. Richard, what am I missing here?
Flags: needinfo?(markh) → needinfo?(rnewman)
Structural reconciling -- that is, is there a very similar record at the same place in the merged tree? -- is the ideal way to go, and would eliminate the need for guidMap and hasDupe. The doubt I have is that doing structural reconciling correctly requires top-down processing: knowing which two sets of children should be compared might involve a sequence of GUID or structural merges from the root down. Without building tree merging we don't have that. Without top-down processing you're reliant on order of record application, and this deduping is mostly of value for full/fresh syncs, when all records are applied at once…
Flags: needinfo?(rnewman)
Blocks: 1305563
Marking as p3 to match the blocked bug 1305563
Priority: -- → P3
Some more half-formed thoughts... As Richard points out in comment 3, proper structural reconciliation needs a tree merge. With two-phase application, we can at least avoid duping records in different subtrees, but we still don't know the shared parent of the local and remote records. In addition to considering the item's ancestors, I thought we could also consider the position, but that seems like a non-starter. Let's say we have a local folder with children A, B, E, F, C, G, and a remote folder with children A, B, C, D, E, F, G. (Assume the children have different GUIDs, but the same titles and URLs, and that all records on the server have a newer timestamp). We want the final tree to be A, B, C, D, E, F, G. If we only de-duped items at the same position, we'd end up with something like A, B, E, C, F, D, C, E, G, F, G. Another question to consider is, how likely is it that two items will have the same parent and attributes (title, type, URL, and so on), but different GUIDs? (Or different children, if the items are folders). This used to be possible when restoring bookmarks (bug 967204); are there others? Of course, it's also possible to create a folder with the same name on two different devices, like the "Recipes" example. Though, in that case, it would be fairly simple to resolve the duplication manually. Large-scale duplication (a full sync, where most local GUIDs don't match remote GUIDs) is more concerning.
At our scale, everything is likely. Consider the full realm of horrifying things users do: - Bookmark the same thing in two places. - Copy bookmarks on different devices. - Implement their own "sync" by emailing bookmark exports or backups around. - Import old Firefox bookmarks into Chrome two years ago, modify over years, then sync via Chrome, then import from Chrome into Firefox on another machine, and set up Sync. - Use Xmarks and Firefox Sync at the same time. - Import old bookmark backups. - Sync, then turn off syncing for a long period of time, perhaps manually making identical changes in both places. (Some users want one-way or partial sync.) - Copy or move an entire bookmark hierarchy, then make changes to it. - Use Firefox Sync for years, manually fixing all of the corruption and duplication it causes, resulting in subtly divergent trees with very different GUIDs.
Mark pointed out another interesting case in bug 1177516, comment 21 where title-based bookmark de-duping will do the wrong thing.
(In reply to Richard Newman [:rnewman] from comment #6) > At our scale, everything is likely. > > Consider the full realm of horrifying things users do: Right, but we aren't suggesting we can do the right thing in those cases. The more of those edge-cases we try and handle, the more likely it is we end up with other, more likely to be hit edge-cases that we fail to handle. I still maintain it's better to miss some possibly-maybe-if-you-squint duplicates (forcing the user to manually cleanup, but probably not be too surprised) than to get too-smart-for-our-own-good and do something that truly baffles the user.
Mark found some interesting things in the logs for bug 1332751; capturing the IRC conversation here for posterity. From the logs: > 2017-01-24 04:09:12.727000 (00:04:36.816) Sync.Engine.Bookmarks DEBUG 3mJ4roUnAxdU mapped to rMMEtbL_FG8G > 2017-01-24 04:09:12.729000 (00:04:36.818) BookmarkSyncUtils DEBUG dedupeSyncBookmark: Switching local GUID rMMEtbL_FG8G to incoming GUID 3mJ4roUnAxdU > 2017-01-24 04:09:13.026000 (00:04:37.115) Sync.Engine.Bookmarks DEBUG Local item after duplication: age=405.8019998073578; modified=true; exists=true > 2017-01-24 04:09:13.034000 (00:04:37.123) Sync.Engine.Bookmarks WARN DATA LOSS: Both local and remote changes to record: 3mJ4roUnAxdU > 2017-01-24 04:09:13.037000 (00:04:37.126) Sync.Engine.Bookmarks DEBUG v_KLguzV7X3O mapped to rMMEtbL_FG8G > 2017-01-24 04:09:13.040000 (00:04:37.129) Sync.Engine.Bookmarks WARN Failed to reconcile incoming record v_KLguzV7X3O: Error: Local item rMMEtbL_FG8G does not exist ... There are two problems here: we're duping multiple local items to the same remote item ("3mJ4roUnAxdU" and "v_KLguzV7X3O" should not both map to "rMMEtbL_FG8G"), and we're seeing incoming records with a parent we've already duped (see bug 1293163, and Richard's comments on bug 1276823). We can fix (1) by updating the GUID map with the new GUID, and ensure we don't try to de-dupe records more than once, but (2) will still leave the server in an inconsistent state. To make things more interesting, our GUID map only stores the *last* GUID that matches a particular key and parent title. This means incoming children of a duped folder could also be dupes, or they might belong to an entirely different subtree. In the latter case, de-duping is the wrong thing to do. Bug 1276823, comment 11 is relevant. We're speculating this was caused by a test script that created the same (or similar) structure, but different GUIDs on multiple devices.
Attached patch t.patch (deleted) — Splinter Review
here's a patch that demonstrates the issue Kit discussed. The changes to bookmarks.js are disabled (ie, commented out), but I kept them so I don't lose them. The test itself creates 2 seemingly duplicate bookmark trees, and things go pear-shaped. From the logs in the failing test: > Sync.Engine.Bookmarks TRACE Mapped dupe: JKkZ_ATj_A4p" > Sync.Engine.Bookmarks DEBUG fake-guid-01 mapped to JKkZ_ATj_A4p" > Sync.Engine.Bookmarks TRACE Local item JKkZ_ATj_A4p is a duplicate for incoming item fake-guid-01" > BookmarkSyncUtils DEBUG dedupeSyncBookmark: Switching local GUID JKkZ_ATj_A4p to incoming GUID fake-guid-01" ... > Sync.Engine.Bookmarks DEBUG fake-guid-03 mapped to JKkZ_ATj_A4p" > Sync.Engine.Bookmarks TRACE Local item JKkZ_ATj_A4p is a duplicate for incoming item fake-guid-03" > Sync.Engine.Bookmarks WARN Failed to reconcile incoming record fake-guid-03: Error: Local item JKkZ_ATj_A4p does not exist (resource://gre/modules/PlacesSyncUtils.jsm:1620:11) ... > Sync.Engine.Bookmarks DEBUG Records that failed to apply: fake-guid-03" and the validator reports 3 parentChildMismatches. So I experimented with replacing the item in the guidmap with the new item after de-duping (ie, if you uncomment the change in _switchItemToDupe), which is a slight improvement (no failures to apply or rename) but still the validator reports issues - the *children* of those dupe folders still get the dupe treatment and fails to update the server with the end result of the de-dupe. I *think* this part of the patch is a strict improvement (we only get 2 parentChildMismatches instead of 3 and no failures - "just" a mismatch between the client and the server) So I *then* tried playing with "let's never dupe folders", but that ends up being the same as above - the fact the children of the folders get duped causes validation failures. That said though, I suspect we are looking at edge-cases rather than the common case. I think we should dig deeper into this, but as a lower priority than the repair work.
Some more scattered thoughts: * Once we have two-phase, we can compare trees, instead of deduping as we go. The left tree is all local bookmarks under the syncable roots; the right tree is the buffer overlayed onto the local tree. Join on level, type-specific attributes (title, parent title, URL, smart bookmark anno), and maybe position to make sure we only dedupe items in matching subtrees. * Using the position is tricky. Consider two devices, A and B, with identical structures but different GUIDs. Let's say we reposition an item in the subtree on device B, then sync. A will correctly dedupe all items before the repositioned one, but won't treat the shifted items as dupes. Making multiple structural changes like this while offline will increase the chances of Sync leaving dupes behind. * OTOH, using just the key is not sufficient, either, because it's possible to have two bookmarks with the same title, URL, etc. in the same folder, at different positions (bug 1213369). I suspect this is part of why `hasDupe` exists. The other reason is so that we don't indiscriminately dedupe to a completely different subtree; though, as this bug shows, it's still possible. So maybe we should account for the position, after all, and accept that Sync might create dupes of items in similar subtrees.
(In reply to Kit Cambridge [:kitcambridge] from comment #11) > Some more scattered thoughts: Thanks for continuing to think about this! > * Using the position is tricky. Consider two devices, A and B, with > identical structures but different GUIDs. How could this happen (other than a user explicitly creating the same structure manually on 2 different devices before connecting them)? I agree using the position is tricky though for reasons you state. > * OTOH, using just the key is not sufficient, either, because it's possible > to have two bookmarks with the same title, URL, etc. in the same folder, at > different positions (bug 1213369). I suspect this is part of why `hasDupe` > exists. The other reason is so that we don't indiscriminately dedupe to a > completely different subtree; though, as this bug shows, it's still > possible. So maybe we should account for the position, after all Yeah, it's tricky. Using the position means that similar, but not identical trees are not de-duped. I think we should generally assume that the user has manually created these duplicates on different devices - eg, bookmarked a few favourite news sites on both devices before enabling sync. It seems likely there would be trivial differences between these 2 trees - eg, the order may not be identical, or there may be one extra site at position zero the user forgot about on the second device. Ending up with dupes in that scenario seems wrong and difficult to diagnose when it goes wrong (ie, it would appear to users that sometimes we de-dupe and sometimes we don't, with no obvious rhyme or reason) > that Sync might create dupes of items in similar subtrees. In reality, I think this de-duping should be rare - it will happen in 2 cases: * first sync of a second device. * if a user manually creates the same bookmark on 2 sync connected devices. (but while relatively rare, it's likely to be the first impression they have of sync, so very important) I'm really skeptical that users will manually create identical (or even similar) sub-trees on 2 different devices, and if they did, that they'd be particularly upset/surprised to find they end up with 2 trees. Such users would be comfortable editing bookmarks and would take almost no time to manually resolve this situation. The 2-phase work should mean we can consider the server's parent record (it may not be be in the same batch, so we might need an additional get to fetch it), which should solve bug 1213369 - we should never de-dupe if the server's parent record already references the new item, which it would have in that bug. So I'd be inclined to suggest: * look at the state of the tree after considering incoming items. * any local items in a folder which are *not* referenced in the server's parent record are candidates for de-duping. * de-dupe these items considering only this folder, looking for identical attributes but ignoring position, and probably with a special case for separators (ie, de-dupe adjacent separators that don't appear in the parent record) * simply ignore folders (or possibly consider de-duping truly identical trees, but I believe that will be rate) Or is that too naive?
(In reply to Mark Hammond [:markh] from comment #13) > How could this happen (other than a user explicitly creating the same > structure manually on 2 different devices before connecting them)? Exporting or importing bookmarks in a Firefox before GUIDs were round-tripped through bookmark backups. Round-tripping bookmarks through HTML export, or through Chrome/Edge. Creating the original bookmark tree in a different browser, then importing it into Firefox on two different devices. There are a bunch of ways this could happen… > > * OTOH, using just the key is not sufficient, either, because it's possible > > to have two bookmarks with the same title, URL, etc. in the same folder, at > > different positions (bug 1213369). I suspect this is part of why `hasDupe` > > exists. And that you can have two bookmarks with the same values in two folders with the same title, and have those two folders be different. E.g., I might have 2012 Bugs Bug 12345… 2013 Bugs Bug 12345… and those two bookmarks will collide in the GUID map. > I'm really skeptical that users will manually create identical (or even > similar) sub-trees on 2 different devices, and if they did, that they'd be > particularly upset/surprised to find they end up with 2 trees. Such users > would be comfortable editing bookmarks and would take almost no time to > manually resolve this situation. Unless they notice on a device that makes bookmark management hard (e.g., Android). > * any local items in a folder which are *not* referenced in the server's > parent record are candidates for de-duping. To rephrase in iOS terms: the GUID isn't present in the mirror+buffer tree. Note that it's possible for a local item to not be in the server folder for other reasons: it could be remotely deleted, or it could be remotely moved. (Or we could have reconciled two folders together.) > * de-dupe these items considering only this folder, looking for identical > attributes but ignoring position, and probably with a special case for > separators (ie, de-dupe adjacent separators that don't appear in the parent > record) That's pretty much what iOS does: take the server child list, content-dupe any non-GUID-matches, preserving second+ duplicates, and append the local unreconciled child list to the end. Given a server record: [A B C D E] and a local tree (with lowercase meaning "merges to"): [E G(d) H(c) I J] we'd rename G to D, H to C, and produce: [A B C D E I J] > Or is that too naive? I think there's a risk of order-of-operation bugs when considering incoming changes and moves, unless you're talking about doing structured top-down application (not just two-phase, which only aims to guarantee that you don't miss a server record). If you're doing totally buffered application, with consistency checking, then you're pretty much at the point iOS is: walk the two trees top-down, doing GUID- and content-based merging. That gives you folder smushing with exactly the same code, and handles moves and value changes correctly.
(In reply to Richard Newman [:rnewman] from comment #14) > If you're doing totally buffered application, with consistency checking, > then you're pretty much at the point iOS is: walk the two trees top-down, > doing GUID- and content-based merging. That gives you folder smushing with > exactly the same code, and handles moves and value changes correctly. The intent for bug 1305563 is to apply incoming records in level order (bonus: we don't create intermediate orphans, and it's easier to detect missing items. But, unlike iOS, we reparent to unfiled and hope the missing record shows up later). Could you link me to the iOS content de-duping logic, please? You gave me an overview of iOS bookmark sync a couple months back, but I admit I haven't dug into the implementation.
Whiteboard: [data-integrity][sync:bookmarks]
Depends on: 1349409
Assignee: nobody → kit
Status: NEW → ASSIGNED
Priority: P3 → P1
It's worth thinking about the cases we want to solve with deduping, and what's OK to ignore. Overall, I think we should err on the side of leaving dupes in place, and letting folks dedupe manually if we can't figure out what to do. Tree walking would help substantially, but we still can't guarantee that the server tree is at all congruent with the client tree. The longer the time between syncs, the more likely it is for the client and server to diverge, especially if there's a node reassignment or automatic restore. For the first sync, there's no shared parent at all! IIUC, iOS and Android keep mobile bookmarks separate, and can avoid this complexity. Once both platforms have full bookmark management, and iOS has bidirectional sync, some kind of content-based duping seems unavoidable. So I agree we want a limited deduping scheme, instead of removing it from Desktop outright. Comment 6 is a great starting point for further thinking, so let's go through those: > - Bookmark the same thing in two places. > - Copy bookmarks on different devices. Agreed. If you've bookmarked the same URL, with the same title and folder, on two different devices, we should collapse them into one. If you create explicit dupes on another device by copying a bookmark, I think we should only dedupe if you've made the same number of copies locally. Otherwise, we don't know which to keep, and which to collapse. This is especially true for folders, which we dedupe based on the level and title, and not their children. > - Implement their own "sync" by emailing bookmark exports or backups around. > - Import old bookmark backups. Backups include GUIDs as of bug 967204, but I don't think we should worry too much about this case. As with explicit dupes, I think folks who are savvy enough to adopt their own syncing strategies can figure out how to dedupe. Importing a bookmark export into Places also doesn't dedupe, and collapsing the dupes on another device will make that tree inconsistent with the server. In general, we should never dedupe two bookmarks that are already on the server to each other, even if they are explicit dupes. (Our current GUID map will happily do this, and introduce corruption). > - Import old Firefox bookmarks into Chrome two years ago, modify over years, then sync via Chrome, then import from Chrome into Firefox on another machine, and set up Sync. > - Copy or move an entire bookmark hierarchy, then make changes to it. > - Sync, then turn off syncing for a long period of time, perhaps manually making identical changes in both places. (Some users want one-way or partial sync.) As with the first and second points, small changes are OK. More significant structural changes (duplicating an entire folder, moving items between subtrees) aren't safe to dedupe without user intervention, and we don't know the shared parent or any of the intermediate states. I think a generic content-based deduping algorithm that tries to merge two diverged trees, without any feedback, is going to get *something* wrong. It's a question of how much, and at what point we let it alone. > - Use Xmarks and Firefox Sync at the same time. Using two systems side-by-side is going to create so many more problems than dupes... :-( > - Use Firefox Sync for years, manually fixing all of the corruption and duplication it causes, resulting in subtly divergent trees with very different GUIDs. For these cases, it's easier to force a full reset instead of trying to develop a deduping strategy. We'll do this for repair, too.
Attached file new-dedupe.js (obsolete) (deleted) —
Here's a sketch I've been working on at the work week. I haven't started integrating it into the bookmarks engine yet, but does the approach seem reasonable? There's a block comment at the top that explains how it works.
Attachment #8862171 - Flags: feedback?(tchiovoloni)
Attachment #8862171 - Flags: feedback?(markh)
(In reply to Kit Cambridge (he/him) [:kitcambridge] (UTC-8) from comment #18) > IIUC, iOS and Android keep mobile bookmarks separate, and can avoid this > complexity. They keep them in Mobile Bookmarks, but both support whatever desktop does to that folder (including moving other folders into it). It's no different to desktop's Bookmarks Toolbar. Android is growing bookmark management Real Soon Now. > I agree we want a limited deduping scheme, instead of removing it from > Desktop outright. +:thumbsup: > > - Implement their own "sync" by emailing bookmark exports or backups around. > > - Import old bookmark backups. > > Backups include GUIDs as of bug 967204, but I don't think we should worry > too much about this case. As with explicit dupes, I think folks who are > savvy enough to adopt their own syncing strategies can figure out how to > dedupe. Importing a bookmark export into Places also doesn't dedupe, and > collapsing the dupes on another device will make that tree inconsistent with > the server. What I meant by this case was: a user could have copied a bookmark tree to another Firefox and -- prior to Bug 967204 -- that copy would be identical but for the GUIDs. That's much the same as building the same tree by hand, just with many more dupes! I agree that it's not worth thinking about trying to de-dupe when importing to an already synced tree: that's Places's job, if it chooses to tackle that. > In general, we should never dedupe two bookmarks that are already on the > server to each other, even if they are explicit dupes. (Our current GUID map > will happily do this, and introduce corruption). This is what hasDupe was meant to signify. A device uploading a dupe _should_ mark it as such. Naturally the implementation was flawed. > As with the first and second points, small changes are OK. More significant > structural changes (duplicating an entire folder, moving items between > subtrees) aren't safe to dedupe without user intervention, and we don't know > the shared parent or any of the intermediate states. Yep. > I think a generic content-based deduping algorithm that tries to merge two > diverged trees, without any feedback, is going to get *something* wrong. > It's a question of how much, and at what point we let it alone. I agree. This is why iOS's descends from the top, stitching two trees together. It'll never merge two items that don't share a duped or GUID-identical parent. > > - Use Xmarks and Firefox Sync at the same time. > > Using two systems side-by-side is going to create so many more problems than > dupes... :-( Yep; this is just an illustration of _yet another_ mechanism by which a user could have two Firefox profiles that start syncing that already share data. > > - Use Firefox Sync for years, manually fixing all of the corruption and duplication it causes, resulting in subtly divergent trees with very different GUIDs. > > For these cases, it's easier to force a full reset instead of trying to > develop a deduping strategy. We'll do this for repair, too. Even if you reset, this is a case where two clients can appear to have the same bookmark tree, but the GUIDs differ. (Consider incoming records being duped, and the user cleaning up by deleting different GUIDs on each profile.) All of these cases are simply supporting evidence for the need for content-based reconciling. It doesn't have to be that complex, it just needs to successfully stitch together two identical trees.
Attachment #8862171 - Attachment is obsolete: true
Attachment #8862171 - Flags: feedback?(tchiovoloni)
Attachment #8862171 - Flags: feedback?(markh)
Comment on attachment 8862650 [details] Bug 1323333 - Implement limited same-level bookmark deduping. Here's a first cut at same-level deduping. The comment above `PlacesSyncUtils.bookmarks.dedupe` explains what's going on. This needs a lot of tests, but I'd like to nail down the semantics first. PTAL and let me know what you think. The biggest difference is that we only consider bookmarks that exist at the same level in the tree. If you move a similar item into a different folder, and the local and remote trees have different GUIDs, we'll leave the dupes in place. The more the local and remote tree diverge, the less we can do without intervention. Deduping happens in two stages. First, we find congruent folders and smush their contents together. Then, we find and remove all dupes in these folders. In most cases, we expect the dupe group to contain two dupes (one local, one remote), but there can be multiple ones if the user has created explicit dupes. There's probably a cleaner way to represent the "dupe group" concept. I didn't want to implement the grouping logic twice, once in SQL and once in JS, so I settled on keeping it in SQL and using `GROUP_CONCAT`. SQLite lets us define custom aggregate functions, but mozStorage doesn't expose that yet. This patch also no longer uses `hasDupe` or `parentName`, but I'm not sure if we'll need to resurrect them.
Attachment #8862650 - Flags: feedback?(tchiovoloni)
Attachment #8862650 - Flags: feedback?(rnewman)
Attachment #8862650 - Flags: feedback?(markh)
Comment on attachment 8862650 [details] Bug 1323333 - Implement limited same-level bookmark deduping. https://reviewboard.mozilla.org/r/134508/#review137494 ::: services/sync/modules/engines/bookmarks.js:358 (Diff revision 1) > > - // Applies pending tombstones, sets folder child order, and updates the sync > - // status of all `NEW` bookmarks to `NORMAL`. > + // Pares duplicates, applies pending tombstones, sets folder child order, > + // and update the sync status of all `NEW` bookmarks to `NORMAL`. > async _postProcessIncoming() { > + let childIndices = this._store.getChildIndices(); > + let newChanges = await PlacesSyncUtils.bookmarks.dedupe(childIndices); This is an eyesore. I pass the child indices this way because we reorder after deduping. We could reorder before deduping, but, because we only know the server's order, we'd end up scrambling any local bookmarks that *aren't* dupes. (Similar issue as bug 1352233). ::: toolkit/components/places/PlacesSyncUtils.jsm:1959 (Diff revision 1) > + await PlacesUtils.bookmarks.remove(guid, { > + preventRemovalOfNonEmptyFolders: true, > + // TODO: Consider a different source. We shouldn't write > + // tombstones in any case, since the folder we're deleting > + // is `NEW`. > + source: PlacesUtils.bookmarks.SOURCES.DEFAULT, So the semantics of the new source, if we use one, would be the same as `NORMAL`: bump the change counter, write tombstones. ::: toolkit/components/places/PlacesSyncUtils.jsm:1993 (Diff revision 1) > + > +// Returns an array of rows containing folder dupe group keys. Instead of > +// selecting the duplicate rows and re-grouping in JS, we encode the dupe > +// info into a string "dupe key", and then abuse SQLite's `GROUP_CONCAT` to join > +// the dupe keys into a "dupe group key". The dupe key is `{level} | > +// {position} | {guid} | {syncStatus}`, and the dupe group key is multiple dupe This comment isn't accurate anymore; the dupe is now `{position} | {guid} | {syncStatus}`. Also, I've started calling them "dupe groups" and "dupes" instead of "dupe group keys" and "dupe keys". ::: toolkit/components/places/PlacesSyncUtils.jsm:2028 (Diff revision 1) > +} > + > +// Returns an array of rows containing smart bookmark, bookmark, livemark, and > +// separator dupe group keys. See `findDupeFolders` for an explanation of how > +// dupe keys and dupe group keys work. The dupe key for these items is just > +// `{guid} | {syncStatus}`; unlike folders, we don't need the level or position. Ditto. ::: toolkit/components/places/PlacesSyncUtils.jsm:2078 (Diff revision 1) > + n.name = :livemarksAnno > + GROUP BY b.parent, b.title > + HAVING COUNT(*) > 1 > + UNION ALL > + > + /* Separators: parent. TODO: Only dupe immediate siblings? */ As written, this will dedupe all siblings in the same folder.
Comment on attachment 8862650 [details] Bug 1323333 - Implement limited same-level bookmark deduping. https://reviewboard.mozilla.org/r/134508/#review137494 > This is an eyesore. I pass the child indices this way because we reorder after deduping. We could reorder before deduping, but, because we only know the server's order, we'd end up scrambling any local bookmarks that *aren't* dupes. (Similar issue as bug 1352233). Thom suggests this might not be necessary. I'm using the child indices to sort remote dupes, to make it more likely that we'll match up explicit dupes correctly. Consider two local items (A, AA) and two remote items (B, BB) that have the same title and level, but different children. Since we don't consider children when duping, it's possible for A *or* AA to dedupe to either B or BB. We'll still get it wrong if A should dedupe to B, and AA should dedupe to BB, but they're ordered as (BB, B) on the server. OTOH, maybe explicit dupes are unlikely enough (bug 1213369 notwithstanding) to where it's OK if we get this wrong. > As written, this will dedupe all siblings in the same folder. Um, I meant separators.
Comment on attachment 8862650 [details] Bug 1323333 - Implement limited same-level bookmark deduping. In general, I think this bug is taking the correct approach, but mapping what I *think* code does to what the code *actually* does is tricky, as we an experience with the current de-dupe code. I mentioned this last week, but TBH I think some explicit docs/tests are necessary, to help document exactly what is in-scope and what is out-of-scope (ie, to document the "what" before we finalize the "how"). I understand that's easier said than done, but I don't want us to end up in roughly the same position we are in now - fairly complicated code that we *think* does the right thing, but doesn't in practice. Hopefully by scaling back the de-dupe logic (which this does, and which I fully support) might make that easier than trying to do full-blown tree-wide fuzzy deduping, which is roughly what we have now.
Attachment #8862650 - Flags: feedback?(markh) → feedback+
(In reply to Mark Hammond [:markh] from comment #29) > I mentioned this last week, but TBH I think some explicit docs/tests are > necessary… Yup. Need to capture all the invariants: - Two bookmark trees with no bookmarks with the same URL will have the same number of bookmarks when merged. - If we're doing top-down, then two identical bookmark trees with different GUIDs will be completely merged. - (If we're not doing top-down, then the order of processing of records dictates which folders will be compared for duping!) - An empty tree and a non-empty tree: order of the non-empty tree is preserved. - Partly overlapping trees: the non-overlapping parts are order-preserved. - Duplicate items at different levels of the tree will only be duped if they're inside folders that are at some point in an identical subtree with a shared GUID at the top. - Add bookmark X remotely. Add two identical bookmarks, X1 and X2, locally. Do we end up with 3 (wrong), 2 (right), or 1 bookmark after duping? - etc.
Comment on attachment 8862650 [details] Bug 1323333 - Implement limited same-level bookmark deduping. https://reviewboard.mozilla.org/r/134508/#review138544 I didn't dig into the SQL or anything at this stage. My two substantive comments are: - I think doing this inside a live bookmark tree is dangerous. It means you're routinely creating up to N dupes, and then deleting up to N local records to match. That's a lot of churn (not good for Places performance!), but it's also not transactional, which means interruption at any point will create quite a mess. - I think you're using Places in order to impose a shadow tree: that is, contrary to my inline comment, you can trust the level and name because you're using the Places tree and GUID collisions to overlay incoming records. If so, I think that might be the better vocabulary to use, and I think you should consider whether there are ways to model what you're doing without mutating the local copy. ::: services/sync/modules/engines/bookmarks.js:358 (Diff revision 1) > > - // Applies pending tombstones, sets folder child order, and updates the sync > - // status of all `NEW` bookmarks to `NORMAL`. > + // Pares duplicates, applies pending tombstones, sets folder child order, > + // and update the sync status of all `NEW` bookmarks to `NORMAL`. > async _postProcessIncoming() { > + let childIndices = this._store.getChildIndices(); > + let newChanges = await PlacesSyncUtils.bookmarks.dedupe(childIndices); This is, unfortunately, really tricky. Imagine the user quits Firefox after downloading most of the server's bookmarks. At this moment their local bookmark tree is full of duplicates, and the sync is interrupted before `dedupe` is called. Any metadata kept in memory is now gone. If they relaunch Firefox, I think you're betting that the items will be redownloaded, the GUIDs will match, and so you won't get a third copy… but will the deduping work? ::: toolkit/components/places/PlacesSyncUtils.jsm:600 (Diff revision 1) > + * Before uploading, we scan the tree and collapse similar `NEW` and `UNKNOWN` > + * items. > + * > + * How it works: > + * > + * 1. Walk all syncable roots to find folders with the same name at the same Name is probably wrong. Level, sadly, is also probably wrong; it won't handle the trivial case of a parent folder having been moved. The only way to do this correctly is to stitch all the way back to a common root: in the case of a partial sync that's a remote GUID that's known locally but wasn't downloaded; in the worst case that's all the way up to a Places root.
Attachment #8862650 - Flags: feedback?(rnewman)
Assignee: kit → nobody
Status: ASSIGNED → NEW
Priority: P1 → P3
Attachment #8862650 - Flags: feedback?(tchiovoloni)
Priority: P3 → P2
Assignee: nobody → kit
Depends on: 1380474
Rolling this up into the two-phase work.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
Whiteboard: [data-integrity][sync:bookmarks] → [data-integrity][sync:bookmarks][Sync Q3 OKR]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: