Closed Bug 1317223 Opened 8 years ago Closed 8 years ago

Add desktop implementation of bookmark repair

Categories

(Firefox :: Sync, defect, P1)

defect

Tracking

()

RESOLVED FIXED
Firefox 54
Tracking Status
firefox54 --- fixed

People

(Reporter: markh, Assigned: markh)

References

(Blocks 1 open bug)

Details

(Whiteboard: [ios-sync])

Attachments

(8 files, 18 obsolete files)

(deleted), text/x-review-board-request
Details
(deleted), text/x-review-board-request
tcsc
: review+
Details
(deleted), text/x-review-board-request
tcsc
: review+
Details
(deleted), text/x-review-board-request
lina
: review+
Details
(deleted), text/x-review-board-request
tcsc
: review+
Details
(deleted), patch
markh
: review+
Details | Diff | Splinter Review
(deleted), text/x-review-board-request
markh
: review+
Details
(deleted), patch
rnewman
: review+
Details | Diff | Splinter Review
This is to create an implementation on desktop for a "repair requestor". We'll open another bug for the "repair responder" (even though I suspect they will be implemented in parallel) See https://docs.google.com/document/d/1fKEaNTn6VNiSHh4b3r7jIQVrEmG3F6MI0smYWis7ULA/ for a rough implementation plan, but the tl;dr is: * When we run the validator, build a list of IDs we think are missing. * Bounce around other clients and send a command requesting they upload the missing items. * At some point this repair process finishes (either with or without the missing IDs. * Re-run the validator. * Ensure lots of telemetry so we can analyze the performance. I'm about to upload a WIP, which still needs a fair bit of work, particularly integration with the validator.
Comment on attachment 8810302 [details] Bug 1317223 (part 1) - a collection_repair module (without any repairers) and integration with the clients engine. https://reviewboard.mozilla.org/r/92650/#review92644 Note that this is very much a WIP to try and get a basic "shape" of the repair process in place. ::: services/sync/modules/bookmark_repair.js:66 (Diff revision 1) > + this.prefs = new Preferences(PREF_BRANCH); > + // allow service to be mocked in tests. > + this.service = service || Weave.Service; > + } > + > + startRepairs(flowID, validationInfo) { The idea is that this is the main entry-point which will be called after the validator runs. The flowID is passed in as I expect we will want the flowID allocated by the caller and reported as part of the validation (so we can tie the validation and the repairs together) ::: services/sync/modules/bookmark_repair.js:85 (Diff revision 1) > + /* Work out what state our current repair request is in, and whether it can > + proceed to a new state. > + Returns true if we could continue the repair - even if the state didn't > + actually move. Returns false if we aren't actually repairing. > + */ > + continueRepairs(responseInfo = null) { This is something that should be periodically called by someone at sometime, usually without a param, to see if we've timed out waiting for response. The client engine will also call it with an arg once it sees the `repairResponse` command come in. ::: services/sync/modules/bookmark_repair.js:123 (Diff revision 1) > + log.error("continueRepairs spun without getting a new state"); > + } > + if (newState == STATE_FINISHED || newState == STATE_ABORTED) { > + let object = newState == STATE_FINISHED ? "finished" : "aborted"; > + let extra = { flowID: this._flowID, numIDs: this._currentIDs.length }; > + this.service.recordTelemetryEvent("repair", object, undefined, extra); Note that somehow here we need to ensure something notices the state change to "finished" and arranges to somehow re-run the validator with the same flowID. An observer? But who listens for it? Resetting the state here also means we've lost the flowID etc - it may turn out we should have another public method purely for resetting the state etc, but it's not clear who would call that either. ::: services/sync/modules/bookmark_repair.js:325 (Diff revision 1) > + > + get _currentIDs() { > + let ids = this.prefs.get(PREF_REPAIR_MISSING_IDS, ""); > + return ids.length ? ids.split(" ") : []; > + } > + // XXX - setters? Most of the code above just sets the prefs directly, but having setters for them probably makes sense. ::: services/sync/modules/bookmark_repair.js:327 (Diff revision 1) > + let ids = this.prefs.get(PREF_REPAIR_MISSING_IDS, ""); > + return ids.length ? ids.split(" ") : []; > + } > + // XXX - setters? > + > + get _clientsUsed() { this is a bad name! ::: services/sync/modules/collection_repair.js:16 (Diff revision 1) > + "resource://services-sync/bookmark_repair.js"); > + > +this.EXPORTED_SYMBOLS = ["getRepairRequestor"]; > + > +// Should we maybe enforce the requestors being a singleton? > +function getRepairRequestor(collection) { haven't really thoughts this module through, but the idea is to avoid callers knowing what each implementation is. ::: services/sync/modules/collection_repair.js:25 (Diff revision 1) > + default: > + return null; > + } > +} > + > +// And maybe a sub-class that defines the "contract"??? (ie, defines the constructor, `startRepair` and `continueRepair` bool methods - these should be about the only things external callers need.
Comment on attachment 8810302 [details] Bug 1317223 (part 1) - a collection_repair module (without any repairers) and integration with the clients engine. https://reviewboard.mozilla.org/r/92650/#review92670 ::: services/sync/modules/bookmark_repair.js:221 (Diff revision 1) > + if (state == STATE_SENT_REQUEST) { > + log.info(`previous request to client ${clientID} was removed - trying a second time`); > + state = STATE_SENT_SECOND_REQUEST; > + this._writeRequest(clientID); > + } else { > + // this was the second time around, so give up on this client I think we need to be smarter here too - I bet it is easy to come up with interrupted-sync scenarios where 2 client collection syncs happen before bookmarks actually completed (IOW, that we may give up on the other client while it's still trying to write the bookmarks and response)
Priority: -- → P1
Assignee: nobody → markh
Hi, a question: the "desktop" word in the bug description implies that this won't work on android devices? Thanks.
Correct. The Sync codebases on desktop, Android, and iOS are different: they're implemented in platform-appropriate languages for integration reasons.
Hi, So if I sync using only android clients then I'll never get the repair functionality? Or if I'll sync using both android and desktop then I'll risk errors on the android part, which then will be corrected when I sync on desktop? thanks.
(In reply to Stefano Caselli from comment #6) > So if I sync using only android clients then I'll never get the repair > functionality? You won't get it from this bug, no. There are six parts to this work: a repair 'responder' on each platform, to fix what they broke, and a repair 'requestor' on each platform, to detect and call out issues it found. This bug is the requestor for desktop. See Bug 1260900 for more.
ok thanks!
I'm about to upload my current WIP for this so there can be some collaboration between the team. The intention is that we will use the "elm" twig (as setup in bug 1324900) and the hg "evolve" extension to work on these patches with the intention of evolving them and landing them on mozilla-central. Using "evolve" will mean we can safely rebase and fold these patches with the intent of landing just 5 patches (but note that I'm still trying to work out how to arrange for elm to be marked as non-publishing, which IIUC is a requirement for this to work). To facilitate reviews, I suggest that we perform a review of all changes to each of the patches, and after review we fold the fixups into the patch, with the end result being that the 5 patches are always in a "fully reviewed" state, while any fixups that exist are in the process of getting review. As a result, we should get review on these 5 patches before landing them in elm. For example, the workflow might look something like: Initial landing on elm of these parts after review: * Part 1 - blah. r=foo * Part 2 - something else. r=bar Then after someone makes changes to one of the parts. * Part 1 - blah. r=foo * Part 2 - something else. r=bar * fixup part 2 - more tests. r?foo The author of the fixup then requests review, and after r+, the repo ends up looking like: * Part 1 - blah. r=foo * Part 2 - something else, with tests. r=bar, foo Then we rinse and repeat as necessary until it is ready to land, at which time we push the folded patches, all of which were fully reviewed, to central. Hopefully we can get started on this even before elm is ready (but note that bug 1289536 will need to be in central for everything to work and the patches to apply cleanly - which it hopefully will be in the next day or 2). I'll be uploading my WIP via reviewboard with detailed commit messages in each patch, including things I've identified which still need to be done in the patch (but note that bugzilla hides mozreview requests by default, so you may need to unhide them for context). I suspect some of the review comments will add to this list :) Regarding patch ownership - I don't really care :) I'm hoping that Edouard (part 1), Thom (part 2 and 3) and Kit (part 4) can take these patches over, and I'm happy to have their name as the patch authors - and as a result, I've asked for review from each of you for these parts. This will inevitably mean that (a) multiple people contributed to a single patch and (b) the person nominated as the patch author will also have done review of the same patch, but I think that's fine and the history should be clear via this bug. As such, you should all feel free to take ownership of the "style" of the patches too - feel free to make it look more like you wrote it in the first place in areas where you prefer a different style. I'm not strongly attached to any of this. rnewman - I only flagged you for review on the first patch, but I'd appreciate a quick look at as many of the other patches as you have time for.
Summary: Add desktop implementation of "repair requestor" for bookmarks. → Add desktop implementation of bookmark repair
Comment on attachment 8824876 [details] Bug 1317223 (part 3) - A bookmark repair requestor. https://reviewboard.mozilla.org/r/103192/#review104142 ::: services/sync/modules/bookmark_repair.js:37 (Diff revision 1) > + super(service); > + this.prefs = new Preferences(PREF_BRANCH); > + } > + > + // The preferences we use to hold our state. > + get PREF() { drive by of my own patch :) I wonder of these prefs should be hard-coded in the getters/setters? The literals would be only a fre lines from each other and it removes a level of indirection... (I think "state" below is OK - botj PREF and STATE started as module-level constants, but that seemed wrong with the responder also ending up in this module - maybe splitting into 2 modules is better?)
Comment on attachment 8810302 [details] Bug 1317223 (part 1) - a collection_repair module (without any repairers) and integration with the clients engine. https://reviewboard.mozilla.org/r/92650/#review104260 ::: services/sync/modules/engines/clients.js:494 (Diff revision 2) > wipeAll: { args: 0, desc: "Delete all client data for all engines" }, > wipeEngine: { args: 1, desc: "Delete all client data for engine" }, > logout: { args: 0, desc: "Log out client" }, > displayURI: { args: 3, desc: "Instruct a client to display a URI" }, > + repairRequest: {args: 1, desc: "Instruct a client to initiate a repair"}, > + repairResponse: {args: 1, desc: "Instruct a client a repair request is complete"}, Bookmarks repair request? Or do you envision this carrying engine-specifying data? ::: services/sync/modules/engines/clients.js:634 (Diff revision 2) > - this._log.debug("Received an unknown command: " + command); > + this._log.warn("Received an unknown command: " + command); > break; > } > // Add the command to the "cleared" commands list > - this._addClientCommand(this.localID, rawCommand) > + if (shouldRemoveCommand) { > + this.removeLocalCommand(rawCommand); There are two places a command lives, right? There's commands.json on disk, and there's our record on the server. Writes to each of those are async and could fail. Your approach, IIUC, keeps those things the same — if any changes are made to commands.json, they're tracked and uploaded. If the upload fails, we'll eventually see the command again (because most commands aren't uniquely identified). If the local write fails, or is skipped, what happens? If the tracker change isn't persisted, what happens?
Attachment #8810302 - Flags: review?(rnewman)
Comment on attachment 8824875 [details] Bug 1317223 (part 2) - add 'doctor' concept and move bookmark validation logic to it. https://reviewboard.mozilla.org/r/103190/#review104280 As you mention, this is mostly just moving existing code into a new place, so looks fine with the issue addressed. ::: services/sync/modules/doctor.js:46 (Diff revision 1) > + > + // We are called at the end of a sync, which is a good time to periodically > + // check each repairer to see if it can advance. > + if (this._now() / 1000 - this.lastRepairAdvance > REPAIR_ADVANCE_PERIOD) { > + try { > + for (let [collection, requestor] of Object.entries(this._getAllRepairRequestors())) { Should we do this first and prioritize validating engines with repair requestors if the repair period is correct? E.g. if we're going to do a repair, it might make sense to validatate first no matter what. (The answer could be "no" or "this is handled in a different patch", certainly) ::: services/sync/modules/doctor.js:91 (Diff revision 1) > + } > + > + // Update the time now, even if we decline to actually perform a > + // validation. We don't want to check the rest of these more frequently > + // than once a day. > + Svc.Prefs.set("validation.lastTime", nowSeconds); Don't we want this to happen before we check the validation probability? If we do so here, we'll be substantially more likely to run validation every day, since we'll do the random check repeatedly until it passes.
Attachment #8824875 - Flags: review?(tchiovoloni) → review+
Comment on attachment 8824876 [details] Bug 1317223 (part 3) - A bookmark repair requestor. https://reviewboard.mozilla.org/r/103192/#review104284 Largely fine, but a few nits and questions. ::: services/sync/modules/bookmark_repair.js:37 (Diff revision 1) > + super(service); > + this.prefs = new Preferences(PREF_BRANCH); > + } > + > + // The preferences we use to hold our state. > + get PREF() { It seems more consistent with the rest of sync to hardcode the pref names in the accessors below. At the very least, it seems weird that this and state these return new objects each time, but I might be missing why it's done this and not an additional exported symbol (per enumeration). ::: services/sync/modules/bookmark_repair.js:89 (Diff revision 1) > + } > + // orphans are when the child exists but the parent doesn't. > + for (let { parent } of validationInfo.problems.orphans) { > + ids.add(parent); > + } > + // XXX - any others we should consider? There are several I'd add. Many of them are likely to be duplicates, but that shouldn't matter. - parentChildMismatches (both parent and child, since we don't know which is right) - deletedChildren (parent) - multipleParents (all parents) - deletedParents (child) - duplicateChildren Possibly serverMissing/serverUnexpected as well, although I'm not as sure about thise. I'd be fine with waiting until later to add these though. ::: services/sync/modules/bookmark_repair.js:110 (Diff revision 1) > + proceed to a new state. > + Returns true if we could continue the repair - even if the state didn't > + actually move. Returns false if we aren't actually repairing. > + */ > + continueRepairs(response = null) { > + if (this._currentState == this.STATE.NOT_REPAIRING) { What about the other inactive states, e.g. aborted/finished? Worth checking? ::: services/sync/modules/bookmark_repair.js:179 (Diff revision 1) > + } > + // Pull apart the response and see if it provided everything we asked for. > + let fixedIDs = response.IDs; > + let remainingIDs = []; > + for (let id of this._currentIDs) { > + if (fixedIDs.indexOf(id) == -1) { nit: I think you should use a Set (or an Object at the very least) to avoid quadratic complexity in the case where we got everything. ::: services/sync/modules/bookmark_repair.js:273 (Diff revision 1) > + break; // our caller will take the abort action. > + > + case this.STATE.FINISHED: > + break; > + > + case "": nit: should be `case this.STATE.NOT_REPAIRING:`, for consistency ::: services/sync/modules/bookmark_repair.js:359 (Diff revision 1) > + this.prefs.set(this.PREF.CURRENT_STATE, newState); > + } > + > + get _currentIDs() { > + let ids = this.prefs.get(this.PREF.REPAIR_MISSING_IDS, ""); > + return ids.length ? ids.split(" ") : []; Seems a bit dodgey to keep this in a pref... Seems like it could be large for very broken users. Is there a way we can get this value from the telemetry later?
Attachment #8824876 - Flags: review?(tchiovoloni)
Comment on attachment 8824878 [details] Bug 1317223 (part 6) - integration tests for the bookmark repair requestor and responder. https://reviewboard.mozilla.org/r/103196/#review104294 I have mixed feelings about using TPS for this, mainly since TPS never runs (although, if this could be an excuse to make that happen, I'd probably be for it). It's not completely trivial to make TPS work for this case. E.g. it's not just a matter of writing new TPS tests, we'd also have to extend TPS so that it can corrupt the server, for example. (We also *might* need to update the bookmark API functions it uses to the more modern variants, instead of the oldest and most deprecated possible versions, but we also might be able to get away with not doing that). After doing those, it would probably be easier to test this functionality, not to mention that it would be a much more realistic test without as many bizarre hacks, but it could end up being wasted effort depending on how thorough we want to be about these tests. Although, "we construct a dummy version of the clients engine with a manually assigned id in the repair requestor tests" sounds like the kind of edge case that's likely to make us miserable eventually, so... ::: services/sync/tests/unit/test_bookmark_repair.js:142 (Diff revision 1) > + _("Syncing so the outgoing client command is sent."); > + Service.sync(); > + equal(clientsEngine.getClientCommands(remoteID).length, 0); > + > + // so now let's take over the role of that other client! > + let remoteClientsEngine = Service.clientsEngine = new ClientEngine(Service); D:
Attachment #8824878 - Flags: review?(tchiovoloni)
(In reply to Richard Newman [:rnewman] from comment #16) > Comment on attachment 8810302 [details] > > + repairRequest: {args: 1, desc: "Instruct a client to initiate a repair"}, > > + repairResponse: {args: 1, desc: "Instruct a client a repair request is complete"}, > > Bookmarks repair request? Or do you envision this carrying engine-specifying > data? Yes - the request info includes the collection name being repaired and the implementation of these commands attempts to fetch the requestor/responder for the specific collection and handles the fact it might not exist. IOW, a future repair implementation for other collections should, in theory, not require a new command or any other changes to client.js > ::: services/sync/modules/engines/clients.js:634 > There are two places a command lives, right? There's commands.json on disk, > and there's our record on the server. Actually 3 since bug 1289287 - there is a commands-syncing.json which is designed to have the window where failures can cause problems to be as small as possible. commands-syncing.json is what we are *currently* trying to write back to the server, with the idea being that a startup while that file exists means we didn't actually manage to write it, so we recover by writing the contents of that before writing any future changes stored in commands.json. > > Writes to each of those are async and could fail. > > Your approach, IIUC, keeps those things the same — if any changes are made > to commands.json, they're tracked and uploaded. To be clear, IIUC this patch just changes these semantics for a single command - it's not clear from your comment if you are concerned about only this command, or with commands in general? > > If the upload fails, we'll eventually see the command again (because most > commands aren't uniquely identified). As above, the upload failing should be recovered from - we'll upload the saved changes. > If the local write fails, or is skipped, what happens? In the context of this patch, nothing, as this patch actively tries to avoid this write happening for that one command. In the general case, the write failing will mean the command isn't removed and will be executed next sync, as commands.json will not have had the command removed from it. > > If the tracker change isn't persisted, what happens? Failing to track the change for the local device will mean the changes only get uploaded on the next "periodic" update. I suspect I'm missing your point here though - all engines, including clients, are going to get somewhat upset of the tracker fails to be written (although bug 1319175 should make that less likely.) My biggest concern with the changes here are that some error will cause us to fail to hit the point where we post a response and remove the initial command, causing that error to repeat on every sync - but that doesn't sound like your concern?
(In reply to Thom Chiovoloni [:tcsc] from comment #18) > > + // The preferences we use to hold our state. > > + get PREF() { > > It seems more consistent with the rest of sync to hardcode the pref names in > the accessors below. Yeah, I think that's correct. > At the very least, it seems weird that this and state these return new > objects each time, but I might be missing why it's done this and not an > additional exported symbol (per enumeration). It's really done only because js class syntax doesn't support attributes, const or otherwise, and by making them global it wasn't as obvious they apply only to the requestor and not the responder. It might make more sense to split this file in 2 though, which would remove those concerns. (My initial version of the patch had them as non-exported globals, and the test just used the string values) > There are several I'd add. Many of them are likely to be duplicates, but > that shouldn't matter. > > - parentChildMismatches (both parent and child, since we don't know which is > right) > - deletedChildren (parent) > - multipleParents (all parents) > - deletedParents (child) > - duplicateChildren > > Possibly serverMissing/serverUnexpected as well, although I'm not as sure > about thise. > > I'd be fine with waiting until later to add these though. At this stage I was trying to limit the uploads to things we know aren't on the server (or things that are in the server that should be removed). I was vaguely trying to avoid otherwise changing an item that already exists on the server, but that was just an abundance of caution other than having concrete concerns. I guess there's no good reason not to do the "obvious" ones in the above list though, but as soon as it becomes non-obvious and non-trivial, I agree we should wait and let telemetry guide us. > ::: services/sync/modules/bookmark_repair.js:110 > (Diff revision 1) > > + proceed to a new state. > > + Returns true if we could continue the repair - even if the state didn't > > + actually move. Returns false if we aren't actually repairing. > > + */ > > + continueRepairs(response = null) { > > + if (this._currentState == this.STATE.NOT_REPAIRING) { > > What about the other inactive states, e.g. aborted/finished? Worth checking? I think as written these would be more like an "assertion" - it should be impossible for continueRepairs to return while in those states. I agree that's not clear and not particularly obvious (and also part of the reason I probably should have left the STATE constants "private".) Maybe just a comment would suffice? > ::: services/sync/modules/bookmark_repair.js:179 > (Diff revision 1) > > + } > > + // Pull apart the response and see if it provided everything we asked for. > > + let fixedIDs = response.IDs; > > + let remainingIDs = []; > > + for (let id of this._currentIDs) { > > + if (fixedIDs.indexOf(id) == -1) { > > nit: I think you should use a Set (or an Object at the very least) to avoid > quadratic complexity in the case where we got everything. Yeah, agreed, especially given Sets are used in other parts of the patch series. > nit: should be `case this.STATE.NOT_REPAIRING:`, for consistency Indeed :) > Seems a bit dodgey to keep this in a pref... Seems like it could be large > for very broken users. Is there a way we can get this value from the > telemetry later? Yeah, that's a good point for very large lists of IDs, although I'm not sure what you mean by getting it from telemetry later? To address this we could consider keeping all the state in a .json file on disk? Another alternative that might work for the first cut is to simply limit the list of IDs we allow, hoping that eventually we will end up getting them all even if it ends up taking a few repairRequests to proceed. Alternatively, if telemetry tells us there are users with such a huge list of IDs, maybe we should abandon this approach entirely and just have the remote client perform a full upload of everything - that's likely to be more efficient than using this code to individually repair every single item. What do you think?
(In reply to Thom Chiovoloni [:tcsc] from comment #19) > Although, "we construct a dummy version of the clients engine with a > manually assigned id in the repair requestor tests" sounds like the kind of > edge case that's likely to make us miserable eventually, so... Yeah, I'm torn too. > > + // so now let's take over the role of that other client! > > + let remoteClientsEngine = Service.clientsEngine = new ClientEngine(Service); > > D: Exactly :) On one hand, investing some more in TPS seems like it is in our long-term interests, and this would give us the incentive to harass Karl to finally get it running. OTOH, if that investment in TPS looks like delaying when this can land we should avoid it until after it lands and we are watching how it performs via telemetry. But writing that hacky xpcshell test was starting to get quite difficult - we not only want to pretend we have a different clients engine, we also need to pretend it has a completely different places database. Either way is messy - I'll have more of a play with that part of the patch and see where I end up.
(In reply to Mark Hammond [:markh] from comment #21) > ends up taking a few repairRequests to proceed. Alternatively, if telemetry > tells us there are users with such a huge list of IDs, maybe we should > abandon this approach entirely and just have the remote client perform a > full upload of everything To be clear, I mean that the requesting client abandons this approach, not that we abandon it everywhere. eg, something like "if < 20% of total bookmarks need repair, use this strategy, otherwise write a repair request command that doesn't list individual IDs and instead flags to perform a full upload"
Comment on attachment 8810302 [details] Bug 1317223 (part 1) - a collection_repair module (without any repairers) and integration with the clients engine. https://reviewboard.mozilla.org/r/92650/#review104332 ::: services/sync/modules/collection_repair.js:23 (Diff revision 2) > +} > + > +const RESPONDERS = { > +} > + > +// Should we maybe enforce the requestors being a singleton? Maybe, though it looks like the bookmark repair requestor is cheap enough to construct multiple times. Another approach is to add a lazy getters to each engine that return the repair requestor for that collection, as in http://searchfox.org/mozilla-central/rev/225ab0637ed51b8b3f9f4ee2f9c339a37a65b626/services/sync/modules/engines.js#676-686. ::: services/sync/modules/collection_repair.js:25 (Diff revision 2) > +const RESPONDERS = { > +} > + > +// Should we maybe enforce the requestors being a singleton? > +function _getRepairConstructor(which, collection) { > + let [modname, symbolname] = which[collection]; Nit: Bail if `!(collection in which)`. ::: services/sync/modules/collection_repair.js:72 (Diff revision 2) > + @param validationInfo {Object} > + The validation info as returned by the collection's validator. > + > + @param flowID {String} > + A guid that uniquely identifies this repair process, and which > + should be sent to any requestors and reported in telemetry. Worth documenting that the flow ID is the same for all repaired collections? Or maybe I'm misunderstanding `Doctor::_runValidators` in part 2. ::: services/sync/modules/collection_repair.js:116 (Diff revision 2) > + > + @param rawCommand {Object} > + The command object as stored in the clients engine, and which > + will be automatically removed once a repair completes. > + */ > + repair(request, rawCommand) { Might make sense to mark this function as async, for clarity.
Comment on attachment 8824875 [details] Bug 1317223 (part 2) - add 'doctor' concept and move bookmark validation logic to it. https://reviewboard.mozilla.org/r/103190/#review104350 ::: services/sync/modules/doctor.js:113 (Diff revision 1) > + } > + > + if (Object.values(engineInfos).filter(i => i.maxRecords != -1).length != 0) { > + // at least some of the engines have maxRecord restrictions which require > + // us to ask the server for the counts. > + countInfo = this._fetchCollectionCounts(); Nit: Missing `let`. ::: services/sync/modules/doctor.js:153 (Diff revision 1) > + let result = await validator.validate(engine); > + Observers.notify("weave:engine:validate:finish", result, engine.name); > + // And see if we should repair. > + await this._maybeCure(engine, result, flowID); > + } catch (e) { > + if (Async.isShutdownException(ex)) { Nit: `ex` => `e`.
Comment on attachment 8824876 [details] Bug 1317223 (part 3) - A bookmark repair requestor. https://reviewboard.mozilla.org/r/103192/#review104358 ::: services/sync/modules/bookmark_repair.js:37 (Diff revision 1) > + super(service); > + this.prefs = new Preferences(PREF_BRANCH); > + } > + > + // The preferences we use to hold our state. > + get PREF() { +1 for hard-coding the names in the getters and setters; it makes grepping for the prefs easier. I'd prefer to see `STATE` as a module-level constant, too, but I'm not picky about either. ::: services/sync/modules/bookmark_repair.js:89 (Diff revision 1) > + } > + // orphans are when the child exists but the parent doesn't. > + for (let { parent } of validationInfo.problems.orphans) { > + ids.add(parent); > + } > + // XXX - any others we should consider? Maybe reupload the parent for `childrenOnNonFolder` and `parentNotFolder`, too, in addition to the ones Thom mentioned. ::: services/sync/modules/bookmark_repair.js:100 (Diff revision 1) > + log.info(`Starting a repair, looking for ${ids.size} missing item(s)`); > + // setup our prefs to indicate we are on our way. > + this._flowID = flowID; > + this._currentIDs = Array.from(ids); > + this._currentState = this.STATE.NEED_NEW_CLIENT; > + this.service.recordTelemetryEvent("repair", "started", undefined, { flowID, numIDs: ids.size.toString() }); Looks like `numIDs` is a number in tests, but always recorded as strings in the calls to `recordTelemetryEvent`. Is that intentional? ::: services/sync/modules/bookmark_repair.js:156 (Diff revision 1) > + } > + return true; > + } > + > + _continueRepairs(state, response = null) { > + switch (state) { This state machine makes the whole process so much easier to follow. :thumbsup: ::: services/sync/modules/bookmark_repair.js:176 (Diff revision 1) > + // an entirely different repair flow, or from a client we've since > + // given up on.) It doesn't mean we need to abort though... > + break; > + } > + // Pull apart the response and see if it provided everything we asked for. > + let fixedIDs = response.IDs; Looks like this is `ids` in part 4, but `IDs` here? Although I haven't fully read through the responder yet, so I might be missing something. ::: services/sync/modules/bookmark_repair.js:224 (Diff revision 1) > + let lastRequestSent = this.prefs.get(this.PREF.REPAIR_WHEN); > + let timeLeft = lastRequestSent + RESPONSE_INTERVAL_TIMEOUT - this._now() / 1000; > + if (timeLeft <= 0) { > + log.info(`previous request to client ${clientID} is pending, but has taken too long`); > + state = this.STATE.NEED_NEW_CLIENT; > + // XXX - should we remove the command? Since we're going to queue the command for another client, we might as well remove the old one. I don't think it'll hurt anything if we leave it, the old client eventually syncs, and reuploads those records, but it's extra work we can avoid. ::: services/sync/modules/bookmark_repair.js:241 (Diff revision 1) > + // The command isn't pending - if this was the first request, we give > + // it another go (as that client may have cleared the command but is yet > + // to complete the sync) > + // XXX - note that this is no longer true - the responders don't remove > + // their command until they have written a response. This might mean > + // we could drop the entire this.STATE.SENT_SECOND_REQUEST concept??? At first glance, if the responder keeps its command until it's finished, removing the second request seems like the right thing to do. ::: services/sync/modules/bookmark_repair.js:262 (Diff revision 1) > + state = this.STATE.FINISHED; > + break; > + } > + let newClientsUsed = this._clientsUsed; > + newClientsUsed.push(newClientID); > + this._clientsUsed = newClientsUsed; Does it make sense to update `_clientsUsed` after we send the command, so that shutdown doesn't cause us to potentially ignore clients after restart? Or is that {already handled, not an issue}? ::: services/sync/modules/bookmark_repair.js:274 (Diff revision 1) > + > + case this.STATE.FINISHED: > + break; > + > + case "": > + // No repair is in progress. This is a common case, so only log trace. Common because the doctor periodically calls `requestor.continueRepairs` at the end of the sync, right? ::: services/sync/modules/bookmark_repair.js:290 (Diff revision 1) > + } > + > + // Issue a repair request to a specific client. > + _writeRequest(clientID) { > + log.trace("writing repair request to client", clientID); > + let ids = this._currentIDs Nit: Trailing ; ::: services/sync/modules/bookmark_repair.js:365 (Diff revision 1) > + } > + set _currentIDs(arrayOfIDs) { > + this.prefs.set(this.PREF.REPAIR_MISSING_IDS, arrayOfIDs.join(" ")); > + } > + > + get _clientsUsed() { I have a slight preference for writing this and `_currentIDs` as methods, since they always return new arrays. The logic for adding a client to the list in the `NEED_NEW_CLIENT` state handling puzzled me on the first read, until I scrolled down and saw `_clientsUsed` is a getter. Your call, though.
Comment on attachment 8824877 [details] Bug 1317223 (part 5) - a bookmark repair responder. https://reviewboard.mozilla.org/r/103194/#review104622 Approach looks good! ::: services/sync/modules/bookmark_repair.js:391 (Diff revision 1) > } > } > + > +/* An object that responds to repair requests initiated by some other device. > +*/ > +class BookmarkRepairResponder extends CollectionRepairResponder { Half-formed thought: Is it possible for us to deadlock during repair, such that we're waiting on another client to supply our missing records, while it's waiting on us? We remove the command when we abort the repair, so IIUC the other client will time out and try again with a different client. But I wonder if this is worth pondering, or if it's not really a problem. ::: services/sync/modules/bookmark_repair.js:398 (Diff revision 1) > + if (request.request != "upload") { > + this._abortRepair(request, rawCommand, > + `Don't understand request type ${request.request}`); > + return; > + } > + if (this._currentState) { This can happen if multiple clients send us repair requests, right? Instead of aborting, I wonder if we can queue the requests, and send repair responses to all those clients once we're done. (We can also set a "repair in progress" flag earlier here, instead of after we've tracked everything). ::: services/sync/modules/bookmark_repair.js:410 (Diff revision 1) > + // handle them. Potential issues include: > + // * The item exists locally, but isn't in the tree of items we sync (eg, it > + // might be a left-pane item or similar.) > + // * The item exists locally as a folder - and the children of the folder > + // also don't exist on the server - just uploading the folder isn't going > + // to help. (Note that we assume the parents *do* exist, otherwise the It would be nice if we could delete all children of the folder on the server, too, in case we have parent-child disagreements. But resolving those might require the sender to specify the expected hierarchy, too, not just IDs. This might be prematurely overcomplicating things, too. If we still think missing items are the largest source of corruption, it makes sense to focus on those first. They're easier to fix than structural differences. ::: services/sync/modules/bookmark_repair.js:426 (Diff revision 1) > + log.debug(`repair request to upload item ${id} but it doesn't exist locally`); > + continue; > + } > + // We have it - should we actually upload it? Walk up to the root to find > + // out. > + // XXX - should we have PlacesSyncUtils help us out here? Totally. Let's add a method to get the ancestors of an item. ::: services/sync/modules/bookmark_repair.js:434 (Diff revision 1) > + parentid = (await PlacesSyncUtils.bookmarks.fetch(parentid)).parentSyncId; > + } > + if (!PlacesSyncUtils.bookmarks.ROOTS.includes(parentid)) { > + log.debug(`repair request to upload item ${id} but it isn't under a syncable root`); > + toDelete.add(id); > + // XXX - should we walk up to the root and flag all parents as deleted? Maybe, and I think we'll need to flag the ancestors and all their children as deleted. The caveat is, we might not have the complete tree, either (bug 1303679, comment 15). It could be that we upload these non-syncable items, and wait to delete them until everyone has a consistent (though wrong) view. That may be never, and we really don't want to proliferate those records. The best approach might be to take our chances and delete the subtree, and see if we end up orphaning items that we don't know about. We can then repair those separately. ::: services/sync/modules/bookmark_repair.js:449 (Diff revision 1) > + itemSource.ids = request.ids; > + log.trace(`checking the server for items`, request.ids); > + for (let remoteID of JSON.parse(itemSource.get())) { > + log.trace(`the server has "${remoteID}"`); > + existsRemotely.add(remoteID); > + // This item exists on the server, so remove it from toUpload. Or it exists on the server, but has structural differences that we need to fix. ::: services/sync/modules/bookmark_repair.js:463 (Diff revision 1) > + } > + // whew - now add these items to the tracker "weakly" (ie, they will not > + // persist in the case of a restart, but that's OK - we'll then end up here > + // again.) > + // XXX - not clear how to "track weakly" - something like this? > + let trackWeakly = (id, isTombstone = false) => { Sure, this is good. Another way is to track these in a separate object in `BookmarksChangeset` (like `weakChanges`, to match `changes`).
Attachment #8824877 - Flags: review?(kit)
Comment on attachment 8824878 [details] Bug 1317223 (part 6) - integration tests for the bookmark repair requestor and responder. https://reviewboard.mozilla.org/r/103196/#review104642 ::: services/sync/tests/unit/test_bookmark_repair.js:144 (Diff revision 1) > + equal(clientsEngine.getClientCommands(remoteID).length, 0); > + > + // so now let's take over the role of that other client! > + let remoteClientsEngine = Service.clientsEngine = new ClientEngine(Service); > + remoteClientsEngine.localID = remoteID; > + _("what could possibly go wrong?"); https://software.intel.com/sites/default/files/fb/22/race4.jpg
I wonder if we should consider a threshold where we wipe the server and force a full sync if there are too many errors. Spot repairs for missing bookmarks are OK, but deeper structural issues, like parent-child mismatches, or items appearing in multiple or deleted parents, are trickier. Could we inadvertently make the situation worse by trying to repair these? (Especially if we don't have a full picture of the tree). What if there are lots of errors? If we've consulted all clients, and the tree is still inconsistent, does it make sense to drop everything and start fresh?
(In reply to Kit Cambridge [:kitcambridge] from comment #29) > I wonder if we should consider a threshold where we wipe the server and > force a full sync if there are too many errors. Spot repairs for missing > bookmarks are OK, but deeper structural issues, like parent-child > mismatches, or items appearing in multiple or deleted parents, are trickier. > > Could we inadvertently make the situation worse by trying to repair these? > (Especially if we don't have a full picture of the tree). What if there are > lots of errors? If we've consulted all clients, and the tree is still > inconsistent, does it make sense to drop everything and start fresh? I like this idea but one concern would be items that are flagged as clientMissing -- I think doing this would make this repair tricky if not impossible in some cases.
Thom and I chatted a bit about the responder today. Mark, I'd like to get your thoughts on this, too. Not sure if this is overcomplicating things, or a real issue that we should consider. The case I'm worried about is if the requesting client is missing an entire subtree, or is requesting parts of a subtree that shouldn't be on the server at all. In short: "what if the requester doesn't know it needs an item yet?" A variation of this also comes up if the requester, responder, and server all have different pictures of the tree. For example, let's say client B has a folder that's an orphan, and is also missing some of that folder's children. It sends a repair request to A containing the folder's parent, and the missing children. But it turns out that folder has ancestors that also aren't on the server. And maybe some of those missing children are themselves folders, with descendants that also need to be uploaded. B didn't explicitly ask for those items, because it doesn't know about them yet! A can either: 1) Be helpful and upload everything. This implies that A would need to query the server before uploading, to figure out what else is missing. (This gets more involved for structural issues: the records may already exist; they're just wrong). It would then need to figure out the lowest common ancestors of all the requested items, and upload them and any missing descendants. Again, this gets complex for structural differences. The simplest approach would be to reupload all descendants, but, if B requests "unfiled", that's the equivalent of a full sync. At that point, we'd almost want B to give us some context about why it's requesting an item. But getting specific about why we're asking for records suggests we'd need to port Desktop's validator to iOS, and we don't want to go there. Or: 2) Only upload what B requested. After B syncs, it can revalidate, see that it still doesn't have a complete tree, and send out another repair request. This means more crosstalk: B realizes it's now missing grandparents and grandchildren, so it requests those. Rinse and repeat until the tree is complete, which can take several rounds. Also, if other clients disagree about the tree, and subsequent repair requests are serviced by different clients, the repair might end up creating yet another inconsistent tree. This is where the threshold comes in: at some point, it makes sense to abandon all hope and wipe the server. Thom also suggested having the responder run the validator as a last resort, and seeing if it can fix up the server. We wavered between (1) and (2). On the surface, (1) is attractive: we assume the requester is wrong, and aggressively upload whatever we need to make the server consistent. But, without more context about what we need to upload and why, we can get into pathological cases where a spot repair turns into a full sync. It's also significantly more complicated. So maybe (2) is better: we trust the requester, give it only what it asks for, have it revalidate, and figure out if it needs to run another round.
Flags: needinfo?(markh)
(In reply to Kit Cambridge [:kitcambridge] from comment #31) > Thom and I chatted a bit about the responder today. Mark, I'd like to get > your thoughts on this, too. Not sure if this is overcomplicating things, or > a real issue that we should consider. It's good to be thorough! > The case I'm worried about is if the requesting client is missing an entire > subtree, or is requesting parts of a subtree that shouldn't be on the server > at all. In short: "what if the requester doesn't know it needs an item yet?" > A variation of this also comes up if the requester, responder, and server > all have different pictures of the tree. We can think of this as a shared knowledge problem. That is, Client B is trying to get Client A to understand that it's missing some stuff, and what that stuff is. We know we can't communicate the whole set, because Client B doesn't even know the whole set! Without Client A or the server doing a lot more work, then, we must have an iterative solution. And if the server and A can disagree -- and they can -- we unavoidably have either a full resync or an iterative solution: eventually B will ask A to upload something that's on the server, fully synced according to A, but not the same on the server as on A. We cannot trust A's understanding of the server tree: what it _thinks_ it uploaded in the past is flawed. > 1) Be helpful and upload everything. This implies that A would need to query > the server before uploading, to figure out what else is missing. Or be speculative: upload the requested record and its parent. The problem with being speculative is that A might be missing data from another client C, and it'll screw up the parent. But that's another problem. > 2) Only upload what B requested. After B syncs, it can revalidate, see that > it still doesn't have a complete tree, and send out another repair request. > This means more crosstalk: B realizes it's now missing grandparents and > grandchildren, so it requests those. This is the correct answer, IMO. Most bookmark trees are shallow, and most structural corruption is likely to be a single link in the chain. Client B has an inconsistency around F3. It requests [F2, B4, B5], because it has F3 and is missing the parent F2 and two of F3's children. Perhaps it also re-requests F3 for completeness. A uploads F2 and B4. It signals that it doesn't have B5, which was deleted some time ago but never removed from F3. Client B downloads these, sees that it's missing another child of F2. It requests that and gets it. This process will eventually terminate: even in the case of the server and Client A having only a single overlapping record, B can, at most, request every record it has a reference to from the server, and every record in A (trees are fully connected). I believe this will happen in at most (server depth + A depth) iterations, though I haven't proved it. > Rinse and repeat until the tree is > complete, which can take several rounds. Also, if other clients disagree > about the tree, and subsequent repair requests are serviced by different > clients Remember that most users only have one other device. Can you make a repair 'session' sticky, only falling back to another client when the first session fails? I think the two-device case is safe and convergent.
(In reply to Richard Newman [:rnewman] from comment #32) > We know we can't communicate the whole set, because Client B doesn't even > know the whole set! > > Without Client A or the server doing a lot more work, then, we must have an > iterative solution. > > And if the server and A can disagree -- and they can -- we unavoidably have > either a full resync or an iterative solution: eventually B will ask A to > upload something that's on the server, fully synced according to A, but not > the same on the server as on A. We cannot trust A's understanding of the > server tree: what it _thinks_ it uploaded in the past is flawed. I should add to this: the reason why we have a remote-driven recovery process is that **Client B** knows the full contents of the server. It's trying to incrementally, via the server, discover what A knows that wasn't on the server in order to achieve consistency. Each upload from A either fills in a gap or invalidates an existing part of the tree. The process terminates in one of three cases: - All of the gaps have been filled. - All of the gaps have been eliminated by invalidating part of the server tree ("those children don't even exist!"). - No forward progress can be made. This should never happen: it would require A uploading a folder with a child that it isn't able to upload. Of course, it can happen if that child is more than 256KB, or if there's a bug. That's an explicit termination case, and it can be detected.
Comment on attachment 8824877 [details] Bug 1317223 (part 5) - a bookmark repair responder. https://reviewboard.mozilla.org/r/103194/#review108668 ::: services/sync/modules/bookmark_repair.js:434 (Diff revision 1) > + parentid = (await PlacesSyncUtils.bookmarks.fetch(parentid)).parentSyncId; > + } > + if (!PlacesSyncUtils.bookmarks.ROOTS.includes(parentid)) { > + log.debug(`repair request to upload item ${id} but it isn't under a syncable root`); > + toDelete.add(id); > + // XXX - should we walk up to the root and flag all parents as deleted? Now I'm not so sure. The caveat applies to writing tombstones for descendants, too; not just ancestors. Thinking about comment 33, the requesting client knows the state of the server. It's asking other clients to help with the repair, because only they can supply the missing records. This is slightly different, though: the records aren't missing, they just shouldn't have been uploaded in the first place. There are also cases where we don't want clients to remove items locally. Tags come to mind: they shouldn't be on the server, but we should DELETE them outright, not write tombstones. Pinned sites and reading list items are similar: it's OK to remove them from Desktop, but not Android. Left pane items are automatically recreated, and should be safe to remove locally and DELETE remotely. But these are edge cases. Since most trees are shallow, maybe we do this, then repair inconsistencies during the next round. Or, maybe the right approach is to upload all records, including non-syncable ones. Once the requesting client has a consistent tree, it can revalidate, detect the unexpected items, remove them locally, and DELETE them from the server.
(In reply to Kit Cambridge [:kitcambridge] from comment #34) > There are also cases where we don't want clients to remove items locally. > Tags come to mind: they shouldn't be on the server, but we should DELETE > them outright, not write tombstones. Be careful to do that only when the other client knows to throw that record away. A DELETE has no way of signaling removal to the other client; if this client is the only one who knows that a particular record should be ignored by others, it should upload a tombstone. (It needn't worry about redownloading that tombstone and deleting its tags folder: we know -- or should know! -- not to do so.) Almost by definition, a client will not be asking for help with a record if it knows to throw it away! The situation we're trying to avoid: - A bad record R1 was uploaded to the server by Client A. - Client B downloaded it, can't tell that it should ignore it, and buffers it. - R1 isn't rooted correctly. B asks A to fix R1, or R1's parent. - A decides that R1 should never have been on the server, and issues a DELETE, or does nothing. End result: B still has R1, and can't proceed. Correct behavior: upload a tombstone for R1, and ensure that Client A knows not to delete its local tags record if it sees that tombstone. > Since most trees are shallow, maybe we do this, then repair inconsistencies > during the next round. Or, maybe the right approach is to upload all > records, including non-syncable ones. Once the requesting client has a > consistent tree, it can revalidate, detect the unexpected items, remove them > locally, and DELETE them from the server. I would be wary about leaving junk on the server in the middle of a failable iterative process.
(In reply to Richard Newman [:rnewman] from comment #35) > (In reply to Kit Cambridge [:kitcambridge] from comment #34) > > > There are also cases where we don't want clients to remove items locally. > > Tags come to mind: they shouldn't be on the server, but we should DELETE > > them outright, not write tombstones. > > Be careful to do that only when the other client knows to throw that record > away. A DELETE has no way of signaling removal to the other client; if this > client is the only one who knows that a particular record should be ignored > by others, it should upload a tombstone. > > (It needn't worry about redownloading that tombstone and deleting its tags > folder: we know -- or should know! -- not to do so.) > > Almost by definition, a client will not be asking for help with a record if > it knows to throw it away! I don't think this is strictly true, although it would be in the iOS case. But for example, if a folder references a child that's been deleted, we request both at the moment (at least in what's on the elm twig), since while we expect that the folder is wrong and the child is right, it's possible that this is untrue (I think especially given how deletion had been handled recursively until recently). > > The situation we're trying to avoid: > > - A bad record R1 was uploaded to the server by Client A. > - Client B downloaded it, can't tell that it should ignore it, and buffers > it. > - R1 isn't rooted correctly. B asks A to fix R1, or R1's parent. > - A decides that R1 should never have been on the server, and issues a > DELETE, or does nothing. > > End result: B still has R1, and can't proceed. > > Correct behavior: upload a tombstone for R1, and ensure that Client A knows > not to delete its local tags record if it sees that tombstone. > > > Since most trees are shallow, maybe we do this, then repair inconsistencies > > during the next round. Or, maybe the right approach is to upload all > > records, including non-syncable ones. Once the requesting client has a > > consistent tree, it can revalidate, detect the unexpected items, remove them > > locally, and DELETE them from the server. > > I would be wary about leaving junk on the server in the middle of a failable > iterative process. This makes sense, and the "upload just what's requested and do repair iteratively" is probably still overall what I think the right choice is, but I'm a *bit* concerned about the case where now the server is made less consistent by doing this. Clients not participating in this repair could become more corrupt as a result... And they might even make the server state worse D:
(In reply to Thom Chiovoloni [:tcsc] from comment #36) > I'm a *bit* concerned about the case where now the server is made less > consistent by doing this. Bear in mind that the repair process only occurs when the server is already inconsistent; it's hard to make things worse at that point :) > Clients not participating in this repair could > become more corrupt as a result... To give a narrative explanation of how this might occur: Two desktop devices syncing with unreliable bookmark sync will diverge: e.g., A adds a bookmark, B only gets the bookmark but not the changed folder, putting the new bookmark in unsorted bookmarks and moving on with its day. That's how we got in this mess to begin with. That approach currently surfaces errors intermittently: when B adds a new unsorted bookmark, it'll reupload the Unsorted Bookmarks folder, causing A will move the first bookmark into Unsorted. "Hey, where did my toolbar bookmark go?!" Android Sync surfaces those errors immediately: given a conflict it'll reupload all of the corrected records, forcing A to match a cleaned-up version of what it put on the server. That's why users complained of bookmark corruption after pairing an Android device: it pushed desktop corruption back onto the source. Thom's concern here is well-founded: adding a remote-repairer to the diverged constellation of A and B will cause either A's data or B's data to be uploaded in sections to the server in order to achieve consistency, and that will have a side effect of forcing A and B to reach a new converged state. Often that new state will be the state of the device providing the fixes -- they converge on A or B -- but novel states could conceivably occur as a result of applying each new chunk. E.g., if B has a bookmark that A doesn't know about, and A uploads that bookmark's folder without it, it's implementation-defined what happens. B might upload the folder again but with the bookmark; it might move the bookmark somewhere else; who knows? The resulting state isn't A's, and it's not B's old state. The end state after an iterative process might not be the same state as if A simply reuploaded every bookmark and folder. (All this being said, the end state of B currently depends on the order of application of records, so we can't be as bad as 2014-era Sync!) I'm not very concerned about this risk of churn, for several reasons: - I expect most fixes to be shallow, because most bookmark trees will be shallow. We should add telemetry, but I'd be surprised if more than 30% of accounts needed more than one step. With one step this is equivalent to a wipe and reupload, but without the data loss and churn. - Most users have only one or two devices. There is no other device to perturb. - Most repair interactions are likely to be brief. - I think that a client uploading a repeating contradiction should be rare. (That is, A might put B in a different state, but it won't usually cause B to upload something that breaks A again.) - Repairs are targeted at points of inconsistency, so I expect that to fix more than it breaks: the server will quickly become more consistent than it was. And any change it causes on an older client is bound to happen at some point anyway -- the two machines will eventually exchange the conflicting records and have to reconcile their differences.
More thoughts as I read through the implementation: it's OK for a device to service multiple repair requests (upload the union of the requested records to the server, then send responses to each client)...but it probably shouldn't respond if it's currently repairing itself. Otherwise, it might spread its inconsistent view of the world to other clients. I wonder if that risks a deadlock, though, with two clients waiting on each other to supply missing records. Maybe it can respond, but only for subtrees that it's not currently repairing? Or respond with "I'm busy," and finish its repair cycle first? Or don't respond at all, and rely on timeouts...
(In reply to Kit Cambridge [:kitcambridge] from comment #38) > More thoughts as I read through the implementation: it's OK for a device to > service multiple repair requests (upload the union of the requested records > to the server, then send responses to each client)...but it probably > shouldn't respond if it's currently repairing itself. Otherwise, it might > spread its inconsistent view of the world to other clients. > > I wonder if that risks a deadlock, though, with two clients waiting on each > other to supply missing records. Maybe it can respond, but only for subtrees > that it's not currently repairing? Or respond with "I'm busy," and finish > its repair cycle first? Or don't respond at all, and rely on timeouts... AIUI we rely on timeouts for this in the requestor
(In reply to Kit Cambridge [:kitcambridge] from comment #38) > More thoughts as I read through the implementation: it's OK for a device to > service multiple repair requests (upload the union of the requested records > to the server, then send responses to each client)...but it probably > shouldn't respond if it's currently repairing itself. Otherwise, it might > spread its inconsistent view of the world to other clients. By definition, it can't: Sync clients can't upload records until they're done downloading.
Attached patch requestor-validate.patch (obsolete) (deleted) — Splinter Review
Pretty unsure about the content of this patch but it goes along with the discussion earlier in this bug. Essentially, it reruns validation after each step of the repair request, instead of assuming that the old validation data is valid. I wasn't sure what the right way to do this was, hg push review seemed to get very upset about stuff, so i just exported a patch.
Attachment #8830947 - Flags: feedback?(markh)
(In reply to Richard Newman [:rnewman] from comment #32) > > 2) Only upload what B requested. After B syncs, it can revalidate, see that > > it still doesn't have a complete tree, and send out another repair request. > > This means more crosstalk: B realizes it's now missing grandparents and > > grandchildren, so it requests those. > > This is the correct answer, IMO. Most bookmark trees are shallow, and most > structural corruption is likely to be a single link in the chain. I agree - with the caveat that is a missing item is a folder (which the requestor will not know), we should do "something" with the children to ensure the complete subtree under the missing item is consistent on the server (ie, children not already on the server are uploaded, and children which are on the server either (a) correctly point at this missing item or (b) are locally moved to make the server representation correct) (In reply to Richard Newman [:rnewman] from comment #40) > (In reply to Kit Cambridge [:kitcambridge] from comment #38) > > More thoughts as I read through the implementation: it's OK for a device to > > service multiple repair requests (upload the union of the requested records > > to the server, then send responses to each client)...but it probably > > shouldn't respond if it's currently repairing itself. Otherwise, it might > > spread its inconsistent view of the world to other clients. > > By definition, it can't: Sync clients can't upload records until they're > done downloading. I'm not sure what you are getting at here, but I believe it's possible for multiple repair requests to be in-flight at once unless we take special action - eg: * clientA initiates a repair by queueing a request to clientB. * the sync (to write the command) fails. * clientB initiates a repair, syncs. * clientA comes online, syncs. We then have 2 repairs in flight. ISTM that we probably need to have the requestor examine *all* client records for evidence of another client having a repair request in flight, and immediately abort (or decline to start) a repair initiated by this client, recording enough telemetry so we can see how often this actually happens.
Flags: needinfo?(markh)
Comment on attachment 8830947 [details] [diff] [review] requestor-validate.patch Review of attachment 8830947 [details] [diff] [review]: ----------------------------------------------------------------- TBH, I'm not too keen on this as I'm not clear how it makes things better: * I think we should consider aborting if there are > 5k problems to fix, and let telemetry guide us. ISTM that multiple iterations with each capped at 5k would be better served by having the client do a full bookmark sync. I've got a spark job currently running that should tell us the maximum number of problems ever seen by the validator, and I'll update this bug when it completes. * When we see the response, the bookmarks engine probably hasn't re-synced yet - meaning the validator is almost certainly going to be wrong (it's going to compare the server state with a client state that hasn't yet grabbed new server stuff). * It seems more complex than just letting the repair complete and considering starting another round. Indeed, in the (probably) common case of there being exactly 2 devices connected, this seems a more complex way of expressing exactly that. Given that desktop->desktop repair is really just a prototype (ie, there are no implications of the repair process aborting for various reasons) I'm in favor of keeping this as simple as possible in the first cut and let telemetry guide us on what additional complexity we should add. For example, if it turns out a naive repair strategy catches the vast majority of issues it may make sense to simply fall back to the "full sync" strategy instead of trying to guess what tweaks would make it more effective. ::: services/sync/modules/bookmark_repair.js @@ +53,5 @@ > SENT_SECOND_REQUEST: "repair.sent-again", > + > + // Some progress has been made and we know what we need to request, but have not > + // yet made the request. > + WILL_SEND_REQUEST: "repair.will-send", this new state seems unused (which is good - based on comment 42, it seems useful for it to be true that *some* client has a repair related command whenever a repair is in progress, which I think is the case now)
Attachment #8830947 - Flags: feedback?(markh)
(In reply to Mark Hammond [:markh] from comment #43) > * I think we should consider aborting if there are > 5k problems to fix, and > let telemetry guide us. Sorry, I skipped over earlier comments around this. (In reply to Thom Chiovoloni [:tcsc] from comment #30) > (In reply to Kit Cambridge [:kitcambridge] from comment #29) > > I wonder if we should consider a threshold where we wipe the server and > > force a full sync if there are too many errors. Yeah, that's exactly what I restated instead of quoting :) > > Could we inadvertently make the situation worse by trying to repair these? Agreed - a "full sync" should mean we end up with a consistent state between the server and at least one client, whereas an iterative spot-repair seems riskier (because it's so unpredictable without something more formal, especially if round-robin-ing between clients) > I like this idea but one concern would be items that are flagged as > clientMissing -- I think doing this would make this repair tricky if not > impossible in some cases. I'd expect the client to download these - a "full sync" is just re-reconciling every record on the server with the local state, just like on (re-)connection - probably better named described as "first sync". Bug 1332556 is an example of how that could be improved, but it still seems the right direction to be heading.
So I think https://gist.github.com/mhammond/c9a9a928e7634bdd9a586aac6bcc8732 is showing the max counts seen across all validations recorded in all sync pings - the max count for each individual problem: {u'clientCycles': 7, u'clientMissing': 84, u'deletedChildren': 14, u'deletedParents': 17, u'differences': 63, u'duplicateChildren': 1, u'missingChildren': 914, u'multipleParents': 17, u'orphans': 62, u'parentChildMismatches': 53, u'parentNotFolder': 17, u'rootOnServer': 1, u'sdiff:childGUIDs': 7, u'sdiff:parentid': 15, u'serverMissing': 62473, u'serverUnexpected': 48, u'structuralDifferences': 22, u'wrongParentName': 62} serverMissing is - err - "interesting" :) Keep in mind validation is still only on Nightly.
I think there are a lot of cases where we'd historically fail to upload a new record. * Bookmarking two bookmarks in a row. * Bookmarking near shutdown. * Bulk-bookmarking (I expect we didn't get correct notifications from Places). I'm not sure that'd get us to 62,473 :) Is this a count of missing records, or a count of users who are missing records? How many users in the sample?
(In reply to Mark Hammond [:markh] from comment #42) > I agree - with the caveat that is a missing item is a folder (which the > requestor will not know), we should do "something" with the children to > ensure the complete subtree under the missing item is consistent on the > server (ie, children not already on the server are uploaded, and children > which are on the server either (a) correctly point at this missing item or > (b) are locally moved to make the server representation correct) I think I'd rather see the responder be dumb in this case, at least to start: if the child records are already on the server and consistent, great; if not, the requestor will make incremental forward progress and know precisely which new records to request. (In the case of a disagreement between parent and child, expect the next repair request to be "OK, so give me the child again, because you just gave me the parent and one of the two is wrong"!) One of the tracker bugs resulted in a folder record that doesn't list as a child one of the bookmarks that claims to be in it, and so uploading just the folder is enough for a large swath of broken trees. > > By definition, it can't: Sync clients can't upload records until they're > > done downloading. > > I'm not sure what you are getting at here… I think we're saying the same thing: A client that has requested a repair can't respond to one, because it hasn't finished applying incoming records, and so it doesn't have a final consistent state to use for upload. If it had, it wouldn't be mid-repair; it'd be uploading its own changes. That position could allow for deadlocks, if clients don't have a mechanism for refusing a repair request. If refusals are possible (and tracked), then deadlock shouldn't be possible -- B will refuse A's repair request, because it needs repair itself, and so A will ask C. If no forward progress can be made (e.g., C refuses, because it's also unable to sync) then we must abandon the repair and take more drastic action like forcing a full re-sync. A simplification of this would be to allow only one client to be in repair at a time, but that seems a little dangerous: all we want to avoid is a cycle. A "spoked wheel" is totally safe.
(In reply to Richard Newman [:rnewman] from comment #47) > A client that has requested a repair can't respond to one, because it hasn't > finished applying incoming records, and so it doesn't have a final > consistent state to use for upload. If it had, it wouldn't be mid-repair; > it'd be uploading its own changes. Oh, I see what you're saying. My concern might be Desktop-specific, then, because our implementation doesn't block on repair. If I understand part 2 correctly, we wait for all engines to finish syncing before validating and sending the repair request. We'll also finish the sync once we've successfully sent the repair request, not once we've received a response and the missing records. Both would be untrue for iOS.
(In reply to Kit Cambridge [:kitcambridge] from comment #48) > Oh, I see what you're saying. My concern might be Desktop-specific, then, > because our implementation doesn't block on repair. That worries-slash-confuses me. You have a client that knows it has seen an inconsistent/incorrect remote state. It doesn't seem desirable to me to: - Have that client upload subsequent local changes, which perhaps were applied on top of inconsistent/partial remote state. - Interleave incoming repair responses with local changes. ("Hey, client A, could you give me the current Unsorted Bookmarks?" -- and at the same time the user hits the bookmark star.) - Potentially respond to repair requests without having reached consensus. This is like getting half-way through a Git rebase and forcing it to let you commit and push patches against a head that has never existed. I understand that the current implementation still uses Places as a scratchpad, but I don't think that's a reasonable jumping off point for then acting as an authoritative client (which a repair responder must be), and I don't think it's a good end state regardless. IMO the end goal should be validation *prior to* record application, application only taking place on a valid set of records, and repair being an interaction between a paused/inconsistent client and an active/in-sync client. > We'll also finish the sync once we've > successfully sent the repair request, not once we've received a response and > the missing records. Both would be untrue for iOS. And eventually Android, too, which will be buffering and validating bookmarks before application; the sync won't complete for that datatype until the buffer is valid and mergeable.
We talked about storing tombstones permanently (or, at least, some time past a successful sync) to fix bug 1332290, but this would be useful for the repair responder, too. Consider: if a client requests a deleted item, it won't know if it's deleted, or permanently missing until it hears back from every client. If we know an item has been deleted, we can short-circuit this and upload the tombstone immediately. If no client has uploaded anything at the end of the session, the requester can assume it's gone, reparent the descendants that it has, and fix up the server to match. OTOH, this is strictly an optimization. We expect most people to have two devices, anyway, so waiting until the end of the repair session to figure out what to do will amount to the same thing.
The huge serverMissing could happen due to having tons of items in e.g. unfiled, and having bookmark sync broken as a result (bug 1321021). Not clear what to do about this.
(In reply to Thom Chiovoloni [:tcsc] from comment #51) > The huge serverMissing could happen due to having tons of items in e.g. > unfiled, and having bookmark sync broken as a result (bug 1321021). Agreed. I really should have generated histograms for each of those problems so we can get a feel for what's somewhat normal vs one or 2 outliers. > Not clear what to do about this. Probably nothing from the POV of this bug (but yeah, we should certainly look at unblocking those users at some point as described in that bug.
This patch from Thom was uploaded to elm with a review request for me.
Attachment #8833851 - Flags: review+
Comment on attachment 8833853 [details] [diff] [review] Patch_3___Add_more_validation_problem_fields_to_the_set_of_ids_requested_by_the_the_bookmark_repair_requestor_r_markh.patch Review of attachment 8833853 [details] [diff] [review]: ----------------------------------------------------------------- Looks great, thanks - please fix the typos and fold this into the relevant base patch ::: services/sync/modules/bookmark_repair.js @@ +104,4 @@ > ids.add(parent); > } > + > + // Entries where the parent where we know for certain that the child was minor typo in this comment ("where the parent where") @@ +109,5 @@ > + for (let { parent } of validationInfo.problems.deletedChildren || []) { > + ids.add(parent); > + } > + > + // Entries where the child references a parent the we don't have, but we minor typo here - s/a parent the/a parent that/?
Attachment #8833853 - Flags: review?(markh) → review+
Comment on attachment 8833851 [details] [diff] [review] Address_patch_3_nit___Move_PREFS_and_STATE_to_top_level_constants_r_markh.patch I folded this into part 3
Attachment #8833851 - Attachment is obsolete: true
Comment on attachment 8833855 [details] [diff] [review] Patch_3___Use_a_Set_has_check_instead_of_an_Array_indexOf_to_avoid_quadratic_complexity_r_markh.patch I folded this into part 3
Attachment #8833855 - Attachment is obsolete: true
From Kit, requesting review from Thom
Attachment #8833856 - Flags: review?(tchiovoloni)
Comment on attachment 8833857 [details] [diff] [review] Part_4__Formalize_weak_tracking_in__BookmarksChangeset___r_markh.patch Review of attachment 8833857 [details] [diff] [review]: ----------------------------------------------------------------- LGTM, I've mostly naming nits. I also wonder if the "weak" portions of this patch should be its own "part 3.5" (ie, its own "part" and we end up renumbering the other parts), with only the changes specific to the repair folded into part 4? I don't really mind about that, but it might make final checking of the patches before landing that bit simpler. ::: services/sync/modules/bookmark_repair.js @@ +503,5 @@ > } > // whew - now add these items to the tracker "weakly" (ie, they will not > // persist in the case of a restart, but that's OK - we'll then end up here > // again.) > // XXX - not clear how to "track weakly" - something like this? We should remove this comment ::: services/sync/modules/engines.js @@ +916,5 @@ > record.collection = this.name; > return record; > }, > > + // Creates a tombstone Sync record with additional crypto fields. I don't understand the "additional crypto fields" part of this comment? "additional metadata" makes sense though @@ +1448,5 @@ > locallyModified = true; > localAge = this._tracker._now() - this._modified.getModifiedTimestamp(localDupeGUID); > remoteIsNewer = remoteAge < localAge; > > + this._modified.rename(localDupeGUID, item.id); I agree "swap" isn't a great name, but "rename" seems a wrong too - maybe "switch" (but that's still close to swap). Not a big deal either way if a good name doesn't pop into your head, although given the implementation, IMO "swap" is better than "rename" :) @@ +1830,5 @@ > Object.assign(this.changes, changes); > } > > + // Overwrites the existing set of tracked changes. > + adopt(changes) { "adopt" implies to me that the passed changes would be *added* rather than replacing the existing ones. ::: services/sync/modules/engines/bookmarks.js @@ +1152,5 @@ > + } > + return Number.NaN; > + } > + > + setWeak(id, { tombstone = false } = {}) { should we also override the base |insert| method to remove it from |weakChanges|? (ie, ISTM that this implementation takes care to not have the change in both sets, so if it's not important to override |insert|, do we really need this check here?)
Attachment #8833857 - Flags: review?(markh) → review+
Comment on attachment 8833866 [details] [diff] [review] Part_4__Implement__PlacesSyncUtils_bookmarks_fetchSyncIdsForRepair___r_markh.patch I was unable to fold this, presumably due to it depending on earlier patches.
Thanks Kit, LGTM - not folded yet, but please do so once the previous patches are themselves folded.
Attachment #8833867 - Flags: review+
Comment on attachment 8833868 [details] [diff] [review] Part_4__Store_and_upload_tombstones_for_known_deletions__f_markh.patch Review of attachment 8833868 [details] [diff] [review]: ----------------------------------------------------------------- I didn't look at this too closely, but I agree this is where we should end up. IIUC this patch isn't strictly necessary for the repairer, but instead would make both the repairer *and* a general Sync more reliable (ie, making the "old device + node reassignment resurrects deleted items" scenario go away), and that implementing the repairer without this functionality is possible, just less reliable in ways Sync is already unreliable? I suspect mak will have some reservations and it may take time to work out how to balance the tradeoffs correctly. If that's true, it would be ideal if we can consider splitting this into its own bug and not block the repairer landing on this new functionality. If I mistaken and the repair truly would be unreliable without this patch, we should split it into its own patch in this bug with review from mak (and in which case I'll have a closer look at this)
Attachment #8833868 - Flags: feedback?(markh)
Comment on attachment 8833856 [details] [diff] [review] Part_3__Use___ids__and____IDs__consistently__r_tcsc.patch Review of attachment 8833856 [details] [diff] [review]: ----------------------------------------------------------------- Looks good to me.
Attachment #8833856 - Flags: review?(tchiovoloni) → review+
Comment on attachment 8833857 [details] [diff] [review] Part_4__Formalize_weak_tracking_in__BookmarksChangeset___r_markh.patch Review of attachment 8833857 [details] [diff] [review]: ----------------------------------------------------------------- Moving the changeset...changes into a separate patch is a fine idea! ::: services/sync/modules/engines.js @@ +916,5 @@ > record.collection = this.name; > return record; > }, > > + // Creates a tombstone Sync record with additional crypto fields. I admit I copy-pasted this from the `_createRecord` comment. Changed to "metadata" in both places. @@ +1448,5 @@ > locallyModified = true; > localAge = this._tracker._now() - this._modified.getModifiedTimestamp(localDupeGUID); > remoteIsNewer = remoteAge < localAge; > > + this._modified.rename(localDupeGUID, item.id); "swap" seemed wrong because it implies we're moving `oldID` to `newID`, and `newID` to `oldID` (which should be `undefined`, unless `newID` is changed, too, and our de-duping logic decides to replace it)...but really, we're only doing the former. How about "changeID"? It matches "changeItemID". @@ +1830,5 @@ > Object.assign(this.changes, changes); > } > > + // Overwrites the existing set of tracked changes. > + adopt(changes) { Good feedback, thanks! I was using "adopt" in the `nsString.Adopt(char*)` sense, but `replace` makes it clear that it's replacing all changes. ::: services/sync/modules/engines/bookmarks.js @@ +1152,5 @@ > + } > + return Number.NaN; > + } > + > + setWeak(id, { tombstone = false } = {}) { In retrospect, I don't think it matters, so I removed this check. It's not important that the change appear only in one set. We'll always prefer the strong change, anyway.
Comment on attachment 8833868 [details] [diff] [review] Part_4__Store_and_upload_tombstones_for_known_deletions__f_markh.patch (In reply to Mark Hammond [:markh] from comment #66) > IIUC this patch isn't strictly necessary for the repairer, but instead > would make both the repairer *and* a general Sync more reliable (ie, making > the "old device + node reassignment resurrects deleted items" scenario go > away), and that implementing the repairer without this functionality is > possible, just less reliable in ways Sync is already unreliable? I think your assessment is totally accurate, and this patch is more of an optimization. The requester will still need to handle the case where none of the other devices have an item, but don't have a tombstone, either. Let's save this patch for a follow-up, and we can get mak's advice then.
Attachment #8833868 - Attachment is obsolete: true
Attached patch avoid-continuing-repair.patch (obsolete) (deleted) — Splinter Review
This probably needs better tests... Also using a different method of exporting the patch since the last one ended up being hell when updating, so hopefully this works fine.
Attachment #8830947 - Attachment is obsolete: true
Attachment #8835087 - Flags: review?(markh)
Depends on: 1338280
Comment on attachment 8835087 [details] [diff] [review] avoid-continuing-repair.patch As discussed with Thom, this patch is going to be mostly gutted.
Attachment #8835087 - Flags: review?(markh)
FTR, I pushed "unreviewed" (or maybe I should say r=me;) changes to elm to placate eslint. All of these changes were trivial, so didn't want to add overhead, and hopefully the changes will not cause conflicts for anyone. As part of this, I also pulled recent central and rebased - so your next pull will grab many changesets, including the eslint fixes. The only remaining eslint errors are in test_bookmark_repair.js, but as Kit is actively working on that (thus making the risk of conflicts high) I didn't touch that file. Kit, when you get another patch up, could you please ensure eslint doesn't complain about that file? Thanks
This patch has minor fixes to either correct obvious errors, and to log slightly better diagnostics. I'll split it into the correct parts after review.
Attachment #8837031 - Flags: review?(kit)
Attachment #8837031 - Flags: review?(kit) → review+
Blocks: 1339633
Comment on attachment 8837031 [details] [diff] [review] small_fixups_to_various_parts_to_allow_the_process_to_start__r_kitcambridge.patch Thanks - folded and re-pushed.
Attachment #8837031 - Attachment is obsolete: true
This patch fixes that the repair response command didn't send the flow ID (causing the requester to reject it) and fixes to work with recent changes that landed on central (ie, that the 4th arg to sendCommand is now an object rather than just the flow ID) With this, the repair process successfully round-trips \o/
Attachment #8837389 - Flags: review?(tchiovoloni)
Comment on attachment 8837389 [details] [diff] [review] fix_handling_of_flow_ID_in_commands_and_repair_response__r_tcsc.patch Review of attachment 8837389 [details] [diff] [review]: ----------------------------------------------------------------- Looks great!
Attachment #8837389 - Flags: review?(tchiovoloni) → review+
Attached patch skip-repair-if-changes.patch (deleted) — Splinter Review
Gutted version of avoid-continuing-repair.patch.
Attachment #8837819 - Flags: review?(markh)
Attached patch skip-repair-if-repairing.patch (obsolete) (deleted) — Splinter Review
Skips repair if other devices are already repairing. Probably should have done these two together, since they're related.
Attachment #8837821 - Flags: review?(markh)
Comment on attachment 8837819 [details] [diff] [review] skip-repair-if-changes.patch Review of attachment 8837819 [details] [diff] [review]: ----------------------------------------------------------------- Thanks! Please make whatever changes you decide and push it up (either folded into the other parts, or stand-alone and I'll fold it later) ::: services/sync/modules/bookmark_repair.js @@ +152,5 @@ > } > > + if (ids.size > MAX_REQUESTED_IDS) { > + log.info("Not starting a repair as there are over " + MAX_REQUESTED_IDS + " problems"); > + return false; While this should be rare, I think it's work recording telemetry for it. Adding, say: let extra = { flowID, reason: `too many problems: ${ids.size}` }; this.service.recordTelemetryEvent("repair", "abort", undefined, extra); should be enough and matches the other place we write an "abort" event. In this case there will not be a "started" event, but I think that's OK. @@ +422,5 @@ > let ids = this.prefs.get(PREF.REPAIR_MISSING_IDS, ""); > return ids.length ? ids.split(" ") : []; > } > set _currentIDs(arrayOfIDs) { > + if (arrayOfIDs.length > MAX_REQUESTED_IDS) { I don't think this is necessary. You already check above for this condition (which is good, as I don't think the caller of startRepair() is expecting an AbortRepairError) and the only other place this is set is when the new list must be <= the current list. (OTOH, an alternative would be to *only* have this, let it throw when startRepair is called, and have startRepair do the right thing on that exception, which would be OK too, but I'm not sure that's worth it) ::: services/sync/modules/doctor.js @@ +133,5 @@ > if (!validator) { > continue; > } > + if (!await validator.canValidate()) { > + continue; please add a log.debug here indicating we aren't validating as the validator says so :) ::: toolkit/components/places/PlacesSyncUtils.jsm @@ +290,5 @@ > + havePendingChanges() { > + // This could be optimized to use a more efficient query -- We don't need > + // grab all the records if all we care about is whether or not any exist. > + return PlacesUtils.withConnectionWrapper("BookmarkSyncUtils: havePendingChanges", > + db => pullSyncChanges(db, true).then(changes => Object.keys(changes).length > 0)); LGTM, but let's get a bug on file to add the optimized version at some point.
Attachment #8837819 - Flags: review?(markh) → review+
Comment on attachment 8837821 [details] [diff] [review] skip-repair-if-repairing.patch Review of attachment 8837821 [details] [diff] [review]: ----------------------------------------------------------------- I think we want to make this check before advancing a repair too, so just incase we do still get multiple repairs in-flight (eg, when 2 devices both simultaneously start a repair, no neither device has seen the other's command when this is called). However, that's slightly trickier as in that scenario there could well be a repairRequest in remote clients - one that *we* wrote ::: services/sync/modules/doctor.js @@ +30,5 @@ > this.REPAIR_ADVANCE_PERIOD = 86400; // 1 day > > this.Doctor = { > + isRepairInProgress(collection, clientsEngine = Service.engineManager.clientsEngine) { > + if (!clientsEngine) { do we really need this? If there's no clients engine things will be exploding all over the place, so I don't think this check adds value? I guess it's fine if it's just for tests (although it appears to be able to be mocked in tests, and the mock to make it be a noop seems trivial)
Attachment #8837821 - Flags: review?(markh)
Comment on attachment 8837389 [details] [diff] [review] fix_handling_of_flow_ID_in_commands_and_repair_response__r_tcsc.patch folded and pushed
Attachment #8837389 - Attachment is obsolete: true
Blocks: 1340325
See the commit message for an extended description.
Attachment #8838305 - Flags: review?(tchiovoloni)
Comment on attachment 8810302 [details] Bug 1317223 (part 1) - a collection_repair module (without any repairers) and integration with the clients engine. Hi Richard, You made some comments in comment 16, to which I replied in comment 20, but it's not clear if you have any other concerns here? I'm hoping we can get this landed next week (although it will be stuck behind a pref to keep it from moving past Aurora), so I'm making sure all the ducks are lined up. Quack :)
Attachment #8810302 - Flags: review?(rnewman)
Comment on attachment 8838305 [details] [diff] [review] doctor_fixup___define_a_pref_specifically_for_repair_being_enabled__and_let_repair_and_validation_move_to_aurora.patch Review of attachment 8838305 [details] [diff] [review]: ----------------------------------------------------------------- Makes sense and seems good to me!
Attachment #8838305 - Flags: review?(tchiovoloni) → review+
I see fourteen parts on this bug, five of which are in ReviewBoard. It's not at all clear what the final proposed set of changes are. Could someone generate a single rolled-up patch for review and attach it to this bug, and flip the appropriate Obsolete flags?
Flags: needinfo?(markh)
Status: NEW → ASSIGNED
Comment on attachment 8824875 [details] Bug 1317223 (part 2) - add 'doctor' concept and move bookmark validation logic to it. https://reviewboard.mozilla.org/r/103190/#review104280 > Should we do this first and prioritize validating engines with repair requestors if the repair period is correct? > > E.g. if we're going to do a repair, it might make sense to validatate first no matter what. > > (The answer could be "no" or "this is handled in a different patch", certainly) I think we should consider that when we have another repairer > Don't we want this to happen before we check the validation probability? > > If we do so here, we'll be substantially more likely to run validation every day, since we'll do the random check repeatedly until it passes. Yeah, I've moved this
Comment on attachment 8824877 [details] Bug 1317223 (part 5) - a bookmark repair responder. https://reviewboard.mozilla.org/r/103194/#review104622 > Half-formed thought: Is it possible for us to deadlock during repair, such that we're waiting on another client to supply our missing records, while it's waiting on us? > > We remove the command when we abort the repair, so IIUC the other client will time out and try again with a different client. But I wonder if this is worth pondering, or if it's not really a problem. yes, this is a concern and Thom is working on a patch to ensure that we handle attempts at multiple clients trying to initiate a repair. I'll leave this issue open until that lands. > This can happen if multiple clients send us repair requests, right? Instead of aborting, I wonder if we can queue the requests, and send repair responses to all those clients once we're done. (We can also set a "repair in progress" flag earlier here, instead of after we've tracked everything). As above, we will soon abort ASAP in the case of multiple repairs being inflight - nothing good can come from that! > It would be nice if we could delete all children of the folder on the server, too, in case we have parent-child disagreements. But resolving those might require the sender to specify the expected hierarchy, too, not just IDs. > > This might be prematurely overcomplicating things, too. If we still think missing items are the largest source of corruption, it makes sense to focus on those first. They're easier to fix than structural differences. Yeah, let's do this in a followup if telemetry implies it has value. > Maybe, and I think we'll need to flag the ancestors and all their children as deleted. The caveat is, we might not have the complete tree, either (bug 1303679, comment 15). > > It could be that we upload these non-syncable items, and wait to delete them until everyone has a consistent (though wrong) view. That may be never, and we really don't want to proliferate those records. > > The best approach might be to take our chances and delete the subtree, and see if we end up orphaning items that we don't know about. We can then repair those separately. I believe your fixups have already addressed this, thanks! > Or it exists on the server, but has structural differences that we need to fix. yeah - I added a comment to that effect as something we can consider in a followup.
Comment on attachment 8833853 [details] [diff] [review] Patch_3___Add_more_validation_problem_fields_to_the_set_of_ids_requested_by_the_the_bookmark_repair_requestor_r_markh.patch This has been folded into the relevant patch
Attachment #8833853 - Attachment is obsolete: true
Flags: needinfo?(markh)
Comment on attachment 8833856 [details] [diff] [review] Part_3__Use___ids__and____IDs__consistently__r_tcsc.patch This has been folded into the relevant patch
Attachment #8833856 - Attachment is obsolete: true
Comment on attachment 8833857 [details] [diff] [review] Part_4__Formalize_weak_tracking_in__BookmarksChangeset___r_markh.patch This has been folded into the relevant patch
Attachment #8833857 - Attachment is obsolete: true
Comment on attachment 8833866 [details] [diff] [review] Part_4__Implement__PlacesSyncUtils_bookmarks_fetchSyncIdsForRepair___r_markh.patch This has been folded into the relevant patch
Attachment #8833866 - Attachment is obsolete: true
Comment on attachment 8833867 [details] [diff] [review] Part_4__Improve_repair_responder_tests__r_markh.patch This has been folded into the relevant patch
Attachment #8833867 - Attachment is obsolete: true
Comment on attachment 8835087 [details] [diff] [review] avoid-continuing-repair.patch A reworked version of this was folded in.
Attachment #8835087 - Attachment is obsolete: true
Comment on attachment 8837819 [details] [diff] [review] skip-repair-if-changes.patch Thom, can you please fixup the review nits and push this?
Flags: needinfo?(tchiovoloni)
Comment on attachment 8838305 [details] [diff] [review] doctor_fixup___define_a_pref_specifically_for_repair_being_enabled__and_let_repair_and_validation_move_to_aurora.patch This has been folded into the relevant patch
Attachment #8838305 - Attachment is obsolete: true
Done, sorry, not sure if you wanted me to fold it into the patch, but I didn't since it's not clear we can undo that if it wasn't something you wanted me to do.
Flags: needinfo?(tchiovoloni)
Attached patch Rollup patch, 16effd5b21ab..c0d1ffb87c33 (obsolete) (deleted) — Splinter Review
Why do I do this to myself?
Attachment #8839658 - Flags: review?(rnewman)
Comment on attachment 8839658 [details] [diff] [review] Rollup patch, 16effd5b21ab..c0d1ffb87c33 Review of attachment 8839658 [details] [diff] [review]: ----------------------------------------------------------------- ::: services/sync/modules/bookmark_repair.js @@ +1,4 @@ > +/* This Source Code Form is subject to the terms of the Mozilla Public > + * License, v. 2.0. If a copy of the MPL was not distributed with this > + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ > +"use strict"; Newline before. @@ +25,5 @@ > + > +// Some back of the napkin math says that 13k should be safe here -- similar > +// to how many children are safe in a folder, but it's worth noting that > +// getting the clients engine stuck is very bad, and this is put in the client > +// record with everything else, so it seems prudent to be extra safe here. If this system is iterative, I'd be inclined to make this number even smaller. @@ +38,5 @@ > + > +// The states we can be in. > +const STATE = Object.freeze({ > + NOT_REPAIRING: "", > + // We need to try and find another client to use. Nit: try to. I have a preference for newlines before each comment. @@ +44,5 @@ > + // We've sent the first request to a client. > + SENT_REQUEST: "repair.sent", > + // We've retried a request to a client. > + SENT_SECOND_REQUEST: "repair.sent-again", > + // There were no problems, but we've gone as far as we can Nit: closing period. @@ +52,5 @@ > +}); > + > +// The preferences we use to hold our state. > +const PREF = Object.freeze({ > + // If a repair is in progress, this is the generated GUID for the "flow ID" Nit: closing period. @@ +54,5 @@ > +// The preferences we use to hold our state. > +const PREF = Object.freeze({ > + // If a repair is in progress, this is the generated GUID for the "flow ID" > + REPAIR_ID: "flowID", > + // The IDs we are currently trying to obtain via the repair process, space sep'd. Nit: newlines, as before. @@ +58,5 @@ > + // The IDs we are currently trying to obtain via the repair process, space sep'd. > + REPAIR_MISSING_IDS: "ids", > + // The IDs of the clients we've tried to get the missing items from, space sep'd. > + // The final element in the array if the client we are currently expecting to > + // see stuff from. This comment doesn't make too much sense to me. Did two cases get smushed together? @@ +73,5 @@ > + this.prefs = new Preferences(PREF_BRANCH); > + } > + > + // Exposed incase another module needs to understand our state > + get STATE() { This is never used. But if you needed to do it, you'd use the BackstagePass to do so instead: ``` Cu.import("…/bookmark_repair.js").STATE; ``` @@ +78,5 @@ > + return STATE; > + } > + > + // Given the validation data, grab the set of IDs we want to request. > + getProblemIDs(validationInfo) { Have this method short-circuit at MAX_REQUESTED_IDS? No point building a set of 10,000 IDs if we're going to give up after 5,000. @@ +87,5 @@ > + > + // Note that we allow any of the validation problem fields to be missing so > + // that tests have a slightly easier time, hence the `|| []` in each loop. > + > + // missing children records when the parent exists but a child doesn't. Nit: sentences start with a capital letter (throughout). @@ +94,5 @@ > + } > + > + // orphans are when the child exists but the parent doesn't. > + // This could either be a problem in the child (it's wrong about the node > + // that should be it's parent), or the parent could simply be missing. Nit: it's -> its. @@ +134,5 @@ > + // XXX - any others we should consider? > + return ids; > + } > + > + /* See if the repairer is willing and able to begin a repair process given Nit: use a common commenting style throughout. @@ +147,5 @@ > + > + let ids = this.getProblemIDs(validationInfo); > + if (ids.size == 0) { > + log.info("Not starting a repair as there are no problems"); > + return false; This seems like good motivation for a return value enum, rather than a boolean -- there are different reasons why we might not start a repair, not all of which are errors. @@ +150,5 @@ > + log.info("Not starting a repair as there are no problems"); > + return false; > + } > + > + if (ids.size > MAX_REQUESTED_IDS) { I presume you have a bug on file to avoid spinning your wheels once we get in this state. @@ +173,5 @@ > + actually move. Returns false if we aren't actually repairing. > + */ > + continueRepairs(response = null) { > + // Note that "ABORTED" and "FINISHED" should never be current when this > + // function returns - the end up this function resets to NOT_REPAIRING "the end up this" -> this @@ +193,5 @@ > + if (!(ex instanceof AbortRepairError)) { > + throw ex; > + } > + log.info(`Repair has been aborted: ${ex.reason}`); > + newState = STATE.ABORTED; Still wanna do this if we're in shutdown? @@ +223,5 @@ > + > + _continueRepairs(state, response = null) { > + switch (state) { > + case STATE.SENT_REQUEST: > + case STATE.SENT_SECOND_REQUEST: Can we split the handling of responses in these states out into a separate method? @@ +229,5 @@ > + let clientsUsed = this._clientsUsed; > + if (!clientsUsed.length) { > + throw new AbortRepairError(`In state ${state} but have no client IDs listed`); > + } > + let clientID = clientsUsed[clientsUsed.length - 1]; This seems… fragile. Perhaps split this into `currentRemoteClient` and `previousRemoteClients`? That lets you express some invariants, too. @@ +244,5 @@ > + // Pull apart the response and see if it provided everything we asked for. > + let fixedIDs = new Set(response.ids); > + let remainingIDs = []; > + for (let id of this._currentIDs) { > + if (!fixedIDs.has(id)) { This is set difference. Does `CommonUtils.difference` help you here? If not, break out a helper that you can test and document. @@ +285,5 @@ > + // So the command we previously sent is still queued for the client > + // (ie, that client is yet to have synced). Let's see if we should > + // give up on that client. > + let lastRequestSent = this.prefs.get(PREF.REPAIR_WHEN); > + let timeLeft = lastRequestSent + RESPONSE_INTERVAL_TIMEOUT - this._now() / 1000; Please please please use server timestamps for this. Every step in this process gives you an X-Weave-Timestamp that you can use instead of the local clock. @@ +288,5 @@ > + let lastRequestSent = this.prefs.get(PREF.REPAIR_WHEN); > + let timeLeft = lastRequestSent + RESPONSE_INTERVAL_TIMEOUT - this._now() / 1000; > + if (timeLeft <= 0) { > + log.info(`previous request to client ${clientID} is pending, but has taken too long`); > + state = STATE.NEED_NEW_CLIENT; General observation: You're doing breadth-first repair. You compute a set of IDs that need fixing, and you ask _each client in turn_ if they have the records. You ask each client only once. The set only gets smaller until you reach FINISHED. If there are still outstanding IDs, I guess you try the whole process again, or just give up entirely. Each conversation with a client takes up to three days. There's another way to do this, which is to re-compute the set after each response, and iterate with a single responsive client until you've determined that you can no longer make progress _with that client_. Consider the situation where you have two IDs to fetch: a record whose parent is missing, and the parent ID. You fetch both because you don't know which one is wrong. If you request those two, and the remote device gives you the bookmark with a valid parent, the breadth-first approach will spend two weeks talking to other clients to find the old parent. A 'depth-first' approach will recompute the set of missing IDs, decide that the imaginary stale parent is no longer needed to get a consistent bookmark tree, and will stop there. In a non-trivial example where there are several sets of errors, we'd make additional requests of the original responder until no progress was made. Sticking with a single client for as long as possible would seem to provide the most consistent results, no? Am I misunderstanding this process? @@ +306,5 @@ > + // it another go (as that client may have cleared the command but is yet > + // to complete the sync) > + // XXX - note that this is no longer true - the responders don't remove > + // their command until they have written a response. This might mean > + // we could drop the entire STATE.SENT_SECOND_REQUEST concept??? If not, you need to be very careful that you don't cause duplicate work by sending a substantially similar request. (E.g., what happens if you try a second time, but with a larger or smaller set of IDs?) @@ +319,5 @@ > + } > + break; > + > + case STATE.NEED_NEW_CLIENT: > + // We need to try and find a new client to request. s/try and/try to @@ +322,5 @@ > + case STATE.NEED_NEW_CLIENT: > + // We need to try and find a new client to request. > + let newClientID = this._findNextClient(); > + if (!newClientID) { > + state = STATE.FINISHED; Isn't it worth distinguishing between "finished successfully" and "finished because we ran out of clients"? Or is that implied by a non-empty set of missing records in FINISHED? @@ +327,5 @@ > + break; > + } > + let newClientsUsed = this._clientsUsed; > + newClientsUsed.push(newClientID); > + this._clientsUsed = newClientsUsed; Is there value in not simply modifying the list in place? @@ +369,5 @@ > + ids, > + flowID, > + } > + this.service.clientsEngine.sendCommand("repairRequest", [request], clientID, { flowID }); > + this.prefs.set(PREF.REPAIR_WHEN, Math.floor(this._now() / 1000)); Same point about timestamps. @@ +381,5 @@ > + } > + > + _findNextClient() { > + let alreadyDone = this._clientsUsed; > + for (let client of this.service.clientsEngine.remoteClients) { Is this sorted by date last synced? Given the number of duplicates encountered by users, that seems like a wise idea. @@ +395,5 @@ > + // Is the passed client record suitable as a repair responder? > + _isSuitableClient(client) { > + // filter only desktop firefox running 53+ > + return (client.type == DEVICE_TYPE_DESKTOP && > + Services.vc.compare(client.version, 52) > 0); Is this really comparing for 53+, or is it checking 52+? I haven't looked at the source for vc.compare. @@ +465,5 @@ > + this._abortRepair(request, rawCommand, > + `A repair is already in progress`); > + return; > + } > + let engine = this.service.engineManager.get("bookmarks"); Newlines are cheap, here and above! @@ +506,5 @@ > + log.trace(`checking the server for items`, request.ids); > + for (let remoteID of JSON.parse(itemSource.get())) { > + log.trace(`the server has "${remoteID}"`); > + existsRemotely.add(remoteID); > + // This item exists on the server, so remove it from toUpload. Wha? Part of the point of repair is to fix crap on the server. That means replacing records that are on the server. It might involve doing so intelligently -- not creating new orphans -- but it definitely doesn't mean skipping records that exist but are wrong. I think 2/3 of the corruption in bookmarks is inconsistency between folders and records, so this isn't a small part of the work. Is this a follow-up? If so, leave a pointer to the bug here, and a big explanatory comment about why this work hasn't been addressed yet.
Attachment #8839658 - Flags: review?(rnewman) → feedback+
Comment on attachment 8810302 [details] Bug 1317223 (part 1) - a collection_repair module (without any repairers) and integration with the clients engine. See comments on the megapatch I attached. Generally I have only two substantive/approach comments: 1. I'm not convinced that breadth-first repair requests make the most sense, but I might be misunderstanding your approach. 2. If you've accidentally omitted repair of inconsistent (not missing) items, I think that needs to be addressed. Otherwise, just the usual nits and fixes. Looks good.
Attachment #8810302 - Flags: review?(rnewman)
Comment on attachment 8810302 [details] Bug 1317223 (part 1) - a collection_repair module (without any repairers) and integration with the clients engine. https://reviewboard.mozilla.org/r/92650/#review116124 ::: services/sync/modules/collection_repair.js:23 (Diff revision 4) > +} > + > +const RESPONDERS = { > +} > + > +// Should we maybe enforce the requestors being a singleton? After working on the integration test, I am categorically opposed to introducing any singletons. :-) But I'm definitely open to having my mind changed, if we think it's valuable.
(In reply to Richard Newman [:rnewman] from comment #110) > Part of the point of repair is to fix crap on the server. That means > replacing records that are on the server. It might involve doing so > intelligently -- not creating new orphans -- but it definitely doesn't mean > skipping records that exist but are wrong. > > I think 2/3 of the corruption in bookmarks is inconsistency between folders > and records, so this isn't a small part of the work. > > Is this a follow-up? If so, leave a pointer to the bug here, and a big > explanatory comment about why this work hasn't been addressed yet. AIUI, yes, it's a follow-up. We explicitly decided not to focus on structural issues in this bug, and only handle cases where items are missing. Mark elaborated in comment 93, and I agree it's important to document this with a comment.
Whiteboard: [ios-sync]
Blocks: 1341972
Attached patch skip-repair-if-repairing2.patch (obsolete) (deleted) — Splinter Review
Addresses feedback. We still check if the clients engine, well, exists since failing to do so causes a number of failures in unrelated tests (e.g. test_telemetry, test_corrupt_keys, etc.), and it seems easier to just make that check here than to update all those tests, which shouldn't even really be aware of the doctor.
Attachment #8837821 - Attachment is obsolete: true
Attachment #8840547 - Flags: review?(markh)
Ah, forgot to mention, I don't think this is complete. In particular, the case you called out where we are mid-repair and notice another happening at the same time, it's unclear to me what we should do. I'd rather not introduce a new concept of a repairAbort request, but maybe that's what we need? Alternatively, maybe everything would be fine if we only check on the response?
Comment on attachment 8840547 [details] [diff] [review] skip-repair-if-repairing2.patch Review of attachment 8840547 [details] [diff] [review]: ----------------------------------------------------------------- ::: services/sync/modules/doctor.js @@ +35,5 @@ > + log.info("Missing clients engine, assuming we're in test code"); > + return false; > + } > + return Service.engineManager.clientsEngine.isRepairInProgress(command => > + command.collection !== "collection"); As a nit, I think a better name that somehow reflects the fact we are checking *all* clients would be good - the name as written implies "is this device doing a repair". Maybe something like "anyClientsRepairing" or similar? I don't think you want the quotes around "collection" - which highlights the fact we need better tests :) Finally, I think we should also have continueRepairs abort if this is true - in which case I expect the callback will also want to check the flowID of the command (ie, only abort if there is a different flow ID in progress). However, in this latter case, we really want the current repair to be aborted, including reporting of telemetry) which means that it might be best to have these calls in bookmark_repair itself (otherwise we might need to make aborting part of the base class, which doesn't seem worthwhile at this late state). The flexibility of the callback doesn't seem worthwhile either - so I'm thinking something like: * clients.js has the isRepairInProgress(collectionName, ignoreFlowID) function. It returns true if there is a repair with the flowID in progress (and a null flowID would be passed by startRepairs implicitly means "any flow ID means we should return true) * more thorough tests for the above, probably in test_clients_engine (or maybe in test_doctor - don't really have an opinion there) - alternatively, consider moving isRepairInProgress to the doctor itself (ie, requiring no changes to the clients engine) which would make test_doctor the obvious place for tests) * startRepairs in bookmark requestor calls this with ("bookmarks", null) and declines to start a repair (and probably doesn't need telemetry?) * continueRepairs passes (bookmarks, this._flowID), throwing an AbortError. The exception handler for the AbortError should also pass a new "reason" in extra to telemetry.
Attachment #8840547 - Flags: review?(markh) → review-
Depends on: 1342851
Comment on attachment 8839658 [details] [diff] [review] Rollup patch, 16effd5b21ab..c0d1ffb87c33 Review of attachment 8839658 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for the review! I replied in splinter to try and keep context, so this reply might be best viewed there :) Any nits I didn't reply to are fixed. elm has commits with just the fixups: * https://hg.mozilla.org/projects/elm/rev/0b411c34cdcfb24b804d190f794a08704974f8a5 has the requestor (most of the nits etc) * https://hg.mozilla.org/projects/elm/rev/b0c4d66a3834ba28a3301933de4ea6f0a32c1dd3 has the responder (ie, the uploader) and I'll put another "rolled up" one here for new review. ::: services/sync/modules/bookmark_repair.js @@ +25,5 @@ > + > +// Some back of the napkin math says that 13k should be safe here -- similar > +// to how many children are safe in a folder, but it's worth noting that > +// getting the clients engine stuck is very bad, and this is put in the client > +// record with everything else, so it seems prudent to be extra safe here. I made it 1k - note however that the current process isn't "iterative" yet - it is "iterative" WRT clients, but not (yet?) "iterative" WRT repairs. This comes up below in your "breadth first" comment, but figured I'd get it out of the way early. A single "repair" is currently: * Perform a validation, determine a list of IDs we need. * Ask first client for these IDs. * After that client responds, if there are still IDs from that same initial set left, try the next client. * Eventually we get all the IDs we need or we run out of clients. As discussed with Thom (and broadly captured by 1340325), we may well decide to do this more iteratively in the future, but this becomes slightly trickier to implement the way the patch is, so to do this we'd probably just start a "new" repair tied to the original to fix subsequent validation problems (and as you mention below, probably start with the same client we used before) @@ +58,5 @@ > + // The IDs we are currently trying to obtain via the repair process, space sep'd. > + REPAIR_MISSING_IDS: "ids", > + // The IDs of the clients we've tried to get the missing items from, space sep'd. > + // The final element in the array if the client we are currently expecting to > + // see stuff from. s/if/is/ and /see stuff from/current client/ :) But as you suggested below, this is split into 2 - "current" and "previous" @@ +73,5 @@ > + this.prefs = new Preferences(PREF_BRANCH); > + } > + > + // Exposed incase another module needs to understand our state > + get STATE() { Apparently that only works for top-level "var", not "const" nor "let" :( I left it as it is. @@ +147,5 @@ > + > + let ids = this.getProblemIDs(validationInfo); > + if (ids.size == 0) { > + log.info("Not starting a repair as there are no problems"); > + return false; The caller currently doesn't care - apart from logging, which probably is also redundant, but I left that as it is. It's probably more important that telemetry knows why which is handled internally. @@ +150,5 @@ > + log.info("Not starting a repair as there are no problems"); > + return false; > + } > + > + if (ids.size > MAX_REQUESTED_IDS) { I opened bug 1341972. @@ +193,5 @@ > + if (!(ex instanceof AbortRepairError)) { > + throw ex; > + } > + log.info(`Repair has been aborted: ${ex.reason}`); > + newState = STATE.ABORTED; No, we don't want to abort a repair on shutdown - it was probably caused by the shutdown and we'll stop looking for repair responses. @@ +229,5 @@ > + let clientsUsed = this._clientsUsed; > + if (!clientsUsed.length) { > + throw new AbortRepairError(`In state ${state} but have no client IDs listed`); > + } > + let clientID = clientsUsed[clientsUsed.length - 1]; done @@ +244,5 @@ > + // Pull apart the response and see if it provided everything we asked for. > + let fixedIDs = new Set(response.ids); > + let remainingIDs = []; > + for (let id of this._currentIDs) { > + if (!fixedIDs.has(id)) { yeah, it did, thanks. Swapped a few other loops to them as well. @@ +285,5 @@ > + // So the command we previously sent is still queued for the client > + // (ie, that client is yet to have synced). Let's see if we should > + // give up on that client. > + let lastRequestSent = this.prefs.get(PREF.REPAIR_WHEN); > + let timeLeft = lastRequestSent + RESPONSE_INTERVAL_TIMEOUT - this._now() / 1000; TIL that AsyncResource.serverTime is a global and it a easy way to implement this - I just switched _now() to call that (and to throw if it's zero/undefined just to be safe) @@ +288,5 @@ > + let lastRequestSent = this.prefs.get(PREF.REPAIR_WHEN); > + let timeLeft = lastRequestSent + RESPONSE_INTERVAL_TIMEOUT - this._now() / 1000; > + if (timeLeft <= 0) { > + log.info(`previous request to client ${clientID} is pending, but has taken too long`); > + state = STATE.NEED_NEW_CLIENT; As above @@ +306,5 @@ > + // it another go (as that client may have cleared the command but is yet > + // to complete the sync) > + // XXX - note that this is no longer true - the responders don't remove > + // their command until they have written a response. This might mean > + // we could drop the entire STATE.SENT_SECOND_REQUEST concept??? That can't happen as it currently stands - the entire process is using one reducing set of IDs. @@ +322,5 @@ > + case STATE.NEED_NEW_CLIENT: > + // We need to try and find a new client to request. > + let newClientID = this._findNextClient(); > + if (!newClientID) { > + state = STATE.FINISHED; yeah, it's implied via the telemetry recording of how many IDs were remaining at the end. @@ +381,5 @@ > + } > + > + _findNextClient() { > + let alreadyDone = this._clientsUsed; > + for (let client of this.service.clientsEngine.remoteClients) { dupe device complaints tend to be FxA devices these days, but yeah, I opened bug 1342851, and elm has another patch up that sorts by that ID. @@ +395,5 @@ > + // Is the passed client record suitable as a repair responder? > + _isSuitableClient(client) { > + // filter only desktop firefox running 53+ > + return (client.type == DEVICE_TYPE_DESKTOP && > + Services.vc.compare(client.version, 52) > 0); I was surprised too, but apparently that's correct and there are specific tests for this. @@ +506,5 @@ > + log.trace(`checking the server for items`, request.ids); > + for (let remoteID of JSON.parse(itemSource.get())) { > + log.trace(`the server has "${remoteID}"`); > + existsRemotely.add(remoteID); > + // This item exists on the server, so remove it from toUpload. Yes, that's true, thanks. fetchSyncIdsForRepair() actually returns the children of folders, so I believe that code was 1/2 correct ;) I've reworked the code such that: * Any items requested are always flagged for upload if we have them, even when they are already on the server. * Any child items of requested folders are uploaded only if they don't exist on the server. So while we *might* still be uploading things that don't resolve structural issues, we avoid uploading records we *know* cause new ones.
Attachment #8839658 - Attachment is obsolete: true
Attachment #8841507 - Flags: review?(rnewman)
Comment on attachment 8841507 [details] [diff] [review] Rollup patch, ec04ef73ccab..b0c4d66a3834 Review of attachment 8841507 [details] [diff] [review]: ----------------------------------------------------------------- I think this is fine. I guess we'll see how it works in practice! ::: services/sync/modules/bookmark_repair.js @@ +86,5 @@ > + super(service); > + this.prefs = new Preferences(PREF_BRANCH); > + } > + > + /* Exposed incase another module needs to understand our state Nit: "in case"
Attachment #8841507 - Flags: review?(rnewman) → review+
Blocks: 1343093
Blocks: 1343101
Blocks: 1343103
Comment on attachment 8824877 [details] Bug 1317223 (part 5) - a bookmark repair responder. https://reviewboard.mozilla.org/r/103194/#review117290 ::: services/sync/modules/bookmark_repair.js:455 (Diff revision 3) > + } > + let engine = this.service.engineManager.get("bookmarks"); > + // Some items have been requested, but we need to be careful about how we > + // handle them. Potential issues include: > + // * The item exists locally, but isn't in the tree of items we sync (eg, it > + // might be a left-pane item or similar.) We'll write tombstones for them now. The "orphaned non-syncable item" case still exists, but is probably unlikely to happen in practice. I also filed bug 1343101 for repairing other kinds of issues. ::: services/sync/modules/bookmark_repair.js:456 (Diff revision 3) > + let engine = this.service.engineManager.get("bookmarks"); > + // Some items have been requested, but we need to be careful about how we > + // handle them. Potential issues include: > + // * The item exists locally, but isn't in the tree of items we sync (eg, it > + // might be a left-pane item or similar.) > + // * The item exists locally as a folder - and the children of the folder We'll upload all descendants of requested folders now. ::: services/sync/modules/bookmark_repair.js:461 (Diff revision 3) > + // * The item exists locally as a folder - and the children of the folder > + // also don't exist on the server - just uploading the folder isn't going > + // to help. (Note that we assume the parents *do* exist, otherwise the > + // device requesting the item be uploaded wouldn't be aware it exists) > + // XXX - is this OK? Maybe we should verify the parents are the same as > + // we have recorded? Rereading this comment, I'm not sure what "the parents are the same as we have recorded" means. The request contains a flat list of IDs, not the expected hierarchy. Is that what you had in mind, or something else? ::: services/sync/modules/bookmark_repair.js:578 (Diff revision 3) > + _abortRepair(request, rawCommand, why) { > + // XXX - probably want to try and use the same "error" handling as the > + // rest of our telemetry code? > + log.warn(`aborting repair request: ${why}`); > + this.service.clientsEngine.removeLocalCommand(rawCommand); > + // probably want to record telemetry and write a response??? Do we still want to do this? ::: services/sync/tests/unit/test_bookmark_repair_responder.js:195 (Diff revision 3) > +}); > + > +add_task(async function test_responder_tombstone() { > + let server = await setup(); > + > + // TODO: Request an item for which we have a tombstone locally. Decide if For now, we'll handle this identically to `test_responder_missing_items`. I filed bug 1343103 as a follow-up.
Attachment #8824877 - Flags: review?(kit) → review+
Attached patch skip-repair-if-repairing3.patch (obsolete) (deleted) — Splinter Review
Took comments from earlier. I did end up passing in the service to the Doctor.anyClientsRepairing call, since it allows more thorough testing (had I not done this, I would have missed some bugs in the implementation caught by the last test). It does seem a little strange for this to be on doctor given that Doctor doesn't use it now, but putting it on the bookmark repair requestor doesn't seem right either.
Attachment #8840547 - Attachment is obsolete: true
Attachment #8842200 - Flags: review?(markh)
Comment on attachment 8842200 [details] [diff] [review] skip-repair-if-repairing3.patch Review of attachment 8842200 [details] [diff] [review]: ----------------------------------------------------------------- LGTM, thanks! But I wouldn't mind another look with that exception handler in place. ::: services/sync/modules/bookmark_repair.js @@ +158,5 @@ > } > > + if (this.anyClientsRepairing()) { > + log.info("Can't start repair, since other clients are already repairing bookmarks"); > + let extra = { flowID, reason: `other clients repairing` }; probably want normal quotes here? @@ +236,5 @@ > > _continueRepairs(state, response = null) { > + if (this.anyClientsRepairing(this._flowID)) { > + throw new AbortRepairError("Already repairing bookmarks on other clients"); > + } I think we want the exception handler to write the "reason" to its telemetry too (which probably means the string passed to AbortRepairError here should be the same as you record above. ::: services/sync/modules/doctor.js @@ +36,5 @@ > + return false; > + } > + return service.engineManager.clientsEngine.remoteClients.some(client => > + client.commands && client.commands.some(command => { > + if (command.command !== "repairResponse" && command.command !== "repairRequest") { Our coding style calls for "normal" != rather than !== (even though we aren't really consistent about this) @@ +39,5 @@ > + client.commands && client.commands.some(command => { > + if (command.command !== "repairResponse" && command.command !== "repairRequest") { > + return false; > + } > + if (command.args[0].collection != collection) { I think here and below we should check command.args is true first (ie, ISTM there might be commands with a zero length args array) - or maybe short-circuit if command.args.length != 1? ::: services/sync/modules/engines/clients.js @@ -604,5 @@ > } > break; > } > case "repairRequest": { > - // Another device has sent us a request to make some repair. I don't see any reason for this comment to be removed
Attachment #8842200 - Flags: review?(markh)
Attached patch skip-repair-if-repairing4.patch (obsolete) (deleted) — Splinter Review
Nice catch on, well, most of it.
Attachment #8842200 - Attachment is obsolete: true
Attachment #8842567 - Flags: review?(markh)
Comment on attachment 8842567 [details] [diff] [review] skip-repair-if-repairing4.patch Review of attachment 8842567 [details] [diff] [review]: ----------------------------------------------------------------- Thanks!
Attachment #8842567 - Flags: review?(markh) → review+
Comment on attachment 8842567 [details] [diff] [review] skip-repair-if-repairing4.patch I had to make a couple of tweaks as clientEngine is directly on service rather than the engineManager, but I've merged and pushed this to elm, thanks!
Attachment #8842567 - Attachment is obsolete: true
Pushed by mhammond@skippinet.com.au: https://hg.mozilla.org/integration/mozilla-inbound/rev/091bb2d8a1d9 (part 1) - a collection_repair module (without any repairers) and integration with the clients engine. r=rnewman https://hg.mozilla.org/integration/mozilla-inbound/rev/705939ae18f0 (part 2) - add 'doctor' concept and move bookmark validation logic to it. r=tcsc https://hg.mozilla.org/integration/mozilla-inbound/rev/b7efe4ba02c2 (part 3) - a bookmark repair requestor. r=markh,tcsc https://hg.mozilla.org/integration/mozilla-inbound/rev/e91a6d043fc6 (part 4) - formalize weak tracking in `BookmarksChangeset`. r=markh https://hg.mozilla.org/integration/mozilla-inbound/rev/80fce16cfb57 (part 5) - a bookmark repair responder. r=kitcambridge https://hg.mozilla.org/integration/mozilla-inbound/rev/6878846abd90 (part 6) - integration tests for the bookmark repair requestor and responder. r=tcsc
Comment on attachment 8824878 [details] Bug 1317223 (part 6) - integration tests for the bookmark repair requestor and responder. https://reviewboard.mozilla.org/r/103196/#review118340 R=me, a little confused what went on here with reviews (Didn't I already R+ this?). Either way now that it's landing it's silly to have these open as r? still, although I probably should have been explicit about it earlier.
Attachment #8824878 - Flags: review?(tchiovoloni) → review+
Comment on attachment 8824876 [details] Bug 1317223 (part 3) - A bookmark repair requestor. https://reviewboard.mozilla.org/r/103192/#review118344 R=me with the caveat that I wrote part of this patch, and the version on elm is more up to date, so don't land this one, land that one. Same comment above at confusion (also confused about what the right course of action for me is given that i'm r? on this patch and I wrote part of it)
Attachment #8824876 - Flags: review?(tchiovoloni) → review+
I disabled test theses on asan builds and opened bug 1344085. Try at https://treeherder.mozilla.org/#/jobs?repo=try&revision=8178969bb31f56d327e8a4dbd82bccc2f1c65ee1&selectedJob=81306855 and tip of elm looks good, so relanding.
Pushed by mhammond@skippinet.com.au: https://hg.mozilla.org/integration/mozilla-inbound/rev/6ca43c00d509 (part 1) - a collection_repair module (without any repairers) and integration with the clients engine. r=rnewman https://hg.mozilla.org/integration/mozilla-inbound/rev/80ecb5941325 (part 2) - add 'doctor' concept and move bookmark validation logic to it. r=tcsc https://hg.mozilla.org/integration/mozilla-inbound/rev/9b717b114075 (part 3) - a bookmark repair requestor. r=markh,tcsc https://hg.mozilla.org/integration/mozilla-inbound/rev/19541a2cd74e (part 4) - formalize weak tracking in `BookmarksChangeset`. r=markh https://hg.mozilla.org/integration/mozilla-inbound/rev/78a48b9396ac (part 5) - a bookmark repair responder. r=kitcambridge https://hg.mozilla.org/integration/mozilla-inbound/rev/2e0603f1091f (part 6) - integration tests for the bookmark repair requestor and responder. r=tcsc https://hg.mozilla.org/integration/mozilla-inbound/rev/e62d1a669bba (part 7) - disable tests on ASAN due to bug 1344085. r=kitcambridge
Attachment #8824876 - Flags: review?(markh)
Attachment #8839005 - Flags: review?(markh) → review+
Comment on attachment 8810302 [details] Bug 1317223 (part 1) - a collection_repair module (without any repairers) and integration with the clients engine. https://reviewboard.mozilla.org/r/92650/#review121248
Attachment #8810302 - Flags: review?(eoger)
Depends on: 1347373
Depends on: 1347805
Depends on: 1348743
Depends on: 1349709
Comment on attachment 8810302 [details] Bug 1317223 (part 1) - a collection_repair module (without any repairers) and integration with the clients engine. https://reviewboard.mozilla.org/r/92650/#review125666
QA Contact: mihai.ninu
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: