Closed Bug 1223226 Opened 9 years ago Closed 8 years ago

Investigate improvements to purge_ttl backend sync daemon.

Categories

(Cloud Services Graveyard :: Server: Sync, defect, P3)

defect

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: bobm, Assigned: bobm)

References

Details

Sync databases tend get overwhelmed after accruing a certain number of records by the TTL expiration job from the purge_ttl.py daemon script. Investigate running this job in a more friendly matter. Some options include: * Use MariaDB query limits. Note, not all limits available in 5.5.X MariaDB. See: https://mariadb.com/kb/en/mariadb/query-limits-and-timeouts/ * Make use of occasional ONLINE OPTIMIZE TABLE jobs to clean up the table indexes before running TTL expiration queries. * Limit purge TTL SQL jobs to lower traffic periods. * Make use of purge TTL stats to improve queries. See: 976387
(In reply to Bob Micheletto [:bobm] from comment #0) > * Make use of occasional ONLINE OPTIMIZE TABLE jobs to clean up the table > indexes before running TTL expiration queries. See: https://www.percona.com/blog/2015/02/11/tokudb-table-optimization-improvements/
We could also try mucking around with the transactions and isolation level to avoid the purge script taking too many locks. I noticed that the purge script calls the backend method purge_expired_items, which (simplified) looks like this: @with_session def purge_expired_items(self, session, grace_period=0, max_per_loop=1000): num_iters = 0 while num_iters < 100: session.query("PURGE_SOME_EXPIRED_ITEMS", { "maxitems": max_per_loop }) num_iters += 1 This tries to delete items in small batches of 1000 so as not to tie up db resources. But because it's inside a @with_session context, I suspect that it may actually runs all of its iterations inside a single long-running transaction. Refactoring it to create a new transaction for each batch might mean it takes longer to run but is kinder to concurrent queries, something like: def purge_expired_items(self, grace_period=0, max_per_loop=1000): num_iters = 0 while num_iters < 100: with self._get_or_create_session() as session: session.query("PURGE_SOME_EXPIRED_ITEMS", { "maxitems": max_per_loop }) num_iters += 1 Adding an explicit `SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED` could further reduce contention, according to [1] it reduces the amount of index space locked by a DELETE statement. Dirty reads shouldn't be a problem since this script only purges old records. [1] https://dev.mysql.com/doc/refman/5.6/en/set-transaction.html
Blocks: 1250189
Flags: firefox-backlog+
Flags: firefox-backlog+
Whiteboard: [sync-data-integrity]
bobm to work with team for strategy to fix
Assignee: nobody → bobm
Priority: -- → P2
Priority: P2 → P3
Whiteboard: [sync-data-integrity]
BTW the stuff in Comment 2 has been addressed as part of the batch-uploads branch, we'll need to link to it from here when that merges to master.
I've hacked a suggested fix into stage that removes the ORDER BY clause from the PURGE_SOME_EXPIRED_ITEMS SQL query. :kthiessen stage is ready for testing.
Flags: needinfo?(kthiessen)
QA Contact: kthiessen
I will run a load test against Stage this evening or tomorrow morning PDT, to see if that fix holds up.
Flags: needinfo?(kthiessen)
Bob, are we waiting for the sharding fix in https://bugzilla.mozilla.org/show_bug.cgi?id=1273102#c18 to go in before we load test this, or should I just go ahead and crank up some load?
Flags: needinfo?(bobm)
(In reply to Karl Thiessen [:kthiessen] from comment #7) > Bob, are we waiting for the sharding fix in > https://bugzilla.mozilla.org/show_bug.cgi?id=1273102#c18 to go in before we > load test this, or should I just go ahead and crank up some load? I think we can test this independently. If it doesn't spike immediately on load, I think we can verify this fix.
Flags: needinfo?(bobm)
Starting 90-minute load test now.
(In reply to Karl Thiessen [:kthiessen] from comment #9) > Starting 90-minute load test now. Load test results look good to me. There was no read IOPs spike.
OK. Taking several steps to mark this VERIFIED FIXED.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Status: RESOLVED → VERIFIED
Product: Cloud Services → Cloud Services Graveyard
You need to log in before you can comment on or make changes to this bug.