1223226 - Investigate improvements to purge_ttl backend sync daemon.

Assignee

Description

•

9 years ago

Sync databases tend get overwhelmed after accruing a certain number of records by the TTL expiration job from the purge_ttl.py daemon script. Investigate running this job in a more friendly matter. Some options include: * Use MariaDB query limits. Note, not all limits available in 5.5.X MariaDB. See: https://mariadb.com/kb/en/mariadb/query-limits-and-timeouts/ * Make use of occasional ONLINE OPTIMIZE TABLE jobs to clean up the table indexes before running TTL expiration queries. * Limit purge TTL SQL jobs to lower traffic periods. * Make use of purge TTL stats to improve queries. See: 976387

Bob Micheletto [:bobm]

Assignee

Comment 1

•

9 years ago

(In reply to Bob Micheletto [:bobm] from comment #0) > * Make use of occasional ONLINE OPTIMIZE TABLE jobs to clean up the table > indexes before running TTL expiration queries. See: https://www.percona.com/blog/2015/02/11/tokudb-table-optimization-improvements/

Ryan Kelly [:rfkelly]

Comment 2

•

9 years ago

We could also try mucking around with the transactions and isolation level to avoid the purge script taking too many locks. I noticed that the purge script calls the backend method purge_expired_items, which (simplified) looks like this: @with_session def purge_expired_items(self, session, grace_period=0, max_per_loop=1000): num_iters = 0 while num_iters < 100: session.query("PURGE_SOME_EXPIRED_ITEMS", { "maxitems": max_per_loop }) num_iters += 1 This tries to delete items in small batches of 1000 so as not to tie up db resources. But because it's inside a @with_session context, I suspect that it may actually runs all of its iterations inside a single long-running transaction. Refactoring it to create a new transaction for each batch might mean it takes longer to run but is kinder to concurrent queries, something like: def purge_expired_items(self, grace_period=0, max_per_loop=1000): num_iters = 0 while num_iters < 100: with self._get_or_create_session() as session: session.query("PURGE_SOME_EXPIRED_ITEMS", { "maxitems": max_per_loop }) num_iters += 1 Adding an explicit `SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED` could further reduce contention, according to [1] it reduces the amount of index space locked by a DELETE statement. Dirty reads shouldn't be a problem since this script only purges old records. [1] https://dev.mysql.com/doc/refman/5.6/en/set-transaction.html

Ryan Kelly [:rfkelly]

Updated

•

8 years ago

Blocks: 1250189

Edwin Wong [:edwong]

Updated

•

8 years ago

Flags: firefox-backlog+

Chris Karlof [:ckarlof]

Updated

•

8 years ago

Flags: firefox-backlog+

Edwin Wong [:edwong]

Updated

•

8 years ago

Whiteboard: [sync-data-integrity]

Edwin Wong [:edwong]

Comment 3

•

8 years ago

bobm to work with team for strategy to fix

Assignee: nobody → bobm

Priority: -- → P2

Chris Karlof [:ckarlof]

Updated

•

8 years ago

Priority: P2 → P3

Whiteboard: [sync-data-integrity]

Ryan Kelly [:rfkelly]

Comment 4

•

8 years ago

BTW the stuff in Comment 2 has been addressed as part of the batch-uploads branch, we'll need to link to it from here when that merges to master.

Bob Micheletto [:bobm]

Assignee

Comment 5

•

8 years ago

I've hacked a suggested fix into stage that removes the ORDER BY clause from the PURGE_SOME_EXPIRED_ITEMS SQL query. :kthiessen stage is ready for testing.

Flags: needinfo?(kthiessen)

QA Contact: kthiessen

Karl Thiessen [:kthiessen, he/him]

Comment 6

•

8 years ago

I will run a load test against Stage this evening or tomorrow morning PDT, to see if that fix holds up.

Flags: needinfo?(kthiessen)

Karl Thiessen [:kthiessen, he/him]

Comment 7

•

8 years ago

Bob, are we waiting for the sharding fix in https://bugzilla.mozilla.org/show_bug.cgi?id=1273102#c18 to go in before we load test this, or should I just go ahead and crank up some load?

Flags: needinfo?(bobm)

Bob Micheletto [:bobm]

Assignee

Comment 8

•

8 years ago

(In reply to Karl Thiessen [:kthiessen] from comment #7) > Bob, are we waiting for the sharding fix in > https://bugzilla.mozilla.org/show_bug.cgi?id=1273102#c18 to go in before we > load test this, or should I just go ahead and crank up some load? I think we can test this independently. If it doesn't spike immediately on load, I think we can verify this fix.

Flags: needinfo?(bobm)

Karl Thiessen [:kthiessen, he/him]

Comment 9

•

8 years ago

Starting 90-minute load test now.

Bob Micheletto [:bobm]

Assignee

Comment 10

•

8 years ago

(In reply to Karl Thiessen [:kthiessen] from comment #9) > Starting 90-minute load test now. Load test results look good to me. There was no read IOPs spike.

Karl Thiessen [:kthiessen, he/him]

Comment 11

•

8 years ago

OK. Taking several steps to mark this VERIFIED FIXED.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Karl Thiessen [:kthiessen, he/him]

Updated

•

8 years ago

Status: RESOLVED → VERIFIED

BMO Automation

Updated

•

2 years ago

Product: Cloud Services → Cloud Services Graveyard

Bugzilla

Investigate improvements to purge_ttl backend sync daemon.

Categories

(Cloud Services Graveyard :: Server: Sync, defect, P3)

Tracking

(Not tracked)

People

(Reporter: bobm, Assigned: bobm)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Updated

Updated

Updated

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated