Closed
Bug 1223226
Opened 9 years ago
Closed 8 years ago
Investigate improvements to purge_ttl backend sync daemon.
Categories
(Cloud Services Graveyard :: Server: Sync, defect, P3)
Cloud Services Graveyard
Server: Sync
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: bobm, Assigned: bobm)
References
Details
Sync databases tend get overwhelmed after accruing a certain number of records by the TTL expiration job from the purge_ttl.py daemon script. Investigate running this job in a more friendly matter. Some options include:
* Use MariaDB query limits. Note, not all limits available in 5.5.X MariaDB. See: https://mariadb.com/kb/en/mariadb/query-limits-and-timeouts/
* Make use of occasional ONLINE OPTIMIZE TABLE jobs to clean up the table indexes before running TTL expiration queries.
* Limit purge TTL SQL jobs to lower traffic periods.
* Make use of purge TTL stats to improve queries. See: 976387
Assignee | ||
Comment 1•9 years ago
|
||
(In reply to Bob Micheletto [:bobm] from comment #0)
> * Make use of occasional ONLINE OPTIMIZE TABLE jobs to clean up the table
> indexes before running TTL expiration queries.
See: https://www.percona.com/blog/2015/02/11/tokudb-table-optimization-improvements/
Comment 2•9 years ago
|
||
We could also try mucking around with the transactions and isolation level to avoid the purge script taking too many locks.
I noticed that the purge script calls the backend method purge_expired_items, which (simplified) looks like this:
@with_session
def purge_expired_items(self, session, grace_period=0, max_per_loop=1000):
num_iters = 0
while num_iters < 100:
session.query("PURGE_SOME_EXPIRED_ITEMS", { "maxitems": max_per_loop })
num_iters += 1
This tries to delete items in small batches of 1000 so as not to tie up db resources. But because it's inside a @with_session context, I suspect that it may actually runs all of its iterations inside a single long-running transaction.
Refactoring it to create a new transaction for each batch might mean it takes longer to run but is kinder to concurrent queries, something like:
def purge_expired_items(self, grace_period=0, max_per_loop=1000):
num_iters = 0
while num_iters < 100:
with self._get_or_create_session() as session:
session.query("PURGE_SOME_EXPIRED_ITEMS", { "maxitems": max_per_loop })
num_iters += 1
Adding an explicit `SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED` could further reduce contention, according to [1] it reduces the amount of index space locked by a DELETE statement. Dirty reads shouldn't be a problem since this script only purges old records.
[1] https://dev.mysql.com/doc/refman/5.6/en/set-transaction.html
Updated•8 years ago
|
Flags: firefox-backlog+
Updated•8 years ago
|
Flags: firefox-backlog+
Updated•8 years ago
|
Whiteboard: [sync-data-integrity]
Comment 3•8 years ago
|
||
bobm to work with team for strategy to fix
Assignee: nobody → bobm
Priority: -- → P2
Updated•8 years ago
|
Priority: P2 → P3
Whiteboard: [sync-data-integrity]
Comment 4•8 years ago
|
||
BTW the stuff in Comment 2 has been addressed as part of the batch-uploads branch, we'll need to link to it from here when that merges to master.
Assignee | ||
Comment 5•8 years ago
|
||
I've hacked a suggested fix into stage that removes the ORDER BY clause from the PURGE_SOME_EXPIRED_ITEMS SQL query. :kthiessen stage is ready for testing.
Flags: needinfo?(kthiessen)
QA Contact: kthiessen
Comment 6•8 years ago
|
||
I will run a load test against Stage this evening or tomorrow morning PDT, to see if that fix holds up.
Flags: needinfo?(kthiessen)
Comment 7•8 years ago
|
||
Bob, are we waiting for the sharding fix in https://bugzilla.mozilla.org/show_bug.cgi?id=1273102#c18 to go in before we load test this, or should I just go ahead and crank up some load?
Flags: needinfo?(bobm)
Assignee | ||
Comment 8•8 years ago
|
||
(In reply to Karl Thiessen [:kthiessen] from comment #7)
> Bob, are we waiting for the sharding fix in
> https://bugzilla.mozilla.org/show_bug.cgi?id=1273102#c18 to go in before we
> load test this, or should I just go ahead and crank up some load?
I think we can test this independently. If it doesn't spike immediately on load, I think we can verify this fix.
Flags: needinfo?(bobm)
Comment 9•8 years ago
|
||
Starting 90-minute load test now.
Assignee | ||
Comment 10•8 years ago
|
||
(In reply to Karl Thiessen [:kthiessen] from comment #9)
> Starting 90-minute load test now.
Load test results look good to me. There was no read IOPs spike.
Comment 11•8 years ago
|
||
OK. Taking several steps to mark this VERIFIED FIXED.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Updated•8 years ago
|
Status: RESOLVED → VERIFIED
Updated•2 years ago
|
Product: Cloud Services → Cloud Services Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•