Closed
Bug 779290
Opened 12 years ago
Closed 12 years ago
regular tbpl purge/defrag
Categories
(Data & BI Services Team :: DB: MySQL, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: scabral, Assigned: scabral)
References
Details
we purged/defrag'd in https://bugzilla.mozilla.org/show_bug.cgi?id=703967 successfully. Now we need to set up a purge/defrag regular schedule. The tables are still large - runs_logs is still 44G - so we can't just do it in a scheduled task.
We need to figure out how frequently we should defrag, so I'd wait a month (end of August) and check to see the size of the tables.
There are 3 steps to the defrag:
0) Delete the values from the runs table:
DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH );
1) Delete the values from the builders table:
DELETE builders.* from builders LEFT JOIN runs USING (buildername) WHERE runs.buildername IS NULL;
2) Defrag the tables:
OPTIMIZE TABLE runs;
OPTIMIZE TABLE runs_notes;
OPTIMIZE TABLE runs_logs;
OPTIMIZE TABLE builders;
OPTIMIZE TABLE builders_history;
Assignee | ||
Comment 1•12 years ago
|
||
Current size of the tables as of Tue 31 July:
generic1
-rw-rw---- 1 mysql mysql 892M Jul 25 11:27 runs.ibd
-rw-rw---- 1 mysql mysql 15M Jul 25 11:27 runs_notes.ibd
-rw-rw---- 1 mysql mysql 9.0M Jul 25 22:03 builders_history.ibd
-rw-rw---- 1 mysql mysql 13M Jul 25 22:03 builders.ibd
-rw-rw---- 1 mysql mysql 44G Jul 29 13:37 runs_logs.ibd
generic2
-rw-rw---- 1 mysql mysql 9.0M Jul 30 15:17 builders_history.ibd
-rw-rw---- 1 mysql mysql 13M Jul 31 13:35 builders.ibd
-rw-rw---- 1 mysql mysql 15M Jul 31 13:38 runs_notes.ibd
-rw-rw---- 1 mysql mysql 50G Jul 31 13:38 runs_logs.ibd
-rw-rw---- 1 mysql mysql 1012M Jul 31 13:40 runs.ibd
Assignee | ||
Updated•12 years ago
|
Assignee: server-ops-database → scabral
Assignee | ||
Updated•12 years ago
|
Whiteboard: Check at end of Aug 2012
Assignee | ||
Comment 2•12 years ago
|
||
As of Fri Aug 31st, increases:
runs: about 700M (About 40%)
runs_notes: 2M (about 15%)
builders_history: 7M (350%)
builders: 1M (about 8%)
runs_logs: 23G (about 45%)
There's 100G still left on the machine, so it's not necessary to do monthly, but probably a good idea to try.
generic1:
[root@generic1.db.phx1 tbpl_mozilla_org]# ls -lh *.ibd
-rw-rw---- 1 mysql mysql 1.0M Aug 31 00:47 bugscache.ibd
-rw-rw---- 1 mysql mysql 9.0M Aug 31 02:06 builders_history.ibd
-rw-rw---- 1 mysql mysql 14M Aug 31 10:10 builders.ibd
-rw-rw---- 1 mysql mysql 1.5G Aug 31 10:10 runs.ibd
-rw-rw---- 1 mysql mysql 73G Aug 31 10:12 runs_logs.ibd
-rw-rw---- 1 mysql mysql 17M Aug 31 10:09 runs_notes.ibd
generic2:
-rw-rw---- 1 mysql mysql 112K Feb 21 2012 bugscache.ibd
-rw-rw---- 1 mysql mysql 9.0M Aug 30 21:48 builders_history.ibd
-rw-rw---- 1 mysql mysql 14M Aug 31 10:10 builders.ibd
-rw-rw---- 1 mysql mysql 1.5G Aug 31 10:10 runs.ibd
-rw-rw---- 1 mysql mysql 73G Aug 31 10:10 runs_logs.ibd
-rw-rw---- 1 mysql mysql 17M Aug 31 10:09 runs_notes.ibd
Assignee | ||
Updated•12 years ago
|
Whiteboard: Check at end of Aug 2012 → Check at end of Sep 2012
Comment 3•12 years ago
|
||
We now have a bugscache table, which has rows deleted during the 5 min cron, so would presumably benefit from an OPTIMIZE periodically.
Also, for builders_history I would like us to purge old history entries, even if the builder to which they apply still exists; so when we get a UI for the history entries, the contents is kept manageable.
This gives a revised checklist of:
0) Delete old runs:
DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH );
1) Delete rows from the builders table where the buildername is now redundant:
DELETE builders.* from builders LEFT JOIN runs USING (buildername) WHERE runs.buildername IS NULL;
2) Delete builders_history entries older than 6 months:
DELETE FROM builders_history WHERE date < ( NOW( ) - INTERVAL 6 MONTH );
3) Defrag the tables:
OPTIMIZE TABLE runs;
OPTIMIZE TABLE runs_notes;
OPTIMIZE TABLE runs_logs;
OPTIMIZE TABLE builders;
OPTIMIZE TABLE builders_history;
OPTIMIZE TABLE bugscache;
(And just to confirm, unlike the first purge, we definitely want subsequent ones to use 1 month at step #0 not 2 months)
Comment 4•12 years ago
|
||
My apologies, we've had a chat about the builders_history deletions, and decided not to do them after all.
We still would like to OPTIMIZE bugscache, per comment 3. (The only change now, since comment 0).
This gives the revised checklist as:
0) Delete old runs:
DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH );
1) Delete rows from the builders table where the buildername is now redundant:
DELETE builders.* from builders LEFT JOIN runs USING (buildername) WHERE runs.buildername IS NULL;
2) Defrag the tables:
OPTIMIZE TABLE runs;
OPTIMIZE TABLE runs_notes;
OPTIMIZE TABLE runs_logs;
OPTIMIZE TABLE builders;
OPTIMIZE TABLE builders_history;
OPTIMIZE TABLE bugscache;
Assignee | ||
Comment 5•12 years ago
|
||
OK, so should we do this monthly, then? I think the end of the month is a bad target because that would fall into "goals" territory every 3 months. Maybe on or about the 15th of every month?
Comment 6•12 years ago
|
||
Sounds good, thank you :-)
Comment 7•12 years ago
|
||
Just as a heads up:
Rate of DB growth will likely increase now that bug 718632 has landed (was on tbpl-dev as of ~2012-09-04 and prod as of ~2012-09-13), since we now prefetch logs for failures, whereas previously some would have not been opened and thus not fetched from ftp.mozilla.org and stuffed into the DB.
Assignee | ||
Comment 8•12 years ago
|
||
OK, great. I can check size on Friday and we can work on purging next week?
Comment 9•12 years ago
|
||
Sounds good :-)
Assignee | ||
Comment 10•12 years ago
|
||
acknowledging I'm a slacker. I may not get to this tomorrow, too. I can try to do this during next week's webops meetup in SF.
Comment 11•12 years ago
|
||
Would partitioning help here?
Comment 12•12 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #11)
> Would partitioning help here?
Bug 703967 comment 9 and bug 766524 comment 2 decided it wouldn't be viable due to the schema, aiui.
Assignee | ||
Comment 13•12 years ago
|
||
We should probably do this for the dev server, too, since tbpl_allizom_org is 64G. Please let me know if doing the purge on dev would pose a problem?
Comment 14•12 years ago
|
||
I don’t think so. Please go ahead.
Also due to the prefetching, the dev database will probably grow equally fast as the production one...
Assignee | ||
Comment 15•12 years ago
|
||
Awesome. We should definitely do both, then :D
I'll keep the bug updated.
Assignee | ||
Comment 16•12 years ago
|
||
Used comment 4 to populate https://mana.mozilla.org/wiki/display/SYSADMIN/How+to+purge+tbpl+data
I'll start this now for dev2.
Assignee | ||
Comment 17•12 years ago
|
||
On dev2, tbpl_dev_allizom_org:
mysql> use tbpl_dev_allizom_org
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A
Database changed
mysql> DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH );
Query OK, 2968963 rows affected (5 min 17.79 sec)
mysql>
mysql> DELETE builders.* from builders LEFT JOIN runs USING (buildername) WHERE runs.bldername IS NULL;
Query OK, 3284 rows affected (5.47 sec)
starting to optimize the tables now.
Assignee | ||
Comment 18•12 years ago
|
||
dev2 purge of tbpl_dev_allizom_org is done, tbpl_allizom_org doesn't have anything in it. I have put dev2 back into the load balancer, will do dev1 tomorrow.
Assignee | ||
Comment 19•12 years ago
|
||
dev1 purge is done, and dev1 is the master again (dev2 is a slave and set as read_only). I will work on generic2 tomorrow, and then we'll work on generic1.
Comment 20•12 years ago
|
||
Thank you :-)
Assignee | ||
Comment 21•12 years ago
|
||
Didn't get to generic1/2 today.
Assignee | ||
Comment 22•12 years ago
|
||
Due to foreign keys, it would be best to do the deletes first, which will replicate, and then take the servers offline one at a time to defrag.
I'm working on the deletes first - they'll take about an hour and a half, because we're doing this:
DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH limit 2500; do sleep(1);
Query OK, 2500 rows affected, 1 warning (4.69 sec)
(the warning is that the statement may not be safe for replication, because we're doing a delete with a limit, but generic2 has the same amount of data, so we're good.
Assignee | ||
Comment 23•12 years ago
|
||
Done this on the builders table too:
delete from builders where buildername not in
(select buildername from runs) limit 100;
(over and over until it was complete).
Assignee | ||
Comment 24•12 years ago
|
||
we should do another purge at the end of Oct.
Whiteboard: Check at end of Sep 2012 → Check at end of Oct 2012
Assignee | ||
Comment 25•12 years ago
|
||
Starting the purge process on the dev servers, by doing this over and over on the dev master:
DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH limit 2500;
Assignee | ||
Comment 26•12 years ago
|
||
Optimizing the tables on dev2, without replicating to dev1. dev1 is currently serving rw and ro hits, dev2 isn't serving any traffic.
Assignee | ||
Comment 27•12 years ago
|
||
dev2 is down to 44G, dev1 is out of the load balancer and optimizing tables as well (started out as 84G on dev1).
Assignee | ||
Comment 28•12 years ago
|
||
dev1 is purged and back as the read/write db, and dev2 is back as the read-only db. Both reduced in size to about half, from 84G to 44G.
I am deleting on tbpl_mozilla_org now, and after the deletes are done I will work on the defragmentation.
Assignee | ||
Comment 29•12 years ago
|
||
Defragmenting the production backup right now.
Will take generic2 out of the load balancer and start defrag'ing that too.
Assignee | ||
Comment 30•12 years ago
|
||
pre-defrag on generic2:
672K bugscache.ibd
9.0M builders_history.ibd
15M builders.ibd
2.0G runs.ibd
139G runs_logs.ibd
20M runs_notes.ibd
Assignee | ||
Comment 31•12 years ago
|
||
generic2 was finished on Friday, here's the post-defrag:
352K bugscache.ibd
9.0M builders_history.ibd
14M builders.ibd
529M runs.ibd
42G runs_logs.ibd
13M runs_notes.ibd
Freed up almost 100G.
Whiteboard: Check at end of Oct 2012 → Check at end of Jan 2012
Assignee | ||
Comment 32•12 years ago
|
||
generic1 pre-defrag:
mysql> \! df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg00-root
261G 165G 84G 67% /
tmpfs 5.9G 0 5.9G 0% /dev/shm
/dev/sda1 504M 88M 391M 19% /boot
Comment 33•12 years ago
|
||
Thank you for keeping on top of this :-)
Assignee | ||
Comment 34•12 years ago
|
||
Generic1 done, 98G freed up:
mysql> \! df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/vg00-root
261G 67G 182G 27% /
tmpfs 5.9G 0 5.9G 0% /dev/shm
/dev/sda1 504M 88M 391M 19% /boot
Assignee | ||
Updated•12 years ago
|
Whiteboard: Check at end of Jan 2012 → Check at end of Feb 2013
Assignee | ||
Comment 35•12 years ago
|
||
Working on a purge now, we should have done it about 2 weeks ago. We were reminded by disk space on production paging us.
Whiteboard: Check at end of Feb 2013 → Check at end of Mar 2013
Comment 36•12 years ago
|
||
TBPL is pretty slow at the moment & getting a few "service unavailables".
Presuming it's the data purge? Could we throttle it back a tiny bit? :-)
Assignee | ||
Comment 37•12 years ago
|
||
It was throttled back soon after this message, sorry I forgot to write about it. Things should be more than fine now. (the optimization took a few hours)
Comment 38•12 years ago
|
||
Thank you :-)
Assignee | ||
Comment 39•12 years ago
|
||
We will be failing over to generic2 now, the defrag has finished there, and we will defrag on generic1.
Assignee | ||
Comment 40•12 years ago
|
||
generic cluster has been failed over to use generic2 as the master now. Doing maintenance on generic1, will let you know when maintenance is done and we fail back.
Assignee | ||
Comment 41•12 years ago
|
||
The maintenance is over, waiting for the all-clear from SREs to switch back.
Assignee | ||
Comment 42•12 years ago
|
||
switching back now.
Assignee | ||
Comment 43•12 years ago
|
||
generic cluster is back to normal now, this defrag is over for March.
Comment 44•12 years ago
|
||
Thanks Sheeri :-)
Comment 45•12 years ago
|
||
We'll probably need another purge here soon, though presume it may now happen via the mechanism in bug 835608.
However, please may it wait until bug 868356 has been completed - thank you :-)
Assignee | ||
Comment 46•12 years ago
|
||
We did a defrag last week, and since then, we have an automated script to do nightly deletes. We don't need to reclaim space with defrags any more, because the space is re-used by MySQL, and it's only one day of space that is in flux at any given time.
Special thanks to Brandon Johnson (:cyborgshadow) who did the heavy lifting on the scripting.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Whiteboard: Check at end of Mar 2013
Comment 47•12 years ago
|
||
That's great - thank you :-)
Just to double check - the script runs at 3am pacific right?
Assignee | ||
Comment 48•12 years ago
|
||
It's set to run at 2 am local time, which is UTC. So it's at 7 pm Pacific (if my calculations are correct).
Would you like us to change it to 3 am Pacific? If so, that's 10 am UTC (http://www.worldtimebuddy.com/utc-to-pdt-converter)
Comment 49•12 years ago
|
||
I'm sure that time will be fine - I asked mainly so that if we saw any daily perf issues we knew whether they were related. Thank you :-)
Comment 50•12 years ago
|
||
For the record, in both production and dev, this runs in less than 5 seconds daily. Given the size of the data it's touching, I'd say that's pretty good, and wouldn't expect a performance hit from it. :)
Comment 51•12 years ago
|
||
Ah, worrying over nothing! :-)
Updated•10 years ago
|
Product: mozilla.org → Data & BI Services Team
You need to log in
before you can comment on or make changes to this bug.
Description
•