Closed Bug 779290 Opened 12 years ago Closed 12 years ago

regular tbpl purge/defrag

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: scabral, Assigned: scabral)

References

Details

we purged/defrag'd in https://bugzilla.mozilla.org/show_bug.cgi?id=703967 successfully. Now we need to set up a purge/defrag regular schedule. The tables are still large - runs_logs is still 44G - so we can't just do it in a scheduled task. We need to figure out how frequently we should defrag, so I'd wait a month (end of August) and check to see the size of the tables. There are 3 steps to the defrag: 0) Delete the values from the runs table: DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH ); 1) Delete the values from the builders table: DELETE builders.* from builders LEFT JOIN runs USING (buildername) WHERE runs.buildername IS NULL; 2) Defrag the tables: OPTIMIZE TABLE runs; OPTIMIZE TABLE runs_notes; OPTIMIZE TABLE runs_logs; OPTIMIZE TABLE builders; OPTIMIZE TABLE builders_history;
Current size of the tables as of Tue 31 July: generic1 -rw-rw---- 1 mysql mysql 892M Jul 25 11:27 runs.ibd -rw-rw---- 1 mysql mysql 15M Jul 25 11:27 runs_notes.ibd -rw-rw---- 1 mysql mysql 9.0M Jul 25 22:03 builders_history.ibd -rw-rw---- 1 mysql mysql 13M Jul 25 22:03 builders.ibd -rw-rw---- 1 mysql mysql 44G Jul 29 13:37 runs_logs.ibd generic2 -rw-rw---- 1 mysql mysql 9.0M Jul 30 15:17 builders_history.ibd -rw-rw---- 1 mysql mysql 13M Jul 31 13:35 builders.ibd -rw-rw---- 1 mysql mysql 15M Jul 31 13:38 runs_notes.ibd -rw-rw---- 1 mysql mysql 50G Jul 31 13:38 runs_logs.ibd -rw-rw---- 1 mysql mysql 1012M Jul 31 13:40 runs.ibd
Assignee: server-ops-database → scabral
Whiteboard: Check at end of Aug 2012
As of Fri Aug 31st, increases: runs: about 700M (About 40%) runs_notes: 2M (about 15%) builders_history: 7M (350%) builders: 1M (about 8%) runs_logs: 23G (about 45%) There's 100G still left on the machine, so it's not necessary to do monthly, but probably a good idea to try. generic1: [root@generic1.db.phx1 tbpl_mozilla_org]# ls -lh *.ibd -rw-rw---- 1 mysql mysql 1.0M Aug 31 00:47 bugscache.ibd -rw-rw---- 1 mysql mysql 9.0M Aug 31 02:06 builders_history.ibd -rw-rw---- 1 mysql mysql 14M Aug 31 10:10 builders.ibd -rw-rw---- 1 mysql mysql 1.5G Aug 31 10:10 runs.ibd -rw-rw---- 1 mysql mysql 73G Aug 31 10:12 runs_logs.ibd -rw-rw---- 1 mysql mysql 17M Aug 31 10:09 runs_notes.ibd generic2: -rw-rw---- 1 mysql mysql 112K Feb 21 2012 bugscache.ibd -rw-rw---- 1 mysql mysql 9.0M Aug 30 21:48 builders_history.ibd -rw-rw---- 1 mysql mysql 14M Aug 31 10:10 builders.ibd -rw-rw---- 1 mysql mysql 1.5G Aug 31 10:10 runs.ibd -rw-rw---- 1 mysql mysql 73G Aug 31 10:10 runs_logs.ibd -rw-rw---- 1 mysql mysql 17M Aug 31 10:09 runs_notes.ibd
Whiteboard: Check at end of Aug 2012 → Check at end of Sep 2012
We now have a bugscache table, which has rows deleted during the 5 min cron, so would presumably benefit from an OPTIMIZE periodically. Also, for builders_history I would like us to purge old history entries, even if the builder to which they apply still exists; so when we get a UI for the history entries, the contents is kept manageable. This gives a revised checklist of: 0) Delete old runs: DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH ); 1) Delete rows from the builders table where the buildername is now redundant: DELETE builders.* from builders LEFT JOIN runs USING (buildername) WHERE runs.buildername IS NULL; 2) Delete builders_history entries older than 6 months: DELETE FROM builders_history WHERE date < ( NOW( ) - INTERVAL 6 MONTH ); 3) Defrag the tables: OPTIMIZE TABLE runs; OPTIMIZE TABLE runs_notes; OPTIMIZE TABLE runs_logs; OPTIMIZE TABLE builders; OPTIMIZE TABLE builders_history; OPTIMIZE TABLE bugscache; (And just to confirm, unlike the first purge, we definitely want subsequent ones to use 1 month at step #0 not 2 months)
My apologies, we've had a chat about the builders_history deletions, and decided not to do them after all. We still would like to OPTIMIZE bugscache, per comment 3. (The only change now, since comment 0). This gives the revised checklist as: 0) Delete old runs: DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH ); 1) Delete rows from the builders table where the buildername is now redundant: DELETE builders.* from builders LEFT JOIN runs USING (buildername) WHERE runs.buildername IS NULL; 2) Defrag the tables: OPTIMIZE TABLE runs; OPTIMIZE TABLE runs_notes; OPTIMIZE TABLE runs_logs; OPTIMIZE TABLE builders; OPTIMIZE TABLE builders_history; OPTIMIZE TABLE bugscache;
OK, so should we do this monthly, then? I think the end of the month is a bad target because that would fall into "goals" territory every 3 months. Maybe on or about the 15th of every month?
Sounds good, thank you :-)
Just as a heads up: Rate of DB growth will likely increase now that bug 718632 has landed (was on tbpl-dev as of ~2012-09-04 and prod as of ~2012-09-13), since we now prefetch logs for failures, whereas previously some would have not been opened and thus not fetched from ftp.mozilla.org and stuffed into the DB.
OK, great. I can check size on Friday and we can work on purging next week?
Sounds good :-)
acknowledging I'm a slacker. I may not get to this tomorrow, too. I can try to do this during next week's webops meetup in SF.
Would partitioning help here?
(In reply to Dustin J. Mitchell [:dustin] from comment #11) > Would partitioning help here? Bug 703967 comment 9 and bug 766524 comment 2 decided it wouldn't be viable due to the schema, aiui.
We should probably do this for the dev server, too, since tbpl_allizom_org is 64G. Please let me know if doing the purge on dev would pose a problem?
I don’t think so. Please go ahead. Also due to the prefetching, the dev database will probably grow equally fast as the production one...
Awesome. We should definitely do both, then :D I'll keep the bug updated.
On dev2, tbpl_dev_allizom_org: mysql> use tbpl_dev_allizom_org Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH ); Query OK, 2968963 rows affected (5 min 17.79 sec) mysql> mysql> DELETE builders.* from builders LEFT JOIN runs USING (buildername) WHERE runs.bldername IS NULL; Query OK, 3284 rows affected (5.47 sec) starting to optimize the tables now.
dev2 purge of tbpl_dev_allizom_org is done, tbpl_allizom_org doesn't have anything in it. I have put dev2 back into the load balancer, will do dev1 tomorrow.
dev1 purge is done, and dev1 is the master again (dev2 is a slave and set as read_only). I will work on generic2 tomorrow, and then we'll work on generic1.
Thank you :-)
Didn't get to generic1/2 today.
Due to foreign keys, it would be best to do the deletes first, which will replicate, and then take the servers offline one at a time to defrag. I'm working on the deletes first - they'll take about an hour and a half, because we're doing this: DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH limit 2500; do sleep(1); Query OK, 2500 rows affected, 1 warning (4.69 sec) (the warning is that the statement may not be safe for replication, because we're doing a delete with a limit, but generic2 has the same amount of data, so we're good.
Done this on the builders table too: delete from builders where buildername not in (select buildername from runs) limit 100; (over and over until it was complete).
we should do another purge at the end of Oct.
Whiteboard: Check at end of Sep 2012 → Check at end of Oct 2012
Starting the purge process on the dev servers, by doing this over and over on the dev master: DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH limit 2500;
Optimizing the tables on dev2, without replicating to dev1. dev1 is currently serving rw and ro hits, dev2 isn't serving any traffic.
dev2 is down to 44G, dev1 is out of the load balancer and optimizing tables as well (started out as 84G on dev1).
dev1 is purged and back as the read/write db, and dev2 is back as the read-only db. Both reduced in size to about half, from 84G to 44G. I am deleting on tbpl_mozilla_org now, and after the deletes are done I will work on the defragmentation.
Defragmenting the production backup right now. Will take generic2 out of the load balancer and start defrag'ing that too.
pre-defrag on generic2: 672K bugscache.ibd 9.0M builders_history.ibd 15M builders.ibd 2.0G runs.ibd 139G runs_logs.ibd 20M runs_notes.ibd
generic2 was finished on Friday, here's the post-defrag: 352K bugscache.ibd 9.0M builders_history.ibd 14M builders.ibd 529M runs.ibd 42G runs_logs.ibd 13M runs_notes.ibd Freed up almost 100G.
Whiteboard: Check at end of Oct 2012 → Check at end of Jan 2012
generic1 pre-defrag: mysql> \! df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg00-root 261G 165G 84G 67% / tmpfs 5.9G 0 5.9G 0% /dev/shm /dev/sda1 504M 88M 391M 19% /boot
Thank you for keeping on top of this :-)
Generic1 done, 98G freed up: mysql> \! df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg00-root 261G 67G 182G 27% / tmpfs 5.9G 0 5.9G 0% /dev/shm /dev/sda1 504M 88M 391M 19% /boot
Blocks: 835608
Whiteboard: Check at end of Jan 2012 → Check at end of Feb 2013
Working on a purge now, we should have done it about 2 weeks ago. We were reminded by disk space on production paging us.
Whiteboard: Check at end of Feb 2013 → Check at end of Mar 2013
TBPL is pretty slow at the moment & getting a few "service unavailables". Presuming it's the data purge? Could we throttle it back a tiny bit? :-)
It was throttled back soon after this message, sorry I forgot to write about it. Things should be more than fine now. (the optimization took a few hours)
Thank you :-)
We will be failing over to generic2 now, the defrag has finished there, and we will defrag on generic1.
generic cluster has been failed over to use generic2 as the master now. Doing maintenance on generic1, will let you know when maintenance is done and we fail back.
The maintenance is over, waiting for the all-clear from SREs to switch back.
switching back now.
generic cluster is back to normal now, this defrag is over for March.
Thanks Sheeri :-)
We'll probably need another purge here soon, though presume it may now happen via the mechanism in bug 835608. However, please may it wait until bug 868356 has been completed - thank you :-)
We did a defrag last week, and since then, we have an automated script to do nightly deletes. We don't need to reclaim space with defrags any more, because the space is re-used by MySQL, and it's only one day of space that is in flux at any given time. Special thanks to Brandon Johnson (:cyborgshadow) who did the heavy lifting on the scripting.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Whiteboard: Check at end of Mar 2013
That's great - thank you :-) Just to double check - the script runs at 3am pacific right?
It's set to run at 2 am local time, which is UTC. So it's at 7 pm Pacific (if my calculations are correct). Would you like us to change it to 3 am Pacific? If so, that's 10 am UTC (http://www.worldtimebuddy.com/utc-to-pdt-converter)
I'm sure that time will be fine - I asked mainly so that if we saw any daily perf issues we knew whether they were related. Thank you :-)
For the record, in both production and dev, this runs in less than 5 seconds daily. Given the size of the data it's touching, I'd say that's pretty good, and wouldn't expect a performance hit from it. :)
Ah, worrying over nothing! :-)
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.