Closed Bug 779290 Opened 12 years ago Closed 12 years ago

regular tbpl purge/defrag

Categories

(Data & BI Services Team :: DB: MySQL, task)

Product:

Component:

Platform:

x86

macOS

Type:

task

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: scabral, Assigned: scabral)

References

Details

Sheeri Cabral [:sheeri]

Assignee

Description

•

12 years ago

we purged/defrag'd in https://bugzilla.mozilla.org/show_bug.cgi?id=703967 successfully. Now we need to set up a purge/defrag regular schedule. The tables are still large - runs_logs is still 44G - so we can't just do it in a scheduled task. We need to figure out how frequently we should defrag, so I'd wait a month (end of August) and check to see the size of the tables. There are 3 steps to the defrag: 0) Delete the values from the runs table: DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH ); 1) Delete the values from the builders table: DELETE builders.* from builders LEFT JOIN runs USING (buildername) WHERE runs.buildername IS NULL; 2) Defrag the tables: OPTIMIZE TABLE runs; OPTIMIZE TABLE runs_notes; OPTIMIZE TABLE runs_logs; OPTIMIZE TABLE builders; OPTIMIZE TABLE builders_history;

Sheeri Cabral [:sheeri]

Assignee

Comment 1

•

12 years ago

Current size of the tables as of Tue 31 July: generic1 -rw-rw---- 1 mysql mysql 892M Jul 25 11:27 runs.ibd -rw-rw---- 1 mysql mysql 15M Jul 25 11:27 runs_notes.ibd -rw-rw---- 1 mysql mysql 9.0M Jul 25 22:03 builders_history.ibd -rw-rw---- 1 mysql mysql 13M Jul 25 22:03 builders.ibd -rw-rw---- 1 mysql mysql 44G Jul 29 13:37 runs_logs.ibd generic2 -rw-rw---- 1 mysql mysql 9.0M Jul 30 15:17 builders_history.ibd -rw-rw---- 1 mysql mysql 13M Jul 31 13:35 builders.ibd -rw-rw---- 1 mysql mysql 15M Jul 31 13:38 runs_notes.ibd -rw-rw---- 1 mysql mysql 50G Jul 31 13:38 runs_logs.ibd -rw-rw---- 1 mysql mysql 1012M Jul 31 13:40 runs.ibd

Sheeri Cabral [:sheeri]

Assignee

Updated

•

12 years ago

Assignee: server-ops-database → scabral

Sheeri Cabral [:sheeri]

Assignee

Updated

•

12 years ago

Whiteboard: Check at end of Aug 2012

Sheeri Cabral [:sheeri]

Assignee

Comment 2

•

12 years ago

As of Fri Aug 31st, increases: runs: about 700M (About 40%) runs_notes: 2M (about 15%) builders_history: 7M (350%) builders: 1M (about 8%) runs_logs: 23G (about 45%) There's 100G still left on the machine, so it's not necessary to do monthly, but probably a good idea to try. generic1: [root@generic1.db.phx1 tbpl_mozilla_org]# ls -lh *.ibd -rw-rw---- 1 mysql mysql 1.0M Aug 31 00:47 bugscache.ibd -rw-rw---- 1 mysql mysql 9.0M Aug 31 02:06 builders_history.ibd -rw-rw---- 1 mysql mysql 14M Aug 31 10:10 builders.ibd -rw-rw---- 1 mysql mysql 1.5G Aug 31 10:10 runs.ibd -rw-rw---- 1 mysql mysql 73G Aug 31 10:12 runs_logs.ibd -rw-rw---- 1 mysql mysql 17M Aug 31 10:09 runs_notes.ibd generic2: -rw-rw---- 1 mysql mysql 112K Feb 21 2012 bugscache.ibd -rw-rw---- 1 mysql mysql 9.0M Aug 30 21:48 builders_history.ibd -rw-rw---- 1 mysql mysql 14M Aug 31 10:10 builders.ibd -rw-rw---- 1 mysql mysql 1.5G Aug 31 10:10 runs.ibd -rw-rw---- 1 mysql mysql 73G Aug 31 10:10 runs_logs.ibd -rw-rw---- 1 mysql mysql 17M Aug 31 10:09 runs_notes.ibd

Sheeri Cabral [:sheeri]

Assignee

Updated

•

12 years ago

Whiteboard: Check at end of Aug 2012 → Check at end of Sep 2012

Ed Morley [:emorley]

Comment 3

•

12 years ago

We now have a bugscache table, which has rows deleted during the 5 min cron, so would presumably benefit from an OPTIMIZE periodically. Also, for builders_history I would like us to purge old history entries, even if the builder to which they apply still exists; so when we get a UI for the history entries, the contents is kept manageable. This gives a revised checklist of: 0) Delete old runs: DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH ); 1) Delete rows from the builders table where the buildername is now redundant: DELETE builders.* from builders LEFT JOIN runs USING (buildername) WHERE runs.buildername IS NULL; 2) Delete builders_history entries older than 6 months: DELETE FROM builders_history WHERE date < ( NOW( ) - INTERVAL 6 MONTH ); 3) Defrag the tables: OPTIMIZE TABLE runs; OPTIMIZE TABLE runs_notes; OPTIMIZE TABLE runs_logs; OPTIMIZE TABLE builders; OPTIMIZE TABLE builders_history; OPTIMIZE TABLE bugscache; (And just to confirm, unlike the first purge, we definitely want subsequent ones to use 1 month at step #0 not 2 months)

Ed Morley [:emorley]

Comment 4

•

12 years ago

My apologies, we've had a chat about the builders_history deletions, and decided not to do them after all. We still would like to OPTIMIZE bugscache, per comment 3. (The only change now, since comment 0). This gives the revised checklist as: 0) Delete old runs: DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH ); 1) Delete rows from the builders table where the buildername is now redundant: DELETE builders.* from builders LEFT JOIN runs USING (buildername) WHERE runs.buildername IS NULL; 2) Defrag the tables: OPTIMIZE TABLE runs; OPTIMIZE TABLE runs_notes; OPTIMIZE TABLE runs_logs; OPTIMIZE TABLE builders; OPTIMIZE TABLE builders_history; OPTIMIZE TABLE bugscache;

Sheeri Cabral [:sheeri]

Assignee

Comment 5

•

12 years ago

OK, so should we do this monthly, then? I think the end of the month is a bad target because that would fall into "goals" territory every 3 months. Maybe on or about the 15th of every month?

Ed Morley [:emorley]

Comment 6

•

12 years ago

Sounds good, thank you :-)

Ed Morley [:emorley]

Comment 7

•

12 years ago

Just as a heads up: Rate of DB growth will likely increase now that bug 718632 has landed (was on tbpl-dev as of ~2012-09-04 and prod as of ~2012-09-13), since we now prefetch logs for failures, whereas previously some would have not been opened and thus not fetched from ftp.mozilla.org and stuffed into the DB.

Sheeri Cabral [:sheeri]

Assignee

Comment 8

•

12 years ago

OK, great. I can check size on Friday and we can work on purging next week?

Ed Morley [:emorley]

Comment 9

•

12 years ago

Sounds good :-)

Sheeri Cabral [:sheeri]

Assignee

Comment 10

•

12 years ago

acknowledging I'm a slacker. I may not get to this tomorrow, too. I can try to do this during next week's webops meetup in SF.

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

12 years ago

Would partitioning help here?

Ed Morley [:emorley]

Comment 12

•

12 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #11) > Would partitioning help here? Bug 703967 comment 9 and bug 766524 comment 2 decided it wouldn't be viable due to the schema, aiui.

Sheeri Cabral [:sheeri]

Assignee

Comment 13

•

12 years ago

We should probably do this for the dev server, too, since tbpl_allizom_org is 64G. Please let me know if doing the purge on dev would pose a problem?

Arpad Borsos [:Swatinem]

Comment 14

•

12 years ago

I don’t think so. Please go ahead. Also due to the prefetching, the dev database will probably grow equally fast as the production one...

Sheeri Cabral [:sheeri]

Assignee

Comment 15

•

12 years ago

Awesome. We should definitely do both, then :D I'll keep the bug updated.

Sheeri Cabral [:sheeri]

Assignee

Comment 16

•

12 years ago

Used comment 4 to populate https://mana.mozilla.org/wiki/display/SYSADMIN/How+to+purge+tbpl+data I'll start this now for dev2.

Sheeri Cabral [:sheeri]

Assignee

Comment 17

•

12 years ago

On dev2, tbpl_dev_allizom_org: mysql> use tbpl_dev_allizom_org Reading table information for completion of table and column names You can turn off this feature to get a quicker startup with -A Database changed mysql> DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH ); Query OK, 2968963 rows affected (5 min 17.79 sec) mysql> mysql> DELETE builders.* from builders LEFT JOIN runs USING (buildername) WHERE runs.bldername IS NULL; Query OK, 3284 rows affected (5.47 sec) starting to optimize the tables now.

Sheeri Cabral [:sheeri]

Assignee

Comment 18

•

12 years ago

dev2 purge of tbpl_dev_allizom_org is done, tbpl_allizom_org doesn't have anything in it. I have put dev2 back into the load balancer, will do dev1 tomorrow.

Sheeri Cabral [:sheeri]

Assignee

Comment 19

•

12 years ago

dev1 purge is done, and dev1 is the master again (dev2 is a slave and set as read_only). I will work on generic2 tomorrow, and then we'll work on generic1.

Ed Morley [:emorley]

Comment 20

•

12 years ago

Thank you :-)

Sheeri Cabral [:sheeri]

Assignee

Comment 21

•

12 years ago

Didn't get to generic1/2 today.

Sheeri Cabral [:sheeri]

Assignee

Comment 22

•

12 years ago

Due to foreign keys, it would be best to do the deletes first, which will replicate, and then take the servers offline one at a time to defrag. I'm working on the deletes first - they'll take about an hour and a half, because we're doing this: DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH limit 2500; do sleep(1); Query OK, 2500 rows affected, 1 warning (4.69 sec) (the warning is that the statement may not be safe for replication, because we're doing a delete with a limit, but generic2 has the same amount of data, so we're good.

Sheeri Cabral [:sheeri]

Assignee

Comment 23

•

12 years ago

Done this on the builders table too: delete from builders where buildername not in (select buildername from runs) limit 100; (over and over until it was complete).

Sheeri Cabral [:sheeri]

Assignee

Comment 24

•

12 years ago

we should do another purge at the end of Oct.

Whiteboard: Check at end of Sep 2012 → Check at end of Oct 2012

Sheeri Cabral [:sheeri]

Assignee

Comment 25

•

12 years ago

Starting the purge process on the dev servers, by doing this over and over on the dev master: DELETE FROM runs WHERE endtime < ( NOW( ) - INTERVAL 1 MONTH limit 2500;

Sheeri Cabral [:sheeri]

Assignee

Comment 26

•

12 years ago

Optimizing the tables on dev2, without replicating to dev1. dev1 is currently serving rw and ro hits, dev2 isn't serving any traffic.

Sheeri Cabral [:sheeri]

Assignee

Comment 27

•

12 years ago

dev2 is down to 44G, dev1 is out of the load balancer and optimizing tables as well (started out as 84G on dev1).

Sheeri Cabral [:sheeri]

Assignee

Comment 28

•

12 years ago

dev1 is purged and back as the read/write db, and dev2 is back as the read-only db. Both reduced in size to about half, from 84G to 44G. I am deleting on tbpl_mozilla_org now, and after the deletes are done I will work on the defragmentation.

Sheeri Cabral [:sheeri]

Assignee

Comment 29

•

12 years ago

Defragmenting the production backup right now. Will take generic2 out of the load balancer and start defrag'ing that too.

Sheeri Cabral [:sheeri]

Assignee

Comment 30

•

12 years ago

pre-defrag on generic2: 672K bugscache.ibd 9.0M builders_history.ibd 15M builders.ibd 2.0G runs.ibd 139G runs_logs.ibd 20M runs_notes.ibd

Sheeri Cabral [:sheeri]

Assignee

Comment 31

•

12 years ago

generic2 was finished on Friday, here's the post-defrag: 352K bugscache.ibd 9.0M builders_history.ibd 14M builders.ibd 529M runs.ibd 42G runs_logs.ibd 13M runs_notes.ibd Freed up almost 100G.

Whiteboard: Check at end of Oct 2012 → Check at end of Jan 2012

Sheeri Cabral [:sheeri]

Assignee

Comment 32

•

12 years ago

generic1 pre-defrag: mysql> \! df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg00-root 261G 165G 84G 67% / tmpfs 5.9G 0 5.9G 0% /dev/shm /dev/sda1 504M 88M 391M 19% /boot

Ed Morley [:emorley]

Comment 33

•

12 years ago

Thank you for keeping on top of this :-)

Sheeri Cabral [:sheeri]

Assignee

Comment 34

•

12 years ago

Generic1 done, 98G freed up: mysql> \! df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg00-root 261G 67G 182G 27% / tmpfs 5.9G 0 5.9G 0% /dev/shm /dev/sda1 504M 88M 391M 19% /boot

Ed Morley [:emorley]

Updated

•

12 years ago

Blocks: 835608

Sheeri Cabral [:sheeri]

Assignee

Updated

•

12 years ago

Whiteboard: Check at end of Jan 2012 → Check at end of Feb 2013

Sheeri Cabral [:sheeri]

Assignee

Comment 35

•

12 years ago

Working on a purge now, we should have done it about 2 weeks ago. We were reminded by disk space on production paging us.

Whiteboard: Check at end of Feb 2013 → Check at end of Mar 2013

Ed Morley [:emorley]

Comment 36

•

12 years ago

TBPL is pretty slow at the moment & getting a few "service unavailables". Presuming it's the data purge? Could we throttle it back a tiny bit? :-)

Sheeri Cabral [:sheeri]

Assignee

Comment 37

•

12 years ago

It was throttled back soon after this message, sorry I forgot to write about it. Things should be more than fine now. (the optimization took a few hours)

Ed Morley [:emorley]

Comment 38

•

12 years ago

Thank you :-)

Sheeri Cabral [:sheeri]

Assignee

Comment 39

•

12 years ago

We will be failing over to generic2 now, the defrag has finished there, and we will defrag on generic1.

Sheeri Cabral [:sheeri]

Assignee

Comment 40

•

12 years ago

generic cluster has been failed over to use generic2 as the master now. Doing maintenance on generic1, will let you know when maintenance is done and we fail back.

Sheeri Cabral [:sheeri]

Assignee

Comment 41

•

12 years ago

The maintenance is over, waiting for the all-clear from SREs to switch back.

Sheeri Cabral [:sheeri]

Assignee

Comment 42

•

12 years ago

switching back now.

Sheeri Cabral [:sheeri]

Assignee

Comment 43

•

12 years ago

generic cluster is back to normal now, this defrag is over for March.

Ed Morley [:emorley]

Comment 44

•

12 years ago

Thanks Sheeri :-)

Ed Morley [:emorley]

Comment 45

•

12 years ago

We'll probably need another purge here soon, though presume it may now happen via the mechanism in bug 835608. However, please may it wait until bug 868356 has been completed - thank you :-)

Sheeri Cabral [:sheeri]

Assignee

Comment 46

•

12 years ago

We did a defrag last week, and since then, we have an automated script to do nightly deletes. We don't need to reclaim space with defrags any more, because the space is re-used by MySQL, and it's only one day of space that is in flux at any given time. Special thanks to Brandon Johnson (:cyborgshadow) who did the heavy lifting on the scripting.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Whiteboard: Check at end of Mar 2013

Ed Morley [:emorley]

Comment 47

•

12 years ago

That's great - thank you :-) Just to double check - the script runs at 3am pacific right?

Sheeri Cabral [:sheeri]

Assignee

Comment 48

•

12 years ago

It's set to run at 2 am local time, which is UTC. So it's at 7 pm Pacific (if my calculations are correct). Would you like us to change it to 3 am Pacific? If so, that's 10 am UTC (http://www.worldtimebuddy.com/utc-to-pdt-converter)

Ed Morley [:emorley]

Comment 49

•

12 years ago

I'm sure that time will be fine - I asked mainly so that if we saw any daily perf issues we knew whether they were related. Thank you :-)

Brandon Johnson [:cyborgshadow]

Comment 50

•

12 years ago

For the record, in both production and dev, this runs in less than 5 seconds daily. Given the size of the data it's touching, I'd say that's pretty good, and wouldn't expect a performance hit from it. :)

Ed Morley [:emorley]

Comment 51

•

12 years ago

Ah, worrying over nothing! :-)

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → Data & BI Services Team

You need to log in before you can comment on or make changes to this bug.