838494 - Replication for the bugzilla_allizom_org database is not working

Reporter

Description

•

12 years ago

database replication for the allizom.bugzilla.org database is not working. i don't have visibility of the master database host or name, however the slave hostname is db-bugs-stage-ro, database bugzilla_allizom_org.

Brandon Johnson [:cyborgshadow]

Assignee

Comment 1

•

12 years ago

What leads you to believe it's not replicating? I've checked the bugzilla staging servers and the "bugzilla_allizom_org" database is replicating successfully.

Assignee: server-ops-database → bjohnson

Sheeri Cabral [:sheeri]

Comment 2

•

12 years ago

A few more details - currently the new bugzilla_allizom_org database (hostnames bugzilla1.stage.db.scl3.mozilla.com and bugzilla2) are replicating each other. Is there some expectation that they'd be mirroring the current stage?

:glob ✱

Reporter

Comment 3

•

12 years ago

hrm, it's started working again now. sorry about the bug spam.

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → WORKSFORME

:glob ✱

Reporter

Comment 4

•

12 years ago

i'm seeing again, reopening. this time i'm not going to change any settings, so you can see it in action :) visit https://bugzilla.allizom.org/show_bug.cgi?id=832972 if you're logged in, the bug works fine. if you're not logged in, you're told that the bug number isn't valid. so i asked ashish to run | select * from bugs where bug_id=832972 | on bugzilla1.stage.db.scl3.mozilla.com and bugzilla2.stage.db.scl3.mozilla.com. bugzilla1 returned the row. bugzilla2 returned zero rows.

Status: RESOLVED → REOPENED

Resolution: WORKSFORME → ---

Shyam Mani [:fox2mike]

Updated

•

12 years ago

Summary: database replication for the allizom.bugzilla.org database is not working → Replication for the bugzilla_allizom_org database is not working

Brandon Johnson [:cyborgshadow]

Assignee

Comment 5

•

12 years ago

So, This failure is due to the fact that apparently bug ID 832972 doesn't exist in the 2nd stage server, but exists in the first. I checked all the binlogs for the last 10 days (on both servers), and it wasn't created (or deleted) within the last 10 days, which shows me that this disparity has existed for quite some time. A refresh should solve this problem. Want me to proceed, or is there something we should dig further into here?

Brandon Johnson [:cyborgshadow]

Assignee

Comment 6

•

12 years ago

Additionally, my LDAP account has no privileges to see your example from above glob.

Shyam Mani [:fox2mike]

Comment 7

•

12 years ago

(In reply to Brandon Johnson [:cyborgshadow] from comment #5) > A refresh should solve this problem. Want me to proceed, or is there > something we should dig further into here? We should dig. This copy was created from current production data (aka bugzilla1.db.scl3). There should be no disparity in that data whatsoever or we have bigger issues. Assuming that went on without an issue and if the bug is there on that server, how did it not make it to bugzilla2.stage.db.scl3 (which I assume was replicated off bugzilla1.stage.db.scl3 which in fact has the bug)?

Sheeri Cabral [:sheeri]

Comment 8

•

12 years ago

If Brandon already checked the binary logs, I'm not sure what more we can dig into. We can look at system logs on bugzilla2.stage to see if there were any disk errors or anything. IIRC there aren't any other slaves in staging (but if there are we should check to see if they're replicating fine). Let's also make sure we're running checksums on stage and paging for them, if we're not already doing so. This will catch any future errors closer to the time when we can actually debug what's going on.

Brandon Johnson [:cyborgshadow]

Assignee

Comment 9

•

12 years ago

There's a fair bit more I can dig into. I've already found that there are 513 bugs in stage1 that don't exist in stage2. These bugs are all between the IDs of 832950 and 833505 and were created after January 21st. Digging more...

Brandon Johnson [:cyborgshadow]

Assignee

Comment 10

•

12 years ago

This appears to be a legit replication gap. What I see here: bugzilla1.stage.db has no new bugs between 2013-01-22 11:24:29 and 2013-02-05 22:52:48. (creation_ts in bugs table from bug 833483 and 833484 demonstrate this) bugzilla2.stage.db has a gap between '2013-01-21 07:08:38' and '2013-02-05 22:52:48' (aka bug_id 832971 and 833484). I think the solution here is that BOTH stage servers need a refresh as they both have a rather large gap, 2's gap is just a day larger than 1. Thanks!

Shyam Mani [:fox2mike]

Comment 11

•

12 years ago

So as long as bugzilla1.db.scl3 is in sync with the master DBs in phx1, we _might_ be able to go ahead and refresh the DBs. Need an ack from glob as this might need a checksetup.pl run in the end...if we have to re-sync.

Flags: needinfo?(glob)

:glob ✱

Reporter

Comment 12

•

12 years ago

refreshing the DBs sounds like a good idea. yes, we'll need a checksetup.pl run after the 4.0 database is deployed. that's probably a good thing because we didn't get the end-to-end timings last time we upgraded the db. you can perform this at anytime.

Flags: needinfo?(glob)

Brandon Johnson [:cyborgshadow]

Assignee

Comment 13

•

12 years ago

After digging into this, the best course of action is just to hotcopy stage 1 over stage2. The gap from 01-22 to 02-05 is because the stage system wasn't in use at the time. The gap on stage 2 from 01-21 to 01-22 is because apparently people were trying to use a system we had marked as down for maintenance (and were able to successfully use it). Hot copy refresh fixes the gap from 01-21 to 01-22 and resolves this issue. In progress now.

Brandon Johnson [:cyborgshadow]

Assignee

Comment 14

•

12 years ago

fox2mike said we need glob's approval. not starting now. waiting for approval.

Shyam Mani [:fox2mike]

Comment 15

•

12 years ago

(In reply to Brandon Johnson [:cyborgshadow] from comment #14) > fox2mike said we need glob's approval. not starting now. waiting for > approval. He gave you the go ahead in comment #12 "you can perform this at anytime."

:glob ✱

Reporter

Comment 16

•

12 years ago

(In reply to Brandon Johnson [:cyborgshadow] from comment #13) > The gap on stage 2 from 01-21 to 01-22 is because apparently people were > trying to use a system we had marked as down for maintenance (and were able > to successfully use it). how are we able to determine if a system has been marked as down for maintenance?

Brandon Johnson [:cyborgshadow]

Assignee

Comment 17

•

12 years ago

You should be made aware of it via communication, and usually it's taken offline or the maintenance is made invisible. We've improved our communication recently to reach out to the necessary teams whenever databases are scheduled to be offline. In this scenario, we were moving a database and changing it's name simultaneously (a rare occasion) and I should have taken extra measures to make sure there were no connections to it during the maintenance.

Brandon Johnson [:cyborgshadow]

Assignee

Comment 18

•

12 years ago

this finished out yesterday evening after I popped off. Please test and let me know if you have any questions or concerns. Everything should be in sync now.

Shyam Mani [:fox2mike]

Comment 19

•

12 years ago

(In reply to Brandon Johnson [:cyborgshadow] from comment #18) > this finished out yesterday evening after I popped off. Please test and let > me know if you have any questions or concerns. Everything should be in sync > now. Is this a fresh copy from production?

Brandon Johnson [:cyborgshadow]

Assignee

Comment 20

•

12 years ago

No...It's just a hot copy of stage1 to stage2, aka a fix of what was missing on stage2.

Shyam Mani [:fox2mike]

Comment 21

•

12 years ago

(In reply to Brandon Johnson [:cyborgshadow] from comment #20) > No...It's just a hot copy of stage1 to stage2, aka a fix of what was missing > on stage2. Cool, thanks! glob, can you verify everything is fine and close this bug? We won't be doing checksetup.pl again afterall, since this is just a copy of 1 (rw) to 2 (ro). Thanks!

Flags: needinfo?(glob)

:glob ✱

Reporter

Comment 22

•

12 years ago

all looks good now, thanks!

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Flags: needinfo?(glob)

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → Data & BI Services Team

Bugzilla

Quick Search

Replication for the bugzilla_allizom_org database is not working

Categories

(Data & BI Services Team :: DB: MySQL, task)

Tracking

(Not tracked)

People

(Reporter: glob, Assigned: bjohnson)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Comment 20

Comment 21

Comment 22

Updated