Closed Bug 838494 Opened 12 years ago Closed 12 years ago

Replication for the bugzilla_allizom_org database is not working

Categories

(Data & BI Services Team :: DB: MySQL, task)

x86
macOS
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: glob, Assigned: bjohnson)

References

Details

database replication for the allizom.bugzilla.org database is not working. i don't have visibility of the master database host or name, however the slave hostname is db-bugs-stage-ro, database bugzilla_allizom_org.
What leads you to believe it's not replicating? I've checked the bugzilla staging servers and the "bugzilla_allizom_org" database is replicating successfully.
Assignee: server-ops-database → bjohnson
A few more details - currently the new bugzilla_allizom_org database (hostnames bugzilla1.stage.db.scl3.mozilla.com and bugzilla2) are replicating each other. Is there some expectation that they'd be mirroring the current stage?
hrm, it's started working again now. sorry about the bug spam.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
i'm seeing again, reopening. this time i'm not going to change any settings, so you can see it in action :) visit https://bugzilla.allizom.org/show_bug.cgi?id=832972 if you're logged in, the bug works fine. if you're not logged in, you're told that the bug number isn't valid. so i asked ashish to run | select * from bugs where bug_id=832972 | on bugzilla1.stage.db.scl3.mozilla.com and bugzilla2.stage.db.scl3.mozilla.com. bugzilla1 returned the row. bugzilla2 returned zero rows.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Summary: database replication for the allizom.bugzilla.org database is not working → Replication for the bugzilla_allizom_org database is not working
So, This failure is due to the fact that apparently bug ID 832972 doesn't exist in the 2nd stage server, but exists in the first. I checked all the binlogs for the last 10 days (on both servers), and it wasn't created (or deleted) within the last 10 days, which shows me that this disparity has existed for quite some time. A refresh should solve this problem. Want me to proceed, or is there something we should dig further into here?
Additionally, my LDAP account has no privileges to see your example from above glob.
(In reply to Brandon Johnson [:cyborgshadow] from comment #5) > A refresh should solve this problem. Want me to proceed, or is there > something we should dig further into here? We should dig. This copy was created from current production data (aka bugzilla1.db.scl3). There should be no disparity in that data whatsoever or we have bigger issues. Assuming that went on without an issue and if the bug is there on that server, how did it not make it to bugzilla2.stage.db.scl3 (which I assume was replicated off bugzilla1.stage.db.scl3 which in fact has the bug)?
If Brandon already checked the binary logs, I'm not sure what more we can dig into. We can look at system logs on bugzilla2.stage to see if there were any disk errors or anything. IIRC there aren't any other slaves in staging (but if there are we should check to see if they're replicating fine). Let's also make sure we're running checksums on stage and paging for them, if we're not already doing so. This will catch any future errors closer to the time when we can actually debug what's going on.
There's a fair bit more I can dig into. I've already found that there are 513 bugs in stage1 that don't exist in stage2. These bugs are all between the IDs of 832950 and 833505 and were created after January 21st. Digging more...
This appears to be a legit replication gap. What I see here: bugzilla1.stage.db has no new bugs between 2013-01-22 11:24:29 and 2013-02-05 22:52:48. (creation_ts in bugs table from bug 833483 and 833484 demonstrate this) bugzilla2.stage.db has a gap between '2013-01-21 07:08:38' and '2013-02-05 22:52:48' (aka bug_id 832971 and 833484). I think the solution here is that BOTH stage servers need a refresh as they both have a rather large gap, 2's gap is just a day larger than 1. Thanks!
So as long as bugzilla1.db.scl3 is in sync with the master DBs in phx1, we _might_ be able to go ahead and refresh the DBs. Need an ack from glob as this might need a checksetup.pl run in the end...if we have to re-sync.
Flags: needinfo?(glob)
refreshing the DBs sounds like a good idea. yes, we'll need a checksetup.pl run after the 4.0 database is deployed. that's probably a good thing because we didn't get the end-to-end timings last time we upgraded the db. you can perform this at anytime.
Flags: needinfo?(glob)
After digging into this, the best course of action is just to hotcopy stage 1 over stage2. The gap from 01-22 to 02-05 is because the stage system wasn't in use at the time. The gap on stage 2 from 01-21 to 01-22 is because apparently people were trying to use a system we had marked as down for maintenance (and were able to successfully use it). Hot copy refresh fixes the gap from 01-21 to 01-22 and resolves this issue. In progress now.
fox2mike said we need glob's approval. not starting now. waiting for approval.
(In reply to Brandon Johnson [:cyborgshadow] from comment #14) > fox2mike said we need glob's approval. not starting now. waiting for > approval. He gave you the go ahead in comment #12 "you can perform this at anytime."
(In reply to Brandon Johnson [:cyborgshadow] from comment #13) > The gap on stage 2 from 01-21 to 01-22 is because apparently people were > trying to use a system we had marked as down for maintenance (and were able > to successfully use it). how are we able to determine if a system has been marked as down for maintenance?
You should be made aware of it via communication, and usually it's taken offline or the maintenance is made invisible. We've improved our communication recently to reach out to the necessary teams whenever databases are scheduled to be offline. In this scenario, we were moving a database and changing it's name simultaneously (a rare occasion) and I should have taken extra measures to make sure there were no connections to it during the maintenance.
this finished out yesterday evening after I popped off. Please test and let me know if you have any questions or concerns. Everything should be in sync now.
(In reply to Brandon Johnson [:cyborgshadow] from comment #18) > this finished out yesterday evening after I popped off. Please test and let > me know if you have any questions or concerns. Everything should be in sync > now. Is this a fresh copy from production?
No...It's just a hot copy of stage1 to stage2, aka a fix of what was missing on stage2.
(In reply to Brandon Johnson [:cyborgshadow] from comment #20) > No...It's just a hot copy of stage1 to stage2, aka a fix of what was missing > on stage2. Cool, thanks! glob, can you verify everything is fine and close this bug? We won't be doing checksetup.pl again afterall, since this is just a copy of 1 (rw) to 2 (ro). Thanks!
Flags: needinfo?(glob)
all looks good now, thanks!
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Flags: needinfo?(glob)
Resolution: --- → FIXED
Product: mozilla.org → Data & BI Services Team
You need to log in before you can comment on or make changes to this bug.