Closed Bug 896812 Opened 11 years ago Closed 11 years ago

Need monitoring for releng machines to key data centers

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hwine, Assigned: ashish)

References

Details

(Whiteboard: [reit-nagios])

Every data center with RelEng machinery (including AWS regions) needs to notify effectively when a link to a key data center goes down. Key data centers are those which house the single instance of a resource needed for normal operations. I.e. loss of all connectivity to a key data center will close the trees. Examples are SCL3 (buildbot scheduler), SCL1 (slavealloc).
update: PHX1 is also a key data center for releases (including nightly & betas) as symbols and snippets are pushed there.
Assignee: server-ops → ashish
So to start with, we'd need a good way of knowing : 1) SCL3 <=> PHX1 health 2) SCL3 <=> SCL1 health 3) SCL3 <=> releng AWS health
(In reply to Shyam Mani [:fox2mike] from comment #2) > So to start with, we'd need a good way of knowing : > > 3) SCL3 <=> releng AWS health This is actually 2 independent regions that can't talk to each other, so: 3) SCL3 <=> releng.use1 4) SCL3 <=> releng.usw2
(In reply to Hal Wine [:hwine] from comment #3) > (In reply to Shyam Mani [:fox2mike] from comment #2) > > So to start with, we'd need a good way of knowing : > > > > 3) SCL3 <=> releng AWS health > > This is actually 2 independent regions that can't talk to each other, so: > 3) SCL3 <=> releng.use1 > 4) SCL3 <=> releng.usw2 ...and also: 5) SCL3 <=> releng.use2
(In reply to John O'Duinn [:joduinn] from comment #4) > 5) SCL3 <=> releng.use2 5) SCL3 <=> releng.usw1 (there is no use2 for releng).
Hal, since Bug 901784 is closed out, it would be awesome if we could get an AWS instance per region (ashish, please provide the specs) in the correct "private" subnet/vlan. They should be called : nagios1.private.releng.usw1.mozilla.com nagios1.private.releng.usw2.mozilla.com nagios1.private.releng.use1.mozilla.com They can have ashish's keys to start with and once we have flows, they can be puppetized. Thanks! Ashish - Once the instances are up, please file a firewall flow request as you need it with netops.
Flags: needinfo?(hwine)
Catlee -- who can set up the AWS instances mentioned in comment 6? (once Ashish provides specs)
Flags: needinfo?(hwine)
Based on engops meeting this morning, ideally these will be: - size small instances - RHEL or CentOS based - our choice of IP based on the subnets identified in bug 901784 comment 1 :rail -- is this sufficient information? If not, please needinfo ashish on any remaining questions
Flags: needinfo?(rail)
(In reply to Hal Wine [:hwine] from comment #8) > Based on engops meeting this morning, ideally these will be: > - size small instances > - RHEL or CentOS based > - our choice of IP based on the subnets identified in bug 901784 comment 1 > - Yes, small size instances with about 1GB RAM should suffice - They need to be RHEL and in fact, they will be puppetized by Infra Puppet - Ideally private, which is where the other Nagios instances lie, will also help replicating ACLs FWIW we currently have nagios1.private.euw1 running on EC2. I'll reuse that setup here as far as possible. Thanks!
Depends on: 903224
To avoid morphing this bug more, moved the AWS instance request to bug 903224. (In reply to Shyam Mani [:fox2mike] from comment #2) > So to start with, we'd need a good way of knowing : > > 1) SCL3 <=> PHX1 health > 2) SCL3 <=> SCL1 health How are these 2 coming?
Flags: needinfo?(rail)
(In reply to Shyam Mani [:fox2mike] from bug 903224 comment #1) > Once [AWS instances are] up, Ashish will have to get flows opened with > Netops and then finish his setup in time to test all this during the Aug > 24th treeclosing maint window. Why can't we test this new setup with the phx1 <-> scl3 <-> scl1 instances setup?
Flags: needinfo?(shyam)
(In reply to Hal Wine [:hwine] from comment #11) > Why can't we test this new setup with the phx1 <-> scl3 <-> scl1 instances > setup? I believe we already have this in place. For the Aug 24th test, I don't think we're going to kill SCL3 or PHX1, we might be able to disrupt SCL1 connections and test though. I'll defer to Ashish on this portion (on if we already have this in place)
Flags: needinfo?(shyam) → needinfo?(ashish)
(In reply to Shyam Mani [:fox2mike] from comment #12) > I believe we already have this in place. For the Aug 24th test, I don't > think we're going to kill SCL3 or PHX1, we might be able to disrupt SCL1 > connections and test though. I'll defer to Ashish on this portion (on if we > already have this in place) We can get a unidirectional connect info from SCL3->SCL1 by pinging the admin server there (we already do that, can increase frequency of checks).
Flags: needinfo?(ashish)
While Bug 903224 is being worked on, I have set alerts from these to go to the oncall and build as well. nagios-scl3 <-> nagios-phx1 nagios-releng <-> nagios-scl3 nagios-phx1 <-> nagios-releng nagios-releng -> admin1[a,b].infra.scl1 Note: nagios-to-nagios pings are bi directional
Status: NEW → ASSIGNED
(In reply to Ashish Vijayaram [:ashish] from comment #14) > While Bug 903224 is being worked on, I have set alerts from these to go to > the oncall and build as well. Question: by "alerts ... go to build" do you mean these alerts are: - reported in irc channel #buildduty - emailed to release@m.c - both - something else Thanks! This is great news!
Flags: needinfo?(ashish)
(In reply to Hal Wine [:hwine] from comment #15) > Question: by "alerts ... go to build" do you mean these alerts are: > - reported in irc channel #buildduty > - emailed to release@m.c > - both > - something else > > Thanks! This is great news! Both. The current Nagios contactgroup "build" notifies #buildduty and sends out an email to release@m.c.
Flags: needinfo?(ashish)
Also, I shall do some basic tests to verify alerting on Monday and update this bug.
Not sure how this relates to smokeping - the question came up in bug 910818 comment 2. I'll let you and :casey figure out the right thing.
Flags: needinfo?(ashish)
All the three instances are puppetised and basic configuration done. While verifying I found out another flow that needs to be opened from the instances to admin1.scl3, which is their admin host and routes host checks through it. Bug 915636 has been filed and pending closure, I shall verify and enable notifications on all 3 new instances. 04:58:36 -!- nagios-use1 <nagios-use@moz-539655E7.fw1.releng.scl3.mozilla.net> has joined #buildduty 05:01:23 -!- nagios-usw1 <nagios-usw@moz-539655E7.fw1.releng.scl3.mozilla.net> has joined #buildduty 05:03:51 -!- nagios-usw2 <nagios-usw@moz-539655E7.fw1.releng.scl3.mozilla.net> has joined #buildduty
Flags: needinfo?(ashish)
This is all setup! There was an actual alert that went off yesterday from nagios-releng: 21:47 -!- nagios-usw1 <nagios-usw@moz-539655E7.fw1.releng.scl3.mozilla.net> has quit (Ping timeout) 21:49 < nagios-releng> Mon 21:49:38 PDT [4672] nagios1.private.releng.usw1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100% ... 22:25 -!- nagios-usw1 <nagios-usw@moz-539655E7.fw1.releng.scl3.mozilla.net> has joined #buildduty 22:25 < nagios-releng> Mon 22:25:26 PDT [4697] nagios1.private.releng.usw1.mozilla.com is UP :PING OK - Packet loss = 0%, RTA = 10.98 ms With the setup complete, I'm turning on notifications for the new instances. Please treat alerts as production.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Depends on: 1077284
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.