Closed
Bug 896812
Opened 11 years ago
Closed 11 years ago
Need monitoring for releng machines to key data centers
Categories
(mozilla.org Graveyard :: Server Operations, task)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: hwine, Assigned: ashish)
References
Details
(Whiteboard: [reit-nagios])
Every data center with RelEng machinery (including AWS regions) needs to notify effectively when a link to a key data center goes down.
Key data centers are those which house the single instance of a resource needed for normal operations. I.e. loss of all connectivity to a key data center will close the trees. Examples are SCL3 (buildbot scheduler), SCL1 (slavealloc).
Reporter | ||
Comment 1•11 years ago
|
||
update: PHX1 is also a key data center for releases (including nightly & betas) as symbols and snippets are pushed there.
Updated•11 years ago
|
Assignee: server-ops → ashish
Comment 2•11 years ago
|
||
So to start with, we'd need a good way of knowing :
1) SCL3 <=> PHX1 health
2) SCL3 <=> SCL1 health
3) SCL3 <=> releng AWS health
Reporter | ||
Comment 3•11 years ago
|
||
(In reply to Shyam Mani [:fox2mike] from comment #2)
> So to start with, we'd need a good way of knowing :
>
> 3) SCL3 <=> releng AWS health
This is actually 2 independent regions that can't talk to each other, so:
3) SCL3 <=> releng.use1
4) SCL3 <=> releng.usw2
Comment 4•11 years ago
|
||
(In reply to Hal Wine [:hwine] from comment #3)
> (In reply to Shyam Mani [:fox2mike] from comment #2)
> > So to start with, we'd need a good way of knowing :
> >
> > 3) SCL3 <=> releng AWS health
>
> This is actually 2 independent regions that can't talk to each other, so:
> 3) SCL3 <=> releng.use1
> 4) SCL3 <=> releng.usw2
...and also:
5) SCL3 <=> releng.use2
Comment 5•11 years ago
|
||
(In reply to John O'Duinn [:joduinn] from comment #4)
> 5) SCL3 <=> releng.use2
5) SCL3 <=> releng.usw1 (there is no use2 for releng).
Comment 6•11 years ago
|
||
Hal, since Bug 901784 is closed out, it would be awesome if we could get an AWS instance per region (ashish, please provide the specs) in the correct "private" subnet/vlan. They should be called :
nagios1.private.releng.usw1.mozilla.com
nagios1.private.releng.usw2.mozilla.com
nagios1.private.releng.use1.mozilla.com
They can have ashish's keys to start with and once we have flows, they can be puppetized.
Thanks!
Ashish - Once the instances are up, please file a firewall flow request as you need it with netops.
Flags: needinfo?(hwine)
Reporter | ||
Comment 7•11 years ago
|
||
Catlee -- who can set up the AWS instances mentioned in comment 6? (once Ashish provides specs)
Flags: needinfo?(hwine)
Reporter | ||
Comment 8•11 years ago
|
||
Based on engops meeting this morning, ideally these will be:
- size small instances
- RHEL or CentOS based
- our choice of IP based on the subnets identified in bug 901784 comment 1
:rail -- is this sufficient information? If not, please needinfo ashish on any remaining questions
Flags: needinfo?(rail)
Assignee | ||
Comment 9•11 years ago
|
||
(In reply to Hal Wine [:hwine] from comment #8)
> Based on engops meeting this morning, ideally these will be:
> - size small instances
> - RHEL or CentOS based
> - our choice of IP based on the subnets identified in bug 901784 comment 1
>
- Yes, small size instances with about 1GB RAM should suffice
- They need to be RHEL and in fact, they will be puppetized by Infra Puppet
- Ideally private, which is where the other Nagios instances lie, will also help replicating ACLs
FWIW we currently have nagios1.private.euw1 running on EC2. I'll reuse that setup here as far as possible. Thanks!
Reporter | ||
Comment 10•11 years ago
|
||
To avoid morphing this bug more, moved the AWS instance request to bug 903224.
(In reply to Shyam Mani [:fox2mike] from comment #2)
> So to start with, we'd need a good way of knowing :
>
> 1) SCL3 <=> PHX1 health
> 2) SCL3 <=> SCL1 health
How are these 2 coming?
Flags: needinfo?(rail)
Reporter | ||
Comment 11•11 years ago
|
||
(In reply to Shyam Mani [:fox2mike] from bug 903224 comment #1)
> Once [AWS instances are] up, Ashish will have to get flows opened with
> Netops and then finish his setup in time to test all this during the Aug
> 24th treeclosing maint window.
Why can't we test this new setup with the phx1 <-> scl3 <-> scl1 instances setup?
Flags: needinfo?(shyam)
Comment 12•11 years ago
|
||
(In reply to Hal Wine [:hwine] from comment #11)
> Why can't we test this new setup with the phx1 <-> scl3 <-> scl1 instances
> setup?
I believe we already have this in place. For the Aug 24th test, I don't think we're going to kill SCL3 or PHX1, we might be able to disrupt SCL1 connections and test though. I'll defer to Ashish on this portion (on if we already have this in place)
Flags: needinfo?(shyam) → needinfo?(ashish)
Assignee | ||
Comment 13•11 years ago
|
||
(In reply to Shyam Mani [:fox2mike] from comment #12)
> I believe we already have this in place. For the Aug 24th test, I don't
> think we're going to kill SCL3 or PHX1, we might be able to disrupt SCL1
> connections and test though. I'll defer to Ashish on this portion (on if we
> already have this in place)
We can get a unidirectional connect info from SCL3->SCL1 by pinging the admin server there (we already do that, can increase frequency of checks).
Flags: needinfo?(ashish)
Assignee | ||
Comment 14•11 years ago
|
||
While Bug 903224 is being worked on, I have set alerts from these to go to the oncall and build as well.
nagios-scl3 <-> nagios-phx1
nagios-releng <-> nagios-scl3
nagios-phx1 <-> nagios-releng
nagios-releng -> admin1[a,b].infra.scl1
Note: nagios-to-nagios pings are bi directional
Status: NEW → ASSIGNED
Reporter | ||
Comment 15•11 years ago
|
||
(In reply to Ashish Vijayaram [:ashish] from comment #14)
> While Bug 903224 is being worked on, I have set alerts from these to go to
> the oncall and build as well.
Question: by "alerts ... go to build" do you mean these alerts are:
- reported in irc channel #buildduty
- emailed to release@m.c
- both
- something else
Thanks! This is great news!
Reporter | ||
Updated•11 years ago
|
Flags: needinfo?(ashish)
Assignee | ||
Comment 16•11 years ago
|
||
(In reply to Hal Wine [:hwine] from comment #15)
> Question: by "alerts ... go to build" do you mean these alerts are:
> - reported in irc channel #buildduty
> - emailed to release@m.c
> - both
> - something else
>
> Thanks! This is great news!
Both. The current Nagios contactgroup "build" notifies #buildduty and sends out an email to release@m.c.
Flags: needinfo?(ashish)
Assignee | ||
Comment 17•11 years ago
|
||
Also, I shall do some basic tests to verify alerting on Monday and update this bug.
Assignee | ||
Comment 18•11 years ago
|
||
The 3 new instances in USE1, USW1 and USW2 are already setup for monitoring from releng-scl3:
https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.use1.mozilla.com
https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.usw1.mozilla.com
https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.usw2.mozilla.com
Pending Bug 905721 (or dependants), monitoring from the other direction will be setup.
Reporter | ||
Comment 19•11 years ago
|
||
Not sure how this relates to smokeping - the question came up in bug 910818 comment 2. I'll let you and :casey figure out the right thing.
Flags: needinfo?(ashish)
Assignee | ||
Comment 20•11 years ago
|
||
All the three instances are puppetised and basic configuration done. While verifying I found out another flow that needs to be opened from the instances to admin1.scl3, which is their admin host and routes host checks through it. Bug 915636 has been filed and pending closure, I shall verify and enable notifications on all 3 new instances.
04:58:36 -!- nagios-use1 <nagios-use@moz-539655E7.fw1.releng.scl3.mozilla.net> has joined #buildduty
05:01:23 -!- nagios-usw1 <nagios-usw@moz-539655E7.fw1.releng.scl3.mozilla.net> has joined #buildduty
05:03:51 -!- nagios-usw2 <nagios-usw@moz-539655E7.fw1.releng.scl3.mozilla.net> has joined #buildduty
Flags: needinfo?(ashish)
Assignee | ||
Comment 21•11 years ago
|
||
This is all setup! There was an actual alert that went off yesterday from nagios-releng:
21:47 -!- nagios-usw1 <nagios-usw@moz-539655E7.fw1.releng.scl3.mozilla.net> has quit (Ping timeout)
21:49 < nagios-releng> Mon 21:49:38 PDT [4672] nagios1.private.releng.usw1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100%
...
22:25 -!- nagios-usw1 <nagios-usw@moz-539655E7.fw1.releng.scl3.mozilla.net> has joined #buildduty
22:25 < nagios-releng> Mon 22:25:26 PDT [4697] nagios1.private.releng.usw1.mozilla.com is UP :PING OK - Packet loss = 0%, RTA = 10.98 ms
With the setup complete, I'm turning on notifications for the new instances. Please treat alerts as production.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•