896812 - Need monitoring for releng machines to key data centers

Reporter

Description

•

11 years ago

Every data center with RelEng machinery (including AWS regions) needs to notify effectively when a link to a key data center goes down. Key data centers are those which house the single instance of a resource needed for normal operations. I.e. loss of all connectivity to a key data center will close the trees. Examples are SCL3 (buildbot scheduler), SCL1 (slavealloc).

Hal Wine [:hwine] (use NI)

Reporter

Comment 1

•

11 years ago

update: PHX1 is also a key data center for releases (including nightly & betas) as symbols and snippets are pushed there.

Shyam Mani [:fox2mike]

Updated

•

11 years ago

Assignee: server-ops → ashish

Shyam Mani [:fox2mike]

Comment 2

•

11 years ago

So to start with, we'd need a good way of knowing : 1) SCL3 <=> PHX1 health 2) SCL3 <=> SCL1 health 3) SCL3 <=> releng AWS health

Hal Wine [:hwine] (use NI)

Reporter

Comment 3

•

11 years ago

(In reply to Shyam Mani [:fox2mike] from comment #2) > So to start with, we'd need a good way of knowing : > > 3) SCL3 <=> releng AWS health This is actually 2 independent regions that can't talk to each other, so: 3) SCL3 <=> releng.use1 4) SCL3 <=> releng.usw2

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 4

•

11 years ago

(In reply to Hal Wine [:hwine] from comment #3) > (In reply to Shyam Mani [:fox2mike] from comment #2) > > So to start with, we'd need a good way of knowing : > > > > 3) SCL3 <=> releng AWS health > > This is actually 2 independent regions that can't talk to each other, so: > 3) SCL3 <=> releng.use1 > 4) SCL3 <=> releng.usw2 ...and also: 5) SCL3 <=> releng.use2

Shyam Mani [:fox2mike]

Comment 5

•

11 years ago

(In reply to John O'Duinn [:joduinn] from comment #4) > 5) SCL3 <=> releng.use2 5) SCL3 <=> releng.usw1 (there is no use2 for releng).

Shyam Mani [:fox2mike]

Comment 6

•

11 years ago

Hal, since Bug 901784 is closed out, it would be awesome if we could get an AWS instance per region (ashish, please provide the specs) in the correct "private" subnet/vlan. They should be called : nagios1.private.releng.usw1.mozilla.com nagios1.private.releng.usw2.mozilla.com nagios1.private.releng.use1.mozilla.com They can have ashish's keys to start with and once we have flows, they can be puppetized. Thanks! Ashish - Once the instances are up, please file a firewall flow request as you need it with netops.

Flags: needinfo?(hwine)

Hal Wine [:hwine] (use NI)

Reporter

Comment 7

•

11 years ago

Catlee -- who can set up the AWS instances mentioned in comment 6? (once Ashish provides specs)

Flags: needinfo?(hwine)

Hal Wine [:hwine] (use NI)

Reporter

Comment 8

•

11 years ago

Based on engops meeting this morning, ideally these will be: - size small instances - RHEL or CentOS based - our choice of IP based on the subnets identified in bug 901784 comment 1 :rail -- is this sufficient information? If not, please needinfo ashish on any remaining questions

Flags: needinfo?(rail)

Ashish Vijayaram [:ashish]

Assignee

Comment 9

•

11 years ago

(In reply to Hal Wine [:hwine] from comment #8) > Based on engops meeting this morning, ideally these will be: > - size small instances > - RHEL or CentOS based > - our choice of IP based on the subnets identified in bug 901784 comment 1 > - Yes, small size instances with about 1GB RAM should suffice - They need to be RHEL and in fact, they will be puppetized by Infra Puppet - Ideally private, which is where the other Nagios instances lie, will also help replicating ACLs FWIW we currently have nagios1.private.euw1 running on EC2. I'll reuse that setup here as far as possible. Thanks!

Hal Wine [:hwine] (use NI)

Reporter

Updated

•

11 years ago

Depends on: 903224

Hal Wine [:hwine] (use NI)

Reporter

Comment 10

•

11 years ago

To avoid morphing this bug more, moved the AWS instance request to bug 903224. (In reply to Shyam Mani [:fox2mike] from comment #2) > So to start with, we'd need a good way of knowing : > > 1) SCL3 <=> PHX1 health > 2) SCL3 <=> SCL1 health How are these 2 coming?

Flags: needinfo?(rail)

Hal Wine [:hwine] (use NI)

Reporter

Comment 11

•

11 years ago

(In reply to Shyam Mani [:fox2mike] from bug 903224 comment #1) > Once [AWS instances are] up, Ashish will have to get flows opened with > Netops and then finish his setup in time to test all this during the Aug > 24th treeclosing maint window. Why can't we test this new setup with the phx1 <-> scl3 <-> scl1 instances setup?

Flags: needinfo?(shyam)

Shyam Mani [:fox2mike]

Comment 12

•

11 years ago

(In reply to Hal Wine [:hwine] from comment #11) > Why can't we test this new setup with the phx1 <-> scl3 <-> scl1 instances > setup? I believe we already have this in place. For the Aug 24th test, I don't think we're going to kill SCL3 or PHX1, we might be able to disrupt SCL1 connections and test though. I'll defer to Ashish on this portion (on if we already have this in place)

Flags: needinfo?(shyam) → needinfo?(ashish)

Ashish Vijayaram [:ashish]

Assignee

Comment 13

•

11 years ago

(In reply to Shyam Mani [:fox2mike] from comment #12) > I believe we already have this in place. For the Aug 24th test, I don't > think we're going to kill SCL3 or PHX1, we might be able to disrupt SCL1 > connections and test though. I'll defer to Ashish on this portion (on if we > already have this in place) We can get a unidirectional connect info from SCL3->SCL1 by pinging the admin server there (we already do that, can increase frequency of checks).

Flags: needinfo?(ashish)

Ashish Vijayaram [:ashish]

Assignee

Comment 14

•

11 years ago

While Bug 903224 is being worked on, I have set alerts from these to go to the oncall and build as well. nagios-scl3 <-> nagios-phx1 nagios-releng <-> nagios-scl3 nagios-phx1 <-> nagios-releng nagios-releng -> admin1[a,b].infra.scl1 Note: nagios-to-nagios pings are bi directional

Status: NEW → ASSIGNED

Hal Wine [:hwine] (use NI)

Reporter

Comment 15

•

11 years ago

(In reply to Ashish Vijayaram [:ashish] from comment #14) > While Bug 903224 is being worked on, I have set alerts from these to go to > the oncall and build as well. Question: by "alerts ... go to build" do you mean these alerts are: - reported in irc channel #buildduty - emailed to release@m.c - both - something else Thanks! This is great news!

Hal Wine [:hwine] (use NI)

Reporter

Updated

•

11 years ago

Flags: needinfo?(ashish)

Ashish Vijayaram [:ashish]

Assignee

Comment 16

•

11 years ago

(In reply to Hal Wine [:hwine] from comment #15) > Question: by "alerts ... go to build" do you mean these alerts are: > - reported in irc channel #buildduty > - emailed to release@m.c > - both > - something else > > Thanks! This is great news! Both. The current Nagios contactgroup "build" notifies #buildduty and sends out an email to release@m.c.

Flags: needinfo?(ashish)

Ashish Vijayaram [:ashish]

Assignee

Comment 17

•

11 years ago

Also, I shall do some basic tests to verify alerting on Monday and update this bug.

Ashish Vijayaram [:ashish]

Assignee

Comment 18

•

11 years ago

The 3 new instances in USE1, USW1 and USW2 are already setup for monitoring from releng-scl3: https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.use1.mozilla.com https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.usw1.mozilla.com https://nagios.mozilla.org/releng-scl3/cgi-bin/status.cgi?navbarsearch=1&host=nagios1.private.releng.usw2.mozilla.com Pending Bug 905721 (or dependants), monitoring from the other direction will be setup.

Hal Wine [:hwine] (use NI)

Reporter

Comment 19

•

11 years ago

Not sure how this relates to smokeping - the question came up in bug 910818 comment 2. I'll let you and :casey figure out the right thing.

Flags: needinfo?(ashish)

Ashish Vijayaram [:ashish]

Assignee

Comment 20

•

11 years ago

All the three instances are puppetised and basic configuration done. While verifying I found out another flow that needs to be opened from the instances to admin1.scl3, which is their admin host and routes host checks through it. Bug 915636 has been filed and pending closure, I shall verify and enable notifications on all 3 new instances. 04:58:36 -!- nagios-use1 <nagios-use@moz-539655E7.fw1.releng.scl3.mozilla.net> has joined #buildduty 05:01:23 -!- nagios-usw1 <nagios-usw@moz-539655E7.fw1.releng.scl3.mozilla.net> has joined #buildduty 05:03:51 -!- nagios-usw2 <nagios-usw@moz-539655E7.fw1.releng.scl3.mozilla.net> has joined #buildduty

Flags: needinfo?(ashish)

Ashish Vijayaram [:ashish]

Assignee

Comment 21

•

11 years ago

This is all setup! There was an actual alert that went off yesterday from nagios-releng: 21:47 -!- nagios-usw1 <nagios-usw@moz-539655E7.fw1.releng.scl3.mozilla.net> has quit (Ping timeout) 21:49 < nagios-releng> Mon 21:49:38 PDT [4672] nagios1.private.releng.usw1.mozilla.com is DOWN :PING CRITICAL - Packet loss = 100% ... 22:25 -!- nagios-usw1 <nagios-usw@moz-539655E7.fw1.releng.scl3.mozilla.net> has joined #buildduty 22:25 < nagios-releng> Mon 22:25:26 PDT [4697] nagios1.private.releng.usw1.mozilla.com is UP :PING OK - Packet loss = 0%, RTA = 10.98 ms With the setup complete, I'm turning on notifications for the new instances. Please treat alerts as production.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

:Atoll

Updated

•

10 years ago

Depends on: 1077284

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard