Closed Bug 752332 Opened 13 years ago Closed 11 years ago

enable nagios check of puppet agent status on bm32

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: hwine, Assigned: ashish)

References

Details

Trial deploy of new check that puppet agent has been successful recently. (See bug 685527#c4 for details) Please add checking of "check_puppet_agent" on buildbot-master32 via NRPE, with notifications disabled. Sample nagios service configuration given at: <https://github.com/hwine/nagios-plugins/blob/master/check_puppet_agent> After some burn in and tuning, we'll deploy via puppet on a broader scale and then ask for general activation in another bug.
To clarify, please use this line for enabling the service: check_command check_nrpe!check_puppet_agent!3600!7200
I've added the check to the existing nagios with notifications disabled. Rick, please make sure to copy this check from admin1.infra.scl1.mozilla.com to nagios1.private.releng.scl3.mozilla.com when you migrate things today.
Assignee: server-ops-releng → rbryce
Component: Server Operations: RelEng → Server Operations
QA Contact: arich → phong
This is no longer needed. AS releng is staying put on scl1
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → INVALID
The move to scl3 may be invalid, but tracking this on scl1 nagios is not. Moving back to server-ops-releng and marking fixed instead.
Assignee: rbryce → server-ops-releng
Component: Server Operations → Server Operations: RelEng
QA Contact: phong → arich
Resolution: INVALID → FIXED
(In reply to Rick Bryce [:rbryce] from comment #3) > This is no longer needed. AS releng is staying put on scl1 Times have changed, we're in scl3 - please enable check per comment 1 for buildbot-master32 on the nagios server at http://nagios1.private.releng.scl3.mozilla.com/releng-scl3/ plugin functions correctly locally, want to triple check it works okay before rolling out to all hosts.
Assignee: server-ops-releng → server-ops
Status: RESOLVED → REOPENED
Component: Server Operations: RelEng → Server Operations
QA Contact: arich → shyam
Resolution: FIXED → ---
Given that we're not building masters with old-puppet any more, and that all current masters are on KVM and thus will be replaced in the move to scl3, is this still necessary?
Flags: needinfo?(hwine)
(In reply to Dustin J. Mitchell [:dustin] from comment #6) > Given that we're not building masters with old-puppet any more, and that all > current masters are on KVM and thus will be replaced in the move to scl3, is > this still necessary? Yes - it will be deployed on ALL puppetized non-talos machines. Just starting with this older one since it used to work there -- before nagios, etc. upgraded. Easier to trouble shoot. And, yes, we want a nagios alert for this condition. As I understood puppetAgain, the dashboard will flag the error, but not trigger a nagios alert.
Flags: needinfo?(hwine)
We also get an email for every failed puppet run, in the releng-shared mailbox. I really don't think a nagios alert is necessary.
And I should add, at least Callek and I check those religiously. I'd like to know that others are watching that mailbox, too.
Per IRC chat with Dustin, we can proceed on hooking this up.
OK, there is no buildbot-master32: Host buildbot-master32.srv.releng.scl3.mozilla.com not found: 3(NXDOMAIN) Or am I missing something? :)
Assignee: server-ops → ashish
That was one of the buildbot-masters that was recently decommissioned.
(In reply to Ashish Vijayaram [:ashish] from comment #11) > OK, there is no buildbot-master32: > > Host buildbot-master32.srv.releng.scl3.mozilla.com not found: 3(NXDOMAIN) > > Or am I missing something? :) No, I am - it was there when I started testing. I'll move my setup, then update this request. Taking out of your queue for now.
Assignee: ashish → hwine
Okay, build-master12 is even older than 32 was (puppet version) -- I'll have to do some work there to support the plugin. Ashish - can you hook up buildbot-master63.srv.releng.use1.mozilla.com (in AWS) please? the plugin runs clean there. Thanks!
Assignee: hwine → ashish
Done! < nagios-releng> ashish: buildbot-master63.srv.releng.use1.mozilla.com:Puppet freshness is OK - OK: Puppet agent last run: 1706 sec ago
Status: REOPENED → RESOLVED
Closed: 12 years ago11 years ago
Resolution: --- → FIXED
(In reply to Hal Wine [:hwine] from comment #14) > Okay, build-master12 is even older than 32 was (puppet version) -- I'll have > to do some work there to support the plugin. No update needed, see bug 685527 comment 13
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.