Closed
Bug 685527
Opened 13 years ago
Closed 11 years ago
nagios check for age of /var/lib/puppet/state/puppetdlock
Categories
(Release Engineering :: General, defect, P2)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Assigned: hwine)
References
Details
(Whiteboard: [nagios][buildbotmaster][puppet])
Attachments
(2 files)
(deleted),
patch
|
rail
:
review+
hwine
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
rail
:
review+
hwine
:
checked-in+
|
Details | Diff | Splinter Review |
caught a few masters with a lockfile from august 10; required a restart of puppet to fix. presumably fallout from network issues?
in any case, we need a check for these lock files, or some other way of checking if puppet is wedged.
Updated•13 years ago
|
OS: Linux → All
Priority: -- → P3
Hardware: x86_64 → All
Comment 1•13 years ago
|
||
at $previous_job we used to have cfengine touch a file every time it ran, then check the age of that file with nagios. That not only tells you that puppet ran, but that it ran successfully.
Assignee | ||
Updated•13 years ago
|
Assignee: nobody → hwine
Reporter | ||
Comment 3•13 years ago
|
||
this bit us again today when signing3 was silently failing to sync up with puppet
Severity: normal → critical
Assignee | ||
Comment 4•13 years ago
|
||
Found a suitable nagios NRPE plugin via Nagios Exchange (thanks to :arr):
<https://github.com/aswen/nagios-plugins/blob/1766d1fdf3b32477a8fa0dc3d754188ed1c0e2cc/check_puppet_agent>
forked & modified for our puppet version at:
<https://github.com/hwine/nagios-plugins>
Installed on buildbot-master32 for testing, and works on that host:
[root@buildbot-master32 plugins]# ./check_nrpe -H 127.0.0.1 -c check_puppet_agent
OK: Puppet agent <unknown> running catalog version <unknown>
Next steps:
- have relops monitor on this one host
- adjust timings as appropriate
- put into puppet for deployment to all clients running puppetd (not talos)
- activate monitoring on all clients
Comment 5•12 years ago
|
||
The state file for our version of puppet doesn't contain lines with 'version:' or 'config:', which leads to those two '<unknown>' in the OK message. Could we finesse those away ?
Assignee | ||
Comment 6•12 years ago
|
||
Sure - I'm the one who put them there. :) My assumption is that when we migrate to a newer version, they may be present and would then auto populate, since they were present in the upstream version of the code.. I can certainly use different constant or dynamic values there for now.
Assignee | ||
Updated•12 years ago
|
Status: NEW → ASSIGNED
Reporter | ||
Comment 7•12 years ago
|
||
this keeps biting us at really inconvenient times.
Hal - do you have time to finish this up?
Priority: P3 → P2
Assignee | ||
Comment 8•12 years ago
|
||
I can work on this soonish - fwiw, it will be my first significant puppet work, so if we need it faster than a week, someone else should grab.
Assignee | ||
Comment 9•12 years ago
|
||
Passes manual tests from AWS buildbot master:
puppet agent --test --noop --environment test
shows 2 files to be deployed, and nrpe to be reconfigured
puppet agent --test --environment test
shows correct values for the files delivered via puppet
upstream of plugin code is https://github.com/hwine/nagios-plugins
Attachment #745427 -
Flags: review?(rail)
Comment 10•12 years ago
|
||
Comment on attachment 745427 [details] [diff] [review]
puppetAgain patch for buildbot masters
Review of attachment 745427 [details] [diff] [review]:
-----------------------------------------------------------------
You'll want this on foopies and imaging servers as well. It's probably best to include it from toplevel::server.
Comment 11•12 years ago
|
||
Comment on attachment 745427 [details] [diff] [review]
puppetAgain patch for buildbot masters
(In reply to Dustin J. Mitchell [:dustin] from comment #10)
> You'll want this on foopies and imaging servers as well. It's probably best
> to include it from toplevel::server.
Agree. It won't work on the AWS puppet masters (unless you disable daemon_check), but it shouldn't hurt them as well. Once we switch to the cluster model this won't be an issue.
Hal, can you move "include nrpe::check::puppet_agent" from buildmaster::buildbot_master to toplevel::server when you land?
Attachment #745427 -
Flags: review?(rail) → review+
Assignee | ||
Comment 12•12 years ago
|
||
Comment on attachment 745427 [details] [diff] [review]
puppetAgain patch for buildbot masters
https://hg.mozilla.org/build/puppet/rev/985bdf0507d0
Attachment #745427 -
Flags: checked-in+
Assignee | ||
Comment 13•11 years ago
|
||
Note: this plugin does not work on the ancient buildbot-master12, the only one left inhouse at the moment. This is not an issue, as it's scheduled for replacement in bug 867593 by a modern version this plugin supports.
Assignee | ||
Comment 14•11 years ago
|
||
The test for puppet daemon running is very brittle, and gives lots of false positives.
Instead of fixing, remove functionality from this plugin, as many of our servers do not run a traditional puppet daemon. If daemon check needed, we'll implement via proper nagios check_proc
Attachment #747043 -
Flags: review?(rail)
Comment 15•11 years ago
|
||
In fact, none run a puppet daemon. The servers run puppet from a crontask.
Updated•11 years ago
|
Attachment #747043 -
Flags: review?(rail) → review+
Assignee | ||
Comment 16•11 years ago
|
||
Comment on attachment 747043 [details] [diff] [review]
remove broken & useless daemon code
https://hg.mozilla.org/build/puppet/rev/32beb7ce6cc0
Attachment #747043 -
Flags: checked-in+
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Comment 17•11 years ago
|
||
What's left to do here?
Comment 18•11 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #17)
> What's left to do here?
I'm going to assume nothing.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•