Closed Bug 685527 Opened 13 years ago Closed 11 years ago

nagios check for age of /var/lib/puppet/state/puppetdlock

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: hwine)

References

Details

(Whiteboard: [nagios][buildbotmaster][puppet])

Attachments

(2 files)

caught a few masters with a lockfile from august 10; required a restart of puppet to fix. presumably fallout from network issues? in any case, we need a check for these lock files, or some other way of checking if puppet is wedged.
OS: Linux → All
Priority: -- → P3
Hardware: x86_64 → All
at $previous_job we used to have cfengine touch a file every time it ran, then check the age of that file with nagios. That not only tells you that puppet ran, but that it ran successfully.
Assignee: nobody → hwine
this bit us again today when signing3 was silently failing to sync up with puppet
Severity: normal → critical
Found a suitable nagios NRPE plugin via Nagios Exchange (thanks to :arr): <https://github.com/aswen/nagios-plugins/blob/1766d1fdf3b32477a8fa0dc3d754188ed1c0e2cc/check_puppet_agent> forked & modified for our puppet version at: <https://github.com/hwine/nagios-plugins> Installed on buildbot-master32 for testing, and works on that host: [root@buildbot-master32 plugins]# ./check_nrpe -H 127.0.0.1 -c check_puppet_agent OK: Puppet agent <unknown> running catalog version <unknown> Next steps: - have relops monitor on this one host - adjust timings as appropriate - put into puppet for deployment to all clients running puppetd (not talos) - activate monitoring on all clients
Depends on: 752332
The state file for our version of puppet doesn't contain lines with 'version:' or 'config:', which leads to those two '<unknown>' in the OK message. Could we finesse those away ?
Sure - I'm the one who put them there. :) My assumption is that when we migrate to a newer version, they may be present and would then auto populate, since they were present in the upstream version of the code.. I can certainly use different constant or dynamic values there for now.
Status: NEW → ASSIGNED
this keeps biting us at really inconvenient times. Hal - do you have time to finish this up?
Priority: P3 → P2
I can work on this soonish - fwiw, it will be my first significant puppet work, so if we need it faster than a week, someone else should grab.
Passes manual tests from AWS buildbot master: puppet agent --test --noop --environment test shows 2 files to be deployed, and nrpe to be reconfigured puppet agent --test --environment test shows correct values for the files delivered via puppet upstream of plugin code is https://github.com/hwine/nagios-plugins
Attachment #745427 - Flags: review?(rail)
Comment on attachment 745427 [details] [diff] [review] puppetAgain patch for buildbot masters Review of attachment 745427 [details] [diff] [review]: ----------------------------------------------------------------- You'll want this on foopies and imaging servers as well. It's probably best to include it from toplevel::server.
Comment on attachment 745427 [details] [diff] [review] puppetAgain patch for buildbot masters (In reply to Dustin J. Mitchell [:dustin] from comment #10) > You'll want this on foopies and imaging servers as well. It's probably best > to include it from toplevel::server. Agree. It won't work on the AWS puppet masters (unless you disable daemon_check), but it shouldn't hurt them as well. Once we switch to the cluster model this won't be an issue. Hal, can you move "include nrpe::check::puppet_agent" from buildmaster::buildbot_master to toplevel::server when you land?
Attachment #745427 - Flags: review?(rail) → review+
Attachment #745427 - Flags: checked-in+
Note: this plugin does not work on the ancient buildbot-master12, the only one left inhouse at the moment. This is not an issue, as it's scheduled for replacement in bug 867593 by a modern version this plugin supports.
The test for puppet daemon running is very brittle, and gives lots of false positives. Instead of fixing, remove functionality from this plugin, as many of our servers do not run a traditional puppet daemon. If daemon check needed, we'll implement via proper nagios check_proc
Attachment #747043 - Flags: review?(rail)
In fact, none run a puppet daemon. The servers run puppet from a crontask.
Attachment #747043 - Flags: review?(rail) → review+
Attachment #747043 - Flags: checked-in+
Blocks: re-nagios
Product: mozilla.org → Release Engineering
What's left to do here?
(In reply to Dustin J. Mitchell [:dustin] (I read my bugmail; don't needinfo me) from comment #17) > What's left to do here? I'm going to assume nothing.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: