Closed Bug 1332640 Opened 8 years ago Closed 7 years ago

Nagios scriptworker monitoring for queue pending time and file ages

Categories

(Infrastructure & Operations :: MOC: Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: sfraser, Assigned: jlaz, Mentored)

References

Details

Attachments

(2 files)

Hello, If possible I'd like to get extra nagios checks for our signing scriptworkers. There's some file age checks, and a custom one for the number of pending tasks. The nrpe plugin for the pending tasks already exists on the workers, so if I've got my puppet right the attached diff should be helpful.
Assignee: nobody → sespinoza
Mentor: rchilds
Status: NEW → ASSIGNED
Blocks: 1314840
Hi folks, is there anything more you need from me for this bug?
committed c131f4f931967b89fe2dab42745110142b537058 checks failed, will need to revise these. nagios-releng> Wed 19:32:50 PST [4053] signing-linux-1.srv.releng.use1.mozilla.com:Signing Scriptworker gpg creation log age is CRITICAL: FILE_AGE CRITICAL: File not found - /builds/scriptworker/logs/create_initial_gpg_homedirs.log (http://m.mozilla.org/Signing+Scriptworker+gpg+creation+log+age) 19:32:51 <nagios-releng> Wed 19:32:51 PST [4054] signing-linux-2.srv.releng.usw2.mozilla.com:Signing Scriptworker gpg rebuild log age is CRITICAL: FILE_AGE CRITICAL: File not found - /builds/scriptworker/logs/rebuild_gpg_homedirs.log (http://m.mozilla.org/Signing+Scriptworker+gpg+rebuild+log+age) 19:33:00 <nagios-releng> Wed 19:33:00 PST [4055] signing-linux-4.srv.releng.usw2.mozilla.com:Signing Scriptworker gpg creation log age is CRITICAL: FILE_AGE CRITICAL: File not found - /builds/scriptworker/logs/create_initial_gpg_homedirs.log (http://m.mozilla.org/Signing+Scriptworker+gpg+creation+log+age) 19:33:00 <nagios-releng> Wed 19:33:00 PST [4056] signing-linux-3.srv.releng.use1.mozilla.com:Signing scriptworker gpg_homedirs.lock age is CRITICAL: NRPE: Command check_file_age_ok_not_exists not defined (http://m.mozilla.org/Signing+scriptworker+gpg_homedirs.lock+age) 19:33:01 <nagios-releng> Wed 19:33:01 PST [4057] signing-linux-3.srv.releng.use1.mozilla.com:Scriptworker log age is CRITICAL: FILE_AGE CRITICAL: File not found - /builds/scriptworker/logs/worker.log (http://m.mozilla.org/Scriptworker+log+age)
I'd thought check_file_age_ok_not_exists was already present - other configs seem to use it. Need input from someone who has access to those servers to check what's going on, there.
Flags: needinfo?(aki)
Hm. - We can stop monitoring create_initial_gpg_homedirs.log, because we've stopped updating it. - We do still have /builds/scriptworker/logs/rebuild_gpg_homedirs.log and /builds/scriptworker/logs/worker.log ; as long as we're root or cltsign we should be able to see those files. Is this running as a different user? - Am I looking in /etc/nagios/nrpe.d for check_file_age_ok_not_exists? -rw-r--r-- 1 root root 127 Aug 22 18:18 check_child_procs_regex.cfg -rw-r--r-- 1 root root 85 Aug 22 18:18 check_ide_smart.cfg -rw-r--r-- 1 root root 98 Aug 22 18:18 check_ntp_peer.cfg -rw-r--r-- 1 root root 95 Aug 22 18:18 check_ntp_time.cfg -rw-r--r-- 1 root root 111 Aug 22 18:19 check_procs_regex.cfg -rw-r--r-- 1 root root 93 Aug 22 18:19 check_puppet_agent.cfg -rw-r--r-- 1 root root 96 Aug 22 18:19 check_puppet_freshness.cfg -rw-r--r-- 1 root root 77 Aug 22 18:19 check_swap.cfg -rwxr-x--- 1 root root 89 Nov 10 22:59 scriptworker.cfg -rw-r--r-- 1 root root 466 Dec 8 23:51 signingworker.cfg - I'm leaning towards adding you to the puppet shortlist so you can poke around.
Flags: needinfo?(aki)
Depends on: 1336179
:aki could you add him to the puppet shortlist? is that ok with you :sfraser?
Flags: needinfo?(sfraser)
Flags: needinfo?(aki)
:sfraser is already in the puppet shortlist, and can poke around: https://hg.mozilla.org/build/puppet/file/tip/manifests/moco-config.pp#l154
Flags: needinfo?(aki)
Did we move forward with the changes that aki requested? Is there a standard version of check_file_age_ok_not_exists that could be added?
Flags: needinfo?(sfraser)
:sfraser, if you're talking about comment 5, that was addressed to you.
Right, apologies for the delay. I think there's been a bit of confusion. > - We can stop monitoring create_initial_gpg_homedirs.log, because we've stopped updating it. sal, do you need a diff for this? > - We do still have /builds/scriptworker/logs/rebuild_gpg_homedirs.log and /builds/scriptworker/logs/worker.log ; as long as we're root or cltsign we should be able to see those files. Is this running as a different user? nrpe is running as the nagios user. The directories aren't even group readable, so unless we change that, nrpe won't be able to read their state. I'm more reluctant to change nagios, as it would mean these nodes are more fragile. - Am I looking in /etc/nagios/nrpe.d for check_file_age_ok_not_exists? Seems to be a wrapper around a local version of check_file_age that comes with the nrpe puppet module (sysadmins/puppet/modules/nrpe/files/plugins/check_file_age) so I'm unsure why that isn't being copied down at the moment. Can investigate, unless sal knows more about how to deploy it.
Flags: needinfo?(sespinoza)
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #10) > nrpe is running as the nagios user. The directories aren't even group > readable, so unless we change that, nrpe won't be able to read their state. > I'm more reluctant to change nagios, as it would mean these nodes are more > fragile. Adding this check to use sudoers would get around the nagios user permission issue.
Assignee: sespinoza → jlaz
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #10) > Right, apologies for the delay. I think there's been a bit of confusion. > > > > - We can stop monitoring create_initial_gpg_homedirs.log, because we've stopped updating it. > > sal, do you need a diff for this? > > > - We do still have /builds/scriptworker/logs/rebuild_gpg_homedirs.log and /builds/scriptworker/logs/worker.log ; as long as we're root or cltsign we should be able to see those files. Is this running as a different user? > > nrpe is running as the nagios user. The directories aren't even group > readable, so unless we change that, nrpe won't be able to read their state. > I'm more reluctant to change nagios, as it would mean these nodes are more > fragile. > > - Am I looking in /etc/nagios/nrpe.d for check_file_age_ok_not_exists? > > Seems to be a wrapper around a local version of check_file_age that comes > with the nrpe puppet module > (sysadmins/puppet/modules/nrpe/files/plugins/check_file_age) so I'm unsure > why that isn't being copied down at the moment. Can investigate, unless sal > knows more about how to deploy it. Regarding check_file_age_ok_not_exists, you are correct. Puppet should be generating a config file (/etc/nagios/nrpe.d/file_age.cfg) with the following content: command[check_file_age_ok_not_exists]=/usr/lib64/nagios/plugins/custom/check_file_age -m -w $ARG1$ -c $ARG2$ -f '$ARG3$' I believe check_file_age should already exist in that path, so I'm a bit unsure why puppet isn't generating that nrpe config file since I do not have access to the box
Flags: needinfo?(sespinoza)
(In reply to Justin Lazaro [:jlaz] from comment #12) > > - Am I looking in /etc/nagios/nrpe.d for check_file_age_ok_not_exists? > > > > Seems to be a wrapper around a local version of check_file_age that comes > > with the nrpe puppet module > > (sysadmins/puppet/modules/nrpe/files/plugins/check_file_age) so I'm unsure > > why that isn't being copied down at the moment. Can investigate, unless sal > > knows more about how to deploy it. > > Regarding check_file_age_ok_not_exists, you are correct. Puppet should be > generating a config file (/etc/nagios/nrpe.d/file_age.cfg) with the > following content: > > command[check_file_age_ok_not_exists]=/usr/lib64/nagios/plugins/custom/ > check_file_age -m -w $ARG1$ -c $ARG2$ -f '$ARG3$' > > I believe check_file_age should already exist in that path, so I'm a bit > unsure why puppet isn't generating that nrpe config file since I do not have > access to the box Perhaps this is a puppet conflict - looking above, this machine seems to be managed bu hg.m.o/build/puppet. In which case I need to copy the relevant portion of the sysadmin puppet. Can you point me at the right file to plagarise? About nrpe not being able to read the files, ryanc mentioned adding the check to sudoers. How should we go about that? A sudo call in the check definition, or within the checking script?
Comment on attachment 8860360 [details] bug 1332640 fix accidental hardcoding of thresholds https://reviewboard.mozilla.org/r/132390/#review135340 ::: modules/nrpe/manifests/check/cltsign_file_ages.pp:22 (Diff revision 1) > + } > + > + sudoers::custom { > + 'check_file_age': > + user => 'nagios', > + runas => 'cltsign', We may want to pass the user into this module. Once this works for signing scriptworker, we probably want to turn this check on for the other scriptworkers, which use `cltbld` instead of `cltsign`. I'm ok with that as a followup fix though; I'm happy to see forward momentum here!
Attachment #8860360 - Flags: review?(aki) → review+
LGTM, with the same note that we'll want to make this generic for use in other scriptworkers, soon after we enable this for signing scriptworkers.
:sal we've added check_signing_file_age and check_signing_file_age_ok_if_missing, mirroring the sysadmin puppet ones, but running using sudo to the user that can read the files. I don't have access to the nagios hosts to check this from there, can you verify and readd teh tests, please? Thanks
Flags: needinfo?(sespinoza)
Flags: needinfo?(sespinoza) → needinfo?(jlaz)
It looks like the NRPE config on signing-linux-3.srv.releng may still be missing the command definition for check_signing_file_age: [root@nagios1.private.releng.scl3 plugins]# ./check_nrpe -H signing-linux-3.srv.releng.use1.mozilla.com -c check_file_age_ok_not_exists NRPE: Command 'check_file_age_ok_not_exists' not defined It does exist for check_signing_file_age_ok_if_missing though: [root@nagios1.private.releng.scl3 plugins]# ./check_nrpe -H signing-linux-3.srv.releng.use1.mozilla.com -c check_signing_file_age_ok_if_missing FILE_AGE UNKNOWN: No file specified
Flags: needinfo?(jlaz)
(In reply to Justin Lazaro [:jlaz] from comment #19) > It looks like the NRPE config on signing-linux-3.srv.releng may still be > missing the command definition for check_signing_file_age: It appears to work for me on the node itself: [sfraser@signing-linux-3.srv.releng.use1.mozilla.com nrpe.d]$ /usr/lib64/nagios/plugins/check_nrpe -H localhost -c check_signing_file_age -a "-f /builds/scriptworker/logs/rebuild_gpg_homedirs.log" FILE_AGE CRITICAL: /builds/scriptworker/logs/rebuild_gpg_homedirs.log is 18m 15s seconds old and 262217244 bytes I'm now trying to remember the warning threshold we talked about way back, and I'm coming up short. I think it was 30 minutes warning / 45 critical, for /builds/scriptworker/logs/rebuild_gpg_homedirs.log and /builds/scriptworker/logs/worker.log
Looks like it is 45m for warning and 60m for critical according to the commented out entries: # "signing_scriptworker_gpg_rebuild_log" => { # service_description => "Signing Scriptworker gpg rebuild log age", # check_command => 'check_file_age!2700!3600!/builds/scriptworker/logs/rebuild_gpg_homedirs.log', # hostgroups => $nagiosbot ? { # 'nagios4-releng' => [ # 'signing-scriptworkers', # ], # default => [ # ] # } # }, Both commands from comment 18 are valid from nagios1.private.releng.scl3. Let me know if you'd like me to activate these checks. Thanks again!
Flags: needinfo?(sfraser)
Flags: needinfo?(aki)
rebuild_gpg_homedirs.log should exist, and I'd love the checks turned on for that and worker.log. If we're still checking create_initial_gpg_homedirs.log, that will alert because we're no longer updating that file. I'm fine with either enabling, seeing the create_initial_gpg_homedirs.log alert, and then disabling; I'm also fine with not enabling the create_initial_gpg_homedirs.log check. Thanks everyone!
Flags: needinfo?(aki)
Apologies, I'd missed the update. What aki said.
Flags: needinfo?(sfraser)
Flags: needinfo?(jlaz)
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #20) > (In reply to Justin Lazaro [:jlaz] from comment #19) > > It looks like the NRPE config on signing-linux-3.srv.releng may still be > > missing the command definition for check_signing_file_age: > > It appears to work for me on the node itself: > > [sfraser@signing-linux-3.srv.releng.use1.mozilla.com nrpe.d]$ > /usr/lib64/nagios/plugins/check_nrpe -H localhost -c check_signing_file_age > -a "-f /builds/scriptworker/logs/rebuild_gpg_homedirs.log" > FILE_AGE CRITICAL: /builds/scriptworker/logs/rebuild_gpg_homedirs.log is 18m > 15s seconds old and 262217244 bytes > > I'm now trying to remember the warning threshold we talked about way back, > and I'm coming up short. I think it was 30 minutes warning / 45 critical, > for /builds/scriptworker/logs/rebuild_gpg_homedirs.log and > /builds/scriptworker/logs/worker.log Tried to commit the checks today, but still ran into the issue of the command not being recognized on the node: Error: Service check command 'check_signing_file_age' specified in service 'Signing scriptworker gpg_homedirs.lock age' for host 'signing-linux-3.srv.releng.use1.mozilla.com' not defined anywhere! Which is completely strange because I can see that you're running it manually on the box.
Flags: needinfo?(jlaz)
Alright, made some progress here. I've defined the commands in our checkcommands nagios puppet manifest: check_signing_file_age => '$USER1$/check_nrpe -H $HOSTADDRESS$ -t 15 -c check_signing_file_age -a $ARG1$ $ARG2$ $ARG3$', check_signing_file_age_ok_if_missing => '$USER1$/check_nrpe -H $HOSTADDRESS$ -t 15 -c check_signing_file_age_ok_if_missing -a $ARG1$ $ARG2$ $ARG3$', 20:45:40 < nagios-releng> Tue 20:45:40 PDT [4054] [] signing-linux-3.srv.releng.use1.mozilla.com:Signing scriptworker gpg_homedirs.lock age is OK: FILE AGE OK: File not found - /builds/scriptworker/.gpg_homedirs.lock (http://m.mozilla.org/Signing+scriptworker+gpg_homedirs.lock+age) I'll add the rest of the checks shortly.
Alright, we're looking good, and pretty close: 21:07:22 < nagios-releng> jlaz: [] signing-linux-3.srv.releng.use1.mozilla.com:Scriptworker log age is OK - FILE_AGE OK: /builds/scriptworker/logs/worker.log is 11m 19s seconds old and 517770751 bytes Last Checked: 2017-06-20 21:03:54 PDT 21:07:22 < nagios-releng> jlaz: [] signing-linux-3.srv.releng.use1.mozilla.com:Signing Scriptworker gpg rebuild log age is OK - FILE_AGE OK: /builds/scriptworker/logs/rebuild_gpg_homedirs.log is 18m 55s seconds old and 280743929 bytes Last Checked: 2017-06-20 20:57:20 PDT 21:07:23 < nagios-releng> jlaz: [] signing-linux-3.srv.releng.use1.mozilla.com:Signing scriptworker gpg_homedirs.lock age is OK - FILE AGE OK: File not found - /builds/scriptworker/.gpg_homedirs.lock Last Checked: 2017-06-20 20:57:20 PDT The only check that seems to be having issues is the pending scriptworker tasks: 21:08:18 < nagios-releng> jlaz: [] signing-linux-3.srv.releng.use1.mozilla.com:Pending Scriptworker Tasks is UNKNOWN - (No output returned from plugin) Last Checked: 2017-06-20 21:08:13 PDT sfraser, or aki, would you mind running the checks locally to see what may be causing these errors? Once we can figure this out, we can close out the bug
Evidently those require python3 to run. [root@signing-linux-3.srv.releng.use1.mozilla.com ~]# /builds/scriptworker/bin/python /usr/lib64/nagios/plugins/nagios_pending_tasks.py -w 5 -c 10 PENDING_TASKS OK - 0/10 pending tasks
Attachment #8860360 - Flags: review?(jlorenzo) → review+
Comment on attachment 8860360 [details] bug 1332640 fix accidental hardcoding of thresholds https://reviewboard.mozilla.org/r/132390/#review140206 LGTM, with the same note that we'll want to make this generic for other scriptworker instance types. ::: modules/nrpe/manifests/check/check_pending_scriptworker_tasks.pp:4 (Diff revision 4) > +# This Source Code Form is subject to the terms of the Mozilla Public > +# License, v. 2.0. If a copy of the MPL was not distributed with this > +# file, You can obtain one at http://mozilla.org/MPL/2.0/. > +class nrpe::check::check_pending_scriptworker_tasks { I think this could live in modules/scriptworker/, since it is scriptworker specific. I'm fine with keeping it where it is if it's working. ::: modules/nrpe/manifests/check/check_pending_scriptworker_tasks.pp:23 (Diff revision 4) > + 'check_pending_scriptworker_tasks': > + user => 'nagios', > + runas => 'root', > + command => "$plugins_dir/check_pending_scriptworker_tasks"; > + } > + errant whitespace? not sure if this will cause a warning.
Attachment #8860360 - Flags: review+
Attachment #8860360 - Flags: review?(bugspam.Callek) → review+
Nice! I've adjusted the thresholds from 5 (warning) 10 (critical) to 45/50 respectively, per sfraser's request. 12:12 < nagios-releng> Wed 19:12:17 UTC [7604] [] signing-linux-3.srv.releng.use1.mozilla.com:Pending Scriptworker Tasks is OK: PENDING_TASKS OK - 0/10 pending tasks (http://m.mozilla.org/Pending+Scriptworker+Tasks)
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: