Closed
Bug 1314840
Opened 8 years ago
Closed 7 years ago
additional nagios checks for signing scriptworkers
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: mozilla, Assigned: sfraser)
References
Details
Attachments
(3 files)
We already have nagios checks for signing scriptworkers (num processes) [1]. We'd like to add a few more.
First, we want to alert when the number of pending tasks gets too high. We can do this by calling the pendingTasks endpoint [2].
The credentials, provisionerId, and workerType are all in the config file. (For signing scriptworker 0.9.0 this is /builds/scriptworker/scriptworker.json; when we move to 1.0.0 this will be /builds/scriptworker/scriptworker.yaml)
We can either write a standalone script or a separate scriptworker endpoint. For the latter, we'd need to update setup.py [3] and write a function like this [4]:
context, _ = get_context_from_cmdln(sys.argv[1:])
loop = asyncio.get_event_loop()
result = loop.run_until_complete(context.queue.pendingTasks(context.config['provisioner_id'], context.config['worker_type']))
Second, I'd love to have file age checks. These three files should exist, and not be too old, maybe changed in the last hour?:
/builds/scriptworker/logs/create_initial_gpg_homedirs.log
/builds/scriptworker/logs/rebuild_gpg_homedirs.log
/builds/scriptworker/logs/worker.log
And this file should either not exist or be less than an hour old:
/builds/scriptworker/.gpg_homedirs.lock
[1] https://hg.mozilla.org/build/puppet/file/tip/modules/signing_scriptworker/templates/nagios.cfg.erb
[2] https://docs.taskcluster.net/reference/platform/queue/api-docs#pendingTasks
[3] https://github.com/mozilla-releng/scriptworker/blob/03eb5b8/setup.py#L68-L69
[4] https://github.com/mozilla-releng/scriptworker/blob/03eb5b8/scriptworker/gpg.py#L1511-L1532
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → sfraser
Comment hidden (mozreview-request) |
Reporter | ||
Comment 2•8 years ago
|
||
mozreview-review |
Comment on attachment 8815712 [details]
Add signing scriptworker monitoring for Bug 1314840
https://reviewboard.mozilla.org/r/96556/#review96870
::: modules/scriptworker/files/nagios_pending_tasks.py:73
(Diff revision 1)
> + report to nagios
> + """
> + args = get_args()
> +
> + context, credentials = get_context_from_cmdln(sys.argv[1:])
> + cleanup(context)
Oops, I guess I missed these earlier.
These look problematic... have you tried them?
1. This may be wrong, but sys.argv will have the --warning and --critical options, if specified. I think you're going to fail out with unrecognized arguments.
2. cleanup(context) nukes the work_dir, artifact_dir, and task_log_dir. Nagios is going to be running in the background, and tasks are going to be running in scriptworker. The tasks rely on the work_dir, artifact_dir, and task_log_dir, so every time nagios runs this check while a task is running, we're going to break the task.
I think you're mainly creating the context to get the taskcluster queue object to run `queue.pendingTasks`. However, there's an easier way.
Puppet already knows the taskcluster client id and access token: https://hg.mozilla.org/build/puppet/file/tip/modules/signing_scriptworker/manifests/init.pp#l67
You can import and call taskcluster.async Queue directly: https://github.com/mozilla-releng/scriptworker/blob/master/scriptworker/context.py#L111-L113
You already have your session from aiohttp, and you already have the credentials from puppet. If you don't want to populate them from puppet, you can read them from /builds/scriptworker/scriptworker.yaml : https://hg.mozilla.org/build/puppet/file/tip/modules/scriptworker/templates/scriptworker.yaml.erb#l6
I think that'll be cleaner. Let me know if you have questions?
Attachment #8815712 -
Flags: review?(aki) → review-
Assignee | ||
Comment 3•8 years ago
|
||
mozreview-review-reply |
Comment on attachment 8815712 [details]
Add signing scriptworker monitoring for Bug 1314840
https://reviewboard.mozilla.org/r/96556/#review96870
> Oops, I guess I missed these earlier.
>
> These look problematic... have you tried them?
>
> 1. This may be wrong, but sys.argv will have the --warning and --critical options, if specified. I think you're going to fail out with unrecognized arguments.
> 2. cleanup(context) nukes the work_dir, artifact_dir, and task_log_dir. Nagios is going to be running in the background, and tasks are going to be running in scriptworker. The tasks rely on the work_dir, artifact_dir, and task_log_dir, so every time nagios runs this check while a task is running, we're going to break the task.
>
> I think you're mainly creating the context to get the taskcluster queue object to run `queue.pendingTasks`. However, there's an easier way.
>
> Puppet already knows the taskcluster client id and access token: https://hg.mozilla.org/build/puppet/file/tip/modules/signing_scriptworker/manifests/init.pp#l67
>
> You can import and call taskcluster.async Queue directly: https://github.com/mozilla-releng/scriptworker/blob/master/scriptworker/context.py#L111-L113
>
> You already have your session from aiohttp, and you already have the credentials from puppet. If you don't want to populate them from puppet, you can read them from /builds/scriptworker/scriptworker.yaml : https://hg.mozilla.org/build/puppet/file/tip/modules/scriptworker/templates/scriptworker.yaml.erb#l6
>
> I think that'll be cleaner. Let me know if you have questions?
The argument parsing seemed to work, but you're right it does seem more sensible to remove any potential conflict. I'll look at the changes tomorrow.
Comment hidden (mozreview-request) |
Assignee | ||
Updated•8 years ago
|
Attachment #8816105 -
Flags: review?(aki)
Reporter | ||
Comment 5•8 years ago
|
||
mozreview-review |
Comment on attachment 8816105 [details]
Update the pending tasks check to no longer use/wipe the local context
https://reviewboard.mozilla.org/r/96900/#review97174
Thank you!
Attachment #8816105 -
Flags: review?(aki) → review+
Assignee | ||
Comment 6•8 years ago
|
||
Changes pushed
Reporter | ||
Comment 7•8 years ago
|
||
Hey Simon,
Is this bug ready to be closed?
I'm wondering if you've seen any nagios alerts about the queue? We'll probably see more load in January once we start signing all mozilla-central pushes, but verifying this works beforehand may help avoid last minute fixes.
Assignee | ||
Comment 8•8 years ago
|
||
I've not seen any alerts, but I'm also not sure I'm set up to receive any. Where would they go by default?
Reporter | ||
Comment 9•8 years ago
|
||
Probably to #platform-ops-alerts, which is pw protected, or #buildduty.
I don't see your gpg key info in the private releng git repo; have you added yourself? https://mana.mozilla.org/wiki/display/RelEng/Passwords
Reporter | ||
Comment 10•8 years ago
|
||
I dug through the alerts in #platform-ops-alerts and don't appear to see any applicable alerts that match 'signing-' or 'scriptworker'... it's possible we just haven't hit the alert threshold yet.
Assignee | ||
Comment 11•8 years ago
|
||
mana page is not found, presumably I need to be in an extra group there, as I can't access /RelEng/
Reporter | ||
Comment 12•8 years ago
|
||
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #11)
> mana page is not found, presumably I need to be in an extra group there, as
> I can't access /RelEng/
at https://ldapadmin1.private.scl3.mozilla.com , do you have the following groups?
cn=IntranetWiki,ou=groups,dc=mozilla
cn=RelEngWiki,ou=groups,dc=mozilla
cn=vpn_relengwiki,ou=groups,dc=mozilla
You need to vpn in to access that site.
I'm going to guess it's the 2nd one. We can file a bug to add you. If we're missing the correct group on the Day 1 checklist, let's update it :)
https://wiki.mozilla.org/ReleaseEngineering/Day_1_Checklist#LDAP.2C_SSH.2C_VPN
Reporter | ||
Updated•8 years ago
|
Flags: needinfo?(sfraser)
Assignee | ||
Comment 13•8 years ago
|
||
Have filed bug 1328233, cn=RelEngWiki isn't on the day 1 checklist. Once I can confirm it gives me access, I'll add it.
Flags: needinfo?(sfraser)
Blocks: 1328258
Comment 14•8 years ago
|
||
Simon: are we unblocked here, i.e. do you have mana access now?
Flags: needinfo?(sfraser)
Assignee | ||
Comment 15•8 years ago
|
||
We're unblocked, I can see the IRC channel although I've not yet seen any alerts go past. Is there a history in the nagios web interface that can show this?
Flags: needinfo?(sfraser)
Comment 16•8 years ago
|
||
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #15)
> We're unblocked, I can see the IRC channel although I've not yet seen any
> alerts go past. Is there a history in the nagios web interface that can show
> this?
I don't see anything in https://nagios.mozilla.org/releng-scl3/cgi-bin/history.cgi?host=all
Assignee | ||
Comment 17•8 years ago
|
||
ok, I will poke at it tomorrow and try to force an alert
Reporter | ||
Comment 18•8 years ago
|
||
I was able to force an alert for signing-linux-1 by killing the scriptworker process. That's working, but the queue + file ages aren't.
I'm now guessing we haven't added the alert to the sysadmin puppet module [1]. I think we need to patch modules/nagios/manifests/releng/services.pp in there. If we're able to get the same alerts for all scriptworker instances that use the shared scriptworker puppet module [2], that would be ideal. If we only add it to the signing scriptworkers, that's a good start.
Do you have time+headspace to keep looking at this? If you need me to help or take over, let me know. Otherwise I'm going to keep pointing you in what is hopefully the right direction :)
[1] https://mana.mozilla.org/wiki/display/SYSADMIN/Git#Git-Production
[2] https://hg.mozilla.org/build/puppet/file/tip/modules/scriptworker
Flags: needinfo?(sfraser)
Assignee | ||
Comment 19•8 years ago
|
||
I can keep working on it, I just didn't have the access to test that it triggered.
Flags: needinfo?(sfraser)
Assignee | ||
Comment 20•8 years ago
|
||
Was it this check_procs that alerted?
https://hg.mozilla.org/build/puppet/file/tip/modules/signingworker/templates/nagios.cfg.erb
Or this one from sysadmin puppet?
},
'signing-worker-procs' => {
service_description => 'procs - signing-worker',
contact_groups => 'build',
check_command => 'check_nrpe_procs_regex!/builds/signingworker/bin/signing-worker!1!1',
hostgroups => $::fqdn ? {
'nagios1.private.releng.scl3.mozilla.com' => [
'signing-workers'
],
default => [
]
}
},
(there's an equivalent for this for signing-scriptworkers, which also has these age/pending checks configured in nagios.cfg)
Assignee | ||
Comment 21•8 years ago
|
||
How does the following look?
I'm nervous about the literal paths for /builds/scriptworker/ but the config definition for those is in a separate puppet
"service_file_age" => {
service_description => "Signing Scriptworker optional file ages",
check_command => 'nagios_check_file_ages.py!45!60!--optional!--from-file /builds/scriptworker/file_age_check_optionals.txt',
hostgroups => $::fqdn ? {
'nagios1.private.releng.scl3.mozilla.com' => [
'signing-scriptworkers',
],
default => [
]
}
},
"service_file_age" => {
service_description => "Signing Scriptworker file ages",
check_command => 'nagios_check_file_ages.py!45!60!--from-file /builds/scriptworker/file_age_check_required.txt',
hostgroups => $::fqdn ? {
'nagios1.private.releng.scl3.mozilla.com' => [
'signing-scriptworkers',
],
default => [
]
}
},
"service_queue_age" => {
service_description => "Pending Scriptworker Tasks",
check_command => 'nagios_pending_tasks.py!5!10',
hostgroups => $::fqdn ? {
'nagios1.private.releng.scl3.mozilla.com' => [
'signing-scriptworkers'
],
default => [
]
}
},
Flags: needinfo?(aki)
Reporter | ||
Comment 22•8 years ago
|
||
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #20)
> Was it this check_procs that alerted?
>
> https://hg.mozilla.org/build/puppet/file/tip/modules/signingworker/templates/
> nagios.cfg.erb
>
> Or this one from sysadmin puppet?
>
> },
> 'signing-worker-procs' => {
> service_description => 'procs - signing-worker',
> contact_groups => 'build',
> check_command =>
> 'check_nrpe_procs_regex!/builds/signingworker/bin/signing-worker!1!1',
> hostgroups => $::fqdn ? {
> 'nagios1.private.releng.scl3.mozilla.com' => [
> 'signing-workers'
> ],
> default => [
> ]
> }
> },
> (there's an equivalent for this for signing-scriptworkers, which also has
> these age/pending checks configured in nagios.cfg)
This is the alert I saw... I'm going to guess it's the sysadmin puppet one. Which makes me wonder if we can remove the nagios erb's from the various scriptworker modules.
[2017-01-17 20:31:51] <nagios-releng> Tue 12:31:51 PST [4070] signing-linux-1.srv.releng.use1.mozilla.com:procs - scriptworker is 4CRITICAL: PROCS CRITICAL: 0 processes with regex args /builds/scriptworker/bin/scriptworker (http://m.mozilla.org/procs+-+scriptworker)
Reporter | ||
Comment 23•8 years ago
|
||
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #21)
> How does the following look?
>
> I'm nervous about the literal paths for /builds/scriptworker/ but the config
> definition for those is in a separate puppet
I'm currently solving that by requiring all production scriptworker instances install into /builds/scriptworker.
>
> "service_file_age" => {
> service_description => "Signing Scriptworker optional file ages",
Maybe s,Signing ,, in all the descriptions.
Ideally we don't have to specify these checks for each instance type. As long as balrog, beetmover, pushapk, etc scriptworkers live in /builds/scriptworker, these checks should work for each.
Same goes for the other checks. Otherwise, lgtm, though maybe :arr or other relops person should review. Thank you!
Flags: needinfo?(aki)
Assignee | ||
Comment 24•8 years ago
|
||
updated, and requesting input from relops:
"service_optional_file_age" => {
service_description => "Scriptworker optional file ages",
check_command => 'nagios_check_file_ages.py!45!60!--optional!--from-file /builds/scriptworker/file_age_check_optionals.txt',
hostgroups => $::fqdn ? {
'nagios1.private.releng.scl3.mozilla.com' => [
'signing-scriptworkers',
],
default => [
]
}
},
"service_file_age" => {
service_description => "Scriptworker file ages",
check_command => 'nagios_check_file_ages.py!45!60!--from-file /builds/scriptworker/file_age_check_required.txt',
hostgroups => $::fqdn ? {
'nagios1.private.releng.scl3.mozilla.com' => [
'signing-scriptworkers',
],
default => [
]
}
},
"service_queue_age" => {
service_description => "Pending Scriptworker Tasks",
check_command => 'nagios_pending_tasks.py!5!10',
hostgroups => $::fqdn ? {
'nagios1.private.releng.scl3.mozilla.com' => [
'signing-scriptworkers'
],
default => [
]
}
},
Flags: needinfo?(jwatkins)
Assignee | ||
Comment 25•8 years ago
|
||
For clarity, just want to make sure that the above will do what we expect (that is, run the commands with the given arguments, report errors)
Comment 26•8 years ago
|
||
(In reply to Simon Fraser [:sfraser] ⌚️GMT from comment #24)
> updated, and requesting input from relops:
>
> "service_optional_file_age" => {
> service_description => "Scriptworker optional file ages",
> check_command =>
> 'nagios_check_file_ages.py!45!60!--optional!--from-file
> /builds/scriptworker/file_age_check_optionals.txt',
AFAIK, this will not work. You will need to also define a 'check_command' under puppet/modules/nagios/manifests/mozilla/checkcommands.pp which passes over to nrpe. The check_command under services.pp should specify the check (as defined in checkcommands.pp) and any variable arguments to be passed.
eg. 'nagios_check_file_ages!$ABS_PATH_TO_FILE!$WARNING_THRESHOLD_INT!$ERROR_THRESHOLD_INT'
Those args are then interpolated with the string defined in checkcommands.pp which should then call nrpe with the passed arguments, including the name of the check as defined on the host itself ( https://hg.mozilla.org/build/puppet/file/tip/modules/nrpe/manifests/check.pp#l19 )
If you want to check the nrpe response from the host before modifying nagios, you can log into nagios1.private.releng.scl3.mozilla.com and exec /usr/lib64/nagios/plugins/check_nrpe directly
Flags: needinfo?(jwatkins)
Assignee | ||
Comment 27•8 years ago
|
||
I don't have access to nagios1.private.releng, and the rest of team appears not to, either.
I suspect that the commands to try are these:
/usr/lib64/nagios/plugins/check_nrpe -H signing-linux-1.srv.releng.use1.mozilla.com -t 30 -c nagios_pending_tasks.py -w 0 -c 1
/usr/lib64/nagios/plugins/check_nrpe -H signing-linux-1.srv.releng.use1.mozilla.com -t 15 -c check_file_age -a "-w 2700 -c 3600 -W -0 -C -0 -f /builds/scriptworker/logs/worker.log"'
/usr/lib64/nagios/plugins/check_nrpe -H signing-linux-1.srv.releng.use1.mozilla.com -t 15 -c check_file_age_ok_not_exists -a 2700 3600 /builds/scriptworker/.gpg_homedirs.lock'
Comment 28•8 years ago
|
||
A few things:
1) I forgot that check_nrpe is installed on all host so you can test calling you check through check_nrpe directly on the localhost
2) The /etc/nagios/nrpe.d/signingworker.cfg is missing ']' on a couple lines. You should also use $ARG1$ $ARG2$ etc in that file rather than hardcoding threshold numbers
3) The nagios_pending_tasks.py should be world exec (0755)
4) You should be able to run the check script by hand and see the formatted output nagios is expecting.
eg. ./check_ntp_peer -H localhost
NTP OK: Offset 0.011208 secs|offset=0.011208s;60.000000;120.000000;
5) It looks like the ./nagios_pending_tasks.py script is missing the asyncio python lib. I didn't look at the other script.
Traceback (most recent call last):
File "./nagios_pending_tasks.py", line 18, in <module>
import asyncio
ImportError: No module named asyncio
I believe you can use a virtualenv but that will need to be sourced properly on the [check] line in /etc/nagios/nrpe.d/check_ntp_peer.cfg file
Hope that helps!
Updated•8 years ago
|
Blocks: tcmigration-cleanup
Reporter | ||
Comment 29•7 years ago
|
||
(In reply to Jake Watkins [:dividehex] from comment #28)
> 5) It looks like the ./nagios_pending_tasks.py script is missing the asyncio
> python lib. I didn't look at the other script.
> Traceback (most recent call last):
> File "./nagios_pending_tasks.py", line 18, in <module>
> import asyncio
> ImportError: No module named asyncio
>
> I believe you can use a virtualenv but that will need to be sourced properly
> on the [check] line in /etc/nagios/nrpe.d/check_ntp_peer.cfg file
The scriptworker venv in /builds/scriptworker/ should have py35, which has asyncio bundled. That would be /builds/scriptworker/bin/python, though we should use ${basedir}/bin/python like https://hg.mozilla.org/build/puppet/file/tip/modules/scriptworker/manifests/instance.pp#l12 .
Reporter | ||
Comment 30•7 years ago
|
||
I think we can call this fixed, no? Thank you!
Reporter | ||
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•