Closed Bug 1321513 Opened 8 years ago Closed 8 years ago

Add nagios checks for pushapk scriptworkers

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlorenzo, Assigned: jlorenzo)

References

Details

Attachments

(2 files, 2 obsolete files)

Nagios config files have been created on the nodes sides[1] (added in bug 1307826). Pushapkworker doesn't rely on the scriptworker shared module yet (introduced in bug 1316702). If I understand bug 1317783 comment 3 correctly, this may not happen immediately (the scriptworker module enables chain of trust AFAUI). Therefore, I believe we can turn on nagios just like Aki did in bug 1295196 [1] https://hg.mozilla.org/build/puppet/file/8fbdd01cb464/modules/pushapkworker/templates/nagios.cfg.erb
Things have changed since bug 1295196 comment 1, no more SVN repo is required. Releng folks have access to the git repos listed at [1] (and more precisely, the nagios part of it) [1] https://mana.mozilla.org/wiki/display/SYSADMIN/Git
Attached patch add_pushapk_scriptworker_in_nagios.diff (obsolete) (deleted) — Splinter Review
The patch is totally similar to bug 1295196. Before asking for a formal review to the folks in #moc, I have a question for you, Aki. I'm not sure I need to add the resource "pushapk-scriptworker-procs", because "scriptworker-procs" already exist. Should I create it anyway, or is it better to just add a hostgroup to "scriptworker-procs"?
Attachment #8816127 - Flags: feedback?(aki)
Comment on attachment 8816127 [details] [diff] [review] add_pushapk_scriptworker_in_nagios.diff The current idea is to have each scriptworker instance use scriptworker-procs, rather than have one *-procs check per scriptworker instance type, especially since I believe they'll look exactly the same other than the name. If there isn't a strong reason to add another check, let's reuse the check.
Attachment #8816127 - Flags: feedback?(aki) → feedback+
Attached patch add_pushapk_scriptworker_in_nagios_rev2.diff (obsolete) (deleted) — Splinter Review
Here's a new revision where pushapk-scriptworkers is now listed under "scriptworker-procs" instead of a newly created entry. Aki, do you have the rights to review and merge my patch? Or should I ask Amy?
Attachment #8816127 - Attachment is obsolete: true
Attachment #8823219 - Flags: review?(aki)
Oops, I should've added --no-color when generating the diff.
Attachment #8823219 - Attachment is obsolete: true
Attachment #8823219 - Flags: review?(aki)
Attachment #8823221 - Flags: review?(aki)
For the record, the patch has also been rebased on top of the latest master.
Comment on attachment 8823221 [details] [diff] [review] add_pushapk_scriptworker_in_nagios_rev3.diff I think :arr is a better reviewer here.
Attachment #8823221 - Flags: review?(aki) → review?(arich)
Comment on attachment 8823221 [details] [diff] [review] add_pushapk_scriptworker_in_nagios_rev3.diff This looks fine to me. I presume that /builds/scriptworker/bin/scriptworker already exists on the client and that the nagios config has been modified in releng puppet to include the nrpe check?
Attachment #8823221 - Flags: review?(arich) → review+
The releng had been modified to support the nrpe check[1]. However, thanks for calling out `/builds/scriptworker`! I noticed pushapk-scriptworker and signing-scriptworker don't share the same root[2]. Therefore, something needs to be changed: a) either on the releng puppet; we make the root pushapk at /build/scriptworker b) or on the current nagios config; we extend the regex to support other roots. I see drawbacks in both scenarios: a) if you're logged on a machine, you may not know what type of scriptworker instance you have, unless you dig into its configuration b) the regex might become a nightmare after several instances being added. On the other had, having consistency across scriptworker instances is better, and will help setting up new ones. Moreover, if you're logged on a machine, there's great probability its FQDN refers to what type of scriptworker instance you're dealing with. Then, I'll change the puppet configuration. [1] https://hg.mozilla.org/build/puppet/file/3e2076f7aedc/modules/pushapkworker/manifests/init.pp#l63 [2] Pushapk being defined as: https://hg.mozilla.org/build/puppet/file/3e2076f7aedc/manifests/moco-config.pp#l453 Signing-scriptworker as: https://hg.mozilla.org/build/puppet/file/3e2076f7aedc/modules/signing_scriptworker/manifests/settings.pp#l2
Comment on attachment 8823758 [details] Bug 1321513 - Make pushapkworker live under /build/scriptworker https://reviewboard.mozilla.org/r/102242/#review102620 At some point, we should get pushapkworker using modules/scriptworker, and we can make sure all workers using modules/scriptworker have a /builds/scriptworker root. I think this works until that point.
Attachment #8823758 - Flags: review?(aki) → review+
Even though the puppet run was passing yesterday, a nit slipped through the patch. I updated it. The latest revision of that patch passed lint tests there: https://github.com/mozilla/build-puppet/pull/24 I tested the changes against dev-linux64-ec2-jlorenzo.dev.releng.use1.m.c: * /build/pushapkworker doesn't exist anymore * /build/scriptworker does and contains all the files * `supervisorctl status pushapkworker` says the process is running * /builds/scriptworker/logs/worker.log shows the process does what it needs to do (polling TC's queue) I think we're good to merge and deploy attachment 8823758 [details] (which is necessary to land attachment 8823221 [details] [diff] [review]).
Landed on default [1] and merged in production [2]. Puppet agent applied the changes restarted the process (which restarted properly)[3]. On the production machine, I see now /builds/scriptworker/ but not /build/pushapkworker/ anymore. This complies with [4]. [1] https://hg.mozilla.org/build/puppet/rev/1a429ca163d9 [2] https://hg.mozilla.org/build/puppet/rev/b4d2f4f777e4 [3] https://papertrailapp.com/systems/526654063/events?focus=753674615796166664 [4] https://papertrailapp.com/events?q=program%3Apuppet-agent&focus=753673798523449394
Comment on attachment 8823221 [details] [diff] [review] add_pushapk_scriptworker_in_nagios_rev3.diff This patch was landed as revision 568ad3460d032bf498f7bf0e9239938244f64784 on master in the git repo (linked in comment 1).
Attachment #8823221 - Flags: checked-in+
I connected to the releng nagios instance[1] and looked for pushapkwoker. The entry there shows: * 7 checks (PING + the 6 new checks added with attachment 8823221 [details] [diff] [review]). * /builds/scriptworker/bin/scriptworker is declared as running. I can tell it's also working, because this job[2] has just finished and it ran fine. * Every check is green. Looks like we're all set. Time to close this bug! Thanks for the help Ami and Aki! [1] Listed at https://mana.mozilla.org/wiki/display/SYSADMIN/Nagios#Nagios-Hardware [2] https://tools.taskcluster.net/task-inspector/#Tmlx52SQT9uwVhi9oEqagg/
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
(In reply to Aki Sasaki [:aki] from comment #12) > At some point, we should get pushapkworker using modules/scriptworker, and > we can make sure all workers using modules/scriptworker have a > /builds/scriptworker root. I agree. I looked more into using this module. I get the feeling I need a greater version than 0.7.1. That is to say, Chain of Trust has to be enabled. Hence, I guess the module can be added alongside bug 1317783.
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: