Closed Bug 735965 Opened 13 years ago Closed 7 years ago

nagios checks for old signmar processes on signing servers

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task)

x86_64
Linux
task
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: catlee, Unassigned)

References

Details

(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1960] )

Abandoned signmar processes are bad since they chew up the cpu FOREVAR! (see bug 732543) Let's get some nagios checks to catch these bad guys. Here are some that we caught today: cltsign 21503 1 86 Mar13 ? 1-06:21:24 /home/cltsign/instances/rel-key-signing- server-2/bin/signmar -d /home/cltsign/instances/rel-key-signing-server-2/secrets/mar -n re l1 -s /home/cltsign/instances/rel-key-signing-server-2/test-files/test.mar /tmp/tmpL2GFRB/ test.mar.tmp cltsign 21514 1 87 Mar13 ? 1-06:27:04 /home/cltsign/instances/rel-key-signing- server-2/bin/signmar -d /home/cltsign/instances/rel-key-signing-server-2/secrets/mar -n re l1 -s /home/cltsign/instances/rel-key-signing-server-2/test-files/test.mar /tmp/tmpL2GFRB/ test.mar.tmp A few things we could look at to catch them: - ppid is 1 - they're old (from yesterday) - they're signing the test mars - they're hogging the CPUs
Hm, so should we try to use check_procs for this? I'm not sure what a "normal" process looks like vs a "rogue" one. http://nagiosplugins.org/man/check_procs details what we can look for. Should we do something like: check_procs -w 30 -c 50 --metric=CPU -p 1 -u cltsign -C /home/cltsign/instances/rel-key-signing-server-2/bin/signmar Or will that flag on normal signing processes as well? Right now there's nothing called /home/cltsign/instances/rel-key-signing-server-2/bin/signmar running, so I'm not sure. Should we be checking the argument array (if there's something with the argument that includes test.mar will that always be a "rogue" process?
Assignee: server-ops-releng → arich
Perhaps -metric=ELAPSED, if it can cope with multiple processes of the same name. AIUI these processes are launched for each file to sign and should run for a smallish number of seconds.
processes with a parent pid of 1 are 'rogue', so I think '-p 1' should be fine. -metric=ELAPSED is probably a better thing to watch than CPU usage. These processes should not run for more than 60 seconds let's say.
So you want to make a check definition on the client end called check_rogue_signmar (or something) that has: check_procs -m ELAPSED 60 -u cltsign -p 1 -C /home/cltsign/instances/rel-key-signing-server-2/bin/signmar And I can add that to the servers?
(In reply to Amy Rich [:arich] [:arr] from comment #4) > So you want to make a check definition on the client end called > check_rogue_signmar (or something) that has: > > check_procs -m ELAPSED 60 -u cltsign -p 1 -C > /home/cltsign/instances/rel-key-signing-server-2/bin/signmar signmar could be run from any of the signing instances on the machine, so hardcoding this path won't catch everything. could we find any process matching '*/signmar'?
check_procs -m ELAPSED 60 -u cltsign -p 1 --ereg-argument-array=signmar ?
Just checking back in to see if this check has been implemented on the clients yet?
Please assign this back to me when the check has been implemented on the client end, and I will add it to the server.
Assignee: arich → nobody
Component: Server Operations: RelEng → Release Engineering: Machine Management
QA Contact: arich → armenzg
Product: mozilla.org → Release Engineering
This is not a Buildduty type of task but an addition to one of our supported platforms. Moving to the right component. Please adjust if needed. (Needinfo on me is required as I do not CC to bugs by default).
Component: Buildduty → Platform Support
QA Contact: armenzg → coop
catlee, bhearsum: can one of you comment as to whether comment #6 is the proper check to be running on the signing server?
Flags: needinfo?(catlee)
Flags: needinfo?(bhearsum)
(In reply to Chris Cooper [:coop] from comment #10) > catlee, bhearsum: can one of you comment as to whether comment #6 is the > proper check to be running on the signing server? I think that's correct based on rereading and fiddling with check_procs a little bit.
Flags: needinfo?(catlee)
Flags: needinfo?(bhearsum)
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1960]
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.