Closed
Bug 735965
Opened 13 years ago
Closed 7 years ago
nagios checks for old signmar processes on signing servers
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: catlee, Unassigned)
References
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1960] )
Abandoned signmar processes are bad since they chew up the cpu FOREVAR! (see bug 732543)
Let's get some nagios checks to catch these bad guys. Here are some that we caught today:
cltsign 21503 1 86 Mar13 ? 1-06:21:24 /home/cltsign/instances/rel-key-signing-
server-2/bin/signmar -d /home/cltsign/instances/rel-key-signing-server-2/secrets/mar -n re
l1 -s /home/cltsign/instances/rel-key-signing-server-2/test-files/test.mar /tmp/tmpL2GFRB/
test.mar.tmp
cltsign 21514 1 87 Mar13 ? 1-06:27:04 /home/cltsign/instances/rel-key-signing-
server-2/bin/signmar -d /home/cltsign/instances/rel-key-signing-server-2/secrets/mar -n re
l1 -s /home/cltsign/instances/rel-key-signing-server-2/test-files/test.mar /tmp/tmpL2GFRB/
test.mar.tmp
A few things we could look at to catch them:
- ppid is 1
- they're old (from yesterday)
- they're signing the test mars
- they're hogging the CPUs
Comment 1•13 years ago
|
||
Hm, so should we try to use check_procs for this? I'm not sure what a "normal" process looks like vs a "rogue" one.
http://nagiosplugins.org/man/check_procs details what we can look for.
Should we do something like:
check_procs -w 30 -c 50 --metric=CPU -p 1 -u cltsign -C /home/cltsign/instances/rel-key-signing-server-2/bin/signmar
Or will that flag on normal signing processes as well? Right now there's nothing called /home/cltsign/instances/rel-key-signing-server-2/bin/signmar running, so I'm not sure.
Should we be checking the argument array (if there's something with the argument that includes test.mar will that always be a "rogue" process?
Updated•13 years ago
|
Assignee: server-ops-releng → arich
Comment 2•13 years ago
|
||
Perhaps -metric=ELAPSED, if it can cope with multiple processes of the same name. AIUI these processes are launched for each file to sign and should run for a smallish number of seconds.
Reporter | ||
Comment 3•13 years ago
|
||
processes with a parent pid of 1 are 'rogue', so I think '-p 1' should be fine. -metric=ELAPSED is probably a better thing to watch than CPU usage. These processes should not run for more than 60 seconds let's say.
Comment 4•13 years ago
|
||
So you want to make a check definition on the client end called check_rogue_signmar (or something) that has:
check_procs -m ELAPSED 60 -u cltsign -p 1 -C /home/cltsign/instances/rel-key-signing-server-2/bin/signmar
And I can add that to the servers?
Reporter | ||
Comment 5•13 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #4)
> So you want to make a check definition on the client end called
> check_rogue_signmar (or something) that has:
>
> check_procs -m ELAPSED 60 -u cltsign -p 1 -C
> /home/cltsign/instances/rel-key-signing-server-2/bin/signmar
signmar could be run from any of the signing instances on the machine, so hardcoding this path won't catch everything. could we find any process matching '*/signmar'?
Comment 6•13 years ago
|
||
check_procs -m ELAPSED 60 -u cltsign -p 1 --ereg-argument-array=signmar
?
Comment 7•13 years ago
|
||
Just checking back in to see if this check has been implemented on the clients yet?
Comment 8•13 years ago
|
||
Please assign this back to me when the check has been implemented on the client end, and I will add it to the server.
Assignee: arich → nobody
Component: Server Operations: RelEng → Release Engineering: Machine Management
QA Contact: arich → armenzg
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Comment 9•11 years ago
|
||
This is not a Buildduty type of task but an addition to one of our supported platforms.
Moving to the right component.
Please adjust if needed.
(Needinfo on me is required as I do not CC to bugs by default).
Component: Buildduty → Platform Support
QA Contact: armenzg → coop
Comment 10•11 years ago
|
||
catlee, bhearsum: can one of you comment as to whether comment #6 is the proper check to be running on the signing server?
Flags: needinfo?(catlee)
Flags: needinfo?(bhearsum)
Comment 11•11 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #10)
> catlee, bhearsum: can one of you comment as to whether comment #6 is the
> proper check to be running on the signing server?
I think that's correct based on rereading and fiddling with check_procs a little bit.
Flags: needinfo?(catlee)
Flags: needinfo?(bhearsum)
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1960]
Reporter | ||
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WORKSFORME
Assignee | ||
Updated•6 years ago
|
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•