Closed
Bug 591923
Opened 14 years ago
Closed 9 years ago
Monitor for excessive failed puppet runs
Categories
(Release Engineering :: General, defect, P4)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Unassigned)
References
()
Details
(Whiteboard: [puppet][monitoring])
It seems that we landed some puppet changes on Friday and slaves slowly started not be able to succeed running puppet and don't connect to buildbot.
We should have a way of noticing this before there is a large number of slaves down.
Maybe we should get an email like we get from the masters when there are exceptions.
See bug 591803#c3 for more details on what it happened.
Here is the small fix on puppet that got us into the bad state:
http://hg.mozilla.org/build/puppet-manifests/rev/253f67007deb
Reporter | ||
Updated•14 years ago
|
Priority: -- → P4
Comment 1•14 years ago
|
||
(In reply to comment #0)
> See bug 591803#c3 for more details on what it happened.
> Here is the small fix on puppet that got us into the bad state:
> http://hg.mozilla.org/build/puppet-manifests/rev/253f67007deb
s/into/out of/. The URL looks bogus, did you mean to set 590720 ?
A brute force method would be to poll the master for a list of builders, and check each has some minimum number of slaves connected. That'd work for test masters which are broken down by platform and all slaves are expected to be connected, but not so well for pm01/03 where we have the full list for the pool and only expect some to be connected.
Comment 2•14 years ago
|
||
I'm thinking more along the lines of the twisted exception watcher we have on the buildbot masters.
Comment 3•13 years ago
|
||
I didn't see this bug when i started working on bug 690590. The proposed solution in that bug is specifically monitoring for puppetca problems, but I could easily adapt it to also handle more general errors.
It wouldn't be hard to adapt that script to also look for failed puppet runs.
Should we dupe to bug 690590?
Comment 4•11 years ago
|
||
Found in triage.
(In reply to John Ford [:jhford] -- please use 'needinfo?' instead of a CC from comment #3)
> I didn't see this bug when i started working on bug 690590. The proposed
> solution in that bug is specifically monitoring for puppetca problems, but I
> could easily adapt it to also handle more general errors.
>
> It wouldn't be hard to adapt that script to also look for failed puppet runs.
>
> Should we dupe to bug 690590?
Maybe, maybe not. Moving to the correct component for now. Lets see what makes sense to people with more context.
Blocks: re-nagios
Component: Release Engineering → Release Engineering: Developer Tools
QA Contact: hwine
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Comment 5•11 years ago
|
||
All puppet masters are monitored for basic functionality.
We currently get emailed about failed puppet runs, but that's not an alert. Also, transient failures are harmless, as they automatically retry. The events we need to know about are when many hosts fail at once.
Foreman will give us a more comprehensive view, so perhaps there's some way to extract a meaningful signal from Foreman into nagios.
Summary: Add monitoring to puppet masters → Monitor for excessive failed puppet runs
Comment 6•9 years ago
|
||
Amy, Dustin - are we satisified with the current level of Puppet monitoring? If so, we should resolve this.
Flags: needinfo?(arich)
Updated•9 years ago
|
Flags: needinfo?(arich)
Assignee | ||
Updated•8 years ago
|
Component: Tools → General
You need to log in
before you can comment on or make changes to this bug.
Description
•