Closed Bug 591923 Opened 14 years ago Closed 9 years ago

Monitor for excessive failed puppet runs

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

References

(
URL
)

Details

(Whiteboard: [puppet][monitoring])

Armen [:armenzg]

Reporter

Description

•

14 years ago

It seems that we landed some puppet changes on Friday and slaves slowly started not be able to succeed running puppet and don't connect to buildbot. We should have a way of noticing this before there is a large number of slaves down. Maybe we should get an email like we get from the masters when there are exceptions. See bug 591803#c3 for more details on what it happened. Here is the small fix on puppet that got us into the bad state: http://hg.mozilla.org/build/puppet-manifests/rev/253f67007deb

Armen [:armenzg]

Reporter

Updated

•

14 years ago

Priority: -- → P4

Nick Thomas [:nthomas] (UTC+12)

Comment 1

•

14 years ago

(In reply to comment #0) > See bug 591803#c3 for more details on what it happened. > Here is the small fix on puppet that got us into the bad state: > http://hg.mozilla.org/build/puppet-manifests/rev/253f67007deb s/into/out of/. The URL looks bogus, did you mean to set 590720 ? A brute force method would be to poll the master for a list of builders, and check each has some minimum number of slaves connected. That'd work for test masters which are broken down by platform and all slaves are expected to be connected, but not so well for pm01/03 where we have the full list for the pool and only expect some to be connected.

Chris AtLee [:catlee]

Comment 2

•

14 years ago

I'm thinking more along the lines of the twisted exception watcher we have on the buildbot masters.

John Ford [:jhford] CET/CEST Berlin Time

Comment 3

•

13 years ago

I didn't see this bug when i started working on bug 690590. The proposed solution in that bug is specifically monitoring for puppetca problems, but I could easily adapt it to also handle more general errors. It wouldn't be hard to adapt that script to also look for failed puppet runs. Should we dupe to bug 690590?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 4

•

11 years ago

Found in triage. (In reply to John Ford [:jhford] -- please use 'needinfo?' instead of a CC from comment #3) > I didn't see this bug when i started working on bug 690590. The proposed > solution in that bug is specifically monitoring for puppetca problems, but I > could easily adapt it to also handle more general errors. > > It wouldn't be hard to adapt that script to also look for failed puppet runs. > > Should we dupe to bug 690590? Maybe, maybe not. Moving to the correct component for now. Lets see what makes sense to people with more context.

Blocks: re-nagios

Component: Release Engineering → Release Engineering: Developer Tools

QA Contact: hwine

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Dustin J. Mitchell [:dustin] (he/him)

Comment 5

•

11 years ago

All puppet masters are monitored for basic functionality. We currently get emailed about failed puppet runs, but that's not an alert. Also, transient failures are harmless, as they automatically retry. The events we need to know about are when many hosts fail at once. Foreman will give us a more comprehensive view, so perhaps there's some way to extract a meaningful signal from Foreman into nagios.

Summary: Add monitoring to puppet masters → Monitor for excessive failed puppet runs

bhearsum@mozilla.com (:bhearsum)

Comment 6

•

9 years ago

Amy, Dustin - are we satisified with the current level of Puppet monitoring? If so, we should resolve this.

Flags: needinfo?(arich)

Dustin J. Mitchell [:dustin] (he/him)

Comment 7

•

9 years ago

I am.

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Amy Rich [:arr] [:arich]

Updated

•

9 years ago

Flags: needinfo?(arich)

Nobody; OK to take it and work on it

Assignee

Updated

•

8 years ago

Component: Tools → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Monitor for excessive failed puppet runs

Categories

(Release Engineering :: General, defect, P4)

Tracking

(Not tracked)

People

(Reporter: armenzg, Unassigned)

References

(
URL
)

Details

(Whiteboard: [puppet][monitoring])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Updated

Updated