Closed Bug 713281 Opened 13 years ago Closed 10 years ago

setup automatic monitoring of ~/update.log on buildbot masters for exceptions

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: bhearsum, Unassigned)

References

Details

(Whiteboard: [buildmasters])

bhearsum@mozilla.com (:bhearsum)

Reporter

Description

•

13 years ago

bug 712988 tracks an issue where some masters were having trouble submitting data to the status DB. The end result of this is that tons of results were missing from TBPL. We should monitor ~/update.log on the masters like we monitor twistd.log, to make this easier to catch.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 1

•

13 years ago

Given the series of tree-closing outages we've had in the last few weeks, this is something we should get to soon. Marking for triage-followup, but available if someone wants to grab in the meanwhile. Ideally, I'd prefer to do this via nagios, as we already have nagios on these masters, and it gets us email+irc notifications like all other alerts - making buildduty life easier. Any gotchas/objections to that approach? Also, once an exception is hit in the logs, and manually resolved, we'll need to do something to get that "fixed" exception no longer flagged by monitoring. From irc w/catlee, moving/renaming those specific log files should do the trick for clearing the alert.

Summary: watch ~/update.log on masters for exceptions → setup automatic monitoring of ~/update.log on buildbot masters for exceptions

Whiteboard: [triage-followup]

Mike Taylor [:bear]

Comment 2

•

13 years ago

This is something that doesn't fit the model Mozilla has for Nagios (it's all passive from-the-outside checks.) This is perfect for a tool I was learning about and also for the work that I and Catlee are doing with new messaging tools. The tool would either grep or tail the log (heck, could even be realtime) and just send a #buildduty alert when the regex is matched. The log itself shouldn't be the worry for realerting, IMO, it should be whatever data is being stored about the alert, i.e. how many of each alert have happened in a given time frame. Yes, i'm talking about the releng dashboard here.

Mike Taylor [:bear]

Comment 3

•

13 years ago

I'm changing this from critical (as it doesn't impact the current running of the buildmasters) and assigning it a P2 priority.

Severity: critical → major

Priority: -- → P2

Mike Taylor [:bear]

Updated

•

13 years ago

Whiteboard: [triage-followup] → [buildmasters][triage-followup]

Chris Cooper [:coop] (he/him)

Comment 4

•

13 years ago

(In reply to Mike Taylor [:bear] from comment #2) > This is perfect for a tool I was learning about and also for the work that I > and Catlee are doing with new messaging tools. The tool would either grep > or tail the log (heck, could even be realtime) and just send a #buildduty > alert when the regex is matched. > > The log itself shouldn't be the worry for realerting, IMO, it should be > whatever data is being stored about the alert, i.e. how many of each alert > have happened in a given time frame. Yes, i'm talking about the releng > dashboard here. Seems like a lot of work, and for a tool that doesn't exist yet. Why don't we just modify the existing twistd.log script to handle the upload.log, and start logrotating the upload.log so we don't have to worry about re-alerting?

Whiteboard: [buildmasters][triage-followup] → [buildmasters]

Mike Taylor [:bear]

Comment 5

•

13 years ago

(In reply to Chris Cooper [:coop] from comment #4) > Seems like a lot of work, and for a tool that doesn't exist yet. > > Why don't we just modify the existing twistd.log script to handle the > upload.log, and start logrotating the upload.log so we don't have to worry > about re-alerting? sure, that would be a reasonable v1.0 way of dealing with this

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 6

•

11 years ago

Found in triage.

Blocks: re-nagios

Component: Other → Platform Support

Chris Cooper [:coop] (he/him)

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Assignee

Updated

•

6 years ago

Component: Platform Support → Buildduty

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.

Bugzilla

setup automatic monitoring of ~/update.log on buildbot masters for exceptions

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P2)

Tracking

(Not tracked)

People

(Reporter: bhearsum, Unassigned)

References

Details

(Whiteboard: [buildmasters])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 5

Updated

Comment 6

Updated

Updated

Updated