Closed
Bug 713281
Opened 13 years ago
Closed 10 years ago
setup automatic monitoring of ~/update.log on buildbot masters for exceptions
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task, P2)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: bhearsum, Unassigned)
References
Details
(Whiteboard: [buildmasters])
bug 712988 tracks an issue where some masters were having trouble submitting data to the status DB. The end result of this is that tons of results were missing from TBPL. We should monitor ~/update.log on the masters like we monitor twistd.log, to make this easier to catch.
Comment 1•13 years ago
|
||
Given the series of tree-closing outages we've had in the last few weeks, this is something we should get to soon. Marking for triage-followup, but available if someone wants to grab in the meanwhile.
Ideally, I'd prefer to do this via nagios, as we already have nagios on these masters, and it gets us email+irc notifications like all other alerts - making buildduty life easier. Any gotchas/objections to that approach?
Also, once an exception is hit in the logs, and manually resolved, we'll need to do something to get that "fixed" exception no longer flagged by monitoring. From irc w/catlee, moving/renaming those specific log files should do the trick for clearing the alert.
Summary: watch ~/update.log on masters for exceptions → setup automatic monitoring of ~/update.log on buildbot masters for exceptions
Whiteboard: [triage-followup]
Comment 2•13 years ago
|
||
This is something that doesn't fit the model Mozilla has for Nagios (it's all passive from-the-outside checks.)
This is perfect for a tool I was learning about and also for the work that I and Catlee are doing with new messaging tools. The tool would either grep or tail the log (heck, could even be realtime) and just send a #buildduty alert when the regex is matched.
The log itself shouldn't be the worry for realerting, IMO, it should be whatever data is being stored about the alert, i.e. how many of each alert have happened in a given time frame. Yes, i'm talking about the releng dashboard here.
Comment 3•13 years ago
|
||
I'm changing this from critical (as it doesn't impact the current running of the buildmasters) and assigning it a P2 priority.
Severity: critical → major
Priority: -- → P2
Updated•13 years ago
|
Whiteboard: [triage-followup] → [buildmasters][triage-followup]
Comment 4•13 years ago
|
||
(In reply to Mike Taylor [:bear] from comment #2)
> This is perfect for a tool I was learning about and also for the work that I
> and Catlee are doing with new messaging tools. The tool would either grep
> or tail the log (heck, could even be realtime) and just send a #buildduty
> alert when the regex is matched.
>
> The log itself shouldn't be the worry for realerting, IMO, it should be
> whatever data is being stored about the alert, i.e. how many of each alert
> have happened in a given time frame. Yes, i'm talking about the releng
> dashboard here.
Seems like a lot of work, and for a tool that doesn't exist yet.
Why don't we just modify the existing twistd.log script to handle the upload.log, and start logrotating the upload.log so we don't have to worry about re-alerting?
Whiteboard: [buildmasters][triage-followup] → [buildmasters]
Comment 5•13 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #4)
> Seems like a lot of work, and for a tool that doesn't exist yet.
>
> Why don't we just modify the existing twistd.log script to handle the
> upload.log, and start logrotating the upload.log so we don't have to worry
> about re-alerting?
sure, that would be a reasonable v1.0 way of dealing with this
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WONTFIX
Assignee | ||
Updated•6 years ago
|
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•