Closed Bug 852357 Opened 12 years ago Closed 8 years ago

Reporting System for Build/Infrastructure Issues

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: k0scist, Unassigned)

Details

(Whiteboard: [buildfaster:?])

Reporting System for Build/Infrastructure Issues CC: :bhearsum, :edmorley, :RyanVM, :jgriffin, :jmaher Whiteboard: [buildfaster:?] As part of builds (in the buildbot sense, as used herein, so could also be a test run), the slave environment is introspected and modified for setup (and sanity) for the build steps. E.g. an existing hg clone could be updated...or given insufficient disk space or a bad repo state, could be cloned afresh. With our existing infrastructure, if there are issues with setup steps, there are basically two (easy) possibilities: 1. turn the job orange - you broke it! 2. print to TinderboxPrint - but it is likely no one would see it For cases where there is a non-preferred fallback, an expected result vs an actual result, or gathering statistics for later actionability like e.g. timings, but one where the build may safely proceed, turning the job orange may be overkill, but it may be desirable to note the issue somewhat more visibly than via tinderboxprint. Disclaimer: this is a blue sky idea; it is also not (necessarily) a trivial project. I also don't know the extent that this exists in the build system or to the extent this is cared about. This is a rough proposal at best if its useful, not a request. Possible use cases: * noting issues that may require machine reimaging/maintenance (e.g. disk space) * noting systematic timings/slow downs (and other machine stats) * noting prevalence of non-fatal problems or potential problems * noting (excessive?) number of retries (e.g. hg.m.o timeouts) * noting slow downloads * noting when a fallback method is used that takes longer than the preferred case or is otherwise less desirable The no-tech (or at least no-infra) solution is to have each particular piece that is cared about generate and send notification. A more complete solution would entail a universal way of noting that there is an issue (and what it is) as well as a place to put it. Note that while the no-tech issue is easy for a particular case, multiple cases will involve copy+pasting code and will probably discourage notifying on a particular issue since each is roll-your-own. At the other end of the spectrum, a precisely tailored solution will involve an excessive amount of time to spec and craft. Both extremes give perfect elasticity, though some middle ground is likely more pragmatic in terms of overall gain. Noting the issue could be done in any number of ways, for example scanning the logs, POSTing to some service (e.g. bugzilla), emailing some parties, uploading a file (somewhere), pulse, or leveraging TinderboxPrint or similar and/or TBPL 2.0 equivalent thereof via an additional piece. The place to put it could be bugzilla (likely, since it is our issue tracker whether or not it is ideal for this particular purpose and this class of bugs could be harvested to make an additional dashboard), a mailing list, nagios, or yet-to-exist web service. IFF this is something worth doing, first steps would be prioritizing based on added value and deciding the actual form of the solution based on a convolution of (need) and (bang for the buck). Idea from https://bugzilla.mozilla.org/show_bug.cgi?id=851270#c31
Whiteboard: [buildfaster:?]
The new treeherder generic metadata fields sound like a good place to store this; all we then need is a UI for it, separate from the normal treeherder-ui view.
Product: mozilla.org → Release Engineering
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.