Closed Bug 1124059 Opened 10 years ago Closed 7 years ago

create a buildduty dashboard that highlights current infra health

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jlund, Assigned: jlund)

References

Details

Jumping back into buildduty, I've realized I'm out of touch with 'reading' our current state. We also have a number of now maturing parts in our infra that have to be factored in when interpreting individual metrics like, say, 'wait times'. Below are some daily links for buildduty: http://builddata.pub.build.mozilla.org/reports/pending/pending.html https://secure.pub.build.mozilla.org/builddata/reports/slave_health/index.html https://secure.pub.build.mozilla.org/buildapi/reports/waittimes https://www.hostedgraphite.com/da5c920d/grafana/#/dashboard/temp/5d509a54d0b496d184023bab35c17d0f41e6c607 http://nigelbabu.github.io/hgstats/#hours/2 http://netops2.private.scl3.mozilla.com/smokeping/sm.cgi?target=Datacenters.RELENG-SCL3.nagios1-releng-use1 https://wiki.mozilla.org/ReleaseEngineering/Maintenance#Reconfigs_.2F_Deployments https://secure.pub.build.mozilla.org/buildapi/running (callek's upcoming jacuzzi health page) Understanding and interpreting what it means when some things look good and others look bad can be time and experience demanding. Plus a number of these links don't gauge for us what is considered normal for a given time. Let's take an example: you might jump to the conclusion that something is broken because we have high pending and minimal unhealthy slaves. But in reality, jacuzzi pools are 'full' and we have an abnormal amount of pushes at the moment. Without having to investigate too much, we could conclude pending may be high but nothing is 'broken' and our end2end build times are still performing well. Sure we could improve our scheduling but we don't need to close trees or take long diagnosing time away from buildduty. I propose we have a single report summary page that does three things: 1) scrapes/shares-data from the above links to give an overall status check 2) explains what is considered 'recent normal state' vs 'current state' a) where possible this would be in the form of gauges[1] 3) interprets each summary and gives possible clues to what the smoking gun could be a) this would come after everything else in this bug is complete and it need not be very intelligent. Note: the goal of this bug is to provide an ops analysis. It is not meant to provide any long term performance analyzing or replace existing metric reports. roll out plan will be in baby steps. first two things used as a proof of concept can be: 1) polling our releng repos (not just buildbot + mozharness) and highlight recent bugs/changes that have landed 2) providing a gauge for pending, wait times, pushes and end2end stats after that, a similar process can be used to add summaries for aws, jacuzzi, slave health, etc. [1] https://developers.google.com/chart/interactive/docs/gallery/gauge
Assignee: nobody → jlund
Component: Buildduty → Tools
QA Contact: bugspam.Callek → hwine
Component: Tools → General
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.