Closed Bug 1565939 Opened 5 years ago Closed 5 years ago

Configure monitoring for cloudops taskcluster

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: brian, Assigned: edunham)

References

Details

Brian Pitts

Reporter

Description

•

5 years ago

We need to decide what we want to monitor for taskcluster, and how we're going to do it.

For the question of "are the services even running" we can count on k8s to try to keep them running and alert on rising restart count, which would indicate crashes.

For the web services we can track request rates and timings for success and errors using info from the load balancer and alert on high errors.

For the background services we need to figure out a way to know if they are completing work successfully or not.

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

5 years ago

Depends on: 1561338

Brian Pitts

Reporter

Comment 1

•

5 years ago

For background services we now and crons we now have https://docs.taskcluster.net/docs/manual/deploying/monitoring

In addition to what was mentioned before, we should monitor rabbitmq queue depth. We can deploy https://github.com/influxdata/telegraf/tree/master/plugins/inputs/rabbitmq to the nonprod and prod per-realm telegrafs and point them at the correct rabbitmqs.

Brian Pitts

Reporter

Updated

•

5 years ago

Component: Services → Operations: Taskcluster

Product: Taskcluster → Cloud Services

Brian Pitts

Reporter

Comment 2

•

5 years ago

Edunham has the rest set up (including pagerduty and pingsom) but rabbitmq moitoring is still WIP.

Brian Pitts

Reporter

Comment 3

•

5 years ago

Talked with edunham this morning about what's needed to close this

Set everything up for firefox ci and have bpitts review
Retest log-based metrics PR then have bpitts re-review and merge
Automate or document rabbitmq user creation for use by telegraf plugin

Brian Pitts

Reporter

Comment 4

•

5 years ago

We can continue to iterate on what we grapha nd what we alert on, but I think all the basics are in place and working.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Configure monitoring for cloudops taskcluster

Categories

(Cloud Services :: Operations: Taskcluster, task)

Tracking

(Not tracked)

People

(Reporter: brian, Assigned: edunham)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Updated

Comment 2

Comment 3

Comment 4