Closed Bug 1565939 Opened 5 years ago Closed 5 years ago

Configure monitoring for cloudops taskcluster

Categories

(Cloud Services :: Operations: Taskcluster, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: brian, Assigned: edunham)

References

Details

We need to decide what we want to monitor for taskcluster, and how we're going to do it.

For the question of "are the services even running" we can count on k8s to try to keep them running and alert on rising restart count, which would indicate crashes.

For the web services we can track request rates and timings for success and errors using info from the load balancer and alert on high errors.

For the background services we need to figure out a way to know if they are completing work successfully or not.

For background services we now and crons we now have https://docs.taskcluster.net/docs/manual/deploying/monitoring

In addition to what was mentioned before, we should monitor rabbitmq queue depth. We can deploy https://github.com/influxdata/telegraf/tree/master/plugins/inputs/rabbitmq to the nonprod and prod per-realm telegrafs and point them at the correct rabbitmqs.

Component: Services → Operations: Taskcluster
Product: Taskcluster → Cloud Services

Edunham has the rest set up (including pagerduty and pingsom) but rabbitmq moitoring is still WIP.

Talked with edunham this morning about what's needed to close this

  1. Set everything up for firefox ci and have bpitts review
  2. Retest log-based metrics PR then have bpitts re-review and merge
  3. Automate or document rabbitmq user creation for use by telegraf plugin

We can continue to iterate on what we grapha nd what we alert on, but I think all the basics are in place and working.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.