Closed Bug 1441996 Opened 7 years ago Closed 3 years ago

Sentry connectivity checks for Socorro processes

Categories

(Socorro :: General, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1758701

People

(Reporter: osmose, Unassigned)

Details

We use sentry across Socorro for reporting errors, but we have nothing in place to alert use when it is unreachable. This also is a bit complex because Sentry is typically how we'd report an error like this. For the webapp, we could add an endpoint that sends a test message to Sentry and throws a 500 error if it fails. Infra can hit this endpoint either periodically, after a deploy, or both. For the processor/crontabber/etc, one suggestion was to report to Datadog when we can't send a test message on startup.
willkg: Besides the webapp, processor, and crontabber, are there any processes/services we would want to cover with this kind of check? willkg/miles: What options do we have for reporting besides Sentry itself and Datadog?
Flags: needinfo?(willkg)
Flags: needinfo?(miles)
Sentry and Datadog are the realistic places where this reporting should be handled. We could put this in the healthcheck/heartbeat endpoints in some capacity, but returning non-200 in those endpoints is treated as page-able downtime.
Flags: needinfo?(miles)
Seems to me that what we want to test here are two things: 1. will the code we have send exceptions to sentry 2. is the configuration for the component correct Both of those are things that change during deploys--they're not things that change on the whims of time. Given that, I don't want to add these to heartbeat-type healthchecks. I think we want to implement during-deploy checks that get run once during a deploy for each component. For mechanisms, the webapp has that "./manage.py raven whatever" thing. I think we could build an equivalent thing for the processor and crontabber where a "pass" is "error got sent to sentry" and a "fail" is "code raised an error trying to send an error to sentry". Sending an incr to datadog on fails is interesting, but I think I'd rather this used our existing deploy alerting for when deploys fail.
Flags: needinfo?(willkg)

We implemented a cli that lets us test sentry configuration and connectivity for any of the server nodes in bug #1758701, so I'm going to dupe this one to that.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.