657519 - Add nagios check for tinderbox mail processing

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Description

•

14 years ago

No description provided.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 1

•

14 years ago

That's weird, bugzilla ate the description. In bug 653969, we've seen that tinderbox's mail processing gets backed up. Bug 657521 is designed to mitigate that, but it would also be great if such slowness was proactively detected. Bug 653969 comment 14 suggests a nagios check of the queue length, with results like CRITICAL - 12345 tinderbox mails unprocessed, oldest is 123 minutes old How hard would this be? If it's not too hard, it would be a good stopgap to help keep tinderbox alive until we can make it go away.

Blocks: 653969

Amy Rich [:arr] [:arich]

Comment 2

•

14 years ago

Rob, I'll leave this one to you since you monitor the mail and tinderbox servers.

Dave Miller [:justdave]

Comment 3

•

14 years ago

This should be pretty easy. The mail queue is at /var/www/iscsi/webtools/tinderbox/data on dm-webtools02. There are other files besides spooled mail in that directory, so you can't just look at the size of the directory, but all mail spool files start with "tbx." at the beginning of the filename. Number of items in queue: ls -1 /var/www/iscsi/webtools/tinderbox/data/tbx* | wc -l Timestamp of oldest item: ls -1t /var/www/iscsi/webtools/tinderbox/data/tbx* | tail -1 | xargs -n1 stat -c%y

Dave Miller [:justdave]

Comment 4

•

14 years ago

Make that 'y' on the end be a 'Y' to get seconds since epoch for easy math.

Rob Tucker [:rtucker]

Assignee

Comment 5

•

14 years ago

What should the warning and critical values be for the message counts? There are currently about 188 messages in this queue.

Armen [:armenzg]

Comment 6

•

14 years ago

Could we track over 24 hours every 5 minutes what the following two numbers are: * # of unprocessed items * oldest item If this report makes sense, could it be emailed to us every night? I believe this should give us a base to determine what warning and critical levels should be.

Rob Tucker [:rtucker]

Assignee

Comment 7

•

14 years ago

Tweaked the settings per Amy Rich in IRC. Set at 600/900 for the message counts, 10/15 minutes for the time delay.

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Justin Dow [:jabba]

Comment 8

•

14 years ago

Can these go to #build? I only have a vague understanding of how this all works, but if the tree is closed for maintenance or whatever, wouldn't the mail queue get stale? Then it starts paging oncall and we aren't sure what to do about it or whether the tree is closed or whatnot. Might make more sense for these to go to #build and then if there is a real issue, then filing a blocker is the correct way to get oncall attention...

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Armen [:armenzg]

Comment 9

•

14 years ago

wfm

Rob Tucker [:rtucker]

Assignee

Comment 10

•

14 years ago

contact group changed to build

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

Chris AtLee [:catlee]

Comment 11

•

14 years ago

(In reply to comment #8) > Can these go to #build? I only have a vague understanding of how this all > works, but if the tree is closed for maintenance or whatever, wouldn't the > mail queue get stale? Then it starts paging oncall and we aren't sure what > to do about it or whether the tree is closed or whatnot. Might make more > sense for these to go to #build and then if there is a real issue, then > filing a blocker is the correct way to get oncall attention... A tree closure wouldn't trigger either of these checks. There's nothing RelEng does during normal operations that would cause these to trigger, and nothing we can do to fix it. I don't think contact group set to build is appropriate.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Justin Dow [:jabba]

Comment 12

•

14 years ago

Then we need documentation on what these checks mean, etc. what limits are acceptable, etc. I was under the impression that releng slaves send mail to tinderbox to report their status. If that were the case, a tree closure would cause no e-mail to be sent, because slaves are idle, which would cause the queue to get stale? I might have jumped the gun on my assumed workflow. I'm fine with this paging oncall, but being oncall right now and having received three alerts already, I have no idea what to do about them. Then they recovered, so not sure what any of this means. We just need to have documented what the alert means, what the thresholds are and what action to take.

Armen [:armenzg]

Comment 13

•

14 years ago

AFAIU this doesn't necessarily mean there is a tree closure. I think it makes sense for #build to know when we get into this condition in case we are asked by devs. Depending on the circumstance we should let IT know so they check if any of the mail servers is having any issues. What do you say catlee? Should the IT action to be to get in touch with buildduty and analyze in which tree we are seeing the problem. I assume that late at night we will see this check to go off with the hundreds of L10n jobs reporting around the same time. I don't think it makes sense for IT to be paged.

Chris AtLee [:catlee]

Comment 14

•

14 years ago

(In reply to comment #13) > AFAIU this doesn't necessarily mean there is a tree closure. > > I think it makes sense for #build to know when we get into this condition in > case we are asked by devs. > Depending on the circumstance we should let IT know so they check if any of > the mail servers is having any issues. > > What do you say catlee? Yes, nagios should alert #build, but I don't think that means it shouldn't also page oncall. RelEng has no way of diagnosing or fixing these problems. > Should the IT action to be to get in touch with buildduty and analyze in > which tree we are seeing the problem. > > I assume that late at night we will see this check to go off with the > hundreds of L10n jobs reporting around the same time. I don't think it makes > sense for IT to be paged. Right, this should be considered part of normal load, and we should adjust our thresholds to accommodate them if necessary.

Rob Tucker [:rtucker]

Assignee

Comment 15

•

14 years ago

I've got the check setup to alert to the #sysadmins IRC channel and go to #build. We're going to hold off on paging oncall until we've had some time to test that these thresholds are reasonable before paging oncall after hours with it.

Amy Rich [:arr] [:arich]

Comment 16

•

14 years ago

(In reply to comment #12) > Then we need documentation on what these checks mean, etc. what limits are > acceptable, etc. > > I was under the impression that releng slaves send mail to tinderbox to > report their status. If that were the case, a tree closure would cause no > e-mail to be sent, because slaves are idle, which would cause the queue to > get stale? I might have jumped the gun on my assumed workflow. I'm fine with > this paging oncall, but being oncall right now and having received three > alerts already, I have no idea what to do about them. Then they recovered, > so not sure what any of this means. In the event that mail is not being sent to tinderbox, the queue will actually go *down* in size/oldest file time, if anything, because this check is only looking at files that have already been delivered via mail.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 17

•

14 years ago

I think rob's got the right idea here - the intent of these checks is to get proactive notice that things are funny with tinderbox. We don't know if that's all the time, or even if tinderbox's queue length is the problem (other bugs suggest that other mail-handling facilities are actually causing the delay). So let's get the visibility without hammering anyone's pager. If we find a threshold level and define an appropriate IT (or releng?) response, we can revisit. If this is too noisy in-channel, or misses failures, we'll adjust the thresholds. tl;dr: rob++

Rob Tucker [:rtucker]

Assignee

Comment 18

•

14 years ago

I'm going to go ahead and enable this to page oncall. There haven't been any alerts on this since the 17th. The check seems to be doing exactly what we want it to.

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

Nick Thomas [:nthomas] (UTC+12)

Comment 19

•

14 years ago

FTR, the check is at https://nagios.mozilla.org/nagios/cgi-bin/extinfo.cgi?type=2&host=dm-webtools02&service=Tinderbox+Mail

Nick Thomas [:nthomas] (UTC+12)

Comment 20

•

14 years ago

A warning state for time looks like <nagios-sjc1> [80] dm-webtools02:Tinderbox Mail is WARNING: WARNING: Oldest message is 11 minutes old ie doesn't include the number pending. Could we make it say <nagios-sjc1> [80] dm-webtools02:Tinderbox Mail is WARNING: WARNING: Oldest message is 11 minutes old (XX queued) so that we know what is the problem, but also what the other part of the check is finding. Similarly if the alert goes off for the queue length give the oldest in brackets. And same change for CRITICAL states ?

Justin Dow [:jabba]

Comment 21

•

14 years ago

Might I suggest two independent checks? One for time and one for queue size?

Nick Thomas [:nthomas] (UTC+12)

Comment 22

•

14 years ago

That'd work for me.

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard

Bugzilla

Quick Search

Add nagios check for tinderbox mail processing

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: dustin, Assigned: rtucker)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Comment 18

Comment 19

Comment 20

Comment 21

Comment 22

Updated