Closed Bug 657519 Opened 14 years ago Closed 14 years ago

Add nagios check for tinderbox mail processing

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: rtucker)

References

Details

No description provided.
That's weird, bugzilla ate the description. In bug 653969, we've seen that tinderbox's mail processing gets backed up. Bug 657521 is designed to mitigate that, but it would also be great if such slowness was proactively detected. Bug 653969 comment 14 suggests a nagios check of the queue length, with results like CRITICAL - 12345 tinderbox mails unprocessed, oldest is 123 minutes old How hard would this be? If it's not too hard, it would be a good stopgap to help keep tinderbox alive until we can make it go away.
Blocks: 653969
Rob, I'll leave this one to you since you monitor the mail and tinderbox servers.
This should be pretty easy. The mail queue is at /var/www/iscsi/webtools/tinderbox/data on dm-webtools02. There are other files besides spooled mail in that directory, so you can't just look at the size of the directory, but all mail spool files start with "tbx." at the beginning of the filename. Number of items in queue: ls -1 /var/www/iscsi/webtools/tinderbox/data/tbx* | wc -l Timestamp of oldest item: ls -1t /var/www/iscsi/webtools/tinderbox/data/tbx* | tail -1 | xargs -n1 stat -c%y
Make that 'y' on the end be a 'Y' to get seconds since epoch for easy math.
What should the warning and critical values be for the message counts? There are currently about 188 messages in this queue.
Could we track over 24 hours every 5 minutes what the following two numbers are: * # of unprocessed items * oldest item If this report makes sense, could it be emailed to us every night? I believe this should give us a base to determine what warning and critical levels should be.
Tweaked the settings per Amy Rich in IRC. Set at 600/900 for the message counts, 10/15 minutes for the time delay.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Can these go to #build? I only have a vague understanding of how this all works, but if the tree is closed for maintenance or whatever, wouldn't the mail queue get stale? Then it starts paging oncall and we aren't sure what to do about it or whether the tree is closed or whatnot. Might make more sense for these to go to #build and then if there is a real issue, then filing a blocker is the correct way to get oncall attention...
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
wfm
contact group changed to build
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
(In reply to comment #8) > Can these go to #build? I only have a vague understanding of how this all > works, but if the tree is closed for maintenance or whatever, wouldn't the > mail queue get stale? Then it starts paging oncall and we aren't sure what > to do about it or whether the tree is closed or whatnot. Might make more > sense for these to go to #build and then if there is a real issue, then > filing a blocker is the correct way to get oncall attention... A tree closure wouldn't trigger either of these checks. There's nothing RelEng does during normal operations that would cause these to trigger, and nothing we can do to fix it. I don't think contact group set to build is appropriate.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Then we need documentation on what these checks mean, etc. what limits are acceptable, etc. I was under the impression that releng slaves send mail to tinderbox to report their status. If that were the case, a tree closure would cause no e-mail to be sent, because slaves are idle, which would cause the queue to get stale? I might have jumped the gun on my assumed workflow. I'm fine with this paging oncall, but being oncall right now and having received three alerts already, I have no idea what to do about them. Then they recovered, so not sure what any of this means. We just need to have documented what the alert means, what the thresholds are and what action to take.
AFAIU this doesn't necessarily mean there is a tree closure. I think it makes sense for #build to know when we get into this condition in case we are asked by devs. Depending on the circumstance we should let IT know so they check if any of the mail servers is having any issues. What do you say catlee? Should the IT action to be to get in touch with buildduty and analyze in which tree we are seeing the problem. I assume that late at night we will see this check to go off with the hundreds of L10n jobs reporting around the same time. I don't think it makes sense for IT to be paged.
(In reply to comment #13) > AFAIU this doesn't necessarily mean there is a tree closure. > > I think it makes sense for #build to know when we get into this condition in > case we are asked by devs. > Depending on the circumstance we should let IT know so they check if any of > the mail servers is having any issues. > > What do you say catlee? Yes, nagios should alert #build, but I don't think that means it shouldn't also page oncall. RelEng has no way of diagnosing or fixing these problems. > Should the IT action to be to get in touch with buildduty and analyze in > which tree we are seeing the problem. > > I assume that late at night we will see this check to go off with the > hundreds of L10n jobs reporting around the same time. I don't think it makes > sense for IT to be paged. Right, this should be considered part of normal load, and we should adjust our thresholds to accommodate them if necessary.
I've got the check setup to alert to the #sysadmins IRC channel and go to #build. We're going to hold off on paging oncall until we've had some time to test that these thresholds are reasonable before paging oncall after hours with it.
(In reply to comment #12) > Then we need documentation on what these checks mean, etc. what limits are > acceptable, etc. > > I was under the impression that releng slaves send mail to tinderbox to > report their status. If that were the case, a tree closure would cause no > e-mail to be sent, because slaves are idle, which would cause the queue to > get stale? I might have jumped the gun on my assumed workflow. I'm fine with > this paging oncall, but being oncall right now and having received three > alerts already, I have no idea what to do about them. Then they recovered, > so not sure what any of this means. In the event that mail is not being sent to tinderbox, the queue will actually go *down* in size/oldest file time, if anything, because this check is only looking at files that have already been delivered via mail.
I think rob's got the right idea here - the intent of these checks is to get proactive notice that things are funny with tinderbox. We don't know if that's all the time, or even if tinderbox's queue length is the problem (other bugs suggest that other mail-handling facilities are actually causing the delay). So let's get the visibility without hammering anyone's pager. If we find a threshold level and define an appropriate IT (or releng?) response, we can revisit. If this is too noisy in-channel, or misses failures, we'll adjust the thresholds. tl;dr: rob++
I'm going to go ahead and enable this to page oncall. There haven't been any alerts on this since the 17th. The check seems to be doing exactly what we want it to.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
A warning state for time looks like <nagios-sjc1> [80] dm-webtools02:Tinderbox Mail is WARNING: WARNING: Oldest message is 11 minutes old ie doesn't include the number pending. Could we make it say <nagios-sjc1> [80] dm-webtools02:Tinderbox Mail is WARNING: WARNING: Oldest message is 11 minutes old (XX queued) so that we know what is the problem, but also what the other part of the check is finding. Similarly if the alert goes off for the queue length give the oldest in brackets. And same change for CRITICAL states ?
Might I suggest two independent checks? One for time and one for queue size?
That'd work for me.
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.