Closed Bug 659724 Opened 14 years ago Closed 13 years ago

tinderbox queue runner is not keeping up (dm-webtools02)

Categories

(Release Engineering :: General, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: bear)

References

Details

(Whiteboard: [tinderbox][buildduty])

Attachments

(1 file)

We were just alerted that the tinderbox mail queue is 12 minutes long.  This means that the MTA's are doing their job, but that the queue runner can't keep up.

We're discussing parallelizing that queue-running somehow; here's the IRC convo:

14:14 < justdave> right now the cron job fires once per minute, and takes an exclusive lock on the queue
14:14 < justdave> it runs until it finishes the entire queue then exits
14:14 < bear-lunch> assign it to me so I can add it to my todo list when I review things with coop
14:14 < dustin> bear-lunch: you got it
14:14 < justdave> if the cron job fires again while it's still running, it sees the lock in place and exits
14:15  * joduinn happy that the nagios check caught it so we could all see the state of the world while it was still happening
14:15 < justdave> if we locked individual queue files instead of the entire queue, each invocation of the cron job could add an additional queue runner
14:15 < armenzg_buildduty> what is the hostname that processes this?
14:15 < justdave> so as it started getting behind, we'd add a new runner every minute until they all caught up
14:15 < bear-lunch> justdave - yea, we could do that with some sane upper limit
14:15 < justdave> yeah, 4 would probably be a good limit
14:15 < dustin> armenzg_buildduty: dm-webtools02
14:15 < justdave> seeing how there's 4 cores on the CPU

Over to bear to look into this and triage.
another idea from the same IRC conversation:

[15:19]  <justdave> bear-lunch: another thought... right now I think it sets a hash internally of what trees have been touched while it's processing, and then once it's finished with the queue, it updates the cache on all of those trees.
[15:20]  <justdave> we could have it touch some file in each tree that says "this tree's cache is out of date" and have a separate cron job come and rebuild those
See also explanation on dev.tree-management http://tinyurl.com/3puvc2f
Whiteboard: [tinderbox] → [tinderbox][buildduty]
[16:04]	<nagios-sjc1>	[06] dm-webtools02:Tinderbox Mail is WARNING: WARNING: Oldest message is 18 minutes old
[16:24]	<nagios-sjc1>	[12] dm-webtools02:Tinderbox Mail is WARNING: WARNING: Oldest message is 21 minutes old
Getting bug 661365 security review would obsolete this bug.
Should we dupe this to bug 657521?

This was just to cover the occurrence last time it happened.
no, this is the code change needed to try and avoid the nagios alerts. the other bug is a hardware fix.
Lag is currently over 2k messages and 40 minutes, and we'll have to see if that subsides once nightlies clear away. Have we heard anything to confirm the security review is proceeding ? Should we be looking at threading (or something) at http://mxr.mozilla.org/mozilla/source/webtools/tinderbox/processbuild.pl#67:
 67 for my $file (@datafiles) {
 68     &process_mailfile("$::data_dir/$file");
 69 }
Beware the global vars at lines 39-43.
The person doing the security review (forget his name) let us know that he would working on the review, so it is moving.

The loop you mentioned above is close to what I was going to do after I review the code to make sure none of the called routines (like scrape.pl and some others) don't adjust some of the global data files - they are not async friendly.

The only one that really has me worried is limiting it to one process per tree
So, the lag problem went away when the 'monthly backup' finished, removing a bunch of load from the box & SAN ?
Well, the alerts stopped, but it was also the weekend.  We may be in the eye of the storm now.
Once we switch over to using TBPL, we'll no longer need to send emails, hence no longer concerned about this queue length. Keeping this bug open for now, as it remains a production concern until we switch over to tbpl.
Depends on: 625979
Details from notes from months-ago IRC conversations:

Switching to the new TBPL implementation is a step, and at some point makes the box no longer a concern of releng's, but does not allow tinderbox.mozilla.org to be turned off.

First, it still provides web views that are useful to some people (aki, at least) and are not available elsewhere.  To continue to support these views, releng should continue sending mails to tinderbox, at least for some branches.

Second, there are external projects, NSPR among them, which still use tinderbox as their primary build-reporting tool.  Of course, those are much lower load than the releng builds, so the queue problems we're seeing in this bug would not apply.
fyi: this queue is backed up again for the last ~6 hours. Example is:

Subject: ** PROBLEM alert - dm-webtools02/Tinderbox Queue is WARNING **
Date: Wed,  3 Aug 2011 00:35:17 -0700 (PDT)
From: nagios@dm-nagios01.mozilla.org (nagios)
To: release@mozilla.com

***** Nagios  *****

Notification Type: PROBLEM

Service: Tinderbox Queue
Host: dm-webtools02
Address: 10.2.74.14
State: WARNING

Date/Time: 08-03-2011 00:35:17

Additional Info:

WARNING: 3230 Messages Queued. Oldest is 106 minutes old
(In reply to comment #13)
> fyi: this queue is backed up again for the last ~6 hours. Example is:

Does comment 12 suggest that we should be finding resources to rework this code? Having fewer people depending on this without actually reducing load seems like a recipe for disaster.
We should improve our non-tinderbox web views so we can turn off MoCo email to this box instead, I think.
Seriously and constantly impacting development - not only is there the constant 20-80 minute (with spikes to 4 or 5 hours) lag processing the mail, there's also a lag of unknown duration between when the mail is processed and when the static json that tbpl reads is updated. The way I spotted it was by watching #firebot, where he reads quickparse.txt and reports when something changes, and noticing that 20 minutes later, the change still wasn't in tbpl. You can work around that lag by forcing static page generation, but the fact that I'm considering setting up a cronjob to load those URLs for the trees I'm interested in every 5 or 10 minutes is rather gross.
Severity: normal → blocker
Depends on: 669000
2 days ago we had to close all trees, yesterday it was bad (1 hour of backlog) but we were able to survive it, today trees are closed again and I don't expect the situation to improve at least till next merge.

If the new tbpl is too far away, may we at least move all tryserver messages to a separate queue? Those are usually the largest pile and there is no valid reason they should impact development on final trees.
(In reply to Marco Bonardo [:mak] (Away 8-14 Aug) from comment #18)
> If the new tbpl is too far away, may we at least move all tryserver messages
> to a separate queue? Those are usually the largest pile and there is no
> valid reason they should impact development on final trees.

Can we turn off try tinderbox mail altogether?
Aiui, we already have options for developers to a) have results mailed to them b) look at logs on ftp c) get results posted to bugzilla d) look at results in tbpl staging.
Blocks: 676822
Summary: tinderbox queue runner is not keeping up → tinderbox queue runner is not keeping up (dm-webtools02)
Depends on: 679448
This patch adds some *very* rough debug logs (to help us see what is happening when) and also some good timing and other stats per mail file

it's not the prettiest perl code, but my goal was to add and not change the underlying processing so we can insert this into production to start gathering data
Attachment #555322 - Flags: review?(rhelmer)
Attachment #555322 - Flags: feedback?(nrthomas)
Attachment #555322 - Flags: feedback?(aki)
Comment on attachment 555322 [details] [diff] [review]
add debug logging and stats logging to processbuild.pl

Nice, we should get useful data from this. There would be some ambiguity about which builds were submitting the most data because of varying use of compression on the mail. We could print the logcompression header too, or calculate the uncompressed size (perhaps by summing up $readbytes in compress_log_file).

>-use lib "@TINDERBOX_DIR@";
>+use lib "/var/www/iscsi/webtools/tinderbox";

>-$ENV{'PATH'} = "@SETUID_PATH@";
>-my $tinderboxdir = "@TINDERBOX_DIR@";
>+$ENV{'PATH'} = "/usr/bin:/bin";
>+my $tinderboxdir = "/var/www/iscsi/webtools/tinderbox";

aki, rhelmer - these were local changes already on the prod box.
Attachment #555322 - Flags: feedback?(nrthomas) → feedback+
Comment on attachment 555322 [details] [diff] [review]
add debug logging and stats logging to processbuild.pl

Review of attachment 555322 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good.  Are we at all worried about space for the logs?  What is the available space on /tmp?

::: processbuild.pl
@@ -32,5 @@
>  umask 002;
>  
>  # setuid globals
> -$ENV{'PATH'} = "@SETUID_PATH@";
> -my $tinderboxdir = "@TINDERBOX_DIR@";

Do you want to commit these changes?  Or is this just an artifact of the patching process?
Attachment #555322 - Flags: feedback+
(In reply to Dustin J. Mitchell [:dustin] from comment #22)
> Comment on attachment 555322 [details] [diff] [review]
> add debug logging and stats logging to processbuild.pl
> 
> Review of attachment 555322 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good.  Are we at all worried about space for the logs?  What is the
> available space on /tmp?

once my tests and pokes are done, i'll move them to a "proper" log directory

> 
> ::: processbuild.pl
> @@ -32,5 @@
> >  umask 002;
> >  
> >  # setuid globals
> > -$ENV{'PATH'} = "@SETUID_PATH@";
> > -my $tinderboxdir = "@TINDERBOX_DIR@";
> 
> Do you want to commit these changes?  Or is this just an artifact of the
> patching process?

that's the diff from CVS (pre-install) to production - so yea, artifacts
Comment on attachment 555322 [details] [diff] [review]
add debug logging and stats logging to processbuild.pl

+1 for dustin's comments.
Attachment #555322 - Flags: feedback?(aki) → feedback+
Comment on attachment 555322 [details] [diff] [review]
add debug logging and stats logging to processbuild.pl

>Index: processbuild.pl
>===================================================================
> # setuid globals
>-$ENV{'PATH'} = "@SETUID_PATH@";
>-my $tinderboxdir = "@TINDERBOX_DIR@";
>+$ENV{'PATH'} = "/usr/bin:/bin";
>+my $tinderboxdir = "/var/www/iscsi/webtools/tinderbox";

Probably don't want to change these ^

Otherwise looks fine, r+ with those reverted (or explanation why they shouldn't be)
Attachment #555322 - Flags: review?(rhelmer) → review+
(In reply to Robert Helmer [:rhelmer] from comment #25)
> Comment on attachment 555322 [details] [diff] [review]
> add debug logging and stats logging to processbuild.pl
> 
> >Index: processbuild.pl
> >===================================================================
> > # setuid globals
> >-$ENV{'PATH'} = "@SETUID_PATH@";
> >-my $tinderboxdir = "@TINDERBOX_DIR@";
> >+$ENV{'PATH'} = "/usr/bin:/bin";
> >+my $tinderboxdir = "/var/www/iscsi/webtools/tinderbox";
> 
> Probably don't want to change these ^
> 

:) - yea, I was diff'ing against the code I was running on my test vm

> Otherwise looks fine, r+ with those reverted (or explanation why they
> shouldn't be)

my hope is that I can run this on production only to gather data - then decide if this will go into the codebase
bug#681836 will help give us more data to figure out where the bottleneck is.
Depends on: 681836
Related: https://bugzilla.mozilla.org/show_bug.cgi?id=681680
closing this as FIXED - tbpl has been deployed and the number of builders sending tbox email has been drastically reduced.

if for some reason this becomes a problem in the future, please shoot me before reopening...
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: