Closed
Bug 1130347
Opened 10 years ago
Closed 8 years ago
buildapi/self-serve timeouts & builds-4hr/running/pending Nagios age alerts at 8am UTC every day
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: emorley, Unassigned)
References
Details
(Keywords: treeherder)
We're getting Nagios alerts at a similar time each day (8am UTC) - eg:
Service: http - /buildapi/self-serve/jobs
Host: buildapi.pvt.build.mozilla.org
Address: 10.22.74.160
State: CRITICAL
Date/Time: 02-06-2015 00:12:59
Additional Info: CRITICAL - Socket timeout after 10 seconds
Service: http file age - /buildjson/builds-running.js
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-06-2015 00:17:09
Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:01 ago - 8191 bytes in 0.006 second response time
Is some task running at that time each day?
Reporter | ||
Updated•10 years ago
|
Keywords: treeherder
Comment 1•10 years ago
|
||
We've got a bunch of reports that run at midnight PT, which I think lines up with this?
I'll try moving some to 00:30 and see if that makes a difference.
Reporter | ||
Comment 2•10 years ago
|
||
Ah great - thank you :-)
Comment 3•10 years ago
|
||
I also have bug 1129626 that I suspect could help this as well
Reporter | ||
Comment 4•10 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #1)
> We've got a bunch of reports that run at midnight PT, which I think lines up
> with this?
>
> I'll try moving some to 00:30 and see if that makes a difference.
Unfortunately saw these alerts again today:
Service: http file age - /buildjson/builds-running.js
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-07-2015 00:18:04
Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:00 ago - 8191 bytes in 0.006 second response time
Service: http - /buildapi/self-serve/jobs
Host: buildapi.pvt.build.mozilla.org
Address: 10.22.74.160
State: CRITICAL
Date/Time: 02-07-2015 00:18:54
Additional Info: CRITICAL - Socket timeout after 10 seconds
Service: http file age - /buildjson/builds-pending.js
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-07-2015 00:19:04
Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:11 ago - 4506 bytes in 0.010 second response time
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-07-2015 00:20:04
Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:12:40 ago - 2848 bytes in 0.010 second response time
Depends on: 1129626
Reporter | ||
Comment 5•10 years ago
|
||
Service: http - /buildapi/self-serve/jobs
Host: buildapi.pvt.build.mozilla.org
Address: 10.22.74.160
State: CRITICAL
Date/Time: 02-08-2015 00:41:02
Additional Info:
CRITICAL - Socket timeout after 10 seconds
Comment 6•10 years ago
|
||
for reference, another round today, seems to correspond to catlee's timing changes.
[02:41:36] nagios-releng Mon 23:41:47 PST [4634] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/b2g_bumper/b2g_bumper.stamp is 1489 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp)
[02:43:37] nagios-releng Mon 23:43:47 PST [4635] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is OK: FILE_AGE OK: /builds/b2g_bumper/b2g_bumper.stamp is 98 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp)
[03:02:31] =-= mdoglio|afk is now known as mdoglio
[03:15:57] nagios-releng Tue 00:16:07 PST [4636] buildapi.pvt.build.mozilla.org:http - /buildapi/self-serve/jobs is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.mozilla.org/http+-+/buildapi/self-serve/jobs)
[03:17:07] nagios-releng Tue 00:17:17 PST [4637] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-pending.js is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:00 ago - 8191 bytes in 0.009 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-pending.js)
[03:17:07] nagios-releng Tue 00:17:17 PST [4638] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:21 ago - 2848 bytes in 0.004 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[03:17:07] nagios-releng Tue 00:17:17 PST [4639] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-running.js is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:00 ago - 8191 bytes in 0.006 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-running.js)
[03:20:57] nagios-releng Tue 00:21:08 PST [4640] buildapi.pvt.build.mozilla.org:http - /buildapi/self-serve/jobs is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.mozilla.org/http+-+/buildapi/self-serve/jobs)
[03:25:36] nagios-releng Tue 00:25:47 PST [4641] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/b2g_bumper/b2g_bumper.stamp is 1459 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp)
[03:25:46] nagios-releng Tue 00:25:57 PST [4642] buildapi.pvt.build.mozilla.org:http - /buildapi/self-serve/jobs is OK: HTTP OK: HTTP/1.1 200 OK - 44936 bytes in 0.387 second response time (http://m.mozilla.org/http+-+/buildapi/self-serve/jobs)
[03:27:06] nagios-releng Tue 00:27:17 PST [4643] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.007 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[03:27:07] nagios-releng Tue 00:27:17 PST [4644] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-pending.js is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.010 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-pending.js)
[03:27:07] nagios-releng Tue 00:27:17 PST [4645] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-running.js is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.015 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-running.js)
[03:27:36] nagios-releng Tue 00:27:47 PST [4646] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is OK: FILE_AGE OK: /builds/b2g_bumper/b2g_bumper.stamp is 34 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp)
[03:41:56] nagios-releng Tue 00:42:07 PST [4647] buildapi.pvt.build.mozilla.org:http - /buildapi/self-serve/jobs is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.mozilla.org/http+-+/buildapi/self-serve/jobs)
[03:46:06] nagios-releng Tue 00:46:17 PST [4648] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:38 ago - 8191 bytes in 0.007 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[03:46:07] nagios-releng Tue 00:46:17 PST [4649] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-pending.js is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:12:08 ago - 8191 bytes in 0.005 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-pending.js)
[03:46:07] nagios-releng Tue 00:46:17 PST [4650] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-running.js is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:12:08 ago - 2848 bytes in 0.004 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-running.js)
[03:46:46] nagios-releng Tue 00:46:57 PST [4651] buildapi.pvt.build.mozilla.org:http - /buildapi/self-serve/jobs is OK: HTTP OK: HTTP/1.1 200 OK - 44924 bytes in 0.335 second response time (http://m.mozilla.org/http+-+/buildapi/self-serve/jobs)
[03:51:06] nagios-releng Tue 00:51:17 PST [4652] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.009 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[03:51:07] nagios-releng Tue 00:51:18 PST [4653] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-pending.js is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.014 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-pending.js)
[03:51:07] nagios-releng Tue 00:51:18 PST [4654] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-running.js is OK: HTTP OK: HTTP/1.1 200 OK - 4272 bytes in 0.022 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-running.js)
[03:51:36] nagios-releng Tue 00:51:47 PST [4655] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/b2g_bumper/b2g_bumper.stamp is 1475 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp)
[03:57:03] =-= Bebe is now known as Bebe|mtg
[03:57:36] nagios-releng Tue 00:57:47 PST [4656] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is CRITICAL: FILE_AGE CRITICAL: /builds/b2g_bumper/b2g_bumper.stamp is 1834 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp)
[03:59:36] nagios-releng Tue 00:59:47 PST [4657] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is OK: FILE_AGE OK: /builds/b2g_bumper/b2g_bumper.stamp is 10 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp)
Reporter | ||
Comment 7•10 years ago
|
||
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-10-2015 20:08:17
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:02 ago - 8191 bytes in 0.005 second response time
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-11-2015 00:18:03
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:34 ago - 2848 bytes in 0.005 second response time
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-11-2015 00:49:03
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:35 ago - 8191 bytes in 0.005 second response time
Summary: buildapi/self-serve timeouts & builds-4hr/running/pending age alerts at 8am UTC every day → buildapi/self-serve timeouts & builds-4hr/running/pending Nagios age alerts at 8am UTC every day
Reporter | ||
Comment 8•10 years ago
|
||
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-12-2015 00:18:04
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:12:53 ago - 1424 bytes in 0.007 second response time
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-12-2015 00:44:07
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:58 ago - 8191 bytes in 0.005 second response time
Reporter | ||
Comment 9•10 years ago
|
||
Service: http - /buildapi/self-serve/jobs
Host: buildapi.pvt.build.mozilla.org
Address: 10.22.74.160
State: CRITICAL
Date/Time: 02-13-2015 00:42:25
Additional Info:
CRITICAL - Socket timeout after 10 seconds
Service: http file age - /buildjson/builds-pending.js
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-13-2015 00:43:14
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:05 ago - 1424 bytes in 0.005 second response time
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-13-2015 00:48:04
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:49 ago - 1424 bytes in 0.009 second response time
Reporter | ||
Comment 10•10 years ago
|
||
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-13-2015 04:14:07
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:19 ago - 8191 bytes in 0.006 second response time
Reporter | ||
Comment 11•10 years ago
|
||
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-13-2015 08:30:07
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:53 ago - 8191 bytes in 0.019 second response time
Reporter | ||
Comment 12•10 years ago
|
||
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-13-2015 09:09:07
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:01 ago - 8191 bytes in 0.009 second response time
Reporter | ||
Comment 13•10 years ago
|
||
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-15-2015 03:14:24
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:07 ago - 8191 bytes in 0.024 second response time
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 02-15-2015 03:28:24
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:17 ago - 8191 bytes in 0.011 second response time
Comment 14•10 years ago
|
||
This is even more frequent than once a day now. I've pinged the db team to see if they can give us info on the machine configs and replication times to see if this might be part of the problem.
Comment 15•10 years ago
|
||
And this is an issue again tonight
[21:24:58] nagios-releng Thu 18:25:05 PST [4124] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:57 ago - 7120 bytes in 0.006 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[21:30:27] nagios-releng Thu 18:30:34 PST [4125] cruncher.srv.releng.scl3.mozilla.com:Pending builds is OK: OK Pending Builds: 1974 (http://m.mozilla.org/Pending+builds)
[21:34:57] nagios-releng Thu 18:35:04 PST [4126] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.018 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[21:43:57] nagios-releng Thu 18:44:04 PST [4127] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:25 ago - 8191 bytes in 0.011 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[21:53:55] nagios-releng Thu 18:54:02 PST [4128] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.030 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[22:02:55] nagios-releng Thu 19:03:02 PST [4129] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:20 ago - 4272 bytes in 0.007 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[22:22:58] nagios-releng Thu 19:23:05 PST [4130] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.004 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[22:31:58] nagios-releng Thu 19:32:05 PST [4131] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:13:22 ago - 1424 bytes in 0.003 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[22:51:58] nagios-releng Thu 19:52:05 PST [4132] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 4339 bytes in 0.004 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[23:00:59] nagios-releng Thu 20:01:06 PST [4133] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:12:01 ago - 1424 bytes in 0.007 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[23:20:58] nagios-releng Thu 20:21:05 PST [4135] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 7120 bytes in 0.005 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[23:29:56] nagios-releng Thu 20:30:03 PST [4136] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:03 ago - 8191 bytes in 0.004 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[23:39:55] nagios-releng Thu 20:40:02 PST [4137] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.020 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[23:48:55] nagios-releng Thu 20:49:02 PST [4138] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:06 ago - 8191 bytes in 0.008 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
[23:53:59] nagios-releng Thu 20:54:06 PST [4139] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.007 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
Amy, did we hear back from our DBA's, and/or do you have more insight here?
Flags: needinfo?(arich)
Comment 16•10 years ago
|
||
I've heard nothing back from the DBAs yet, but I've cced sheeri and sean on this bug. Let me paste the email I sent to them so they can answer here in the bug:
We're trying to track down some issues (lots of flapping)
with some of our nagios checks which do some polling of the buildbot
database hosts. I'm pretty sure we've seen these issues since the migration
to the new servers. The specific check is:
builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz
There's a lot of soft flapping, and some hard notifications in the nagios
logs as well (you can look at the log archives for more examples of alerts):
http://nagios1.private.releng.scl3.mozilla.com/releng-scl3/cgi-bin/notifications.cgi?host=builddata.pub.build.mozilla.org&service=http+file+age+-+%2Fbuildjson%2Fbuilds-4hr.js.gz&type=0&
I know that the infrastructure has changed recently, so it's probably
useful for us to get a clear idea of some past and current data to help us
debug. What is the comparison between the infrastructure we were on and
what we're on now? I know that we had some physical hardware with (I think
24G) of RAM previously, and now we've moved to some VMs with less. For the
before and after pictures, can you tell us the following for any r/w and
r/o machines?
Type (Physical/Virtual)
How many of each (e.g. did we move form one ro to two?)
CPU (# cores, shared or reserved)
RAM
Any changes in LB config
Replication lag between the servers (hopefully we have some december vs now
numbers?)
I believe that the dbe team has a nagios check that's keeping track of
replication lag, too, correct? What's the sensitivity on that, and is it
reporting to build (basically #buildduty on irc), or just to the dbe
team/#sysadmins?
Flags: needinfo?(arich) → needinfo?(scabral)
Comment 17•10 years ago
|
||
Adding Matt as he's a DBA as well.
The infrastructure for buildbot changed in bug 1125227, and it last changed on Feb 18th. Originally buildbot had 24G RAM. We tried to reduce that to 4G or 6G RAM, as we've been able to do with 90% of our database servers, but buildbot needed more. During this reduction we added another (virtual) slave to see if it would help with the load and decrease replication lag on the slave, but it did not, so it was killed.
Currently, the master (buildbot2) has 24G RAM, and the slave (buildbot1) has 16G RAM. Back in Dec, the master was buildbot1 and the slave was buildbot2, and there's been a few flips back and forth, which are the only load balancer config changes I've made. Anything more complicated would have been done by webops.
CPU (which usually isn't an issue with MySQL dbs) is 4 core Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz. I don't know if they're shared or reserved, that's an issue for the virtualization folks.
Machines *were* physical but are now virtual. If virtual machines can't cut it, let's talk about buying hardware - I'm not sure whose budget that would fall under.
We have *increased* the amount of memory used for the index cache, which was completed in late January (r99398), but that would cause replication lag to be less, not more.
The nagios check only runs once every 5 minutes and does not always catch lag. You can see historical lag at https://rpm.newrelic.com/accounts/263620/dashboard/3101981/page/4 for buildbot1 and https://rpm.newrelic.com/accounts/263620/dashboard/3101982/page/4 for buildbot2 (buildbot2 was the slave in Dec...you can pretty much see where they flipped by the replication lag...the master never has replication lag).
Replication lag pages the DBA oncall, and puts an alert in #db-alerts, but does not notify anyone else.
For how long have the alerts been coming in at 8 am UTC?
The db machines are in Pacific time (ugh) and at 6 am and 6 pm Pacific (so, 2 pm and 2 am UTC until this weekend's time change) we run a checksum, which can have an effect on replication. It sounds like the timing does not match, though.
Flags: needinfo?(scabral)
Comment 18•10 years ago
|
||
Bug 1057184 also has some developer-centric suggestions for fixes. Replication lag is not a new issue with buildbot, though the regularity of it does seem new.
And not that I really need to say it again, but if there's ANY way buildbot could be changed so as to NOT require MyISAM, that would help A LOT. See http://docs.buildbot.net/0.8.5/developer/database.html#index-length-in-mysql - I have no idea what exactly needs such a long index, but I cannot imagine a scenario where there's no workaround for that (e.g. surrogate key using an autoincrement number).
Comment 19•10 years ago
|
||
Another piece of relevant information is that back a month or more ago when replication was a problem, we set the master and slave VIPs to both point to buildbot2 (so above I was not correct when I said buildbot1 is the slave - it's replicating buildbot2 but nothing's doing any queries on it).
Any replication lag on buildbot1 should NOT be affecting anything, as all applications should be using the VIP to connect.
Reporter | ||
Comment 20•10 years ago
|
||
These haven't occurred for a while.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Reporter | ||
Comment 21•10 years ago
|
||
So pretty much a day and a bit after I closed this, it started occurring again. Typical.
Notification Type: PROBLEM
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 04-23-2015 00:44:19
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:45 ago - 8191 bytes in 0.009 second response time
<followed by instances every morning>
This morning's:
Notification Type: PROBLEM
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 04-28-2015 00:19:00
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:33 ago - 2891 bytes in 0.004 second response time
Reporter | ||
Updated•10 years ago
|
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Comment 22•9 years ago
|
||
....still an issue?
Reporter | ||
Comment 23•9 years ago
|
||
Hasn't occurred since, but I presume the DB failure recently solved the problem for us, by making the queries more performant hehe :-)
Status: REOPENED → RESOLVED
Closed: 10 years ago → 9 years ago
Resolution: --- → WORKSFORME
Reporter | ||
Comment 24•9 years ago
|
||
This is happening pretty much daily at the moment - could someone take a look? :-)
***** Nagios *****
Notification Type: PROBLEM
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 11-16-2015 00:17:57
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:13:39 ago - 8191 bytes in 0.005 second response time
***** Nagios *****
Notification Type: PROBLEM
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 11-16-2015 00:41:59
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:41 ago - 8191 bytes in 0.006 second response time
Status: RESOLVED → REOPENED
Flags: needinfo?(catlee)
Resolution: WORKSFORME → ---
Comment 25•9 years ago
|
||
From the #buildduty channel, this flapped twice last night:
03:18 <nagios-releng> Mon 00:17:57 PST [4000] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:13:39 ago - 8191 bytes in 0.005 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
03:20 aselagea|afk → aselagea|buildduty
03:23 <nagios-releng> Mon 00:22:57 PST [4001] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.007 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
03:42 <nagios-releng> Mon 00:41:59 PST [4002] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:41 ago - 8191 bytes in 0.006 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
03:52 <nagios-releng> Mon 00:51:59 PST [4003] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 4344 bytes in 0.022 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz)
I'm going to check cruncher to see whether we're DOSing ourselves during that stretch.
Comment 26•9 years ago
|
||
Would be useful to know if the r/w and r/o vips are currently pointing to a single machine or master and slave.
Comment 27•9 years ago
|
||
(In reply to Nick Thomas [:nthomas] from comment #26)
> Would be useful to know if the r/w and r/o vips are currently pointing to a
> single machine or master and slave.
Sheeri: I guess that question is for you. ^^
Flags: needinfo?(scabral)
Comment 28•9 years ago
|
||
On cruncher, we had a bunch of slave_health processes stacking up with reportor and allthethings at midnight at 00:00 and 00:30, so I've spaced those out a little better.
Updated•9 years ago
|
Flags: needinfo?(catlee)
Reporter | ||
Comment 29•9 years ago
|
||
That's great - thank you :-)
Will take a look in a few days and see if there have been any more nagios alert emails.
Comment 30•9 years ago
|
||
buildbot-ro and buildbot-rw point to the same machine. This is what I'd expect because the lag problem made it so we had to do this.
Flags: needinfo?(scabral)
Reporter | ||
Comment 31•9 years ago
|
||
This morning I received:
Notification Type: PROBLEM
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: CRITICAL
Date/Time: 11-18-2015 00:43:49
Additional Info:
HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:29 ago - 8191 bytes in 0.004 second response time
Notification Type: RECOVERY
Service: http file age - /buildjson/builds-4hr.js.gz
Host: builddata.pub.build.mozilla.org
Address: 63.245.215.57
State: OK
Date/Time: 11-18-2015 00:48:51
Additional Info:
HTTP OK: HTTP/1.1 200 OK - 4344 bytes in 0.025 second response time
I can't see anything helpful on the New Relic APM page for api.pub.build.m.o (https://rpm.newrelic.com/accounts/263620/applications/6284521) - guessing the scheduled tasks aren't using the `newrelic-admin run-program` wrapper.
The MySQL plugin page shows a spike in slow queries around that time (but there's no way for me to see what they were, since the APM page is required for that):
https://rpm.newrelic.com/accounts/263620/dashboard/3101981/page/2?tw%5Bend%5D=1447840620&tw%5Bstart%5D=1447829820
And disk IO on buildbot1.db.scl3.mozilla.com reached 100%:
https://rpm.newrelic.com/accounts/263620/servers/5241897?tw%5Bend%5D=1447840879&tw%5Bstart%5D=1447830079
Comment 32•9 years ago
|
||
Timing-wise, that lines up with daily reportor run on cruncher which starts at 00:30. The log shows the build_times query running from 00:34:46 to 00:44:55. I imagine this is pretty intensive query-wise:
https://github.com/catlee/reportor/blob/master/reports/build_times/build_times.py
Comment 33•9 years ago
|
||
(In reply to Ed Morley [:emorley] from comment #31)
>
> I can't see anything helpful on the New Relic APM page for api.pub.build.m.o
> (https://rpm.newrelic.com/accounts/263620/applications/6284521) - guessing
> the scheduled tasks aren't using the `newrelic-admin run-program` wrapper.
>
Celery tasks are not wrapped with new relic atm (iirc) however note that we don't use api.pub for this either, so thats irrelevant for our desire to track down the slowdown.
> And disk IO on buildbot1.db.scl3.mozilla.com reached 100%:
> https://rpm.newrelic.com/accounts/263620/servers/
> 5241897?tw%5Bend%5D=1447840879&tw%5Bstart%5D=1447830079
This, however is relevant.
Comment 34•9 years ago
|
||
I've disabled the 'build_times' report for now.
The 'build_steps' report is also pretty expensive - I wonder if we still need that.
Comment 35•9 years ago
|
||
We are working on migrating back to hardware, FWIW.
Reporter | ||
Comment 36•8 years ago
|
||
I'm no longer on the sheriffs list - Wes, do these still occur at 8am some days?
Flags: needinfo?(wkocher)
Not that I recall.
Flags: needinfo?(wkocher)
Reporter | ||
Updated•8 years ago
|
Status: REOPENED → RESOLVED
Closed: 9 years ago → 8 years ago
Resolution: --- → WORKSFORME
Assignee | ||
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•