Closed Bug 1130347 Opened 10 years ago Closed 8 years ago

buildapi/self-serve timeouts & builds-4hr/running/pending Nagios age alerts at 8am UTC every day

Categories

(Release Engineering :: General, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: emorley, Unassigned)

References

Details

(Keywords: treeherder)

We're getting Nagios alerts at a similar time each day (8am UTC) - eg: Service: http - /buildapi/self-serve/jobs Host: buildapi.pvt.build.mozilla.org Address: 10.22.74.160 State: CRITICAL Date/Time: 02-06-2015 00:12:59 Additional Info: CRITICAL - Socket timeout after 10 seconds Service: http file age - /buildjson/builds-running.js Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-06-2015 00:17:09 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:01 ago - 8191 bytes in 0.006 second response time Is some task running at that time each day?
Keywords: treeherder
We've got a bunch of reports that run at midnight PT, which I think lines up with this? I'll try moving some to 00:30 and see if that makes a difference.
Ah great - thank you :-)
I also have bug 1129626 that I suspect could help this as well
(In reply to Chris AtLee [:catlee] from comment #1) > We've got a bunch of reports that run at midnight PT, which I think lines up > with this? > > I'll try moving some to 00:30 and see if that makes a difference. Unfortunately saw these alerts again today: Service: http file age - /buildjson/builds-running.js Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-07-2015 00:18:04 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:00 ago - 8191 bytes in 0.006 second response time Service: http - /buildapi/self-serve/jobs Host: buildapi.pvt.build.mozilla.org Address: 10.22.74.160 State: CRITICAL Date/Time: 02-07-2015 00:18:54 Additional Info: CRITICAL - Socket timeout after 10 seconds Service: http file age - /buildjson/builds-pending.js Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-07-2015 00:19:04 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:11 ago - 4506 bytes in 0.010 second response time Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-07-2015 00:20:04 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:12:40 ago - 2848 bytes in 0.010 second response time
Depends on: 1129626
Service: http - /buildapi/self-serve/jobs Host: buildapi.pvt.build.mozilla.org Address: 10.22.74.160 State: CRITICAL Date/Time: 02-08-2015 00:41:02 Additional Info: CRITICAL - Socket timeout after 10 seconds
for reference, another round today, seems to correspond to catlee's timing changes. [02:41:36] nagios-releng Mon 23:41:47 PST [4634] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/b2g_bumper/b2g_bumper.stamp is 1489 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp) [02:43:37] nagios-releng Mon 23:43:47 PST [4635] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is OK: FILE_AGE OK: /builds/b2g_bumper/b2g_bumper.stamp is 98 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp) [03:02:31] =-= mdoglio|afk is now known as mdoglio [03:15:57] nagios-releng Tue 00:16:07 PST [4636] buildapi.pvt.build.mozilla.org:http - /buildapi/self-serve/jobs is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.mozilla.org/http+-+/buildapi/self-serve/jobs) [03:17:07] nagios-releng Tue 00:17:17 PST [4637] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-pending.js is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:00 ago - 8191 bytes in 0.009 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-pending.js) [03:17:07] nagios-releng Tue 00:17:17 PST [4638] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:21 ago - 2848 bytes in 0.004 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [03:17:07] nagios-releng Tue 00:17:17 PST [4639] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-running.js is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:00 ago - 8191 bytes in 0.006 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-running.js) [03:20:57] nagios-releng Tue 00:21:08 PST [4640] buildapi.pvt.build.mozilla.org:http - /buildapi/self-serve/jobs is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.mozilla.org/http+-+/buildapi/self-serve/jobs) [03:25:36] nagios-releng Tue 00:25:47 PST [4641] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/b2g_bumper/b2g_bumper.stamp is 1459 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp) [03:25:46] nagios-releng Tue 00:25:57 PST [4642] buildapi.pvt.build.mozilla.org:http - /buildapi/self-serve/jobs is OK: HTTP OK: HTTP/1.1 200 OK - 44936 bytes in 0.387 second response time (http://m.mozilla.org/http+-+/buildapi/self-serve/jobs) [03:27:06] nagios-releng Tue 00:27:17 PST [4643] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.007 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [03:27:07] nagios-releng Tue 00:27:17 PST [4644] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-pending.js is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.010 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-pending.js) [03:27:07] nagios-releng Tue 00:27:17 PST [4645] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-running.js is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.015 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-running.js) [03:27:36] nagios-releng Tue 00:27:47 PST [4646] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is OK: FILE_AGE OK: /builds/b2g_bumper/b2g_bumper.stamp is 34 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp) [03:41:56] nagios-releng Tue 00:42:07 PST [4647] buildapi.pvt.build.mozilla.org:http - /buildapi/self-serve/jobs is CRITICAL: CRITICAL - Socket timeout after 10 seconds (http://m.mozilla.org/http+-+/buildapi/self-serve/jobs) [03:46:06] nagios-releng Tue 00:46:17 PST [4648] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:38 ago - 8191 bytes in 0.007 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [03:46:07] nagios-releng Tue 00:46:17 PST [4649] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-pending.js is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:12:08 ago - 8191 bytes in 0.005 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-pending.js) [03:46:07] nagios-releng Tue 00:46:17 PST [4650] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-running.js is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:12:08 ago - 2848 bytes in 0.004 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-running.js) [03:46:46] nagios-releng Tue 00:46:57 PST [4651] buildapi.pvt.build.mozilla.org:http - /buildapi/self-serve/jobs is OK: HTTP OK: HTTP/1.1 200 OK - 44924 bytes in 0.335 second response time (http://m.mozilla.org/http+-+/buildapi/self-serve/jobs) [03:51:06] nagios-releng Tue 00:51:17 PST [4652] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.009 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [03:51:07] nagios-releng Tue 00:51:18 PST [4653] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-pending.js is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.014 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-pending.js) [03:51:07] nagios-releng Tue 00:51:18 PST [4654] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-running.js is OK: HTTP OK: HTTP/1.1 200 OK - 4272 bytes in 0.022 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-running.js) [03:51:36] nagios-releng Tue 00:51:47 PST [4655] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/b2g_bumper/b2g_bumper.stamp is 1475 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp) [03:57:03] =-= Bebe is now known as Bebe|mtg [03:57:36] nagios-releng Tue 00:57:47 PST [4656] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is CRITICAL: FILE_AGE CRITICAL: /builds/b2g_bumper/b2g_bumper.stamp is 1834 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp) [03:59:36] nagios-releng Tue 00:59:47 PST [4657] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/b2g_bumper/b2g_bumper.stamp is OK: FILE_AGE OK: /builds/b2g_bumper/b2g_bumper.stamp is 10 seconds old and 0 bytes (http://m.mozilla.org/File+Age+-+/builds/b2g_bumper/b2g_bumper.stamp)
Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-10-2015 20:08:17 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:02 ago - 8191 bytes in 0.005 second response time Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-11-2015 00:18:03 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:34 ago - 2848 bytes in 0.005 second response time Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-11-2015 00:49:03 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:35 ago - 8191 bytes in 0.005 second response time
Summary: buildapi/self-serve timeouts & builds-4hr/running/pending age alerts at 8am UTC every day → buildapi/self-serve timeouts & builds-4hr/running/pending Nagios age alerts at 8am UTC every day
Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-12-2015 00:18:04 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:12:53 ago - 1424 bytes in 0.007 second response time Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-12-2015 00:44:07 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:58 ago - 8191 bytes in 0.005 second response time
Service: http - /buildapi/self-serve/jobs Host: buildapi.pvt.build.mozilla.org Address: 10.22.74.160 State: CRITICAL Date/Time: 02-13-2015 00:42:25 Additional Info: CRITICAL - Socket timeout after 10 seconds Service: http file age - /buildjson/builds-pending.js Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-13-2015 00:43:14 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:05 ago - 1424 bytes in 0.005 second response time Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-13-2015 00:48:04 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:49 ago - 1424 bytes in 0.009 second response time
Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-13-2015 04:14:07 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:19 ago - 8191 bytes in 0.006 second response time
Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-13-2015 08:30:07 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:53 ago - 8191 bytes in 0.019 second response time
Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-13-2015 09:09:07 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:01 ago - 8191 bytes in 0.009 second response time
Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-15-2015 03:14:24 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:07 ago - 8191 bytes in 0.024 second response time Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 02-15-2015 03:28:24 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:17 ago - 8191 bytes in 0.011 second response time
This is even more frequent than once a day now. I've pinged the db team to see if they can give us info on the machine configs and replication times to see if this might be part of the problem.
And this is an issue again tonight [21:24:58] nagios-releng Thu 18:25:05 PST [4124] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:57 ago - 7120 bytes in 0.006 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [21:30:27] nagios-releng Thu 18:30:34 PST [4125] cruncher.srv.releng.scl3.mozilla.com:Pending builds is OK: OK Pending Builds: 1974 (http://m.mozilla.org/Pending+builds) [21:34:57] nagios-releng Thu 18:35:04 PST [4126] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.018 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [21:43:57] nagios-releng Thu 18:44:04 PST [4127] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:11:25 ago - 8191 bytes in 0.011 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [21:53:55] nagios-releng Thu 18:54:02 PST [4128] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.030 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [22:02:55] nagios-releng Thu 19:03:02 PST [4129] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:20 ago - 4272 bytes in 0.007 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [22:22:58] nagios-releng Thu 19:23:05 PST [4130] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.004 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [22:31:58] nagios-releng Thu 19:32:05 PST [4131] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:13:22 ago - 1424 bytes in 0.003 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [22:51:58] nagios-releng Thu 19:52:05 PST [4132] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 4339 bytes in 0.004 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [23:00:59] nagios-releng Thu 20:01:06 PST [4133] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:12:01 ago - 1424 bytes in 0.007 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [23:20:58] nagios-releng Thu 20:21:05 PST [4135] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 7120 bytes in 0.005 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [23:29:56] nagios-releng Thu 20:30:03 PST [4136] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:03 ago - 8191 bytes in 0.004 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [23:39:55] nagios-releng Thu 20:40:02 PST [4137] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.020 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [23:48:55] nagios-releng Thu 20:49:02 PST [4138] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:06 ago - 8191 bytes in 0.008 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) [23:53:59] nagios-releng Thu 20:54:06 PST [4139] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.007 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) Amy, did we hear back from our DBA's, and/or do you have more insight here?
Flags: needinfo?(arich)
I've heard nothing back from the DBAs yet, but I've cced sheeri and sean on this bug. Let me paste the email I sent to them so they can answer here in the bug: We're trying to track down some issues (lots of flapping) with some of our nagios checks which do some polling of the buildbot database hosts. I'm pretty sure we've seen these issues since the migration to the new servers. The specific check is: builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz There's a lot of soft flapping, and some hard notifications in the nagios logs as well (you can look at the log archives for more examples of alerts): http://nagios1.private.releng.scl3.mozilla.com/releng-scl3/cgi-bin/notifications.cgi?host=builddata.pub.build.mozilla.org&service=http+file+age+-+%2Fbuildjson%2Fbuilds-4hr.js.gz&type=0& I know that the infrastructure has changed recently, so it's probably useful for us to get a clear idea of some past and current data to help us debug. What is the comparison between the infrastructure we were on and what we're on now? I know that we had some physical hardware with (I think 24G) of RAM previously, and now we've moved to some VMs with less. For the before and after pictures, can you tell us the following for any r/w and r/o machines? Type (Physical/Virtual) How many of each (e.g. did we move form one ro to two?) CPU (# cores, shared or reserved) RAM Any changes in LB config Replication lag between the servers (hopefully we have some december vs now numbers?) I believe that the dbe team has a nagios check that's keeping track of replication lag, too, correct? What's the sensitivity on that, and is it reporting to build (basically #buildduty on irc), or just to the dbe team/#sysadmins?
Flags: needinfo?(arich) → needinfo?(scabral)
Adding Matt as he's a DBA as well. The infrastructure for buildbot changed in bug 1125227, and it last changed on Feb 18th. Originally buildbot had 24G RAM. We tried to reduce that to 4G or 6G RAM, as we've been able to do with 90% of our database servers, but buildbot needed more. During this reduction we added another (virtual) slave to see if it would help with the load and decrease replication lag on the slave, but it did not, so it was killed. Currently, the master (buildbot2) has 24G RAM, and the slave (buildbot1) has 16G RAM. Back in Dec, the master was buildbot1 and the slave was buildbot2, and there's been a few flips back and forth, which are the only load balancer config changes I've made. Anything more complicated would have been done by webops. CPU (which usually isn't an issue with MySQL dbs) is 4 core Intel(R) Xeon(R) CPU E5-2680 0 @ 2.70GHz. I don't know if they're shared or reserved, that's an issue for the virtualization folks. Machines *were* physical but are now virtual. If virtual machines can't cut it, let's talk about buying hardware - I'm not sure whose budget that would fall under. We have *increased* the amount of memory used for the index cache, which was completed in late January (r99398), but that would cause replication lag to be less, not more. The nagios check only runs once every 5 minutes and does not always catch lag. You can see historical lag at https://rpm.newrelic.com/accounts/263620/dashboard/3101981/page/4 for buildbot1 and https://rpm.newrelic.com/accounts/263620/dashboard/3101982/page/4 for buildbot2 (buildbot2 was the slave in Dec...you can pretty much see where they flipped by the replication lag...the master never has replication lag). Replication lag pages the DBA oncall, and puts an alert in #db-alerts, but does not notify anyone else. For how long have the alerts been coming in at 8 am UTC? The db machines are in Pacific time (ugh) and at 6 am and 6 pm Pacific (so, 2 pm and 2 am UTC until this weekend's time change) we run a checksum, which can have an effect on replication. It sounds like the timing does not match, though.
Flags: needinfo?(scabral)
Bug 1057184 also has some developer-centric suggestions for fixes. Replication lag is not a new issue with buildbot, though the regularity of it does seem new. And not that I really need to say it again, but if there's ANY way buildbot could be changed so as to NOT require MyISAM, that would help A LOT. See http://docs.buildbot.net/0.8.5/developer/database.html#index-length-in-mysql - I have no idea what exactly needs such a long index, but I cannot imagine a scenario where there's no workaround for that (e.g. surrogate key using an autoincrement number).
Another piece of relevant information is that back a month or more ago when replication was a problem, we set the master and slave VIPs to both point to buildbot2 (so above I was not correct when I said buildbot1 is the slave - it's replicating buildbot2 but nothing's doing any queries on it). Any replication lag on buildbot1 should NOT be affecting anything, as all applications should be using the VIP to connect.
These haven't occurred for a while.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
So pretty much a day and a bit after I closed this, it started occurring again. Typical. Notification Type: PROBLEM Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 04-23-2015 00:44:19 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:45 ago - 8191 bytes in 0.009 second response time <followed by instances every morning> This morning's: Notification Type: PROBLEM Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 04-28-2015 00:19:00 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:10:33 ago - 2891 bytes in 0.004 second response time
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
....still an issue?
Hasn't occurred since, but I presume the DB failure recently solved the problem for us, by making the queries more performant hehe :-)
Status: REOPENED → RESOLVED
Closed: 10 years ago9 years ago
Resolution: --- → WORKSFORME
This is happening pretty much daily at the moment - could someone take a look? :-) ***** Nagios ***** Notification Type: PROBLEM Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 11-16-2015 00:17:57 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:13:39 ago - 8191 bytes in 0.005 second response time ***** Nagios ***** Notification Type: PROBLEM Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 11-16-2015 00:41:59 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:41 ago - 8191 bytes in 0.006 second response time
Status: RESOLVED → REOPENED
Flags: needinfo?(catlee)
Resolution: WORKSFORME → ---
From the #buildduty channel, this flapped twice last night: 03:18 <nagios-releng> Mon 00:17:57 PST [4000] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:13:39 ago - 8191 bytes in 0.005 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) 03:20 aselagea|afk → aselagea|buildduty 03:23 <nagios-releng> Mon 00:22:57 PST [4001] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 8191 bytes in 0.007 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) 03:42 <nagios-releng> Mon 00:41:59 PST [4002] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is CRITICAL: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:41 ago - 8191 bytes in 0.006 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) 03:52 <nagios-releng> Mon 00:51:59 PST [4003] builddata.pub.build.mozilla.org:http file age - /buildjson/builds-4hr.js.gz is OK: HTTP OK: HTTP/1.1 200 OK - 4344 bytes in 0.022 second response time (http://m.mozilla.org/http+file+age+-+/buildjson/builds-4hr.js.gz) I'm going to check cruncher to see whether we're DOSing ourselves during that stretch.
Would be useful to know if the r/w and r/o vips are currently pointing to a single machine or master and slave.
(In reply to Nick Thomas [:nthomas] from comment #26) > Would be useful to know if the r/w and r/o vips are currently pointing to a > single machine or master and slave. Sheeri: I guess that question is for you. ^^
Flags: needinfo?(scabral)
On cruncher, we had a bunch of slave_health processes stacking up with reportor and allthethings at midnight at 00:00 and 00:30, so I've spaced those out a little better.
Flags: needinfo?(catlee)
That's great - thank you :-) Will take a look in a few days and see if there have been any more nagios alert emails.
buildbot-ro and buildbot-rw point to the same machine. This is what I'd expect because the lag problem made it so we had to do this.
Flags: needinfo?(scabral)
This morning I received: Notification Type: PROBLEM Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: CRITICAL Date/Time: 11-18-2015 00:43:49 Additional Info: HTTP CRITICAL: HTTP/1.1 200 OK - Last modified 0:09:29 ago - 8191 bytes in 0.004 second response time Notification Type: RECOVERY Service: http file age - /buildjson/builds-4hr.js.gz Host: builddata.pub.build.mozilla.org Address: 63.245.215.57 State: OK Date/Time: 11-18-2015 00:48:51 Additional Info: HTTP OK: HTTP/1.1 200 OK - 4344 bytes in 0.025 second response time I can't see anything helpful on the New Relic APM page for api.pub.build.m.o (https://rpm.newrelic.com/accounts/263620/applications/6284521) - guessing the scheduled tasks aren't using the `newrelic-admin run-program` wrapper. The MySQL plugin page shows a spike in slow queries around that time (but there's no way for me to see what they were, since the APM page is required for that): https://rpm.newrelic.com/accounts/263620/dashboard/3101981/page/2?tw%5Bend%5D=1447840620&tw%5Bstart%5D=1447829820 And disk IO on buildbot1.db.scl3.mozilla.com reached 100%: https://rpm.newrelic.com/accounts/263620/servers/5241897?tw%5Bend%5D=1447840879&tw%5Bstart%5D=1447830079
Timing-wise, that lines up with daily reportor run on cruncher which starts at 00:30. The log shows the build_times query running from 00:34:46 to 00:44:55. I imagine this is pretty intensive query-wise: https://github.com/catlee/reportor/blob/master/reports/build_times/build_times.py
(In reply to Ed Morley [:emorley] from comment #31) > > I can't see anything helpful on the New Relic APM page for api.pub.build.m.o > (https://rpm.newrelic.com/accounts/263620/applications/6284521) - guessing > the scheduled tasks aren't using the `newrelic-admin run-program` wrapper. > Celery tasks are not wrapped with new relic atm (iirc) however note that we don't use api.pub for this either, so thats irrelevant for our desire to track down the slowdown. > And disk IO on buildbot1.db.scl3.mozilla.com reached 100%: > https://rpm.newrelic.com/accounts/263620/servers/ > 5241897?tw%5Bend%5D=1447840879&tw%5Bstart%5D=1447830079 This, however is relevant.
Depends on: 1226959
I've disabled the 'build_times' report for now. The 'build_steps' report is also pretty expensive - I wonder if we still need that.
We are working on migrating back to hardware, FWIW.
I'm no longer on the sheriffs list - Wes, do these still occur at 8am some days?
Flags: needinfo?(wkocher)
Not that I recall.
Flags: needinfo?(wkocher)
Status: REOPENED → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → WORKSFORME
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.