Closed Bug 1176784 Opened 9 years ago Closed 7 years ago

reconfigs should not rely on seta server

Categories

(Release Engineering :: General, defect)

defect
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rail, Assigned: kmoir)

References

Details

Attachments

(4 files, 12 obsolete files)

(deleted), patch
jlund
: review+
kmoir
: review+
Details | Diff | Splinter Review
(deleted), patch
kmoir
: checked-in+
Details | Diff | Splinter Review
(deleted), patch
kmoir
: checked-in+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
Hit an issue wit a release reconfig when the seta server was returning 500 errors: Requested: make checkconfig Executed: /bin/bash -l -c "cd /builds/buildbot/tests_scheduler && make checkconfig" =============================== Standard output =============================== cd master && /builds/buildbot/tests_scheduler/bin/buildbot checkconfig HTTPError = 500 Traceback (most recent call last): File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/scripts/runner.py", line 1042, in doCheckConfig ConfigLoader(configFileName=configFileName) File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/scripts/checkconfig.py", line 31, in __init__ self.loadConfig(configFile, check_synchronously_only=True) File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/master.py", line 652, in loadConfig exec f in localDict File "/builds/buildbot/tests_scheduler/master/master.cfg", line 7, in <module> import config File "/tmp/tmpQRFxAr/config.py", line 1966, in <module> File "/tmp/tmpQRFxAr/config_seta.py", line 113, in loadSkipConfig File "/tmp/tmpQRFxAr/config_seta.py", line 31, in get_seta_platforms File "/tools/python27/lib/python2.7/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout) File "/tools/python27/lib/python2.7/urllib2.py", line 406, in open response = meth(req, response) File "/tools/python27/lib/python2.7/urllib2.py", line 519, in http_response 'http', request, response, code, msg, hdrs) File "/tools/python27/lib/python2.7/urllib2.py", line 444, in error return self._call_chain(*args) File "/tools/python27/lib/python2.7/urllib2.py", line 378, in _call_chain result = func(*args) File "/tools/python27/lib/python2.7/urllib2.py", line 527, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) HTTPError: HTTP Error 500: Internal Server Error make: *** [checkconfig] Error 1
Tried to checkconfig manually: [cltbld@buildbot-master81.bb.releng.scl3.mozilla.com tests_scheduler]$ make checkconfig cd master && /builds/buildbot/tests_scheduler/bin/buildbot checkconfig HTTPError = 500 Traceback (most recent call last): File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/scripts/runner.py", line 1042, in doCheckConfig ConfigLoader(configFileName=configFileName) File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/scripts/checkconfig.py", line 31, in __init__ self.loadConfig(configFile, check_synchronously_only=True) File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/master.py", line 652, in loadConfig exec f in localDict File "/builds/buildbot/tests_scheduler/master/master.cfg", line 7, in <module> import config File "/tmp/tmpJkacwm/config.py", line 1966, in <module> File "/tmp/tmpJkacwm/config_seta.py", line 113, in loadSkipConfig File "/tmp/tmpJkacwm/config_seta.py", line 31, in get_seta_platforms File "/tools/python27/lib/python2.7/urllib2.py", line 126, in urlopen return _opener.open(url, data, timeout) File "/tools/python27/lib/python2.7/urllib2.py", line 406, in open response = meth(req, response) File "/tools/python27/lib/python2.7/urllib2.py", line 519, in http_response 'http', request, response, code, msg, hdrs) File "/tools/python27/lib/python2.7/urllib2.py", line 444, in error return self._call_chain(*args) File "/tools/python27/lib/python2.7/urllib2.py", line 378, in _call_chain result = func(*args) File "/tools/python27/lib/python2.7/urllib2.py", line 527, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) HTTPError: HTTP Error 500: Internal Server Error make: *** [checkconfig] Error 1
Bug 1176802 for the alertmanager.allizom.org outage. It would be good if we had a copy on disk, or in a repo, so that we have something to fall back on if the server is down.
So what if we had a cron job, which polls seta and lands into buildbot-configs (if it gets a 200 with valid data). Then the automated reconfigs deploy to the masters on the hour. This would give us an audit trail and a way to rollback the seta config. Would have been helpful when 20k pending jobs turned up today.
Flags: needinfo?(kmoir)
Sorry I didn't see this bug until today. And I didn't realize that it had caused problems in the past. I'll look at how to make this more resilient. Didn't realize alertmanager isn't managed by moc either and that we don't have a nagios alert on it. Will look into this.
Assignee: nobody → kmoir
Flags: needinfo?(kmoir)
Just as an fyi, I checked the logs on a Linux test master yesterday A reconfig occurred at 17:00 2015-08-31 17:00:01 - INFO - Checking whether we need to reconfig... 2015-08-31 17:00:05 - INFO - buildbotcustom: production-0.8 tag has moved - old rev: b2ecb14104783c60f9a4276f3031213aa7634a9a; new rev: 730073773a8c85759a7c592f7287e885845052f6 1 files updated, 0 files merged, 0 files removed, 0 files unresolved 2015-08-31 17:00:08 - INFO - buildbot-configs: production tag has moved - old rev: 7d72b9711f492a4c99fc84dd97ef0c76ed0ebec1; new rev: 84ef93b01b62774d9066908176fbf8103b5f8971 5 files updated, 0 files merged, 0 files removed, 0 files unresolved 2015-08-31 17:00:11 - INFO - tools: default tag has moved - old rev: df4e897f9bc7fb877dfcacf19d150543af619589; new rev: 487bb16f9bdfd8c05301692e4de3e2b1a2a105cc 1 files updated, 0 files merged, 0 files removed, 0 files unresolved 2015-08-31 17:00:11 - INFO - Starting reconfig. - 1441065601 cd master && /builds/buildbot/tests1-linux/bin/buildbot checkconfig Config file is good! 2015-08-31 17:01:29 - INFO - Reconfig completed successfuly. - 1441065601 This corresponds to when the scheduling started skipping again on the scheduling master 2015-08-31 17:03:48-0700 [-] tests-fx-team-win8_64-debug-unittest-7-3600: skipping with 1/7 important changes since only 79/3600s have elapsed 2015-08-31 17:05:53-0700 [-] tests-fx-team-win8_64-debug-unittest-7-3600: skipping with 1/7 important changes since only 204/3600s have elapsed 2015-08-31 17:05:54-0700 [-] t Looking at the logs on the scheduling master no jobs were skipped on Sunday Aug 30 (but I think trees were mostly closed due to downtime). And no jobs were skipped until 2015-08-31 17:03. So my suspicion is that when the buildbot master were brought up after the downtime, the SETA server was unavailable (although I can't see that in the logs) and thus the masters didn't have any SETA data until the reconfig at 17:00 on August 31.
Depends on: 1200838
I was working on this for the past few days with another approach that didn't work for various reasons. I have a python script that I used before to generate seta configs before I changed it to manipulate the BRANCHES config itself. I can use it to generate the config files via cron, and update the configs to load these files instead of changing the branches config. My question is what credentials should I use to land this in bb-configs - ffxbld? Are there other examples of scripts that we use to land content in bbconfigs besides tagging for releases etc?
Flags: needinfo?(nthomas)
We also land into gecko repos in http://hg.mozilla.org/build/tools/file/default/scripts/periodic_file_updates/periodic_file_updates.sh. That and tagging use the ffxbld ssh key to push to ssh://hg.m.o. We could implement this as a cron that runs on bm81, or schedule jobs on the builders themselves. Either way you'd have access to the key you need.
Flags: needinfo?(nthomas)
not sure if this is the place to put this but travis says 15 test masters are failing right now: ``` 2015-09-16 21:45:30,881 - Couldn't load test-output/bm109-tests1-windows/master.cfg Traceback (most recent call last): File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 111, in dump_master c = loadMaster(path) File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 26, in loadMaster execfile(path, g, g) File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/master.cfg", line 10, in <module> import config File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config.py", line 2152, in <module> loadSkipConfig(BRANCHES,"desktop") File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config_seta.py", line 115, in loadSkipConfig define_configs(b, platforms, BRANCHES) File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config_seta.py", line 98, in define_configs platform = seta_platforms[p][0] KeyError: 'Windows 8' Traceback (most recent call last): File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 165, in <module> main() File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 146, in main dump = dump_master(args.masters[0]) File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 111, in dump_master c = loadMaster(path) File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 26, in loadMaster execfile(path, g, g) File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/master.cfg", line 10, in <module> import config File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config.py", line 2152, in <module> loadSkipConfig(BRANCHES,"desktop") File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config_seta.py", line 115, in loadSkipConfig define_configs(b, platforms, BRANCHES) File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config_seta.py", line 98, in define_configs platform = seta_platforms[p][0] KeyError: 'Windows 8' ``` taking a look, it seems like http://alertmanager.allizom.org/data/setadetails/?date=2015-09-16&buildbot=1&branch=mozilla-inbound&inactive=1 today is including a key 'Windows 8' that I guess https://dxr.mozilla.org/build-central/source/buildbot-configs/mozilla-tests/config_seta.py?offset=1800#14 is not happy about. In [30]: url = "http://alertmanager.allizom.org/data/setadetails/?date=" + today + "&buildbot=1&branch=" + 'fx-team' + "&inactive=1" In [31]: content = json.load(urllib2.urlopen(url)) In [32]: c = {} In [33]: c['jobtypes'] = content['jobtypes'] # copy code from config_seta.py In [34]: for p in c['jobtypes'][today]: platform = ' '.join(p.encode('utf-8').split()[0:-4]) if platform not in platforms: platforms.append(platform) ....: In [35]: platforms Out[35]: ['Windows 8 64-bit', 'Windows 8', 'Rev5 MacOSX Yosemite 10.10', 'Rev5 MacOSX Yosemite', 'Ubuntu VM 12.04', 'Ubuntu VM', 'Rev4 MacOSX Snow Leopard 10.6', 'Windows 7', 'Windows 7 32-bit', 'Ubuntu VM 12.04 x64', 'Ubuntu ASAN VM 12.04 x64', 'Windows XP 32-bit', 'Windows XP', 'android-2-3-armv7-api9', 'android-4-3-armv7-api11']
let me know if I can change SETA at all.
We're wondering if SETA changed today, either code, data format, or data itself, and if that might be causing this.
For example, did we add talos jobs like 'Windows 8 64-bit fx-team talos g2-e10s' for the first time today ? There's code which does platform = ' '.join(p.encode('utf-8').split()[0:-4]) which turns unittest style jobs into 'Windows 8 64-bit', but talos into 'Windows 8', and the latter isn't in seta_platforms at http://hg.mozilla.org/build/buildbot-configs/annotate/c71f4b72e5db/mozilla-tests/config_seta.py#l14
Flags: needinfo?(jmaher)
This reduces the set of platforms to ['Windows 8 64-bit', 'Rev5 MacOSX Yosemite 10.10', 'Ubuntu VM 12.04', 'Rev4 MacOSX Snow Leopard 10.6', 'Windows 7 32-bit', 'Ubuntu VM 12.04 x64', 'Ubuntu ASAN VM 12.04 x64', 'Windows XP 32-bit', 'android-2-3-armv7-api9', 'android-4-3-armv7-api11'] which are all in seta_platforms. This will fix up master reconfigs, regardless of whether talos is meant to be in there or not (seems odd to me, but I'm lacking SETA context). It's obviously just as fragile as what it replaces, so consider it a short-term hack.
Attachment #8662124 - Flags: review?(jlund)
Attachment #8662124 - Flags: review+
Comment on attachment 8662124 [details] [diff] [review] [buildbot-configs] Handle talos builders Review of attachment 8662124 [details] [diff] [review]: ----------------------------------------------------------------- makes sense. r+
Attachment #8662124 - Flags: review?(jlund) → review+
yes, we added talos to it in preparation for upcoming work. This is easy to revert on the SETA side- quite possibly we need a live and staging API that SETA publishes?
Flags: needinfo?(jmaher)
Severity: critical → major
Depends on: 1208292
We had bm121 hang in a reconfig today, it was waiting on a socket to the SETA server. We could mitigate that by adding at http://hg.mozilla.org/build/buildbot-configs/file/fc34e111082d/mozilla-tests/config_seta.py#l31
bm67 got itself into a funk today while reconfiging and having socket errors while reading seta config twistd.log: 226118 2016-01-04 15:02:56-0800 [-] generic exception: Traceback (most recent call last): 226119 2016-01-04 15:02:56-0800 [-] File "/builds/buildbot/tests1-linux64/master/config_seta.py", line 45, in get_seta_platforms 226120 2016-01-04 15:02:56-0800 [-] response = urllib2.urlopen(url) 226121 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/urllib2.py", line 126, in urlopen 226122 2016-01-04 15:02:56-0800 [-] return _opener.open(url, data, timeout) 226123 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/urllib2.py", line 400, in open 226124 2016-01-04 15:02:56-0800 [-] response = self._open(req, data) 226125 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/urllib2.py", line 418, in _open 226126 2016-01-04 15:02:56-0800 [-] '_open', req) 226127 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/urllib2.py", line 378, in _call_chain 226128 2016-01-04 15:02:56-0800 [-] result = func(*args) 226129 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/urllib2.py", line 1207, in http_open 226130 2016-01-04 15:02:56-0800 [-] return self.do_open(httplib.HTTPConnection, req) 226131 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/urllib2.py", line 1180, in do_open 226132 2016-01-04 15:02:56-0800 [-] r = h.getresponse(buffering=True) 226133 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/httplib.py", line 1030, in getresponse 226134 2016-01-04 15:02:56-0800 [-] response.begin() 226135 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/httplib.py", line 407, in begin 226136 2016-01-04 15:02:56-0800 [-] version, status, reason = self._read_status() 226137 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/httplib.py", line 365, in _read_status 226138 2016-01-04 15:02:56-0800 [-] line = self.fp.readline() 226139 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/socket.py", line 447, in readline 226140 2016-01-04 15:02:56-0800 [-] data = self._sock.recv(self._rbufsize) 226141 2016-01-04 15:02:56-0800 [-] error: [Errno 104] Connection reset by peer 226142 2016-01-04 15:02:56-0800 [-] 226143 2016-01-04 15:02:56-0800 [-] error while parsing config file 226144 2016-01-04 15:02:56-0800 [-] error during loadConfig 226145 2016-01-04 15:02:56-0800 [-] Unhandled Error irc log: 17:56:54 <hwine> KWierso|afk: looks like issues, but not mine - I'll find someone 18:00:21 <relengbot> [sns alert] Mon 18:01:07 PST buildbot-master67.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes. 18:03:38 — jlund looks 18:07:03 <jlund> I wonder if kim and releaserunner's reconfig requests walked over each other 18:10:16 <jlund> hmm, nope. kim's finished in minutes. this looks like the 43.0.4 releaserunner reconfig never finished: 18:10:19 <jlund> [cltbld@buildbot-master67.bb.releng.use1.mozilla.com tests1-linux64]$ grep -r 'configuration' master/twistd.log.1 master/twistd.log 18:10:19 <jlund> master/twistd.log.1:2016-01-04 11:03:52-0800 [-] configuration update started 18:10:19 <jlund> master/twistd.log.1:2016-01-04 11:05:57-0800 [-] configuration update complete 18:10:19 <jlund> master/twistd.log:2016-01-04 15:00:31-0800 [-] loading configuration from /builds/buildbot/tests1-linux64/master/master.cfg 18:11:41 <jlund> 226190 2016-01-04 15:02:56-0800 [-] The new config file is unusable, so I'll ignore it. 18:11:41 <jlund> 226191 2016-01-04 15:02:56-0800 [-] I will keep using the previous config file instead. 18:13:00 <jlund> I think seta raised an error while reading the config: 18:13:01 <jlund> 2016-01-04 15:02:56-0800 [-] File "/builds/buildbot/tests1-linux64/master/config_seta.py", line 45, in get_seta_platforms 18:13:13 <jlund> 2016-01-04 15:02:56-0800 [-] generic exception: Traceback 18:13:51 <jlund> lost a socket connection. probably needs a beefier retry logic 18:14:12 <jlund> I'll try running a reconfig again since it looks like buildbot gave up trying and got itself out of sync with the rest oft he masters 18:15:14 — hwine thanks jlund and goes to update docs with bad information 18:16:55 <jlund> np 18:19:57 <jlund> looks like stacktraces have stopped after: 18:19:59 <jlund> master/twistd.log:2016-01-04 18:18:07-0800 [-] configuration update complete
Flags: needinfo?(kmoir)
also, for aid in updating docs, I'll brain dump what I did here via history cmds 635 cd /builds/buildbot/tests1-linux64/ 642 grep -r 'configuration' master/twistd.log.1 master/twistd.log master/twistd.log.1:2016-01-04 11:03:52-0800 [-] configuration update started master/twistd.log.1:2016-01-04 11:05:57-0800 [-] configuration update complete master/twistd.log:2016-01-04 15:00:31-0800 [-] loading configuration from /builds 643 vim master/twistd.log # found out why 15:00 reconfig never finished via log output in comment 17 644 source bin/activate # use buildbot venv 645 make checkconfig # check if current repos look good # also good to check buildbot-configs and custom are up to date 646 rm reconfig.lock # safe bc now we know that the reconfig gave up after log line "The new config file is unusable, so I'll ignore it." 647 make reconfig # reconfig to get master in sync with other masters
I think the best thing to do here is have a .json file checked into the tree which we can pull from. In that regard we would need to adjust the seta tools slightly since the .json file has hardcoded branch names in it. I could land the SETA change in tree when needed and the reconfig will pick up whenever it is done, even if there are no changes.
I revisited my patches to update the configs in tree/buildbot-configs today so we don't have problems with reconfigs if the seta server is unavailable, making progress.
Flags: needinfo?(kmoir)
We've been getting this occasionally: 2016-03-21 18:00:52-0700 [-] Unhandled Error Traceback (most recent call last): File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/twisted/application/app.py", line 311, in runReactorWithLogging reactor.run() File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/twisted/internet/base.py", line 1165, in run self.mainLoop() File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/twisted/internet/base.py", line 1174, in mainLoop self.runUntilCurrent() File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/twisted/internet/base.py", line 796, in runUntilCurrent call.func(*call.args, **call.kw) --- <exception caught here> --- File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/buildbot-0.8.2_hg_8b87b4974e3c_production_0.8-py2.7.egg/buildbot/master.py", line 628, in loadTheConfigFile d = self.loadConfig(f) File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/buildbot-0.8.2_hg_8b87b4974e3c_production_0.8-py2.7.egg/buildbot/master.py", line 652, in loadConfig exec f in localDict File "/builds/buildbot/tests1-linux32/master/master.cfg", line 21, in <module> reload(mobile_config) File "/builds/buildbot/tests1-linux32/master/mobile_config.py", line 3014, in <module> loadSkipConfig(BRANCHES, "mobile") File "/builds/buildbot/tests1-linux32/master/config_seta.py", line 145, in loadSkipConfig platforms = get_seta_platforms(b, platform_filter) File "/builds/buildbot/tests1-linux32/master/config_seta.py", line 70, in get_seta_platforms data = json.loads(response.read()) File "/tools/python27/lib/python2.7/socket.py", line 351, in read data = self._sock.recv(rbufsize) File "/tools/python27/lib/python2.7/httplib.py", line 561, in read s = self.fp.read(amt) File "/tools/python27/lib/python2.7/socket.py", line 380, in read data = self._sock.recv(left) socket.error: [Errno 104] Connection reset by peer 2016-03-21 18:00:52-0700 [-] The new config file is unusable, so I'll ignore it. 2016-03-21 18:00:52-0700 [-] I will keep using the previous config file instead.
in Q2 we are looking to moving SETA to Heroku and make it more reliable. In addition work will be done to make SETA data useful for taskcluster- ideally the target state for SETA to be in taskcluster and work smoother there. If there is something we could do outside of moving SETA to Heroku, please ask for it either here or in a new bug.
Joel: is there a bug tracking the move of seta data to Heroku?
Flags: needinfo?(jmaher)
and we do have a bug 1253020 just for that!
Flags: needinfo?(jmaher)
Do you have a rough timeline for that move ? In the meantime we may need to add a retry to the buildbot code, or beef up the SETA AWS instance to handle more concurrent connctions.
Turns out we have retries, but they're not catching the socket.error exceptions. Bug 1259325 to fix that up. Need to do that because a failing reconfig leaves the master in a state where it burns builds in a few seconds, and they stay permapending on treeherder.
Blocks: 1264618
SETA is returning consistent 500 errors, which blocks reconfigs on test masters (repos get updated, checkconfig fails, reconfig never happens). IIRC they don't have the previous config to fall back on.
The seta server is available again, looking at the logs on a test master it appears it was an intermittent error.
Depends on: 1286358
$ curl -I "http://alertmanager.allizom.org/data/setadetails/?date=2016-08-08&buildbot=1&branch=mozilla-inbound&inactive=1" HTTP/1.1 500 Internal Server Error Date: Tue, 09 Aug 2016 02:19:21 GMT Server: Apache/2.4.7 (Ubuntu) Connection: close Content-Type: text/html; charset=iso-8859-1 Looks pretty consistent, and I would guess for 5+ hours based on some stuck reconfigs.
Have a script in progess to wget seta data, save it locally and push to bbconfigs for storage. Each branch we run seta on needs a different copy of the data, working on an elegant way to store that now. My plan is to run this script via cron from the scheduling test master once a day, (config in puppet) since seta is only updated once a day. I'm thinking that we would just land the data on default because merging to production would trigger an unneccessary reconfig.
(In reply to Kim Moir [:kmoir] from comment #30) > Have a script in progess to wget seta data, save it locally and push to > bbconfigs for storage. Each branch we run seta on needs a different copy of > the data, working on an elegant way to store that now. My plan is to run > this script via cron from the scheduling test master once a day, (config in > puppet) since seta is only updated once a day. I'm thinking that we would > just land the data on default because merging to production would trigger an > unneccessary reconfig. I'm confused about the approach here. Is the thing we're storing in buildbot-configs an ultimate fallback if we can't find anything else? Here's what I envisage as the SETA process fallthrough: * buildbot master starts or reconfig is triggered ** download new SETA data from server *** if server unreachable, fallback to existing SETA data from last run (triggers warning) **** if no previous run, fallback to archived copy of SETA data in buildbot-configs (triggers louder warning) We'd run Kim's script on some reasonable cadence to refresh the in-tree SETA data. Is this accurate?
Side topic, as part of the SETA re-write I'm doing (bug 1306709), I hope to upload SETA information to an S3 archive.
coop: Yes, this is the same approach I had envisioned. I'll look at this bug again now that the nightly tcmigration stuff on my plate is mostly done.
Attached patch wip patches (obsolete) (deleted) — Splinter Review
Attached patch wip patches (obsolete) (deleted) — Splinter Review
Alin: is buildbot-master69.bb.releng.use1.mozilla.com still available for testing? I'd like to test these patches on it via my puppet testing instance. I see that it is up but not running jobs.
Flags: needinfo?(aselagea)
(In reply to Kim Moir [:kmoir] from comment #36) > Alin: is buildbot-master69.bb.releng.use1.mozilla.com still available for > testing? I'd like to test these patches on it via my puppet testing > instance. I see that it is up but not running jobs. Yes, it's still available.
Flags: needinfo?(aselagea)
Attached file bug1176784puppet.patch (obsolete) (deleted) —
Attachment #8809512 - Attachment is obsolete: true
Attached patch bug1176784tools.patch (obsolete) (deleted) — Splinter Review
Attachment #8809513 - Attachment is obsolete: true
Attached patch bug1176784puppet2.patch (obsolete) (deleted) — Splinter Review
Attachment #8810667 - Attachment is obsolete: true
Attached patch bug1176784puppet2.patch (obsolete) (deleted) — Splinter Review
Attachment #8810951 - Attachment is obsolete: true
Comment on attachment 8810942 [details] [diff] [review] bug1176784tools.patch >diff --git a/buildfarm/maintenance/update_seta.py b/buildfarm/maintenance/update_seta.py ... >+ with open(temp_file, 'wt') as f: >+ json.dump(data, f) How would you feel about adding 'indent=4' to the dump(), and also sorting the list of builders ? That would give us nice diffs in hg to track changes over time.
good suggestion nick, I'll do that and update the patch
Attached patch bug1176784bb.patch (obsolete) (deleted) — Splinter Review
Attached patch bug1176784tools2.patch (obsolete) (deleted) — Splinter Review
Attachment #8810942 - Attachment is obsolete: true
Attachment #8811020 - Flags: feedback?(nthomas)
Comment on attachment 8811020 [details] [diff] [review] bug1176784tools2.patch >diff --git a/buildfarm/maintenance/update_seta.py b/buildfarm/maintenance/update_seta.py >+# main >+today = date.today().strftime("%Y-%m-%d") >+remote = "ssh://hg.mozilla.org/build/buildbot-configs" >+ssh_key = "/home/cltbld/.ssh/ffxbuild_rsa" >+ssh_username = "ffxbuild" A little safety net until this is ready for primetime ? >+revision = "default" >+localrepo = "/tmp/buildbot-configs" >+configs_path = localrepo + "/mozilla-tests/" >+msg = "updating seta data for " + today msg seems unused, there's something very similar in the commit() though. >+if os.path.exists(localrepo): >+ shutil.rmtree(localrepo) >+os.mkdir(localrepo) >+clone(remote, localrepo,revision) purge() is an alternative for an existing repo, but for buildbot-configs it doesn't really matter. >+#assume data could not be fetched >+status = False >+for branch in seta_branches: >+ status = update_seta_data(branch, configs_path) >+ if status: >+ #add files >+ cmd = ['hg', 'add', '.'] >+ #commit files >+value = run_cmd(cmd, cwd=localrepo) So we'll have the .old file committed in the repo ? Does that give us something over the hg history of the main file ? hg add is a bit dangerous for committing temp files by accident.
Attachment #8811020 - Flags: feedback?(nthomas) → feedback+
Comment on attachment 8811015 [details] [diff] [review] bug1176784bb.patch >diff --git a/mozilla-tests/config_seta.py b/mozilla-tests/config_seta.py --- a/mozilla-tests/config_seta.py +++ b/mozilla-tests/config_seta.py @@ -72,7 +72,14 @@ def get_seta_platforms(branch, platform_ ... > c['jobtypes'] = data.get('jobtypes', None) > platforms = [] > for p in c['jobtypes'][today]: This last line could be a problem when we fall back to the in-repo data. today may well not match the date when the data was cached.
Attached patch bug1176784tools3.patch (obsolete) (deleted) — Splinter Review
Attachment #8811020 - Attachment is obsolete: true
Attached patch bug1176784bb2.patch (obsolete) (deleted) — Splinter Review
Attachment #8811015 - Attachment is obsolete: true
Attachment #8811919 - Flags: review?(nthomas)
Attachment #8811794 - Flags: review?(nthomas)
Attachment #8810961 - Flags: feedback?(nthomas)
Comment on attachment 8811919 [details] [diff] [review] bug1176784bb2.patch >diff --git a/mozilla-tests/config_seta.py b/mozilla-tests/config_seta.py >@@ -71,9 +70,21 @@ def get_seta_platforms(branch, platform_ > if os.environ.get('DISABLE_SETA'): > return [] > >+ global today >+ today = date.today().strftime("%Y-%m-%d") Doesn't look like this needs to be a global. >+ if data == "": >+ path = os.path.join(path, branch + "-seta.json") >+ with open(path, 'r') as f: >+ data = json.load(f) Please add a log message when we fallback to disk data. We're not pumping the master logs into papertrail so there's no easy alerting, but a message in twistd.log would be useful if the sheriffs get in touch about SETA behaving strangely.
Attachment #8811919 - Flags: review?(nthomas) → review+
Comment on attachment 8811794 [details] [diff] [review] bug1176784tools3.patch >diff --git a/buildfarm/maintenance/update_seta.py b/buildfarm/maintenance/update_seta.py ... +sys.path.append("/builds/buildbot/tests_scheduler/tools/lib/python") We often do this relative to the script, to make it more portable. eg sys.path.append(os.path.join(os.path.dirname(__file__), "../../lib/python")) >+ backup_file = configs_path + branch + "-seta.json.old" No longer used ? >+ssh_key = "/home/cltbld/.ssh/ffxbuild_rsa" >+ssh_username = "ffxbuild" s/ffxbuild/ffxbld/ please. >+revision = "default" If we merge infrequently from default to production, will the seta config changes be picked up "soon enough" ? >+clone(remote, localrepo,revision) Nit, whitespace missing. >+#assume data could not be fetched >+status = False >+for branch in seta_branches: >+ status = update_seta_data(branch, configs_path) Did you mean to have a status check here ? >+revision = commit(localrepo, msg, user=ssh_username) >+push_cmd = ['hg', 'push'] >+push_value = run_cmd(push_cmd, cwd=localrepo) You've imported push from util.hg, did it not work out ?
Attachment #8811794 - Flags: review?(nthomas) → review+
Comment on attachment 8810961 [details] [diff] [review] bug1176784puppet2.patch This looks plausible, but a puppet expert I'm not. >diff --git a/modules/buildmaster/manifests/seta_update.pp b/modules/buildmaster/manifests/seta_update.pp >+# this class manages cleanup functionality for the buildbot databases Showing its origin there. >+ python::virtualenv { >+ "$seta_update_dir": >+ python => "${packages::mozilla::python27::python}", >+ require => Class['packages::mozilla::python27'], >+ user => "${users::builder::username}", >+ group => "${users::builder::group}", >+ packages => [ >+ "SQLAlchemy==0.7.9", >+ "MySQL-python==1.2.3", Are these packages really required ? >diff --git a/modules/buildmaster/templates/buildmaster-seta-update.erb b/modules/buildmaster/templates/buildmaster-seta-update.erb >+@weekly <%=scope.lookupvar('users::builder::username')%> <%=@seta_update_dir%>/bin/python /builds/buildbot/tests_scheduler/tools/buildfarm/maintenance/update_seta.py -l <%=@seta_update_dir%>/update-seta.log You have a tools checkout in <%=@seta_update_dir%>, so should use that I think. Does anything keep that up to date, or do we deploy changes with a manual hg pull ? I like the idea of a log, but no sign of -l argument in the script.
Attachment #8810961 - Flags: feedback?(nthomas) → feedback+
Attached patch bug1176784bb3.patch (deleted) — Splinter Review
I made the variable "today" a global because if the local seta server is not available, it's value is overwritten by the date value in the local copy of the seta data in the json file. If this is the case, then we need to use this value in the other methods, so I thought this was a better approach.
Attachment #8811919 - Attachment is obsolete: true
Attached patch bug1176784tools4.patch (obsolete) (deleted) — Splinter Review
Attachment #8811794 - Attachment is obsolete: true
Attached patch bug1176784puppet3.patch (obsolete) (deleted) — Splinter Review
Attachment #8810961 - Attachment is obsolete: true
Comment on attachment 8813281 [details] [diff] [review] bug1176784tools4.patch r? on this since I only asked for f? before
Attachment #8813281 - Flags: review?(nthomas)
(In reply to Kim Moir [:kmoir] from comment #53) > I made the variable "today" a global because if the local seta server is not > available, it's value is overwritten by the date value in the local copy of > the seta data in the json file. If this is the case, then we need to use > this value in the other methods, so I thought this was a better approach. Ah, I see what you mean. Another approach would be strip the date part out, and only work with data['jobtypes'][<date>].
Comment on attachment 8813281 [details] [diff] [review] bug1176784tools4.patch >diff --git a/buildfarm/maintenance/update_seta.py b/buildfarm/maintenance/update_seta.py >+import logging >+log = logging.getLogger(__name__) I'm a convert to sending logs into papertail. You could have a look at aws_watch_pending.py for an example of setting up a StreamHandler to send output to stdout/stderr, while also keeping a local log. Then you can append ' 2>&1 | logger -t update_seta' when you call the python script. >+ except socket.error, e: >+ log.warming("Socket error when accessing %s: %s" % (url, str(e))) Nit, warming log, must be on fire ;-) >+ print("Retrying") Nit, missed one print -> log.<level> change. >+ data = wfetch(url) >+ if data: >+ #test if data cannot be fetched Nit, old comment or slight the wrong place ? >+ parser = OptionParser() >+ parser.set_defaults( >+ filename=None, >+ loglevel=logging.INFO, >+ logfile=None, >+ skip_orphans=False, Nit, unused filename and skip_orphans.
Attachment #8813281 - Flags: review?(nthomas) → review+
Attached patch bug1176784tools5.patch (deleted) — Splinter Review
Attachment #8813281 - Attachment is obsolete: true
Attached patch bug1176784puppet4.patch (deleted) — Splinter Review
Attachment #8813283 - Attachment is obsolete: true
Attachment #8814977 - Flags: checked-in+
Attachment #8813277 - Flags: checked-in+
Correct if not fixed.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: