Closed
Bug 1176784
Opened 9 years ago
Closed 7 years ago
reconfigs should not rely on seta server
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rail, Assigned: kmoir)
References
Details
Attachments
(4 files, 12 obsolete files)
(deleted),
patch
|
jlund
:
review+
kmoir
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
kmoir
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
kmoir
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
Details | Diff | Splinter Review |
Hit an issue wit a release reconfig when the seta server was returning 500 errors:
Requested: make checkconfig
Executed: /bin/bash -l -c "cd /builds/buildbot/tests_scheduler && make checkconfig"
=============================== Standard output ===============================
cd master && /builds/buildbot/tests_scheduler/bin/buildbot checkconfig
HTTPError = 500
Traceback (most recent call last):
File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/scripts/runner.py", line
1042, in doCheckConfig
ConfigLoader(configFileName=configFileName)
File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/scripts/checkconfig.py",
line 31, in __init__
self.loadConfig(configFile, check_synchronously_only=True)
File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/master.py", line 652, in
loadConfig
exec f in localDict
File "/builds/buildbot/tests_scheduler/master/master.cfg", line 7, in <module>
import config
File "/tmp/tmpQRFxAr/config.py", line 1966, in <module>
File "/tmp/tmpQRFxAr/config_seta.py", line 113, in loadSkipConfig
File "/tmp/tmpQRFxAr/config_seta.py", line 31, in get_seta_platforms
File "/tools/python27/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/tools/python27/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/tools/python27/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/tools/python27/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(*args)
File "/tools/python27/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/tools/python27/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: Internal Server Error
make: *** [checkconfig] Error 1
Reporter | ||
Comment 1•9 years ago
|
||
Tried to checkconfig manually:
[cltbld@buildbot-master81.bb.releng.scl3.mozilla.com tests_scheduler]$ make checkconfig
cd master && /builds/buildbot/tests_scheduler/bin/buildbot checkconfig
HTTPError = 500
Traceback (most recent call last):
File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/scripts/runner.py", line 1042, in doCheckConfig
ConfigLoader(configFileName=configFileName)
File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/scripts/checkconfig.py", line 31, in __init__
self.loadConfig(configFile, check_synchronously_only=True)
File "/builds/buildbot/tests_scheduler/lib/python2.7/site-packages/buildbot-0.8.2_hg_f8e28d877d11_production_0.8-py2.7.egg/buildbot/master.py", line 652, in loadConfig
exec f in localDict
File "/builds/buildbot/tests_scheduler/master/master.cfg", line 7, in <module>
import config
File "/tmp/tmpJkacwm/config.py", line 1966, in <module>
File "/tmp/tmpJkacwm/config_seta.py", line 113, in loadSkipConfig
File "/tmp/tmpJkacwm/config_seta.py", line 31, in get_seta_platforms
File "/tools/python27/lib/python2.7/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/tools/python27/lib/python2.7/urllib2.py", line 406, in open
response = meth(req, response)
File "/tools/python27/lib/python2.7/urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "/tools/python27/lib/python2.7/urllib2.py", line 444, in error
return self._call_chain(*args)
File "/tools/python27/lib/python2.7/urllib2.py", line 378, in _call_chain
result = func(*args)
File "/tools/python27/lib/python2.7/urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 500: Internal Server Error
make: *** [checkconfig] Error 1
Comment 2•9 years ago
|
||
Bug 1176802 for the alertmanager.allizom.org outage.
It would be good if we had a copy on disk, or in a repo, so that we have something to fall back on if the server is down.
Comment 3•9 years ago
|
||
So what if we had a cron job, which polls seta and lands into buildbot-configs (if it gets a 200 with valid data). Then the automated reconfigs deploy to the masters on the hour. This would give us an audit trail and a way to rollback the seta config. Would have been helpful when 20k pending jobs turned up today.
Flags: needinfo?(kmoir)
Assignee | ||
Comment 4•9 years ago
|
||
Sorry I didn't see this bug until today. And I didn't realize that it had caused problems in the past. I'll look at how to make this more resilient.
Didn't realize alertmanager isn't managed by moc either and that we don't have a nagios alert on it. Will look into this.
Assignee: nobody → kmoir
Flags: needinfo?(kmoir)
Assignee | ||
Comment 5•9 years ago
|
||
Just as an fyi, I checked the logs on a Linux test master yesterday
A reconfig occurred at 17:00
2015-08-31 17:00:01 - INFO - Checking whether we need to reconfig...
2015-08-31 17:00:05 - INFO - buildbotcustom: production-0.8 tag has moved - old rev: b2ecb14104783c60f9a4276f3031213aa7634a9a; new rev: 730073773a8c85759a7c592f7287e885845052f6
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
2015-08-31 17:00:08 - INFO - buildbot-configs: production tag has moved - old rev: 7d72b9711f492a4c99fc84dd97ef0c76ed0ebec1; new rev: 84ef93b01b62774d9066908176fbf8103b5f8971
5 files updated, 0 files merged, 0 files removed, 0 files unresolved
2015-08-31 17:00:11 - INFO - tools: default tag has moved - old rev: df4e897f9bc7fb877dfcacf19d150543af619589; new rev: 487bb16f9bdfd8c05301692e4de3e2b1a2a105cc
1 files updated, 0 files merged, 0 files removed, 0 files unresolved
2015-08-31 17:00:11 - INFO - Starting reconfig. - 1441065601
cd master && /builds/buildbot/tests1-linux/bin/buildbot checkconfig
Config file is good!
2015-08-31 17:01:29 - INFO - Reconfig completed successfuly. - 1441065601
This corresponds to when the scheduling started skipping again on the scheduling master
2015-08-31 17:03:48-0700 [-] tests-fx-team-win8_64-debug-unittest-7-3600: skipping with 1/7 important changes since only 79/3600s have elapsed
2015-08-31 17:05:53-0700 [-] tests-fx-team-win8_64-debug-unittest-7-3600: skipping with 1/7 important changes since only 204/3600s have elapsed
2015-08-31 17:05:54-0700 [-] t
Looking at the logs on the scheduling master no jobs were skipped on Sunday Aug 30 (but I think trees were mostly closed due to downtime). And no jobs were skipped until 2015-08-31 17:03. So my suspicion is that when the buildbot master were brought up after the downtime, the SETA server was unavailable (although I can't see that in the logs) and thus the masters didn't have any SETA data until the reconfig at 17:00 on August 31.
Assignee | ||
Comment 7•9 years ago
|
||
I was working on this for the past few days with another approach that didn't work for various reasons.
I have a python script that I used before to generate seta configs before I changed it to manipulate the BRANCHES config itself. I can use it to generate the config files via cron, and update the configs to load these files instead of changing the branches config.
My question is what credentials should I use to land this in bb-configs - ffxbld? Are there other examples of scripts that we use to land content in bbconfigs besides tagging for releases etc?
Flags: needinfo?(nthomas)
Comment 8•9 years ago
|
||
We also land into gecko repos in http://hg.mozilla.org/build/tools/file/default/scripts/periodic_file_updates/periodic_file_updates.sh. That and tagging use the ffxbld ssh key to push to ssh://hg.m.o.
We could implement this as a cron that runs on bm81, or schedule jobs on the builders themselves. Either way you'd have access to the key you need.
Flags: needinfo?(nthomas)
Comment 9•9 years ago
|
||
not sure if this is the place to put this but travis says 15 test masters are failing right now:
```
2015-09-16 21:45:30,881 - Couldn't load test-output/bm109-tests1-windows/master.cfg
Traceback (most recent call last):
File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 111, in dump_master
c = loadMaster(path)
File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 26, in loadMaster
execfile(path, g, g)
File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/master.cfg", line 10, in <module>
import config
File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config.py", line 2152, in <module>
loadSkipConfig(BRANCHES,"desktop")
File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config_seta.py", line 115, in loadSkipConfig
define_configs(b, platforms, BRANCHES)
File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config_seta.py", line 98, in define_configs
platform = seta_platforms[p][0]
KeyError: 'Windows 8'
Traceback (most recent call last):
File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 165, in <module>
main()
File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 146, in main
dump = dump_master(args.masters[0])
File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 111, in dump_master
c = loadMaster(path)
File "/home/travis/build/mozilla/build-buildbot-configs/.tox/braindump/buildbot-related/dump_master_json.py", line 26, in loadMaster
execfile(path, g, g)
File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/master.cfg", line 10, in <module>
import config
File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config.py", line 2152, in <module>
loadSkipConfig(BRANCHES,"desktop")
File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config_seta.py", line 115, in loadSkipConfig
define_configs(b, platforms, BRANCHES)
File "/home/travis/build/mozilla/build-buildbot-configs/test-output/bm109-tests1-windows/config_seta.py", line 98, in define_configs
platform = seta_platforms[p][0]
KeyError: 'Windows 8'
```
taking a look, it seems like http://alertmanager.allizom.org/data/setadetails/?date=2015-09-16&buildbot=1&branch=mozilla-inbound&inactive=1 today is including a key 'Windows 8' that I guess https://dxr.mozilla.org/build-central/source/buildbot-configs/mozilla-tests/config_seta.py?offset=1800#14 is not happy about.
In [30]: url = "http://alertmanager.allizom.org/data/setadetails/?date=" + today + "&buildbot=1&branch=" + 'fx-team' + "&inactive=1"
In [31]: content = json.load(urllib2.urlopen(url))
In [32]: c = {}
In [33]: c['jobtypes'] = content['jobtypes']
# copy code from config_seta.py
In [34]: for p in c['jobtypes'][today]:
platform = ' '.join(p.encode('utf-8').split()[0:-4])
if platform not in platforms:
platforms.append(platform)
....:
In [35]: platforms
Out[35]:
['Windows 8 64-bit',
'Windows 8',
'Rev5 MacOSX Yosemite 10.10',
'Rev5 MacOSX Yosemite',
'Ubuntu VM 12.04',
'Ubuntu VM',
'Rev4 MacOSX Snow Leopard 10.6',
'Windows 7',
'Windows 7 32-bit',
'Ubuntu VM 12.04 x64',
'Ubuntu ASAN VM 12.04 x64',
'Windows XP 32-bit',
'Windows XP',
'android-2-3-armv7-api9',
'android-4-3-armv7-api11']
Comment 10•9 years ago
|
||
let me know if I can change SETA at all.
Comment 11•9 years ago
|
||
We're wondering if SETA changed today, either code, data format, or data itself, and if that might be causing this.
Comment 12•9 years ago
|
||
For example, did we add talos jobs like 'Windows 8 64-bit fx-team talos g2-e10s' for the first time today ?
There's code which does
platform = ' '.join(p.encode('utf-8').split()[0:-4])
which turns unittest style jobs into 'Windows 8 64-bit', but talos into 'Windows 8', and the latter isn't in seta_platforms at http://hg.mozilla.org/build/buildbot-configs/annotate/c71f4b72e5db/mozilla-tests/config_seta.py#l14
Flags: needinfo?(jmaher)
Comment 13•9 years ago
|
||
This reduces the set of platforms to
['Windows 8 64-bit',
'Rev5 MacOSX Yosemite 10.10',
'Ubuntu VM 12.04',
'Rev4 MacOSX Snow Leopard 10.6',
'Windows 7 32-bit',
'Ubuntu VM 12.04 x64',
'Ubuntu ASAN VM 12.04 x64',
'Windows XP 32-bit',
'android-2-3-armv7-api9',
'android-4-3-armv7-api11']
which are all in seta_platforms.
This will fix up master reconfigs, regardless of whether talos is meant to be in there or not (seems odd to me, but I'm lacking SETA context). It's obviously just as fragile as what it replaces, so consider it a short-term hack.
Attachment #8662124 -
Flags: review?(jlund)
Assignee | ||
Updated•9 years ago
|
Attachment #8662124 -
Flags: review+
Comment 14•9 years ago
|
||
Comment on attachment 8662124 [details] [diff] [review]
[buildbot-configs] Handle talos builders
Review of attachment 8662124 [details] [diff] [review]:
-----------------------------------------------------------------
makes sense. r+
Attachment #8662124 -
Flags: review?(jlund) → review+
Comment 15•9 years ago
|
||
yes, we added talos to it in preparation for upcoming work. This is easy to revert on the SETA side- quite possibly we need a live and staging API that SETA publishes?
Flags: needinfo?(jmaher)
Comment 16•9 years ago
|
||
We had bm121 hang in a reconfig today, it was waiting on a socket to the SETA server. We could mitigate that by adding at
http://hg.mozilla.org/build/buildbot-configs/file/fc34e111082d/mozilla-tests/config_seta.py#l31
Comment 17•9 years ago
|
||
bm67 got itself into a funk today while reconfiging and having socket errors while reading seta config
twistd.log:
226118 2016-01-04 15:02:56-0800 [-] generic exception: Traceback (most recent call last):
226119 2016-01-04 15:02:56-0800 [-] File "/builds/buildbot/tests1-linux64/master/config_seta.py", line 45, in get_seta_platforms
226120 2016-01-04 15:02:56-0800 [-] response = urllib2.urlopen(url)
226121 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/urllib2.py", line 126, in urlopen
226122 2016-01-04 15:02:56-0800 [-] return _opener.open(url, data, timeout)
226123 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/urllib2.py", line 400, in open
226124 2016-01-04 15:02:56-0800 [-] response = self._open(req, data)
226125 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/urllib2.py", line 418, in _open
226126 2016-01-04 15:02:56-0800 [-] '_open', req)
226127 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/urllib2.py", line 378, in _call_chain
226128 2016-01-04 15:02:56-0800 [-] result = func(*args)
226129 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/urllib2.py", line 1207, in http_open
226130 2016-01-04 15:02:56-0800 [-] return self.do_open(httplib.HTTPConnection, req)
226131 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/urllib2.py", line 1180, in do_open
226132 2016-01-04 15:02:56-0800 [-] r = h.getresponse(buffering=True)
226133 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/httplib.py", line 1030, in getresponse
226134 2016-01-04 15:02:56-0800 [-] response.begin()
226135 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/httplib.py", line 407, in begin
226136 2016-01-04 15:02:56-0800 [-] version, status, reason = self._read_status()
226137 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/httplib.py", line 365, in _read_status
226138 2016-01-04 15:02:56-0800 [-] line = self.fp.readline()
226139 2016-01-04 15:02:56-0800 [-] File "/tools/python27/lib/python2.7/socket.py", line 447, in readline
226140 2016-01-04 15:02:56-0800 [-] data = self._sock.recv(self._rbufsize)
226141 2016-01-04 15:02:56-0800 [-] error: [Errno 104] Connection reset by peer
226142 2016-01-04 15:02:56-0800 [-]
226143 2016-01-04 15:02:56-0800 [-] error while parsing config file
226144 2016-01-04 15:02:56-0800 [-] error during loadConfig
226145 2016-01-04 15:02:56-0800 [-] Unhandled Error
irc log:
17:56:54 <hwine> KWierso|afk: looks like issues, but not mine - I'll find someone
18:00:21 <relengbot> [sns alert] Mon 18:01:07 PST buildbot-master67.bb.releng.use1.mozilla.com maybe_reconfig.sh: ERROR - Reconfig lockfile is older than 120 minutes.
18:03:38 — jlund looks
18:07:03 <jlund> I wonder if kim and releaserunner's reconfig requests walked over each other
18:10:16 <jlund> hmm, nope. kim's finished in minutes. this looks like the 43.0.4 releaserunner reconfig never finished:
18:10:19 <jlund> [cltbld@buildbot-master67.bb.releng.use1.mozilla.com tests1-linux64]$ grep -r 'configuration' master/twistd.log.1 master/twistd.log
18:10:19 <jlund> master/twistd.log.1:2016-01-04 11:03:52-0800 [-] configuration update started
18:10:19 <jlund> master/twistd.log.1:2016-01-04 11:05:57-0800 [-] configuration update complete
18:10:19 <jlund> master/twistd.log:2016-01-04 15:00:31-0800 [-] loading configuration from /builds/buildbot/tests1-linux64/master/master.cfg
18:11:41 <jlund> 226190 2016-01-04 15:02:56-0800 [-] The new config file is unusable, so I'll ignore it.
18:11:41 <jlund> 226191 2016-01-04 15:02:56-0800 [-] I will keep using the previous config file instead.
18:13:00 <jlund> I think seta raised an error while reading the config:
18:13:01 <jlund> 2016-01-04 15:02:56-0800 [-] File "/builds/buildbot/tests1-linux64/master/config_seta.py", line 45, in get_seta_platforms
18:13:13 <jlund> 2016-01-04 15:02:56-0800 [-] generic exception: Traceback
18:13:51 <jlund> lost a socket connection. probably needs a beefier retry logic
18:14:12 <jlund> I'll try running a reconfig again since it looks like buildbot gave up trying and got itself out of sync with the rest oft he masters
18:15:14 — hwine thanks jlund and goes to update docs with bad information
18:16:55 <jlund> np
18:19:57 <jlund> looks like stacktraces have stopped after:
18:19:59 <jlund> master/twistd.log:2016-01-04 18:18:07-0800 [-] configuration update complete
Flags: needinfo?(kmoir)
Comment 18•9 years ago
|
||
also, for aid in updating docs, I'll brain dump what I did here via history cmds
635 cd /builds/buildbot/tests1-linux64/
642 grep -r 'configuration' master/twistd.log.1 master/twistd.log
master/twistd.log.1:2016-01-04 11:03:52-0800 [-] configuration update started
master/twistd.log.1:2016-01-04 11:05:57-0800 [-] configuration update complete
master/twistd.log:2016-01-04 15:00:31-0800 [-] loading configuration from /builds
643 vim master/twistd.log # found out why 15:00 reconfig never finished via log output in comment 17
644 source bin/activate # use buildbot venv
645 make checkconfig # check if current repos look good
# also good to check buildbot-configs and custom are up to date
646 rm reconfig.lock # safe bc now we know that the reconfig gave up after log line "The new config file is unusable, so I'll ignore it."
647 make reconfig # reconfig to get master in sync with other masters
Comment 19•9 years ago
|
||
I think the best thing to do here is have a .json file checked into the tree which we can pull from. In that regard we would need to adjust the seta tools slightly since the .json file has hardcoded branch names in it. I could land the SETA change in tree when needed and the reconfig will pick up whenever it is done, even if there are no changes.
Assignee | ||
Comment 20•9 years ago
|
||
I revisited my patches to update the configs in tree/buildbot-configs today so we don't have problems with reconfigs if the seta server is unavailable, making progress.
Flags: needinfo?(kmoir)
Comment 21•9 years ago
|
||
We've been getting this occasionally:
2016-03-21 18:00:52-0700 [-] Unhandled Error
Traceback (most recent call last):
File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/twisted/application/app.py", line 311, in runReactorWithLogging
reactor.run()
File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/twisted/internet/base.py", line 1165, in run
self.mainLoop()
File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/twisted/internet/base.py", line 1174, in mainLoop
self.runUntilCurrent()
File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/twisted/internet/base.py", line 796, in runUntilCurrent
call.func(*call.args, **call.kw)
--- <exception caught here> ---
File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/buildbot-0.8.2_hg_8b87b4974e3c_production_0.8-py2.7.egg/buildbot/master.py", line 628, in loadTheConfigFile
d = self.loadConfig(f)
File "/builds/buildbot/tests1-linux32/lib/python2.7/site-packages/buildbot-0.8.2_hg_8b87b4974e3c_production_0.8-py2.7.egg/buildbot/master.py", line 652, in loadConfig
exec f in localDict
File "/builds/buildbot/tests1-linux32/master/master.cfg", line 21, in <module>
reload(mobile_config)
File "/builds/buildbot/tests1-linux32/master/mobile_config.py", line 3014, in <module>
loadSkipConfig(BRANCHES, "mobile")
File "/builds/buildbot/tests1-linux32/master/config_seta.py", line 145, in loadSkipConfig
platforms = get_seta_platforms(b, platform_filter)
File "/builds/buildbot/tests1-linux32/master/config_seta.py", line 70, in get_seta_platforms
data = json.loads(response.read())
File "/tools/python27/lib/python2.7/socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "/tools/python27/lib/python2.7/httplib.py", line 561, in read
s = self.fp.read(amt)
File "/tools/python27/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
socket.error: [Errno 104] Connection reset by peer
2016-03-21 18:00:52-0700 [-] The new config file is unusable, so I'll ignore it.
2016-03-21 18:00:52-0700 [-] I will keep using the previous config file instead.
Comment 22•9 years ago
|
||
in Q2 we are looking to moving SETA to Heroku and make it more reliable. In addition work will be done to make SETA data useful for taskcluster- ideally the target state for SETA to be in taskcluster and work smoother there.
If there is something we could do outside of moving SETA to Heroku, please ask for it either here or in a new bug.
Assignee | ||
Comment 23•9 years ago
|
||
Joel: is there a bug tracking the move of seta data to Heroku?
Flags: needinfo?(jmaher)
Comment 25•9 years ago
|
||
Do you have a rough timeline for that move ? In the meantime we may need to add a retry to the buildbot code, or beef up the SETA AWS instance to handle more concurrent connctions.
Comment 26•9 years ago
|
||
Turns out we have retries, but they're not catching the socket.error exceptions. Bug 1259325 to fix that up. Need to do that because a failing reconfig leaves the master in a state where it burns builds in a few seconds, and they stay permapending on treeherder.
Comment 27•8 years ago
|
||
SETA is returning consistent 500 errors, which blocks reconfigs on test masters (repos get updated, checkconfig fails, reconfig never happens). IIRC they don't have the previous config to fall back on.
Assignee | ||
Comment 28•8 years ago
|
||
The seta server is available again, looking at the logs on a test master it appears it was an intermittent error.
Depends on: 1286358
Comment 29•8 years ago
|
||
$ curl -I "http://alertmanager.allizom.org/data/setadetails/?date=2016-08-08&buildbot=1&branch=mozilla-inbound&inactive=1"
HTTP/1.1 500 Internal Server Error
Date: Tue, 09 Aug 2016 02:19:21 GMT
Server: Apache/2.4.7 (Ubuntu)
Connection: close
Content-Type: text/html; charset=iso-8859-1
Looks pretty consistent, and I would guess for 5+ hours based on some stuck reconfigs.
Assignee | ||
Comment 30•8 years ago
|
||
Have a script in progess to wget seta data, save it locally and push to bbconfigs for storage. Each branch we run seta on needs a different copy of the data, working on an elegant way to store that now. My plan is to run this script via cron from the scheduling test master once a day, (config in puppet) since seta is only updated once a day. I'm thinking that we would just land the data on default because merging to production would trigger an unneccessary reconfig.
Comment 31•8 years ago
|
||
(In reply to Kim Moir [:kmoir] from comment #30)
> Have a script in progess to wget seta data, save it locally and push to
> bbconfigs for storage. Each branch we run seta on needs a different copy of
> the data, working on an elegant way to store that now. My plan is to run
> this script via cron from the scheduling test master once a day, (config in
> puppet) since seta is only updated once a day. I'm thinking that we would
> just land the data on default because merging to production would trigger an
> unneccessary reconfig.
I'm confused about the approach here. Is the thing we're storing in buildbot-configs an ultimate fallback if we can't find anything else?
Here's what I envisage as the SETA process fallthrough:
* buildbot master starts or reconfig is triggered
** download new SETA data from server
*** if server unreachable, fallback to existing SETA data from last run (triggers warning)
**** if no previous run, fallback to archived copy of SETA data in buildbot-configs (triggers louder warning)
We'd run Kim's script on some reasonable cadence to refresh the in-tree SETA data.
Is this accurate?
Comment 32•8 years ago
|
||
Side topic, as part of the SETA re-write I'm doing (bug 1306709), I hope to upload SETA information to an S3 archive.
Assignee | ||
Comment 33•8 years ago
|
||
coop:
Yes, this is the same approach I had envisioned. I'll look at this bug again now that the nightly tcmigration stuff on my plate is mostly done.
Assignee | ||
Comment 34•8 years ago
|
||
Assignee | ||
Comment 35•8 years ago
|
||
Assignee | ||
Comment 36•8 years ago
|
||
Alin: is buildbot-master69.bb.releng.use1.mozilla.com still available for testing? I'd like to test these patches on it via my puppet testing instance. I see that it is up but not running jobs.
Flags: needinfo?(aselagea)
Comment 37•8 years ago
|
||
(In reply to Kim Moir [:kmoir] from comment #36)
> Alin: is buildbot-master69.bb.releng.use1.mozilla.com still available for
> testing? I'd like to test these patches on it via my puppet testing
> instance. I see that it is up but not running jobs.
Yes, it's still available.
Flags: needinfo?(aselagea)
Assignee | ||
Comment 38•8 years ago
|
||
Attachment #8809512 -
Attachment is obsolete: true
Assignee | ||
Comment 39•8 years ago
|
||
Attachment #8809513 -
Attachment is obsolete: true
Assignee | ||
Comment 40•8 years ago
|
||
Attachment #8810667 -
Attachment is obsolete: true
Assignee | ||
Comment 41•8 years ago
|
||
Attachment #8810951 -
Attachment is obsolete: true
Comment 42•8 years ago
|
||
Comment on attachment 8810942 [details] [diff] [review]
bug1176784tools.patch
>diff --git a/buildfarm/maintenance/update_seta.py b/buildfarm/maintenance/update_seta.py
...
>+ with open(temp_file, 'wt') as f:
>+ json.dump(data, f)
How would you feel about adding 'indent=4' to the dump(), and also sorting the list of builders ? That would give us nice diffs in hg to track changes over time.
Assignee | ||
Comment 43•8 years ago
|
||
good suggestion nick, I'll do that and update the patch
Assignee | ||
Comment 44•8 years ago
|
||
Assignee | ||
Comment 45•8 years ago
|
||
Attachment #8810942 -
Attachment is obsolete: true
Assignee | ||
Updated•8 years ago
|
Attachment #8811020 -
Flags: feedback?(nthomas)
Comment 46•8 years ago
|
||
Comment on attachment 8811020 [details] [diff] [review]
bug1176784tools2.patch
>diff --git a/buildfarm/maintenance/update_seta.py b/buildfarm/maintenance/update_seta.py
>+# main
>+today = date.today().strftime("%Y-%m-%d")
>+remote = "ssh://hg.mozilla.org/build/buildbot-configs"
>+ssh_key = "/home/cltbld/.ssh/ffxbuild_rsa"
>+ssh_username = "ffxbuild"
A little safety net until this is ready for primetime ?
>+revision = "default"
>+localrepo = "/tmp/buildbot-configs"
>+configs_path = localrepo + "/mozilla-tests/"
>+msg = "updating seta data for " + today
msg seems unused, there's something very similar in the commit() though.
>+if os.path.exists(localrepo):
>+ shutil.rmtree(localrepo)
>+os.mkdir(localrepo)
>+clone(remote, localrepo,revision)
purge() is an alternative for an existing repo, but for buildbot-configs it doesn't really matter.
>+#assume data could not be fetched
>+status = False
>+for branch in seta_branches:
>+ status = update_seta_data(branch, configs_path)
>+ if status:
>+ #add files
>+ cmd = ['hg', 'add', '.']
>+ #commit files
>+value = run_cmd(cmd, cwd=localrepo)
So we'll have the .old file committed in the repo ? Does that give us something over the hg history of the main file ? hg add is a bit dangerous for committing temp files by accident.
Attachment #8811020 -
Flags: feedback?(nthomas) → feedback+
Comment 47•8 years ago
|
||
Comment on attachment 8811015 [details] [diff] [review]
bug1176784bb.patch
>diff --git a/mozilla-tests/config_seta.py b/mozilla-tests/config_seta.py
--- a/mozilla-tests/config_seta.py
+++ b/mozilla-tests/config_seta.py
@@ -72,7 +72,14 @@ def get_seta_platforms(branch, platform_
...
> c['jobtypes'] = data.get('jobtypes', None)
> platforms = []
> for p in c['jobtypes'][today]:
This last line could be a problem when we fall back to the in-repo data. today may well not match the date when the data was cached.
Assignee | ||
Comment 48•8 years ago
|
||
Attachment #8811020 -
Attachment is obsolete: true
Assignee | ||
Comment 49•8 years ago
|
||
Attachment #8811015 -
Attachment is obsolete: true
Attachment #8811919 -
Flags: review?(nthomas)
Assignee | ||
Updated•8 years ago
|
Attachment #8811794 -
Flags: review?(nthomas)
Assignee | ||
Updated•8 years ago
|
Attachment #8810961 -
Flags: feedback?(nthomas)
Comment 50•8 years ago
|
||
Comment on attachment 8811919 [details] [diff] [review]
bug1176784bb2.patch
>diff --git a/mozilla-tests/config_seta.py b/mozilla-tests/config_seta.py
>@@ -71,9 +70,21 @@ def get_seta_platforms(branch, platform_
> if os.environ.get('DISABLE_SETA'):
> return []
>
>+ global today
>+ today = date.today().strftime("%Y-%m-%d")
Doesn't look like this needs to be a global.
>+ if data == "":
>+ path = os.path.join(path, branch + "-seta.json")
>+ with open(path, 'r') as f:
>+ data = json.load(f)
Please add a log message when we fallback to disk data. We're not pumping the master logs into papertrail so there's no easy alerting, but a message in twistd.log would be useful if the sheriffs get in touch about SETA behaving strangely.
Attachment #8811919 -
Flags: review?(nthomas) → review+
Comment 51•8 years ago
|
||
Comment on attachment 8811794 [details] [diff] [review]
bug1176784tools3.patch
>diff --git a/buildfarm/maintenance/update_seta.py b/buildfarm/maintenance/update_seta.py
...
+sys.path.append("/builds/buildbot/tests_scheduler/tools/lib/python")
We often do this relative to the script, to make it more portable. eg
sys.path.append(os.path.join(os.path.dirname(__file__), "../../lib/python"))
>+ backup_file = configs_path + branch + "-seta.json.old"
No longer used ?
>+ssh_key = "/home/cltbld/.ssh/ffxbuild_rsa"
>+ssh_username = "ffxbuild"
s/ffxbuild/ffxbld/ please.
>+revision = "default"
If we merge infrequently from default to production, will the seta config changes be picked up "soon enough" ?
>+clone(remote, localrepo,revision)
Nit, whitespace missing.
>+#assume data could not be fetched
>+status = False
>+for branch in seta_branches:
>+ status = update_seta_data(branch, configs_path)
Did you mean to have a status check here ?
>+revision = commit(localrepo, msg, user=ssh_username)
>+push_cmd = ['hg', 'push']
>+push_value = run_cmd(push_cmd, cwd=localrepo)
You've imported push from util.hg, did it not work out ?
Attachment #8811794 -
Flags: review?(nthomas) → review+
Comment 52•8 years ago
|
||
Comment on attachment 8810961 [details] [diff] [review]
bug1176784puppet2.patch
This looks plausible, but a puppet expert I'm not.
>diff --git a/modules/buildmaster/manifests/seta_update.pp b/modules/buildmaster/manifests/seta_update.pp
>+# this class manages cleanup functionality for the buildbot databases
Showing its origin there.
>+ python::virtualenv {
>+ "$seta_update_dir":
>+ python => "${packages::mozilla::python27::python}",
>+ require => Class['packages::mozilla::python27'],
>+ user => "${users::builder::username}",
>+ group => "${users::builder::group}",
>+ packages => [
>+ "SQLAlchemy==0.7.9",
>+ "MySQL-python==1.2.3",
Are these packages really required ?
>diff --git a/modules/buildmaster/templates/buildmaster-seta-update.erb b/modules/buildmaster/templates/buildmaster-seta-update.erb
>+@weekly <%=scope.lookupvar('users::builder::username')%> <%=@seta_update_dir%>/bin/python /builds/buildbot/tests_scheduler/tools/buildfarm/maintenance/update_seta.py -l <%=@seta_update_dir%>/update-seta.log
You have a tools checkout in <%=@seta_update_dir%>, so should use that I think. Does anything keep that up to date, or do we deploy changes with a manual hg pull ?
I like the idea of a log, but no sign of -l argument in the script.
Attachment #8810961 -
Flags: feedback?(nthomas) → feedback+
Assignee | ||
Comment 53•8 years ago
|
||
I made the variable "today" a global because if the local seta server is not available, it's value is overwritten by the date value in the local copy of the seta data in the json file. If this is the case, then we need to use this value in the other methods, so I thought this was a better approach.
Attachment #8811919 -
Attachment is obsolete: true
Assignee | ||
Comment 54•8 years ago
|
||
Attachment #8811794 -
Attachment is obsolete: true
Assignee | ||
Comment 55•8 years ago
|
||
Attachment #8810961 -
Attachment is obsolete: true
Assignee | ||
Comment 56•8 years ago
|
||
Comment on attachment 8813281 [details] [diff] [review]
bug1176784tools4.patch
r? on this since I only asked for f? before
Attachment #8813281 -
Flags: review?(nthomas)
Comment 57•8 years ago
|
||
(In reply to Kim Moir [:kmoir] from comment #53)
> I made the variable "today" a global because if the local seta server is not
> available, it's value is overwritten by the date value in the local copy of
> the seta data in the json file. If this is the case, then we need to use
> this value in the other methods, so I thought this was a better approach.
Ah, I see what you mean. Another approach would be strip the date part out, and only work with data['jobtypes'][<date>].
Comment 58•8 years ago
|
||
Comment on attachment 8813281 [details] [diff] [review]
bug1176784tools4.patch
>diff --git a/buildfarm/maintenance/update_seta.py b/buildfarm/maintenance/update_seta.py
>+import logging
>+log = logging.getLogger(__name__)
I'm a convert to sending logs into papertail. You could have a look at aws_watch_pending.py for an example of setting up a StreamHandler to send output to stdout/stderr, while also keeping a local log. Then you can append ' 2>&1 | logger -t update_seta' when you call the python script.
>+ except socket.error, e:
>+ log.warming("Socket error when accessing %s: %s" % (url, str(e)))
Nit, warming log, must be on fire ;-)
>+ print("Retrying")
Nit, missed one print -> log.<level> change.
>+ data = wfetch(url)
>+ if data:
>+ #test if data cannot be fetched
Nit, old comment or slight the wrong place ?
>+ parser = OptionParser()
>+ parser.set_defaults(
>+ filename=None,
>+ loglevel=logging.INFO,
>+ logfile=None,
>+ skip_orphans=False,
Nit, unused filename and skip_orphans.
Attachment #8813281 -
Flags: review?(nthomas) → review+
Assignee | ||
Comment 59•8 years ago
|
||
Attachment #8813281 -
Attachment is obsolete: true
Assignee | ||
Comment 60•8 years ago
|
||
Attachment #8813283 -
Attachment is obsolete: true
Assignee | ||
Updated•8 years ago
|
Attachment #8814977 -
Flags: checked-in+
Assignee | ||
Updated•8 years ago
|
Attachment #8813277 -
Flags: checked-in+
Comment 61•8 years ago
|
||
Comment 62•7 years ago
|
||
Correct if not fixed.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•