Closed Bug 605278 Opened 14 years ago Closed 13 years ago

[Tracking bug] Dynamic slave allocation to buildbot masters

Categories

(Release Engineering :: General, defect)

x86
All
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: dustin)

References

Details

(Whiteboard: [buildslaves][automation][q4goal])

Moving slaves between buildbot masters is a manual process right now, and a time-consuming one at that. It would be nice to have slaves query a service that would tell them where to connect (e.g. provide them with a buildbot.tac). If the service is down, the slave would connect to the previous known-good master. catlee has done some work on a pylons project to handle slave allocation. That work can be found here: http://hg.mozilla.org/users/catlee_mozilla.com/slaves/ This project hasn't been properly scoped yet. Hopefully it can be worked into existing plans for buildbot in q4. Sub-projects can hang off of this tracking bug as they are itemized.
In various discussions, it's become pretty clear that a web service is a good idea. At this point, I'm loathe to put this support into Buildbot itself: I think that a web service that runs before starting the slave at all is a less disruptive change (and it doesn't require updating Buildbot on the slaves). I'm going to divide this up as follows: 1. using a "fake" web service (.txt on people.m.o/~dmitchell, probably), get the slaves to download buildbot.tac before running it. In bug 508673, nthomas mentioned problems with auto-restarting behavior. Hopefully I'll encounter and solve that in this step. 2. Transition to a "real" web service, using a dynamic web service. I don't want to set up an entirely new web service for this -- if I can drop a .php or .py or .cgi file somewhere that similar scripts are already located, that would be best. 3. Build a management frontend - probably a simple web display, and also a command-line tool? I'll break those out into bugs as they firm up.
(In reply to comment #2) > 1. using a "fake" web service (.txt on people.m.o/~dmitchell, probably), get > the slaves to download buildbot.tac before running it. In bug 508673, nthomas > mentioned problems with auto-restarting behavior. Hopefully I'll encounter and > solve that in this step. That comment may well be out of date since we started blocking buildbot startup on a successful puppet run. bhearsum is your man for the best info.
part 1 is in bug 611149. Adding onto #3, it would probably be nice for the frontend to loudly proclaim any slaves for which there is no configuration on record -- this might help to track down lost or misconfigured slaves!
Status: NEW → ASSIGNED
Depends on: 611149
Blocks: 615301
OK, in a brief survey of how Buildbot slaves start up, here's what I've found so far. == Linux == /etc/init.d/buildbot depends on /etc/init.d/puppet, which blocks until puppet runs. /etc/init.d/buildbot-tac will run right away. buildbot-tac is installed by puppet, so there are at least two race conditions here. buildbot-tac runs buildbot-tac.py only on the first boot, creating the buildbot.tac file for the slave. buildbot-tac.py is installed as part of an RPM (build-tools - bug 615301). == Darwin == com.reductivelabs.puppet.plist runs /usr/local/bin/sleep-and-run-puppet.sh, which is presumably installed as part of the base image. This script sleeps for 60 seconds, then runs puppet in the foreground every 60 seconds until it succeeds. There is a buildbot-tac launchd script which runs /usr/local/bin/buildbot-tac, which is installed by puppet. This launchd script does not wait for puppet, so there is a race condition here, although the 60-second pause in the puppet launch script probably eliminates any risk. The buildbot launchd script waits until puppet has run, and then invokes 'buildbot start' directly. == Windows == buildbot.bat is in cltbld's "Startup Items". It is installed via OPSI, but it's not in the OPSI hg repo, because it contains passwords. buildbot.bat runs buidlbot-tac.py, via a checkout of http://hg.mozilla.org/build/tools at d:\tools. That checkout is only done once. Once buildbot-tac.py has created the tac file, buildbot.bat runs start-buildbot.bat, which sets up some VC++ variables and runs start-buildbot.sh, which finally runs 'buildbot start' in the appropriate directory. == Windows Talos == Talos windows systems are not administered by OPSI, so someone starts Buildbot by hand, apparently. Have I missed anything? What about Linux and Mac Talos slaves?
I'm not sure what OPSI and buildbot startup have to do with one another.....OPSI doesn't do any Buildbot launching, merely syncs out the files that do on certain platforms. On XP machines, buildbot is started through startTalos.bat (from the "Startup" folder), which first calls out to buildbot-tac.py and then does a bit of other prep before calling buildbot. On 32-bit Windows 7 (talos-r3-w7-NNN), startTalos.bat (on the Desktop) is launched at startup by the Task Scheduler. It doesn't look like we use buildbot-tac.py here at this time, but there's no reason we can't. On 64-bit Windows 7 (t-r3-w764-NNN), startTalos.bat (on the Desktop) is launched at startup by the Task Scheduler. On this platform, startTalos.bat calls out to buildbot-tac.py prior to starting Buildbot. -- On Linux test machines startup looks like this: - Start OS - Launch X, autologin to GNOME as cltbld - Autolaunch Terminal session which runs runs /home/cltbld/run-puppet-and-buildbot.sh. That script....: -- Runs Puppet in a loop until it succeeds -- Launches Buildbot On Mac test machines startup is the same as mac build machines, but the script is called run-puppet.sh.
Based on the previous two comments, there are six somewhat-related ways that Buildslaves get started. From what I can tell, there's no good reason for these to be different, but changing all six will be difficult, especially for me! I think that the easiest option right now is to create a runslave.py script that will try to download a new buildbot.tac, and failing that, if an existing buildbot.tac is present, use it. Then it will launch the buildslave. This will allow us to remove the buildbot-tac{,.py} stuff (which is racy anyway). I'll need to find a simple way to distribute this to slaves, too. It can be a file installed by puppet and OPSI, with the password distributed separately in its own file.
Sounds great Dustin!
After a brief conf call with Ben and Catlee, I'm going to stick to the above plan. I'll get the scripts up on staging slaves shortly, and once I have a reasonably representative set of slave tested, I'll get the patches up for review. This is a big project, so it needs to be broken down into distinct steps, between which things are working and stable. The first of those steps is to get a set of .tac files matching existing slave configs stored in an http-accessible directory somewhere, and begin rolling out the slave modifications by hand/opsi/puppet as appropriate. Once that's done, and any day-to-day tasks have been automated with scripts in that directory, I'll start work on the backend. The initial backend will be based on a static allocation of slaves to masters, stored in a database. Then I'll add dynamic balancing. Then I'll add a fancy-pants web2.0 UI.
For (my) future reference, here's the logic to decide which password to use: passwd_file = None if slave_matches('moz2'): passwd_file = "password.buildslave" elif slave_matches('try-'): passwd_file = "password.buildslave.try" elif slave_matches('talos', 'qm-p'): if slave_matches('ubuntu', 'linux', 'fed'): passwd_file = "password.talos" elif slave_matches('leopard', 'tiger', 'snow'): passwd_file = "password.talos.mac"
Dustin not all try slaves have "try-" in their names. Not all build builders have "moz2" in their names. In fact, only the VMs and the minis do. I could be misinterpreting the previous comment!
I didn't make up that logic - it's from /etc/init.d/buildbot. It's probably incomplete.
Any machines which are on try and not prefixed with "try-" were moved there after they were cloned -- and hence not handled by buildbot-tac.py. At this point, I think it's still a fair assumption to say that "try-" is a try machine and otherwise is not. We'll have the ability to override the defaults, so it's not a big deal.
Depends on: 616003
Depends on: 616344
Depends on: 616350
Depends on: 616351
Depends on: 616352
Blocks: 613106
Depends on: 618369
Depends on: 628797
No longer blocks: 615301
Depends on: 629690
Depends on: 629692
No longer depends on: 616352
Depends on: 633365
Blocks: 637349
Depends on: 656175
No longer depends on: 629690
This is done at this point. The remaining blocker bug will happen when we re-deploy w764.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.