Closed Bug 605278 Opened 14 years ago Closed 13 years ago

[Tracking bug] Dynamic slave allocation to buildbot masters

Categories

(Release Engineering :: General, defect)

Product:

Component:

Platform:

x86

All

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: coop, Assigned: dustin)

References

Details

(Whiteboard: [buildslaves][automation][q4goal])

Chris Cooper [:coop] (he/him)

Reporter

Description

•

14 years ago

Moving slaves between buildbot masters is a manual process right now, and a time-consuming one at that. It would be nice to have slaves query a service that would tell them where to connect (e.g. provide them with a buildbot.tac). If the service is down, the slave would connect to the previous known-good master. catlee has done some work on a pylons project to handle slave allocation. That work can be found here: http://hg.mozilla.org/users/catlee_mozilla.com/slaves/ This project hasn't been properly scoped yet. Hopefully it can be worked into existing plans for buildbot in q4. Sub-projects can hang off of this tracking bug as they are itemized.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 2

•

14 years ago

In various discussions, it's become pretty clear that a web service is a good idea. At this point, I'm loathe to put this support into Buildbot itself: I think that a web service that runs before starting the slave at all is a less disruptive change (and it doesn't require updating Buildbot on the slaves). I'm going to divide this up as follows: 1. using a "fake" web service (.txt on people.m.o/~dmitchell, probably), get the slaves to download buildbot.tac before running it. In bug 508673, nthomas mentioned problems with auto-restarting behavior. Hopefully I'll encounter and solve that in this step. 2. Transition to a "real" web service, using a dynamic web service. I don't want to set up an entirely new web service for this -- if I can drop a .php or .py or .cgi file somewhere that similar scripts are already located, that would be best. 3. Build a management frontend - probably a simple web display, and also a command-line tool? I'll break those out into bugs as they firm up.

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

14 years ago

(In reply to comment #2) > 1. using a "fake" web service (.txt on people.m.o/~dmitchell, probably), get > the slaves to download buildbot.tac before running it. In bug 508673, nthomas > mentioned problems with auto-restarting behavior. Hopefully I'll encounter and > solve that in this step. That comment may well be out of date since we started blocking buildbot startup on a successful puppet run. bhearsum is your man for the best info.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 4

•

14 years ago

part 1 is in bug 611149. Adding onto #3, it would probably be nice for the frontend to loudly proclaim any slaves for which there is no configuration on record -- this might help to track down lost or misconfigured slaves!

Status: NEW → ASSIGNED

Depends on: 611149

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Blocks: 615301

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 5

•

14 years ago

OK, in a brief survey of how Buildbot slaves start up, here's what I've found so far. == Linux == /etc/init.d/buildbot depends on /etc/init.d/puppet, which blocks until puppet runs. /etc/init.d/buildbot-tac will run right away. buildbot-tac is installed by puppet, so there are at least two race conditions here. buildbot-tac runs buildbot-tac.py only on the first boot, creating the buildbot.tac file for the slave. buildbot-tac.py is installed as part of an RPM (build-tools - bug 615301). == Darwin == com.reductivelabs.puppet.plist runs /usr/local/bin/sleep-and-run-puppet.sh, which is presumably installed as part of the base image. This script sleeps for 60 seconds, then runs puppet in the foreground every 60 seconds until it succeeds. There is a buildbot-tac launchd script which runs /usr/local/bin/buildbot-tac, which is installed by puppet. This launchd script does not wait for puppet, so there is a race condition here, although the 60-second pause in the puppet launch script probably eliminates any risk. The buildbot launchd script waits until puppet has run, and then invokes 'buildbot start' directly. == Windows == buildbot.bat is in cltbld's "Startup Items". It is installed via OPSI, but it's not in the OPSI hg repo, because it contains passwords. buildbot.bat runs buidlbot-tac.py, via a checkout of http://hg.mozilla.org/build/tools at d:\tools. That checkout is only done once. Once buildbot-tac.py has created the tac file, buildbot.bat runs start-buildbot.bat, which sets up some VC++ variables and runs start-buildbot.sh, which finally runs 'buildbot start' in the appropriate directory. == Windows Talos == Talos windows systems are not administered by OPSI, so someone starts Buildbot by hand, apparently. Have I missed anything? What about Linux and Mac Talos slaves?

bhearsum@mozilla.com (:bhearsum)

Comment 6

•

14 years ago

I'm not sure what OPSI and buildbot startup have to do with one another.....OPSI doesn't do any Buildbot launching, merely syncs out the files that do on certain platforms. On XP machines, buildbot is started through startTalos.bat (from the "Startup" folder), which first calls out to buildbot-tac.py and then does a bit of other prep before calling buildbot. On 32-bit Windows 7 (talos-r3-w7-NNN), startTalos.bat (on the Desktop) is launched at startup by the Task Scheduler. It doesn't look like we use buildbot-tac.py here at this time, but there's no reason we can't. On 64-bit Windows 7 (t-r3-w764-NNN), startTalos.bat (on the Desktop) is launched at startup by the Task Scheduler. On this platform, startTalos.bat calls out to buildbot-tac.py prior to starting Buildbot. -- On Linux test machines startup looks like this: - Start OS - Launch X, autologin to GNOME as cltbld - Autolaunch Terminal session which runs runs /home/cltbld/run-puppet-and-buildbot.sh. That script....: -- Runs Puppet in a loop until it succeeds -- Launches Buildbot On Mac test machines startup is the same as mac build machines, but the script is called run-puppet.sh.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 7

•

14 years ago

Based on the previous two comments, there are six somewhat-related ways that Buildslaves get started. From what I can tell, there's no good reason for these to be different, but changing all six will be difficult, especially for me! I think that the easiest option right now is to create a runslave.py script that will try to download a new buildbot.tac, and failing that, if an existing buildbot.tac is present, use it. Then it will launch the buildslave. This will allow us to remove the buildbot-tac{,.py} stuff (which is racy anyway). I'll need to find a simple way to distribute this to slaves, too. It can be a file installed by puppet and OPSI, with the password distributed separately in its own file.

bhearsum@mozilla.com (:bhearsum)

Comment 8

•

14 years ago

Sounds great Dustin!

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 9

•

14 years ago

After a brief conf call with Ben and Catlee, I'm going to stick to the above plan. I'll get the scripts up on staging slaves shortly, and once I have a reasonably representative set of slave tested, I'll get the patches up for review. This is a big project, so it needs to be broken down into distinct steps, between which things are working and stable. The first of those steps is to get a set of .tac files matching existing slave configs stored in an http-accessible directory somewhere, and begin rolling out the slave modifications by hand/opsi/puppet as appropriate. Once that's done, and any day-to-day tasks have been automated with scripts in that directory, I'll start work on the backend. The initial backend will be based on a static allocation of slaves to masters, stored in a database. Then I'll add dynamic balancing. Then I'll add a fancy-pants web2.0 UI.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 10

•

14 years ago

For (my) future reference, here's the logic to decide which password to use: passwd_file = None if slave_matches('moz2'): passwd_file = "password.buildslave" elif slave_matches('try-'): passwd_file = "password.buildslave.try" elif slave_matches('talos', 'qm-p'): if slave_matches('ubuntu', 'linux', 'fed'): passwd_file = "password.talos" elif slave_matches('leopard', 'tiger', 'snow'): passwd_file = "password.talos.mac"

Armen [:armenzg]

Comment 11

•

14 years ago

Dustin not all try slaves have "try-" in their names. Not all build builders have "moz2" in their names. In fact, only the VMs and the minis do. I could be misinterpreting the previous comment!

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 12

•

14 years ago

I didn't make up that logic - it's from /etc/init.d/buildbot. It's probably incomplete.

bhearsum@mozilla.com (:bhearsum)

Comment 13

•

14 years ago

Any machines which are on try and not prefixed with "try-" were moved there after they were cloned -- and hence not handled by buildbot-tac.py. At this point, I think it's still a fair assumption to say that "try-" is a try machine and otherwise is not. We'll have the ability to override the defaults, so it's not a big deal.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Depends on: 616003

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Depends on: 616344

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Depends on: 616350

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Depends on: 616351

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Depends on: 616352

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Blocks: 613106

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Depends on: 618369

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Depends on: 628797

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

No longer blocks: 615301

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Depends on: 629690

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Depends on: 629692

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

No longer depends on: 616352

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Depends on: 633365

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

14 years ago

Blocks: 637349

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

13 years ago

Depends on: 656175

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

13 years ago

No longer depends on: 629690

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 14

•

13 years ago

This is done at this point. The remaining blocker bug will happen when we re-deploy w764.

Status: ASSIGNED → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.