Closed
Bug 1020202
Opened 11 years ago
Closed 9 years ago
Machines should be discovered automatically; we should not need to manage machine names in our configs
Categories
(Release Engineering :: General, defect)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: pmoore, Unassigned)
References
Details
(Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2110] )
An example of this:
https://bug1013035.bugzilla.mozilla.org/attachment.cgi?id=8425210
This was a change that occurred due to move of machines to scl3 from scl1.
It is incredibly error-prone, managing things like this by hand.
I propose, that we should have naming conventions for hosts, and as soon as they are registered in dhcp, we should use dhcp as the source of information about which machines of this class exist in our network.
There's a whole bunch of automation that would simplify our lives massively: for example, if you put a new machine in the network following the naming convention, puppet could automatically detect it, and apply the appropriate puppet image, slavealloc could automatically add it, buildbot configs could get automatically updated, and before you know it - adding a machine into our infrastructure is just a case of adding a machine with a hostname corresponding to the correct naming convention, and everything else happens automatically.
There is absolutely no reason for us to maintain lists like this: https://bug1013035.bugzilla.mozilla.org/attachment.cgi?id=8425210.
Comment 1•11 years ago
|
||
I like this!
On-site, inventory would be a good source of this info. On-demand/reserved instances in AWS could work with a query to EC2. But how would it work with spot?
Comment 2•11 years ago
|
||
Historically, the reason we've done the mapping is to allow relatively quick re-adjustment of pool sizes. The theory was, we can quickly move additional builders to try builders.
Before opining on whether this is a good idea, I'd like to understand how our existing use cases are handled under this approach. Note that:
a) too many of our use cases are unwritten lore
b) "no longer a business need" is a valid answer
Use cases that I believe I was told the old mapping supported was:
- easy to move machines from one pool to another without messing up relops & dcops (pre-inventory days)
(maybe that's it -- other team members with better lore should be consulted)
In general, I love the idea, as I hate those list comprehensions. I do not like "encoding" information into hostnames -- I believe in a uid for host names, and handle the rest with various lookups (inventory, CNAMES, relengapi) - "encoding" has many shortcomings that has bit both us and groups trying to work with us (think builder names, branch names).
Comment 3•11 years ago
|
||
I worry about:
* staging/dev, where we may have non-standard naming conventions, and we otherwise might want to point at non-production slave lists. as long as we can manage that easily this should be ok
* aiui spot doesn't have dns, by design
* test-masters.sh may become [more?] dependent on a network connection. i'd love to keep these things runnable when we're on a laptop without network. we may be able to help this by pointing at a flatfile.
* new slave additions will still require a reconfig. this probably isn't an issue as it isn't a new requirement, just noting that just adding new machines with the right name isn't sufficient.
Comment 4•11 years ago
|
||
We chatted a bit about this today. I'd like to propose that we treat slavealloc as the Source of Truth for these names.
For production, we add something to 'make update' that downloads the relevant bits from slavealloc and saves them locally. master.cfg (or some file it imports) then references that local data to construct the list of buildbot slave objects.
For dev/staging, we could do something similar. TBD what to do with test-masters.sh.
The drawback here is if a slave is pulled from production to do some testing, then a reconfig happens, then the slave is put back into production in slavealloc, it will require another reconfig to be 'live'. IMO, this is a pretty minor downside.
Comment 5•11 years ago
|
||
Slavealloc also isn't trustworthy.
Maybe it's time to fix that, perhaps by moving slavealloc into relengapi?
Comment 6•11 years ago
|
||
We're certainly treating it as trustworthy from buildbot and puppet.
Comment 7•11 years ago
|
||
No, we were actually *very* careful about that in puppet for exactly this reason. Slave trustlevels, which are the important bit of data in terms of trustworthiness, are not gathered from slavealloc. In fact, the only thing puppet currently uses is the environment.
I agree that we should have a single list of slaves in a database somewhere -- just not in slavealloc as it's currently defined. I think that the easiest fix to this is to move the administrative bits of slavealloc into relengapi, which has proper authentication. The allocator itself is so simple that it could also be rewritten easily, thereby freeing slavealloc of Twisted, but that's optional.
Related, will it make sense to synchronize that list of slaves from higher sources of truth, even though there are several of those (inventory for onsite, AWS for EC2, probably another vendor for cloud macs)? That could certainly be a subsequent step in developing this fix.
Reporter | ||
Updated•10 years ago
|
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2101]
Updated•10 years ago
|
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2101] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/2110]
Assignee | ||
Updated•7 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•