Closed Bug 643506 Opened 14 years ago Closed 13 years ago

Move 7 automation and tools machines to new Addons Testing Secure buildbot pool

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: cmtalbert, Unassigned)

References

Details

(Whiteboard: [buildslaves][slaveduty])

In order to find a way forward until new slaves are purchased for replacement, we are donating 7 machines to the addons testing buildbot pool.  These machines will form a secure pool where AMO folks can wire in a "push button" functionality to their site to initiate talos tests at will on these slaves for a particular addon.
i.e. User x goes to his addon, clicks the test button.  A buildbot master (from bug 617762) will initiate talos tests on these slaves.

The 7 minis we are donating are below and can all be found in AFK, atop the black cabinet.  Please take only these 7:
* tools-r3-fed-002
* tools-r3-fed64-002
* tools-r3-snow-002
* tools-r3-leopard-002
* tools-r3-xp-002
* tools-r3-w7-002
* tools-r3-w764-002

These machines are not active and will be shut down in a few minutes after this bug is filed.
Sent email to bhearsum and zandr with usernames/passwords for these machines.
Assignee: server-ops → server-ops-releng
Component: Server Operations → Server Operations: RelEng
QA Contact: mrz → zandr
These should be re-imaged and renamed as follows:

addon-r3-fed-001
addon-r3-fed-002
addon-r3-snow-001
addon-r3-snow-002
addon-r3-w7-001
addon-r3-w7-002
addon-r3-w7-003

This gives a reasonable (but small) amount of machines per-platform.  We would like these to live on an isolated network with talos-addon-master1.

ctalbert tells me that the machines are still just powered down and have not been touched since this bug was originally filed.
Blocks: 599169
No longer depends on: 599169
Assignee: server-ops-releng → zandr
Status: NEW → UNCONFIRMED
Ever confirmed: false
These machines will do initial baking in the releng vpn - please talk to dustin for details.
tools-r3-xp-002        -> addon-r3-fed-001  Port 6   reimage done
tools-r3-w7-002        -> addon-r3-fed-002  Port 12  reimage done
tools-r3-w764-002      -> addon-r3-snow-001 Port 13  booted into deploystudio
tools-r3-fed64-002     -> addon-r3-snow-002 Port 16  booted into deploystudio
tools-r3-fed-002       -> addon-r3-w7-001   Port 17  booted into deploystudio
tools-r3-leopard-002   -> addon-r3-w7-002   Port 18  booted into deploystudio
tools-r3-snow-002      -> addon-r3-w7-003   Port 19  booted into deploystudio
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
These have been added as addon-r3-<os>-<#>.build.scl1.mozilla.com and imaged appropriately.

Over to Releng for baking.
Assignee: zandr → nobody
Component: Server Operations: RelEng → Release Engineering
QA Contact: zandr → release
Whiteboard: [buildslaves][slaveduty]
Priority: -- → P3
(In reply to comment #5)
> These have been added as addon-r3-<os>-<#>.build.scl1.mozilla.com and imaged
> appropriately.
> 
> Over to Releng for baking.

As I am fairly new to this whole OPSI/Puppet stuff, I'm curious: How long does something need to be in the network to achieve "fully baked"?
Not long - it's almost trivial for puppet, but OPSI always takes some nontrivial TLC (aka, poking with a sharp stick) before it starts working.  I'll work on it today or tomorrow, depending on what comes up.
These two are done:

addon-r3-fed-001
addon-r3-fed-002

Each has its root, cltbld, and vnc passwords set to something different from the default - I'll give it to you on request in IRC.  These are ready to point at a buildmaster by simply adding a buildbot.tac in ~/talos-slave.  It's currently running runslave.py, which tries to access the slave allocator and nagios on every boot, but this won't hurt anything.  There are no SSH keys on these machines.  Hostnames are set to the above names, unqualified, which I don't expect to cause problems.

Sadly, these two:

addon-r3-snow-001
addon-r3-snow-002

have the wrong version of Mac OS X on them.  I thought we had updated the refimage for the new version, but apparently not.  With Nick and Bear's help, I found the updater from Apple, but the update failed.  So, updating the refimage is now a blocker for this bug.  The -snow-* slaves are halted awaiting re-re-imaging.  Nothing's ever easy.

I'll work on getting the w7 slaves to talk to OPSI tomorrow.
Depends on: 655199
Please ensure that you scrub out puppet/opsi before you hand back, so that those talos slaves don't block on reboot waiting on those resources.
Ah, there are only W7 systems, so OPSI is not involved - a pleasure for everyone, let me tell you.  So 

addon-r3-w7-001
addon-r3-w7-002
addon-r3-w7-003

are finished, as well.  I shut down -001 and then thought better of it, since I may have forgotten something that I'll need to touch up.

For reference, since I assume you'll want to change the password I've assigned, the instructions for changing passwords are all here: https://intranet.mozilla.org/Build:Farm:Password_Maintenance  (for many you need to read the script).

So we're just waiting for the re-re-image of the snow machines.
From the last few comments here, Dustin is obviously doing the work, so pushing to him to avoid confusion in other groups.

(In reply to comment #10)
> So we're just waiting for the re-re-image of the snow machines.
Dustin, can you add the bug# tracking the re-re-imaging as a dep.bug so we can all follow along?
Assignee: nobody → dustin
The reimage will be on this bug, but the refimage snapshot that's blocking it is bug 656042
Depends on: releng-snapshots
No longer depends on: 655199
Assignee: dustin → zandr
Which network are these machines now located on?  They should be isolated on a network with tools-addon-master1 to protect the rest of our infrastructure.
They're still on the build network - waiting on bug 656042 to re-image the snow machines (and a few dozen others..)
w7-001 is powered back up, and snow-00[12] have been imaged.

I'm going file a new bug for the network infra required to isolate these machines on a new VLAN.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Go ahead with the network move.  I'll make sure the snow machines are puppeted while you work on that.
Assignee: zandr → dustin
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Hmm,

dustin@lorentz ~ $ ssh cltbld@addon-r3-snow-001.build.scl1.mozilla.com
ssh: connect to host addon-r3-snow-001.build.scl1.mozilla.com port 22: Connection refused

dustin@lorentz ~ $ ssh cltbld@addon-r3-snow-002.build.scl1.mozilla.com
ssh: connect to host addon-r3-snow-002.build.scl1.mozilla.com port 22: Connection refused
So let me contradict myself in comment #16 - don't move these to the new network yet, until I get access to the snow-leopard machines and make sure they're configured correctly.

And please help me with that :)
(In reply to comment #18)

> And please help me with that :)

Sigh. There's an issue with our new SL image. Will resolve tomorrow.
Does it make sense to split the snow-leopard slaves out of this bug, and move forward with the rest of this?
If you like, but I have a new SL image and will hit these this afternoon.

Comment 19 was written before the win64 firedrills.
Given that there's network changes to make, better to do that once than try to piece it out.  Let's make these the first SL slaves to get reimaged, though.
Reimaging of the SL machines is complete, and the hostnames have been updated.
Hooray!  The snow machines are now fully puppeted as well.

For the record, on the fedora systems, I edited run-puppet-and-buildslave.sh to remove the puppet runs:

if false; then ## commented out for addons testing
 .. puppet stuff
fi
 .. start buildbot

On the snow-leopard systems, I removed
 /Library/LaunchDaemons/com.reductivelabs.puppet.plist
and
--- /Library/LaunchAgents/org.mozilla.build.buildslave.plist.old        2011-06-06 19:49:48.000000000 -0700
+++ /Library/LaunchAgents/org.mozilla.build.buildslave.plist    2011-06-06 19:50:02.000000000 -0700
@@ -29,14 +29,9 @@
         <string>/usr/bin/python</string>
         <string>/usr/local/bin/runslave.py</string>
     </array>
-    <!-- do not run immediately when loaded -->
+    <!-- run immediately when loaded -->
     <key>RunAtLoad</key>
-    <false/>
-    <!-- but run when puppet (which is running as root) touches this file -->
-    <key>WatchPaths</key>
-    <array>
-        <string>/var/puppet/run/puppet.finished</string>
-    </array>
+    <true/>
     <key>WorkingDirectory</key>
     <string>/Users/cltbld</string>
 </dict>

Over to zandr for the new netops bug.
Assignee: dustin → zandr
As a note, we don't currently have a master to connect these slaves to - so we are not blocked by having this sit for a while longer.  See bug 617762, bug 659512.
Just reinstalled dongles, btw.

Are we done here?
zandr -- just the network move for these slaves.  I'll leave you to interface with netops there, as I don't have the details of the new vlan/network.
My master has appeared so now I'm interested in the status of the these slaves?

Are they still in netops limbo?
Dustin - the w7 machines are requesting an activation key, I thought that that was built into the image.  Otherwise, do you have keys for them?
I'll take care of activating them. Might not happen today, if not I'll hit them first thing tomorrow.
(alice, I suspect they needed to be re-activated after moving to the new network)
Have the vista slaves been activated?
Can I get an update here?
addon-r3-w7-001 and 002 have been activated. I sawed off the limb I was sitting on, so 003 will take a site visit.
moving so I can set colo-trip
Assignee: zandr → server-ops-releng
Component: Release Engineering → Server Operations: RelEng
QA Contact: release → zandr
colo-trip: --- → scl1
and -003 is activated.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
These are up and working now.
Component: Server Operations: RelEng → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.