Closed Bug 474572 Opened 16 years ago Closed 15 years ago

set-up centralized deployment of software/machine configuration on win32 build slaves

Categories

(Release Engineering :: General, defect, P3)

x86
Windows Server 2003
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(4 files)

Corey has been looking into ways in which we can quickly deploy software updates and other such things to our win32 build slaves. We've looked at OCS, and decided it's not a production system. Other options may be opsi, or perhaps even commercial solutions. This bug is to track progress on this. We're aiming to have a ready-to-deploy system (that is, running in staging without issue) by the end of the quarter. Corey, can you direct future updates about it to this bug, rather than e-mail?
A short synopsis: As Ben mentioned, we've tested OCS Inventory NG [1] and it seems a little too rough around the edges to provide all of the features we need. The one benefit that OCS has over any other open source software deployment package is that it will deploy to windows nodes while the nodes are running. There is an OCS service that runs on the nodes and "checks in" once an hour for updates. This works akin to the models of cfengine or puppet in the unix world. That single benefit does not outweigh the lack of features and cumbersome way the deployment happens on the client side. OPSI [2] deploys software during the bootup phase of the OS. This is different from OCS in that the system would need to be rebooted in order for changes to be made and new packages deployed. In addition, OPSI offers boot-time functions via PXE boot which can test memory, inventory hardware, or even deploy a fresh unattended install of the OS. The interface is more robust than OCS, and the system overall is much more featureful. The only problem I have run into with OPSI is trying to get the client to work properly in the win2k3 build slave reference image (whereas it is known to work fine on win2k3). I'll be working on getting this resolved ASAP so we can test OPSI further. Another option for the Windows side would be a commercial solution. There are a whole handful of options out there at varying prices. From a price perspective, Novell tends to come in below the competition with their Zenworks [3] package, something we can grab a trial of and test if desired. On the Linux/OSX side it has generally been accepted that puppet [4] is the way to go. I have not done anything with puppet on OSX but they report that Google uses puppet to manage multitudes of OSX workstations. As noted above, the way that puppet is able to deploy software versus the way OPSI deploys software is very different, so different methodologies would need to be employed here. As of this writing I am unable to find an open source package that handles software deployment and configuration management across all 3 platforms (Win, Lin, OSX). There are commercial solutions for this, KBox by KACE [5] being one of them. [1] - http://www.ocsinventory-ng.org [2] - http://www.opsi.org [3] - http://www.novell.com/products/zenworks [4] - http://reductivelabs.com/projects/puppet [5] - http://www.kace.com more on this later.. Ideas and other input are more than welcome!
Assigning this to you Corey, hope you don't mind.
Assignee: nobody → cshields
Status: NEW → ASSIGNED
Priority: -- → P2
Update.. All of my OPSI testing at home was with WinXP which worked great (both installing software on existing XP VMs and installing XP VMs from scratch) but there was a big issue in getting OPSI installed on to the Win2k3 reference image. The first component of OPSI is the 'preloginloader', which is the app that runs before the login process to check for any actions from the OPSI server. This was not installing properly at all on the win2k3 reference image (test-winslave.build.m.o). I was finally able to get this pushed to the win2k3 reference image this weekend. There was a goofed up winexecsvc service that needed removed from the win2k3 server, along with needing a newer winexe binary on the OPSI server side (meaning a custom compile of winexe). So, good progress there because that was a pain of a problem to solve, and moved forward to deploying other OPSI components using the preloginloader. The next components that get installed are a software audit tool and its dependent python package. Somewhere in the installation of the python package, the win2k3 server is hung in a state that I can not reach it via RDP (the RDP session connects but stays at a gray screen). At first I thought this may have been the python installation conflicting with a pre-existing python install in the reference image, so I tried the same steps on a bare win2k3 install (test-winslave2.build.m.o) and it has produced the same result. So, now I have 2 win2k3 servers sitting in a state of unknown. If I could get access to the console via VI (presumably) or get someone to take a look at them I'd appreciate it. Poke me on IRC if you want to take a stab at it. The bright side of all this is that the big OPSI blocker for the reference image is a non issue now with the newer winexe. :)
Update... Over the last few nights and last weekend I ran into snags with OPSI trying to deploy packages only to have those packages sit at a console prompt from windows asking if the package should be trusted. Took a while but I found the registry settings required to allow "untrusted" packages to install on the win2k3 nodes. Basically you have to allow the launching of unsigned/untrusted apps within the appropriate internet security zone. Yet, doing this via the control panel only affects the current user so it has to be done manually through the registry to make it system-wide (HKLM). With this set I'm starting to create OPSI packages to test it out. I'm using the upcoming mobile build additions to the reference image[1] as something to trial this with. So far they seem to be working well and the first couple of packages are deploying just fine. Once I have them all done and working I'll uninstall the packages and let Ben have a shot at the UI. Again, any deployment with OPSI requires a reboot since the work is done in the pre-login boot up. But, when it comes to all of the build slaves this could be done easily with a linux script and winexe firing off a shutdown command across all of the slaves. [1] - https://wiki.mozilla.org/Mobile/Build/StepsToModifyRefVM
Update.. I have the following packages below setup and tested in OPSI, able to deploy. I've been testing against both a build slave reference image VM and a bare win2k3 install VM. I've had the build slave reference VM (test-winslave) re-imaged to give us a fresh testbed for these deployments, and have sent a note off to bhearsum so he can give it a try. These packages were picked from the requrements mentioned in the URL above. activesync firefox (not necessary for the slaves, I was just testing this) .NET compact framework v2 (netcfv2) visualstudio (can be deployed but is already a part of the reference image so unnecessary for test-winslave) windows mobile SDK (win-mobile-sdk) Disk space is going to be the biggest issue here, the SDK takes quite a bit and the slave reference image is tight as it is.
This is late notice, but anyone we're having a conference call to go over the basics of OPSI-for-build-machines today at 2pm PDT. Anyone interested can join conf 294 to participate.
A bit of follow up on one of the issues from the phone conference.. The concern was brought up about the "status" of a product sometimes being displayed incorrectly. This would happen, for instance, whenever a goofed up product would install most of the way but then hang on an open file dialog or driver signing dialog, or something similar. OPSI would mark the product as "installed" when it really wasn't. Well, the short of this is that I haven't been able to replicate the problem when the product packages are properly setup and tested (maybe Ben has seen it happen?). It has always happened to me when I'm testing packages and they are not properly installing in the first place. Tonight I tried a little harsh environment testing by waiting for an installation to start and then killing the VM. OPSI would still show the product as "Installing" with an action request of "Setup". Start up the VM and it realizes that it needs to do the setup and starts over again, and I killed it in a different location a second time. Started it up the third time and let it install. Everything worked fine and it was quite resilient toward that kind of behavior. Tomorrow I'll try something similar and simulate a network outage to see how it reacts.
Corey, I also saw a reply to your post on the OPSI forum that talked about post-install verification. You can do all sorts of tests and call isFatalError if there's a problem. Given your testing and the above seems that this is going to be a non-issue after all.
Ah thanks for pointing that out. I thought that I was subscribed to the thread but I was wrong. I'm looking at those tests tonight. Testing for disk space ahead of time will be a must for packages like the SDK, which don't seem to check for themselves before the installation. Another result of carpet-pulling tests: I pulled the network on test-opsi in the middle of a product install. This caused the install to hang (understandably, as the source package is being run from a network share). The install timed out some ~10 minutes later and OPSI let the machine continue to the login process. The OPSI client showed the package as still "Installing" and the package was not installed successfully on the system. Upon reboot it tried again and was successful. So, I think the moral of the testing here is that if you set a group of systems to install something and all of them come back later as installed with one of them still "installing" for an unusual period of time, there may be something wrong. Now, these theories and methods could possibly be preempted by the testing talked about by the OPSI guys.
On the topic of using the software audit portion of of OPSI: One of the downfalls of OPSI is that they have a product (software package) database and deployment, and a software inventory feature that is separate. That separation is a bit of a bummer. We could still make use of the software audit feature since it can be triggered by OPSI at every boot up, and the data is collected and stored centrally on the OPSI server. The data is stored in flat files, one file per OPSI client. It would not be difficult at all to write a script to parse these files looking for specific "must have" installations.. If a host is missing something, then report it. Unfortunately the data isn't hosted in a relational database, but the scale at which we are talking here shouldn't require that. With that said... the software audit package on the reference image hasn't run since march 4th. There is a missing python module and I'm stewing on the best avenue to take for the reference image. OPSI has a python package with the modules installed that it likes to deploy as a prerequisite to the swaudit package, and the reference image already includes python. As a comparison, the swaudit is still operating fine on the bare image. So, I'm still looking into that but wanted to put a note out for discussion with regard to using this data or not. I'm pasting a sample from one of these files below: [{EDDF99D9-9FE3-4871-A7DB-D1522C51EE9A}] displayversion = 2.0.7045 binaryname = displayname = Microsoft .NET Compact Framework 2.0 SP2 installsize = 97743872 uninstallstring =
Copy/paste from e-mail from Corey about properly detecting installation failure: The tests are two-fold, check for a file that should exist when the package is installed, and check for a registry key (in the Uninstall registry). These are our canaries. I've used and tested this recipe in the windows mobile sdk product, the .net cf 2 product, and the activesync product. All have been repacked. When it fails, the OPSI client shows "Failed". And this does work very well, I typo'd a registry key and it showed up as failed. Only change necessary is in Aktionen. I've tried to make this generic so it could be reused and just change the variables: [Aktionen] DefVar $RegKeyCheck$ DefVar $RegKeyExpectedValue$ DefVar $ExpectedFile$ Winbatch_product_silent_install ; Registry key and files to check for a successful install set $RegKeyCheck$ = GetRegistryStringValue("[HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Uninstall\{56DB0BD0-E3EB-49B4-A312-97CF88BE12CE}] DisplayVersion") set $RegKeyExpectedValue$ = "6.0.0.17740" set $ExpectedFile$ = "C:\Program Files\Windows Mobile 6 SDK\Managed Libraries\Microsoft.WindowsMobile.dll" if not($RegKeyCheck$ = $RegKeyExpectedValue$) logError "Fatal: Windows Mobile SDK install must have failed, the registry key is missing, returned "+$RegKeyCheck$+"." isFatalError endif if not(FileExists($ExpectedFile$)) logError "Fatal: After Installation "+$ExpectedFile$+" not found" isFatalError endif
I'm going to work on converting the example packages on https://wiki.mozilla.org/ReleaseEngineering:OPSI to do installation failure detection, and then we're rolling this out in staging. I'll use this as a tracking bug and file any necessary dependents.
Depends on: 484980
Depends on: 484981
Alright, after some bumps and bruises we've got this deployed in our staging environment. One big question we haven't answered yet is "where do we keep package sources?". We can't keep them 100% in the public because we'll need to store ISOs, installers, and possibly other things subject to legal issues alongside. One option is to keep things entirely in the CVS /mofo repository - which is not public. Another is to keep the installation scripts and and metadata in a public repository and store only the packages in mofo. This could get complicated when it comes to combining the two, though. Ignoring that complication I feel like this is the best option - it keeps the most we can in the public (and probably in Mercurial). We could also do the above but keep the packages only on the opsi server. This would probably simplify things but we run the risk of losing the packages if we reinstall. I'm going to leave this bug open until we solve that issue. Over the next few months we should use OPSI to deploy all new packages in staging so we can get a better feel for it and decide if we want to use it in production.
The tracker bug for the linux setup is Bug 486614 and the tracker bug for the mac osx setup is Bug 486615
(In reply to comment #13) > Another is to keep the installation scripts and and metadata in a public > repository and store only the packages in mofo. This could get complicated when > it comes to combining the two, though. Ignoring that complication I feel like > this is the best option - it keeps the most we can in the public (and probably > in Mercurial). I've got a system based on this idea running right now. With the help of a couple scripts it seems to be working quite well. Those, and a sample package are located in: http://hg.mozilla.org/users/bhearsum_mozilla.com/opsi-package-sources/. If we go with this we'll have to create a opsi-binaries module in the mofo repository, to house the binaries. The workflow for updating a package would be like this: # root@opsi server cd ~/opsi-binaries cvs up cd ~/opsi-package-sources hg pull && hg up ./sync-binaries ./regenerate-package $somepackage
There is a slight concern with this, some packages you won't want to be public. The visual studio package is one example, as it contains a transform file that would reveal the license for that package. Granted, this is a one-off, but as this becomes the standard it would be easy for someone to forget and let something like that slip into the public repo. This is unless I'm missing the obvious and the mofo/binaries repository is private, in which case you could stick the transform there with no problem and the package source remains clean. I'm not sure of the mofo repository details (been years since I've dealt with it)
(In reply to comment #17) > This is unless I'm missing the obvious and the mofo/binaries repository is > private, in which case you could stick the transform there with no problem and > the package source remains clean. I'm not sure of the mofo repository details > (been years since I've dealt with it) Ah, sorry. Yes. Oddly enough, the 'mofo' repository is a private CVS repository.
Attached patch opsi packaging helper scripts (deleted) — Splinter Review
There are large header comments in both of these scripts with finer details, but basically...sync-binaries is meant to be run on the OPSI server whenever packages are updated, to sync binaries into the opsi-package-sources clone. Then, regenerate-package can be used to create the .opsi package and register it with the OSPI server.
Attachment #371472 - Flags: review?(cshields)
Attachment #371472 - Flags: review?(catlee)
Attachment #371472 - Flags: review?(catlee) → review+
Comment on attachment 371472 [details] [diff] [review] opsi packaging helper scripts Looks good to me. Not 100% what it's doing though :)
Comment on attachment 371472 [details] [diff] [review] opsi packaging helper scripts Tested the regen script, good thinking. Worked well. Sync script looks good too, pretty much self explanatory. This does leave the md5 files behind, but with this all being kept on the same server I don't see a use for the md5's anyway. They would only be necessary when distributing the .opsi files which we aren't doing. Cheers!
Attachment #371472 - Flags: review?(cshields) → review+
(In reply to comment #21) > This does leave the md5 files behind, but with this all being kept on the same > server I don't see a use for the md5's anyway. They would only be necessary > when distributing the .opsi files which we aren't doing. > > Cheers! Good point. I didn't mean to leave these behind, actually. For completeness, I'm going to update the script to copy them in, too.
Depends on: 487410
Depends on: 487412
Attachment #371665 - Flags: review?(catlee) → review+
Comment on attachment 371472 [details] [diff] [review] opsi packaging helper scripts Pushed to the new repository, http://hg.mozilla.org/build/opsi-package-sources/
Attachment #371472 - Flags: checked‑in+ checked‑in+
Comment on attachment 371665 [details] [diff] [review] use basename + copy the md5 file, in regen-packages Pushed to the new repository, http://hg.mozilla.org/build/opsi-package-sources/
Attachment #371665 - Flags: checked‑in+ checked‑in+
I imported MozillaBuildSetup-1.3.exe into the new opsi-binaries/ module in mofo. There is now a clone of opsi-package-sources and a checkout of opsi-binaries on staging-opsi in ~cltbld. I ran sync-binaries to test it out, and then 'regenerate-package mozillabuild'. I had to push one bustage fix: regenerate-package used ~/ as the base dir for opsi-package-sources/opsi-packages but I discovered that I had to run that script as root, so I changed it to ~cltbld/ The staging part of this is done now. Production to come later on in the quarter.
Depends on: 491298
OS: Mac OS X → Windows Server 2003
Going to do the rollout of this in the coming weeks, taking this bug.
Depends on: 494213
Assignee: cshields → bhearsum
Priority: P2 → P3
While writing an OSPI Buildbot package I discovered that the regenerate-package script causes all installation state for the specified package to be lost. As it turns out, if we don't delete it before installing the new version we don't have that problem. Seems important!
Attachment #379177 - Flags: review?(ccooper)
Depends on: 494426
Attachment #379177 - Flags: review?(ccooper) → review+
Comment on attachment 379177 [details] [diff] [review] don't remove the existing package in the regenartion script changeset: 6:a9c813d1e502
Attachment #379177 - Flags: checked‑in+ checked‑in+
Depends on: 495825
Depends on: 495948
I've setup production-opsi.build.mozilla.org by following the instructions here: https://wiki.mozilla.org/ReferencePlatforms/OPSI_Server. We're ready to deploy this in production at this point - just need to find the right time to do so.
No longer depends on: 495825, 495948
thank you for not warning me about these dependency removals, bugzilla.
Depends on: 495825, 495948
We're all done here now! Big thanks to Corey for all his work on this, and Lukas for helping me deploy it.
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: