Closed Bug 412816 Opened 17 years ago Closed 17 years ago

add nrpe daemon to production grade tinderboxen and other important build machines

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: bhearsum)

References

Details

Attachments

(1 file)

Since we're moving to automation for nightly/depend builds we don't need to worry about fx-win32-tbox, fx-linux-tbox, and bm-xserve08. I _think_ the following list is all the ones that it should be added to (ref platforms are excluded from this, see bug #412443): fxdbug-linux-tbox l10n-linux-tbox staging-try1-linux-slave moz2-win32-slave1 staging-try1-win32-slave staging-try-master try-master l10n-win32-tbox egg try1-linux-slave try1-win32newref-slave fxdbug-win32-tbox xr-win32-tbox karma fxexp-win32-tbox moz2-linux-slave1 tbnewref-win32-tbox xr-linux-tbox tb-linux-tbox bm-xserve01 bm-xserve02 bm-xserve07 bm-xserve11 bm-xserve12 bm-xserve15 bl-bldlnx01 bl-bldlnx03 bl-bldxp01 balsa-18branch argo-vm crazyhorse patrocles pacifica-vm prometheus-vm I *think* that's all of them. If someone could verify this that'd be great.
Assignee: nobody → bhearsum
Priority: -- → P2
Priority: P2 → P3
These look good for a start, we should get this started. Do you need help getting this going? I saw that you added a win32 setup doc ( http://wiki.mozilla.org/Build:Nagios:Win32 ), are there equivs for Linux and OSX? Also can we get the configs checked in if they are not (they'd be a good starting point if nothing else.. I understand that different machines have different partitions, thresholds etc.). I'll cross the list against the Build:Farm page, and try to dig up any that are not listed on Build:Farm while I'm at it.
I'm pretty caught up in other things right now, so anything you want to do would be appreciated :-).
These are basically the same as the configs being used on the release automation machines right now. I've added some header comments to them. They should "just work" when checked out but Mac/Linux do need to be adjusted for the number of Buildbot processes running. I propose checking them into mozilla/tools/nagios. If we want to start pushing things to hg we could create hg.mozilla.org/build/software-configs or misc-configs (or something) and put them there.
Attachment #306023 - Flags: review?(rhelmer)
Attachment #306023 - Flags: review?(rhelmer) → review+
Comment on attachment 306023 [details] [diff] [review] [checked in] basic nagios configs RCS file: /cvsroot/mozilla/tools/nagios/NSC.ini,v done Checking in NSC.ini; /cvsroot/mozilla/tools/nagios/NSC.ini,v <-- NSC.ini initial revision: 1.1 done RCS file: /cvsroot/mozilla/tools/nagios/nrpe-linux.cfg,v done Checking in nrpe-linux.cfg; /cvsroot/mozilla/tools/nagios/nrpe-linux.cfg,v <-- nrpe-linux.cfg initial revision: 1.1 done RCS file: /cvsroot/mozilla/tools/nagios/nrpe-mac.cfg,v done Checking in nrpe-mac.cfg; /cvsroot/mozilla/tools/nagios/nrpe-mac.cfg,v <-- nrpe-mac.cfg initial revision: 1.1 done
Attachment #306023 - Attachment description: basic nagios configs → [checked in] basic nagios configs
Alright, so far I've got the nrpe daemon running on the following machines: fxdbug-linux-tbox l10n-linux-tbox argo fx-linux-tbox tb-linux-tbox try1-linux-slave try2-linux-slave try-master staging-master moz2-linux-slave1 xr-linux-tbox staging-try-master staging-try1-linux-slave prometheus-vm cerberus-vm fxdbug-win32-tbox fx-win32-tbox moz2-win32-slave1 l10n-win32-tbox tbnewref-win32-tbox I think most of the Macs were already running it but I updated the nrpe.cfg on the following (to add disk, load, user, and processes monitors): bm-xserve01 bm-xserve02 bm-xserve04 bm-xserve07 bm-xserve08 bm-xserve11 bm-xserve12 bm-xserve15 I've got additional Windows boxes to put it on tomorrow, and once all of the new Macs come up I'll be checking their nrpe.cfg to make sure it is in line with the rest.
I managed to get the nrpe daemon going on balsa-18branch today (Redhat 7.2), crazyhorse and the Redhat 8.0 machines have been troublesome, I'm still working on that. I also finished up with the Windows machines. In addition to the ones listed in comment #5 the nrpe daemon is running on: try1-win32-slave try2-win32-slave xr-win32-tbox staging-try1-win32-slave pacifica-vm bm-xserve16 is also ready to go. I'm about to file an IT bug to get monitoring on these going.
Ben, since IT added the checks in bug 430246, it looks like all the MSYS machines are failing the PING test. Some sort of firewall or MSYS issue on the ref platform ?
Depends on: 420346
(In reply to comment #7) > Ben, since IT added the checks in bug 430246, it looks like all the MSYS > machines are failing the PING test. Some sort of firewall or MSYS issue on the > ref platform ? > Yep, those VMs were blocking ICMP. I enabled it on them, and the reference VM itself.
Alright, I've got all this monitoring stuff sorted out now. avg_load notifications have been disabled for windows machines -- we should figure out a good way to do this at some point in the future.
Hit enter too soon. egg, karma, and crazyhorse are not being monitored. I can't find working packages for karma and egg (redhat 8) - and rpm just hangs on crazyhorse. I bet crazyhorse we could get crazyhorse working after a reboot, but I'm not sure about the other two.
Does anyone feel strongly that egg, karma, and crazyhorse should be monitored? I'm inclined to just let them be.
I'm going to take the lack of response as silent agreement.
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → FIXED
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: