Closed Bug 917462 Opened 11 years ago Closed 11 years ago

please adjust nagios alert for gaia_bumper.stamp on buildbot-master66

Categories

(mozilla.org Graveyard :: Server Operations, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mozilla, Assigned: ericz)

References

Details

(Whiteboard: [reit-ops])

Attachments

(1 file)

buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp Looks like it's warning ~574 seconds and critical at ~930 seconds? Could we adjust this to warn at 1200 seconds and critical at 1800 seconds? It keeps flapping with no-human-intervention-needed notifications.
Whiteboard: [reit-ops]
Assignee: infra → server-ops
Component: Infrastructure: Monitoring → Server Operations
Product: Infrastructure & Operations → mozilla.org
QA Contact: jdow → shyam
Assignee: server-ops → eziegenhorn
This is committed in rev 75199...will take a bit to get pushed out via puppet.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Thank you!
This appeared today: [19:39] <nagios-releng> Mon 19:39:55 PDT [4316] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/gaia_bumper/gaia_bumper.stamp is 559 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp) Do you know how long it'll take to take effect?
Flags: needinfo?(eziegenhorn)
It's already in effect : define service{ use generic-service host_name buildbot-master66.srv.releng.usw2.mozilla.com service_description File Age - /builds/gaia_bumper/gaia_bumper.stamp check_command check_file_age!1200!1800!/builds/gaia_bumper/gaia_bumper.stamp That's odd you saw it show up :|
:aki Yeah that has been in effect for two weeks, I have no idea how you saw this alert. I double-checked nagios1.private.releng.scl3 and it has the correct, current config values. I looked in the logs there and the last time this alert shows up was the end of June. What channel did you see this alert in?
Flags: needinfo?(eziegenhorn)
This was in #buildduty. Is nagios-releng controlled by this service?
Ok, so :ashish had some great insights into why this isn't working right (the host isn't puppetized and a bad interaction with a weirdly-defined check) and he also got me access to the box which will be great help. I believe the critical threshold is working now and am still working on the warning threshold which seems broken still.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Ran a number of tests and this appears to be working reliably now. Let me know if it false-alarms any longer.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
From today: [21:01] <nagios-releng> Sun 21:01:47 PDT [4742] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/gaia_bumper/gaia_bumper.stamp is 501 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch checkcommands.pp patch (deleted) — Splinter Review
Ok, what it was actually alerting about is that the file is 0 bytes. For whatever reason, we specify it should be at least 1 byte big. This is not the behavior we want for this box, but I'm not sure about others. Additionally, we specify the -m flag to check_file_age, and it doesn't take a -m flag. Perhaps that's from an older version. Since fixing this affects other hosts I'm going to ask Ashish to review the patch before I put it in.
Attachment #824183 - Flags: review?(ashish)
Comment on attachment 824183 [details] [diff] [review] checkcommands.pp patch Review of attachment 824183 [details] [diff] [review]: ----------------------------------------------------------------- The choice of 1 byte is quite likely historical. I can't imagine this change breaking anything but keep an eye out after pushed out. The -m flag applies to an earlier version of the bundled plugin but doesn't work unless used alongwith a different NRPE plugin check_file_age2...
Attachment #824183 - Flags: review?(ashish) → review+
Patch committed in r77265. Will watch #buildduty for a few days.
13:42 nagios-releng: Wed 13:42:07 PDT [4782] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/gaia_bumper/gaia_bumper.stamp is 527 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp) 13:58 nagios-releng: Wed 13:58:07 PDT [4783] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/gaia_bumper/gaia_bumper.stamp is 1487 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp) 13:00 nagios-releng: Wed 14:00:08 PDT [4784] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is OK: FILE_AGE OK: /builds/gaia_bumper/gaia_bumper.stamp is 15 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp)
[12:55] <nagios-releng> Fri 12:55:48 PDT [4852] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/gaia_bumper/gaia_bumper.stamp is 591 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp)
[05:09] <nagios-releng> [#buildduty] Tue 05:09:49 PST [4153] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is WARNING: FILE_AGE WARNING: /builds/gaia_bumper/gaia_bumper.stamp is 554 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp) [05:19] <nagios-releng> [#buildduty] Tue 05:19:50 PST [4159] buildbot-master66.srv.releng.usw2.mozilla.com:File Age - /builds/gaia_bumper/gaia_bumper.stamp is OK: FILE_AGE OK: /builds/gaia_bumper/gaia_bumper.stamp is 61 seconds old and 0 bytes (http://m.allizom.org/File+Age+-+/builds/gaia_bumper/gaia_bumper.stamp)
Yeah something is still busted. To wit: -sh-4.1$ /usr/lib64/nagios/plugins/check_nrpe -H buildbot-master66.srv.releng.usw2.mozilla.com -t 15 -c check_file_age -a "-w 677 -c 1500 -W 0 -C 0 -m -f /builds/gaia_bumper/gaia_bumper.stamp" FILE_AGE WARNING: /builds/gaia_bumper/gaia_bumper.stamp is 261 seconds old and 0 bytes That shouldn't warn with those parameters. Once I regain access to the box I'll troubleshoot more.
Depends on: 937888
We were bumping up against differences in /etc/nagios/nrpe.cfg between releng hosts and infra hosts. They defined the check_file_age check's arguments differently and it was causing the first argument specified for releng hosts (only buildbot-master66 uses it) to be ignored as it was garbled. Therefore, it was using the default warning age of 240 seconds. Dustin just landed a patch to nrpe.cfg for releng hosts to make it match infra hosts. I will watch it a few more days.
This has alerted in the last three days, but they were all valid alerts. Hesitantly going to close this again. Thanks for your patience.
Status: REOPENED → RESOLVED
Closed: 11 years ago11 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: