Closed Bug 697627 Opened 13 years ago Closed 11 years ago

nagios checks for pending builds

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: bhearsum, Assigned: zeller)

References

Details

(Whiteboard: [nagios])

Attachments

(3 files, 13 obsolete files)

nagios check script 13 years ago John Hopkins (:jhopkins) (deleted), text/plain		Details
bug697627-buildapi.patch 11 years ago John Zeller [:zeller] (deleted), patch		Details \| Diff \| Splinter Review
bug697627.patch 11 years ago John Zeller [:zeller] (deleted), patch		Details \| Diff \| Splinter Review
bug697627.patch 11 years ago John Zeller [:zeller] (deleted), patch		Details \| Diff \| Splinter Review
bug697627-unittests.patch 11 years ago John Zeller [:zeller] (deleted), patch		Details \| Diff \| Splinter Review
check_pending_builds.py 11 years ago John Zeller [:zeller] (deleted), text/plain	hwine : feedback-	Details
check_pending_builds.py 11 years ago John Zeller [:zeller] (deleted), text/x-python		Details
check_pending_builds.py 11 years ago John Zeller [:zeller] (deleted), text/x-python		Details
check_pending_builds.py 11 years ago John Zeller [:zeller] (deleted), text/x-python		Details
check_pending_builds.py 11 years ago John Zeller [:zeller] (deleted), text/x-python		Details
test_check_pending_builds.py 11 years ago John Zeller [:zeller] (deleted), text/plain	hwine : review+	Details
check_pending_builds.py 11 years ago John Zeller [:zeller] (deleted), text/plain	hwine : review+	Details
check_pending_builds.py 11 years ago John Zeller [:zeller] (deleted), text/plain		Details
check_pending_builds.py 11 years ago John Zeller [:zeller] (deleted), text/plain	hwine : review+	Details
test_check_pending_builds.py 11 years ago John Zeller [:zeller] (deleted), text/plain	hwine : review+ hwine : checked-in+	Details
check_pending_builds.py 11 years ago John Zeller [:zeller] (deleted), text/plain	hwine : review+ hwine : checked-in+	Details

bhearsum@mozilla.com (:bhearsum)

Reporter

Description

•

13 years ago

bug 697374 tracked a problem where we had thousands of pending jobs (because 32-bit Linux test slaves weren't accepting new ones), and didn't notice for awhile. This could've been caught sooner if we had some Nagios checks in place. A couple of ideas that were thrown out: * A check that fails when we have over X pending jobs (to account for some unavoidable load). This might legitimately fail sometimes, but we could downtime or ack it in those situations. X should probably be around 500. * A check that fails when there are more than X pending and less than Y running. I'm not sure what X and Y should be here. Regardless of what we do, we probably need something on buildapi that returns the # of pending and running in a simple form, so that it's easily understandable by whatever script is doing the checks.

John Hopkins (:jhopkins)

Comment 1

•

13 years ago

Attached file nagios check script (deleted) — Details

We use the attached perl script to check the Thunderbird buildbot build queue. It scrapes the /one_box_per_builder page.

Chris Cooper [:coop] (he/him)

Updated

•

13 years ago

Priority: -- → P3

Whiteboard: [nagios]

Hal Wine [:hwine] (use NI)

Updated

•

11 years ago

Blocks: re-nagios

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 2

•

11 years ago

Found in triage. Sounds like this is actually two items: 1) buildapi work to make the length of pending builds queue be measurable from commandline 2) nagios work to call buildapi and then alert us when length > some-threshold-to-be-determined. Moving to Tools for now... needinfo to hwine, zeller to see if this seems accurate, and if so, what they think are next steps here.

Component: Release Engineering → Release Engineering: Developer Tools

Flags: needinfo?(jozeller)

Flags: needinfo?(hwine)

QA Contact: hwine