Closed
Bug 1373938
Opened 7 years ago
Closed 6 years ago
Submit worker type pending count data to statsum and/or ActiveData
Categories
(Taskcluster :: Services, enhancement, P2)
Taskcluster
Services
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: garndt, Assigned: jhford)
Details
Attachments
(1 file)
It appears that release engineering has graphs in grafana to show historical data about pending counts.
The taskcluster team has graphs about pending wait times within signalfx, but those might not be the same numbers that are helpful to releng and it's protected behind a login.
I'm not sure the mechanism that acquires that data from buildbot to send to graphite, but perhaps we can adapt it to periodically look at the pending counts for worker types that releng is concerned with and stuff it in there. I'm happy to help in any way that I can.
Comment 1•7 years ago
|
||
We're currently using hostedgraphite for this kind of metric. e.g.
https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/grafana/dashboard/db/pending
I believe the metric name is for this would be $prefix.releng.pending.$poolnmame
Does signalfx have no public access for dashboards?
Priority: -- → P2
Reporter | ||
Comment 2•7 years ago
|
||
(In reply to Chris AtLee [:catlee] from comment #1)
> We're currently using hostedgraphite for this kind of metric. e.g.
>
> https://www.hostedgraphite.com/da5c920d/86a8384e-d9cf-4208-989b-9538a1a53e4b/
> grafana/dashboard/db/pending
>
> I believe the metric name is for this would be
> $prefix.releng.pending.$poolnmame
I came across this dashboard, but could not find out exactly what is stuffing the data into graphite. I think this is for AWS [1] but I didn't find where we're monitoring the same for physical machines. I think whatever is doing the polling for pending can just be adjusted to also do it for worker types that releng is concerned with. We provide an endpoint that can report pending counts for a given provisionerId/workerType [2].
>
> Does signalfx have no public access for dashboards?
They do not. I'm reaching out again to see where it's at on their roadmap. However, we are currently not tracking pending counts in signalfx. We track pending wait times. Not that it would be impossible to track pending counts, but we have found that wait times are much more useful for knowing when something is going wrong and people are waiting too long for results.
This is the dashboard btw, https://app.signalfx.com/#/dashboard/Cp7oeIXAYDI . I can get you access if you don't have it already. (note, osx testers are not on there yet, there is a follow up bug to add it, but you can build a custom graph showing it).
[1] https://github.com/mozilla-releng/build-cloud-tools/blob/master/cloudtools/scripts/aws_watch_pending.py
[2] https://docs.taskcluster.net/reference/platform/taskcluster-queue/references/api#pendingTasks
Comment 3•7 years ago
|
||
Catlee, is RelEng capturing this data?
Dustin, do we have any work planned for recording historical pending counts for all provisioner/workerType combinations somewhere? This feels generic enough that RelEng shouldn't have to build a custom solution.
Flags: needinfo?(dustin)
Flags: needinfo?(catlee)
Comment 4•7 years ago
|
||
We record this sort of information in signalfx. If we don't have pending counts, we could certainly add that. It has the issues described above regarding public access.
I just chatted with gps and he mentioned that ActiveData can slurp up this sort of information. We could probably modify tc-lib-monitor to send data to ActiveData in addition to SignalFx (and later only to ActiveData)
Flags: needinfo?(dustin)
Updated•7 years ago
|
Component: Platform Support → Queue
Product: Release Engineering → Taskcluster
QA Contact: catlee
Summary: Pending counts for gecko-t-osx-1010 are not recorded in graphite/grafana → Submit worker type pending count data to statsum and/or ActiveData
Comment 5•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #3)
> Catlee, is RelEng capturing this data?
No, we don't capture any non-buildbot pending counts I don't think.
Flags: needinfo?(catlee)
Comment 6•7 years ago
|
||
Coop,
Maybe we should consider this for 2018 Q1 or Q2?
Jonas, is this a lot of work to implement? I'm guessing the easiest is for the queue to publish this data, but it could also be published by an external service that routinely queries the queue for pending count data. Depends how monolithic we want to make the queue, I guess. ;)
In general, we should probably form some kind of working group that looks at operational concerns, and data we provide, composed of people from TaskCluster, RelEng, our Cloud Operations team(s)?, Build Duty, Sheriffs, .... to make sure we have a shared view on what data we want to publish, how we make it available, support the services that provide access to the data, etc.. Historical capture of pending counts across workerTypes is probably just a starting point. Until now, we've mostly just published data we thought might be interesting, but haven't really had an operations-driven project plan etc.
Pete
Flags: needinfo?(jopsen)
Flags: needinfo?(coop)
Comment 7•7 years ago
|
||
This would be fairly simple to implement separate from the queue...
or as part of the queue in a background process.. But it would probably take a dedicated background process.
Flags: needinfo?(jopsen)
Comment 8•7 years ago
|
||
Not saying that it's something we should do in Q1/Q2, if we do queue using postgres, we'll get the option of making some long running analytics on that... At-least that's the thinking dustin and bstack have.
Comment 9•7 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #6)
> Maybe we should consider this for 2018 Q1 or Q2?
Yes, I would love to start collect this data in someplace useful. Let's decide where/how in January.
Flags: needinfo?(coop)
Comment 10•6 years ago
|
||
We're interested in dashboards again. This data may end up getting piped to a statuspage.io instance.
Assignee | ||
Comment 11•6 years ago
|
||
We're already doing the polling for this in the provisioner. This simple patch adds a monitor.measure to this polling, so that we'll make this pending tasks available per worker type
Attachment #8991826 -
Flags: review?(bstack)
Comment 12•6 years ago
|
||
Commit pushed to master at https://github.com/taskcluster/taskcluster-lib-api
https://github.com/taskcluster/taskcluster-lib-api/commit/470df11b81455d9320f35bcce0e2c4be2de1c3e4
Merge pull request #108 from taskcluster/bug1373938
Bug 1437461 - Use taskcluster-lib-artifact-go for uploading/downloading artifacts
Updated•6 years ago
|
Attachment #8991826 -
Flags: review?(bstack) → review+
Assignee | ||
Comment 13•6 years ago
|
||
I've landed this patch, and it will be a part of the next deployment
QA Contact: jhford
Assignee | ||
Updated•6 years ago
|
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•6 years ago
|
Assignee: nobody → jhford
Updated•6 years ago
|
Component: Queue → Services
You need to log in
before you can comment on or make changes to this bug.
Description
•