Closed Bug 1149789 Opened 9 years ago Closed 9 years ago

hooks: Build a hooks.taskcluster.net service (proposal)

Categories

(Taskcluster :: Services, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jonasfj, Assigned: dustin)

References

()

Details

(Whiteboard: [bb2tc])

Attachments

(1 file)

:pmoore, posted bug 1149504, and my thinking around this have been a bit simpler, ie. only listen for pulse messages, let's imagine it's called hooks.taskcluster.net and outlined as follows:

## Web API:
  PUT    /v1/hooks/<hook-group>/<hook-id>                   (create hook)
         {
           title:          "...",
           description:    "...",
           owner:          "someone@mozilla.com",
           emailErrors:    true || false,
           bindings: [
             {
               exchange:   "...",
               routingKey: "..."
             },
             ...
           ],
           task: {...}     // task definition
         }
  POST   /v1/hooks/<hook-group>/<hook-id>/trigger           (trigger hook, for testing)
         {...} // trigger-payload
  POST   /v1/hooks/<hook-group>/<hook-id>/trigger/<token>   (trigger hook from webhook w. token)
         {...} // trigger-payload
  GET    /v1/hooks/<hook-group>/<hook-id>/token             (get secret token for .../trigger/<token>)
  PATCH  /v1/hooks/<hook-group>/<hook-id>                   (update hook definition)
  GET    /v1/hooks/<hook-group>/<hook-id>                   (get hook definition)
  GET    /v1/hooks/                                         (list hook groups)
  GET    /v1/hooks/<hook-group>/                            (list hooks in given group)
  , where <hook-group> and <hook-id> are identifiers 1-22 characters, used in scopes etc.


## Background Worker:
Listens for messages on pulse, creating one AMQP queue for each <hook-group>/<hook-id> and binding
it to the list of bindings given. When a pulse message is received it triggers the hook with
trigger-payload as {exchange: '...', routingKey: '...', payload: {/*pulse message payload*/}}


## Triggering a hook:
When a hook is triggered, either through webhook or by pulse message. The task definition is
parameterized using the trigger-payload. So keys from pulse message or webhook can be substituted
into the task definition before it is created.
Using something like: https://www.npmjs.com/package/json-parameterization
So that we can do strings like "{{2 days | from-now }}" for task.deadline, and
"{{<label> | as-slugid}}" so we can replace labels with random slugids.
Errors, are printed to logs, and maybe we email the owner, with at most one email per 12 hours.


## Administration:
A UI is added to tools.taskcluster.net, and people can be given scopes such as,
  hooks:create:releng-hooks/*
So groups of scopes can be delegated to people.

---------------------------------------------------
This will handle two sources of input:
 - pulse messages
 - webhooks that submit JSON (using the secret <token> API end-point)

For all other sources we setup something that publishes to pulse.
IMO we might want to maintain a github-pulse service, which publishes github events to pulse.
Similarly we do something for hg pushlog, etc...
Note, maybe having the end-points:
  POST   /v1/hooks/<hook-group>/<hook-id>/trigger           (trigger hook, for testing)
  POST   /v1/hooks/<hook-group>/<hook-id>/trigger/<token>   (trigger hook from webhook w. token)
  GET    /v1/hooks/<hook-group>/<hook-id>/token             (get secret token for .../trigger/<token>)
Is a bad idea...

Instead we should add something that from a web-hook publishes a pulse message. And then we can pickup
that pulse message and do something with it.

It's always better to have pulse messages, because others can listen for them and do stuff...
So forcing everything through pulse is probably a good thing.
I really like this proposal Jonas, I think it is very well thought through. Also appreciate that you sketched out an API for clarity.

I have a couple of open questions for some peripheral matters, but I think the design is solid, so these questions are more for the sake of completeness than anything else.

1) What would you like to do for timer-based (cron-like) tasks, such as updating vcs caches, ETL processes etc? Would we have timers generating pulse messages? :/ Maybe a secondary service for handling cron-based jobs might help here?

2) In the case we wish to have tasks downstream of nagios alerts, would you propose we route all nagios alerts through pulse? Might nagios spikes topple pulse, or are we confident pulse could handle it? Maybe there are also high-volume services that could generate a lot of traffic, or at least be intermittently "bursty". I wonder if we can toughen up our pulse implementation?

I'm aware pulse is based on RabbitMQ which in theory can handle a *lot* of traffic, at the moment it just seems a little fragile, and I'm not sure if reliability is likely to considerably improve, especially if we add a lot of traffic. However, maybe these are not significant amounts.

I do favour simplicity, and pulse is the obvious candidate for routing all events through. So this does seem like a very good practice. Maybe we could consider special handling of these two use cases (cron/nagios) if we decide they don't sit well in pulse.

3) Another thought occurred to me. If the publish of the pulse messages is not controlled by the party which is generating the downstream TaskCluster task graphs, it might be that the Routing Key / Exchange binding is not adequate enough to determine whether a task needs to fire. Maybe there might also need to be filtering of the message body too, in order that noop tasks are not created and consume resources (e.g. the task fires up, some condition is not met in the message body, and the task completes). However I haven't a solid proposal to solve this. This most obvious solution I see is that the party generating the downstream tasks first publishes message to their own exchange, filtering from a source AMQP channel, so that exchange/routing key is sufficient.. However, this still requires that they write/deploy a service running somewhere to do the consuming/filtering/publishing, in which case they could talk to the scheduler directly rather than needing a pulse interface (since the main motivation of adding a pulse interface was to allow event based task generation that does not require the deployment of a dedicated service). Another option might be some json processing language that allows an expression to be passed that returns true/false based on whether a task should be generated based on performing the expression against the AMQP message body. Let me know what your thoughts are on this too!

Thanks for all your input! =)

Pete
... and also hg.mozilla.org / git.mozilla.org changes - should we route all internal hg/git pushes through pulse, for all repositories? For example, Release Engineering has a bunch of repositories they use, see:
https://wiki.mozilla.org/ReleaseEngineering/Repositories

If they wanted to move their CI to taskcluster for RelEng tools (at the moment relying heavily on travis) I think they'd have to write a custom hg / git poller, if the mozilla-taskcluster poller is just for gecko repositories(?) Or maybe we can get dev services to publish hg/git pushes to pulse. If that is our favoured solution, I can raise a bug against dev services to start that process.

Do we already have a github -> pulse integration?

Thanks!
Pete
So pulse publishers can only write to their own exchanges:
See: https://wiki.mozilla.org/Auto-tools/Projects/Pulse#Publishers
If a pulse publisher doesn't make routing patterns suitable for filtering that's a problem.
But not one we should fix here.

I think we should setup a github-pulse bridge, that you can add an entire organization to, and
then have all push events arrive on pulse. That would be nice for stats and many other things.
I filed bug 1150287 with details on that proposal, should be simple.

Regarding cron, then we could either build that in, or develop a pulse publisher for
messages with a routing key like: "time.<year>.<month>.<day>.<day-of-week>.<hour>.<minute>".
But this is a little fragile if just a single message is dropped. On the other hand we shouldn't
get that with pulse...

Note, for a lot of things we actually want scheduling in-tree.. I can't recall our thoughts on that
but I'm fairly sure jlal, catlee and I discussed something like that at PDX.

Regarding, nagios I have no real idea what this is. If it's large messages or thousands of them
per second, then yeah we might need something that filters them before they are published to
pulse.
(In reply to Pete Moore [:pmoore][:pete] from comment #2)

> 1) What would you like to do for timer-based (cron-like) tasks, such as
> updating vcs caches, ETL processes etc? Would we have timers generating
> pulse messages? :/ Maybe a secondary service for handling cron-based jobs
> might help here?

The "See Also" reference to bug 1088350 is in response to this.
If any of the tasks need credentials, where do we store them?
Would the user specify it in the task definition?

Could you build a dummy task definition in here that would show such use case?
On a side note, in trigger-bot we configure a consumer and we then handle the message with a callback:
https://github.com/chmanchester/trigger-bot/blob/master/triggerbot/triggerbot_pulse.py#L147
@armenzg,
I wouldn't specify credentials in task definition.
But I would specify scopes, we already do this, see the "task.scopes" property in createTask docs.

Then I would just validate that the client that creates/modifies the hook as all the scopes contained
in "task.scopes".
See docker-worker auth-proxy docs for details on how to use scopes declared in "task.scopes":
  http://docs.taskcluster.net/workers/docker-worker/
Component: TaskCluster → General
Product: Testing → Taskcluster
Component: General → Hooks
Flags: needinfo?(nhirata.bugzilla)
Assignee: nobody → amiyaguchi
So today is my last day, and I haven't quite been able to finish this service up. However, I do think I have build enough of a service that is usable, at least for periodic scheduling.

The hooks service can be found at https://github.com/taskcluster/taskcluster-hooks/. This project envelops a few different things. The first is a web service for creating, updating, deleting, and triggering hooks. The second is a periodic scheduler that will poll the web service at a set interval for hooks that need to be triggered.

Scheduling works as following, since it differs a bit from the other things discussed here. The schedule is defined under taskcluster-hooks/schemas/schedule.yml, and can take a few different forms. In the definition, it looks something like the following:

hook: {
  schedule: {
    format: {
       type: 'none' // can be one of 'none', 'daily', 'weekly', 'monthly'
       // dayOfWeek: defined if 'weekly', is a list of days ('monday', 'tuesday', etc...)
       // dayOfMonth: defined if 'monthly', is a list of numbers between 1 and 31
       // timeOfDay: defined if not 'none', is a list of numbers between 0 and 23
  }
}

I've also added an endpoint to the original spec that Jonas and I drew up on an etherpad at https://etherpad.mozilla.org/jonasfj-taskcluster-hooks-design. This is a schedule endpoint, and it allows you to view the above schedule, with the next scheduled date. There's also a question of whether jobs actually get run under the scheduler. As of now, I don't have a way to test the scheduler, so I don't actually know if jobs will get run under a production environment. One way would to keep a history table of triggered tasks, so we can associate a taskid with a given hook and a time.

I haven't deployed the service to heroku, so that still needs to be done, as well as assigning it the domain of hooks.taskcluster.net. 

Once that is done, the tools part of the hooks stuff can be added to taskcluster-tools. I realized I'm terrible at making a nice looking form, so I've created an editor that is similar to the task-creator, with an explorer taking up a third of the space on the left hand side of the page. I've fixed up most of the react quirks that I was having earlier. 

There is a bit of validation work that needs to be done before its completely bug free (groupId and hookId need to conform to url safe names), but I am able to create new hooks, edit them, and delete them just fine. There is also a way to trigger them from this webview. A few things that would make this nicer would to add the next scheduled time (or a list of scheduled dates maybe). Also, possibly a better way of validating the form instead of blindly sending it out to the server. 

The code can be found on my taskcluster-tools branch at https://github.com/acmiyaguchi/taskcluster-tools/tree/hooks. When the hooks service is registered with the rest of taskcluster, the reference to the js api dump of taskcluster-hooks can be replaced with the url to the service instead.

I'll still help with this this, but in a much more minor capacity.
I forgot to mention a few things that I hadn't gotten around to. Here's a nice itemized list of things that I think need to be done, in no order of particular importance. 

- binding and listening to pulse messages
- publishing events to pulse (hook was triggered?)
- emailing on errors
- passing data to task creation from trigger input
- update taskcluster-client
- probably better documentation

And things I mentioned in my previous post, but more compactly:
Hooks:
  - A proper test of bin/schedule-hooks.js
  - a way to verify that a task was triggered on schedule (keeping history logs, pulse?)  
  - deployment
Tools:
  - better form validation
  - change api reference to deployed service schema
Assignee: acmiyaguchi → nobody
Assignee: nobody → dustin
Flags: needinfo?(nhirata.bugzilla)
Whiteboard: [bb2tc]
Blocks: 1212616
fwiw, that looks very high priority for b2gdroid.
Version 1 -- lots still to do in versions 2 and up!
Attachment #8675875 - Flags: review?(jopsen)
Comment on attachment 8675875 [details]
https://github.com/taskcluster/taskcluster-hooks/pull/1

Solid start... a few problems:
 - authorization is not parameterized (easy to fix)
 - idempotency of scheduled tasks isn't ensured, see comments on "created" time
 - otherwise it is mostly small things

We need to refile this again a blank branch, as the initial work was
never reviewed. So there is probably a few more issues hiding in those parts too.
From what I can see though, this is all the right way.
Attachment #8675875 - Flags: review?(jopsen) → review-
Depends on: 1219375
OK, it's deployed, ish.  Remaining:
 * tools (Anthony had something started here)
 * add to docs
 * regenerate client packages
Pete, can you update the taskcluster-client-go with the new API?  It's in manifest.json
Flags: needinfo?(pmoore)
Hi Dustin!

Done!

Go client: https://godoc.org/github.com/taskcluster/taskcluster-client-go/hooks
Java client: http://taskcluster.github.io/taskcluster-client-java/apidocs/org/mozilla/taskcluster/client/hooks/Hooks.html

This would have been fully automatic, except for at the moment there is one piece of data that is not tracked in the manifest, or anything it points to, which is the "docroot" which I use in the generation process.

See https://github.com/taskcluster/taskcluster-client-go/blob/master/codegenerator/model/apis.json

I'm quite keen to get it added into manifests.json or in the api references, so that in future the publishing of the updated clients will be fully automatic, as I get can all information I need from manifest.json.

At the moment my computer runs a cron every hour that checks for schema updates, and if it sees something in manifest.json it doesn't have a docroot for, it speaks to me. Literally.
https://github.com/petemoore/myscrapbook/blob/e2b8d41a9f50fce23ac0f32516f3034060b98181/update_clients.sh#L85
So this triggered me to the new API this morning, and made me almost spill my coffee. =)

By the way, the change introduced a new edge case for the java code generator which had my brain bleeding for a while, before I finally fixed it in https://github.com/taskcluster/taskcluster-client-java/commit/3e907f8f1289afa56542a9963c957b221fe7da0a.

Some API cleanup ideas
======================

I noticed we have two schemas with title "Hook Schedule" which resulted in:

* https://godoc.org/github.com/taskcluster/taskcluster-client-go/hooks#HookSchedule
* https://godoc.org/github.com/taskcluster/taskcluster-client-go/hooks#HookSchedule1

which unfortunately isn't so pretty. I wonder if we can rename one of them to something else to make this prettier?

I've also submitted a PR to fix the size of the cron array to 6 items: https://github.com/taskcluster/taskcluster-hooks/pull/7.

Thanks!
Flags: needinfo?(pmoore)
The tools work is still in progress.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: Hooks → Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: