1149789 - hooks: Build a hooks.taskcluster.net service (proposal)

Reporter

Description

•

9 years ago

:pmoore, posted bug 1149504, and my thinking around this have been a bit simpler, ie. only listen for pulse messages, let's imagine it's called hooks.taskcluster.net and outlined as follows:

## Web API:
  PUT    /v1/hooks/<hook-group>/<hook-id>                   (create hook)
         {
           title:          "...",
           description:    "...",
           owner:          "someone@mozilla.com",
           emailErrors:    true || false,
           bindings: [
             {
               exchange:   "...",
               routingKey: "..."
             },
             ...
           ],
           task: {...}     // task definition
         }
  POST   /v1/hooks/<hook-group>/<hook-id>/trigger           (trigger hook, for testing)
         {...} // trigger-payload
  POST   /v1/hooks/<hook-group>/<hook-id>/trigger/<token>   (trigger hook from webhook w. token)
         {...} // trigger-payload
  GET    /v1/hooks/<hook-group>/<hook-id>/token             (get secret token for .../trigger/<token>)
  PATCH  /v1/hooks/<hook-group>/<hook-id>                   (update hook definition)
  GET    /v1/hooks/<hook-group>/<hook-id>                   (get hook definition)
  GET    /v1/hooks/                                         (list hook groups)
  GET    /v1/hooks/<hook-group>/                            (list hooks in given group)
  , where <hook-group> and <hook-id> are identifiers 1-22 characters, used in scopes etc.


## Background Worker:
Listens for messages on pulse, creating one AMQP queue for each <hook-group>/<hook-id> and binding
it to the list of bindings given. When a pulse message is received it triggers the hook with
trigger-payload as {exchange: '...', routingKey: '...', payload: {/*pulse message payload*/}}


## Triggering a hook:
When a hook is triggered, either through webhook or by pulse message. The task definition is
parameterized using the trigger-payload. So keys from pulse message or webhook can be substituted
into the task definition before it is created.
Using something like: https://www.npmjs.com/package/json-parameterization
So that we can do strings like "{{2 days | from-now }}" for task.deadline, and
"{{<label> | as-slugid}}" so we can replace labels with random slugids.
Errors, are printed to logs, and maybe we email the owner, with at most one email per 12 hours.


## Administration:
A UI is added to tools.taskcluster.net, and people can be given scopes such as,
  hooks:create:releng-hooks/*
So groups of scopes can be delegated to people.

---------------------------------------------------
This will handle two sources of input:
 - pulse messages
 - webhooks that submit JSON (using the secret <token> API end-point)

For all other sources we setup something that publishes to pulse.
IMO we might want to maintain a github-pulse service, which publishes github events to pulse.
Similarly we do something for hg pushlog, etc...

Jonas Finnemann Jensen (:jonasfj)

Reporter

Comment 1

•

9 years ago

Note, maybe having the end-points:
  POST   /v1/hooks/<hook-group>/<hook-id>/trigger           (trigger hook, for testing)
  POST   /v1/hooks/<hook-group>/<hook-id>/trigger/<token>   (trigger hook from webhook w. token)
  GET    /v1/hooks/<hook-group>/<hook-id>/token             (get secret token for .../trigger/<token>)
Is a bad idea...

Instead we should add something that from a web-hook publishes a pulse message. And then we can pickup
that pulse message and do something with it.

It's always better to have pulse messages, because others can listen for them and do stuff...
So forcing everything through pulse is probably a good thing.

Pete Moore [:pmoore][:pete]

Comment 2

•

9 years ago

I really like this proposal Jonas, I think it is very well thought through. Also appreciate that you sketched out an API for clarity.

I have a couple of open questions for some peripheral matters, but I think the design is solid, so these questions are more for the sake of completeness than anything else.

1) What would you like to do for timer-based (cron-like) tasks, such as updating vcs caches, ETL processes etc? Would we have timers generating pulse messages? :/ Maybe a secondary service for handling cron-based jobs might help here?

2) In the case we wish to have tasks downstream of nagios alerts, would you propose we route all nagios alerts through pulse? Might nagios spikes topple pulse, or are we confident pulse could handle it? Maybe there are also high-volume services that could generate a lot of traffic, or at least be intermittently "bursty". I wonder if we can toughen up our pulse implementation?

I'm aware pulse is based on RabbitMQ which in theory can handle a *lot* of traffic, at the moment it just seems a little fragile, and I'm not sure if reliability is likely to considerably improve, especially if we add a lot of traffic. However, maybe these are not significant amounts.

I do favour simplicity, and pulse is the obvious candidate for routing all events through. So this does seem like a very good practice. Maybe we could consider special handling of these two use cases (cron/nagios) if we decide they don't sit well in pulse.

3) Another thought occurred to me. If the publish of the pulse messages is not controlled by the party which is generating the downstream TaskCluster task graphs, it might be that the Routing Key / Exchange binding is not adequate enough to determine whether a task needs to fire. Maybe there might also need to be filtering of the message body too, in order that noop tasks are not created and consume resources (e.g. the task fires up, some condition is not met in the message body, and the task completes). However I haven't a solid proposal to solve this. This most obvious solution I see is that the party generating the downstream tasks first publishes message to their own exchange, filtering from a source AMQP channel, so that exchange/routing key is sufficient.. However, this still requires that they write/deploy a service running somewhere to do the consuming/filtering/publishing, in which case they could talk to the scheduler directly rather than needing a pulse interface (since the main motivation of adding a pulse interface was to allow event based task generation that does not require the deployment of a dedicated service). Another option might be some json processing language that allows an expression to be passed that returns true/false based on whether a task should be generated based on performing the expression against the AMQP message body. Let me know what your thoughts are on this too!

Thanks for all your input! =)

Pete

Pete Moore [:pmoore][:pete]

Comment 3

•

9 years ago

... and also hg.mozilla.org / git.mozilla.org changes - should we route all internal hg/git pushes through pulse, for all repositories? For example, Release Engineering has a bunch of repositories they use, see:
https://wiki.mozilla.org/ReleaseEngineering/Repositories

If they wanted to move their CI to taskcluster for RelEng tools (at the moment relying heavily on travis) I think they'd have to write a custom hg / git poller, if the mozilla-taskcluster poller is just for gecko repositories(?) Or maybe we can get dev services to publish hg/git pushes to pulse. If that is our favoured solution, I can raise a bug against dev services to start that process.

Do we already have a github -> pulse integration?

Thanks!
Pete

Jonas Finnemann Jensen (:jonasfj)

Reporter