Closed Bug 1574651 Opened 5 years ago Closed 5 years ago

Investigate ingesting pulse messages from multiple TC deployments

Categories

(Tree Management :: Treeherder: Data Ingestion, enhancement, P2)

enhancement

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

I know, it scares me too.

Servo will be moving to the community Taskcluster deployment, while Treeherder is currently set to consume from pulse, which contains only data from the firefox-ci Taskcluster deployment.

Servo reports its results to treeherder and lacks another good solution for displaying status.

What would be involved in consuming messages from a second RabbitMQ cluster (or exchange, at least) and generating build status for those messages? I think the tricky bit will be that the rootURL at which to find the status of those tasks will differ, so we'll need to be careful and make sure all URLs point to the right place.

I'll spend some time investigating this, and may bend Armen's ear to validate my findings.

I fully endorse you and Armen working together as much as needed to get this working.

What will the new rootUrl will look like?
What would the new exchanges look like? Are they available now?
Are the messages going to keep the same schema?
When will this be needed for?
I believe this would be possible without a lot of code changes. Testing it will be a bit cumbersome.

There's few instances of rootUrl in here:
https://github.com/mozilla/treeherder/pull/5042/files

The command ./manage.py ingest_push_and_tasks is not fully mature and does not currently have support for Git branches.
However, you can still test ingesting a task at a time.

I currently see the current implementation to work for the following:
https://treeherder.mozilla.org/#/jobs?repo=servo-master
https://treeherder.mozilla.org/#/jobs?repo=servo-auto

I don't see it fully working for:
https://treeherder.mozilla.org/#/jobs?repo=servo-try
https://treeherder.mozilla.org/#/jobs?repo=servo-prs

What's the difference between all four? What are the last two used for?

(In reply to Armen [:armenzg] from comment #2)

What will the new rootUrl will look like?

Current leading candidate is https://community-tc.services.mozilla.com

What would the new exchanges look like? Are they available now?

The exchanges would be the same, just on a different RabbitMQ instance.

Are the messages going to keep the same schema?

Yes

When will this be needed for?

Before we move servo to the new deployment, which will occur before we move Firefox-CI to its new deployment.

I believe this would be possible without a lot of code changes. Testing it will be a bit cumbersome.

There's few instances of rootUrl in here:
https://github.com/mozilla/treeherder/pull/5042/files

The command ./manage.py ingest_push_and_tasks is not fully mature and does not currently have support for Git branches.
However, you can still test ingesting a task at a time.

Thanks -- I will start by looking at those files.

Homu is the tool Servo uses to manage pull requests in https://github.com/servo/servo/. Based on certain commands in github comments, it can create a merge commit and push it to a branch to trigger some tasks:

  • The auto branch, where if all tests report green (through the GitHub Status API), Homu will then push to master (which in turn triggers another couple of tasks).
  • The try branch where all tests are run and results are reported as a PR comment, but nothing is merged to master
  • Various try-FOO branches where a subset of the of tests are run

Servo feeds job/task data to Treeherder by setting a tc-treeherder.v2._/{repository}/{commitSha} route when creating tasks. The the queue service takes care of sending Pulse messages that Treeherder ingests. For tasks created in response to a push to a branch (which was typically made by Homu), the Treeherder "repository" name is set to servo-{branchName} except if the branch is one of try-* in which case it is set to servo-try. The logic for this is in .taskcluster.yml for the decision task, and in decision_task.py and in decisionlib.py for other tasks.

For tasks created in response to a pull request event, the Treeherder "respository" name is set to servo-prs (in .taskcluster.yml and decision_task.py).

There’s also a daily hook. I see that we’re missing a Treeherder route for that decision task. But then that decision task goes through the same code path as a push to the master branch when creating other tasks, so those go to Treeherder’s servo-master.

Treeherder inserts the taskId into its DB, and then uses that in the context of the configured rootUrl both on the frontend and backend, mostly to support actions. So, that would need some refactoring too.

I made an appointment for Cam, Armen, and I to talk on Thursday.

jgraham made a good point on IRC: we don’t need to track a Taskcluster rootUrl for each task, only one per repository.

Dang, I apologize for missing this meeting. I chimed in on IRC, but here's the distillation of my own 2 cents, some of these points already made by others:

I think keeping a single Treeherder instance would be best. Something might come up to make having a separate Treeherder a better option, but I can't think of anything right now.

As James said, we should just make sure to add the rootUrl to the repositories fixture/schema and build our urls from there. We have Taskcluster urls semi-hard-coded in a few places (like runnable-jobs.json, task inspector, etc). So same deal there. That's pretty straightforward.

Edits after thinking more:

And then we'd just need to be able to read messages from both exchanges. Sounds like the task/push schema is going to be the same, so that's going to be straightforward. I think it's just a matter of adding the new exchange to our Pulse queues. But I don't know if there are complications of reading from Pulse exchanges and another RabbitMQ instance. I think that should be fine. But I've never tried it.

So this all sounds mostly like a pretty straight shot. :)

As I got started on this, I realized that this is going to interact poorly with the current sign-in implementation, which only works with one thing (Mozilla's Auth0, basically). So it will only have creds for one deployment. There's work going on regarding third-party login and that may offer a solution: when a user clicks or hits a key to trigger an action, then Treeherder should check if it has TC credentials for that rootUrl, and if not begin the third-party signin process, if such a thing is configured for that rootUrl. That can happen in a new tab, and will proceed without user interaction in most cases.

Until then, I'll just set things up so actions only work in repos with rootUrl https://taskcluster.net.

Next up:

  • ingest from multiple pulse exchanges, each for a different rootUrl
Status: NEW → ASSIGNED
Priority: -- → P2

I have some work on the next step already. I'll wrap that up and make a PR.

This PR will also fix Bug 1578524. Please mark it as fixed when this is merged.

Blocks: 1578524

This got deployed this morning and I noticed the pulse queues overgrowing.

I have added the following variables:
PULSE_PUSH_SOURCES to [{"root_url": "https://taskcluster.net", "github": true, "hgmo": true, "pulse_url": <prod_url>}]
PULSE_PUSH_TASKS to [{"root_url": "https://taskcluster.net", "pulse_url": <prod_url>}]
ROOT_URL to https://taskcluster.net

Are these correct?

The queues are now back under control.

Hi camd,
I saw that you added and remove the variables; is there a reason for this?
https://cl.ly/b00bb096b096

That looks correct. They were added and removed because they conflicted with variables used by the previous version of Treeherder. I should not have re-used variable names like that -- sorry!

Thanks for catching this quickly and avoiding data loss!

Yeah, the code prior to this merge would detect if those vars were set. If so, it would stop ingesting the old way. But the new way wasn't complete till this merge. So I added the new vars in anticipation of a push of this code, but that caused a similar havoc. So I had to remove them to get back to normal.

My apologies for not touching base with you yesterday, Armen, about being sure these variables were set at deploy time. Fortunately, since the queue is now unbounded, we would not have lost data, even if you hadn't caught it so quick. And nice job figuring out what was needed to fix it.

I think this is in production -- OK to close?

Indeed. It did not get reverted.

Congratulations and thanks for fixing it!

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: