Closed Bug 1149738 Opened 10 years ago Closed 9 years ago

Add the ability to re-trigger TaskCluster tasks to mozci

Categories

(Testing :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Assigned: armenzg)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Status: NEW → ASSIGNED
Depends on: 1174236
We can only add taskcluster's cancel and re-trigger ability since those are the only available APIs.
Summary: Add TaskCluster support to mozci → Add the ability to re-trigger, cancel TaskCluster tasks to mozci
Blocks: 1178522
We need permanent credentials to rerun [1]. We can't rerun a task past its deadline [2]. We can also take an existing task and re-submit it with different stamps IIUC [3] [1] import taskcluster queue = taskcluster.Queue({'credentials': {'clientId': 'ID','accessToken': 'tokenID'}, queue.rerunTask("UzIo9qsTTHa-NFqx6RfJig") [2] Traceback (most recent call last): File "rerun.py", line 7, in <module> queue.rerunTask("UzIo9qsTTHa-NFqx6RfJig") File "/Users/armenzg/venv/tc/lib/python2.7/site-packages/taskcluster/client.py", line 455, in apiCall return self._makeApiCall(e, *args, **kwargs) File "/Users/armenzg/venv/tc/lib/python2.7/site-packages/taskcluster/client.py", line 232, in _makeApiCall return self._makeHttpRequest(entry['method'], route, payload) File "/Users/armenzg/venv/tc/lib/python2.7/site-packages/taskcluster/client.py", line 424, in _makeHttpRequest superExc=rerr taskcluster.exceptions.TaskclusterRestFailure: Task can't be scheduled past it's deadline [3] garndt: armenzg: that's right, I grab the task definition from the queue, replace the deadline in that definition and submit an entirely new task that will have a new task ID ###################### import taskcluster import datetime queue = taskcluster.Queue() task = queue.task('UzIo9qsTTHa-NFqx6RfJig') task_id = taskcluster.slugId() artifacts = task['payload'].get('artifacts', {}) for artifact, definition in artifacts.iteritems(): definition['expires'] = taskcluster.fromNow('365 days') task['taskGroupId'] = task_id task['expires'] = taskcluster.fromNow('365 days') task['created'] = datetime.datetime.utcnow() task['deadline'] = taskcluster.fromNow('24 hours') queue.createTask(task_id, task)
chmanchester: do you have a bug this bug is blocking you on? From what I already wrote in my previous comment trigger-bot could still add the ability to retrigger TC tasks and not be dependent on mozci (I sent you client credentials which we can add more scopes to it - me or garndt/jonasfj can). I will still add this ability to mozci but I want to make sure how hard this work blocks you or not since I'm going away for a week.
I don't have a separate bug, it's not clear to me how much work this will be. It's not blocking me until we move more tests to TaskCluster. We're only missing the ability to trigger mulet jobs for now, and I think that's acceptable for the foreseeable future. I would prefer to take advantage of this ability in MozCI, if not we'll end up building the same thing twice. When do you expect to implement it?
I can have it after I come back from my break. I can use Mulet as my test case.
Thanks Armen, that sounds good!
garndt: What is the minimum scope I need to rerun a task? I have some permanent credentials created for this. In [3]: import taskcluster In [4]: queue = taskcluster.Queue() In [5]: queue.rerunTask('PLmvHjAGSUaoV3MUUYAHLQ') --------------------------------------------------------------------------- TaskclusterAuthFailure Traceback (most recent call last) <ipython-input-5-ee3e3eeeec0b> in <module>() ----> 1 queue.rerunTask('PLmvHjAGSUaoV3MUUYAHLQ') /home/armenzg/venv/tc/lib/python2.7/site-packages/taskcluster-0.0.22-py2.7.egg/taskcluster/client.pyc in apiCall(self, *args, **kwargs) 453 def addApiCall(e): 454 def apiCall(self, *args, **kwargs): --> 455 return self._makeApiCall(e, *args, **kwargs) 456 return apiCall 457 /home/armenzg/venv/tc/lib/python2.7/site-packages/taskcluster-0.0.22-py2.7.egg/taskcluster/client.pyc in _makeApiCall(self, entry, *args, **kwargs) 230 log.debug('Route is: %s', route) 231 --> 232 return self._makeHttpRequest(entry['method'], route, payload) 233 234 def _processArgs(self, entry, *args, **kwargs): /home/armenzg/venv/tc/lib/python2.7/site-packages/taskcluster-0.0.22-py2.7.egg/taskcluster/client.pyc in _makeHttpRequest(self, method, route, payload) 415 status_code=status, 416 body=data, --> 417 superExc=rerr 418 ) 419 # Raise TaskclusterRestFailure for all other issues TaskclusterAuthFailure: Authorization Failed
Looking here [1] , it appears that you need queue:rerun-task and assume:scheduler-id:<schedulerId>/<taskGroupId> [1] http://docs.taskcluster.net/queue/api-docs/#rerunTask
<armenzg> what is the reason that we don't allow rerunning a task past its deadline? <jonasfj> garndt, I always log errors with debug("err: %s, as JSON: %j", err, err, err.stack) to get properties too.. <jonasfj> armenzg, yes, basically we want to have a way to say that tasks a definitively dead and never going to run again... <jonasfj> I guess we could allow deadline extension.. <jonasfj> but for now I'm quiet happy not doing this.. <jonasfj> idempotency is nice.. <armenzg> jonasfj, I'm implementing a piece of code which will take a task id and create a new task with new creation times and deadlines instead of rerunning <armenzg> since I'm past the deadline <jonasfj> armenzg, I think that's a good pattern... <jonasfj> IMO we could have 3 concepts: <jonasfj> retries - automatic when infrastructure failures <jonasfj> reruns - semi automatic <jonasfj> retriggers - when requested sometime later <armenzg> jonasfj, there is no API for the last one, correct? <jonasfj> reruns are currently handled at scheduler level, but we've discussed moving them into the queue... and say that they should happen for task-failures <armenzg> IIUC <jonasfj> armenzg, correct <jonasfj> no API... not sure we should make one... because task context changes (taskId changes) <jonasfj> there may be side effects if it was part of task-graph before <jonasfj> or had dependencies... <jonasfj> IMO, I don't like reruns because it means that down-stream tasks in a task-graph never really knows if their dependencies succeeded or failed...
garndt: I can't rerun tasks past their deadline. I want to create a task as on comment 2. According to the documentation, I need a scope of "queue:create-task:<provisionerId>/<workerType>" [1] FYI I have added queue:create-task:* to my client. Any ideas on why this task that I created failed? https://tools.taskcluster.net/task-inspector/#8Z6Q0L1YQUaTxJMFSUTeeg/0 [1] http://docs.taskcluster.net/queue/api-docs/#createTask
Here's my current script: https://github.com/armenzg/mozilla_ci_tools/blob/taskcluster_retrigger/mozci/scripts/misc/taskcluster_retrigger.py#L72 where the exception is being thrown from: https://github.com/taskcluster/taskcluster-client.py/blob/master/taskcluster/client.py#L420 This is the error I'm hitting: 07/30/2015 10:56:46 DEBUG: Contents of new task: 07/30/2015 10:56:46 DEBUG: {"workerType": "b2gtest", "taskGroupId": "vzNJ6VpnT9-Ez4dQMQheuA", "expires": "2015-08-01T14:56:46.637870Z", "retries": 5, "extra": {"chunks": {"current": 1, "total": 1}, "treeherderEnv": ["production", "staging"], "treeherder": {"groupSymbol": "?", "collection": {"opt": true}, "productName": "b2g", "machine": {"platform": "b2g-linux64"}, "groupName": "Submitted by taskcluster", "build": {"platform": "b2g-linux64"}, "symbol": "Gu"}}, "created": "2015-07-30T14:56:46.637894Z", "tags": {"createdForUser": "armenzg@mozilla.com"}, "priority": "normal", "schedulerId": "task-graph-scheduler", "deadline": "2015-07-31T14:56:46.637910Z", "routes": ["tc-treeherder.try.9cb5a212f74ecfe912f3cd390787a57cf60407ad", "tc-treeherder-stage.try.9cb5a212f74ecfe912f3cd390787a57cf60407ad"], "scopes": ["docker-worker:image:taskcluster/tester:0.3.5", "queue:define-task:aws-provisioner-v1/test-c4-2xlarge", "queue:create-task:aws-provisioner-v1/test-c4-2xlarge", "docker-worker:cache:tc-vcs", "docker-worker:cache:linux-cache", "docker-worker:capability:device:loopbackVideo", "docker-worker:capability:device:loopbackAudio"], "payload": {"artifacts": {"public/logs/": {"path": "/home/worker/build/upload/logs/", "expires": "2016-07-29T14:56:46.637744Z", "type": "directory"}, "public/test_info/": {"path": "/home/worker/build/blobber_upload_dir/", "expires": "2016-07-29T14:56:46.637806Z", "type": "directory"}, "public/build": {"path": "/home/worker/artifacts/", "expires": "2016-07-29T14:56:46.637838Z", "type": "directory"}}, "image": "taskcluster/tester:0.3.5", "cache": {"linux-cache": "/home/worker/.cache", "tc-vcs": "/home/worker/.tc-vcs"}, "capabilities": {"devices": {"loopbackVideo": true, "loopbackAudio": true}}, "maxRunTime": 3600, "command": ["entrypoint", "./bin/pull_gaia.sh &&", "python ./mozharness/scripts/gaia_unit.py --no-read-buildbot-config --config-file ./mozharness/configs/b2g/gaia_unit_production_config.py --config-file ./mozharness_configs/gaia_integration_override.py --config-file ./mozharness_configs/remove_executables.py --download-symbols ondemand --no-pull --installer-url https://queue.taskcluster.net/v1/task/GkC7V8ZUTN67KcfUvpN-2w/artifacts/public/build/target.linux-x86_64.tar.bz2 --test-packages-url https://queue.taskcluster.net/v1/task/GkC7V8ZUTN67KcfUvpN-2w/artifacts/public/build/test_packages.json --gaia-repo https://hg.mozilla.org/integration/gaia-central --gaia-dir /home/worker --xre-url https://queue.taskcluster.net/v1/task/wXAHAaxDQpqxoWF1iljJjg/runs/0/artifacts/public/cache/xulrunner-sdk-40.zip\n"], "env": {"MOZHARNESS_REPOSITORY": "https://hg.mozilla.org/build/mozharness", "MOZILLA_BUILD_URL": "https://queue.taskcluster.net/v1/task/GkC7V8ZUTN67KcfUvpN-2w/artifacts/public/build/target.linux-x86_64.tar.bz2", "GAIA_HEAD_REPOSITORY": "https://hg.mozilla.org/integration/gaia-central", "GAIA_REV": "5965f93d5645b666c209e6f3e339426e302d163c", "GAIA_REF": "5965f93d5645b666c209e6f3e339426e302d163c", "MOZHARNESS_REV": "31dad082f2e4", "GAIA_BASE_REPOSITORY": "https://hg.mozilla.org/integration/gaia-central"}}, "provisionerId": "aws-provisioner-v1", "metadata": {"owner": "mozilla-taskcluster-maintenance@mozilla.com", "source": "http://todo.com/soon", "name": "[TC] Gaia Unit Test", "description": "Gaia Unit Test"}} 07/30/2015 10:56:46 DEBUG: Found a positional argument: vzNJ6VpnT9-Ez4dQMQheuA 07/30/2015 10:56:46 DEBUG: After processing positional arguments, we have: {u'taskId': 'vzNJ6VpnT9-Ez4dQMQheuA'} 07/30/2015 10:56:46 DEBUG: After keyword arguments, we have: {u'taskId': 'vzNJ6VpnT9-Ez4dQMQheuA'} 07/30/2015 10:56:46 DEBUG: Route is: task/vzNJ6VpnT9-Ez4dQMQheuA 07/30/2015 10:56:46 DEBUG: Full URL used is: https://queue.taskcluster.net/v1/task/vzNJ6VpnT9-Ez4dQMQheuA 07/30/2015 10:56:46 INFO: Not using hawk! 07/30/2015 10:56:46 DEBUG: Making attempt 0 07/30/2015 10:56:46 DEBUG: Making a PUT request to https://queue.taskcluster.net/v1/task/vzNJ6VpnT9-Ez4dQMQheuA 07/30/2015 10:56:46 DEBUG: HTTP Headers: {'Content-Type': 'application/json'} 07/30/2015 10:56:46 DEBUG: HTTP Payload: "{\"workerType\": \"b2gtest\", \"taskGroupId\": \"vzNJ6VpnT9-Ez4dQMQheuA\", \"expires\": \"2015-08-0 (limit 100 char) 07/30/2015 10:56:46 DEBUG: Received HTTP Status: 400 07/30/2015 10:56:46 DEBUG: Received HTTP Headers: {'content-length': '12', 'via': '1.1 vegur', 'x-content-type-options': 'nosniff', 'x-powered-by': 'Express', 'server': 'Cowboy', 'connection': 'keep-alive', 'date': 'Thu, 30 Jul 2015 14:56:46 GMT', 'content-type': 'text/html; charset=utf-8'} 07/30/2015 10:56:46 DEBUG: Received HTTP Payload: Bad Request (limit 1024 char) Traceback (most recent call last): File "mozci/scripts/misc/taskcluster_retrigger.py", line 116, in <module> main() File "mozci/scripts/misc/taskcluster_retrigger.py", line 95, in main result = queue.createTask(new_task_id, task) File "/home/armenzg/venv/tc/lib/python2.7/site-packages/taskcluster-0.0.22-py2.7.egg/taskcluster/client.py", line 455, in apiCall return self._makeApiCall(e, *args, **kwargs) File "/home/armenzg/venv/tc/lib/python2.7/site-packages/taskcluster-0.0.22-py2.7.egg/taskcluster/client.py", line 232, in _makeApiCall return self._makeHttpRequest(entry['method'], route, payload) File "/home/armenzg/venv/tc/lib/python2.7/site-packages/taskcluster-0.0.22-py2.7.egg/taskcluster/client.py", line 424, in _makeHttpRequest superExc=rerr taskcluster.exceptions.TaskclusterRestFailure: None
garndt: if I paste the task into the task creator, it goes through, however, the task shows up as failed: https://pastebin.mozilla.org/8841082 <-- task as pasted in task creator https://tools.taskcluster.net/task-inspector/#gvmpTJ3fRJuKuVJIj5m21g/
I've used a mach command to help me construct a similar task definition [1] Such task is running properly (I think) [2]. I've tried doing a diff between the two task definitions and I don't know why it would not work [3][4] garndt: would you mind trying to run my script? My client is called mozilla-pulse-actions (You need to set up the two env variables for the client before the script will work). This is what I run: python mozci/scripts/misc/taskcluster_retrigger.py -r PLmvHjAGSUaoV3MUUYAHLQ --debug Notice that the tasks that I create through the creator don't come with logs (except the one task I pasted from the mach command). [1] ./mach taskcluster-graph --project try --message "try: -b o -p linux64_gecko -u gaia-unit" --base-repository http://hg.mozilla.org/mozilla-central --head-repository http://hg.mozilla.org/try --head-rev 06c5d926ab7a --revision-hash 9cb5a212f74ecfe912f3cd390787a57cf60407ad --owner armenzg@mozilla.com [2] https://tools.taskcluster.net/task-inspector/#_wWS_0lzRwycaEHZeTOZnA [3] https://queue.taskcluster.net/v1/task/_wWS_0lzRwycaEHZeTOZnA good one [4] https://queue.taskcluster.net/v1/task/pqtsyr4NQ5qt0IRdq3Wg0Q bad one
Flags: needinfo?(garndt)
(In reply to Armen Zambrano Gasparnian [:armenzg] from comment #13) > garndt: if I paste the task into the task creator, it goes through, however, > the task shows up as failed: > https://pastebin.mozilla.org/8841082 <-- task as pasted in task creator > https://tools.taskcluster.net/task-inspector/#gvmpTJ3fRJuKuVJIj5m21g/ The task shows up as failed because the expiration the artifacts is before the created date/time I believe.
Flags: needinfo?(garndt)
In comment 12 I get a Bad Request because I was submitting a dictionary instead of a json object. After I changed that, I started getting another type of issue "None of the scope-sets was satisfied" [1] This is the code which I'm executing: https://github.com/armenzg/mozilla_ci_tools/blob/taskcluster_retrigger/mozci/scripts/misc/taskcluster_retrigger.py#L60 [1] 07/30/2015 03:17:10 DEBUG: Found a positional argument: kKmBkbE7ShuV8zZ2jZlhGw 07/30/2015 03:17:10 DEBUG: After processing positional arguments, we have: {u'taskId': 'kKmBkbE7ShuV8zZ2jZlhGw'} 07/30/2015 03:17:10 DEBUG: After keyword arguments, we have: {u'taskId': 'kKmBkbE7ShuV8zZ2jZlhGw'} 07/30/2015 03:17:10 DEBUG: Route is: task/kKmBkbE7ShuV8zZ2jZlhGw 07/30/2015 03:17:10 DEBUG: Full URL used is: https://queue.taskcluster.net/v1/task/kKmBkbE7ShuV8zZ2jZlhGw 07/30/2015 03:17:10 DEBUG: parsed URL parts: {'hostname': u'queue.taskcluster.net', 'path': u'/v1/task/kKmBkbE7ShuV8zZ2jZlhGw', 'port': 443, 'query': '', 'resource': u'/v1/task/kKmBkbE7ShuV8zZ2jZlhGw', 'scheme': u'https'} 07/30/2015 03:17:10 DEBUG: artifacts={'app': None, 'dlg': None, 'ext': 'e30=', 'hash': None, 'host': u'queue.taskcluster.net', 'method': u'put', 'nonce': 'DFUSI9', 'port': 443, 'resource': u'/v1/task/kKmBkbE7ShuV8zZ2jZlhGw', 'ts': 1438283830} 07/30/2015 03:17:10 DEBUG: Making attempt 0 07/30/2015 03:17:10 DEBUG: Making a PUT request to https://queue.taskcluster.net/v1/task/kKmBkbE7ShuV8zZ2jZlhGw 07/30/2015 03:17:10 DEBUG: HTTP Headers: {'Content-Type': 'application/json', 'Authorization': 'Hawk id="T9J-xA9JSUKQzfR99NRtMg", ts="1438283830", nonce="DFUSI9", ext="e30=", mac="YB/xLpz5YTIUxjqNyg+fqIWO4cRhI0b60yvriYZ5MLo="'} 07/30/2015 03:17:10 DEBUG: HTTP Payload: {"workerType":"b2gtest","taskGroupId":"kKmBkbE7ShuV8zZ2jZlhGw","expires":"2016-07-30T19:17:10.853802 (limit 100 char) 07/30/2015 03:17:11 DEBUG: Received HTTP Status: 401 07/30/2015 03:17:11 DEBUG: Received HTTP Headers: {'content-length': '187', 'via': '1.1 vegur', 'x-powered-by': 'Express', 'server': 'Cowboy', 'access-control-request-method': '*', 'connection': 'keep-alive', 'date': 'Thu, 30 Jul 2015 19:17:10 GMT', 'access-control-allow-origin': '*', 'access-control-allow-methods': 'OPTIONS,GET,HEAD,POST,PUT,DELETE,TRACE,CONNECT', 'content-type': 'application/json; charset=utf-8', 'access-control-allow-headers': 'X-Requested-With,Content-Type,Authorization,Accept,Origin'} 07/30/2015 03:17:11 DEBUG: Received HTTP Payload: { "message": "Authorization Failed", "error": { "info": "None of the scope-sets was satisfied", "scopesets": [ "queue:create-task:aws-provisioner-v1/b2gtest" ] } } (limit 1024 char)
The "None of the scope-sets was satisfied" code comes from here [1] Is there a way I can some debug code in there? I assume the python client interacts with the javascript code in the taskcluster setup and I can't add some debug logic to it. [1] https://github.com/taskcluster/taskcluster-base/blob/master/api.js#L601 [2] https://github.com/taskcluster/taskcluster-base/blob/master/utils.js#L50
Ok, just separating out the issues a little... The task creator issue is caused by the expiration being 1 year minus a couple of days in the future where the worker tries to create the live log artifact one year from today. Artifacts are not allowed to be created after the expiration of a task. This should be handled better on the worker side. The issue with getting the "None of the scope-sets was satisfied" was first caused by a stale cache in the queue related to scopes, and after flushing that, there was a new set of scope issues. Taskcluster credentials when creating tasks must encompass the scopes within the task itself, which is not the case right now.
Thank you so much for your help! I will be cleaning up this code tomorrow and test it a bit more. The current scopes for the client are the following: queue:create-task:* queue:define-task:* docker-worker:cache:* docker-worker:capability:* docker-worker:image:* queue:route:tc-treeherder* <garndt> armenzg: https://github.com/taskcluster/docker-worker/blob/master/lib/features/local_live_log.js#L141-L142 <armenzg> garndt, you rock! Thanks Greg! I will go on my happy way :) <garndt> https://github.com/taskcluster/docker-worker/blob/master/config/defaults.js#L115 <garndt> really we should fix that to where the expiration of the live log artifact is not beyond the task expiration <garndt> to avoid this situation <garndt> but I think that most people are not specifying expires, and it defaults to 1 year past creation or something <garndt> so you could also just remove setting expires entirely if you wanted to I think <garndt> for the task that is <garndt> armenzg: you're very welcome sir
I've managed to re-trigger all tasks on a push: https://treeherder.mozilla.org/#/jobs?repo=try&author=armenzg@mozilla.com&filter-searchStr=taskcluster My set of scopes for mozilla-pulse-actions is: queue:create-task:* queue:define-task:* docker-worker:cache:* docker-worker:capability:* docker-worker:image:* tc-treeherder* queue:route:* I'm going to add the credentials to the pulse_actions Heroku app and start working on the next mozci release.
garndt: could you also give me your feedback on taskcluster_retrigger.py? PR: https://github.com/armenzg/mozilla_ci_tools/pull/308/files
Attachment #8641877 - Flags: review?(cmanchester)
Attachment #8641877 - Flags: feedback?(garndt)
Comment on attachment 8641877 [details] ability to retrigger tasks on TaskCluster + fix logging issue Provided feedback in the PR. Thanks Armen!
Attachment #8641877 - Flags: feedback?(garndt) → feedback+
Comment on attachment 8641877 [details] ability to retrigger tasks on TaskCluster + fix logging issue Looks good, thanks Armen! There were a couple of questions, I'd like to take another look at any updates.
Attachment #8641877 - Flags: review?(cmanchester)
Comment on attachment 8641877 [details] ability to retrigger tasks on TaskCluster + fix logging issue garndt: I've added the 24 hours timedelta. I believe I've addressed all comments. Would you like me to do anything about scheduling.py? It was not clear from your last comment.
Attachment #8641877 - Flags: review?(cmanchester)
Comment on attachment 8641877 [details] ability to retrigger tasks on TaskCluster + fix logging issue Just a question about logging that I think needs to be addressed, but otherwise we're good to go here!
Attachment #8641877 - Flags: review?(cmanchester) → review+
Once there's actually the need to cancel TC tasks and I get to work on it I will open a bug for it. Removing it from the scope and simply closing this.
Summary: Add the ability to re-trigger, cancel TaskCluster tasks to mozci → Add the ability to re-trigger TaskCluster tasks to mozci
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: