[Ronin Mojave] Upgrade generic worker to v16.4.0 multiuser
Categories
(Infrastructure & Operations :: RelOps: Puppet, task)
Tracking
(Not tracked)
People
(Reporter: dividehex, Assigned: dhouse, NeedInfo)
References
(Blocks 1 open bug)
Details
Attachments
(1 file)
(deleted),
text/x-phabricator-request
|
Details |
The Mojave 10.14.0 and 10.14.5 (workerType: gecko-t-osx-1014) workers need to be upgraded to the latest version of generic worker (v15.1.0). We don't have golang installed on the macos hosts in the datacenters but I believe the kv Edwin is referring to is the gw build env. Currently, on the yosemite host, we are running gw v14.1.2 which was built against go1.10.8. And on the Mojave hosts, we are running gw v13.0.3 which was built against go1.10.3.
Latest release is here: https://github.com/taskcluster/generic-worker/releases
[taskcluster 2019-06-25T02:56:51.198Z] "generic-worker": {
[taskcluster 2019-06-25T02:56:51.198Z] "go-arch": "amd64",
[taskcluster 2019-06-25T02:56:51.198Z] "go-os": "darwin",
[taskcluster 2019-06-25T02:56:51.198Z] "go-version": "go1.10.3",
[taskcluster 2019-06-25T02:56:51.198Z] "release": "https://github.com/taskcluster/generic-worker/releases/tag/v13.0.3",
[taskcluster 2019-06-25T02:56:51.198Z] "revision": "4790b3e4a8b726daa41aa102091b53b192aa79ef",
[taskcluster 2019-06-25T02:56:51.198Z] "source": "https://github.com/taskcluster/generic-worker/commits/4790b3e4a8b726daa41aa102091b53b192aa79ef",
[taskcluster 2019-06-25T02:56:51.198Z] "version": "13.0.3"
vs.
[taskcluster 2019-06-26T09:41:07.930Z] "generic-worker": {
[taskcluster 2019-06-26T09:41:07.930Z] "go-arch": "amd64",
[taskcluster 2019-06-26T09:41:07.930Z] "go-os": "darwin",
[taskcluster 2019-06-26T09:41:07.930Z] "go-version": "go1.10.8",
[taskcluster 2019-06-26T09:41:07.930Z] "release": "https://github.com/taskcluster/generic-worker/releases/tag/v14.1.2",
[taskcluster 2019-06-26T09:41:07.930Z] "revision": "13118c4c1ba10f863f39d6c623b3dd59ca6e0f00",
[taskcluster 2019-06-26T09:41:07.930Z] "source": "https://github.com/taskcluster/generic-worker/commits/13118c4c1ba10f863f39d6c623b3dd59ca6e0f00",
[taskcluster 2019-06-26T09:41:07.930Z] "version": "14.1.2"
:dividehex - I noticed that our macosx1014 machines are running an older version of go (10.3):
https://taskcluster-artifacts.net/E2Hbfw8sRTSmzPBsiRp6GA/0/public/logs/live_backing.log
For comparison, macosx1010 machines are running go 10.8:
https://taskcluster-artifacts.net/DM7hLXa0TDODqqt0QRUFog/0/public/logs/live_backing.log
Comment 1•5 years ago
|
||
:dividehex - thanks for this bug. I will create another bug to have vendored six
installed in the global environment for macosx1014 machines (to mirror current macosx1010 setup) as noted in the same email.
Comment 2•5 years ago
|
||
six
issue with macosx1014 will not be addressed with a global dependency; rather, another bug will be created to investigate solutions.
Reporter | ||
Updated•5 years ago
|
Updated•5 years ago
|
Comment 3•5 years ago
|
||
Please note, you'll need at least generic-worker 15.1.2 since that includes a fix for bug 1566159. However, 15.1.4 is the current latest, and if that works for you, would be best. Please note though, if you hit any problems with 15.1.4 (or 15.1.3) that 15.1.2 is also usable, but nothing prior to that.
Thanks!
Comment 4•5 years ago
|
||
Installation instructions: https://github.com/taskcluster/generic-worker#macos---multiusersimple-build
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 5•5 years ago
|
||
:dragrom,
I've looked over Pete's instructions. Most of the instructions already take place in the puppet config. Below are the highlights of what needs to change in order to upgrade puppet to gw 15.0.4 multiuser on MacOS:
I would suggest copying the generic_worker::init class to a new class (generic_worker::multiuser) and modify that.
- generic work needs to run as root under multiuser
- You'll also need to change the working dir to /etc/generic-worker (don't forget to also change it in the plist)
- The generic worker wrapper script will need to be launched from /Library/LaunchDaemons/ instead of /Library/LaunchAgents/
- You'll need to add the UserName key to the generic-worker.plist.erb
<key>UserName</key>
<string>root</string>
Reporter | ||
Comment 6•5 years ago
|
||
On another note, since we are upgrading from gw 13.x.x, we will probably need to remove openpgpSigningKeyLocation
from the config and the exec that creates it.
See the v14.0.0 release notes
https://github.com/taskcluster/generic-worker/releases/tag/v14.0.0
Comment 7•5 years ago
|
||
[root@t-mojave-r7-465.test.releng.mdc1.mozilla.com ~]# generic-worker --version
generic-worker (multiuser engine) 15.1.4 [ revision: https://github.com/taskcluster/generic-worker/commits/c407e45e3f019599005971b30993f29eb3c59b0d ]
Comment 8•5 years ago
|
||
[root@t-mojave-r7-465.test.releng.mdc1.mozilla.com ~]# launchctl list|grep worker
- 0 net.generic.worker
Comment 9•5 years ago
|
||
Comment 10•5 years ago
|
||
Comment 11•5 years ago
|
||
Comment 12•5 years ago
|
||
It looks like a lot of the failures are permissions issues with reading/writing /builds/tooltool_cache
. This is expected - each task runs as a different user, so unless /builds/tooltool_cache
is world readable/writable, I'd expect to see that.
It would be preferable for tasks not to touch files outside the home directory of the task user; in other words for /builds/tooltool_cache
not to exist but tasks to mount a writable directory cache into the task directory containing the tooltool cache.
If changing the location of the tooltool cache is deemed to be outside the scope of this bug, making /builds/tooltool_cache
world readable/writable would be a simple way to overcome the issue, but this solution comes with the caveat that it isn't as secure as using a writable directory cache, which guards the cache with taskcluster scopes, meaning only tasks with the required scopes can modify the cache, not any task that runs on the worker type. This is somewhat mitigated by being able to restrict which tasks can run on the worker type, but a writable directory cache would still be better, if only for the fact it is more explicit about its intentions.
Update: on reflection I see there are still some open bugs relating to generic-worker writable directory caches that could possibly bite us, perhaps better to just make /builds/tooltool_cache
world readable/writable for now.
Some of these bugs might not be relevant (e.g. Windows only, or solved but not updated) but it would be best for us to review them before we use the same feature on macOS, in case macOS shares any of the same issues (and in any case it would be good for us to resolve these open bugs).
Comment 13•5 years ago
|
||
I fixed the Permisison denied issue, but tests still fails.
egao: Can you please have a look over the fail tests? https://treeherder.mozilla.org/#/jobs?repo=try&revision=5105a3f391d75408afec3ce4698dde53833dd950
Comment 14•5 years ago
|
||
:dragrom - I am not quite sure what is happening in this bug, though we are now on generic-worker version 16.0.0. Has that been tried?
Otherwise I am not sure why there may be permission issues with generic-worker 15.1.4 and not on 13.x.x.
I recall that for Linux the permissions are specified in the Dockerfile, but for Windows and MacOSX systems I am not so sure how the permissions are set for the directories.
Though, this issue does bring forth a thought of mine; why not just grant to the task user ownership of the whole subdirectory for the task user? I don't know of a good reason why we do not do so.
Comment 15•5 years ago
|
||
Comment 16•5 years ago
|
||
(In reply to Dragos Crisan [:dragrom] from comment #15)
https://treeherder.mozilla.org/#/jobs?repo=try&revision=8940705911105018fa7f797ee90f711e246f253c
Hi Dragos,
Is someone assigned to look into these failures?
Comment 17•5 years ago
|
||
I test the implementation for generic-worker multiuser on osx mojave workers (For each job we create a user and run the job under this user) ) and I have 15 tests who fail with something like cut and paste issue: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8940705911105018fa7f797ee90f711e246f253c
can you have a look please over them?
Comment 19•5 years ago
|
||
:dragrom, can we find someone else to look into these test failures?
Assignee | ||
Comment 21•5 years ago
|
||
Dragos, can we ignore the test failures for now and get it running only for PGO? Then we can troubleshoot the test failures without blocking PGO builds. (tomprince pointed this out as a possible option to me in #ci)
Comment 22•5 years ago
|
||
Updated•5 years ago
|
Comment 23•5 years ago
|
||
Dave, what tests are in PGO builds or how can I run only PGO builds? if PGO not fail, we can ignore failed test for now
Assignee | ||
Comment 24•5 years ago
|
||
(In reply to Dragos Crisan [:dragrom] from comment #23)
Dave, what tests are in PGO builds or how can I run only PGO builds? if PGO not fail, we can ignore failed test for now
Chris, could you point us at which tasks the macos PGO workers can test running? (or send some tasks at the gecko-t-osx-1014-staging worker type?) Then we can validate that the new generic-worker is working correctly for PGO and finish building the workers for that pool (not blocked by other tests/tasks failing still on the new generic-worker on macos).
From the parent bug 1528374, this patch adds some tasks with a "/pgo" suffix: https://phabricator.services.mozilla.com/D20409
(but it looks like that is for the worker type t-osx-1010-pgo).
Comment 25•5 years ago
|
||
Yes, I can rebase the patches from bug 1528374 this afternoon and push some tasks to gecko-t-osx-1014-staging. To clarify though, I think it's very unlikely that the tests themselves will be fixed from doing PGO on the browser. Perhaps Tom meant using the new worker for the run task only? I think that would make sense and probably work.
Assignee | ||
Comment 26•5 years ago
|
||
(In reply to Chris Manchester (:chmanchester) from comment #25)
Yes, I can rebase the patches from bug 1528374 this afternoon and push some tasks to gecko-t-osx-1014-staging. To clarify though, I think it's very unlikely that the tests themselves will be fixed from doing PGO on the browser. Perhaps Tom meant using the new worker for the run task only? I think that would make sense and probably work.
Thanks. I'm not aware of the full situation, and was offering a "why don't you just" when it looked like we were delaying pgo for non-pgo failures.
Comment 27•5 years ago
|
||
Ok, I've pushed to try using the new worker here, but it's failing to find python in the expected location: https://tools.taskcluster.net/groups/WIql72cqStqnMCtkd8nvtw/tasks/CnjSrSZORJm-Hg3CGa7gaw/runs/0/logs/public%2Flogs%2Flive.log
Dragos, is python installed on these workers? Where should we be checking for it. The code at https://searchfox.org/mozilla-central/rev/7cc0f0e89cb40e43bf5c96906f13d44705401042/taskcluster/taskgraph/transforms/job/run_task.py#161 seems invalid in this case.
Comment 29•5 years ago
|
||
Getting a bit further, it looks like these workers aren't seeing the right mercurial version (although I can't say I understand the error, exactly): https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=270371622&revision=96eb8e821db156b46289634dadcb74dfc57f1333
Dragos, is there something more we need to do to make these pick up the right mercurial?
Comment 30•5 years ago
|
||
(In reply to Chris Manchester (:chmanchester) from comment #29)
Getting a bit further, it looks like these workers aren't seeing the right mercurial version (although I can't say I understand the error, exactly): https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=270371622&revision=96eb8e821db156b46289634dadcb74dfc57f1333
Dragos, is there something more we need to do to make these pick up the right mercurial?
Do we have a restriction for mercurial version? On the worker we use 5.1
Mercurial Distributed SCM (version 5.1)
(see https://mercurial-scm.org for more information)
Copyright (C) 2005-2019 Matt Mackall and others
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Comment 31•5 years ago
|
||
(In reply to Dragos Crisan [:dragrom] from comment #30)
(In reply to Chris Manchester (:chmanchester) from comment #29)
Getting a bit further, it looks like these workers aren't seeing the right mercurial version (although I can't say I understand the error, exactly): https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=270371622&revision=96eb8e821db156b46289634dadcb74dfc57f1333
Dragos, is there something more we need to do to make these pick up the right mercurial?
Do we have a restriction for mercurial version? On the worker we use 5.1
Mercurial Distributed SCM (version 5.1)
(see https://mercurial-scm.org for more information)Copyright (C) 2005-2019 Matt Mackall and others
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Fixed roubstcheckout issue. Now we have another error: Exception: Could not detect environment shell! Working to fix this
Comment 32•5 years ago
|
||
(In reply to Chris Manchester (:chmanchester) from comment #29)
Getting a bit further, it looks like these workers aren't seeing the right mercurial version (although I can't say I understand the error, exactly): https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=270371622&revision=96eb8e821db156b46289634dadcb74dfc57f1333
Dragos, is there something more we need to do to make these pick up the right mercurial?
The error seems to come from process.py and indicates that at least one of the following environment variables should be set:
SHELL
MOZILLABUILD
COMSPEC
It looks like this requirement was introduced by :gps in late 2012 in Bug 799648 - Move MozbuildObject functionality into mach mixins.
Perhaps on the old workers, one of these was already set. The options I see available to us, in order of preference (preferred to least preferred):
- Adding an environment variable, such as
SHELL: /bin/bash
to the task definition - Adapting process.py to use a different mechanism to establish an appropriate shell to use
- Adding a generic-worker feature to allow config-based env vars to be added to the environment for all tasks run on that worker, rolling it out, and updating the config to explicitly set
SHELL
in all tasks (which could potentially negatively impact other tasks that don't wish to haveSHELL
set).
Option 3 is not only the most complex and time-intensive, I also think it is the least desirable due to the lack of transparency for task authors that additional environment variables are set, the pollution of this effect to all tasks that run on the worker type, and the overhead of config maintenance for worker types. For this reason, only 1) and 2) seem like feasible options, with 1) winning for me due to its simplicity, explicitness, transparency and backwards-compatibility. :-)
Comment 33•5 years ago
|
||
I exported SHELL env variables. Now, the task fail with a robustcheckout error error:
https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=270450157&revision=96eb8e821db156b46289634dadcb74dfc57f1333
https://tools.taskcluster.net/groups/RLw_YX_CSQK85kXNOQBelw/tasks/DPBzSboCSfqg9OYeTY1jWA/runs/3/logs/public%2Flogs%2Flive.log
Chris, can you please have a look?
Updated•5 years ago
|
Updated•5 years ago
|
Comment 34•5 years ago
|
||
I'm not sure... Connor, does the error in comment 33 look familiar at all? There seems to be a robustcheckout/mercurial mis-match somewhere along the way here.
Updated•5 years ago
|
Comment 35•5 years ago
|
||
Yes, robustcheckout
has a compatibility check to ensure the extension version support the Mercurial version. You can fix this by using the newest version which supports Mercurial 5.1 from here.
Comment 36•5 years ago
|
||
Thanks Connor. Dragos, it looks like we need to update robustcheckout on the worker (comment 35), does that seem feasible?
Comment 37•5 years ago
|
||
Updated robustcheckout.
Chris, can you please sent another pgo jobs to osx-1014 staging pool?
Comment 38•5 years ago
|
||
Hi Dragos, thank you very much for the update. I was able to complete a profile-run task on the new worker. I believe it needs two updates (I worked around these in my try push):
- Enabling sparse checkout for hg (this was done for the prior worker here: bug 1517831)
- Exporting SHELL... further? I needed to export SHELL in my task's run script to get things to work. I'm not sure if this is supposed to happen at a different layer, or in the worker definition itself, or why comment 33 wasn't quite enough to get this working.
Please let me know if I can debug further or you have any other questions.
Comment 39•5 years ago
|
||
(In reply to Chris Manchester (:chmanchester) from comment #38)
Hi Dragos, thank you very much for the update. I was able to complete a profile-run task on the new worker. I believe it needs two updates (I worked around these in my try push):
- Enabling sparse checkout for hg (this was done for the prior worker here: bug 1517831)
- Exporting SHELL... further? I needed to export SHELL in my task's run script to get things to work. I'm not sure if this is supposed to happen at a different layer, or in the worker definition itself, or why comment 33 wasn't quite enough to get this working.
Looks like option 1 from comment 32 works in our case. I'll continue to check a way to use environment variables on the script.Please let me know if I can debug further or you have any other questions.
Chris, I saw errors on https://taskcluster-artifacts.net/EKr323zwTJioGOCotJjfsw/0/public/build/profile-run-1.log. Is this expected?
Comment 40•5 years ago
|
||
Those are just warnings, I'm fairly sure that's ok. Were you able to enable sparse checkout on that worker and export SHELL so the task sees it?
Comment 41•5 years ago
|
||
(In reply to Chris Manchester (:chmanchester) from comment #40)
Those are just warnings, I'm fairly sure that's ok. Were you able to enable sparse checkout on that worker and export SHELL so the task sees it?
spare is enabled in mercurial:
[dcrisan@t-mojave-r7-463.test.releng.mdc1.mozilla.com ~]$ hg config|grep extensions
extensions.share=
extensions.rebase=
extensions.mq=
extensions.purge=
extensions.robustcheckout=/usr/local/lib/hgext/robustcheckout.py
extensions.sparse=
Now, I'm working to export SHELL
Comment 42•5 years ago
|
||
Chris, can you test pgo jobs vithout export SHELL in task definition?
Comment 43•5 years ago
|
||
Hmm, that still seems to fail, as in https://treeherder.mozilla.org/#/jobs?repo=try&revision=4037fa5826947c4bb4c0da39cbbd119bab33fe6f&selectedJob=272487526
Can you take a look? Thanks!
Comment 44•5 years ago
|
||
I run some tests but without success. I'll recommend to use option 1 from https://bugzilla.mozilla.org/show_bug.cgi?id=1561956#c32 (define SHELL variable in task definition)
Comment 45•5 years ago
|
||
Ok, I added that to the task definition and everything seems to be fine. Thanks very much for your work on this!
Comment 46•5 years ago
|
||
:dragrom, does anything else need to be tested before deploying this?
Comment 47•5 years ago
|
||
Chris,
As Pete Moore recommandation, I updated generic-worker to the latest version: 16.4.0. can you please run the tests on this version?
Thanks
Updated•5 years ago
|
Comment 48•5 years ago
|
||
Sure, I re-ran the job here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8a18f63064181de9ffe373c996eb2e3caf943cb5&selectedJob=273330931
That seems to have completed successfully, did it run against the correct version?
Comment 49•5 years ago
|
||
Chris,
Per Pete request, I installed generic-worker version v16.5.2 on Mojave staging pool
Can you please run again pgo test? I tried to re-run pgo rpofiler task but rt task generated an empty json file.
Comment 50•5 years ago
|
||
I made another push, here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c0fb7e65b0cc90f60286e12f8def98183c80c465
Please let me know if there's anything else I can test.
Comment 51•5 years ago
|
||
Chris,
Do you know if will be other pgo tests?
Comment 52•5 years ago
|
||
This is the only job I'm planning to run against this pool. Do we need anything else to deploy?
Comment 53•5 years ago
|
||
Hi Chris,
I believe there are failures at the moment in Dragos's latest try push (https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa13d6d1c363fd37f7b4a437a1794442dafa3bf0&selectedJob=274833798) in the old taskcluster.net deployment due to the SHELL
environment variable missing.
Rather than setting this explicitly in generic-worker, I think it may be better for us to explicitly set this in the task definitions that require it. It looks like most (if not all) the failures are run-task tasks, do you think it would be feasible for run-task tasks to automatically set SHELL
env var on linux/mac tasks?
Let me know what you think.
In the meantime, I believe Dragos is going to make a new try push which should get picked up in the new firefox-ci cluster, and then manually edit the failed tasks to add the SHELL
environment variable to the task definitions, and see if this resolves the issue. Of course it only makes sense for us to update run-task to do this, if we first see that it indeed fixes the issue.
Thanks!
Pete
Comment 54•5 years ago
|
||
Adding the variable to run-task sounds ok to me, but a lot of these tasks are sort of out of my area, so I don't have much insight there. For the pgo profile run I added it to the task definition, which I have no issue with.
My impression at one point was that we were going to deploy a small pool for the pgo run tasks with the necessary features but not block that on greening up all the tests. Is that feasible?
Comment 55•5 years ago
|
||
(In reply to Chris Manchester (:chmanchester) from comment #54)
Adding the variable to run-task sounds ok to me, but a lot of these tasks are sort of out of my area, so I don't have much insight there. For the pgo profile run I added it to the task definition, which I have no issue with.
Many thanks Chris!
My impression at one point was that we were going to deploy a small pool for the pgo run tasks with the necessary features but not block that on greening up all the tests. Is that feasible?
That was my understanding too. Dragos, can you confirm?
Comment 56•5 years ago
|
||
I hesitate to change run-task file to add SHELL variable. I'll try the approach to use puppet and add this value only for mojave workers. From what I now we will use a small pool to deploy multiuser worker
Comment 57•5 years ago
|
||
Hi Pete, does anything remain to be done to deploy this? As of comment 50 this is ready to go from my side. Thanks!
Comment 58•5 years ago
|
||
(In reply to Dragos Crisan [:dragrom] from comment #56)
I hesitate to change run-task file to add SHELL variable. I'll try the approach to use puppet and add this value only for mojave workers. From what I now we will use a small pool to deploy multiuser worker
Hi Dragos, I'm afraid that won't work. The environment variables of the task commands are not derived from the environment variables of the generic-worker process. The tasks run as a different user, with different environment variables, and there is no way to set additional environment variables in task commands via a worker configuration change - to do that would require a generic-worker feature (see comment 32).
However, from comment 54, Chris already added the env var for the PGO profile run, so you don't need to add it.
I believe the remaining work is to make the Mac PGO worker pool a production pool, and to adapt the gecko decision task to use the new worker pool for Mac PGO tasks. I think you will need to do the first of these, and either you or Chris can probably do the second, and I think that should complete the work for this bug. Upgrading other Mac workers to multiuser is outside the scope of this bug.
Chris, Dragos, can you confirm we are in agreement? Thanks!
Comment 59•5 years ago
|
||
Indeed, I have patches to run the profile-run tasks against the right worker type, everything else conforms to my understanding. Thanks!
Comment 60•5 years ago
|
||
Hi Joel,
I have some tests that fail with generic-worker multiuser. Here is the try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa13d6d1c363fd37f7b4a437a1794442dafa3bf0&selectedJob=274845388
We have 2 major types of failures:
- one type of failure is generated by missing SHELL env variable. To fix this failure need to define SHELL=/bin/bash in the task definition
- another type is generated by timeout on cut/paste operation.
Can you please have a look and help us to fix these issues?
Comment 61•5 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #58)
(In reply to Dragos Crisan [:dragrom] from comment #56)
I hesitate to change run-task file to add SHELL variable. I'll try the approach to use puppet and add this value only for mojave workers. From what I now we will use a small pool to deploy multiuser worker
Hi Dragos, I'm afraid that won't work. The environment variables of the task commands are not derived from the environment variables of the generic-worker process. The tasks run as a different user, with different environment variables, and there is no way to set additional environment variables in task commands via a worker configuration change - to do that would require a generic-worker feature (see comment 32).
However, from comment 54, Chris already added the env var for the PGO profile run, so you don't need to add it.
I believe the remaining work is to make the Mac PGO worker pool a production pool, and to adapt the gecko decision task to use the new worker pool for Mac PGO tasks. I think you will need to do the first of these, and either you or Chris can probably do the second, and I think that should complete the work for this bug. Upgrading other Mac workers to multiuser is outside the scope of this bug.
I'll add an extra task: to fix the failed tests. For this, I put a NI to Joel
Chris, Dragos, can you confirm we are in agreement? Thanks!
Comment 62•5 years ago
|
||
Hi Joel,
I have some tests that fail with generic-worker multiuser. Here is the try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa13d6d1c363fd37f7b4a437a1794442dafa3bf0&selectedJob=274845388
We have 2 major types of failures:
one type of failure is generated by missing SHELL env variable. To fix this failure need to define SHELL=/bin/bash in the task definition
another type is generated by timeout on cut/paste operation.
Can you please have a look and help us to fix these issues?
Comment 63•5 years ago
|
||
this is a push from November 6th, we did the taskcluster migration after that and I am unable to retrigger jobs. Can you push with something modern so I can determine what are intermittent issues vs real ones.
As for cut/paste, this is usually a hiccup when creating a new environment, what are the differences in generic worker from previous version to new version
Comment 64•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #63)
this is a push from November 6th, we did the taskcluster migration after that and I am unable to retrigger jobs. Can you push with something modern so I can determine what are intermittent issues vs real ones.
As for cut/paste, this is usually a hiccup when creating a new environment, what are the differences in generic worker from previous version to new version
Sorry for wrong link. This is the correct link: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f74b594792302984fc753758c3555590c5c81595
Comment 65•5 years ago
|
||
ok, the errors are real. It seems that cut/paste and in general access to external tools/services is broken. Did permissions/users change with this updated generic worker?
Comment 66•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #65)
ok, the errors are real. It seems that cut/paste and in general access to external tools/services is broken. Did permissions/users change with this updated generic worker?
I can give you some details. The generic-worker multiuser create a user for each task. The workflow is something like this:
- worker claim a task
- generic-worker create a new user and reboot the worker
- generic-worker run the task under created worker
- if task is finished with success, generic worker delete the task user
On the previews version of the generic-worker (generic-worker single user), all tasks run under the same user (cltbld).
For more details about generic-worker, Pete Moore is the right person to respond
Can you also fix the tasks that fail with missing shell env, by adding SHELL=/bin/bash in the task definition?
Comment 67•5 years ago
|
||
(In reply to Dragos Crisan [:dragrom] from comment #61)
(In reply to Pete Moore [:pmoore][:pete] from comment #58)
(In reply to Dragos Crisan [:dragrom] from comment #56)
I hesitate to change run-task file to add SHELL variable. I'll try the approach to use puppet and add this value only for mojave workers. From what I now we will use a small pool to deploy multiuser worker
Hi Dragos, I'm afraid that won't work. The environment variables of the task commands are not derived from the environment variables of the generic-worker process. The tasks run as a different user, with different environment variables, and there is no way to set additional environment variables in task commands via a worker configuration change - to do that would require a generic-worker feature (see comment 32).
However, from comment 54, Chris already added the env var for the PGO profile run, so you don't need to add it.
I believe the remaining work is to make the Mac PGO worker pool a production pool, and to adapt the gecko decision task to use the new worker pool for Mac PGO tasks. I think you will need to do the first of these, and either you or Chris can probably do the second, and I think that should complete the work for this bug. Upgrading other Mac workers to multiuser is outside the scope of this bug.
I'll add an extra task: to fix the failed tests. For this, I put a NI to Joel
Dragos, my understanding of this is that we can deploy a small, separate pool to run the profile-run tasks (bug 1530732), which does not require greening up the rest of the tests. Is something preventing us from doing this?
Chris, Dragos, can you confirm we are in agreement? Thanks!
Comment 68•5 years ago
|
||
:pmoore, can you help figure out why on OSX with the new "user per task" that tests are not able to access external processes (i.e. clipboard)? Maybe there are permissions, etc. ?
Comment 69•5 years ago
|
||
:egao, can you help figure out what is going on with the python tests:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=f74b594792302984fc753758c3555590c5c81595&selectedJob=277193246
Comment 71•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #68)
:pmoore, can you help figure out why on OSX with the new "user per task" that tests are not able to access external processes (i.e. clipboard)? Maybe there are permissions, etc. ?
The test tasks need to include SHELL
env var in task definition (see comment 53 and comment 54).
Comment 72•5 years ago
|
||
I installed generic-worker multiuser on https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-3-t-osx-1014 workers. Chris, this will be generic workers allocated to gpo. For the momment, I pinned these workers to my branch. You can use these workers to yoru your tests.
Comment 73•5 years ago
|
||
Joel, do you have any updates regarding failing tests?
Comment 74•5 years ago
|
||
I will let :egao handle this, he still has a needinfo assigned.
Comment 75•5 years ago
|
||
Sorry - I should have mentioned this earlier, but I do not understand what this bug is trying to do. If the new generic-worker is causing macosx1014 tests to fail, isn't that an issue with the generic-worker?
If I could get some clarification around what is happening that would be appreciated.
Comment 76•5 years ago
|
||
Thank you very much, this looks good! It seems that on try these will run against the plain gecko-t-osx-1014 workers, which need to have the sparse profile extension enabled, but I'm ok with turning off sparse and not blocking on that for now.
Comment 77•5 years ago
|
||
Hi Dragos, can you please update the robustcheckout extension on these workers. Per https://bugzilla.mozilla.org/show_bug.cgi?id=1528374#c18 these jobs fail due to an older version. Thanks!
Assignee | ||
Comment 78•5 years ago
|
||
(In reply to Edwin Takahashi (:egao, :etakahashi) from comment #75)
Sorry - I should have mentioned this earlier, but I do not understand what this bug is trying to do. If the new generic-worker is causing macosx1014 tests to fail, isn't that an issue with the generic-worker?
If I could get some clarification around what is happening that would be appreciated.
egao, I think there are some tests re: https://bugzilla.mozilla.org/show_bug.cgi?id=1561956#c69 (https://tools.taskcluster.net/groups/K01QCpGiSU2DjC2IEDddeQ/tasks/ecDVDAYFSWicPn5ohbeztw/runs/0/logs/public%2Flogs%2Flive_backing.log) still failing because of SHELL not being set by generic-worker
Assignee | ||
Comment 79•5 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #32)
Perhaps on the old workers, one of these was already set. The options I see available to us, in order of preference (preferred to least preferred):
pmoore, I'm sorry to respond late to this. I did some digging and found the previous version of generic-worker passed the environment through to the task: https://github.com/taskcluster/generic-worker/blob/v13.0.1/plat_all-unix-style.go#L116
In multi-user/new generic-worker, if not changed to passing the parent environment (not the "generic-worker" parent, but the launchd/shell parent with it's HOME/PATH/USER/etc), could SHELL be added to the small set of standard shell variables? https://github.com/taskcluster/generic-worker/blob/master/multiuser_posix.go#L106 (like HOME, PATH, and USER)
Comment 80•5 years ago
|
||
(In reply to Chris Manchester (:chmanchester) from comment #77)
Hi Dragos, can you please update the robustcheckout extension on these workers. Per https://bugzilla.mozilla.org/show_bug.cgi?id=1528374#c18 these jobs fail due to an older version. Thanks!
Hi Chris,
I updated the roubustcheckout to the latest version.
Comment 81•5 years ago
|
||
Chris, please let me know if everything is OK now.
Comment 82•5 years ago
|
||
This is almost ready. Thank you very much for your work on this. Would it also be possible to add the sparse extension to the hgrc of the workers running in the gecko-t-osx-1014 pool as well? These tasks will run against those on try and could benefit from that there as well. Thanks!
Comment 83•5 years ago
|
||
(In reply to Chris Manchester (:chmanchester) from comment #82)
This is almost ready. Thank you very much for your work on this. Would it also be possible to add the sparse extension to the hgrc of the workers running in the gecko-t-osx-1014 pool as well? These tasks will run against those on try and could benefit from that there as well. Thanks!
Looking on gecko-t-osx-1014 workers I saw hg sparse extension is enabled on them:
[extensions]
share=
rebase=
mq=
purge=
robustcheckout=/usr/local/lib/hgext/robustcheckout.py
; Enable sparse checkout only in macOS worker
sparse=
Comment 84•5 years ago
|
||
(In reply to Dave House [:dhouse] from comment #79)
(In reply to Pete Moore [:pmoore][:pete] from comment #32)
Perhaps on the old workers, one of these was already set. The options I see available to us, in order of preference (preferred to least preferred):
pmoore, I'm sorry to respond late to this. I did some digging and found the previous version of generic-worker passed the environment through to the task: https://github.com/taskcluster/generic-worker/blob/v13.0.1/plat_all-unix-style.go#L116
In multi-user/new generic-worker, if not changed to passing the parent environment (not the "generic-worker" parent, but the launchd/shell parent with it's HOME/PATH/USER/etc), could SHELL be added to the small set of standard shell variables? https://github.com/taskcluster/generic-worker/blob/master/multiuser_posix.go#L106 (like HOME, PATH, and USER)
It can, but I feel this may be the wrong place to do it. I believe the SHELL
environment variable should really be task-specific, in case not all tasks wish /bin/bash
to be the value, for example. I think this differs from HOME
and USER
which are specific to the user account running the task, so can more easily be pre-baked to appropriate values.
In the past tasks did indeed inherit environment variables from the generic-worker parent process, which is why it was there before, before we had OS task user separation.
Note, any task that requires SHELL
to be set can set it, and this also makes the choice explicit, rather than hidden in generic-worker code. On reflection, we probably should not set PATH
, but this would mean that almost all tasks would need to explicitly set it, and this could be a point of great frustration, so a reasonable default is provided. It could be argued that /bin/bash
is a reasonable default for SHELL
but it feels a little more biased than setting a default PATH
since it really is a user (or in our case, task) preference where multiple reasonable alternatives could be set, rather than a default system PATH
which contains the core system binary directories.
I think with Chris's changes this issue is solved, and will be more explicit in tasks, so overall better, but if that doesn't work out for some reason, we can open up this discussion, of course. This is my personal opinion, after all, which others may disagree with.
Comment 85•5 years ago
|
||
(In reply to Dragos Crisan [:dragrom] from comment #83)
(In reply to Chris Manchester (:chmanchester) from comment #82)
This is almost ready. Thank you very much for your work on this. Would it also be possible to add the sparse extension to the hgrc of the workers running in the gecko-t-osx-1014 pool as well? These tasks will run against those on try and could benefit from that there as well. Thanks!
Looking on gecko-t-osx-1014 workers I saw hg sparse extension is enabled on them:
[extensions]
share=
rebase=
mq=
purge=
robustcheckout=/usr/local/lib/hgext/robustcheckout.py
; Enable sparse checkout only in macOS worker
sparse=
Ah, my mistake. We're seeing the error from comment 33 again, which means we need to update the robustcheckout extension. Can you take a look? Thanks!
Comment 86•5 years ago
|
||
(In reply to Chris Manchester (:chmanchester) from comment #85)
(In reply to Dragos Crisan [:dragrom] from comment #83)
(In reply to Chris Manchester (:chmanchester) from comment #82)
This is almost ready. Thank you very much for your work on this. Would it also be possible to add the sparse extension to the hgrc of the workers running in the gecko-t-osx-1014 pool as well? These tasks will run against those on try and could benefit from that there as well. Thanks!
Looking on gecko-t-osx-1014 workers I saw hg sparse extension is enabled on them:
[extensions]
share=
rebase=
mq=
purge=
robustcheckout=/usr/local/lib/hgext/robustcheckout.py
; Enable sparse checkout only in macOS worker
sparse=Ah, my mistake. We're seeing the error from comment 33 again, which means we need to update the robustcheckout extension. Can you take a look? Thanks!
Chirs, I looked on pgo workers and looks like all jobs are green on these workers. Please let me know if you wil see any errors
Comment 87•5 years ago
|
||
I'm still seeing issues enabling sparse profiles on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bd6752b83c636fcf058e315d87b4b53586c71f65&selectedJob=281790785 These jobs run against the regular gecko-t-osx-1014 workers on try, so we need them to succeed there as well.
I'm looking into this because it might help with bug 1604486, which is already pretty frequent.
Comment 88•5 years ago
|
||
(In reply to Chris Manchester (:chmanchester) from comment #87)
I'm still seeing issues enabling sparse profiles on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bd6752b83c636fcf058e315d87b4b53586c71f65&selectedJob=281790785 These jobs run against the regular gecko-t-osx-1014 workers on try, so we need them to succeed there as well.
I'm looking into this because it might help with bug 1604486, which is already pretty frequent.
Looks like robustcheckout needs to be upgraded in-tree. I'll send a patch.
Assignee | ||
Comment 89•5 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] on PTO until 2020 from comment #84)
(In reply to Dave House [:dhouse] from comment #79)
(In reply to Pete Moore [:pmoore][:pete] from comment #32)
Perhaps on the old workers, one of these was already set. The options I see available to us, in order of preference (preferred to least preferred):
pmoore, I'm sorry to respond late to this. I did some digging and found the previous version of generic-worker passed the environment through to the task: https://github.com/taskcluster/generic-worker/blob/v13.0.1/plat_all-unix-style.go#L116
In multi-user/new generic-worker, if not changed to passing the parent environment (not the "generic-worker" parent, but the launchd/shell parent with it's HOME/PATH/USER/etc), could SHELL be added to the small set of standard shell variables? https://github.com/taskcluster/generic-worker/blob/master/multiuser_posix.go#L106 (like HOME, PATH, and USER)
It can, but I feel this may be the wrong place to do it. I believe the
SHELL
environment variable should really be task-specific, in case not all tasks wish/bin/bash
to be the value, for example. I think this differs fromHOME
andUSER
which are specific to the user account running the task, so can more easily be pre-baked to appropriate values.In the past tasks did indeed inherit environment variables from the generic-worker parent process, which is why it was there before, before we had OS task user separation.
Note, any task that requires
SHELL
to be set can set it, and this also makes the choice explicit, rather than hidden in generic-worker code. On reflection, we probably should not setPATH
, but this would mean that almost all tasks would need to explicitly set it, and this could be a point of great frustration, so a reasonable default is provided. It could be argued that/bin/bash
is a reasonable default forSHELL
but it feels a little more biased than setting a defaultPATH
since it really is a user (or in our case, task) preference where multiple reasonable alternatives could be set, rather than a default systemPATH
which contains the core system binary directories.
I understand your point of view. If task writers/maintainers don't have a problem with that, then I'm fine with it.
I think with Chris's changes this issue is solved, and will be more explicit in tasks, so overall better, but if that doesn't work out for some reason, we can open up this discussion, of course. This is my personal opinion, after all, which others may disagree with.
The change to not having SHELL set exposes tasks, that expect it, as failures now. I think Chris had fixed the pgo tasks that required SHELL, and I think it is non-pgo tasks that were failing after that (possibly all because of mach).
Comment 90•5 years ago
|
||
Hi Chris,
Looks like pgo queue is green: https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-3-t-osx-1014.
Did you see any issues or everything is OK with PGO build?
Comment 91•5 years ago
|
||
Everything appears to be good from my end. Thank you very much for your work on this!
Comment 92•5 years ago
|
||
Hi Joel,
Do you have any news about failed non pgo tasks?
Comment 93•5 years ago
|
||
if we have this deployed for all osx workers then there are no problems
Updated•5 years ago
|
Assignee | ||
Comment 94•5 years ago
|
||
(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #93)
if we have this deployed for all osx workers then there are no problems
I'll update the osx worker staging pool so that we can verify this is working for all macos tasks.
This has been running for PGO and for the staging pool, but not for all osx workers (yet).
Assignee | ||
Comment 95•4 years ago
|
||
The homebrew removal is complete (bug 1563799). So we can proceed with updating generic-worker on the macos staging and production pools.
Assignee | ||
Comment 96•4 years ago
|
||
(In reply to Dave House [:dhouse] from comment #89)
The change to not having SHELL set exposes tasks, that expect it, as failures now. I think Chris had fixed the pgo tasks that required SHELL, and I think it is non-pgo tasks that were failing after that (possibly all because of mach).
This issue resurfaced now that we're setting up a multi-user generic-worker to run another type of task (builds).
The fix was to default to /bin/sh for non-win32 in mach: https://phabricator.services.mozilla.com/D97686
Comment 97•3 years ago
|
||
Has been updated in all places where we were able to.
Description
•