Closed Bug 1561956 Opened 5 years ago Closed 3 years ago

[Ronin Mojave] Upgrade generic worker to v16.4.0 multiuser

Categories

(Infrastructure & Operations :: RelOps: Puppet, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dividehex, Assigned: dhouse, NeedInfo)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

The Mojave 10.14.0 and 10.14.5 (workerType: gecko-t-osx-1014) workers need to be upgraded to the latest version of generic worker (v15.1.0). We don't have golang installed on the macos hosts in the datacenters but I believe the kv Edwin is referring to is the gw build env. Currently, on the yosemite host, we are running gw v14.1.2 which was built against go1.10.8. And on the Mojave hosts, we are running gw v13.0.3 which was built against go1.10.3.

Latest release is here: https://github.com/taskcluster/generic-worker/releases

[taskcluster 2019-06-25T02:56:51.198Z] "generic-worker": {
[taskcluster 2019-06-25T02:56:51.198Z] "go-arch": "amd64",
[taskcluster 2019-06-25T02:56:51.198Z] "go-os": "darwin",
[taskcluster 2019-06-25T02:56:51.198Z] "go-version": "go1.10.3",
[taskcluster 2019-06-25T02:56:51.198Z] "release": "https://github.com/taskcluster/generic-worker/releases/tag/v13.0.3",
[taskcluster 2019-06-25T02:56:51.198Z] "revision": "4790b3e4a8b726daa41aa102091b53b192aa79ef",
[taskcluster 2019-06-25T02:56:51.198Z] "source": "https://github.com/taskcluster/generic-worker/commits/4790b3e4a8b726daa41aa102091b53b192aa79ef",
[taskcluster 2019-06-25T02:56:51.198Z] "version": "13.0.3"

vs.

[taskcluster 2019-06-26T09:41:07.930Z] "generic-worker": {
[taskcluster 2019-06-26T09:41:07.930Z] "go-arch": "amd64",
[taskcluster 2019-06-26T09:41:07.930Z] "go-os": "darwin",
[taskcluster 2019-06-26T09:41:07.930Z] "go-version": "go1.10.8",
[taskcluster 2019-06-26T09:41:07.930Z] "release": "https://github.com/taskcluster/generic-worker/releases/tag/v14.1.2",
[taskcluster 2019-06-26T09:41:07.930Z] "revision": "13118c4c1ba10f863f39d6c623b3dd59ca6e0f00",
[taskcluster 2019-06-26T09:41:07.930Z] "source": "https://github.com/taskcluster/generic-worker/commits/13118c4c1ba10f863f39d6c623b3dd59ca6e0f00",
[taskcluster 2019-06-26T09:41:07.930Z] "version": "14.1.2"

:dividehex - I noticed that our macosx1014 machines are running an older version of go (10.3):
https://taskcluster-artifacts.net/E2Hbfw8sRTSmzPBsiRp6GA/0/public/logs/live_backing.log

For comparison, macosx1010 machines are running go 10.8:
https://taskcluster-artifacts.net/DM7hLXa0TDODqqt0QRUFog/0/public/logs/live_backing.log

:dividehex - thanks for this bug. I will create another bug to have vendored six installed in the global environment for macosx1014 machines (to mirror current macosx1010 setup) as noted in the same email.

six issue with macosx1014 will not be addressed with a global dependency; rather, another bug will be created to investigate solutions.

Assignee: jwatkins → dcrisan
Summary: [Ronin Mojave] Upgrade generic worker to v15.1.0 → [Ronin Mojave] Upgrade generic worker to v15.1.0 multiuser
Status: NEW → ASSIGNED

Please note, you'll need at least generic-worker 15.1.2 since that includes a fix for bug 1566159. However, 15.1.4 is the current latest, and if that works for you, would be best. Please note though, if you hit any problems with 15.1.4 (or 15.1.3) that 15.1.2 is also usable, but nothing prior to that.

Thanks!

Summary: [Ronin Mojave] Upgrade generic worker to v15.1.0 multiuser → [Ronin Mojave] Upgrade generic worker to v15.1.2 (or higher) multiuser
Blocks: 1530732
Summary: [Ronin Mojave] Upgrade generic worker to v15.1.2 (or higher) multiuser → [Ronin Mojave] Upgrade generic worker to v15.1.4 (or higher) multiuser

:dragrom,

I've looked over Pete's instructions. Most of the instructions already take place in the puppet config. Below are the highlights of what needs to change in order to upgrade puppet to gw 15.0.4 multiuser on MacOS:

I would suggest copying the generic_worker::init class to a new class (generic_worker::multiuser) and modify that.

  1. generic work needs to run as root under multiuser
  2. You'll also need to change the working dir to /etc/generic-worker (don't forget to also change it in the plist)
  3. The generic worker wrapper script will need to be launched from /Library/LaunchDaemons/ instead of /Library/LaunchAgents/
  4. You'll need to add the UserName key to the generic-worker.plist.erb
<key>UserName</key>
<string>root</string>

On another note, since we are upgrading from gw 13.x.x, we will probably need to remove openpgpSigningKeyLocation from the config and the exec that creates it.

See the v14.0.0 release notes
https://github.com/taskcluster/generic-worker/releases/tag/v14.0.0

[root@t-mojave-r7-465.test.releng.mdc1.mozilla.com ~]# launchctl list|grep worker

  • 0 net.generic.worker

It looks like a lot of the failures are permissions issues with reading/writing /builds/tooltool_cache. This is expected - each task runs as a different user, so unless /builds/tooltool_cache is world readable/writable, I'd expect to see that.

It would be preferable for tasks not to touch files outside the home directory of the task user; in other words for /builds/tooltool_cache not to exist but tasks to mount a writable directory cache into the task directory containing the tooltool cache.

If changing the location of the tooltool cache is deemed to be outside the scope of this bug, making /builds/tooltool_cache world readable/writable would be a simple way to overcome the issue, but this solution comes with the caveat that it isn't as secure as using a writable directory cache, which guards the cache with taskcluster scopes, meaning only tasks with the required scopes can modify the cache, not any task that runs on the worker type. This is somewhat mitigated by being able to restrict which tasks can run on the worker type, but a writable directory cache would still be better, if only for the fact it is more explicit about its intentions.


Update: on reflection I see there are still some open bugs relating to generic-worker writable directory caches that could possibly bite us, perhaps better to just make /builds/tooltool_cache world readable/writable for now.

Some of these bugs might not be relevant (e.g. Windows only, or solved but not updated) but it would be best for us to review them before we use the same feature on macOS, in case macOS shares any of the same issues (and in any case it would be good for us to resolve these open bugs).

I fixed the Permisison denied issue, but tests still fails.
egao: Can you please have a look over the fail tests? https://treeherder.mozilla.org/#/jobs?repo=try&revision=5105a3f391d75408afec3ce4698dde53833dd950

Flags: needinfo?(egao)

:dragrom - I am not quite sure what is happening in this bug, though we are now on generic-worker version 16.0.0. Has that been tried?

Otherwise I am not sure why there may be permission issues with generic-worker 15.1.4 and not on 13.x.x.

I recall that for Linux the permissions are specified in the Dockerfile, but for Windows and MacOSX systems I am not so sure how the permissions are set for the directories.

Though, this issue does bring forth a thought of mine; why not just grant to the task user ownership of the whole subdirectory for the task user? I don't know of a good reason why we do not do so.

Flags: needinfo?(egao)

(In reply to Dragos Crisan [:dragrom] from comment #15)

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8940705911105018fa7f797ee90f711e246f253c

Hi Dragos,
Is someone assigned to look into these failures?

Flags: needinfo?(dcrisan)

I test the implementation for generic-worker multiuser on osx mojave workers (For each job we create a user and run the job under this user) ) and I have 15 tests who fail with something like cut and paste issue: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8940705911105018fa7f797ee90f711e246f253c
can you have a look please over them?

Flags: needinfo?(dcrisan) → needinfo?(enndeakin)

I think you have the wrong person.

Flags: needinfo?(enndeakin)

:dragrom, can we find someone else to look into these test failures?

Flags: needinfo?(dcrisan)

I pinged masayuki

Flags: needinfo?(dcrisan)

Dragos, can we ignore the test failures for now and get it running only for PGO? Then we can troubleshoot the test failures without blocking PGO builds. (tomprince pointed this out as a possible option to me in #ci)

Flags: needinfo?(dcrisan)
Flags: needinfo?(dcrisan)

Dave, what tests are in PGO builds or how can I run only PGO builds? if PGO not fail, we can ignore failed test for now

Flags: needinfo?(dhouse)

(In reply to Dragos Crisan [:dragrom] from comment #23)

Dave, what tests are in PGO builds or how can I run only PGO builds? if PGO not fail, we can ignore failed test for now

Chris, could you point us at which tasks the macos PGO workers can test running? (or send some tasks at the gecko-t-osx-1014-staging worker type?) Then we can validate that the new generic-worker is working correctly for PGO and finish building the workers for that pool (not blocked by other tests/tasks failing still on the new generic-worker on macos).

From the parent bug 1528374, this patch adds some tasks with a "/pgo" suffix: https://phabricator.services.mozilla.com/D20409
(but it looks like that is for the worker type t-osx-1010-pgo).

Flags: needinfo?(dhouse) → needinfo?(cmanchester)

Yes, I can rebase the patches from bug 1528374 this afternoon and push some tasks to gecko-t-osx-1014-staging. To clarify though, I think it's very unlikely that the tests themselves will be fixed from doing PGO on the browser. Perhaps Tom meant using the new worker for the run task only? I think that would make sense and probably work.

Flags: needinfo?(cmanchester)

(In reply to Chris Manchester (:chmanchester) from comment #25)

Yes, I can rebase the patches from bug 1528374 this afternoon and push some tasks to gecko-t-osx-1014-staging. To clarify though, I think it's very unlikely that the tests themselves will be fixed from doing PGO on the browser. Perhaps Tom meant using the new worker for the run task only? I think that would make sense and probably work.

Thanks. I'm not aware of the full situation, and was offering a "why don't you just" when it looked like we were delaying pgo for non-pgo failures.

Ok, I've pushed to try using the new worker here, but it's failing to find python in the expected location: https://tools.taskcluster.net/groups/WIql72cqStqnMCtkd8nvtw/tasks/CnjSrSZORJm-Hg3CGa7gaw/runs/0/logs/public%2Flogs%2Flive.log

Dragos, is python installed on these workers? Where should we be checking for it. The code at https://searchfox.org/mozilla-central/rev/7cc0f0e89cb40e43bf5c96906f13d44705401042/taskcluster/taskgraph/transforms/job/run_task.py#161 seems invalid in this case.

Flags: needinfo?(dcrisan)

Sorry, I believe I found the issue...

Flags: needinfo?(dcrisan)

Getting a bit further, it looks like these workers aren't seeing the right mercurial version (although I can't say I understand the error, exactly): https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=270371622&revision=96eb8e821db156b46289634dadcb74dfc57f1333

Dragos, is there something more we need to do to make these pick up the right mercurial?

Flags: needinfo?(dcrisan)

(In reply to Chris Manchester (:chmanchester) from comment #29)

Getting a bit further, it looks like these workers aren't seeing the right mercurial version (although I can't say I understand the error, exactly): https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=270371622&revision=96eb8e821db156b46289634dadcb74dfc57f1333

Dragos, is there something more we need to do to make these pick up the right mercurial?

Do we have a restriction for mercurial version? On the worker we use 5.1

Mercurial Distributed SCM (version 5.1)
(see https://mercurial-scm.org for more information)

Copyright (C) 2005-2019 Matt Mackall and others
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Flags: needinfo?(dcrisan) → needinfo?(cmanchester)

(In reply to Dragos Crisan [:dragrom] from comment #30)

(In reply to Chris Manchester (:chmanchester) from comment #29)

Getting a bit further, it looks like these workers aren't seeing the right mercurial version (although I can't say I understand the error, exactly): https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=270371622&revision=96eb8e821db156b46289634dadcb74dfc57f1333

Dragos, is there something more we need to do to make these pick up the right mercurial?

Do we have a restriction for mercurial version? On the worker we use 5.1

Mercurial Distributed SCM (version 5.1)
(see https://mercurial-scm.org for more information)

Copyright (C) 2005-2019 Matt Mackall and others
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Fixed roubstcheckout issue. Now we have another error: Exception: Could not detect environment shell! Working to fix this

Flags: needinfo?(cmanchester)

(In reply to Chris Manchester (:chmanchester) from comment #29)

Getting a bit further, it looks like these workers aren't seeing the right mercurial version (although I can't say I understand the error, exactly): https://treeherder.mozilla.org/#/jobs?repo=try&selectedJob=270371622&revision=96eb8e821db156b46289634dadcb74dfc57f1333

Dragos, is there something more we need to do to make these pick up the right mercurial?

The error seems to come from process.py and indicates that at least one of the following environment variables should be set:

  • SHELL
  • MOZILLABUILD
  • COMSPEC

It looks like this requirement was introduced by :gps in late 2012 in Bug 799648 - Move MozbuildObject functionality into mach mixins.

Perhaps on the old workers, one of these was already set. The options I see available to us, in order of preference (preferred to least preferred):

  1. Adding an environment variable, such as SHELL: /bin/bash to the task definition
  2. Adapting process.py to use a different mechanism to establish an appropriate shell to use
  3. Adding a generic-worker feature to allow config-based env vars to be added to the environment for all tasks run on that worker, rolling it out, and updating the config to explicitly set SHELL in all tasks (which could potentially negatively impact other tasks that don't wish to have SHELL set).

Option 3 is not only the most complex and time-intensive, I also think it is the least desirable due to the lack of transparency for task authors that additional environment variables are set, the pollution of this effect to all tasks that run on the worker type, and the overhead of config maintenance for worker types. For this reason, only 1) and 2) seem like feasible options, with 1) winning for me due to its simplicity, explicitness, transparency and backwards-compatibility. :-)

Flags: needinfo?(cmanchester)
Flags: needinfo?(cmanchester)

I'm not sure... Connor, does the error in comment 33 look familiar at all? There seems to be a robustcheckout/mercurial mis-match somewhere along the way here.

Flags: needinfo?(cmanchester)
Flags: needinfo?(sheehan)

Yes, robustcheckout has a compatibility check to ensure the extension version support the Mercurial version. You can fix this by using the newest version which supports Mercurial 5.1 from here.

Flags: needinfo?(sheehan)

Thanks Connor. Dragos, it looks like we need to update robustcheckout on the worker (comment 35), does that seem feasible?

Flags: needinfo?(dcrisan)

Updated robustcheckout.
Chris, can you please sent another pgo jobs to osx-1014 staging pool?

Flags: needinfo?(dcrisan) → needinfo?(cmanchester)

Hi Dragos, thank you very much for the update. I was able to complete a profile-run task on the new worker. I believe it needs two updates (I worked around these in my try push):

  • Enabling sparse checkout for hg (this was done for the prior worker here: bug 1517831)
  • Exporting SHELL... further? I needed to export SHELL in my task's run script to get things to work. I'm not sure if this is supposed to happen at a different layer, or in the worker definition itself, or why comment 33 wasn't quite enough to get this working.

Please let me know if I can debug further or you have any other questions.

Flags: needinfo?(cmanchester) → needinfo?(dcrisan)

(In reply to Chris Manchester (:chmanchester) from comment #38)

Hi Dragos, thank you very much for the update. I was able to complete a profile-run task on the new worker. I believe it needs two updates (I worked around these in my try push):

  • Enabling sparse checkout for hg (this was done for the prior worker here: bug 1517831)
  • Exporting SHELL... further? I needed to export SHELL in my task's run script to get things to work. I'm not sure if this is supposed to happen at a different layer, or in the worker definition itself, or why comment 33 wasn't quite enough to get this working.
    Looks like option 1 from comment 32 works in our case. I'll continue to check a way to use environment variables on the script.

Please let me know if I can debug further or you have any other questions.

Chris, I saw errors on https://taskcluster-artifacts.net/EKr323zwTJioGOCotJjfsw/0/public/build/profile-run-1.log. Is this expected?

Flags: needinfo?(dcrisan) → needinfo?(cmanchester)

Those are just warnings, I'm fairly sure that's ok. Were you able to enable sparse checkout on that worker and export SHELL so the task sees it?

Flags: needinfo?(cmanchester) → needinfo?(dcrisan)

(In reply to Chris Manchester (:chmanchester) from comment #40)

Those are just warnings, I'm fairly sure that's ok. Were you able to enable sparse checkout on that worker and export SHELL so the task sees it?

spare is enabled in mercurial:

[dcrisan@t-mojave-r7-463.test.releng.mdc1.mozilla.com ~]$ hg config|grep extensions
extensions.share=
extensions.rebase=
extensions.mq=
extensions.purge=
extensions.robustcheckout=/usr/local/lib/hgext/robustcheckout.py
extensions.sparse=

Now, I'm working to export SHELL

Flags: needinfo?(dcrisan)

Chris, can you test pgo jobs vithout export SHELL in task definition?

Flags: needinfo?(cmanchester)
Flags: needinfo?(cmanchester) → needinfo?(dcrisan)

I run some tests but without success. I'll recommend to use option 1 from https://bugzilla.mozilla.org/show_bug.cgi?id=1561956#c32 (define SHELL variable in task definition)

Flags: needinfo?(dcrisan)

Ok, I added that to the task definition and everything seems to be fine. Thanks very much for your work on this!

:dragrom, does anything else need to be tested before deploying this?

Flags: needinfo?(dcrisan)

Chris,

As Pete Moore recommandation, I updated generic-worker to the latest version: 16.4.0. can you please run the tests on this version?

Thanks

Flags: needinfo?(dcrisan) → needinfo?(cmanchester)
Summary: [Ronin Mojave] Upgrade generic worker to v15.1.4 (or higher) multiuser → [Ronin Mojave] Upgrade generic worker to v16.4.0 multiuser

Sure, I re-ran the job here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8a18f63064181de9ffe373c996eb2e3caf943cb5&selectedJob=273330931

That seems to have completed successfully, did it run against the correct version?

Flags: needinfo?(cmanchester) → needinfo?(dcrisan)

Chris,
Per Pete request, I installed generic-worker version v16.5.2 on Mojave staging pool

Can you please run again pgo test? I tried to re-run pgo rpofiler task but rt task generated an empty json file.

Flags: needinfo?(dcrisan) → needinfo?(cmanchester)

I made another push, here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c0fb7e65b0cc90f60286e12f8def98183c80c465

Please let me know if there's anything else I can test.

Flags: needinfo?(cmanchester) → needinfo?(dcrisan)

Chris,
Do you know if will be other pgo tests?

Flags: needinfo?(dcrisan) → needinfo?(cmanchester)

This is the only job I'm planning to run against this pool. Do we need anything else to deploy?

Flags: needinfo?(cmanchester) → needinfo?(dcrisan)

Hi Chris,

I believe there are failures at the moment in Dragos's latest try push (https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa13d6d1c363fd37f7b4a437a1794442dafa3bf0&selectedJob=274833798) in the old taskcluster.net deployment due to the SHELL environment variable missing.

Rather than setting this explicitly in generic-worker, I think it may be better for us to explicitly set this in the task definitions that require it. It looks like most (if not all) the failures are run-task tasks, do you think it would be feasible for run-task tasks to automatically set SHELL env var on linux/mac tasks?

Let me know what you think.

In the meantime, I believe Dragos is going to make a new try push which should get picked up in the new firefox-ci cluster, and then manually edit the failed tasks to add the SHELL environment variable to the task definitions, and see if this resolves the issue. Of course it only makes sense for us to update run-task to do this, if we first see that it indeed fixes the issue.

Thanks!
Pete

Flags: needinfo?(cmanchester)

Adding the variable to run-task sounds ok to me, but a lot of these tasks are sort of out of my area, so I don't have much insight there. For the pgo profile run I added it to the task definition, which I have no issue with.

My impression at one point was that we were going to deploy a small pool for the pgo run tasks with the necessary features but not block that on greening up all the tests. Is that feasible?

Flags: needinfo?(cmanchester) → needinfo?(pmoore)

(In reply to Chris Manchester (:chmanchester) from comment #54)

Adding the variable to run-task sounds ok to me, but a lot of these tasks are sort of out of my area, so I don't have much insight there. For the pgo profile run I added it to the task definition, which I have no issue with.

Many thanks Chris!

My impression at one point was that we were going to deploy a small pool for the pgo run tasks with the necessary features but not block that on greening up all the tests. Is that feasible?

That was my understanding too. Dragos, can you confirm?

Flags: needinfo?(pmoore)

I hesitate to change run-task file to add SHELL variable. I'll try the approach to use puppet and add this value only for mojave workers. From what I now we will use a small pool to deploy multiuser worker

Flags: needinfo?(dcrisan)

Hi Pete, does anything remain to be done to deploy this? As of comment 50 this is ready to go from my side. Thanks!

Flags: needinfo?(pmoore)

(In reply to Dragos Crisan [:dragrom] from comment #56)

I hesitate to change run-task file to add SHELL variable. I'll try the approach to use puppet and add this value only for mojave workers. From what I now we will use a small pool to deploy multiuser worker

Hi Dragos, I'm afraid that won't work. The environment variables of the task commands are not derived from the environment variables of the generic-worker process. The tasks run as a different user, with different environment variables, and there is no way to set additional environment variables in task commands via a worker configuration change - to do that would require a generic-worker feature (see comment 32).

However, from comment 54, Chris already added the env var for the PGO profile run, so you don't need to add it.

I believe the remaining work is to make the Mac PGO worker pool a production pool, and to adapt the gecko decision task to use the new worker pool for Mac PGO tasks. I think you will need to do the first of these, and either you or Chris can probably do the second, and I think that should complete the work for this bug. Upgrading other Mac workers to multiuser is outside the scope of this bug.

Chris, Dragos, can you confirm we are in agreement? Thanks!

Flags: needinfo?(pmoore)
Flags: needinfo?(dcrisan)
Flags: needinfo?(cmanchester)

Indeed, I have patches to run the profile-run tasks against the right worker type, everything else conforms to my understanding. Thanks!

Flags: needinfo?(cmanchester)

Hi Joel,

I have some tests that fail with generic-worker multiuser. Here is the try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa13d6d1c363fd37f7b4a437a1794442dafa3bf0&selectedJob=274845388

We have 2 major types of failures:

  • one type of failure is generated by missing SHELL env variable. To fix this failure need to define SHELL=/bin/bash in the task definition
  • another type is generated by timeout on cut/paste operation.

Can you please have a look and help us to fix these issues?

Flags: needinfo?(dcrisan) → needinfo?(jmaher)

(In reply to Pete Moore [:pmoore][:pete] from comment #58)

(In reply to Dragos Crisan [:dragrom] from comment #56)

I hesitate to change run-task file to add SHELL variable. I'll try the approach to use puppet and add this value only for mojave workers. From what I now we will use a small pool to deploy multiuser worker

Hi Dragos, I'm afraid that won't work. The environment variables of the task commands are not derived from the environment variables of the generic-worker process. The tasks run as a different user, with different environment variables, and there is no way to set additional environment variables in task commands via a worker configuration change - to do that would require a generic-worker feature (see comment 32).

However, from comment 54, Chris already added the env var for the PGO profile run, so you don't need to add it.

I believe the remaining work is to make the Mac PGO worker pool a production pool, and to adapt the gecko decision task to use the new worker pool for Mac PGO tasks. I think you will need to do the first of these, and either you or Chris can probably do the second, and I think that should complete the work for this bug. Upgrading other Mac workers to multiuser is outside the scope of this bug.

I'll add an extra task: to fix the failed tests. For this, I put a NI to Joel

Chris, Dragos, can you confirm we are in agreement? Thanks!

Flags: needinfo?(jmaher)

Hi Joel,

I have some tests that fail with generic-worker multiuser. Here is the try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=fa13d6d1c363fd37f7b4a437a1794442dafa3bf0&selectedJob=274845388

We have 2 major types of failures:

one type of failure is generated by missing SHELL env variable. To fix this failure need to define SHELL=/bin/bash in the task definition
another type is generated by timeout on cut/paste operation.

Can you please have a look and help us to fix these issues?

Flags: needinfo?(jmaher)

this is a push from November 6th, we did the taskcluster migration after that and I am unable to retrigger jobs. Can you push with something modern so I can determine what are intermittent issues vs real ones.

As for cut/paste, this is usually a hiccup when creating a new environment, what are the differences in generic worker from previous version to new version

Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #63)

this is a push from November 6th, we did the taskcluster migration after that and I am unable to retrigger jobs. Can you push with something modern so I can determine what are intermittent issues vs real ones.

As for cut/paste, this is usually a hiccup when creating a new environment, what are the differences in generic worker from previous version to new version

Sorry for wrong link. This is the correct link: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f74b594792302984fc753758c3555590c5c81595

Flags: needinfo?(jmaher)

ok, the errors are real. It seems that cut/paste and in general access to external tools/services is broken. Did permissions/users change with this updated generic worker?

Flags: needinfo?(jmaher)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #65)

ok, the errors are real. It seems that cut/paste and in general access to external tools/services is broken. Did permissions/users change with this updated generic worker?

I can give you some details. The generic-worker multiuser create a user for each task. The workflow is something like this:

  • worker claim a task
  • generic-worker create a new user and reboot the worker
  • generic-worker run the task under created worker
  • if task is finished with success, generic worker delete the task user

On the previews version of the generic-worker (generic-worker single user), all tasks run under the same user (cltbld).

For more details about generic-worker, Pete Moore is the right person to respond

Can you also fix the tasks that fail with missing shell env, by adding SHELL=/bin/bash in the task definition?

Flags: needinfo?(jmaher)

(In reply to Dragos Crisan [:dragrom] from comment #61)

(In reply to Pete Moore [:pmoore][:pete] from comment #58)

(In reply to Dragos Crisan [:dragrom] from comment #56)

I hesitate to change run-task file to add SHELL variable. I'll try the approach to use puppet and add this value only for mojave workers. From what I now we will use a small pool to deploy multiuser worker

Hi Dragos, I'm afraid that won't work. The environment variables of the task commands are not derived from the environment variables of the generic-worker process. The tasks run as a different user, with different environment variables, and there is no way to set additional environment variables in task commands via a worker configuration change - to do that would require a generic-worker feature (see comment 32).

However, from comment 54, Chris already added the env var for the PGO profile run, so you don't need to add it.

I believe the remaining work is to make the Mac PGO worker pool a production pool, and to adapt the gecko decision task to use the new worker pool for Mac PGO tasks. I think you will need to do the first of these, and either you or Chris can probably do the second, and I think that should complete the work for this bug. Upgrading other Mac workers to multiuser is outside the scope of this bug.

I'll add an extra task: to fix the failed tests. For this, I put a NI to Joel

Dragos, my understanding of this is that we can deploy a small, separate pool to run the profile-run tasks (bug 1530732), which does not require greening up the rest of the tests. Is something preventing us from doing this?

Chris, Dragos, can you confirm we are in agreement? Thanks!

Flags: needinfo?(dcrisan)

:pmoore, can you help figure out why on OSX with the new "user per task" that tests are not able to access external processes (i.e. clipboard)? Maybe there are permissions, etc. ?

Flags: needinfo?(jmaher) → needinfo?(pmoore)
Flags: needinfo?(egao)

Taking a look.

Flags: needinfo?(egao)

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #68)

:pmoore, can you help figure out why on OSX with the new "user per task" that tests are not able to access external processes (i.e. clipboard)? Maybe there are permissions, etc. ?

The test tasks need to include SHELL env var in task definition (see comment 53 and comment 54).

Flags: needinfo?(pmoore) → needinfo?(egao)

I installed generic-worker multiuser on https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-3-t-osx-1014 workers. Chris, this will be generic workers allocated to gpo. For the momment, I pinned these workers to my branch. You can use these workers to yoru your tests.

Flags: needinfo?(dcrisan) → needinfo?(cmanchester)

Joel, do you have any updates regarding failing tests?

Flags: needinfo?(jmaher)

I will let :egao handle this, he still has a needinfo assigned.

Flags: needinfo?(jmaher)

Sorry - I should have mentioned this earlier, but I do not understand what this bug is trying to do. If the new generic-worker is causing macosx1014 tests to fail, isn't that an issue with the generic-worker?

If I could get some clarification around what is happening that would be appreciated.

Flags: needinfo?(egao)

Thank you very much, this looks good! It seems that on try these will run against the plain gecko-t-osx-1014 workers, which need to have the sparse profile extension enabled, but I'm ok with turning off sparse and not blocking on that for now.

Flags: needinfo?(cmanchester)

Hi Dragos, can you please update the robustcheckout extension on these workers. Per https://bugzilla.mozilla.org/show_bug.cgi?id=1528374#c18 these jobs fail due to an older version. Thanks!

Flags: needinfo?(dcrisan)

(In reply to Edwin Takahashi (:egao, :etakahashi) from comment #75)

Sorry - I should have mentioned this earlier, but I do not understand what this bug is trying to do. If the new generic-worker is causing macosx1014 tests to fail, isn't that an issue with the generic-worker?

If I could get some clarification around what is happening that would be appreciated.

egao, I think there are some tests re: https://bugzilla.mozilla.org/show_bug.cgi?id=1561956#c69 (https://tools.taskcluster.net/groups/K01QCpGiSU2DjC2IEDddeQ/tasks/ecDVDAYFSWicPn5ohbeztw/runs/0/logs/public%2Flogs%2Flive_backing.log) still failing because of SHELL not being set by generic-worker

(In reply to Pete Moore [:pmoore][:pete] from comment #32)

Perhaps on the old workers, one of these was already set. The options I see available to us, in order of preference (preferred to least preferred):

pmoore, I'm sorry to respond late to this. I did some digging and found the previous version of generic-worker passed the environment through to the task: https://github.com/taskcluster/generic-worker/blob/v13.0.1/plat_all-unix-style.go#L116

In multi-user/new generic-worker, if not changed to passing the parent environment (not the "generic-worker" parent, but the launchd/shell parent with it's HOME/PATH/USER/etc), could SHELL be added to the small set of standard shell variables? https://github.com/taskcluster/generic-worker/blob/master/multiuser_posix.go#L106 (like HOME, PATH, and USER)

Flags: needinfo?(pmoore)

(In reply to Chris Manchester (:chmanchester) from comment #77)

Hi Dragos, can you please update the robustcheckout extension on these workers. Per https://bugzilla.mozilla.org/show_bug.cgi?id=1528374#c18 these jobs fail due to an older version. Thanks!

Hi Chris,

I updated the roubustcheckout to the latest version.

Flags: needinfo?(dcrisan)

Chris, please let me know if everything is OK now.

Flags: needinfo?(cmanchester)

This is almost ready. Thank you very much for your work on this. Would it also be possible to add the sparse extension to the hgrc of the workers running in the gecko-t-osx-1014 pool as well? These tasks will run against those on try and could benefit from that there as well. Thanks!

Flags: needinfo?(cmanchester) → needinfo?(dcrisan)

(In reply to Chris Manchester (:chmanchester) from comment #82)

This is almost ready. Thank you very much for your work on this. Would it also be possible to add the sparse extension to the hgrc of the workers running in the gecko-t-osx-1014 pool as well? These tasks will run against those on try and could benefit from that there as well. Thanks!

Looking on gecko-t-osx-1014 workers I saw hg sparse extension is enabled on them:

[extensions]
share=
rebase=
mq=
purge=
robustcheckout=/usr/local/lib/hgext/robustcheckout.py
; Enable sparse checkout only in macOS worker
sparse=

Flags: needinfo?(dcrisan)

(In reply to Dave House [:dhouse] from comment #79)

(In reply to Pete Moore [:pmoore][:pete] from comment #32)

Perhaps on the old workers, one of these was already set. The options I see available to us, in order of preference (preferred to least preferred):

pmoore, I'm sorry to respond late to this. I did some digging and found the previous version of generic-worker passed the environment through to the task: https://github.com/taskcluster/generic-worker/blob/v13.0.1/plat_all-unix-style.go#L116

In multi-user/new generic-worker, if not changed to passing the parent environment (not the "generic-worker" parent, but the launchd/shell parent with it's HOME/PATH/USER/etc), could SHELL be added to the small set of standard shell variables? https://github.com/taskcluster/generic-worker/blob/master/multiuser_posix.go#L106 (like HOME, PATH, and USER)

It can, but I feel this may be the wrong place to do it. I believe the SHELL environment variable should really be task-specific, in case not all tasks wish /bin/bash to be the value, for example. I think this differs from HOME and USER which are specific to the user account running the task, so can more easily be pre-baked to appropriate values.

In the past tasks did indeed inherit environment variables from the generic-worker parent process, which is why it was there before, before we had OS task user separation.

Note, any task that requires SHELL to be set can set it, and this also makes the choice explicit, rather than hidden in generic-worker code. On reflection, we probably should not set PATH, but this would mean that almost all tasks would need to explicitly set it, and this could be a point of great frustration, so a reasonable default is provided. It could be argued that /bin/bash is a reasonable default for SHELL but it feels a little more biased than setting a default PATH since it really is a user (or in our case, task) preference where multiple reasonable alternatives could be set, rather than a default system PATH which contains the core system binary directories.

I think with Chris's changes this issue is solved, and will be more explicit in tasks, so overall better, but if that doesn't work out for some reason, we can open up this discussion, of course. This is my personal opinion, after all, which others may disagree with.

Flags: needinfo?(pmoore)

(In reply to Dragos Crisan [:dragrom] from comment #83)

(In reply to Chris Manchester (:chmanchester) from comment #82)

This is almost ready. Thank you very much for your work on this. Would it also be possible to add the sparse extension to the hgrc of the workers running in the gecko-t-osx-1014 pool as well? These tasks will run against those on try and could benefit from that there as well. Thanks!

Looking on gecko-t-osx-1014 workers I saw hg sparse extension is enabled on them:

[extensions]
share=
rebase=
mq=
purge=
robustcheckout=/usr/local/lib/hgext/robustcheckout.py
; Enable sparse checkout only in macOS worker
sparse=

Ah, my mistake. We're seeing the error from comment 33 again, which means we need to update the robustcheckout extension. Can you take a look? Thanks!

Flags: needinfo?(dcrisan)

(In reply to Chris Manchester (:chmanchester) from comment #85)

(In reply to Dragos Crisan [:dragrom] from comment #83)

(In reply to Chris Manchester (:chmanchester) from comment #82)

This is almost ready. Thank you very much for your work on this. Would it also be possible to add the sparse extension to the hgrc of the workers running in the gecko-t-osx-1014 pool as well? These tasks will run against those on try and could benefit from that there as well. Thanks!

Looking on gecko-t-osx-1014 workers I saw hg sparse extension is enabled on them:

[extensions]
share=
rebase=
mq=
purge=
robustcheckout=/usr/local/lib/hgext/robustcheckout.py
; Enable sparse checkout only in macOS worker
sparse=

Ah, my mistake. We're seeing the error from comment 33 again, which means we need to update the robustcheckout extension. Can you take a look? Thanks!

Chirs, I looked on pgo workers and looks like all jobs are green on these workers. Please let me know if you wil see any errors

Flags: needinfo?(dcrisan) → needinfo?(cmanchester)

I'm still seeing issues enabling sparse profiles on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bd6752b83c636fcf058e315d87b4b53586c71f65&selectedJob=281790785 These jobs run against the regular gecko-t-osx-1014 workers on try, so we need them to succeed there as well.

I'm looking into this because it might help with bug 1604486, which is already pretty frequent.

Flags: needinfo?(cmanchester)

(In reply to Chris Manchester (:chmanchester) from comment #87)

I'm still seeing issues enabling sparse profiles on try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bd6752b83c636fcf058e315d87b4b53586c71f65&selectedJob=281790785 These jobs run against the regular gecko-t-osx-1014 workers on try, so we need them to succeed there as well.

I'm looking into this because it might help with bug 1604486, which is already pretty frequent.

Looks like robustcheckout needs to be upgraded in-tree. I'll send a patch.

Depends on: 1605189

(In reply to Pete Moore [:pmoore][:pete] on PTO until 2020 from comment #84)

(In reply to Dave House [:dhouse] from comment #79)

(In reply to Pete Moore [:pmoore][:pete] from comment #32)

Perhaps on the old workers, one of these was already set. The options I see available to us, in order of preference (preferred to least preferred):

pmoore, I'm sorry to respond late to this. I did some digging and found the previous version of generic-worker passed the environment through to the task: https://github.com/taskcluster/generic-worker/blob/v13.0.1/plat_all-unix-style.go#L116

In multi-user/new generic-worker, if not changed to passing the parent environment (not the "generic-worker" parent, but the launchd/shell parent with it's HOME/PATH/USER/etc), could SHELL be added to the small set of standard shell variables? https://github.com/taskcluster/generic-worker/blob/master/multiuser_posix.go#L106 (like HOME, PATH, and USER)

It can, but I feel this may be the wrong place to do it. I believe the SHELL environment variable should really be task-specific, in case not all tasks wish /bin/bash to be the value, for example. I think this differs from HOME and USER which are specific to the user account running the task, so can more easily be pre-baked to appropriate values.

In the past tasks did indeed inherit environment variables from the generic-worker parent process, which is why it was there before, before we had OS task user separation.

Note, any task that requires SHELL to be set can set it, and this also makes the choice explicit, rather than hidden in generic-worker code. On reflection, we probably should not set PATH, but this would mean that almost all tasks would need to explicitly set it, and this could be a point of great frustration, so a reasonable default is provided. It could be argued that /bin/bash is a reasonable default for SHELL but it feels a little more biased than setting a default PATH since it really is a user (or in our case, task) preference where multiple reasonable alternatives could be set, rather than a default system PATH which contains the core system binary directories.

I understand your point of view. If task writers/maintainers don't have a problem with that, then I'm fine with it.

I think with Chris's changes this issue is solved, and will be more explicit in tasks, so overall better, but if that doesn't work out for some reason, we can open up this discussion, of course. This is my personal opinion, after all, which others may disagree with.

The change to not having SHELL set exposes tasks, that expect it, as failures now. I think Chris had fixed the pgo tasks that required SHELL, and I think it is non-pgo tasks that were failing after that (possibly all because of mach).

Hi Chris,

Looks like pgo queue is green: https://firefox-ci-tc.services.mozilla.com/provisioners/releng-hardware/worker-types/gecko-3-t-osx-1014.

Did you see any issues or everything is OK with PGO build?

Flags: needinfo?(cmanchester)

Everything appears to be good from my end. Thank you very much for your work on this!

Flags: needinfo?(cmanchester)

Hi Joel,
Do you have any news about failed non pgo tasks?

Flags: needinfo?(jmaher)

if we have this deployed for all osx workers then there are no problems

Flags: needinfo?(jmaher)
Assignee: dcrisan → dhouse

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #93)

if we have this deployed for all osx workers then there are no problems

I'll update the osx worker staging pool so that we can verify this is working for all macos tasks.

This has been running for PGO and for the staging pool, but not for all osx workers (yet).

Flags: needinfo?(dhouse)

The homebrew removal is complete (bug 1563799). So we can proceed with updating generic-worker on the macos staging and production pools.

(In reply to Dave House [:dhouse] from comment #89)

The change to not having SHELL set exposes tasks, that expect it, as failures now. I think Chris had fixed the pgo tasks that required SHELL, and I think it is non-pgo tasks that were failing after that (possibly all because of mach).

This issue resurfaced now that we're setting up a multi-user generic-worker to run another type of task (builds).

The fix was to default to /bin/sh for non-win32 in mach: https://phabricator.services.mozilla.com/D97686

Has been updated in all places where we were able to.

Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: