Closed Bug 1399401 Opened 7 years ago Closed 6 years ago

Upgrade all win7/win10 gecko workers to generic-worker 10.7.8

Categories

(Taskcluster :: Services, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla61

People

(Reporter: pmoore, Assigned: pmoore)

References

(Blocks 1 open bug)

Details

Attachments

(4 files)

These worker types currently run outdated versions of generic-worker: > gecko-t-win7-32: generic-worker 8.2.0 > gecko-t-win7-32-gpu: generic-worker 8.2.0 > gecko-t-win10-64: generic-worker 8.3.0 We should upgrade these worker type to 10.2.2 for the following benefits: * all tasks user-sandboxed (dedicated user for each task, which is deleted after the task completes) * more secure (no access to secrets on machines) * more realistic environment (winlogon session belonging to a regular user, with full dedicated desktop environment) * tasks cannot (intentionally or accidentally) interfere with each other * latest features available * includes several bug fixes and logging/monitoring improvements * avoid needing to maintain two different branches of generic-worker Currently when upgrading we hit these failures: > Windows 7 opt: > tc-M-e10s(5 bc1) > > Windows 7 debug: > tc-M-e10s(5 bc5) tc-M(5 bc1 bc7) > > windows7-32-stylo-disabled opt: > tc-M-e10s(5 bc2) > > windows7-32-stylo-disabled debug: > tc-M-e10s(5 bc2) > Windows 10 x64 opt: > tc-X(X) > > Windows 10 x64 debug: > tc-X(X) tc-M-e10s(5) > > windows10-64-stylo-disabled opt: > tc-M-e10s(5) > > windows10-64-stylo-disabled debug: > tc-M-e10s(5) Fixing these failures will allow us roll out new worker features to these worker types. This bug originates from https://bugzilla.mozilla.org/show_bug.cgi?id=1382204#c59
Add this commit to your try push to switch to generic-worker 10.2.2: * https://hg.mozilla.org/try/raw-rev/835faadcf252b3b019476c713ab5459ccc6af951
Hi Joel, Is this something you can help me with? Many thanks, Pete
Flags: needinfo?(jmaher)
:pmoore, you can followup with :mattn for the alert/dialog/notification failures and :rstrong for the xpcshell failures related to installation/updating. I could help after the migration, but as it stands many on our team are at full capacity for the rest of the month.
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #4) > :pmoore, you can followup with :mattn for the alert/dialog/notification > failures and :rstrong for the xpcshell failures related to > installation/updating. > > I could help after the migration, but as it stands many on our team are at > full capacity for the rest of the month. Matt, could you help me diagnose these alert/dialog/notification failures? Robert, could you help me diagnose the xpcshell failures? Many thanks guys! Pete
Flags: needinfo?(robert.strong.bugs)
Flags: needinfo?(MattN+bmo)
For convenience I've made a new try push from latest mozilla-central revision: * https://treeherder.mozilla.org/#/jobs?repo=try&revision=aa0ec3f9f9c876dc0ec7b8d8237d86f206dcbb51 I just did this by applying the patch from comment 1 against the latest mozilla-central revision (893fe1549e1e).
Depends on: 1398491
I spoke to Robert over IRC and he kindly pointed me to https://bugzilla.mozilla.org/show_bug.cgi?id=1067756#c21 That would explain it, because after the worker upgrade, task users would not have write access to these directories.
Flags: needinfo?(robert.strong.bugs)
(In reply to Pete Moore [:pmoore][:pete] from comment #7) > I spoke to Robert over IRC and he kindly pointed me to > https://bugzilla.mozilla.org/show_bug.cgi?id=1067756#c21 > > That would explain it, because after the worker upgrade, task users would > not have write access to these directories. Rebuilding win7/win10 beta worker types in OpenCloudConfig: * https://tools.taskcluster.net/groups/L6NZFeUNSlabxdmqJSgcPw Thanks Robert!
rstrong has also just highlighted to me that the privilege granted in bug 1353889 for the GenericWorker account will need to be granted to the task users e.g. for gecko-t-win10-64-beta that would be: https://github.com/mozilla-releng/OpenCloudConfig/blob/b20fe867f48dde8dd040b339eef21047ca5b728d/userdata/Manifest/gecko-t-win10-64-beta.json#L1147
(In reply to Pete Moore [:pmoore][:pete] from comment #6) > For convenience I've made a new try push from latest mozilla-central > revision: > > * > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=aa0ec3f9f9c876dc0ec7b8d8237d86f206dcbb51 > > I just did this by applying the patch from comment 1 against the latest > mozilla-central revision (893fe1549e1e). gecko-1-b-win2012-beta was broken, hopefully fixed now with: https://github.com/mozilla-releng/OpenCloudConfig/commit/bae0c1c78410197cb64a2e9e122b37e6e515255e When the rollout of the new AMIs complete, we should be able to retrigger those broken builds. AMI rollouts are happening in: * https://tools.taskcluster.net/groups/bv4SHLSIT3eEsKUbriOoVw/tasks/LjPipx7TQTKl9IuMaACpZw/runs/0
(In reply to Pete Moore [:pmoore][:pete] from comment #10) > rstrong has also just highlighted to me that the privilege granted in bug > 1353889 for the GenericWorker account will need to be granted to the task > users > > e.g. for gecko-t-win10-64-beta that would be: > https://github.com/mozilla-releng/OpenCloudConfig/blob/ > b20fe867f48dde8dd040b339eef21047ca5b728d/userdata/Manifest/gecko-t-win10-64- > beta.json#L1147 Applied to beta worker types: * https://github.com/mozilla-releng/OpenCloudConfig/commit/02a45d1e675b25f95bf764b095fa037181d5aa2b
(In reply to Pete Moore [:pmoore][:pete] from comment #14) > New push with fixes: > > * > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=bdd9ec4a5222cf3a3b82db8d155015e655fbc986 This hasn't fixed the tc-X(X) tests yet, I'm now checking the change to make the 'Mozilla Maintenance Service' folder writable to task users worked with this test task: * https://tools.taskcluster.net/groups/OO_g0scBShmyWdRDzYQ7mg/tasks/OO_g0scBShmyWdRDzYQ7mg/details
Backslash escaping syntax mistake in previous task, new task: * https://tools.taskcluster.net/groups/Z_FR3TFAQ-axtTqdVOV93w/tasks/Z_FR3TFAQ-axtTqdVOV93w/details
Updated task: * https://tools.taskcluster.net/groups/DlCMdhvqS5us3EIgcm6Fiw/tasks/DlCMdhvqS5us3EIgcm6Fiw/runs/0/logs/public%2Flogs%2Flive.log Looks like my change[1] didn't work for some reason: Z:\task_1505581330>echo hellooo 1>"C:\Program Files (x86)\Mozilla Maintenance Service\hello.txt" Access is denied. [taskcluster 2017-09-16T17:05:27.754Z] Exit Code: 1 --- [1] https://github.com/mozilla-releng/OpenCloudConfig/commit/b20fe867f48dde8dd040b339eef21047ca5b728d
I still haven't got around to further investigating the issue in comment 17, and I'll be out for a few days now. Rob, if you get the time to look into this, it would be awesome, otherwise I can have a look when I'm back next week. Basically, I made a change to a manifest in OCC so that a directory is read/writable to Everyone, rolled everything out, but I can't write to that directory in a task. All the links are in comment 17. Like I say, I can also take a look when I'm back next week. Thanks guys!
Flags: needinfo?(rthijssen)
iirc the build system Windows images have the maintenance service installed. Do these systems?
my guess is that the command isn't succeeding because of missing quotes around the arg with spaces in it. eg this line (https://github.com/mozilla-releng/OpenCloudConfig/commit/b20fe867f48dde8dd040b339eef21047ca5b728d#diff-87907bc6a1f2a26aacbddd5425eea212R977): reads: "C:\\Program Files (x86)\\Mozilla Maintenance Service", but should read: "\"C:\\Program Files (x86)\\Mozilla Maintenance Service\"",
Flags: needinfo?(rthijssen)
Blocks: 1370877
(In reply to Rob Thijssen (:grenade - UTC+3) from comment #20) > my guess is that the command isn't succeeding because of missing quotes > around the arg with spaces in it. > > eg this line > (https://github.com/mozilla-releng/OpenCloudConfig/commit/ > b20fe867f48dde8dd040b339eef21047ca5b728d#diff- > 87907bc6a1f2a26aacbddd5425eea212R977): > > reads: > "C:\\Program Files (x86)\\Mozilla Maintenance Service", > > but should read: > "\"C:\\Program Files (x86)\\Mozilla Maintenance Service\"", Thanks Rob! I'm trying a push with this now, to see if it helps. Thanks for looking. :) https://github.com/mozilla-releng/OpenCloudConfig/commit/3357d441046aad0442f2032aed6d7cc78bf996c1
(In reply to Pete Moore [:pmoore][:pete] from comment #23) > Retrying failed xpcshell tasks: > > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=bdd9ec4a5222cf3a3b82db8d155015e655fbc986&filter- > searchStr=xpcshell&duplicate_jobs=visible That fixed xpcshell! Now just the mochitests left.
great stuff
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #25) > great stuff All credit goes to rstrong and grenade for that one :-) New try push against latest mozilla central head revision: * https://treeherder.mozilla.org/#/jobs?repo=try&revision=f3e3fd07da461fc449a0038c51b69db7392a2e2f&filter-tier=1&filter-tier=2&filter-tier=3&duplicate_jobs=visible&group_state=expanded
there are a lot of failures which all seem to be related to popups/notifications/other windows. For example, bc1 already runs on taskcluster VM and passes as tier-1, but it is failing with this change. I verified in the log that we are running: 10:54:42 INFO - 17 INFO TEST-START | browser/base/content/test/alerts/browser_notification_do_not_disturb.js 10:54:45 INFO - GECKO(2968) | MEMORY STAT | vsize 1742MB | vsizeMaxContiguous 131597346MB | residentFast 260MB | heapAllocated 108MB 10:54:45 INFO - 18 INFO TEST-OK | browser/base/content/test/alerts/browser_notification_do_not_disturb.js | took 3117ms 10:54:45 INFO - 19 INFO checking window state 10:54:45 INFO - 20 INFO TEST-START | browser/base/content/test/alerts/browser_notification_open_settings.js 10:54:47 INFO - GECKO(2968) | MEMORY STAT | vsize 1791MB | vsizeMaxContiguous 131597346MB | residentFast 311MB | heapAllocated 140MB 10:54:47 INFO - 21 INFO TEST-OK | browser/base/content/test/alerts/browser_notification_open_settings.js | took 2163ms 10:54:47 INFO - 22 INFO checking window state 10:54:47 INFO - 23 INFO TEST-START | browser/base/content/test/alerts/browser_notification_remove_permission.js 10:54:48 INFO - GECKO(2968) | MEMORY STAT | vsize 1792MB | vsizeMaxContiguous 131597346MB | residentFast 312MB | heapAllocated 142MB 10:54:48 INFO - 24 INFO TEST-OK | browser/base/content/test/alerts/browser_notification_remove_permission.js | took 785ms 10:54:48 INFO - 25 INFO checking window state 10:54:48 INFO - 26 INFO TEST-START | browser/base/content/test/alerts/browser_notification_replace.js 10:54:49 INFO - GECKO(2968) | MEMORY STAT | vsize 1792MB | vsizeMaxContiguous 131597346MB | residentFast 296MB | heapAllocated 117MB 10:54:49 INFO - 27 INFO TEST-OK | browser/base/content/test/alerts/browser_notification_replace.js | took 462ms 10:54:49 INFO - 28 INFO checking window state 10:54:49 INFO - 29 INFO TEST-START | browser/base/content/test/alerts/browser_notification_tab_switching.js 10:54:49 INFO - GECKO(2968) | MEMORY STAT | vsize 1783MB | vsizeMaxContiguous 131597346MB | residentFast 278MB | heapAllocated 104MB 10:54:49 INFO - 30 INFO TEST-OK | browser/base/content/test/alerts/browser_notification_tab_switching.js | took 759ms but on your try push I see failures like this: 07:58:53 INFO - 2 INFO TEST-START | browser/base/content/test/alerts/browser_notification_open_settings.js 07:59:38 INFO - TEST-INFO | started process screenshot 07:59:38 INFO - TEST-INFO | screenshot: exit 0 07:59:38 INFO - Buffered messages logged at 07:58:53 07:59:38 INFO - 3 INFO Entering test bound test_settingsOpen_observer 07:59:38 INFO - 4 INFO Opening a dummy tab so openPreferences=>switchToTabHavingURI doesn't use the blank tab. 07:59:38 INFO - 5 INFO Console message: [JavaScript Warning: "Use of nsIFile in content process is deprecated." {file: "resource://gre/modules/FileUtils.jsm" line: 174}] 07:59:38 INFO - 6 INFO simulate a notifications-open-settings notification 07:59:38 INFO - 7 INFO TEST-PASS | browser/base/content/test/alerts/browser_notification_open_settings.js | The notification settings tab opened - 07:59:38 INFO - Buffered messages logged at 07:58:54 07:59:38 INFO - 8 INFO Leaving test bound test_settingsOpen_observer 07:59:38 INFO - 9 INFO Entering test bound test_settingsOpen_button 07:59:38 INFO - 10 INFO Adding notification permission 07:59:38 INFO - 11 INFO Console message: [JavaScript Warning: "Use of nsIFile in content process is deprecated." {file: "resource://gre/modules/FileUtils.jsm" line: 174}] 07:59:38 INFO - 12 INFO Console message: [JavaScript Warning: "Unknown pseudo-class or pseudo-element ‘-moz-tree-line’. Ruleset ignored due to bad selector." {file: "chrome://global/content/xul.css" line: 654}] 07:59:38 INFO - 13 INFO Waiting for notification 07:59:38 INFO - Buffered messages finished 07:59:38 ERROR - 14 INFO TEST-UNEXPECTED-FAIL | browser/base/content/test/alerts/browser_notification_open_settings.js | Test timed out - 07:59:38 INFO - GECKO(3176) | MEMORY STAT | vsize 685MB | vsizeMaxContiguous 804MB | residentFast 195MB | heapAllocated 63MB 07:59:38 INFO - 15 INFO TEST-OK | browser/base/content/test/alerts/browser_notification_open_settings.js | took 45078ms 07:59:38 INFO - Not taking screenshot here: see the one that was previously logged 07:59:38 ERROR - 16 INFO TEST-UNEXPECTED-FAIL | browser/base/content/test/alerts/browser_notification_open_settings.js | Found a tab after previous test timed out: http://example.org/browser/browser/base/content/test/alerts/file_dom_notifications.html - 07:59:38 INFO - 17 INFO checking window state 07:59:38 INFO - 18 INFO TEST-START | browser/base/content/test/alerts/browser_notification_remove_permission.js 08:00:23 INFO - Not taking screenshot here: see the one that was previously logged 08:00:23 INFO - Buffered messages logged at 07:59:38 08:00:23 INFO - 19 INFO Console message: [JavaScript Warning: "Use of nsIFile in content process is deprecated." {file: "resource://gre/modules/FileUtils.jsm" line: 174}] 08:00:23 INFO - Buffered messages finished 08:00:23 ERROR - 20 INFO TEST-UNEXPECTED-FAIL | browser/base/content/test/alerts/browser_notification_remove_permission.js | Test timed out - 08:00:23 INFO - GECKO(3176) | MEMORY STAT | vsize 684MB | vsizeMaxContiguous 804MB | residentFast 195MB | heapAllocated 65MB 08:00:23 INFO - 21 INFO TEST-OK | browser/base/content/test/alerts/browser_notification_remove_permission.js | took 45328ms given that, it looks like there is an issue with focus with the worker changes you are making :pete. Can you run the test on a loaner before/after you changes and watch what is going on? I suspect the answer might be obvious.
Flags: needinfo?(MattN+bmo)
Blocks: 1403490
No longer blocks: 1403490
There was some investigation of those notification-related things in bug 1364517.
I'm hoping that attachment 8916804 [details] will fix any test failures related to browser/base/content/test/alerts/*. The issue being fixed is a race condition not specific to Windows 10 though so it may not be sufficient.
Depends on: 1352791
> Depends on: 1352791 FYI, even though that bug is still open, I landed a fix there for Windows 10. Is that bug still blocking this landing? Does this change make the Win7 failures worse?
Blocks: 1373551
Summary: Upgrade all win7/win10 gecko workers to generic-worker 10.2.2 or later → Upgrade all win7/win10 gecko workers to generic-worker 10.2.3
Blocks: 1394557
Blocks: 1343049
Pete, what are the next steps on this bug?
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #34) > Pete, what are the next steps on this bug? I've just landed https://github.com/mozilla-releng/OpenCloudConfig/pull/111 which updates all of our beta worker types to be identical to our production worker types, except for generic-worker version and configuration. I've also rigorously updated our worker type definitions in the aws provisioner to make sure the beta worker types also match the production versions. When the OpenCloudConfig changes have propagated to AWS, I'll trigger a new try job using the beta worker types to see what issues remain. This is much like I did in previous comments - just refreshing to latest versions, and then triggering a new try push. I suspect my try push will have to wait for tomorrow as it takes a couple of hours for all the changes to propagate, but hopefully by this time tomorrow we should have a new completed try push that we can evaluate.
please do a --rebuild 20 on your try push, that will help get data on failure rates.
So I've been having some problem getting the last beta worker type updated - gecko-t-win10-64-gpu-b - I'm going to have another try now - all my deploys until now have been failing due to either not being able to get instances, or the instances I had losing network connectivity (so it isn't possible to see what is going wrong). This could just be a case of bug 1372172 hitting us during AMI creation. See e.g. last failed OCC task just now: https://tools.taskcluster.net/groups/R4iy9VIJSB6tG1f4UwVgmA/tasks/DTdqlTWNRC-naG3mtCgitg/runs/0/logs/public%2Flogs%2Flive.log That links to a papertrail log, that stops outputting for 85 minutes between 14:50:01 and 16:25:05 CET: https://papertrailapp.com/groups/2488493/events?q=i-09768a467658be83e Dec 01 14:48:01 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: generic-worker is not running. Dec 01 14:48:01 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: OpenCloudConfig is running. Dec 01 14:48:01 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com HaltOnIdle: instance appears to be initialising. Dec 01 14:50:01 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: generic-worker is not running. Dec 01 14:50:01 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: OpenCloudConfig is running. Dec 01 14:50:01 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com HaltOnIdle: instance appears to be initialising. Dec 01 16:25:05 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com Microsoft-Windows-GroupPolicy: Shutdown script failed. GPO Name : Local Group Policy GPO File System Path : C:\Windows\System32\GroupPolicy\Machine Script Name: C:\scripts\set_user_data.ps1 Dec 01 16:25:05 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com Microsoft-Windows-Kernel-PnP: The driver \Driver\WudfRd failed to load for the device SWD\WPDBUSENUM\{70ffd6cb-3efa-11e7-9146-806e6f6e6963}#0000000000100000. Dec 01 16:25:05 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com Service_Control_Manager: The CldFlt service failed to start due to the following error: The request is not supported. Dec 01 16:25:05 i-09768a467658be83e.gecko-t-win10-64-gpu-b.usw2.mozilla.com OpenCloudConfig: Windows update service is running The above log extract indicates a problem, since HaltOnIdle is scheduled to run every 2 minutes, which it does up until 14:50, and then for 85 minutes we have no logging until we see the machine is rebooted. This suggests the worker is still running, but either loses network connectivity or the papertrail integration breaks. I did not reboot the machine from the outside, so I presume it rebooted as part of the environment preparation it was internally performing. The observant log reader may also notice a repeated message earlier in the log: "An error occurred (InstanceLimitExceeded) when calling the RunInstances operation: You have requested more instances (2) than your current instance limit of 1 allows for the specified instance type. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit." This occurred multiple times in the log before the above extract, because there was another g3.4xlarge instance running in us-west-2, when we have a limit of 1. This other running instance was probably a runaway instance from a previously timed-out OCC task for a previous push. By terminating that rogue instance, I was able to get the task to continue. The price of this though, was a delay in time that ate into the maxRunTime of the task. So maybe the task might have been successful if it had been able to spawn a g3.4xlarge instance in us-west-2 immediately, rather than needing to wait for someone to terminate the running one. For future reference, such a rogue instance can be seen under: https://us-west-2.console.aws.amazon.com/ec2/v2/home?region=us-west-2#Instances:search=g3.4xlarge;sort=instanceId I've now ensured that there are no g3.4xlarge instance running in us-west-2, and made a new OCC push to try to rebuild gecko-t-win10-64-gpu-b again: https://tools.taskcluster.net/groups/GMo_mDgUSZqubrB74-7rrA/tasks/Kz1LVo5fS5ak5U1nWjR0MQ/details I may not be around when this completes, but if it does complete successfully, I have prepared a try patch here: https://bugzilla.mozilla.org/show_bug.cgi?id=1400012#c9 that makes it trivial to run a try push against the beta worker types - so if anyone wants to make a try push once the above task completes successfully, this patch should do the trick.
Depends on: 1422870
(In reply to Pete Moore [:pmoore][:pete] from comment #37) > So I've been having some problem getting the last beta worker type updated - > gecko-t-win10-64-gpu-b - I'm going to have another try now - all my deploys > until now have been failing due to either not being able to get instances, > or the instances I had losing network connectivity (so it isn't possible to > see what is going wrong). All problems updating gecko-t-win10-64-gpu-b have been solved now, so I have made a new try push here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=67581af6162e8c0dfaaa726c3fda298ef576a846&filter-tier=1&filter-tier=2&filter-tier=3&duplicate_jobs=visible&group_state=expanded&filter-searchStr=windows&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=runnable Let's see how that goes! :-)
(In reply to Pete Moore [:pmoore][:pete] from comment #38) > (In reply to Pete Moore [:pmoore][:pete] from comment #37) > > So I've been having some problem getting the last beta worker type updated - > > gecko-t-win10-64-gpu-b - I'm going to have another try now - all my deploys > > until now have been failing due to either not being able to get instances, > > or the instances I had losing network connectivity (so it isn't possible to > > see what is going wrong). > > All problems updating gecko-t-win10-64-gpu-b have been solved now, so I have > made a new try push here: > > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=67581af6162e8c0dfaaa726c3fda298ef576a846&filter- > tier=1&filter-tier=2&filter- > tier=3&duplicate_jobs=visible&group_state=expanded&filter- > searchStr=windows&filter-resultStatus=testfailed&filter- > resultStatus=busted&filter-resultStatus=exception&filter- > resultStatus=runnable > > Let's see how that goes! :-) Some problems with Y: drive not getting mounted on some win 7 gpu jobs - but other than that, I think worth taking a look at already.
Flags: needinfo?(jmaher)
Rob any idea what the Y: drive mounting problem on gecko-t-win7-32-gpu-b might be related to? Thanks!
Flags: needinfo?(rthijssen)
From this screenshot from a problematic worker[1] we see that the Y: drive got mounted as E: instead of Y: -- [1] https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-win7-32-gpu-b/workers/us-east-1/i-03115ac894967073d
Assignee: nobody → pmoore
Status: NEW → ASSIGNED
I spotted drive letters can be mapped in DriveLetterConfig.xml[1]. I think at the moment we are doing mounting drives in rundsc.ps1[2]. See EC2 docs[3] for details on the DriveLetterConfig.xml file. If using DriveLetterConfig.xml works, that could that be an alternative solution than mounting in rundsc.ps1? I checked one of our instances, and saw the file exists, but contains no mappings at the moment: Z:\task_1512497232>type "C:\Program Files\Amazon\Ec2ConfigService\Settings\DriveLetterConfig.xml" <?xml version="1.0" standalone="yes"?> <DriveLetterMapping> </DriveLetterMapping> -- [1] C:\Program Files\Amazon\Ec2ConfigService\Settings\DriveLetterConfig.xml [2] https://github.com/mozilla-releng/OpenCloudConfig/blob/cebf4fc5888510550a09f1ccdcf0d4001d7c32ec/userdata/rundsc.ps1#L306-L367 [3] http://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/UsingConfig_WinAMI.html#UsingConfigInterface_WinAMI
this is failing to get rid of xendpriv.exe (look at the logs for 'cl'). This is a hack I put in the allow clipboard to run on the machines- we cannot remove the file due to access denied, I suspect you need to remove this in the setup, or fix the taskcluster worker to allow access to that file. outside of this, many tests are failing for prompt/notification/multi-window- that is a common theme from earlier when looking at taskcluster workers for windows in the past.
Flags: needinfo?(jmaher)
most likely explanation is that gw started running before occ was able to assign the correct drive letter. we see in the logs that this is often the case on windows 2012 where we use newer gw. since the worker type experiencing the incorrect drive mappings is also running a newer gw, i suspect this is also the case here. when occ detects that gw has started before occ, it simply terminates itself (as a workaroud to other issues experienced earlier) since we can't have both running. note that using ec2 to assign the drive lettere in DriveLetterConfig.xml will not be 100% effective as gw also doesn't wait for ec2config to complete before it starts up. imo the best fix is to add a check inside gw to wait until occ has set the ready state flag before attempting to run tasks. there's simply nothing we can do in occ to get the drive mappings correct if gw starts before occ has run.
Flags: needinfo?(rthijssen)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #43) > this is failing to get rid of xendpriv.exe (look at the logs for 'cl'). > This is a hack I put in the allow clipboard to run on the machines- we > cannot remove the file due to access denied, I suspect you need to remove > this in the setup, or fix the taskcluster worker to allow access to that > file. > > outside of this, many tests are failing for > prompt/notification/multi-window- that is a common theme from earlier when > looking at taskcluster workers for windows in the past. Indeed - it looks like this file is included on the workers. Rob, do you know why this file is there, and where it comes from? Should this whole "C:\Program Files\Citrix\XenTools" directory be there at all? ----- [taskcluster 2017-12-06T11:26:35.251Z] Worker Type (gecko-t-win7-32-beta) settings: [taskcluster 2017-12-06T11:26:35.251Z] { [taskcluster 2017-12-06T11:26:35.251Z] "aws": { [taskcluster 2017-12-06T11:26:35.251Z] "ami-id": "ami-18038d77", [taskcluster 2017-12-06T11:26:35.251Z] "availability-zone": "eu-central-1c", [taskcluster 2017-12-06T11:26:35.251Z] "instance-id": "i-00f0b6a2a91125f87", [taskcluster 2017-12-06T11:26:35.251Z] "instance-type": "c4.2xlarge", [taskcluster 2017-12-06T11:26:35.251Z] "local-ipv4": "10.147.50.97", [taskcluster 2017-12-06T11:26:35.251Z] "public-hostname": "ec2-18-194-58-53.eu-central-1.compute.amazonaws.com", [taskcluster 2017-12-06T11:26:35.251Z] "public-ipv4": "18.194.58.53" [taskcluster 2017-12-06T11:26:35.251Z] }, [taskcluster 2017-12-06T11:26:35.251Z] "config": { [taskcluster 2017-12-06T11:26:35.251Z] "deploymentId": "bec3aef21ffa", [taskcluster 2017-12-06T11:26:35.251Z] "runTasksAsCurrentUser": false [taskcluster 2017-12-06T11:26:35.251Z] }, [taskcluster 2017-12-06T11:26:35.251Z] "generic-worker": { [taskcluster 2017-12-06T11:26:35.251Z] "go-arch": "386", [taskcluster 2017-12-06T11:26:35.251Z] "go-os": "windows", [taskcluster 2017-12-06T11:26:35.251Z] "go-version": "go1.9", [taskcluster 2017-12-06T11:26:35.251Z] "release": "https://github.com/taskcluster/generic-worker/releases/tag/v10.3.1", [taskcluster 2017-12-06T11:26:35.251Z] "revision": "bc1ecb9aa266105bf8a936fa451bff4e2a35843e", [taskcluster 2017-12-06T11:26:35.251Z] "source": "https://github.com/taskcluster/generic-worker/tree/bc1ecb9aa266105bf8a936fa451bff4e2a35843e", [taskcluster 2017-12-06T11:26:35.251Z] "version": "10.3.1" [taskcluster 2017-12-06T11:26:35.251Z] }, [taskcluster 2017-12-06T11:26:35.251Z] "machine-setup": { [taskcluster 2017-12-06T11:26:35.251Z] "ami-created": "2017-12-05 14:36:21.569Z", [taskcluster 2017-12-06T11:26:35.251Z] "manifest": "https://github.com/mozilla-releng/OpenCloudConfig/blob/bec3aef21ffac1363747d6d5dc49079be1b61d1c/userdata/Manifest/gecko-t-win7-32-beta.json" [taskcluster 2017-12-06T11:26:35.251Z] } [taskcluster 2017-12-06T11:26:35.251Z] } [taskcluster 2017-12-06T11:26:35.251Z] Task ID: a_J-nNAvT2S_g1HbHQnypg [taskcluster 2017-12-06T11:26:35.251Z] === Task Starting === [taskcluster 2017-12-06T11:26:36.299Z] Uploading redirect artifact public/logs/live.log to URL https://clbduniaaaawak4uxpi4qn4c3mgrkwj5uxhp3xkefmvo3mhn.taskcluster-worker.net:60023/log/TorEb--jSeqsgkgKBjqzPw with mime type "text/plain; charset=utf-8" and expiry 2017-12-06T11:27:35.889Z [taskcluster 2017-12-06T11:26:36.738Z] Executing command 0: dir "C:\Program Files\Citrix\XenTools\XenDPriv.exe" Z:\task_1512559551>dir "C:\Program Files\Citrix\XenTools\XenDPriv.exe" Volume in drive C is OSDisk Volume Serial Number is FC62-2D8F Directory of C:\Program Files\Citrix\XenTools 04/08/2014 04:07 PM 12,288 XenDPriv.exe 1 File(s) 12,288 bytes 0 Dir(s) 11,514,023,936 bytes free [taskcluster 2017-12-06T11:26:36.780Z] Exit Code: 0 [taskcluster 2017-12-06T11:26:36.780Z] Success Code: 0x0 [taskcluster 2017-12-06T11:26:36.780Z] User Time: 15.6001ms [taskcluster 2017-12-06T11:26:36.780Z] Kernel Time: 0s [taskcluster 2017-12-06T11:26:36.780Z] Wall Time: 30ms [taskcluster 2017-12-06T11:26:36.780Z] Peak Memory: 2273280 [taskcluster 2017-12-06T11:26:36.780Z] Result: SUCCEEDED [taskcluster 2017-12-06T11:26:36.780Z] === Task Finished === [taskcluster 2017-12-06T11:26:36.780Z] Task Duration: 42ms
Pete, that is related to the xen vm toolchain that amazon uses for its workers. We need some of the Xen tools, but not that specific file which luckily works for fixing our clipboard problems.
I added the following to remove it from the golden AMIs (see https://github.com/mozilla-releng/OpenCloudConfig/commit/17db37e19674751ff1baacb9da438f494a148663): + { + "ComponentName": "DeleteXenDPriv.exe", + "ComponentType": "CommandRun", + "Comment": "See https://bugzilla.mozilla.org/show_bug.cgi?id=1399401#c43 and https://bugzilla.mozilla.org/show_bug.cgi?id=1394757", + "Command": "cmd.exe", + "Arguments": [ + "/c", + "del", + "/f", + "/q", + "\"C:\\Program Files\\Citrix\\XenTools\\XenDPriv.exe\"" + ], + "Validate": { + "PathsNotExist": [ + "C:\\Program Files\\Citrix\\XenTools\\XenDPriv.exe" + ] + } + }, But the logs on a live worker show this isn't able to delete the file: 20171206164250-DeleteXenDPriv.exe-stderr.log ============================================ Access is denied. Running attrib shows that the file is not read-only, which was my first thought about why we are not able to delete it: Z:\task_1512656869>attrib "C:\Program Files\Citrix\XenTools\XenDPriv.exe" A I C:\Program Files\Citrix\XenTools\XenDPriv.exe Rob (:grenade) suggested it might be because the file is in use. I'll look more in depth into bug 1394757 to see how this file was deleted in the test setup code before. Note deleting this file during test setup no longer works, because tests do not run as admin. Also deleting it during test setup is bad because it changes system state - i.e. tests running before a test that deletes this file could well behave differently to tests that run after this system file is deleted - therefore better for the file not to make it into a live environment in the first place, and for the test environment to be consistent between test runs - which is why I have chosen to delete it entirely from the worker environment in OpenCloudConfig.
what we do in the test script is: 1) kill XenDPriv.exe 2) rename the file (but deleting is fine as well) it will try to restart all the time if you just kill the process and the file exists.
pmoore: if this file exists on the base ami, we can either remove it there and bake a new base ami or put the logic to kill it with fire in this method: https://github.com/mozilla-releng/OpenCloudConfig/blob/master/userdata/rundsc.ps1#L105 i'd kind of like it in the remove-legacystuff method just so there's a record in source code that we are doing so deliberately but not too fussed since this bug is also a good record.
(In reply to Rob Thijssen (:grenade UTC+2) from comment #49) > i'd kind of like it in the remove-legacystuff method just so there's a > record in source code that we are doing so deliberately but not too fussed > since this bug is also a good record. For now I've removed it in the manifest, but we can move it to rundsc.ps1 if you prefer. --- I've made a new try push here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=ce439beeb616415a842e11a69d1ad10a58117eef&filter-tier=1&filter-tier=2&filter-tier=3&duplicate_jobs=visible&group_state=expanded&filter-searchStr=windows&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=runnable --- (In reply to Joel Maher ( :jmaher) (UTC-5) from comment #36) > please do a --rebuild 20 on your try push, that will help get data on > failure rates. I've done this in the new try push above - hope I got the syntax right (I just added it to the end)! :-)
Pete, from comment 43: outside of this, many tests are failing for prompt/notification/multi-window- that is a common theme from earlier when looking at taskcluster workers for windows in the past. this holds true with your latest push, none of the test failures were fixed, so --rebuild 20 seemed a bit overkill.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #51) > Pete, from comment 43: > outside of this, many tests are failing for > prompt/notification/multi-window- that is a common theme from earlier when > looking at taskcluster workers for windows in the past. Thanks - do you know if these problems were ever solved? > this holds true with your latest push, none of the test failures were fixed, > so --rebuild 20 seemed a bit overkill. Sorry, I didn't realise --rebuild 20 would also rerun tasks that passed. Indeed, I was pretty alarmed to see how many tasks got generated when I came in this morning! I won't be doing that again unless there is some very strong justification.
Summary of permanent failures ============================= Windows 7 opt: 1) test-windows7-32/opt-mochitest-browser-chrome-e10s-3 2) test-windows7-32/opt-mochitest-chrome-3 Windows 7 debug: 3) test-windows7-32/debug-mochitest-5 4) test-windows7-32/debug-mochitest-browser-chrome-7 5) test-windows7-32/debug-mochitest-browser-chrome-e10s-4 6) test-windows7-32/debug-mochitest-clipboard Windows 10 opt: 7) test-windows10-64/opt-mochitest-browser-chrome-e10s-3 8) test-windows10-64/opt-mochitest-chrome-3 9) test-windows10-64/opt-mochitest-e10s-5 Windows 10 debug: 10) test-windows10-64/debug-mochitest-browser-chrome-e10s-1 11) test-windows10-64/debug-mochitest-chrome-3 12) test-windows10-64/debug-mochitest-e10s-5
One of the failures in test-windows7-32/debug-mochitest-clipboard is: 00:09:34 ERROR - 138 INFO TEST-UNEXPECTED-FAIL | devtools/client/commandline/test/browser_cmd_screenshot.js | arg.filename.value (for 'screenshot C:\Users\task_1512691342\AppData\Local\Temp\TestScreenshotFile.png') - Got C:\Users ask_1512691342\AppData\Local\Temp\TestScreenshotFile.png, expected C:\Users\task_1512691342\AppData\Local\Temp\TestScreenshotFile.png Here was can see the failure is simply because the string is getting escaped, i.e. `C:\Users\task` -> `C:\Users<tab>ask` because `\t` is being interpreted as the tab character. This is clearly a buggy test that needs fixing. I suspect there is something funny going on here: https://dxr.mozilla.org/mozilla-central/rev/457b0fe91e0d49a5bc35014fb6f86729cd5bac9b/devtools/client/commandline/test/browser_cmd_screenshot.js#106
Flags: needinfo?(jmaher)
Hi Matt, Do you have any ideas about what might be the cause of the failures in comment 50 (and comment 54)? Or do you know if there are any open existing bugs that I can make dependencies of this bug if any of them are currently being investigated? Thanks!
Flags: needinfo?(MattN+bmo)
:pmoore, interesting file on the clipboard failure, could we make it upper case Task to avoid this? I agree we should look into a fix for the test. :jryans, I see you in the file commit history often for browser_cmd_screenshot.js, would you happen to know where we get the filename value and why it might interpret a \t in the full path as a <tab> character? ^^ see comment 54.
Flags: needinfo?(jmaher) → needinfo?(jryans)
Out of curiosity I've triggered the tasks from comment 53 again (just once each) but configured the task users to be in the Administrators group (using the "osGroups" feature in generic-worker[1]). I put them in a single task group here: https://tools.taskcluster.net/groups/caeMQxVJQf6ix0UiYgXnvQ I'm curious if this will fix any of them. -- [1] https://docs.taskcluster.net/reference/workers/generic-worker/payload
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #56) > :jryans, I see you in the file commit history often for > browser_cmd_screenshot.js, would you happen to know where we get the > filename value and why it might interpret a \t in the full path as a <tab> > character? ^^ see comment 54. Hmm, that's a fun one! I think this patch should fix the issue, but I don't have a simple way to verify it myself.
Flags: needinfo?(jryans)
Comment on attachment 8935897 [details] [diff] [review] Escape backslashes in GCLI screenshot test Review of attachment 8935897 [details] [diff] [review]: ----------------------------------------------------------------- ::: devtools/client/commandline/test/browser_cmd_screenshot.js @@ +104,3 @@ > check: { > args: { > filename: { value: "" + file.path }, Shouldn't the replace be on line 106 instead of line 103 ? I think line 103 just creates a description, whereas line 106 is the filename that is passed through.
Flags: needinfo?(jryans)
(In reply to Pete Moore [:pmoore][:pete] from comment #59) > Comment on attachment 8935897 [details] [diff] [review] > Escape backslashes in GCLI screenshot test > > Review of attachment 8935897 [details] [diff] [review]: > ----------------------------------------------------------------- > > ::: devtools/client/commandline/test/browser_cmd_screenshot.js > @@ +104,3 @@ > > check: { > > args: { > > filename: { value: "" + file.path }, > > Shouldn't the replace be on line 106 instead of line 103 ? I think line 103 > just creates a description, whereas line 106 is the filename that is passed > through. I believe the `setup` is the text to enter, while the `check` block states the expected value certain arguments should have after parsing. The issue seems to be related to how we parse the backslashes in the input, so that's why I modified the `setup` to escape the text entered. However if the patch doesn't work and your version does, that's fine too!
Flags: needinfo?(jryans)
Blocks: 1360198
(In reply to J. Ryan Stinnett [:jryans] (use ni?) from comment #60) > (In reply to Pete Moore [:pmoore][:pete] from comment #59) > > Comment on attachment 8935897 [details] [diff] [review] > > Escape backslashes in GCLI screenshot test > > > > Review of attachment 8935897 [details] [diff] [review]: > > ----------------------------------------------------------------- > > > > ::: devtools/client/commandline/test/browser_cmd_screenshot.js > > @@ +104,3 @@ > > > check: { > > > args: { > > > filename: { value: "" + file.path }, > > > > Shouldn't the replace be on line 106 instead of line 103 ? I think line 103 > > just creates a description, whereas line 106 is the filename that is passed > > through. > > I believe the `setup` is the text to enter, while the `check` block states > the expected value certain arguments should have after parsing. > > The issue seems to be related to how we parse the backslashes in the input, > so that's why I modified the `setup` to escape the text entered. > > However if the patch doesn't work and your version does, that's fine too! Thanks Ryan! Trying your patch in * https://treeherder.mozilla.org/#/jobs?repo=try&revision=98eea9e5205be091f8f78af72c48c87f4c544870&filter-tier=1&filter-tier=2&filter-tier=3&duplicate_jobs=visible&group_state=expanded&filter-searchStr=windows&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=runnable
Summary: Upgrade all win7/win10 gecko workers to generic-worker 10.2.3 → Upgrade all win7/win10 gecko workers to generic-worker 10.4.1
The try push in comment 61 is looking much better! Note, the try push is based on this mozilla-central push, which has some starred failures already: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=6f5fac320fcb6625603fa8a744ffa8523f8b3d71&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-searchStr=windows I've retriggered failures, to see if they are intermittent.
Hey Joel, Is there anyone that can help me with the last couple of failures? https://tinyurl.com/ycatybqe Many thanks! Pete
Flags: needinfo?(jmaher)
I don't really see any obvious pattern. The notification tests have been disabled since they were intermittently failing on the old worker :(
Flags: needinfo?(MattN+bmo)
the failures are all prompts/multi-window failures, when you are logged into a session, can you use the browser and get prompts and multiple windows? Can you run the tests locally in a vnc/rdp session and reproduce the failures? Once we get to that point, it will be easier to determine who can help.
Flags: needinfo?(jmaher)
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #65) > the failures are all prompts/multi-window failures, when you are logged into > a session, can you use the browser and get prompts and multiple windows? > Can you run the tests locally in a vnc/rdp session and reproduce the > failures? Once we get to that point, it will be easier to determine who can > help. You're quite right - these are the next things to check. In order to do that, I've implemented a (rather basic) native RDP worker feature that will allow us to RDP in while the task is running (see bug 1172273). This is subtly different to the existing Windows loaner procedure as it will (hopefully, if it works) get you in to the actual session running the task with the real task user. I've released an alpha release of the worker which I'm deploying to our beta worker types in https://tools.taskcluster.net/groups/PKEayE2bRg-RnPWtAYYPHQ so when that deployment is complete, I will test out the new RDP procedure, and see if I can see what is going wrong.
Depends on: 1172273
Depends on: 1433854
Blocks: 1433854
No longer depends on: 1433854
So I'm able to watch tests running now via RDP. Example try push: (5887a82a1d0416c0724ee355f59d3c90e6fcb83f): * https://tinyurl.com/ydywxj2k I connected initially with my native screen resolution, which seems to have caused the screen resolution update to 1280x1024 to fail, so tests did not run. I then reconnected with 1280x1024 resolution, and was able to manually run the task, only to discover it passed. I'll will trigger the task again, and connecting via rdp with 1280x1024, to see if the tests then run or not, and if we get the same failure[1] we consistently get when we don't connect via RDP using the upgraded worker. -- [1] https://public-artifacts.taskcluster.net/FZncnsa3QmWrvwUNYgfWyg/0/public/test_info/mozilla-test-fail-screenshot_jbm8ac.png
Note: in order to connect via RDP, the workflow is: 1) Add the following patches to your gecko (firefox) checkout, to enable the beta worker types: > curl -L 'https://bug1399401.bmoattachments.org/attachment.cgi?id=8935897' | hg import - > curl -L 'https://bug1400012.bmoattachments.org/attachment.cgi?id=8948627' | hg import - 2) Prepare any other commits for changes you'd like to test, as normal, and push to try. 3) Find a the failing task you want to play with in treeherder, and visit the failing task in the taskcluster task inspector 4) Go to Actions -> Edit Task 5) In the "payload" section add "rdpInfo": "ldap/<ldapUser>/rdpinfo.txt" (e.g. "ldap/pmoore@mozilla.com/rdpinfo.txt") 6) Add "generic-worker:allow-rdp:aws-provisioner-v1/<workerType>" to scopes list, e.g. > scopes: > - 'generic-worker:allow-rdp:aws-provisioner-v1/gecko-t-win7-32-beta' 7) Ask somebody in #taskcluster to grant you the generic-worker:allow-rdp:aws-provisioner-v1/<workerType> scope and queue:get-artifact:ldap/<ldapUser>/* for the workerType(s) and ldap user you use 8) Run the task, and when it starts, go to "Run Artifacts" to see the rdpInfo.txt file appear with rdp connection information 9) Enter the connection information into your RDP client of choice 10) Connect with screen resolution 1280x1024 !
Note, bug 1436002 will simplify step 7 in comment 68. :)
Blocks: 1368961
Blocks: tc-stability
Blocks: 1333957
No longer blocks: tc-stability
Summary: Upgrade all win7/win10 gecko workers to generic-worker 10.4.1 → Upgrade all win7/win10 gecko workers to generic-worker 10.5.1
(In reply to Pete Moore [:pmoore][:pete] from comment #68) > Note: in order to connect via RDP, the workflow is: .... <snip/> .... This is now a little bit simpler (step 5 changed, step 7 removed): 1) Add the following patches to your gecko (firefox) checkout, to enable the beta worker types: > curl -L 'https://bug1399401.bmoattachments.org/attachment.cgi?id=8935897' | hg import - > curl -L 'https://bug1400012.bmoattachments.org/attachment.cgi?id=8948627' | hg import - 2) Prepare any other commits for changes you'd like to test, as normal, and push to try. 3) Find a the failing task you want to play with in treeherder, and visit the failing task in the taskcluster task inspector 4) Go to Actions -> Edit Task 5) Add rdpInfo to the payload section: > payload: > rdpInfo: 'login-identity/<login-identity>/rdpinfo.txt' For example, 'login-identity/mozilla-ldap/pmoore@mozilla.com/rdpinfo.txt' (check https://tools.taskcluster.net/credentials to see what your login identity is, e.g. you should have the scope queue:create-artifact:login-identity/<login-identity>/*). 6) Add "generic-worker:allow-rdp:aws-provisioner-v1/<workerType>" to scopes list, e.g. > scopes: > - 'generic-worker:allow-rdp:aws-provisioner-v1/gecko-t-win7-32-beta' 7) Run the task, and when it starts, go to "Run Artifacts" to see the rdpInfo.txt file appear with rdp connection information 8) Enter the connection information into your RDP client of choice 9) Connect with screen resolution 1280x1024 !
Blocks: 1358545
Blocks: 1439517
Pete: can I ask you to trigger another try run so I can look at current results? My try access is still broken and this will allow me to retrigger as necessary.
Flags: needinfo?(pmoore)
(In reply to Chris Cooper [:coop] from comment #71) > Pete: can I ask you to trigger another try run so I can look at current > results? My try access is still broken and this will allow me to retrigger > as necessary. I spoke with Joel and Kendall in the TC migration mtg today. Pete: can I ask you to collate a list of the currently failing tests (from a new Try run, hopefully) in a new bug comment? I'm going to look at the failures myself using your loaner method and then write that process up so we can get a dev to help.
(In reply to Chris Cooper [:coop] from comment #72) > (In reply to Chris Cooper [:coop] from comment #71) > > Pete: can I ask you to trigger another try run so I can look at current > > results? My try access is still broken and this will allow me to retrigger > > as necessary. No problem - I've made a try push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=85b4ef4fa06f4a75d9b50f8a3de2a3ecab3f7afd > > I spoke with Joel and Kendall in the TC migration mtg today. > > Pete: can I ask you to collate a list of the currently failing tests (from a > new Try run, hopefully) in a new bug comment? I'm going to look at the > failures myself using your loaner method and then write that process up so > we can get a dev to help. I'll be gone by the time this try push completes - but following that treeherder link above should be the authoritative source of the information. Note - I made it from running step 1 and 2 from comment 70. If anyone is investigating failures, that same comment explains how to retrigger the task with an interactive loaner, and troubleshoot the issue while the task is actually running.
Flags: needinfo?(pmoore)
(In reply to Pete Moore [:pmoore][:pete] from comment #73) > No problem - I've made a try push: > > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=85b4ef4fa06f4a75d9b50f8a3de2a3ecab3f7afd The jobs are not currently running due to bug 1443595.
Depends on: 1443595
From jmaher via email: "here is the most recent push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=85b4ef4fa06f4a75d9b50f8a3de2a3ecab3f7afd&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=runnable&filter-searchStr=x64 here you can see: c3 - * toolkit/content/tests/chrome/test_bug360437.xul * toolkit/content/tests/chrome/test_dialogfocus.xul * toolkit/content/tests/chrome/test_showcaret.xul * toolkit/content/tests/widgets/test_menubar.xul * toolkit/mozapps/downloads/tests/chrome/test_unknownContentType_delayedbutton.xul 5 - * toolkit/components/prompts/test/test_prompts.html * toolkit/components/prompts/test/test_modal_prompts.html bc2/bc3 - * browser/components/customizableui/test/browser_panelUINotifications_multiWindow.js these are all tests that deal with focus, specifically multi modal/window test cases. In many of the screenshots you can see we pop up the window, but it is opaque in that you can see a shadow of it in the foreground. Ideally you could watch a test run locally and then compare it to a loaner and see the difference. Most of these tests send keys to specific windows to type/click/hotkey. I wonder if there is some quirk where the windows or keys are getting crossed between users such as current user and administrator."
here are the components for the failures: c3 - * toolkit/content/tests/chrome/test_bug360437.xul ** Toolkit :: Find Toolbar * toolkit/content/tests/chrome/test_dialogfocus.xul ** Toolkit :: XUL Widgets * toolkit/content/tests/chrome/test_showcaret.xul ** Toolkit :: XUL Widgets * toolkit/content/tests/widgets/test_menubar.xul ** Core :: XUL * toolkit/mozapps/downloads/tests/chrome/test_unknownContentType_delayedbutton.xul ** Toolkit :: Downloads API 5 - * toolkit/components/passwordmgr/test/mochitest/test_prompt.html ** Toolkit :: Password Manager * toolkit/components/prompts/test/test_modal_prompts.html ** Toolkit :: General bc2/bc3 - * browser/components/customizableui/test/browser_panelUINotifications_multiWindow.js ** Firefox :: Toolbars and Customization Ideally there is something in the code of the tests that is in common and not seen in other tests, we could pinpoint the actions which seem to cause failure in this new environment.
the test failures in mochitest-e10s-5 are concerning because when I disable the above mentioned tests, the next test(s) in the list start failing in the same way (timeout). This looks to be that we would end up disabling all prompt and modal tests for password manager and in toolkit general- the other failures seem to go away clean with disabling tests. One observation I noticed was many of these failures are on the 3rd window, so we have the harness and we open a new window for a test and that new window opens a dialog or another new window.
Summary: Upgrade all win7/win10 gecko workers to generic-worker 10.5.1 → Upgrade all win7/win10 gecko workers to generic-worker 10.7.1
Pete and I met yesterday and discussed this. Here's a summation of our thoughts. Windows has a few user experience interactions (pop-ups, messages, modal windows) that appear on first-run. Since the new worker is using a new user for every run, these interactions may appear every single time we run a test unless we find the correct settings to toggle them off. I can recall this happening before on Mac. We don't know if this is the actual cause, and the timing of these interactions is unknown. Three ways we could proceed here: 1) To quote Pete, we should add a "big, dirty sleep" to the start of the test run, say 5 minutes. This will give us enough time to establish an RDP connection before the test starts to see if there's a errant popup, etc stealing focus. If may also give the pop-ups enough time to clear on their own before the test starts. 2) Failing that, we could use a Windows sys call to try to figure out which window has focus during each test. 3) We could do a screen recording of an entire test run. This would allow us to step through, rewind, etc to observe a behavior that may be too quick to manually notice otherwise.
(In reply to Chris Cooper [:coop] from comment #79) > 3) We could do a screen recording of an entire test run. This would allow us > to step through, rewind, etc to observe a behavior that may be too quick to > manually notice otherwise. https://www.dvdvideosoft.com/products/dvd/Free-Screen-Video-Recorder.htm looks like it might do the trick here.
(In reply to Pete Moore [:pmoore][:pete] from comment #80) > (In reply to Chris Cooper [:coop] from comment #79) > > > 3) We could do a screen recording of an entire test run. This would allow us > > to step through, rewind, etc to observe a behavior that may be too quick to > > manually notice otherwise. > > https://www.dvdvideosoft.com/products/dvd/Free-Screen-Video-Recorder.htm > looks like it might do the trick here. I had some issues with installing "Free Screen Video Recorder", I'm taking a look at "OBS Studio" instead: https://obsproject.com/ instead...
(In reply to Chris Cooper [:coop] from comment #79) > 1) To quote Pete, we should add a "big, dirty sleep" to the start of the > test run, say 5 minutes. This will give us enough time to establish an RDP > connection before the test starts to see if there's a errant popup, etc > stealing focus. If may also give the pop-ups enough time to clear on their > own before the test starts. I've made a new try push to try this out: remote: View your change here: remote: https://hg.mozilla.org/try/rev/04c887284be5672c06d78ae93624c0624e33e722 remote: remote: Follow the progress of your build on Treeherder: remote: https://treeherder.mozilla.org/#/jobs?repo=try&revision=04c887284be5672c06d78ae93624c0624e33e722
Note, I've (hopefully) fixed the issue with the taskbar on both Windows 7 and Windows 10 not being hidden in bug 1433851 and am testing in a new try push: https://tinyurl.com/ycwrff4e
(In reply to Pete Moore [:pmoore][:pete] from comment #82) > (In reply to Chris Cooper [:coop] from comment #79) > > 1) To quote Pete, we should add a "big, dirty sleep" to the start of the > > test run, say 5 minutes. This will give us enough time to establish an RDP > > connection before the test starts to see if there's a errant popup, etc > > stealing focus. If may also give the pop-ups enough time to clear on their > > own before the test starts. > > I've made a new try push to try this out: > > remote: View your change here: > remote: > https://hg.mozilla.org/try/rev/04c887284be5672c06d78ae93624c0624e33e722 > remote: > remote: Follow the progress of your build on Treeherder: > remote: > https://treeherder.mozilla.org/#/ > jobs?repo=try&revision=04c887284be5672c06d78ae93624c0624e33e722 I forgot to say - the big dirty sleep didn't help. :(
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #77) > the test failures in mochitest-e10s-5 are concerning because when I disable > the above mentioned tests, the next test(s) in the list start failing in the > same way (timeout). This looks to be that we would end up disabling all > prompt and modal tests for password manager and in toolkit general- the > other failures seem to go away clean with disabling tests. > > One observation I noticed was many of these failures are on the 3rd window, > so we have the harness and we open a new window for a test and that new > window opens a dialog or another new window. This is now resolved. TL;DR: It was due to the settings here[1]. In release 10.7.7 I've upgraded generic-worker to go 1.10, and in the process, rediscovered these STARTUPINFO settings[3]. In go 1.10, there is the possibility to use the go standard library to make CreateProcessAsUser system calls, which previously did not exist, so I have migrated the worker to use the standard library for spawning task user processes, and in the process done away with these flags. In the process of migrating, I discovered the STARTUPINFO flags are controlled by the standard library in go 1.10, and do not allow for any customisation. From MSDN docs[3,4] we see the difference between the flag settings I was using, and those used in the standard library: Pre generic-worker 10.7.7 ========================= Before we set the following STARTUPINFO process flags[1]: si.Flags = win32.STARTF_FORCEOFFFEEDBACK | syscall.STARTF_USESHOWWINDOW si.ShowWindow = syscall.SW_SHOWMINNOACTIVE From the MSDN docs[3,4]: STARTF_FORCEOFFFEEDBACK Indicates that the feedback cursor is forced off while the process is starting. The Normal Select cursor is displayed. STARTF_USESHOWWINDOW The wShowWindow member contains additional information. SW_SHOWMINNOACTIVE Displays the window as a minimized window. This value is similar to SW_SHOWMINIMIZED, except the window is not activated. Post generic-worker 10.7.7 ========================== Now we set the flags like this[2]: si.Flags = STARTF_USESTDHANDLES From the MSDN docs[3]: STARTF_USESTDHANDLES The hStdInput, hStdOutput, and hStdError members contain additional information..... Conclusion ========== The problem here was with SW_SHOWMINNOACTIVE which creates a non-activated window. When rereading the docs, I was reminded that our failures were focus related, and led me to try using adopting the standard library instead, to see if that solved the issue. Of course we could have continued to use our custom runlib library, and adapted the flags, but this seemed like a good opportunity to simplify our codebase, and use the new feature of the go standard library. -- [1] https://github.com/taskcluster/runlib/blob/4ab38b9ff487347cfe9707ca800d305baab444b5/subprocess/subprocess_windows.go#L139-L140 [2] https://github.com/golang/go/blob/go1.10/src/syscall/exec_windows.go#L311-L320 [3] https://msdn.microsoft.com/en-us/library/windows/desktop/ms686331%28v=vs.85%29.aspx [4] https://msdn.microsoft.com/en-us/library/windows/desktop/ms633548(v=vs.85).aspx
Depends on: 1448197
Depends on: 1447265
I believe all blocking issues have now been resolved - but I'm on PTO - so will look at rolling out generic-worker next week when I'm back.
Latest try push with new settings. Failures were all intermittents, that passed in retries: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7af721a1d8a445af27f3ceaea939c75d6eb6266a&group_state=expanded
Summary: Upgrade all win7/win10 gecko workers to generic-worker 10.7.1 → Upgrade all win7/win10 gecko workers to generic-worker 10.7.8
Blocks: 1180187
Currently preparing the deployment...
Attachment #8935897 - Flags: review+
Pushed by pmoore@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/1f8e34bd956b Escape backslashes in GCLI screenshot test,r=pmoore
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
As Pete wrote this upgrade hasn't been done yet, so I will reopen for now.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This should do all the magic. I've spent a couple of days double and triple checking worker type definitions, and am reasonably confident everything has been accounted for. This is really a *major* upgrade, so the potential for something to go wrong is higher than normal. I've taken a snapshot of the (confidential) worker type definitions, which I will share with the team, so that they can be rolled back if needed. This means if any problems are discovered the rollback process is two-fold: 1) Revert PR 128 from OpenCloudConfig 2) Request that somebody in the taskcluster team reverts the worker type definitions to their current state (i.e. to the versions I am sending them in an email this afternoon)
Attachment #8964886 - Flags: review?(rthijssen)
Attachment #8964886 - Flags: review?(rthijssen) → review+
We haven't had any complaints of problems yet, so I'll close this now. Please reopen if any issues appear!
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
No longer depends on: 1448197
Status: RESOLVED → REOPENED
Flags: needinfo?(pmoore)
Resolution: FIXED → ---
Nice spot, Joel! Does this look ok?
Attachment #9023219 - Flags: review?(jmaher)
Comment on attachment 9023219 [details] [diff] [review] gecko patch: enable coalescing on win7/win10 worker types Review of attachment 9023219 [details] [diff] [review]: ----------------------------------------------------------------- cool!
Attachment #9023219 - Flags: review?(jmaher) → review+
Pete: is this live now?
Flags: needinfo?(pmoore)
Pushed by pmoore@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/cef0a23a3849 enable coalescing on Windows 7 and Windows 10 worker types,r=jmaher

It should be soon - I've just pushed to mozilla-inbound, and this bug should get automatically closed when it lands on mozilla-central, so let's leave it open.

Fingers crossed! Thanks for chasing me up. :)

Flags: needinfo?(pmoore)
Status: REOPENED → RESOLVED
Closed: 7 years ago6 years ago
Resolution: --- → FIXED
Component: Integration → Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: