1372172 - investigate ec2 windows instances stalled in impaired state

Assignee

Description

•

7 years ago

Windows 7 and 10 instances are too frequently entering an impaired state (hung. no logging to papertrail and no running of taskcluster tasks). this is currently worked around by monitoring ec2 via cron script and rebooting instances determined to be in this state. event log entries from around the time the impaired state is detected: Jun 12 12:37:52 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com generic-worker: 2017/06/12 11:37:51 Exiting worker with exit code 67 Jun 12 12:37:56 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Application_Error: Faulting application name: Explorer.EXE, version: 10.0.15063.0, time stamp: 0xa25b6083 Faulting module name: PROPSYS.dll, version: 7.0.15063.0, time stamp: 0x74650d36 Exception code: 0xc0000006 Fault offset: 0x000000000003a895 Faulting process id: 0x17bc Faulting application start time: 0x01d2e36c5df05d7a Faulting application path: C:\Windows\Explorer.EXE Faulting module path: C:\Windows\SYSTEM32\PROPSYS.dll Report Id: bd44c21e-c016-49c6-a3c4-2ebae3a3c865 Faulting package full name: Faulting package-relative application ID: Jun 12 12:37:56 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Application_Error: Windows cannot access the file for one of the following reasons: there is a problem with the network connection, the disk that the file is stored on, or the storage drivers installed on this computer; or the disk is missing. Windows closed the program Windows Explorer because of this error. Program: Windows Explorer File: The error value is listed in the Additional Data section. User Action 1. Open the file again. This situation might be a temporary problem that corrects itself when the program runs again. 2. If the file still cannot be accessed and - It is on the network, your network administrator should verify that there is not a problem with the network and that the server can be contacted. - It is on a removable disk, for example, a floppy disk or CD-ROM, verify that the disk is fully inserted into the computer. 3. Check and repair the file system by running CHKDSK. To run CHKDSK, click Start, click Run, type CMD, and then click OK. At the command prompt, type CHKDSK /F, and then press ENTER. 4. If the problem persists, restore the file from a backup copy. 5. Determine whether other files on the same disk can be opened. If not, the disk might be damaged. If it is a hard disk, contact your administrator or computer hardware vendor for further assistance. Additional Data Error value: C000026E Disk type: 0 Jun 12 12:37:58 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com User32: The process C:\Windows\system32\shutdown.exe (I-026957CE49B4B) has initiated the restart of computer I-026957CE49B4B on behalf of user NT AUTHORITY\SYSTEM for the following reason: No title for this reason could be found Reason Code: 0x800000ff Shutdown Type: restart Comment: Rebooting as generic worker ran successfully Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: generic-worker is not running. Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: OpenCloudConfig is not running. Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: Get-Uptime :: last boot: 6/12/2017 11:09:19 AM; uptime: 00:28:42.9313708. Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: remote desktop session is not active. Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: Is-ConditionTrue :: drive format is not in progress. Jun 12 12:38:03 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com HaltOnIdle: instance failed validity check and will be halted. Jun 12 12:38:40 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com ESENT: svchost (3768) TILEREPOSITORYS-1-5-21-458149028-3186198778-2564299442-1003: An attempt to write to the file "Z:\task_1497265725\AppData\Local\TileDataLayer\Database\EDB.log" at offset 159744 (0x0000000000027000) for 4096 (0x00001000) bytes failed after 0.000 seconds with system error 21 (0x00000015): "The device is not ready. ". The write operation will fail with error -1022 (0xfffffc02). If this error persists then the file may be damaged and may need to be restored from a previous backup. Jun 12 12:38:40 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com ESENT: svchost (3768) TILEREPOSITORYS-1-5-21-458149028-3186198778-2564299442-1003: Unable to write to section 2 while flushing logfile Z:\task_1497265725\AppData\Local\TileDataLayer\Database\EDB.log. Error -1022 (0xfffffc02). Jun 12 12:38:40 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com ESENT: svchost (3768) TILEREPOSITORYS-1-5-21-458149028-3186198778-2564299442-1003: The logfile sequence in "Z:\task_1497265725\AppData\Local\TileDataLayer\Database\" has been halted due to a fatal error. No further updates are possible for the databases that use this logfile sequence. Please correct the problem and restart or restore from backup. Jun 12 12:38:40 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com ESENT: svchost (3768) TILEREPOSITORYS-1-5-21-458149028-3186198778-2564299442-1003: An attempt to write to the file "Z:\task_1497265725\AppData\Local\TileDataLayer\Database\vedatamodel.jfm" at offset 0 (0x0000000000000000) for 8192 (0x00002000) bytes failed after 0.000 seconds with system error 21 (0x00000015): "The device is not ready. ". The write operation will fail with error -1022 (0xfffffc02). If this error persists then the file may be damaged and may need to be restored from a previous backup. Jun 12 12:38:40 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com ESENT: svchost (3768) TILEREPOSITORYS-1-5-21-458149028-3186198778-2564299442-1003: The database engine stopped the instance (0) with error (-510). Internal Timing Sequence: [1] 0.000006 +J(0) [2] 0.000012 +J(0) [3] 0.000007 +J(0) [4] 0.000005 +J(0) [5] 0.000038 +J(0) +M(C:0K, Fs:1, WS:4K # 0K, PF:0K # 0K, P:0K) [6] 0.000016 +J(0) [7] - [8] 0.000009 +J(0) [9] 0.000885 +J(0) +M(C:0K, Fs:3, WS:-36K # 0K, PF:-48K # 0K, P:-48K) [10] - [11] 0.001168 +J(0) +M(C:0K, Fs:2, WS:8K # 0K, PF:0K # 0K, P:0K) [12] - [13] 0.000052 +J(0) +M(C:0K, Fs:3, WS:8K # 0K, PF:-4K # 0K, P:-4K) [14] 0.000037 +J(0) +M(C:0K, Fs:0, WS:-8K # 0K, PF:-8K # 0K, P:-8K) [15] 0.000016 +J(0) +M(C:0K, Fs:0, WS:-12K # 0K, PF:-12K # 0K, P:-12K) [16] 0.000006 +J(0). Jun 12 12:39:17 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Microsoft-Windows-GroupPolicy: Shutdown script failed. GPO Name : Local Group Policy GPO File System Path : C:\Windows\System32\GroupPolicy\Machine Script Name: C:\scripts\set_user_data.ps1 Jun 12 12:39:17 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Microsoft-Windows-Kernel-PnP: The driver \Driver\WudfRd failed to load for the device SWD\WPDBUSENUM\{70ffd6cb-3efa-11e7-9146-806e6f6e6963}#0000000595E00000. Jun 12 12:39:17 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Service_Control_Manager: The CldFlt service failed to start due to the following error: The request is not supported. Jun 12 12:39:17 i-026957ce49b4b6b1c.gecko-t-win10-64-gpu.usw2.mozilla.com Microsoft-Windows-Kernel-PnP: The driver \Driver\WudfRd failed to load for the device SWD\WPDBUSENUM\{70ffd6cb-3efa-11e7-9146-806e6f6e6963}#0000000000100000.

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 1

•

7 years ago

Attached file reboot-impaired-instances.sh (obsolete) (deleted) — Details

this is the script i am using to reboot impaired instances. it runs under cron every 10 minutes. rebooted instances appear to perform fine and run tc tasks correctly.

Pete Moore [:pmoore][:pete]

Comment 2

•

7 years ago

Rob, this slightly modified version of the inner loop should only kill instances that have been impaired for more than ten minutes: /usr/bin/aws ec2 describe-instance-status --profile taskcluster --region "${aws_region}" --filters 'Name=instance-status.status,Values=impaired' --query 'InstanceStatuses[*].{InstanceId:InstanceId,Since:InstanceStatus.Details[0].ImpairedSince}' --output text | while read instance_id impaired_since do # check if impaired for more than 600 seconds (10 mins) # note, since dates are iso 8601 in UTC, a string comparison of dates is sufficient and correct if [ "${impaired_since}" \< "$(date -d @$(($(date +%s) - 600)) --iso-8601=seconds)" ]; then key_name=$(aws ec2 describe-instances --profile taskcluster --region ${aws_region} --instance-ids ${instance_id} | jq -r '.Reservations[0].Instances[0].KeyName') case "${key_name}" in *gecko-t-win*) aws ec2 reboot-instances --profile taskcluster --region ${aws_region} --instance-ids ${instance_id}>/home/grenade/cron/log/reboot-${aws_region}-${instance_id}.json 2>&1 echo $(/usr/bin/date -u '+%Y-%m-%d %H:%M:%S') ${aws_region} ${instance_id} ${key_name}>> /home/grenade/cron/log/reboot.log ;; *) echo ${key_name}>/home/grenade/cron/log/skip-${aws_region}-${instance_id}.log 2>&1 ;; esac fi done The reason I'd like to try this adjusted version is that I would expect all rebooting instances to be impaired at times, but only if they remain impaired for some time (I think 10 mins should cover it), should this indicate a real problem. If the status checks run less frequently than every 10 mins, we should increase the duration above (600s) accordingly. I'm curious if when filtering out the false positives, if we're just left with win10 gpu or not (I'm wondering if the instances of the other types is just related to the status check running while it is in the process of rebooting). In any case, this script adjustment should hopefully filter out some of the noise and reduce some of the false positives, leaving us with a clearer picture of how widespread the problem is. Many thanks!

Greg Arndt [:garndt]

Comment 3

•

7 years ago

My 2 cents is that it might be better to monitor for system status rather than instance status. I've seen instance status go into an impaired state right after reboot for a few minutes, but then resolve. The instances that are truly rogue always seem to have an impaired system status though.

Greg Arndt [:garndt]

Comment 4

•

7 years ago

Also for some more datapoints to add to the investigation, while dealing with these instances last week and over the weekend, I noticed: 1. primarily concentrated to the windows 10 gpu worker type 2. failure can happen at different places 3. Some instances claimed no tasks at all 4. some were in the middle of a task (and not the same type of task),. These I think are the ones reported as exception here 5. some completed the last task successfully and failed after reboot I assume I also was only analyzing instances which were in a system impaired state, not instance. I found that instances can often be in an instance impaired state for a few minutes but then resolved. Some of the information was collected and put into this sheet: https://docs.google.com/spreadsheets/d/1rqFavRrFUefvtd6pAPmU0l2pGCuYwt86HZOzi4D4-zU/edit#gid=1470233531

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 5

•

7 years ago

Attached file reboot.log (obsolete) (deleted) — Details

some quick counts of impaired instances by worker type (between 2017-06-12 11:41:16 and 2017-06-13 13:23:27 UTC): - gecko-t-win7-32: 742 - gecko-t-win7-32-gpu: 504 - gecko-t-win10-64: 39 - gecko-t-win10-64-gpu: 1362

Pete Moore [:pmoore][:pete]

Comment 6

•

7 years ago

For papertrail troubleshooting, it looks like this papertrail search finds the instances that we reboot via this script: https://papertrailapp.com/groups/853883/events?q=+The+system+has+rebooted+without+cleanly+shutting+down+first

Pete Moore [:pmoore][:pete]

Comment 7

•

7 years ago

I notice that e.g. this worker (i-0f82062fc0bb59517/us-west-2/gecko-t-win7-32-gpu) successfully ran 8 tasks[1] (rebooting between each one) before being caught by reboot-impaired-instances.sh shortly after its last reboot[2] (before claiming a 9th task): Jun 13 11:39:55 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 09:39:54 Running task https://tools.taskcluster.net/task-inspector/#LgCf6LkKR2emOJlhj_7OpA/0 Jun 13 11:57:40 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 09:57:39 Running task https://tools.taskcluster.net/task-inspector/#ObKuKSvISmq2Auos23tVGQ/0 Jun 13 12:03:43 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 10:03:42 Running task https://tools.taskcluster.net/task-inspector/#QKpFbfqbTRmBcWZuvSWJZQ/0 Jun 13 12:11:26 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 10:11:25 Running task https://tools.taskcluster.net/task-inspector/#LUMtilzmSaaJLrHuj96sxw/0 Jun 13 12:17:23 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 10:17:21 Running task https://tools.taskcluster.net/task-inspector/#Tgyfl2XdSrSuPG_ElaPhZA/0 Jun 13 12:48:02 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 10:47:59 Running task https://tools.taskcluster.net/task-inspector/#LlANDjYtRLyZJXS7YKP4Kg/0 Jun 13 13:16:21 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 11:16:19 Running task https://tools.taskcluster.net/task-inspector/#IbQKYt8dRXCcaPBbf2Z9nQ/0 Jun 13 14:21:37 i-0f82062fc0bb59517.gecko-t-win7-32-gpu.usw2.mozilla.com generic-worker: 2017/06/13 12:21:35 Running task https://tools.taskcluster.net/task-inspector/#dr14IIpSSGaqlxDhMEGutg/0 This seems suspicious to me, because it was able to start up 8 times without crashing, yet was determined to be impaired on the 9th startup, and before any task was claimed that might have exercised it in some way. I still wonder if we are catching too many instances in our reboot-impaired-instances.sh net, even with the 10 minute wait from comment 2. Maybe we should adjust the query per comment 3? What do you guys think? -- [1] https://papertrailapp.com/systems/942648811/events?q=Running+task&focus=811235686693920784 [2] https://papertrailapp.com/systems/942648811/events?focus=811265815423926311&selected=811265815423926311

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 8

•

7 years ago

we're not trapping instances that are not impaired. there's a different ec2 cli call for instances whose state has not yet been determined. the call we are running is specifically designed to only reboot instances that are genuinely impaired. even if my statement above were completely incorrect (it's not, but hypothetically speaking), we know we have an issue with instances going into impaired state. the kill script is a temporary measure in order to keep the fires lit. worrying about the kill script does absolutely nothing about the very real issue that we have instances becoming impaired which would simply hold resources until we ground to a halt. if we could focus attention on determining how instances are getting impaired instead, i think that would be the most productive use of our time. i appreciate that the script may be flawed. but i am completely sure that the script is not what is causing instances to become impaired so criticisms of the script might be perfectly valid but achieve precisely nothing in debugging the cause of the impaired state. polishing the temporary script will not solve our problem.

Pete Moore [:pmoore][:pete]

Comment 9

•

7 years ago

Can we roll back the golden AMIs and the OCC manifests, and then apply the changes (be they base AMI changes, or OCC changes) iteratively, until we reproduce the issue? Is there anything you would like a hand with?

Flags: needinfo?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 10

•

7 years ago

i'm focusing on the event log messages about lost network connectivity and the configuration of the pv driver. please feel free to roll back images if you prefer that approach for identifying the cause of the impaired state.

Flags: needinfo?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 11

•

7 years ago

found a suspect in the event logs for win 10 64 gpu: Application: ViscosityService.exe Framework Version: v4.0.30319 Description: The process was terminated due to an unhandled exception. Exception Info: System.Runtime.InteropServices.COMException at System.Runtime.InteropServices.Marshal.ThrowExceptionForHRInternal(Int32, IntPtr) at System.Management.SinkForEventQuery.Cancel() at System.Management.ManagementEventWatcher.Stop() at System.Management.ManagementEventWatcher.Finalize() could explain the loss of network since viscosity creates a virtual adapter for the vpn. timing of the viscosity install also coincides with the start of the issues with impaired instances so testing a patch now which disables it: - https://github.com/mozilla-releng/OpenCloudConfig/commit/cd9c0ef083e416ee325760ba9e6b2c3d254f6bc1 - https://tools.taskcluster.net/task-group-inspector/#/JiBoMQ9JSMGR9-1ZbC17KQ

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Updated

•

7 years ago

Assignee: relops → rthijssen

Status: NEW → ASSIGNED

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 12

•

7 years ago

Attached file reboot-impaired-instances.sh (obsolete) (deleted) — Details

the windows 10 gpu instances still go into impaired state and need to be rebooted when this happens. if they aren't rebooted, they use up all of our configured allocation for gecko-t-win10-64-gpu and the pending count for that instance type goes up while no work gets done on the impaired instances. i run the attached script to hunt down impaired instances and reboot them as part of a cron job. which i have configured to run every five minutes. someone will need to run this script or something like it while i'm on pto or until we resolve the root cause of the instances going into this state.

Attachment #8876703 - Attachment is obsolete: true

Attachment #8877149 - Attachment is obsolete: true

Attachment #8919820 - Flags: feedback?(mcornmesser)

Mark Cornmesser [:markco]

Comment 13

•

7 years ago

I think I have something working here. Thought he logs are generating currently. I am going to check it again in an hour. (In reply to Rob Thijssen (:grenade - UTC+3) from comment #12) > Created attachment 8919820 [details] > reboot-impaired-instances.sh > > the windows 10 gpu instances still go into impaired state and need to be > rebooted when this happens. if they aren't rebooted, they use up all of our > configured allocation for gecko-t-win10-64-gpu and the pending count for > that instance type goes up while no work gets done on the impaired instances. > > i run the attached script to hunt down impaired instances and reboot them as > part of a cron job. which i have configured to run every five minutes. > > someone will need to run this script or something like it while i'm on pto > or until we resolve the root cause of the instances going into this state.

Mark Cornmesser [:markco]

Updated

•

7 years ago

Attachment #8919820 - Flags: feedback?(mcornmesser) → feedback+

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 14

•

7 years ago

https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-win10-64-gpu has again issues 421 pending jobs and 256 instances in the pool - many in state "exception" or "completed".

Marco Castelluccio [:marco]

Updated

•

7 years ago

Blocks: 1419475

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

Blocks: 1425746

Pete Moore [:pmoore][:pete]

Updated

•

7 years ago

Blocks: tc-stability

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Updated

•

7 years ago

Depends on: 1436274

Chris Cooper [:coop] (he/him)

Updated

•

7 years ago

No longer blocks: tc-stability

Chris Cooper [:coop] (he/him)

Updated

•

7 years ago

Blocks: tc-stability

Chris Cooper [:coop] (he/him)

Comment 15

•

7 years ago

(In reply to Rob Thijssen (:grenade UTC+2) from comment #12) > i run the attached script to hunt down impaired instances and reboot them as > part of a cron job. which i have configured to run every five minutes. > > someone will need to run this script or something like it while i'm on pto > or until we resolve the root cause of the instances going into this state. Rob: Where is the script running right now? If it's still running on someone's (your?) personal machine, can we please get it moved onto an AWS instance or a datacenter machine?

Flags: needinfo?(rthijssen)

Pete Moore [:pmoore][:pete]

Comment 16

•

7 years ago

:jhford has established that since today, that script no longer will work, as it relies on some convention which is no longer followed. I've terminated workers in the most affected pools, and :jhford is seeing if he can stand up a modified version of the script. The path back to "old-world" where that script works is non-trivial, which is why :jhford is initially investigating patching the existing script from this bug.

Chris Cooper [:coop] (he/him)

Comment 17

•

7 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #16) > :jhford has established that since today, that script no longer will work, > as it relies on some convention which is no longer followed. > > I've terminated workers in the most affected pools, and :jhford is seeing if > he can stand up a modified version of the script. > > The path back to "old-world" where that script works is non-trivial, which > is why :jhford is initially investigating patching the existing script from > this bug. (canceling the NI on :grenade) It's worrying in that we deployed a provisioning change that invalidated other parts of our infra, and found out about it after the fact. How do we keep the Taskcluster and RelOps sides of this coin in better sync going forward?

Flags: needinfo?(rthijssen)

John Ford [:jhford] CET/CEST Berlin Time

Updated

•

7 years ago

Depends on: 1437973

John Ford [:jhford] CET/CEST Berlin Time

Comment 18

•

7 years ago

Comment on attachment 8919820 [details] reboot-impaired-instances.sh Obsoleted in favour of https://bugzilla.mozilla.org/attachment.cgi?id=8950677&action=edit This script was using internal implementation details of the provisioner to determine which instances belonged to which workerType. For future reference, the correct endpoint to determine this information is: https://aws-provisioner.taskcluster.net/v1/state/gecko-t-win10-64-gpu Bug 1437973 will track deployment of this script.

Attachment #8919820 - Attachment is obsolete: true

Dustin J. Mitchell [:dustin] (he/him)

Comment 19

•

7 years ago

Trees remain closed for this since about 20:00UTC. We have moved the winstance-slayer script into a more visible hook in bug 1437973 and started terminating instead of rebooting, but to no avail -- pending seems to keep going up. It does not appear that the instances are taking any jobs.

Chris Cooper [:coop] (he/him)

Updated

•

7 years ago

Severity: normal → blocker

Priority: -- → P1

John Ford [:jhford] CET/CEST Berlin Time

Comment 20

•

7 years ago

(In reply to Chris Cooper [:coop] from comment #17) > It's worrying in that we deployed a provisioning change that invalidated > other parts of our infra, and found out about it after the fact. Agreed. I didn't know anything about this bug beyond a passing familiarity with Gecko testing on Windows being unreliable, so it came as a shock to me as well. This logic of script could've been integrated into ec2-manager in about two hours. The bulk of that that code is what was written for 1437973. Greg was involved in this bug and we'd spoken about checks similar to this as well as the removal of the ssh key pair name in our 1x1s for months, so it's confusing to me that I wasn't looped in. More concerning, however, is that the way that this script is developed and deployed. It's critical to having open trees, but it seems that it's being run as a cronjob on someone's workstation. That's fine when it's a couple days to get through a hairy spot, but this has been running for eight months. It's using internal implementation details of another system. Using unsupported interfaces and workstation deployments for something critical is risky, as this situation has highlighted. This script could've also negatively impacted EC2-Manager, causing other tree closures. It's using two API endpoints which are extremely expensive on the EC2 API call rate limits. The two describe* calls can list theoretically unbounded data inside the EC2 api. That could exhaust our API rate limit quota. If that happens, ec2-manager would fail when it needed to call describeInstances endpoint itself. This call limit exhaustion is the direct cause of nearly all provisioning outages in the last year or so, so this is not an abstract concern. Maybe the solution is to block all access to the instances in our EC2 account other than the supported interfaces of the AWS-Provisioner. When a need for more access arises, we can make sure it's done in the most reliable way possible or at least document the exceptions. > How do we keep the Taskcluster and RelOps sides of this coin in better sync > going forward? I will include Relops on emails relating to major EC2-Manager or AWS-Provisioner changes if desired. As I'm not sure which internal details are used, I don't know which changes are important. I can also suggest following the EC2-Manager repository. That's where the development for all of these changes happens. Issues and Pull Requests are the best source of information. Changes are usually landed to master branch a day before deployment. Some upcoming changes which might be important for Relops in EC2-Manager are: - switching how we generate taskcluster credentials for workers - removal of provisioner secrets - we will be doing reliability based bidding - we will be rotating the credentials which ec2-manager uses The first two changes are changes to supported interfaces and backwards compatibility is a high priority. The last two should be invisible. It would be great if Relops could let me know which tools operate on the taskcluster ec2 account. The tools of interest will either call listing operations or change state. Anything which does either thing automatically would be great too. Things like CloudWatch monitoring isn't important.

Chris Cooper [:coop] (he/him)

Comment 21

•

7 years ago

I cc-ed :markco on the AWS case, but in case[1] he can't see it, I'm copying the response here: "From my understanding of your issue, you have an instance which is failing status checks & would like some assistance in investigating this issue. "I have reviewed your instance i-001e32d40884f82ed & can see from looking at the status checks there is an issue with EC2 detecting the Network card for your instance which is generally caused by the interface being disabled within Windows. "To resolve this issue, can you please create a new Elastic Network Interface, Attach it to your instance, then RDP to your instance using the new ENI IP address & ensure the Primary Network interface is not disabled?" 1. https://console.aws.amazon.com/support/v1#/case/?displayId=4875481371&language=en

Pete Moore [:pmoore][:pete]

Comment 22

•

7 years ago

The problem turned out to be two-fold, but from the same root cause. We had code that relied on the key name of provisioned instances containing the worker type name. 1) The kill instances script relied on this, and therefore no longer was effective in killing workers. 2) OpenCloudConfig relied on this, and therefore OCC Windows instances would not work. Non-OCC Windows workers are not affected (such as those used by nss, and nss-try). 1) Has a fix here: https://mozilla.logbot.info/taskcluster/20180214#c14298284 2) Has a fix here: https://github.com/mozilla-releng/OpenCloudConfig/commit/d7d2df5a174087bad52e7d3636ae92e043f999f0 The fix from 1) is already deployed. It will take one to two hours for 2) to automatically deploy to the affected worker types, at which point we should see the backlogs to start to recover.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 23

•

7 years ago

Thank you for the investigation and fixing the issue. Pending counts are going down, sheriffs waited for the jobs pending on non-Try trees to run up to latest potential merge candidates, did a merge and then reopened the trees a few minutes ago. Autoland will get closed again because there should be many patches pending and we will reopen it again after we got some build and test coverage.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 24

•

7 years ago

Trees are closing again because Windows builds stopped running after some time. Do we hit an AWS query limit? https://tools.taskcluster.net/aws-provisioner/gecko-3-b-win2012/ currently lists 256 pending jobs for gecko-3-b-win2012. Pending build jobs whose platforms start with 'win' grouped by state: https://sql.telemetry.mozilla.org/queries/51150 pending 394 running 5 The 'running' number is falling. Pending build jobs whose platforms start with 'win', looked at the column 'state': https://sql.telemetry.mozilla.org/queries/51149

Flags: needinfo?(pmoore)

Summary: investigate ec2 windows instances stalled in impaired state → Tree closure: investigate ec2 windows instances stalled in impaired state

John Ford [:jhford] CET/CEST Berlin Time

Comment 25

•

7 years ago

Flags: needinfo?(pmoore)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 26

•

7 years ago

The Windows builds not starting is being tracked in bug 1438152, got told this bug is only about gecko-t-win10-64-gpu.

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1438152

Chris Cooper [:coop] (he/him)

Updated

•

7 years ago

Severity: blocker → normal

Priority: P1 → P3

Chris Cooper [:coop] (he/him)

Updated

•

7 years ago

Depends on: 1438684

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Updated

•

7 years ago

Summary: Tree closure: investigate ec2 windows instances stalled in impaired state → investigate ec2 windows instances stalled in impaired state

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Updated

•

7 years ago

Blocks: 1437215

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 27

•

6 years ago

Any progress on this? It's blocking being able to run some kinds of tests in CI for QuantumRender, which in turn means it's going to be harder to ship QuantumRender.

Blocks: 1424755

Pete Moore [:pmoore][:pete]

Updated

•

6 years ago

No longer depends on: 1436274

Eric Rahm [:erahm]

Comment 28

•

6 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #27) > Any progress on this? It's blocking being able to run some kinds of tests in > CI for QuantumRender, which in turn means it's going to be harder to ship > QuantumRender. :coop, can you help prioritize this? This is blocking enabling QR for Windows Nightly.

Flags: needinfo?(coop)

Eric Rahm [:erahm]

Comment 29

•

6 years ago

If it helps, this is super easy to repro, just run with the patch from [1]: > ./mach try -b o -p win64 -u awsy-e10s[windows10-64-qr] -t none --rebuild 5 --artifact [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=65633282852e99abbed00c59ef8df057cb9ababb

Marco Castelluccio [:marco]

Comment 30

•

6 years ago

You can also easily reproduce with the patch from https://treeherder.mozilla.org/#/jobs?repo=try&revision=c57c6fe3ba01f2e3e62b34d6d02ca77c573fe2ac.

Pete Moore [:pmoore][:pete]

Updated

•

6 years ago

Blocks: 1339136

Chris Cooper [:coop] (he/him)

Comment 32

•

6 years ago

Redirecting NI to :fubar. I can't speak directly for :grenade's time.

Flags: needinfo?(coop) → needinfo?(klibby)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 33

•

6 years ago

Attached file m3-livelog (deleted) — Details

Per :pmoore's request in bug 1339136 comment 18, I ran curl -L on the livelog for one of these jobs that ended up going blue. The task is at [1], I think it was run 3 or 4 on which I ran curl: $ curl --location -o m3-livelog "https://queue.taskcluster.net/v1/task/DTrZAoN_SQq6iMZ2RW858A/runs/3/artifacts/public%2Flogs%2Flive.log" % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 138 100 138 0 0 461 0 --:--:-- --:--:-- --:--:-- 463 100 2928k 0 2928k 0 0 3199 0 --:--:-- 0:15:37 --:--:-- 0 curl: (56) LibreSSL SSL_read: SSL_ERROR_SYSCALL, errno 60 The last couple of lines in the log are this: 15:55:51 INFO - GECKO(3112) | --DOMWINDOW == 23 (0000016EB614BA00) [pid = 3388] [serial = 823] [outer = 0000000000000000] [url = http://mochi.test:8888/tests/dom/tests/mochitest/bugs/file_window_bar.html?default]^M 15:55:51 INFO - GECKO(3112) | --DOMWINDOW == 22 (0000016EB7843800) [pid = 3388] [serial = 820] [outer = 0000000000000000] [url = http://mochi.test:8888/tests/dom/tests/mochitest/bugs/fi so it looks like it just cuts off at some point. The "full" log (i.e. the m3-livelog file I got from curl) is attached. [1] https://tools.taskcluster.net/groups/Qq3j73IqQ76miagF9laLxQ/tasks/DTrZAoN_SQq6iMZ2RW858A/details

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 34

•

6 years ago

Also I should note that the m3-livelog file is 2998272 bytes, which is an exact multiple of 16k, so it's likely that something was buffering the output and the last stuff in the file is just whatever was last flushed, not necessarily the last output generated by the worker.

Kendall Libby [:fubar] (he/him)

Comment 35

•

6 years ago

Mark, can you have a go at this while Rob's on PTO? Given that we have a method to reproduce (thanks!) maybe we can manually run that while watching the event log, etc, and get a better idea of what's tripping us up.

Flags: needinfo?(klibby) → needinfo?(mcornmesser)

Mark Cornmesser [:markco]

Comment 36

•

6 years ago

Rgr.

Flags: needinfo?(mcornmesser)

Kendall Libby [:fubar] (he/him)

Comment 37

•

6 years ago

Out of curiosity, are any of the tasks/jobs that we've run into this issue on also run on buildbot/ESR52?

Mark Cornmesser [:markco]

Comment 38

•

6 years ago

What is the problem being reproduced in comment 29 and comment 30? Are those producing instances in an impaired state, or is an artifact upload issue? Do we have anyway to identify instances that are in an impaired state?

Eric Rahm [:erahm]

Comment 39

•

6 years ago

(In reply to Mark Cornmesser [:markco] from comment #38) > What is the problem being reproduced in comment 29 and comment 30? Are those > producing instances in an impaired state, or is an artifact upload issue? > > Do we have anyway to identify instances that are in an impaired state? I believe the instances are in an impaired state; trying to inspect one of the machines [1] indicates it's not provisioned. [1] https://tools.taskcluster.net/provisioners/aws-provisioner-v1/worker-types/gecko-t-win10-64-gpu/workers/eu-central-1/i-0aace1bdc7bd7ba24

Mark Cornmesser [:markco]

Comment 40

•

6 years ago

Because of bug 1443589 I have not been able to spend much time on this bug this week.

Pete Moore [:pmoore][:pete]

Updated

•

6 years ago

Blocks: 1474729

Marco Castelluccio [:marco]

Comment 41

•

6 years ago

We can still reproduce this with Windows coverage builds. Are you still planning to investigate this?

Flags: needinfo?(rthijssen)

Flags: needinfo?(mcornmesser)

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 42

•

6 years ago

hi Marco, yes, we're still actively trying to solve this. it would be good to test if the problem can be reproduced on the new auto-generated amis (from bug 1474548) used on our beta workers. if you have a good way to reproduce the problem, can i ask that you import this commit: https://hg.mozilla.org/try/raw-rev/0103760 before pushing to try? the commit forces all windows tasks onto our beta workers where we'd like to check if the generated windows 10 amis maybe don't suffer from problems that cause them to go offline. thank you!

Flags: needinfo?(rthijssen)

Flags: needinfo?(mcornmesser)

Marco Castelluccio [:marco]

Comment 43

•

6 years ago

(In reply to Rob Thijssen (:grenade UTC+2) from comment #42) > hi Marco, yes, we're still actively trying to solve this. > > it would be good to test if the problem can be reproduced on the new > auto-generated amis (from bug 1474548) used on our beta workers. > > if you have a good way to reproduce the problem, can i ask that you import > this commit: https://hg.mozilla.org/try/raw-rev/0103760 before pushing to > try? the commit forces all windows tasks onto our beta workers where we'd > like to check if the generated windows 10 amis maybe don't suffer from > problems that cause them to go offline. > > thank you! Yes, that patch seems to fix the problem: https://treeherder.mozilla.org/#/jobs?repo=try&revision=4535f9712505f214b9078a1081c442ebbc1fdc9f.

Pete Moore [:pmoore][:pete]

Comment 44

•

6 years ago

Great news, thanks guys!

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 45

•

6 years ago

The patch worked for me as well: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=a73d0ef7bae6722730367ef060f312d8a375623c Any ETA on landing it? Really excited to see this finally fixed!

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 46

•

6 years ago

jmaher: we think we're ready to roll out auto-generated windows 10 amis for gecko-t-win10-64-gpu worker types. we'd like to do a test run on the gecko-t-win10-64-gpu-b beta workers that runs a fairly heavy test load against that worker type. i would like to create some try pushes that trigger all the tests that already run on gecko-t-win10-64-gpu but redirect them to gecko-t-win10-64-gpu-b using the worker swap from https://hg.mozilla.org/try/raw-rev/0103760. do you know what try syntax i should use to trigger quite a lot of win 10 gpu tests? do you want to do any testing as well that i should wait on?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Comment 47

•

6 years ago

here are some examples: https://searchfox.org/mozilla-central/search?q=%3A+hardware&case=false&regexp=false&path=taskcluster%2Fci ignore the |default: hardware|, those are perf tests. I would also do win10-ccov as well- you can find that in |./mach try fuzzy -q 'windows ccov'| since these tests fail 90% of the time on existing configurations, just getting them to run a few times will be more than adequate :) If you want to up the load, just run all win10 jobs 4 or 5 times :)

Flags: needinfo?(jmaher)

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 48

•

6 years ago

here's what i came up with to test (hopefully i got my patches and syntax correct): https://treeherder.mozilla.org/#/jobs?repo=try&revision=dd4980f

Joel Maher ( :jmaher ) (UTC -8)

Comment 49

•

6 years ago

this looks good- a few perma failing tests in reftest- those can be fixed easily.

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 50

•

6 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #49) > this looks good- a few perma failing tests in reftest- those can be fixed > easily. great. is there anything obvious on infra that could be causing those reftest failures? i converted some of the base64 images that had pixel differences to see if i could spot a desktop popup or something but the images i saw were just greyscale spheres on a white background for both the reference png and the test png. should i wait on promoting this ami to gecko-t-win10-64-gpu or is it easier to get reftest fixed if we go live with the updated ami?

Flags: needinfo?(jmaher)

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 51

•

6 years ago

here's the last test run i did last night that included mochitest-gpu as well: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=c406a22

Joel Maher ( :jmaher ) (UTC -8)

Comment 52

•

6 years ago

Flags: needinfo?(matt.woodrow)

Flags: needinfo?(jmaher)

Flags: needinfo?(dbaron)

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 53

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b4de24b8e77d29ec18aaabc5bd91277526d2721e

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 54

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=69c8640a5bead9619e92b9abfce76f5be8bfa9c8

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 55

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=68cc57cdf7cf6d1fbc1d3ff9a2637ed5f2da9003

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 56

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=78df06871fc38b7855ee408e0593202ebca21355

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 57

•

6 years ago

crashtest, mochitest-(1-5) & mochitest-media-e10s are currently configured to run - on gpu instances (virtual-with-gpu) when running a "Windows 10 x64 QuantumRender" [windows10-64-qr] test configuration and - without gpu (virtual) when running a normal "Windows 10 x64" [Windows 10] test configuration. see: https://dxr.mozilla.org/mozilla-central/search?q=windows10-64-qr%2F.*%3A+virtual-with-gpu it looks like mochitest-media-e10s is failing when run on the new ami under the windows10-64-qr config. i don't know if that's significant, but thought i would mention it in case we need to fix that too. if someone understands the significance of why these suites run on gpu for the qr configuration only, maybe they could make a call on what we should do about this. it looks like it was switched on in bug 1426155. here's a push showing failure for mochitest-media-e10s on the new ami (beta) with gpu and success on old ami (prod) without gpu: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=68cc57c crashtest & mochitest-(1-5) look fine: https://treeherder.mozilla.org/#/jobs?repo=try&group_state=expanded&revision=78df068

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 58

•

6 years ago

(In reply to Rob Thijssen (:grenade UTC+2) from comment #57) > it looks like mochitest-media-e10s is failing when run on the new ami under > the windows10-64-qr config. i don't know if that's significant, but thought > i would mention it in case we need to fix that too. For now we can disable the failing test on the windows10-64-qr platform. Although really it looks like the test doesn't stall, it just takes a really long time as the log shows output still coming from the test at the point that it times out. I'm happy to write a patch to deal with this failure if you need me to. > if someone understands > the significance of why these suites run on gpu for the qr configuration > only, maybe they could make a call on what we should do about this. These suites run on gpu because the qr configuration has webrender enabled, and webrender only works if you have a gpu.

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 59

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=40fd362ed5dd76488c5c192755fd58163501f031

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 60

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3b0e40d71163d0852a43670b7e38d3cd1e6d6ec4

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 61

•

6 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #58) > For now we can disable the failing test on the windows10-64-qr platform. > Although really it looks like the test doesn't stall, it just takes a really > long time Actually it looks like the test is waiting for a particular pixel to turn red but never sees that happening, and it keeps looping and checking until the test times out. I'm not sure offhand why this is happening as the screenshot seems to show the expected behaviour, but I'm happy to disable the test on QR until we figure it out.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Updated

•

6 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1504162

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 62

•

6 years ago

Attached patch patch to disable failing mda test on QR (deleted) — Details — Splinter Review

Matt Woodrow (:mattwoodrow)

Comment 63

•

6 years ago

transforms-1.html is failing with a max difference of 1 with the two images disagreeing on 127 vs 128 (~0.5 as a float). That's super subtle, and probably just a slight difference in floating point calculations somewhere, maybe the graphics driver. We should just fuzz it.

Flags: needinfo?(matt.woodrow)

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Updated

•

6 years ago

Blocks: 1493688

David Baron :dbaron:

Comment 64

•

6 years ago

redirecting my needinfo from comment 52 to mats

Flags: needinfo?(dbaron) → needinfo?(mats)

Mats Palmgren (inactive)

Comment 65

•

6 years ago

(In reply to Joel Maher ( :jmaher ) (UTC-4) from comment #52) > :dbaron, can you look at the failure in > grid-placement-auto-implicit-001.html > (https://searchfox.org/mozilla-central/source/layout/reftests/css-grid/ > reftest.list#38)? Looks like it's using JS to generate some part of the test/reference, so I suspect it needs to use "reftest-wait". Please disable it and file a bug in Core/Layout:Grid.

Flags: needinfo?(mats)

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 66

•

6 years ago

jmaher: i believe we need 3 patches to tests in order to start using the new auto generated amis. - disable failing mda test on QR (as per :kats' patch in comment 62) - fuzz reftests/transforms-1.html tests (as per :mattwoodrow's comment 63) - disable reftests/grid-placement-auto-implicit-001.html (as per mats comment 65) is this something you are able to organise or assign to the appropriate people?

Flags: needinfo?(jmaher)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 67

•

6 years ago

If Joel's not around I'm happy to do this.

Joel Maher ( :jmaher ) (UTC -8)

Comment 68

•

6 years ago

let me get some bugs in place and we can resolve them independently (ideally this week)

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

6 years ago

Depends on: 1507152

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

6 years ago

Depends on: 1507154

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

6 years ago

Attachment #9022091 - Flags: review+

Joel Maher ( :jmaher ) (UTC -8)

Comment 69

•

6 years ago

in addition there are a series of small adjustments to make to existing fuzzy-if statements outlined in comment 52. I have try pushes going for the first two bugs

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

6 years ago

Depends on: 1507161

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 70

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1b3f941763cc41d16604b23b6d7ce5d7214e8ad0

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 71

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=e503d9385eabe44062cc77d360f7c48c377adbb3

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

6 years ago

Depends on: 1509054

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Updated

•

6 years ago

Depends on: 1509064

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 72

•

6 years ago

Comment on attachment 9022091 [details] [diff] [review] patch to disable failing mda test on QR I landed this patch on bug 1509064

Attachment #9022091 - Flags: checked-in+

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 73

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=2f3b1f166e22759463b34cd56ac25267bf00e9ad

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 74

•

6 years ago

jmaher, kats: are you happy for me to promote the new gecko-t-win10-64-gpu ami (from beta to prod), based on the test results linked in comment 73 above, or should i wait for further fuzzy patches? i note that there are a couple failures: - test-windows10-64-qr/opt-reftest-e10s-2 (R2): failed 3/3 retries - test-windows10-64-qr/opt-awsy-base-dmd-e10s (ab-d): failed 2/3 retries

Flags: needinfo?(kats)

Flags: needinfo?(jmaher)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 75

•

6 years ago

(In reply to Rob Thijssen (:grenade UTC+2) from comment #74) > i note that there are a couple failures: > - test-windows10-64-qr/opt-reftest-e10s-2 (R2): failed 3/3 retries This one needs fixing still - it looks like the test on windows has both the 'webrender' and 'skiaContent' annotations activated, and the latter is taking precedence over the former. If we switch the order of those annotations it should be good. I'll land a patch for this. > - test-windows10-64-qr/opt-awsy-base-dmd-e10s (ab-d): failed 2/3 retries This is a pre-existing failure, so I'm not worried about this > jmaher, kats: > are you happy for me to promote the new gecko-t-win10-64-gpu ami (from beta > to prod), based on the test results linked in comment 73 above, or should i > wait for further fuzzy patches? > Please wait for me to fix the fuzzy failure above. The trees are closed at the moment so it will be whenever the trees reopen.

Flags: needinfo?(kats)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 76

•

6 years ago

(In reply to Kartikaya Gupta (email:kats@mozilla.com) from comment #75) > This one needs fixing still - it looks like the test on windows has both the > 'webrender' and 'skiaContent' annotations activated, and the latter is > taking precedence over the former. If we switch the order of those > annotations it should be good. I'll land a patch for this. Sorry, I was looking at an old version of the code. The actual problem is the &&isDebugBuild clause on the annotation. I'll need to drop that since it happens on both opt and debug. Will do that once trees reopen.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 77

•

6 years ago

I landed the patch on inbound earlier today. If the AMI change happens as part of an in-tree change then you could land it on inbound now and it should be good. Otherwise we should wait until that patch has propagated to all the trunk trees.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 78

•

6 years ago

The patch is on central now so we should be good to go!

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 79

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=07b2a8df57f6fcbc33ade6fa216502e00b017c6d

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 80

•

6 years ago

new amis are now live: https://github.com/mozilla-releng/OpenCloudConfig/commit/606fe69601a6621b144580f17365234e5347766d https://github.com/mozilla-releng/OpenCloudConfig/commit/496b6e8d17392503b50c05086c38167d5747fea6 https://tools.taskcluster.net/task-group-inspector/#/X-DxvMHGR3uKGODnIykkxA

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Flags: needinfo?(jmaher)

Resolution: --- → FIXED

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 81

•

6 years ago

\o/ thank you!

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 82

•

6 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b39449b9d479da84edc40c6fd1fc30a743659dd8

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 83

•

6 years ago

New failures on Windows 10 have increase heavily: https://sql.telemetry.mozilla.org/queries/60373/ Rob, can you take a look at those, please? Table with links to bugs, logs and infos on job type and platform: https://sql.telemetry.mozilla.org/queries/60371/

Flags: needinfo?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Assignee

Comment 84

•

6 years ago

i'm afraid i don't know what to make of this data or what info is being asked of me but i'll take a stab at a response: the change implemented by this bug was to create and use a new windows 10 ami which uses a source controlled, automated process to deterministically create an ec2 ami from a windows build 1703 iso (see bug 1474548), replacing an ami that was manually created (by a human, installing windows on a vm, and importing the resulting vhd into ec2, as an ami). the change gives us a mechanism to deterministically modify the ami generation process in an incremental and audited way in order to improve the repeatability of that exercise for subsequent builds of windows without having to rely on human memory from the last time an ami was manually built. a happy byproduct of the change was that instances created from the new amis do not suffer from going into an impaired state whenever the gpu is put through its paces, hence the resolution of this bug. we can of course revert the deployment to go back to the manually created ami which did not have the problems referred to by the telemetry link (but which did have other problems, like instances that die mid-task), or we can choose to bite the bullet and start to incrementally improve the automated process to reduce the number of intermittently failing tests, as and when we understand the causes of the individual failures. i prefer the latter, but it's not my call and i'm happy to work with whatever moves us forward. if there's any other specific questions i can answer, please feel free to ni me again.

Flags: needinfo?(rthijssen)

Chris Cooper [:coop] (he/him)

Comment 85

•

6 years ago

(In reply to Rob Thijssen (:grenade UTC+2) from comment #84) > i'm afraid i don't know what to make of this data I'm also somewhat overwhelmed by the data. Sebastian: can you (or one of the sheriffs) call out a few standout bugs from that list to block this bug on if you want Rob to dig into this more, please?

Flags: needinfo?(aryx.bugmail)

Joel Maher ( :jmaher ) (UTC -8)

Comment 86

•

6 years ago

I think the problem is there are so many new bugs that are individual- it isn't as if one or two bugs are happening at large scale, this is more of a trend that shows a low rate of windows 10 new intermittent bugs and in the last 3 days a massive jump in new issues. I also agree with you :coop, this is a hard problem to understand with that data- I think looking into common failures (crashes, hangs, startup hangs, focus, specific suites/messages) would help make this actionable. In fact it would be nice if we had metrics broken down daily like: * config - overall intermittent rate, overall new intermittent rate * test suite - overall intermittent rate, overall new intermittent rate * buckets (hangs, timeouts, crashes, infra, test failure) - overall intermittent rate, overall new intermittent rate We could at least have a common ground for understanding what the different patterns are and catch issues that are more widespread easier.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

6 years ago

Depends on: 1511293

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 87

•

6 years ago

The failure rate uptick is caused by Windows 10 update prompts overlaying the screen which causes focus, click automation and color detection issues. See bug 1511293.

Flags: needinfo?(aryx.bugmail) → needinfo?(rthijssen)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 88

•

6 years ago

That issue should be fixed in bug 1510220.

Flags: needinfo?(rthijssen)

reboot-impaired-instances.sh 7 years ago Rob Thijssen [:grenade (EET/UTC+0300)] (deleted), text/plain		Details
reboot.log 7 years ago Rob Thijssen [:grenade (EET/UTC+0300)] (deleted), text/plain		Details
reboot-impaired-instances.sh 7 years ago Rob Thijssen [:grenade (EET/UTC+0300)] (deleted), application/x-shellscript	markco : feedback+	Details
m3-livelog 6 years ago Kartikaya Gupta (email:kats@mozilla.staktrace.com) (deleted), text/plain		Details
patch to disable failing mda test on QR 6 years ago Kartikaya Gupta (email:kats@mozilla.staktrace.com) (deleted), patch	jmaher : review+ kats : checked-in+	Details \| Diff \| Splinter Review