worker should log an exception with queue on exit code 1073807364 in order to trigger a retry
Categories
(Taskcluster :: Workers, enhancement)
Tracking
(Not tracked)
People
(Reporter: grenade, Assigned: grenade)
References
Details
Attachments
(1 file)
on windows builders in gcp on tier 3, we see a significant number of builds failing with a final log message of:
[taskcluster:error] exit status 1073807364
the symptom is an old one and can be caused by different issues. typically it occurs when the instance is lost mid-task through a reboot or instance termination that is unexpected by the worker process.
i initially assumed we were getting terminations from gcloud but the papertrail instance logs don't really support that assumption. for example, here are the last few lines of a build log:
11:32:10 INFO - mozmake.EXE[5]: Leaving directory 'z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest'
11:32:10 INFO - mozmake.EXE[5]: Entering directory 'z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest'
11:32:10 INFO - media/webrtc/trunk/gtest/webrtc-gtest.exe
11:32:10 INFO - mozmake.EXE[5]: Leaving directory 'z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest'
11:32:16 INFO - mozmake.EXE[5]: Entering directory 'z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest'
11:32:16 INFO - z:/task_1555325675/build/src/clang/bin/lld-link.exe -NOLOGO -OUT:../../../../dist/bin/webrtc-gtest.exe -PDB:webrtc-gtest.pdb -IMPLIB:webrtc-gtest.lib -SUBSYSTEM:CONSOLE,6.01 -STACK:1572864 -guard:cf,nolongjmp -LARGEADDRESSAWARE -RELEASE -SAFESEH -DEBUG -OPT:REF,ICF clang_rt.profile-i386.lib @z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest/webrtc-gtest_exe_pgo.list ./module.res ..\..\..\..\mozglue\build\mozglue.lib kernel32.lib user32.lib gdi32.lib winmm.lib wsock32.lib advapi32.lib secur32.lib winmm.lib Strmiids.lib d3d11.lib dxgi.lib amstrmid.lib dmoguids.lib msdmo.lib shell32.lib shlwapi.lib strmiids.lib ws2_32.lib wmcodecdspuuid.lib
11:32:16 INFO - lld-link: warning: ignoring unknown argument: -RELEASE
11:32:16 INFO - mozmake.EXE[5]: Leaving directory 'z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest'
11:32:16 INFO - mozmake.EXE[5]: Entering directory 'z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest'
11:32:16 INFO - touch -t `date +%Y%m%d%H%M.%S -d 'now+5seconds'` pgo.relink
11:32:16 INFO - mozmake.EXE[5]: Leaving directory 'z:/task_1555325675/build/src/obj-firefox/media/webrtc/trunk/gtest'
11:32:17 INFO - mozmake.EXE[5]: Nothing to be done for 'syms'.
[taskcluster 2019-04-15T11:39:30.853Z] Exit Code: 1073807364
[taskcluster 2019-04-15T11:39:30.853Z] User Time: 0s
[taskcluster 2019-04-15T11:39:30.853Z] Kernel Time: 0s
[taskcluster 2019-04-15T11:39:30.853Z] Wall Time: 32m10.7963246s
[taskcluster 2019-04-15T11:39:30.853Z] Result: FAILED
[taskcluster 2019-04-15T11:39:30.853Z] === Task Finished ===
[taskcluster 2019-04-15T11:39:30.853Z] Task Duration: 42m44.7676934s
[taskcluster 2019-04-15T11:39:31.367Z] Uploading artifact public/logs/localconfig.json from file logs\localconfig.json with content encoding "gzip", mime type "application/octet-stream" and expiry 2019-05-13T10:34:46.420Z
[taskcluster 2019-04-15T11:39:32.040Z] Uploading artifact public/logs/certified.log from file generic-worker\certified.log with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-05-13T10:34:46.420Z
[taskcluster 2019-04-15T11:39:32.975Z] Uploading artifact public/chain-of-trust.json from file generic-worker\chain-of-trust.json with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-05-13T10:34:46.420Z
[taskcluster 2019-04-15T11:39:33.343Z] Uploading artifact public/chain-of-trust.json.sig from file generic-worker\chain-of-trust.json.sig with content encoding "gzip", mime type "application/octet-stream" and expiry 2019-05-13T10:34:46.420Z
[taskcluster 2019-04-15T11:39:33.880Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/Va8tiYL9T5GvQIagmugNIg/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-05-13T10:34:46.420Z
[taskcluster:error] exit status 1073807364
and here's the last few entries from the instance papertrail logs at the same time:
Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Test-InstanceProductivity :: begin - 2019-04-15T11:38:01.1562091Z
Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Is-Terminating :: begin - 2019-04-15T11:38:01.1869996Z
Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Is-Terminating :: end - 2019-04-15T11:38:01.2065237Z
Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Is-ConditionTrue :: drive Y: is detected.
Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Is-ConditionTrue :: drive Z: is detected.
Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Is-ConditionTrue :: generic-worker is running.
Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Test-InstanceProductivity :: instance appears to be productive.
Apr 15 11:38:01 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com HaltOnIdle: Test-InstanceProductivity :: end - 2019-04-15T11:38:01.2670906Z
Apr 15 11:38:45 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: ERROR Failed to retrieve eventlog fields; The handle is invalid.
Apr 15 11:38:52 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: ERROR last message repeated 2 times
Apr 15 11:39:02 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: ERROR Failed to retrieve eventlog fields; The handle is invalid.
Apr 15 11:39:32 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Request
Apr 15 11:39:32 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: PUT /Va8tiYL9T5GvQIagmugNIg/0/public/logs/localconfig.json?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=application%2Foctet-stream&Expires=1555331081&Signature=B%2BaFuLYamQ9LlVQ9qA52%2BjCaf8o%3D HTTP/1.1 Host: taskcluster-public-artifacts.s3.us-west-2.amazonaws.com User-Agent: Go-http-client/1.1 Content-Length: 1331 Content-Encoding: gzip Content-Type: application/octet-stream Accept-Encoding: gzip
Apr 15 11:39:32 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: 1 put requests issued to https://taskcluster-public-artifacts.s3.us-west-2.amazonaws.com/Va8tiYL9T5GvQIagmugNIg/0/public/logs/localconfig.json?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=application%2Foctet-stream&Expires=1555331081&Signature=B%2BaFuLYamQ9LlVQ9qA52%2BjCaf8o%3D
Apr 15 11:39:32 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Response
Apr 15 11:39:32 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: HTTP/1.1 200 OK Content-Length: 0 Date: Mon, 15 Apr 2019 11:39:32 GMT Etag: "d9c56c6b785ca907d72ed166cb6309a1" Server: AmazonS3 X-Amz-Id-2: 4T/eElP/LpdGoNpJjtRkhxSdnKoNDVQtSGZQkAh7cvbSeb4pJ8dmhJOUdvbGSFBmmBGLTUAldlI= X-Amz-Request-Id: F34D3872B0E69178 X-Amz-Version-Id: fW9_rF9cY3bunOIP_55PcnZimv1LHarL
Apr 15 11:39:32 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Stopping task feature Chain of Trust...
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Request
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: PUT /Va8tiYL9T5GvQIagmugNIg/0/public/logs/certified.log?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=text%2Fplain%3B%20charset%3Dutf-8&Expires=1555331082&Signature=H%2FC9YPwtFQCXuhBdIxSZ1zXuN3Y%3D HTTP/1.1 Host: taskcluster-public-artifacts.s3.us-west-2.amazonaws.com User-Agent: Go-http-client/1.1 Content-Length: 476139 Content-Encoding: gzip Content-Type: text/plain; charset=utf-8 Accept-Encoding: gzip
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: 1 put requests issued to https://taskcluster-public-artifacts.s3.us-west-2.amazonaws.com/Va8tiYL9T5GvQIagmugNIg/0/public/logs/certified.log?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=text%2Fplain%3B%20charset%3Dutf-8&Expires=1555331082&Signature=H%2FC9YPwtFQCXuhBdIxSZ1zXuN3Y%3D
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Response
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: HTTP/1.1 200 OK Content-Length: 0 Date: Mon, 15 Apr 2019 11:39:33 GMT Etag: "589cc5a2881ec948da215aab72535922" Server: AmazonS3 X-Amz-Id-2: G5Qqf9Mrjt2nP8cu06cxvU56EQDSV8zYbPFSRtwiRvUIgGWXo0ThoZ69HvM1p+UJW4AtFNl/rXE= X-Amz-Request-Id: B0FD95DFF30CE0D2 X-Amz-Version-Id: DWfn9mpCscE6tTyGav8M6QA.E5_f6ZnW
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Request
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: PUT /Va8tiYL9T5GvQIagmugNIg/0/public/chain-of-trust.json?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=text%2Fplain%3B%20charset%3Dutf-8&Expires=1555331083&Signature=GZKF%2F9TOcm3luu1Se9LyB9xcIAg%3D HTTP/1.1 Host: taskcluster-public-artifacts.s3.us-west-2.amazonaws.com User-Agent: Go-http-client/1.1 Content-Length: 2287 Content-Encoding: gzip Content-Type: text/plain; charset=utf-8 Accept-Encoding: gzip
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: 1 put requests issued to https://taskcluster-public-artifacts.s3.us-west-2.amazonaws.com/Va8tiYL9T5GvQIagmugNIg/0/public/chain-of-trust.json?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=text%2Fplain%3B%20charset%3Dutf-8&Expires=1555331083&Signature=GZKF%2F9TOcm3luu1Se9LyB9xcIAg%3D
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Response
Apr 15 11:39:33 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: HTTP/1.1 200 OK Content-Length: 0 Date: Mon, 15 Apr 2019 11:39:34 GMT Etag: "fa0b6e3aab78925230e370c09158ddc6" Server: AmazonS3 X-Amz-Id-2: XyieHF5QNmTod3TkSP5ZTRJjDQ+89u+pEuquH8NGcdHutDPIVu3ZfWDsLiRsJXFgkmf1s/afIhw= X-Amz-Request-Id: DA6B11EAFD867C89 X-Amz-Version-Id: YFzf6HgWN_.fOght8tcDZYngjL9Dmtc_
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Request
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: PUT /Va8tiYL9T5GvQIagmugNIg/0/public/chain-of-trust.json.sig?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=application%2Foctet-stream&Expires=1555331083&Signature=ydoCiJ2HEddW%2B4Pkwii8GNutPqs%3D HTTP/1.1 Host: taskcluster-public-artifacts.s3.us-west-2.amazonaws.com User-Agent: Go-http-client/1.1 Content-Length: 116 Content-Encoding: gzip Content-Type: application/octet-stream Accept-Encoding: gzip
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: 1 put requests issued to https://taskcluster-public-artifacts.s3.us-west-2.amazonaws.com/Va8tiYL9T5GvQIagmugNIg/0/public/chain-of-trust.json.sig?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=application%2Foctet-stream&Expires=1555331083&Signature=ydoCiJ2HEddW%2B4Pkwii8GNutPqs%3D
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Response
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: HTTP/1.1 200 OK Content-Length: 0 Date: Mon, 15 Apr 2019 11:39:34 GMT Etag: "8c69926670ca9ad05dbc93b05859edf0" Server: AmazonS3 X-Amz-Id-2: T2kJBEV1dPP6zvmLfZ04vQ/OCzlPSMVeKycnm3AreiWg6L443iLwy+lNIf66lrZLE4hMGkFL7iE= X-Amz-Request-Id: 27E059F90EE9F495 X-Amz-Version-Id: M7O07vvCBzAwfbU2R6Z9Q45NBVFieFma
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Stopping task feature Supersede...
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Stopping task feature Mounts/Caches...
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Stopping task feature OS Groups...
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Stopping task feature Taskcluster Proxy...
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Stopped taskcluster proxy process (PID 3740)
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Stopping task feature Live Log...
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Redirecting public/logs/live.log to public/logs/live_backing.log
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: WARNING: could not terminate livelog exposure: close tcp [::]:60023: use of closed network connection
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: ERROR last message repeated 3 times
Apr 15 11:39:34 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: ERROR Failed to retrieve eventlog fields; The handle is invalid.
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Request
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: PUT /Va8tiYL9T5GvQIagmugNIg/0/public/logs/live_backing.log?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=text%2Fplain%3B%20charset%3Dutf-8&Expires=1555331084&Signature=8T%2FC%2F97nslh2gbxC71POu5fpVt8%3D HTTP/1.1 Host: taskcluster-public-artifacts.s3.us-west-2.amazonaws.com User-Agent: Go-http-client/1.1 Content-Length: 476349 Content-Encoding: gzip Content-Type: text/plain; charset=utf-8 Accept-Encoding: gzip
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: 1 put requests issued to https://taskcluster-public-artifacts.s3.us-west-2.amazonaws.com/Va8tiYL9T5GvQIagmugNIg/0/public/logs/live_backing.log?AWSAccessKeyId=AKIAJTPURLXKI2RSRWOQ&Content-Type=text%2Fplain%3B%20charset%3Dutf-8&Expires=1555331084&Signature=8T%2FC%2F97nslh2gbxC71POu5fpVt8%3D
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Response
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: HTTP/1.1 200 OK Content-Length: 0 Date: Mon, 15 Apr 2019 11:39:35 GMT Etag: "8f47d39d054c22b12398614337344af8" Server: AmazonS3 X-Amz-Id-2: bs1nKj6vth2jhB9gEltHaK3KGipHYX5PNwj/cRBRe2g9FOlB6rcZNX5WdoBhljvU5oUQoLeZ3H4= X-Amz-Request-Id: FAA486598297BAB9 X-Amz-Version-Id: o7ZwmW92hKLOw0xpxgGfHQHSvhboQW75
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Resolving task Va8tiYL9T5GvQIagmugNIg ...
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: ERROR(s) encountered: exit status 1073807364
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Making system call GetProfilesDirectoryW with args: [0 C0422161FC]
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Result: 0 0 The data area passed to a system call is too small.
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Making system call GetProfilesDirectoryW with args: [C0423B45C0 C0422161FC]
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Result: 1 0 The operation completed successfully.
Apr 15 11:39:35 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Looking for existing task users to delete...
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Resolved 1 tasks in total so far.
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Saving file file-caches.json (absolute path: C:\generic-worker\file-caches.json)
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Saving file directory-caches.json (absolute path: C:\generic-worker\directory-caches.json)
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com generic-worker: Exiting worker with exit code 67
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com GCEMetadataScripts: INFO: 2019/04/15 11:39:35.594969 main.go:407: GCEMetadataScripts: Starting shutdown scripts (version 4.2.1@1).
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com GCEMetadataScripts: INFO: 2019/04/15 11:39:35.634992 main.go:416: GCEMetadataScripts: No shutdown scripts to run.
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com GCEMetadataScripts: INFO: 2019/04/15 11:39:35.645749 main.go:421: GCEMetadataScripts: Finished running shutdown scripts.
Apr 15 11:39:36 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com User32: The process C:\Windows\system32\shutdown.exe (VM-SMX3T1SFAI5D) has initiated the restart of computer VM-SMX3T1SFAI5D on behalf of user NT AUTHORITY\SYSTEM for the following reason: Application: Maintenance (Planned) Reason Code: 0x80040001 Shutdown Type: restart Comment: rebooting; generic worker task run completed
Apr 15 11:39:37 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: ERROR last message repeated 3 times
Apr 15 11:39:38 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com Service_Control_Manager: The sshd service terminated unexpectedly. It has done this 1 time(s).
Apr 15 11:39:39 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: ERROR Failed to retrieve eventlog fields; The handle is invalid.
Apr 15 11:39:42 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: WARNING received a system shutdown request
Apr 15 11:39:42 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: WARNING stopping nxlog service
Apr 15 11:39:42 vm-smx3t1sfai5dmdsro.gecko-1-b-win2012-gamma.usc1.mozilla.com nxlog: WARNING nxlog-ce received a termination request signal, exiting...
it's difficult to say what's going on but from the perspective of what's in the logs, it looks like
- a task finished normally
- generic-worker exited with an exit code of 67
- which was interpreted by the wrapper script as a request to reboot
- the instance never restarted (this suggests something was wrong on the infra front, but i haven't figured out what that is yet)
in any case, retriggering tasks that fail in this way, usually results in successful builds. as such, it feels like we could improve the resilience of what happens or is shown in treeherder by triggering a retry instead of showing a build or test failure.
here's some links to recent failures:
- https://treeherder.mozilla.org/#/jobs?resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&tier=3&searchStr=2012&group_state=expanded&repo=try
- https://treeherder.mozilla.org/#/jobs?resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&tier=3&searchStr=2012&group_state=expanded&repo=autoland
- https://treeherder.mozilla.org/#/jobs?resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&tier=3&searchStr=2012&group_state=expanded&repo=mozilla-inbound
- https://treeherder.mozilla.org/#/jobs?resultStatus=testfailed%2Cbusted%2Cexception%2Crunnable&tier=3&searchStr=2012&group_state=expanded&repo=mozilla-central
Assignee | ||
Updated•6 years ago
|
Comment 1•6 years ago
|
||
We should be able to do this in the gecko task definitions with:
payload:
...
onExitStatus:
retry:
- 1073807364
We could include this in all gecko windows tasks in the decision code machinery.
See onExitStatus docs.
Assignee | ||
Comment 2•6 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #1)
We should be able to do this in the gecko task definitions with:
payload: ... onExitStatus: retry: - 1073807364
We could include this in all gecko windows tasks in the decision code machinery.
See onExitStatus docs.
testing now:
https://hg.mozilla.org/try/rev/7f63803
https://treeherder.mozilla.org/#/jobs?repo=try&tier=1%2C2%2C3&revision=7f63803
Assignee | ||
Comment 3•6 years ago
|
||
seems to work well.
i deliberately killed worker instance vm-mh2jgoe41ewuocvcr whilst it was running its task.
the livelog recorded these messages before the task run was recorded as an exception and re-triggered:
updating [==================> ] 112100/275564 5m46s
updating [==================> ] 112500/275564 5m46s
updating [==================> ] 112900/275564 5m47s[taskcluster 2019-04-30T20:51:36.164Z] Exit Code: 3221225786
[taskcluster 2019-04-30T20:51:36.164Z] User Time: 15.625ms
[taskcluster 2019-04-30T20:51:36.164Z] Kernel Time: 0s
[taskcluster 2019-04-30T20:51:36.164Z] Wall Time: 4m14.2193896s
[taskcluster 2019-04-30T20:51:36.164Z] Result: FAILED
[taskcluster 2019-04-30T20:51:36.164Z] === Task Finished ===
[taskcluster 2019-04-30T20:51:36.164Z] Task Duration: 4m14.4235599s
[taskcluster:error] Uploading error artifact public/logs from file logs with message "Could not read directory 'Z:\\task_1556652867\\logs'", reason "file-missing-on-worker" and expiry 2019-05-28T20:35:50.681Z
[taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read directory 'Z:\task_1556652867\logs'
[taskcluster:error] Uploading error artifact public/build from file public/build with message "Could not read directory 'Z:\\task_1556652867\\public\\build'", reason "file-missing-on-worker" and expiry 2019-05-28T20:35:50.681Z
[taskcluster:error] TASK FAILURE during artifact upload: file-missing-on-worker: Could not read directory 'Z:\task_1556652867\public\build'
[taskcluster 2019-04-30T20:51:38.143Z] Uploading artifact public/logs/certified.log from file generic-worker\certified.log with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-05-28T20:35:50.681Z
[taskcluster 2019-04-30T20:51:39.109Z] Uploading artifact public/chain-of-trust.json from file generic-worker\chain-of-trust.json with content encoding "gzip", mime type "text/plain; charset=utf-8" and expiry 2019-05-28T20:35:50.681Z
[taskcluster 2019-04-30T20:51:39.649Z] Uploading artifact public/chain-of-trust.json.sig from file generic-worker\chain-of-trust.json.sig with content encoding "gzip", mime type "application/octet-stream" and expiry 2019-05-28T20:35:50.681Z
[taskcluster 2019-04-30T20:51:40.340Z] Uploading redirect artifact public/logs/live.log to URL https://queue.taskcluster.net/v1/task/F5HBezBETdWtTP_u-PQV5A/runs/0/artifacts/public/logs/live_backing.log with mime type "text/plain; charset=utf-8" and expiry 2019-05-28T20:35:50.681Z
[taskcluster:error] Task appears to have failed intermittently - exit code 3221225786 found in task payload.onExitStatus list
[taskcluster:error] file-missing-on-worker: Could not read directory 'Z:\task_1556652867\logs'
[taskcluster:error] file-missing-on-worker: Could not read directory 'Z:\task_1556652867\public\build'
Assignee | ||
Comment 4•6 years ago
|
||
when an instance is terminated while it is still running a task, the generic worker process exits with an interrupt exit code. this change treats such exit codes as an exception which triggers a task retry
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Comment 5•6 years ago
|
||
Comment on attachment 9061761 [details]
trigger retry on instance termination mid-task
Looks good Rob, but I think just needs a comment to explain why these exit codes are special, and also needs to be filtered to only add them to tasks on windows workers (e.g. to avoid the mac tests and linux talos jobs we run on generic-worker).
Thanks!
Comment 6•6 years ago
|
||
Note, the current implementation won't easily allow other exit codes to be added to the list - ideally we'd implement this the same way we have in docker-worker, e.g. here.
Although maybe we can do that in a subsequent bug, in order not to block this one.
Assignee | ||
Comment 7•6 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #6)
Note, the current implementation won't easily allow other exit codes to be added to the list - ideally we'd implement this the same way we have in docker-worker, e.g. here.
Although maybe we can do that in a subsequent bug, in order not to block this one.
here's an attempt at what i think was suggested:
task_def['payload'] = {
'command': worker['command'],
'maxRunTime': worker['max-run-time'],
'onExitStatus': {},
}
if worker['os'] == 'windows':
worker.setdefault('retry-exit-status', []).extend(
# these codes (on windows) indicate a process interruption,
# rather than a task run failure. see bug 1544403.
1073807364, # process force-killed due to system shutdown
3221225786, # sigint (any interrupt)
)
if 'retry-exit-status' in worker:
task_def['payload']['onExitStatus']['retry'] = worker['retry-exit-status']
instead of what was implemented in the patch:
task_def['payload'] = {
'command': worker['command'],
'maxRunTime': worker['max-run-time'],
'onExitStatus': {
'retry': [
# these codes (on windows) indicate a process interruption,
# rather than a task run failure. see bug 1544403.
1073807364, # process force-killed due to system shutdown
3221225786, # sigint (any interrupt)
],
},
}
i'm of the fairly strong opinion that the second of these two implementations is cleaner. i'm not sure what was meant by "won't easily allow other exit codes to be added to the list". please clarify that. possibly with a code example here as i must have failed to comprehend that. i think it's pretty much exactly the same amount of effort to add exit codes to either implementation.
Comment 8•6 years ago
|
||
(In reply to Rob Thijssen [:grenade (EET)] from comment #7)
(In reply to Pete Moore [:pmoore][:pete] from comment #6)
Note, the current implementation won't easily allow other exit codes to be added to the list - ideally we'd implement this the same way we have in docker-worker, e.g. here.
Although maybe we can do that in a subsequent bug, in order not to block this one.
here's an attempt at what i think was suggested:
task_def['payload'] = { 'command': worker['command'], 'maxRunTime': worker['max-run-time'], 'onExitStatus': {}, } if worker['os'] == 'windows': worker.setdefault('retry-exit-status', []).extend( # these codes (on windows) indicate a process interruption, # rather than a task run failure. see bug 1544403. 1073807364, # process force-killed due to system shutdown 3221225786, # sigint (any interrupt) ) if 'retry-exit-status' in worker: task_def['payload']['onExitStatus']['retry'] = worker['retry-exit-status']
instead of what was implemented in the patch:
task_def['payload'] = { 'command': worker['command'], 'maxRunTime': worker['max-run-time'], 'onExitStatus': { 'retry': [ # these codes (on windows) indicate a process interruption, # rather than a task run failure. see bug 1544403. 1073807364, # process force-killed due to system shutdown 3221225786, # sigint (any interrupt) ], }, }
i'm of the fairly strong opinion that the second of these two implementations is cleaner. i'm not sure what was meant by "won't easily allow other exit codes to be added to the list". please clarify that. possibly with a code example here as i must have failed to comprehend that. i think it's pretty much exactly the same amount of effort to add exit codes to either implementation.
The payload builder of docker-worker includes an optional property retry-exit-status
here. This allows any of the task-building machinery to mutate the list of exit codes that should cause a retry, by manipulating the retry-exit-status
worker property. For example,
The retry exit codes, like the other parts of the task payload (mounts, commands to run, artifacts to include, ...), should be part of the generic-worker payload builder (like it is for the docker worker) so that any of the transformation steps can mutate the list. In the final transformation, the values from the list should be taken to generate the payload, like we do in docker-worker.
The reason I say it isn't possible to mutate the list currently, is because in your patch there currently are no means for the transforms to alter the list contents, since the final transformation is deciding what goes in there.
Assignee | ||
Comment 9•6 years ago
|
||
Assignee | ||
Comment 10•6 years ago
|
||
(In reply to Pete Moore [:pmoore][:pete] from comment #8)
The reason I say it isn't possible to mutate the list currently, is because in your patch there currently are no means for the transforms to alter the list contents, since the final transformation is deciding what goes in there.
thanks for explaining that. it sounds like a great feature to add. i'll leave that for another feature request as it sounds like it has wider implications than the scope of this bug, which is just about getting generic worker to stop reporting task exceptions as build failures.
i've added the platform check you requested to the phabricator patch and i look forward to getting that merged and this issue closed.
Assignee | ||
Comment 11•6 years ago
|
||
Assignee | ||
Updated•6 years ago
|
Comment 12•6 years ago
|
||
Pushed by ccoroiu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e1d2dc6db0fa
trigger retry on instance termination mid-task r=pmoore
Comment 13•6 years ago
|
||
bugherder |
Updated•2 years ago
|
Description
•