Closed
Bug 1307803
Opened 8 years ago
Closed 8 years ago
Windows worker hanging on build failure
Categories
(Taskcluster :: Workers, defect)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: ted, Assigned: pmoore)
References
Details
Attachments
(2 files)
I did a try push here, and all the Windows taskcluster builds failed:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3dc4cbe71997
However, they failed with 'claim-expired', like:
https://tools.taskcluster.net/task-inspector/#NIaiZPtqQvuzrNcCD4WaCw/0
Looking at one that was still running:
https://tools.taskcluster.net/task-inspector/#NIaiZPtqQvuzrNcCD4WaCw/1
I saw that the build had failed:
<...>
13:12:49 INFO - client.mk:373: recipe for target 'configure' failed
13:12:49 INFO - mozmake.EXE: *** [configure] Error 1
14:32:49 INFO - Automation Error: mozprocess timed out after 4800 seconds running ['C:\\mozilla-build\\msys\\bin\\bash.exe', 'Z:\\task_1475671472\\build\\src\\mach', '--log-no-times', 'build', '-v']
14:32:49 ERROR - timed out after 4800 seconds of no output
<...>
I'm not sure what's happening here, but something is failing to notice this is broken. This shouldn't have hit the mozprocess timeout either, it's just a simple failure. I've attached the full build log that I cut that snipped out of above.
Updated•8 years ago
|
Severity: normal → major
Updated•8 years ago
|
Assignee: nobody → pmoore
Assignee | ||
Comment 1•8 years ago
|
||
I'm not sure what caused this, but I see it is running on generic worker 5.2.0.
There has been a redesign of failure handling in generic worker which has landed this evening in 6.0.0, so my proposal would be that we try first with the refactored worker, to see if that fixes things.
I'll raise a PR against https://github.com/mozilla-releng/OpenCloudConfig for this - then when it merges, the rollout should be automatic, as far as I understand.
Assignee | ||
Comment 2•8 years ago
|
||
Attachment #8798157 -
Flags: review?(rthijssen)
Assignee | ||
Comment 3•8 years ago
|
||
(note: the last line of the log also scared me, as it looked like some funky process magic might be going on)
Comment 4•8 years ago
|
||
I looked at one of the instances that had an issue with a run of this task...instance i-0471eb89d40b860a7
Looking in papertrail I see the following:
Oct 05 15:57:27 win2012-i-0471eb89d40b860a7 generic-worker: 2016/10/05 15:57:26 Reclaiming task NIaiZPtqQvuzrNcCD4WaCw...
Oct 05 15:57:28 win2012-i-0471eb89d40b860a7 generic-worker: 2016/10/05 15:57:27 Reclaimed task NIaiZPtqQvuzrNcCD4WaCw successfully.
Oct 05 15:58:02 win2012-i-0471eb89d40b860a7 HaltOnIdle: Is-Running :: generic-worker is running.
Oct 05 15:58:02 win2012-i-0471eb89d40b860a7 HaltOnIdle: instance appears to be productive.
Oct 05 15:59:27 win2012-i-0471eb89d40b860a7 Microsoft-Windows-DSC: Job {B0A72FFE-8B14-11E6-8149-126690B82E1F} : Configuration is sent from computer NULL by user sid S-1-5-18.
Oct 05 16:03:01 win2012-i-0471eb89d40b860a7 HaltOnIdle: Is-Running :: generic-worker is running.
Oct 05 16:03:01 win2012-i-0471eb89d40b860a7 HaltOnIdle: instance appears to be productive.
Oct 05 16:08:01 win2012-i-0471eb89d40b860a7 HaltOnIdle: Is-Running :: generic-worker is running.
Oct 05 16:08:01 win2012-i-0471eb89d40b860a7 HaltOnIdle: instance appears to be productive.
That was the last thing in the papertrail logs from this worker (note I was searching this a few hours after that last log message).
I'm not sure what caused it to stop logging, I no longer see that machine in the AWS console so clearly it's gone.
Is it possible this machine crashed or was terminated by other means? Does the generic worker monitor AWS spot kills and terminate gracefully?
When the machines shutdown, is there something I could see in the logs that could indicate that?
Sorry Pete, I attempted to determine what was going wrong without much luck.
Comment 5•8 years ago
|
||
So after talking with Pete, it appears the generic worker does not monitor the spot termination endpoint AWS provides to determine if it will be spot killed in the next 2 minutes. I'm not sure if that is the cause of this, but some symptoms point that way.
I've added bug 1308224
Comment 6•8 years ago
|
||
Comment on attachment 8798157 [details]
Github Pull Request for OpenCloudConfig
I think we're going to skip 6.0.0 for gecko-1-b-win2012 and gecko-3-b-win2012 and move to the next release.
Attachment #8798157 -
Flags: review?(rthijssen) → review-
Assignee | ||
Updated•8 years ago
|
Component: Worker → Generic-Worker
Assignee | ||
Comment 7•8 years ago
|
||
I believe this no longer occurs - ted, ok for me to close?
Flags: needinfo?(ted)
Reporter | ||
Comment 8•8 years ago
|
||
Sure, I haven't seen this occur in anything I've done lately.
Flags: needinfo?(ted)
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Updated•6 years ago
|
Component: Generic-Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•