packet.net "machine-13" is very slow
Categories
(Taskcluster :: General, defect)
Tracking
(Not tracked)
People
(Reporter: gbrown, Assigned: coop)
References
(Blocks 1 open bug)
Details
Crash Data
Over the last 24 hours or so there have been several odd failures reported in bug 1411358 for test-android-em-7.0-x86* tests. In these failures, tests are running very slowly, resulting in time outs.
Logs show they all have the same worker-id: "machine-13".
[taskcluster 2019-03-29 10:30:56.118Z] Task ID: fKGlkRUURLmaXmljhVdxUg
[taskcluster 2019-03-29 10:30:56.118Z] Worker ID: machine-13
[taskcluster 2019-03-29 10:30:56.118Z] Worker Group: packet-sjc1
[taskcluster 2019-03-29 10:30:56.118Z] Worker Node Type: packet.net
[taskcluster 2019-03-29 10:30:56.118Z] Worker Type: gecko-t-linux
Reboot, quarantine, remove?
Examples:
https://treeherder.mozilla.org/logviewer.html#?job_id=236858209&repo=mozilla-central
https://treeherder.mozilla.org/logviewer.html#?job_id=236770611&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=236770603&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=236671157&repo=autoland
https://treeherder.mozilla.org/logviewer.html#?job_id=236660940&repo=autoland
Comment 1•6 years ago
|
||
I don't think that anyone but wander knows how to do anything other than quarantine this worker. But at the very least a roobt seems in order.
Reporter | ||
Comment 2•6 years ago
|
||
Failures continue:
https://treeherder.mozilla.org/logviewer.html#?job_id=237554633&repo=autoland
https://treeherder.mozilla.org/logviewer.html#?job_id=237554635&repo=autoland
https://treeherder.mozilla.org/logviewer.html#?job_id=237554631&repo=autoland
https://treeherder.mozilla.org/logviewer.html#?job_id=237293150&repo=mozilla-central
Assignee | ||
Comment 3•6 years ago
|
||
I just rebooted it from the console. Let me know if that helps.
Assignee | ||
Updated•6 years ago
|
Reporter | ||
Updated•6 years ago
|
Reporter | ||
Comment 4•6 years ago
|
||
machine-13 failures continue after the reboot:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=238471439&repo=autoland&lineNumber=18055
Assignee | ||
Comment 5•6 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #4)
machine-13 failures continue after the reboot:
https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=238471439&repo=autoland&lineNumber=18055
I just kicked off a reinstall for machine-13.
Reporter | ||
Comment 6•6 years ago
|
||
That seemed to shift the trouble from machine-13 to machine-20?
and/or machine-6?
(I'm not sure what to make of this!)
Reporter | ||
Updated•6 years ago
|
Assignee | ||
Comment 7•6 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #6)
That seemed to shift the trouble from machine-13 to machine-20?
and/or machine-6?
(I'm not sure what to make of this!)
These instances functionally live forever, so it's possible they accumulate enough cruft over time to fail jobs. Maybe we should do a rolling reinstall the entire pool?
Reporter | ||
Comment 8•6 years ago
|
||
(In reply to Chris Cooper [:coop] pronoun: he from comment #7)
Maybe we should do a rolling reinstall the entire pool?
Sure, that seems like a good idea.
Assignee | ||
Comment 9•6 years ago
|
||
(In reply to Geoff Brown [:gbrown] from comment #8)
(In reply to Chris Cooper [:coop] pronoun: he from comment #7)
Maybe we should do a rolling reinstall the entire pool?
Sure, that seems like a good idea.
I think I finally figured out the terraform state today. I've already re-installed machine-0, and am iterating through the rest of the pool now one-at-a-time. machine-13 is currently down because I screwed that one up differently, but should come back online with the rest.
Assignee | ||
Comment 10•6 years ago
|
||
OK, that took longer than it should have, but along the way I learned more about packet.net, terraform, and docker-worker than I probably ever wanted to know. All of the instances in packet.net have been re-created.
Please re-open if machine-13 is still slow, or file a new bug for systemic issues.
Reporter | ||
Updated•6 years ago
|
Updated•6 years ago
|
Reporter | ||
Updated•5 years ago
|
Description
•