Closed Bug 1540280 Opened 6 years ago Closed 6 years ago

packet.net "machine-13" is very slow

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gbrown, Assigned: coop)

References

(Blocks 1 open bug)

Details

Crash Data

Over the last 24 hours or so there have been several odd failures reported in bug 1411358 for test-android-em-7.0-x86* tests. In these failures, tests are running very slowly, resulting in time outs.

Logs show they all have the same worker-id: "machine-13".

[taskcluster 2019-03-29 10:30:56.118Z] Task ID: fKGlkRUURLmaXmljhVdxUg
[taskcluster 2019-03-29 10:30:56.118Z] Worker ID: machine-13
[taskcluster 2019-03-29 10:30:56.118Z] Worker Group: packet-sjc1
[taskcluster 2019-03-29 10:30:56.118Z] Worker Node Type: packet.net
[taskcluster 2019-03-29 10:30:56.118Z] Worker Type: gecko-t-linux

Reboot, quarantine, remove?

Examples:

https://treeherder.mozilla.org/logviewer.html#?job_id=236858209&repo=mozilla-central
https://treeherder.mozilla.org/logviewer.html#?job_id=236770611&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=236770603&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=236671157&repo=autoland
https://treeherder.mozilla.org/logviewer.html#?job_id=236660940&repo=autoland

I don't think that anyone but wander knows how to do anything other than quarantine this worker. But at the very least a roobt seems in order.

I just rebooted it from the console. Let me know if that helps.

Assignee: nobody → coop
Status: NEW → ASSIGNED
Flags: needinfo?(gbrown)
Flags: needinfo?(gbrown)

(In reply to Geoff Brown [:gbrown] from comment #4)

machine-13 failures continue after the reboot:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=238471439&repo=autoland&lineNumber=18055

I just kicked off a reinstall for machine-13.

(In reply to Geoff Brown [:gbrown] from comment #6)

That seemed to shift the trouble from machine-13 to machine-20?
and/or machine-6?
(I'm not sure what to make of this!)

These instances functionally live forever, so it's possible they accumulate enough cruft over time to fail jobs. Maybe we should do a rolling reinstall the entire pool?

(In reply to Chris Cooper [:coop] pronoun: he from comment #7)

Maybe we should do a rolling reinstall the entire pool?

Sure, that seems like a good idea.

(In reply to Geoff Brown [:gbrown] from comment #8)

(In reply to Chris Cooper [:coop] pronoun: he from comment #7)

Maybe we should do a rolling reinstall the entire pool?

Sure, that seems like a good idea.

I think I finally figured out the terraform state today. I've already re-installed machine-0, and am iterating through the rest of the pool now one-at-a-time. machine-13 is currently down because I screwed that one up differently, but should come back online with the rest.

OK, that took longer than it should have, but along the way I learned more about packet.net, terraform, and docker-worker than I probably ever wanted to know. All of the instances in packet.net have been re-created.

Please re-open if machine-13 is still slow, or file a new bug for systemic issues.

Status: ASSIGNED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Crash Signature: [@ libc.so + 0x8c66a]
You need to log in before you can comment on or make changes to this bug.