1540280 - packet.net "machine-13" is very slow

Reporter

Description

•

6 years ago

Over the last 24 hours or so there have been several odd failures reported in bug 1411358 for test-android-em-7.0-x86* tests. In these failures, tests are running very slowly, resulting in time outs.

Logs show they all have the same worker-id: "machine-13".

[taskcluster 2019-03-29 10:30:56.118Z] Task ID: fKGlkRUURLmaXmljhVdxUg
[taskcluster 2019-03-29 10:30:56.118Z] Worker ID: machine-13
[taskcluster 2019-03-29 10:30:56.118Z] Worker Group: packet-sjc1
[taskcluster 2019-03-29 10:30:56.118Z] Worker Node Type: packet.net
[taskcluster 2019-03-29 10:30:56.118Z] Worker Type: gecko-t-linux

Reboot, quarantine, remove?

Examples:

https://treeherder.mozilla.org/logviewer.html#?job_id=236858209&repo=mozilla-central
https://treeherder.mozilla.org/logviewer.html#?job_id=236770611&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=236770603&repo=mozilla-inbound
https://treeherder.mozilla.org/logviewer.html#?job_id=236671157&repo=autoland
https://treeherder.mozilla.org/logviewer.html#?job_id=236660940&repo=autoland

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

6 years ago

I don't think that anyone but wander knows how to do anything other than quarantine this worker. But at the very least a roobt seems in order.

Geoff Brown [:gbrown]

Reporter

Comment 2

•

6 years ago

Failures continue:

https://treeherder.mozilla.org/logviewer.html#?job_id=237554633&repo=autoland
https://treeherder.mozilla.org/logviewer.html#?job_id=237554635&repo=autoland
https://treeherder.mozilla.org/logviewer.html#?job_id=237554631&repo=autoland
https://treeherder.mozilla.org/logviewer.html#?job_id=237293150&repo=mozilla-central

Blocks: 1411358

Chris Cooper [:coop] (he/him)

Assignee

Comment 3

•

6 years ago

I just rebooted it from the console. Let me know if that helps.

Chris Cooper [:coop] (he/him)

Assignee

Updated

•

6 years ago

Assignee: nobody → coop

Status: NEW → ASSIGNED

Geoff Brown [:gbrown]

Reporter

Updated

•

6 years ago

Flags: needinfo?(gbrown)

Geoff Brown [:gbrown]

Reporter

Comment 4

•

6 years ago

machine-13 failures continue after the reboot:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=238471439&repo=autoland&lineNumber=18055

Flags: needinfo?(gbrown)

Chris Cooper [:coop] (he/him)

Assignee

Comment 5

•

6 years ago

(In reply to Geoff Brown [:gbrown] from comment #4)

machine-13 failures continue after the reboot:

https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=238471439&repo=autoland&lineNumber=18055

I just kicked off a reinstall for machine-13.

Geoff Brown [:gbrown]

Reporter

Comment 6

•

6 years ago

That seemed to shift the trouble from machine-13 to machine-20?

https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-04-02&endday=2019-04-09&tree=all&bug=1474758

and/or machine-6?

https://treeherder.mozilla.org/intermittent-failures.html#/bugdetails?startday=2019-04-02&endday=2019-04-09&tree=trunk&bug=1411358

(I'm not sure what to make of this!)

Geoff Brown [:gbrown]

Reporter

Updated

•

6 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1474758

Chris Cooper [:coop] (he/him)

Assignee

Comment 7

•

6 years ago

(In reply to Geoff Brown [:gbrown] from comment #6)

That seemed to shift the trouble from machine-13 to machine-20?
and/or machine-6?
(I'm not sure what to make of this!)

These instances functionally live forever, so it's possible they accumulate enough cruft over time to fail jobs. Maybe we should do a rolling reinstall the entire pool?

Geoff Brown [:gbrown]

Reporter

Comment 8

•

6 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #7)

Maybe we should do a rolling reinstall the entire pool?

Sure, that seems like a good idea.

Chris Cooper [:coop] (he/him)

Assignee

Comment 9

•

6 years ago

(In reply to Geoff Brown [:gbrown] from comment #8)

(In reply to Chris Cooper [:coop] pronoun: he from comment #7)

Maybe we should do a rolling reinstall the entire pool?

Sure, that seems like a good idea.

I think I finally figured out the terraform state today. I've already re-installed machine-0, and am iterating through the rest of the pool now one-at-a-time. machine-13 is currently down because I screwed that one up differently, but should come back online with the rest.

Chris Cooper [:coop] (he/him)

Assignee

Comment 10

•

6 years ago

OK, that took longer than it should have, but along the way I learned more about packet.net, terraform, and docker-worker than I probably ever wanted to know. All of the instances in packet.net have been re-created.

Please re-open if machine-13 is still slow, or file a new bug for systemic issues.

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Geoff Brown [:gbrown]

Reporter

Updated

•

6 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1545308

BugBot [:suhaib / :marco/ :calixte]

Updated

•

6 years ago

Crash Signature: [@ libc.so + 0x8c66a]

Geoff Brown [:gbrown]

Reporter

Updated

•

5 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1593536

Bugzilla

packet.net "machine-13" is very slow

Categories

(Taskcluster :: General, defect)

Tracking

(Not tracked)

People

(Reporter: gbrown, Assigned: coop)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Updated

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Updated

Updated