1595383 - trees closed - terraform-packet gecko-t-linux instances not taking jobs

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

5 years ago

https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?orgId=1&var-workerType=gecko-t-linux&refresh=5m shows that terraform-packet gecko-t-linux has >1k pending tasks but 0 running workers.

Trees will get closed for this because this is needed for our Android 7.0 x86-64 test coverage.

Flags: needinfo?(edunham)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Updated

•

5 years ago

Assignee: nobody → relops

Component: Operations: Taskcluster → RelOps: General

Flags: needinfo?(edunham) → needinfo?(aerickson)

Product: Cloud Services → Infrastructure & Operations

QA Contact: klibby

Chris Cooper [:coop] (he/him)

Comment 1

•

5 years ago

I rebooted machine-0 and it is now running a single job. That's good and bad: each instance is meant to have 4 workers running.

Looking back through the boot logs, it looks like there were some startup issues, but i'm not familiar enough with packet to know whether this is normal.

[   80.104430] cloud-init[1140]: 2019-11-10 14:49:10,659 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [47/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fe3c6321d68>, 'Connection to 169.254.169.254 timed out. (connect timeout=50.0)'))]
[  130.428890] cloud-init[1140]: 2019-11-10 14:50:01,687 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [98/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fe3c62ae438>, 'Connection to 169.254.169.254 timed out. (connect timeout=50.0)'))]
[  151.451801] cloud-init[1140]: 2019-11-10 14:50:22,710 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [119/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fe3c62aef98>, 'Connection to 169.254.169.254 timed out. (connect timeout=20.0)'))]
[  152.453387] cloud-init[1140]: 2019-11-10 14:50:23,711 - DataSourceEc2.py[CRITICAL]: Giving up on md from ['http://169.254.169.254/2009-04-04/meta-data/instance-id'] after 120 seconds

[  156.652495] cloud-init[1566]: Cloud-init v. 19.2-36-g059d049c-0ubuntu2~18.04.1 finished at Sun, 10 Nov 2019 14:50:27 +0000. Datasource DataSourceNone.  Up 156.62 seconds
[  156.676129] cloud-init[1566]: 2019-11-10 14:50:27,891 - cc_final_message.py[WARNING]: Used fallback datasource
[FAILED] Failed to start Execute cloud user/final scripts.
See 'systemctl status cloud-final.service' for details.

root@machine-0:~# systemctl status cloud-final.service
● cloud-final.service - Execute cloud user/final scripts
   Loaded: loaded (/lib/systemd/system/cloud-final.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sun 2019-11-10 14:50:27 UTC; 7min ago
  Process: 1566 ExecStart=/usr/bin/cloud-init modules --mode=final (code=exited, status=1/FAILURE)
 Main PID: 1566 (code=exited, status=1/FAILURE)

Nov 10 14:50:27 machine-0 ec2[1658]: 256 SHA256:zcsZ7TfEdtDR7NNDbCcI1iM4l2ioeUf2lBtz5pFmNp8 root@machine-0 (ED25519)
Nov 10 14:50:27 machine-0 ec2[1658]: 2048 SHA256:8zEmaNUSUWdlDL73KworM6sOmqPiRpEyk2AF4K2yrVk root@machine-0 (RSA)
Nov 10 14:50:27 machine-0 ec2[1658]: -----END SSH HOST KEY FINGERPRINTS-----
Nov 10 14:50:27 machine-0 ec2[1658]: #############################################################
Nov 10 14:50:27 machine-0 cloud-init[1566]: ci-info: no authorized ssh keys fingerprints found for user ubuntu.
Nov 10 14:50:27 machine-0 cloud-init[1566]: Cloud-init v. 19.2-36-g059d049c-0ubuntu2~18.04.1 finished at Sun, 10 Nov 2019 14:50:27 +0000. Datasource DataSourceNone.  Up 156.62 seconds
Nov 10 14:50:27 machine-0 cloud-init[1566]: 2019-11-10 14:50:27,891 - cc_final_message.py[WARNING]: Used fallback datasource
Nov 10 14:50:27 machine-0 systemd[1]: cloud-final.service: Main process exited, code=exited, status=1/FAILURE
Nov 10 14:50:27 machine-0 systemd[1]: cloud-final.service: Failed with result 'exit-code'.
Nov 10 14:50:27 machine-0 systemd[1]: Failed to start Execute cloud user/final scripts.

Chris Cooper [:coop] (he/him)

Comment 2

•

5 years ago

I've rebooted all the packet.net instances now and they all seem to be taking jobs. However, they are all running only a single worker instead of 4.

Chris Cooper [:coop] (he/him)

Updated

•

5 years ago

Assignee: relops → wcosta

Status: NEW → ASSIGNED

Chris Cooper [:coop] (he/him)

Comment 3

•

5 years ago

I checked /etc/start-worker.yml on machine-0 and it has capacity: 4 so maybe this is an issue with the provider?

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 4

•

5 years ago

After the restarts, 50 machines started to take 4 tasks each, backlog is gone.

Trees remain closed for bug 1595368.

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

Flags: needinfo?(aerickson)

Resolution: --- → FIXED

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Comment 5

•

5 years ago

Issue has returned, no active workers, trees closed: https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?orgId=1&var-workerType=gecko-t-linux&refresh=5m

Status: RESOLVED → REOPENED

Flags: needinfo?(wcosta)

Resolution: FIXED → ---

Wander Lairson Costa

Assignee

Comment 6

•

5 years ago

I found out what's wrong. There is a type in the config file, instead of shutdown.enabled it is shutdown.enable, and that with a recent change in docker-worker to enable shut down by default, caused the machines to halt after a period of time. I fixed it and I am deploying the machines, must take around 1 hour.

Flags: needinfo?(wcosta)

Wander Lairson Costa

Assignee

Comment 7

•

5 years ago

In the middle of the process there was a problem with Azure, I had to start over, but some machines are up and you should see jobs running.

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Updated

•

5 years ago

Status: REOPENED → RESOLVED

Closed: 5 years ago → 5 years ago

Resolution: --- → FIXED

Bugzilla

trees closed - terraform-packet gecko-t-linux instances not taking jobs

Categories

(Infrastructure & Operations :: RelOps: General, defect)

Tracking

(Not tracked)

People

(Reporter: aryx, Assigned: wcosta)

References

(Regression)

Details

(Keywords: regression)

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated