Closed Bug 1595383 Opened 5 years ago Closed 5 years ago

trees closed - terraform-packet gecko-t-linux instances not taking jobs

Categories

(Infrastructure & Operations :: RelOps: General, defect)

defect
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: aryx, Assigned: wcosta)

References

(Regression)

Details

(Keywords: regression)

https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?orgId=1&var-workerType=gecko-t-linux&refresh=5m shows that terraform-packet gecko-t-linux has >1k pending tasks but 0 running workers.

Trees will get closed for this because this is needed for our Android 7.0 x86-64 test coverage.

Flags: needinfo?(edunham)
Assignee: nobody → relops
Component: Operations: Taskcluster → RelOps: General
Flags: needinfo?(edunham) → needinfo?(aerickson)
Product: Cloud Services → Infrastructure & Operations
QA Contact: klibby

I rebooted machine-0 and it is now running a single job. That's good and bad: each instance is meant to have 4 workers running.

Looking back through the boot logs, it looks like there were some startup issues, but i'm not familiar enough with packet to know whether this is normal.

[   80.104430] cloud-init[1140]: 2019-11-10 14:49:10,659 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [47/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fe3c6321d68>, 'Connection to 169.254.169.254 timed out. (connect timeout=50.0)'))]
[  130.428890] cloud-init[1140]: 2019-11-10 14:50:01,687 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [98/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fe3c62ae438>, 'Connection to 169.254.169.254 timed out. (connect timeout=50.0)'))]
[  151.451801] cloud-init[1140]: 2019-11-10 14:50:22,710 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [119/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fe3c62aef98>, 'Connection to 169.254.169.254 timed out. (connect timeout=20.0)'))]
[  152.453387] cloud-init[1140]: 2019-11-10 14:50:23,711 - DataSourceEc2.py[CRITICAL]: Giving up on md from ['http://169.254.169.254/2009-04-04/meta-data/instance-id'] after 120 seconds
[  156.652495] cloud-init[1566]: Cloud-init v. 19.2-36-g059d049c-0ubuntu2~18.04.1 finished at Sun, 10 Nov 2019 14:50:27 +0000. Datasource DataSourceNone.  Up 156.62 seconds
[  156.676129] cloud-init[1566]: 2019-11-10 14:50:27,891 - cc_final_message.py[WARNING]: Used fallback datasource
[FAILED] Failed to start Execute cloud user/final scripts.
See 'systemctl status cloud-final.service' for details.
root@machine-0:~# systemctl status cloud-final.service
● cloud-final.service - Execute cloud user/final scripts
   Loaded: loaded (/lib/systemd/system/cloud-final.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Sun 2019-11-10 14:50:27 UTC; 7min ago
  Process: 1566 ExecStart=/usr/bin/cloud-init modules --mode=final (code=exited, status=1/FAILURE)
 Main PID: 1566 (code=exited, status=1/FAILURE)

Nov 10 14:50:27 machine-0 ec2[1658]: 256 SHA256:zcsZ7TfEdtDR7NNDbCcI1iM4l2ioeUf2lBtz5pFmNp8 root@machine-0 (ED25519)
Nov 10 14:50:27 machine-0 ec2[1658]: 2048 SHA256:8zEmaNUSUWdlDL73KworM6sOmqPiRpEyk2AF4K2yrVk root@machine-0 (RSA)
Nov 10 14:50:27 machine-0 ec2[1658]: -----END SSH HOST KEY FINGERPRINTS-----
Nov 10 14:50:27 machine-0 ec2[1658]: #############################################################
Nov 10 14:50:27 machine-0 cloud-init[1566]: ci-info: no authorized ssh keys fingerprints found for user ubuntu.
Nov 10 14:50:27 machine-0 cloud-init[1566]: Cloud-init v. 19.2-36-g059d049c-0ubuntu2~18.04.1 finished at Sun, 10 Nov 2019 14:50:27 +0000. Datasource DataSourceNone.  Up 156.62 seconds
Nov 10 14:50:27 machine-0 cloud-init[1566]: 2019-11-10 14:50:27,891 - cc_final_message.py[WARNING]: Used fallback datasource
Nov 10 14:50:27 machine-0 systemd[1]: cloud-final.service: Main process exited, code=exited, status=1/FAILURE
Nov 10 14:50:27 machine-0 systemd[1]: cloud-final.service: Failed with result 'exit-code'.
Nov 10 14:50:27 machine-0 systemd[1]: Failed to start Execute cloud user/final scripts.

I've rebooted all the packet.net instances now and they all seem to be taking jobs. However, they are all running only a single worker instead of 4.

Assignee: relops → wcosta
Status: NEW → ASSIGNED

I checked /etc/start-worker.yml on machine-0 and it has capacity: 4 so maybe this is an issue with the provider?

After the restarts, 50 machines started to take 4 tasks each, backlog is gone.

Trees remain closed for bug 1595368.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Flags: needinfo?(aerickson)
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Flags: needinfo?(wcosta)
Resolution: FIXED → ---

I found out what's wrong. There is a type in the config file, instead of shutdown.enabled it is shutdown.enable, and that with a recent change in docker-worker to enable shut down by default, caused the machines to halt after a period of time. I fixed it and I am deploying the machines, must take around 1 hour.

Flags: needinfo?(wcosta)

In the middle of the process there was a problem with Azure, I had to start over, but some machines are up and you should see jobs running.

Status: REOPENED → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.