trees closed - terraform-packet gecko-t-linux instances not taking jobs
Categories
(Infrastructure & Operations :: RelOps: General, defect)
Tracking
(Not tracked)
People
(Reporter: aryx, Assigned: wcosta)
References
(Regression)
Details
(Keywords: regression)
https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?orgId=1&var-workerType=gecko-t-linux&refresh=5m shows that terraform-packet gecko-t-linux has >1k pending tasks but 0 running workers.
Trees will get closed for this because this is needed for our Android 7.0 x86-64 test coverage.
Reporter | ||
Updated•5 years ago
|
Comment 1•5 years ago
|
||
I rebooted machine-0 and it is now running a single job. That's good and bad: each instance is meant to have 4 workers running.
Looking back through the boot logs, it looks like there were some startup issues, but i'm not familiar enough with packet to know whether this is normal.
[ 80.104430] cloud-init[1140]: 2019-11-10 14:49:10,659 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [47/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fe3c6321d68>, 'Connection to 169.254.169.254 timed out. (connect timeout=50.0)'))]
[ 130.428890] cloud-init[1140]: 2019-11-10 14:50:01,687 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [98/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fe3c62ae438>, 'Connection to 169.254.169.254 timed out. (connect timeout=50.0)'))]
[ 151.451801] cloud-init[1140]: 2019-11-10 14:50:22,710 - url_helper.py[WARNING]: Calling 'http://169.254.169.254/2009-04-04/meta-data/instance-id' failed [119/120s]: request error [HTTPConnectionPool(host='169.254.169.254', port=80): Max retries exceeded with url: /2009-04-04/meta-data/instance-id (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fe3c62aef98>, 'Connection to 169.254.169.254 timed out. (connect timeout=20.0)'))]
[ 152.453387] cloud-init[1140]: 2019-11-10 14:50:23,711 - DataSourceEc2.py[CRITICAL]: Giving up on md from ['http://169.254.169.254/2009-04-04/meta-data/instance-id'] after 120 seconds
[ 156.652495] cloud-init[1566]: Cloud-init v. 19.2-36-g059d049c-0ubuntu2~18.04.1 finished at Sun, 10 Nov 2019 14:50:27 +0000. Datasource DataSourceNone. Up 156.62 seconds
[ 156.676129] cloud-init[1566]: 2019-11-10 14:50:27,891 - cc_final_message.py[WARNING]: Used fallback datasource
[FAILED] Failed to start Execute cloud user/final scripts.
See 'systemctl status cloud-final.service' for details.
root@machine-0:~# systemctl status cloud-final.service
● cloud-final.service - Execute cloud user/final scripts
Loaded: loaded (/lib/systemd/system/cloud-final.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Sun 2019-11-10 14:50:27 UTC; 7min ago
Process: 1566 ExecStart=/usr/bin/cloud-init modules --mode=final (code=exited, status=1/FAILURE)
Main PID: 1566 (code=exited, status=1/FAILURE)
Nov 10 14:50:27 machine-0 ec2[1658]: 256 SHA256:zcsZ7TfEdtDR7NNDbCcI1iM4l2ioeUf2lBtz5pFmNp8 root@machine-0 (ED25519)
Nov 10 14:50:27 machine-0 ec2[1658]: 2048 SHA256:8zEmaNUSUWdlDL73KworM6sOmqPiRpEyk2AF4K2yrVk root@machine-0 (RSA)
Nov 10 14:50:27 machine-0 ec2[1658]: -----END SSH HOST KEY FINGERPRINTS-----
Nov 10 14:50:27 machine-0 ec2[1658]: #############################################################
Nov 10 14:50:27 machine-0 cloud-init[1566]: ci-info: no authorized ssh keys fingerprints found for user ubuntu.
Nov 10 14:50:27 machine-0 cloud-init[1566]: Cloud-init v. 19.2-36-g059d049c-0ubuntu2~18.04.1 finished at Sun, 10 Nov 2019 14:50:27 +0000. Datasource DataSourceNone. Up 156.62 seconds
Nov 10 14:50:27 machine-0 cloud-init[1566]: 2019-11-10 14:50:27,891 - cc_final_message.py[WARNING]: Used fallback datasource
Nov 10 14:50:27 machine-0 systemd[1]: cloud-final.service: Main process exited, code=exited, status=1/FAILURE
Nov 10 14:50:27 machine-0 systemd[1]: cloud-final.service: Failed with result 'exit-code'.
Nov 10 14:50:27 machine-0 systemd[1]: Failed to start Execute cloud user/final scripts.
Comment 2•5 years ago
|
||
I've rebooted all the packet.net instances now and they all seem to be taking jobs. However, they are all running only a single worker instead of 4.
Updated•5 years ago
|
Comment 3•5 years ago
|
||
I checked /etc/start-worker.yml
on machine-0 and it has capacity: 4
so maybe this is an issue with the provider?
Reporter | ||
Comment 4•5 years ago
|
||
After the restarts, 50 machines started to take 4 tasks each, backlog is gone.
Trees remain closed for bug 1595368.
Reporter | ||
Comment 5•5 years ago
|
||
Issue has returned, no active workers, trees closed: https://earthangel-b40313e5.influxcloud.net/d/slXwf4emz/workers?orgId=1&var-workerType=gecko-t-linux&refresh=5m
Assignee | ||
Comment 6•5 years ago
|
||
I found out what's wrong. There is a type in the config file, instead of shutdown.enabled
it is shutdown.enable
, and that with a recent change in docker-worker to enable shut down by default, caused the machines to halt after a period of time. I fixed it and I am deploying the machines, must take around 1 hour.
Assignee | ||
Comment 7•5 years ago
|
||
In the middle of the process there was a problem with Azure, I had to start over, but some machines are up and you should see jobs running.
Reporter | ||
Updated•5 years ago
|
Description
•