Closed
Bug 899784
Opened 11 years ago
Closed 10 years ago
Rev4 machines have Puppet disabled which can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: armenzg, Unassigned)
References
Details
https://tbpl.mozilla.org/php/getParsedLog.php?id=25925156&tree=Mozilla-Inbound#error0 ./configs/talos/linux_config.py: "title": os.uname()[1].lower().split('.')[0], ./configs/talos/mac_config.py: "title": os.uname()[1].lower().split('.')[0], 12:55:08 CRITICAL - DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found 12:55:08 CRITICAL - DEBUG : process_Request line: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name) 12:55:13 CRITICAL - DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found 12:55:13 CRITICAL - DEBUG : process_Request line: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name) 12:55:23 CRITICAL - DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found 12:55:23 CRITICAL - DEBUG : process_Request line: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name) 12:55:43 CRITICAL - DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found 12:55:43 CRITICAL - DEBUG : process_Request line: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name) 12:56:23 CRITICAL - DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found 12:56:23 CRITICAL - DEBUG : process_Request line: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name) 12:57:43 CRITICAL - FAIL: Graph server unreachable (5 attempts) 12:57:43 CRITICAL - RETURN:No machine_name called 'client-builders-mac-mini-10' can be found 12:57:43 CRITICAL - RETURN: raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name) 12:57:43 ERROR - Traceback (most recent call last): 12:57:43 CRITICAL - talos.utils.talosError: 'Graph server unreachable (5 attempts)\nsend failed, graph server says:\nNo machine_name called \'client-builders-mac-mini-10\' can be found\n File "/var/www/html/graphs/server/pyfomatic/collect.py", line 271, in handleRequest\n metadata = MetaDataFromTalos(databaseCursor, databaseModule, inputStream)\n File "/var/www/html/graphs/server/pyfomatic/collect.py", line 63, in __init__\n self.doDatabaseThings(databaseCursor)\n File "/var/www/html/graphs/server/pyfomatic/collect.py", line 92, in doDatabaseThings\n raise DatabaseException("No machine_name called \'%s\' can be found" % self.machine_name)\n\n' 12:57:43 ERROR - Return code: 1
Updated•11 years ago
|
Blocks: t-snow-r4-0051
Reporter | ||
Updated•11 years ago
|
Summary: Some machines can loose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10 → Some machines can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10
Comment 1•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=25966660&tree=Mozilla-Inbound
Comment 2•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=26005016&tree=Mozilla-Inbound
Reporter | ||
Comment 3•11 years ago
|
||
We're switching to the new Puppet infra soon (<1week). If we have problematic slaves we should disable them until we sync up with the new puppet infra. Callek, what is the bug for the new Puppet infra? Slaves with the issue: slave: talos-r4-snow-029 slave: talos-r4-lion-067 slave: talos-r4-snow-053
Flags: needinfo?(bugspam.Callek)
Summary: Some machines can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10 → Rev4 machines have Puppet disabled which can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10
Reporter | ||
Updated•11 years ago
|
Flags: needinfo?(bugspam.Callek) → needinfo?(coop)
Comment 4•11 years ago
|
||
Even with puppet attached, we weren't immune to this, but we would error out *in puppet* before taking jobs. I would shy away from disabling these slaves. Wait times on these platforms are already terrible. Running the steps in the remote_scutil_cmds.bash, either via the script or by hand, will resurrect a machine in this state: https://hg.mozilla.org/build/braindump/file/8ccc8daef11b/mac-related/remote_scutil_cmds.bash
Flags: needinfo?(coop)
Comment 5•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=26168871&tree=Mozilla-Inbound
Comment 6•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=26191944&tree=Mozilla-Inbound
Comment 7•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=26287922&tree=Mozilla-Inbound - snow-056
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
Comment 9•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=26627050&tree=Mozilla-Aurora
Comment 10•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=27024084&tree=Mozilla-Aurora
Comment 11•11 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=27254632&tree=Mozilla-Aurora
Comment 12•10 years ago
|
||
RyanVM says he hasn't seen this bug in ages.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•6 years ago
|
Component: General Automation → General
You need to log in
before you can comment on or make changes to this bug.
Description
•