Closed Bug 899784 Opened 11 years ago Closed 10 years ago

Rev4 machines have Puppet disabled which can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10

Categories

(Release Engineering :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: armenzg, Unassigned)

References

Details

https://tbpl.mozilla.org/php/getParsedLog.php?id=25925156&tree=Mozilla-Inbound#error0

./configs/talos/linux_config.py:    "title": os.uname()[1].lower().split('.')[0],
./configs/talos/mac_config.py:    "title": os.uname()[1].lower().split('.')[0],

12:55:08 CRITICAL -  DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found
12:55:08 CRITICAL -  DEBUG : process_Request line:     raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name)
12:55:13 CRITICAL -  DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found
12:55:13 CRITICAL -  DEBUG : process_Request line:     raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name)
12:55:23 CRITICAL -  DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found
12:55:23 CRITICAL -  DEBUG : process_Request line:     raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name)
12:55:43 CRITICAL -  DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found
12:55:43 CRITICAL -  DEBUG : process_Request line:     raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name)
12:56:23 CRITICAL -  DEBUG : process_Request line: No machine_name called 'client-builders-mac-mini-10' can be found
12:56:23 CRITICAL -  DEBUG : process_Request line:     raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name)
12:57:43 CRITICAL -  FAIL: Graph server unreachable (5 attempts)
12:57:43 CRITICAL -  RETURN:No machine_name called 'client-builders-mac-mini-10' can be found
12:57:43 CRITICAL -  RETURN:    raise DatabaseException("No machine_name called '%s' can be found" % self.machine_name)
12:57:43    ERROR -  Traceback (most recent call last):
12:57:43 CRITICAL -  talos.utils.talosError: 'Graph server unreachable (5 attempts)\nsend failed, graph server says:\nNo machine_name called \'client-builders-mac-mini-10\' can be found\n  File "/var/www/html/graphs/server/pyfomatic/collect.py", line 271, in handleRequest\n    metadata = MetaDataFromTalos(databaseCursor, databaseModule, inputStream)\n  File "/var/www/html/graphs/server/pyfomatic/collect.py", line 63, in __init__\n    self.doDatabaseThings(databaseCursor)\n  File "/var/www/html/graphs/server/pyfomatic/collect.py", line 92, in doDatabaseThings\n    raise DatabaseException("No machine_name called \'%s\' can be found" % self.machine_name)\n\n'
12:57:43    ERROR - Return code: 1
Blocks: 713055
No longer depends on: 713055
Summary: Some machines can loose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10 → Some machines can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10
We're switching to the new Puppet infra soon (<1week).
If we have problematic slaves we should disable them until we sync up with the new puppet infra.

Callek, what is the bug for the new Puppet infra?


Slaves with the issue:
slave: talos-r4-snow-029
slave: talos-r4-lion-067
slave: talos-r4-snow-053
Flags: needinfo?(bugspam.Callek)
Summary: Some machines can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10 → Rev4 machines have Puppet disabled which can lose their name and burn talos jobs because they end up with a name like client-builders-mac-mini-10
Flags: needinfo?(bugspam.Callek) → needinfo?(coop)
Even with puppet attached, we weren't immune to this, but we would error out *in puppet* before taking jobs.

I would shy away from disabling these slaves. Wait times on these platforms are already terrible. 

Running the steps in the remote_scutil_cmds.bash, either via the script or by hand, will resurrect a machine in this state:

https://hg.mozilla.org/build/braindump/file/8ccc8daef11b/mac-related/remote_scutil_cmds.bash
Flags: needinfo?(coop)
Product: mozilla.org → Release Engineering
RyanVM says he hasn't seen this bug in ages.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.