Closed Bug 793358 Opened 12 years ago Closed 12 years ago

Intermittent tegra "Cleanup Device failed" or "'python run_tests.py ...' failed" or "Configure Device failed" or "updateSUT.py failed" or "verify.py failed" ending in a "process killed by signal 15"

Categories

(Release Engineering :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: philor, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [testing][mobile][android_tier_1][android_tier_∞])

+++ This bug was initially created as a clone of Bug #660480 +++ Because 2000 comments is too many to load dozens of times every day. https://tbpl.mozilla.org/php/getParsedLog.php?id=15415619&tree=Profiling tegra-073
A lot of disconnections on the fly. My first feeling was that we have overloaded foopies but the logs are from a Sunday which is lower load than during a weekeday. cpu_wio shows nothing interesting for all of the Linux foopies. http://ganglia1.build.scl1.mozilla.com/ganglia/?r=day&cs=&ce=&m=cpu_wio&s=by+name&c=RelEngSCL1&h=&host_regex=foopy&max_graphs=0&tab=m&vn=&sh=1&z=large&hc=2 load_one does not show as percentages which is weird to read: http://ganglia1.build.scl1.mozilla.com/ganglia/?r=day&cs=&ce=&m=load_one&s=by+name&c=RelEngSCL1&h=&host_regex=foopy&max_graphs=0&tab=m&vn=&sh=1&z=large&hc=2 Loading an individual foopy gives easier graphs to read: http://ganglia1.build.scl1.mozilla.com/ganglia/?r=week&cs=&ce=&c=RelEngSCL1&h=foopy34.build.scl1.mozilla.com&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS I have always looked for cpu_wio to recognize problems on our buildbot masters but I can't tell if it is the same for the foopies. FTR I am not looking at this bug any further as I don't want to be derailed and don't have any ideas on what to try. ###################### When I look at a single tegra I can see this: (from foopy07 - *2 failed jobs instead of 1 showing on log*) 2012-09-23 19:24:08-0700 [Broker,client] startCommand:shell [id 395830] 2012-09-23 19:24:08-0700 [Broker,client] RunProcess._startCommand 2012-09-23 19:24:08-0700 [Broker,client] python /builds/sut_tools/verify.py 2012-09-23 19:24:08-0700 [Broker,client] in dir /builds/tegra-078/test/build (timeout 1200 secs) 2012-09-23 19:24:08-0700 [Broker,client] watching logfiles {} 2012-09-23 19:24:08-0700 [Broker,client] argv: ['python', '/builds/sut_tools/verify.py'] 2012-09-23 19:24:08-0700 [Broker,client] environment: {'PWD': '/builds/tegra-078/test/build', '__CF_USER_TEXT_ENCODING': '0x1F6:0:0', 'SUT_NAME': 'tegra-078', 'HOME': '/Users/cltbld', 'PATH': '/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'SUT_IP': '10.250.49.66'} 2012-09-23 19:24:08-0700 [Broker,client] closing stdin 2012-09-23 19:24:08-0700 [Broker,client] using PTY: False 2012-09-23 19:29:14-0700 [-] sending app-level keepalive 2012-09-23 19:39:14-0700 [-] sending app-level keepalive 2012-09-23 19:44:08-0700 [-] command timed out: 1200 seconds without output, killing pid 8178 2012-09-23 19:44:08-0700 [-] trying os.kill(-pid, 9) 2012-09-23 19:44:08-0700 [-] trying process.signalProcess('KILL') 2012-09-23 19:44:08-0700 [-] signal KILL sent successfully 2012-09-23 19:44:08-0700 [-] command finished with signal 9, exit code None, elapsedTime: 1200.008209 2012-09-23 19:44:08-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x101e545f0> 2012-09-23 19:44:08-0700 [Broker,client] startCommand:shell [id 396717] 2012-09-23 19:44:08-0700 [Broker,client] RunProcess._startCommand 2012-09-23 19:44:08-0700 [Broker,client] python /builds/sut_tools/reboot.py 10.250.49.66 2012-09-23 19:44:08-0700 [Broker,client] in dir /builds/tegra-078/test/. (timeout 1800 secs) 2012-09-23 19:44:08-0700 [Broker,client] watching logfiles {} 2012-09-23 19:44:08-0700 [Broker,client] argv: ['python', '/builds/sut_tools/reboot.py', '10.250.49.66'] 2012-09-23 19:44:08-0700 [Broker,client] environment: {'PWD': '/builds/tegra-078/test', '__CF_USER_TEXT_ENCODING': '0x1F6:0:0', 'SUT_NAME': 'tegra-078', 'HOME': '/Users/cltbld', 'PATH': '/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'SUT_IP': '10.250.49.66'} 2012-09-23 19:44:08-0700 [Broker,client] closing stdin 2012-09-23 19:44:08-0700 [Broker,client] using PTY: False 2012-09-23 19:49:14-0700 [-] sending app-level keepalive 2012-09-23 19:59:14-0700 [-] sending app-level keepalive 2012-09-23 20:09:14-0700 [-] sending app-level keepalive 2012-09-23 20:14:08-0700 [-] command timed out: 1800 seconds without output, killing pid 8731 2012-09-23 20:14:08-0700 [-] trying os.kill(-pid, 9) 2012-09-23 20:14:08-0700 [-] trying process.signalProcess('KILL') 2012-09-23 20:14:08-0700 [-] signal KILL sent successfully 2012-09-23 20:14:08-0700 [-] command finished with signal 9, exit code None, elapsedTime: 1800.018783 2012-09-23 20:14:08-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x101e54758> 2012-09-23 20:19:14-0700 [-] sending app-level keepalive 2012-09-23 20:19:33-0700 [Broker,client] SlaveBuilder.remote_print(Android Tegra 250 mozilla-inbound talos remote-trobocheck3): message from master: ping 2012-09-23 20:19:33-0700 [Broker,client] startCommand:shell [id 397528] 2012-09-23 20:19:33-0700 [Broker,client] RunProcess._startCommand 2012-09-23 20:19:33-0700 [Broker,client] bash -c echo $SUT_IP 2012-09-23 20:19:33-0700 [Broker,client] in dir /builds/tegra-078/test/build (timeout 1200 secs) 2012-09-23 20:19:33-0700 [Broker,client] watching logfiles {} 2012-09-23 20:19:33-0700 [Broker,client] argv: ['bash', '-c', 'echo $SUT_IP'] 2012-09-23 20:19:33-0700 [Broker,client] environment: {'PWD': '/builds/tegra-078/test/build', '__CF_USER_TEXT_ENCODING': '0x1F6:0:0', 'SUT_NAME': 'tegra-078', 'HOME': '/Users/cltbld', 'PATH': '/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'SUT_IP': '10.250.49.66'} 2012-09-23 20:19:33-0700 [Broker,client] closing stdin 2012-09-23 20:19:33-0700 [Broker,client] using PTY: False 2012-09-23 20:19:33-0700 [-] command finished with signal None, exit code 0, elapsedTime: 0.016709 2012-09-23 20:19:33-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x101b97440> 2012-09-23 20:19:34-0700 [Broker,client] startCommand:shell [id 397529] 2012-09-23 20:19:34-0700 [Broker,client] RunProcess._startCommand 2012-09-23 20:19:34-0700 [Broker,client] python /builds/sut_tools/verify.py 2012-09-23 20:19:34-0700 [Broker,client] in dir /builds/tegra-078/test/build (timeout 1200 secs) 2012-09-23 20:19:34-0700 [Broker,client] watching logfiles {} 2012-09-23 20:19:34-0700 [Broker,client] argv: ['python', '/builds/sut_tools/verify.py'] 2012-09-23 20:19:34-0700 [Broker,client] environment: {'PWD': '/builds/tegra-078/test/build', '__CF_USER_TEXT_ENCODING': '0x1F6:0:0', 'SUT_NAME': 'tegra-078', 'HOME': '/Users/cltbld', 'PATH': '/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'SUT_IP': '10.250.49.66'} 2012-09-23 20:19:34-0700 [Broker,client] closing stdin 2012-09-23 20:19:34-0700 [Broker,client] using PTY: False 2012-09-23 20:20:36-0700 [-] command finished with signal 15, exit code None, elapsedTime: 62.217295 2012-09-23 20:20:36-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x101b97440> 2012-09-23 20:20:36-0700 [Broker,client] startCommand:shell [id 397711] 2012-09-23 20:20:36-0700 [Broker,client] RunProcess._startCommand 2012-09-23 20:20:36-0700 [Broker,client] python /builds/sut_tools/reboot.py 10.250.49.66 2012-09-23 20:20:36-0700 [Broker,client] in dir /builds/tegra-078/test/../talos-data (timeout 1800 secs) 2012-09-23 20:20:36-0700 [Broker,client] watching logfiles {} 2012-09-23 20:20:36-0700 [Broker,client] argv: ['python', '/builds/sut_tools/reboot.py', '10.250.49.66'] 2012-09-23 20:20:36-0700 [Broker,client] environment: {'PWD': '/builds/tegra-078/talos-data', '__CF_USER_TEXT_ENCODING': '0x1F6:0:0', 'SUT_NAME': 'tegra-078', 'HOME': '/Users/cltbld', 'PATH': '/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'SUT_IP': '10.250.49.66'} 2012-09-23 20:20:36-0700 [Broker,client] closing stdin 2012-09-23 20:20:36-0700 [Broker,client] using PTY: False 2012-09-23 20:20:37-0700 [-] Received SIGTERM, shutting down. 2012-09-23 20:20:37-0700 [-] stopCommand: halting current command <buildslave.commands.shell.SlaveShellCommand instance at 0x101e34098> 2012-09-23 20:20:37-0700 [-] command interrupted, killing pid 9535 2012-09-23 20:20:37-0700 [-] trying os.kill(-pid, 9) 2012-09-23 20:20:37-0700 [-] trying process.signalProcess('KILL') 2012-09-23 20:20:37-0700 [-] signal KILL sent successfully 2012-09-23 20:20:37-0700 [Broker,client] lost remote 2012-09-23 20:20:37-0700 [Broker,client] lost remote (from buildbot-master19) 2012-09-23 20:20:36-0700 [Broker,483,10.250.48.208] <RemoteShellCommand '['python', '/builds/sut_tools/verify.py']'> rc=-1 2012-09-23 20:20:36-0700 [-] closing log <buildbot.status.builder.LogFile instance at 0x3c7a2cf8> 2012-09-23 20:20:36-0700 [-] releaseLocks(<buildbotcustom.steps.base.C instance at 0x2aaabb9976c8>): [] 2012-09-23 20:20:36-0700 [-] step 'verify_tegra_state' complete: failure 2012-09-23 20:20:36-0700 [-] ShellCommand.startCommand(cmd=<RemoteShellCommand '['python', '/builds/sut_tools/reboot.py', '10.250.49.66']'>) 2012-09-23 20:20:36-0700 [-] cmd.args = {'workdir': '../talos-data', 'timeout': 1800, 'env': {}, 'want_stdout': 1, 'usePTY': 'slave-config', 'maxTime': None, 'logEnviron': True, 'want_stderr': 1, 'logfiles': {}} 2012-09-23 20:20:36-0700 [-] <RemoteShellCommand '['python', '/builds/sut_tools/reboot.py', '10.250.49.66']'>: RemoteCommand.run [397711] 2012-09-23 20:20:36-0700 [-] command '['python', '/builds/sut_tools/reboot.py', '10.250.49.66']' in dir '../talos-data' 2012-09-23 20:20:36-0700 [-] LoggedRemoteCommand.start 2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] BuildSlave.detached(tegra-078) 2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] <Build Android Tegra 250 mozilla-inbound talos remote-trobocheck3>.lostRemote 2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] stopping currentStep <buildbotcustom.steps.base.C instance at 0x2c43c170> 2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] addCompleteLog(interrupt) 2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] RemoteCommand.interrupt <RemoteShellCommand '['python', '/builds/sut_tools/reboot.py', '10.250.49.66']'> [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] 2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] RemoteCommand.disconnect: lost slave 2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] releaseLocks(<buildbotcustom.steps.base.C instance at 0x2c43c170>): [] 2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] step 'reboot device' complete: exception 2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] <Build Android Tegra 250 mozilla-inbound talos remote-trobocheck3>: build finished ###################### build/tests/reftes remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] ######################
> 2012-09-23 20:19:34-0700 [Broker,client] argv: ['python', > '/builds/sut_tools/verify.py'] This could be the root reason. verify.py when something fails sets error.flg, which clientproxy kills off. We don't [yet] reliably flush stdout during this step, so buildbot may see a timeout due to that (python buffers stdout by default when no tty) And there are many reasons verify.py can fail, legitimately. And it would cause a force-kill of buildbot during or after verify.py runs.
https://tbpl.mozilla.org/php/getParsedLog.php?id=15496392&tree=Mozilla-Aurora tegra-150 (yeah, it's busted, but I get very little satisfaction from filing on the "two out of three are purple" bustages)
https://tbpl.mozilla.org/php/getParsedLog.php?id=16206518&tree=Mozilla-Beta tegra-250 (And a whole bunch on tegra-3xx since they got stirred up on Monday, but they don't count, they're garbage.)
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.