Closed
Bug 793358
Opened 12 years ago
Closed 12 years ago
Intermittent tegra "Cleanup Device failed" or "'python run_tests.py ...' failed" or "Configure Device failed" or "updateSUT.py failed" or "verify.py failed" ending in a "process killed by signal 15"
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: philor, Unassigned)
References
Details
(Keywords: intermittent-failure, Whiteboard: [testing][mobile][android_tier_1][android_tier_∞])
+++ This bug was initially created as a clone of Bug #660480 +++
Because 2000 comments is too many to load dozens of times every day.
https://tbpl.mozilla.org/php/getParsedLog.php?id=15415619&tree=Profiling
tegra-073
Reporter | ||
Comment 1•12 years ago
|
||
Reporter | ||
Comment 2•12 years ago
|
||
Reporter | ||
Comment 3•12 years ago
|
||
Reporter | ||
Comment 4•12 years ago
|
||
Reporter | ||
Comment 5•12 years ago
|
||
Reporter | ||
Comment 6•12 years ago
|
||
Reporter | ||
Comment 7•12 years ago
|
||
Reporter | ||
Comment 8•12 years ago
|
||
Reporter | ||
Comment 9•12 years ago
|
||
Reporter | ||
Comment 10•12 years ago
|
||
Reporter | ||
Comment 11•12 years ago
|
||
Reporter | ||
Comment 12•12 years ago
|
||
Reporter | ||
Comment 13•12 years ago
|
||
Comment 14•12 years ago
|
||
A lot of disconnections on the fly.
My first feeling was that we have overloaded foopies but the logs are from a Sunday which is lower load than during a weekeday.
cpu_wio shows nothing interesting for all of the Linux foopies.
http://ganglia1.build.scl1.mozilla.com/ganglia/?r=day&cs=&ce=&m=cpu_wio&s=by+name&c=RelEngSCL1&h=&host_regex=foopy&max_graphs=0&tab=m&vn=&sh=1&z=large&hc=2
load_one does not show as percentages which is weird to read:
http://ganglia1.build.scl1.mozilla.com/ganglia/?r=day&cs=&ce=&m=load_one&s=by+name&c=RelEngSCL1&h=&host_regex=foopy&max_graphs=0&tab=m&vn=&sh=1&z=large&hc=2
Loading an individual foopy gives easier graphs to read:
http://ganglia1.build.scl1.mozilla.com/ganglia/?r=week&cs=&ce=&c=RelEngSCL1&h=foopy34.build.scl1.mozilla.com&tab=m&vn=&mc=2&z=medium&metric_group=ALLGROUPS
I have always looked for cpu_wio to recognize problems on our buildbot masters but I can't tell if it is the same for the foopies.
FTR I am not looking at this bug any further as I don't want to be derailed and don't have any ideas on what to try.
######################
When I look at a single tegra I can see this:
(from foopy07 - *2 failed jobs instead of 1 showing on log*)
2012-09-23 19:24:08-0700 [Broker,client] startCommand:shell [id 395830]
2012-09-23 19:24:08-0700 [Broker,client] RunProcess._startCommand
2012-09-23 19:24:08-0700 [Broker,client] python /builds/sut_tools/verify.py
2012-09-23 19:24:08-0700 [Broker,client] in dir /builds/tegra-078/test/build (timeout 1200 secs)
2012-09-23 19:24:08-0700 [Broker,client] watching logfiles {}
2012-09-23 19:24:08-0700 [Broker,client] argv: ['python', '/builds/sut_tools/verify.py']
2012-09-23 19:24:08-0700 [Broker,client] environment: {'PWD': '/builds/tegra-078/test/build', '__CF_USER_TEXT_ENCODING': '0x1F6:0:0', 'SUT_NAME': 'tegra-078', 'HOME': '/Users/cltbld', 'PATH': '/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'SUT_IP': '10.250.49.66'}
2012-09-23 19:24:08-0700 [Broker,client] closing stdin
2012-09-23 19:24:08-0700 [Broker,client] using PTY: False
2012-09-23 19:29:14-0700 [-] sending app-level keepalive
2012-09-23 19:39:14-0700 [-] sending app-level keepalive
2012-09-23 19:44:08-0700 [-] command timed out: 1200 seconds without output, killing pid 8178
2012-09-23 19:44:08-0700 [-] trying os.kill(-pid, 9)
2012-09-23 19:44:08-0700 [-] trying process.signalProcess('KILL')
2012-09-23 19:44:08-0700 [-] signal KILL sent successfully
2012-09-23 19:44:08-0700 [-] command finished with signal 9, exit code None, elapsedTime: 1200.008209
2012-09-23 19:44:08-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x101e545f0>
2012-09-23 19:44:08-0700 [Broker,client] startCommand:shell [id 396717]
2012-09-23 19:44:08-0700 [Broker,client] RunProcess._startCommand
2012-09-23 19:44:08-0700 [Broker,client] python /builds/sut_tools/reboot.py 10.250.49.66
2012-09-23 19:44:08-0700 [Broker,client] in dir /builds/tegra-078/test/. (timeout 1800 secs)
2012-09-23 19:44:08-0700 [Broker,client] watching logfiles {}
2012-09-23 19:44:08-0700 [Broker,client] argv: ['python', '/builds/sut_tools/reboot.py', '10.250.49.66']
2012-09-23 19:44:08-0700 [Broker,client] environment: {'PWD': '/builds/tegra-078/test', '__CF_USER_TEXT_ENCODING': '0x1F6:0:0', 'SUT_NAME': 'tegra-078', 'HOME': '/Users/cltbld', 'PATH': '/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'SUT_IP': '10.250.49.66'}
2012-09-23 19:44:08-0700 [Broker,client] closing stdin
2012-09-23 19:44:08-0700 [Broker,client] using PTY: False
2012-09-23 19:49:14-0700 [-] sending app-level keepalive
2012-09-23 19:59:14-0700 [-] sending app-level keepalive
2012-09-23 20:09:14-0700 [-] sending app-level keepalive
2012-09-23 20:14:08-0700 [-] command timed out: 1800 seconds without output, killing pid 8731
2012-09-23 20:14:08-0700 [-] trying os.kill(-pid, 9)
2012-09-23 20:14:08-0700 [-] trying process.signalProcess('KILL')
2012-09-23 20:14:08-0700 [-] signal KILL sent successfully
2012-09-23 20:14:08-0700 [-] command finished with signal 9, exit code None, elapsedTime: 1800.018783
2012-09-23 20:14:08-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x101e54758>
2012-09-23 20:19:14-0700 [-] sending app-level keepalive
2012-09-23 20:19:33-0700 [Broker,client] SlaveBuilder.remote_print(Android Tegra 250 mozilla-inbound talos remote-trobocheck3): message from master: ping
2012-09-23 20:19:33-0700 [Broker,client] startCommand:shell [id 397528]
2012-09-23 20:19:33-0700 [Broker,client] RunProcess._startCommand
2012-09-23 20:19:33-0700 [Broker,client] bash -c echo $SUT_IP
2012-09-23 20:19:33-0700 [Broker,client] in dir /builds/tegra-078/test/build (timeout 1200 secs)
2012-09-23 20:19:33-0700 [Broker,client] watching logfiles {}
2012-09-23 20:19:33-0700 [Broker,client] argv: ['bash', '-c', 'echo $SUT_IP']
2012-09-23 20:19:33-0700 [Broker,client] environment: {'PWD': '/builds/tegra-078/test/build', '__CF_USER_TEXT_ENCODING': '0x1F6:0:0', 'SUT_NAME': 'tegra-078', 'HOME': '/Users/cltbld', 'PATH': '/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'SUT_IP': '10.250.49.66'}
2012-09-23 20:19:33-0700 [Broker,client] closing stdin
2012-09-23 20:19:33-0700 [Broker,client] using PTY: False
2012-09-23 20:19:33-0700 [-] command finished with signal None, exit code 0, elapsedTime: 0.016709
2012-09-23 20:19:33-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x101b97440>
2012-09-23 20:19:34-0700 [Broker,client] startCommand:shell [id 397529]
2012-09-23 20:19:34-0700 [Broker,client] RunProcess._startCommand
2012-09-23 20:19:34-0700 [Broker,client] python /builds/sut_tools/verify.py
2012-09-23 20:19:34-0700 [Broker,client] in dir /builds/tegra-078/test/build (timeout 1200 secs)
2012-09-23 20:19:34-0700 [Broker,client] watching logfiles {}
2012-09-23 20:19:34-0700 [Broker,client] argv: ['python', '/builds/sut_tools/verify.py']
2012-09-23 20:19:34-0700 [Broker,client] environment: {'PWD': '/builds/tegra-078/test/build', '__CF_USER_TEXT_ENCODING': '0x1F6:0:0', 'SUT_NAME': 'tegra-078', 'HOME': '/Users/cltbld', 'PATH': '/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'SUT_IP': '10.250.49.66'}
2012-09-23 20:19:34-0700 [Broker,client] closing stdin
2012-09-23 20:19:34-0700 [Broker,client] using PTY: False
2012-09-23 20:20:36-0700 [-] command finished with signal 15, exit code None, elapsedTime: 62.217295
2012-09-23 20:20:36-0700 [-] SlaveBuilder.commandComplete <buildslave.commands.shell.SlaveShellCommand instance at 0x101b97440>
2012-09-23 20:20:36-0700 [Broker,client] startCommand:shell [id 397711]
2012-09-23 20:20:36-0700 [Broker,client] RunProcess._startCommand
2012-09-23 20:20:36-0700 [Broker,client] python /builds/sut_tools/reboot.py 10.250.49.66
2012-09-23 20:20:36-0700 [Broker,client] in dir /builds/tegra-078/test/../talos-data (timeout 1800 secs)
2012-09-23 20:20:36-0700 [Broker,client] watching logfiles {}
2012-09-23 20:20:36-0700 [Broker,client] argv: ['python', '/builds/sut_tools/reboot.py', '10.250.49.66']
2012-09-23 20:20:36-0700 [Broker,client] environment: {'PWD': '/builds/tegra-078/talos-data', '__CF_USER_TEXT_ENCODING': '0x1F6:0:0', 'SUT_NAME': 'tegra-078', 'HOME': '/Users/cltbld', 'PATH': '/opt/local/bin:/opt/local/sbin:/opt/local/Library/Frameworks/Python.framework/Versions/2.6/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin', 'SUT_IP': '10.250.49.66'}
2012-09-23 20:20:36-0700 [Broker,client] closing stdin
2012-09-23 20:20:36-0700 [Broker,client] using PTY: False
2012-09-23 20:20:37-0700 [-] Received SIGTERM, shutting down.
2012-09-23 20:20:37-0700 [-] stopCommand: halting current command <buildslave.commands.shell.SlaveShellCommand instance at 0x101e34098>
2012-09-23 20:20:37-0700 [-] command interrupted, killing pid 9535
2012-09-23 20:20:37-0700 [-] trying os.kill(-pid, 9)
2012-09-23 20:20:37-0700 [-] trying process.signalProcess('KILL')
2012-09-23 20:20:37-0700 [-] signal KILL sent successfully
2012-09-23 20:20:37-0700 [Broker,client] lost remote
2012-09-23 20:20:37-0700 [Broker,client] lost remote
(from buildbot-master19)
2012-09-23 20:20:36-0700 [Broker,483,10.250.48.208] <RemoteShellCommand '['python', '/builds/sut_tools/verify.py']'> rc=-1
2012-09-23 20:20:36-0700 [-] closing log <buildbot.status.builder.LogFile instance at 0x3c7a2cf8>
2012-09-23 20:20:36-0700 [-] releaseLocks(<buildbotcustom.steps.base.C instance at 0x2aaabb9976c8>): []
2012-09-23 20:20:36-0700 [-] step 'verify_tegra_state' complete: failure
2012-09-23 20:20:36-0700 [-] ShellCommand.startCommand(cmd=<RemoteShellCommand '['python', '/builds/sut_tools/reboot.py', '10.250.49.66']'>)
2012-09-23 20:20:36-0700 [-] cmd.args = {'workdir': '../talos-data', 'timeout': 1800, 'env': {}, 'want_stdout': 1, 'usePTY': 'slave-config', 'maxTime': None, 'logEnviron': True, 'want_stderr': 1, 'logfiles': {}}
2012-09-23 20:20:36-0700 [-] <RemoteShellCommand '['python', '/builds/sut_tools/reboot.py', '10.250.49.66']'>: RemoteCommand.run [397711]
2012-09-23 20:20:36-0700 [-] command '['python', '/builds/sut_tools/reboot.py', '10.250.49.66']' in dir '../talos-data'
2012-09-23 20:20:36-0700 [-] LoggedRemoteCommand.start
2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] BuildSlave.detached(tegra-078)
2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] <Build Android Tegra 250 mozilla-inbound talos remote-trobocheck3>.lostRemote
2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] stopping currentStep <buildbotcustom.steps.base.C instance at 0x2c43c170>
2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] addCompleteLog(interrupt)
2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] RemoteCommand.interrupt <RemoteShellCommand '['python', '/builds/sut_tools/reboot.py', '10.250.49.66']'> [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] RemoteCommand.disconnect: lost slave
2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] releaseLocks(<buildbotcustom.steps.base.C instance at 0x2c43c170>): []
2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] step 'reboot device' complete: exception
2012-09-23 20:20:37-0700 [Broker,483,10.250.48.208] <Build Android Tegra 250 mozilla-inbound talos remote-trobocheck3>: build finished
######################
build/tests/reftes
remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
######################
Reporter | ||
Comment 15•12 years ago
|
||
Reporter | ||
Comment 16•12 years ago
|
||
Reporter | ||
Comment 17•12 years ago
|
||
Reporter | ||
Comment 18•12 years ago
|
||
Comment 19•12 years ago
|
||
> 2012-09-23 20:19:34-0700 [Broker,client] argv: ['python',
> '/builds/sut_tools/verify.py']
This could be the root reason. verify.py when something fails sets error.flg, which clientproxy kills off. We don't [yet] reliably flush stdout during this step, so buildbot may see a timeout due to that (python buffers stdout by default when no tty)
And there are many reasons verify.py can fail, legitimately. And it would cause a force-kill of buildbot during or after verify.py runs.
Reporter | ||
Comment 20•12 years ago
|
||
Reporter | ||
Comment 21•12 years ago
|
||
Reporter | ||
Comment 22•12 years ago
|
||
Reporter | ||
Comment 23•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=15496392&tree=Mozilla-Aurora
tegra-150 (yeah, it's busted, but I get very little satisfaction from filing on the "two out of three are purple" bustages)
Reporter | ||
Comment 24•12 years ago
|
||
Reporter | ||
Comment 25•12 years ago
|
||
Reporter | ||
Comment 26•12 years ago
|
||
Reporter | ||
Comment 27•12 years ago
|
||
Reporter | ||
Comment 28•12 years ago
|
||
Reporter | ||
Comment 29•12 years ago
|
||
Reporter | ||
Comment 30•12 years ago
|
||
Reporter | ||
Comment 31•12 years ago
|
||
Reporter | ||
Comment 32•12 years ago
|
||
Reporter | ||
Comment 33•12 years ago
|
||
Reporter | ||
Comment 34•12 years ago
|
||
Reporter | ||
Comment 35•12 years ago
|
||
Reporter | ||
Comment 36•12 years ago
|
||
Reporter | ||
Comment 37•12 years ago
|
||
Reporter | ||
Comment 38•12 years ago
|
||
Reporter | ||
Comment 39•12 years ago
|
||
Comment 40•12 years ago
|
||
Reporter | ||
Comment 41•12 years ago
|
||
Reporter | ||
Comment 42•12 years ago
|
||
Reporter | ||
Comment 43•12 years ago
|
||
Reporter | ||
Comment 44•12 years ago
|
||
Reporter | ||
Comment 45•12 years ago
|
||
Reporter | ||
Comment 46•12 years ago
|
||
Reporter | ||
Comment 47•12 years ago
|
||
Reporter | ||
Comment 48•12 years ago
|
||
Reporter | ||
Comment 49•12 years ago
|
||
Reporter | ||
Comment 50•12 years ago
|
||
Reporter | ||
Comment 51•12 years ago
|
||
Reporter | ||
Comment 52•12 years ago
|
||
Reporter | ||
Comment 53•12 years ago
|
||
Reporter | ||
Comment 54•12 years ago
|
||
Reporter | ||
Comment 55•12 years ago
|
||
Reporter | ||
Comment 56•12 years ago
|
||
Reporter | ||
Comment 57•12 years ago
|
||
Reporter | ||
Comment 58•12 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=16206518&tree=Mozilla-Beta
tegra-250
(And a whole bunch on tegra-3xx since they got stirred up on Monday, but they don't count, they're garbage.)
Reporter | ||
Comment 59•12 years ago
|
||
Reporter | ||
Comment 60•12 years ago
|
||
Reporter | ||
Comment 61•12 years ago
|
||
Reporter | ||
Comment 62•12 years ago
|
||
Reporter | ||
Comment 63•12 years ago
|
||
Reporter | ||
Comment 64•12 years ago
|
||
Reporter | ||
Comment 65•12 years ago
|
||
Reporter | ||
Comment 66•12 years ago
|
||
Updated•12 years ago
|
Keywords: intermittent-failure
Comment 67•12 years ago
|
||
Reporter | ||
Updated•12 years ago
|
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
Assignee | ||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•