Closed Bug 660480 Opened 14 years ago Closed 12 years ago

Intermittent tegra "Cleanup Device failed" or "'python run_tests.py ...' failed" or "Configure Device failed" or "updateSUT.py failed" or "verify.py failed" ending in a "process killed by signal 15"

Categories

(Release Engineering :: General, defect, P3)

ARM
Android
defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: philor, Unassigned)

References

Details

(Whiteboard: [testing][mobile][android_tier_1][android_tier_∞])

Attachments

(1 file, 3 obsolete files)

Rather quietly in http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1306616745.1306617247.17001.gz python ../../sut_tools/cleanup.py 10.250.49.29 {env} process killed by signal 15 program finished with exit code -1 and a bit more noisily (and thus possibly a separate thing) in http://tinderbox.mozilla.org/showlog.cgi?log=Mobile/1306585906.1306586450.32029.gz python ../../sut_tools/cleanup.py 10.250.49.18 {env} ... send cmd: rmdr /mnt/sdcard/tests recv'ing... response: Deleting file(s) from /mnt/sdcard/tests Deleted browser_output.txt ... $> send cmd: uninst org.mozilla.fennec uninstallAppAndReboot: True Waiting for tegra to come back... Try 1 reconnecting socket unable to connect socket reconnecting socket unable to connect socket reconnecting socket unable to connect socket reconnecting socket unable to connect socket reconnecting socket unable to connect socket devroot None Try 2 reconnecting socket unable to connect socket devroot None Try 3 reconnecting socket unable to connecprocess killed by signal 15 program finished with exit code -1
http://tinderbox.mozilla.org/showlog.cgi?log=Try/1306635706.1306636210.12483.gz Android Tegra 250 try talos remote-tzoom on 2011/05/28 19:21:46 s: tegra-077 send cmd: uninst org.mozilla.fennec uninstallAppAndReboot: True Waiting for tegra to come back... Try 1 reconnecting socket unable to connect socket reconnecting sockprocess killed by signal 15 program finished with exit code -1
Summary: Intermittent jsreftest-1 "Cleanup Device failed" → Intermittent tegra "Cleanup Device failed"
Bear, Aki - any thoughts here?
Priority: -- → P5
These happen when a Tegra has decided to wander off and stare at the stars for a couple cycles. We need to change the error message so this triggers a purple/retry so Philor can know it's a device issue and not a code issue.
Rumor has it that people are starting to miss my constant bugspam, and want an end to the "i;r" starring. http://tbpl.allizom.org/php/getParsedLog.php?id=6113194 http://tbpl.allizom.org/php/getParsedLog.php?id=6113197
Whatever status that is that buildbot-based tbpl shows as blue, plus an automatic retrigger, would be super-awesome. Offhand, I'd guess this is at least two thirds of the explosions of failure where a single run has ten or fifteen or twenty failures by the time I get done retriggering enough to get everything run. http://tbpl.allizom.org/php/getParsedLog.php?id=6113406 Android Tegra 250 mozilla-central talos remote-tp4m_nochrome on 2011-08-24 19:54:37 PDT python run_tests.py --noisy local.yml in dir /builds/tegra-065/test/../talos-data/talos/ (timeout 21600 secs) ... reconnecting socket RETURN:s: tegra-065 RETURN:id:20110824171455 RETURN:<a href = "http://hg.mozilla.org/mozilla-central/rev/e58e98a89827">rev:e58e98a89827</a> tegra-065: Started Wed, 24 Aug 2011 19:57:52 Running test tp4m: Started Wed, 24 Aug 2011 19:57:52 process killed by signal 15 program finished with exit code -1
Summary: Intermittent tegra "Cleanup Device failed" → Intermittent tegra "Cleanup Device failed" or "'python run_tests.py ...' failed" ending in a "process killed by signal 15"
(In reply to John O'Duinn [:joduinn] from comment #117) > Is this a DUP of bug#681861 ?
Whiteboard: [android_tier_1]
No.
https://tbpl.mozilla.org/php/getParsedLog.php?id=6363156 https://tbpl.mozilla.org/php/getParsedLog.php?id=6360965 https://tbpl.mozilla.org/php/getParsedLog.php?id=6361425 https://tbpl.mozilla.org/php/getParsedLog.php?id=6369812 https://tbpl.mozilla.org/php/getParsedLog.php?id=6363165 https://tbpl.mozilla.org/php/getParsedLog.php?id=6369814 https://tbpl.mozilla.org/php/getParsedLog.php?id=6360961 https://tbpl.mozilla.org/php/getParsedLog.php?id=6370084 https://tbpl.mozilla.org/php/getParsedLog.php?id=6360958 https://tbpl.mozilla.org/php/getParsedLog.php?id=6360966 https://tbpl.mozilla.org/php/getParsedLog.php?id=6361344 https://tbpl.mozilla.org/php/getParsedLog.php?id=6361417 https://tbpl.mozilla.org/php/getParsedLog.php?id=6361416 https://tbpl.mozilla.org/php/getParsedLog.php?id=6361433 https://tbpl.mozilla.org/php/getParsedLog.php?id=6361429 https://tbpl.mozilla.org/php/getParsedLog.php?id=6363254 https://tbpl.mozilla.org/php/getParsedLog.php?id=6369679 https://tbpl.mozilla.org/php/getParsedLog.php?id=6369685 https://tbpl.mozilla.org/php/getParsedLog.php?id=6369816 https://tbpl.mozilla.org/php/getParsedLog.php?id=6369811 https://tbpl.mozilla.org/php/getParsedLog.php?id=6352832 https://tbpl.mozilla.org/php/getParsedLog.php?id=6352311 https://tbpl.mozilla.org/php/getParsedLog.php?id=6352397 https://tbpl.mozilla.org/php/getParsedLog.php?id=6352473 https://tbpl.mozilla.org/php/getParsedLog.php?id=6352477 https://tbpl.mozilla.org/php/getParsedLog.php?id=6352967 https://tbpl.mozilla.org/php/getParsedLog.php?id=6350971 https://tbpl.mozilla.org/php/getParsedLog.php?id=6350972
https://tbpl.mozilla.org/php/getParsedLog.php?id=6392546&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6392618&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6392723&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6393728&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6396167&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6396299&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6395851&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6395839&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6394879&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6396730&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6395056&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6394915&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6394917&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6394990&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6396851&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6397683&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6397691&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6396858&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6396362&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6396720&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6395130&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6395262&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6396311&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6395408&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6395251&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6395120&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6395416&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6396298&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6396859&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6397790&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6397853&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6396310&tree=Mozilla-Inbound Thought experiment: how many of these per push constitutes unacceptable? Clearly, if every push was 23/23 these, we would hide them on tbpl until they were fixed or shut off. Suppose it was 22/23 these? 17/23 these, and 4/23 bug 681855?
https://tbpl.mozilla.org/php/getParsedLog.php?id=6408301&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6408304&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6408299&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6410796&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6410804&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6410801&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6410451&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6410802&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6408373&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6410800&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6411824&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6412220&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6412221&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6412231&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6412143&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6412150&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6412232&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6412029&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6412026&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6412134&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6412145&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6412144&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6409055&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6409062&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6409059&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6409061&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6409889&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6409979&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6409974&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=6412686&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6412616&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6413091&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6413092&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6413345&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6413444&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6413439&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6405728&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6410645&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6405551&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6410647&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6405721&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6405853&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6405556&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6410510&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6410443&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6410450&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6410504&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6410506&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6413628&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6413623&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6406608&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6406606&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6406697&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6414099&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6406412&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6406513&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6414204&tree=Mozilla-Beta https://tbpl.mozilla.org/php/getParsedLog.php?id=6414094&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6414208&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6414346&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6413968&tree=Mozilla-Aurora https://tbpl.mozilla.org/php/getParsedLog.php?id=6413978&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6413846&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6413835&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6413715&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6413721&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6413708&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6414812&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6414808&tree=Mozilla-Inbound
https://tbpl.mozilla.org/php/getParsedLog.php?id=6415959&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6416073&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6415835&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6416074&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6415837&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6415841&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6415822&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6416428&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6416425&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6416913&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6416908&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6416909&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6416910&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6416903&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6417023&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6417024&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6417361&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6417362&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6417364&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6417021&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6417776&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6417779&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6417867&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6417778&tree=Firefox
https://tbpl.mozilla.org/php/getParsedLog.php?id=6424492&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6424487&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6424632&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6424490&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6424620&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6425099&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6425101&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6425244&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6421379&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6421416&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6421375&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6423360&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6423747&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6423749&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6421488&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6423909&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6423824&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6422870&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6425355&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6425353&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6425354&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6425414&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6424498&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6423905&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6424062&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6424056&tree=Firefox https://tbpl.mozilla.org/php/getParsedLog.php?id=6424201&tree=Firefox
Whiteboard: [android_tier_1]
From birch, a little different from the standard fare here: https://tbpl.mozilla.org/php/getParsedLog.php?id=7406754&tree=Birch
Summary: Intermittent tegra "Cleanup Device failed" or "'python run_tests.py ...' failed" ending in a "process killed by signal 15" → Intermittent tegra "Cleanup Device failed" or "'python run_tests.py ...' failed" or "Configure Device failed" ending in a "process killed by signal 15"
Is comment #95 still the correct path forward here, i.e. if the "python /builds/sut_tools/cleanup.py $IPADDR" fails, would retrying just that step likely fix the issue, or is that tegra hung at that point and we need to retry the entire job on a new tegra?
Priority: -- → P3
Whiteboard: [testing][mobile][android_tier_1]
Whiteboard: [testing][mobile][android_tier_1][android_tier_∞] → [testing][mobile][android_tier_1][android_tier_∞][triagefollowup]
Assignee: nobody → coop
Whiteboard: [testing][mobile][android_tier_1][android_tier_∞][triagefollowup] → [testing][mobile][android_tier_1][android_tier_∞]
Depends on: 690311
(In reply to Phil Ringnalda (:philor) from comment #582) > All you have to do is add one per-step RETRY log filter, so I have something > to copy-paste from. That's all. Just one. It's slightly more involved than that. Some of these failing steps are not setup to RETRY automatically, so that functionality needs to be added as well.
Yeah, that's the perfect that is being allowed to be the enemy of the good.
Attached patch RETRY on common tegra errors (obsolete) (deleted) — Splinter Review
Tagging Ben for review based on the provenance of most of the retry code. This patch does the following: * adds a class of tegra-specific errors. We may not want to RETRY on all of these, but at least we'll have a list we/philor can augment. * changes some test steps that are hitting these errors to use RetryingShellCommand. In some cases, this requires reorganizing how the classes are initialized. Specifically, the test command needs to be created/appended to before calling init in the super class (In reply to Phil Ringnalda (:philor) from comment #584) > Yeah, that's the perfect that is being allowed to be the enemy of the good. Going to ignore that.
Attachment #590316 - Flags: review?(bhearsum)
Comment on attachment 590316 [details] [diff] [review] RETRY on common tegra errors Review of attachment 590316 [details] [diff] [review]: ----------------------------------------------------------------- ::: steps/unittest.py @@ +940,5 @@ > return evaluateRemoteMochitest(self.name, cmd.logs['stdio'].getText(), > superResult) > > > +class RemoteReftestStep(ReftestMixin, ChunkingMixin, RetryingShellCommand): You're switching from a ShellCommandReportTime here, are we going to lose functionality because of that?
Attachment #590316 - Flags: review?(bhearsum) → review+
Comment on attachment 590316 [details] [diff] [review] RETRY on common tegra errors (In reply to Ben Hearsum [:bhearsum] from comment #586) > You're switching from a ShellCommandReportTime here, are we going to lose > functionality because of that? My first version of the patch actually changed ShellCommandReportTimeout to use RetryingShellCommand, but I changed my mind due to the 3 levels of command report state checking involved. ShellCommandReportTimeout was implemented to surface timeout status to tinderbox. If it's still an issue in TBPL after this patch lands, I'll post a fix. Landing this will save philor some headaches in the meantime. https://hg.mozilla.org/build/buildbotcustom/rev/053913150d1a
Attachment #590316 - Flags: checked-in+
landed in production with tonights reconfig
Comment on attachment 590316 [details] [diff] [review] RETRY on common tegra errors http://hg.mozilla.org/build/buildbotcustom/rev/763366ebe6e3 Backed out for errors in these classes of builds * talos - toolsdir property not found when we come to run the perf tests * suspected android unit - jobs failing to run, 0 second runs that the master doesn't claim to know anything about
Attachment #590316 - Flags: checked-in+ → checked-in-
Depends on: 723006
The problem was that swapping from ShellCommand to RetryingShellCommand requires the toolsdir property to be set, and it's currently not on talos and or android unit tests, so buildbot was barfing with a Traceback (most recent call last): File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/process/buildstep.py", line 728, in startStep d.addCallback(self._startStep_2) File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/internet/defer.py", line 260, in addCallback callbackKeywords=kw) File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/internet/defer.py", line 249, in addCallbacks self._runCallbacks() File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/twisted/internet/defer.py", line 441, in _runCallbacks self.result = callback(self.result, *args, **kw) --- <exception caught here> --- File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/process/buildstep.py", line 769, in _startStep_2 skip = self.start() File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/steps/shell.py", line 212, in start command = properties.render(self.command) File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/process/properties.py", line 108, in render return [ self.render(e) for e in value ] File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/process/properties.py", line 106, in render return value.render(self.pmap) File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/process/properties.py", line 194, in render s = self.fmtstring % pmap File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/process/properties.py", line 169, in __getitem__ rv = properties[key] File "/builds/buildbot/tests1-linux/lib/python2.6/site-packages/buildbot-0.8.2_hg_aeaa057e9df6_production_0.8-py2.6.egg/buildbot/process/properties.py", line 50, in __getitem__ rv = self.properties[name][0] exceptions.KeyError: 'toolsdir' when it came to run the step.
(In reply to Nick Thomas [:nthomas] from comment #589) > Comment on attachment 590316 [details] [diff] [review] > RETRY on common tegra errors > > http://hg.mozilla.org/build/buildbotcustom/rev/763366ebe6e3 > > Backed out for errors in these classes of builds > * talos - toolsdir property not found when we come to run the perf tests > * suspected android unit - jobs failing to run, 0 second runs that the > master doesn't claim to know anything about OK, this is a mess. Non-Android test jobs (both talos and unittest) don't have an existing tools checkout to rely on. I *could* add the required steps to checkout the tools repo for test jobs, but I think that adds unnecessary overhead to a test process we're trying to slim down. The right fix is to make sure the tools repo is checked out on all the slaves at boot time (bug 712205), but that's a non-trivial amount of work. We can still benefit in this bug by just marking the android cleanup steps to retry though. The foopys have a tools checkout we can use for retry-ing, and it will get rid of one class of existing errors. I'll post a revised patch shortly that limits the scope of the fix to that (for now).
Depends on: 712205
Attached patch RETRY on tegra cleanup errors (obsolete) (deleted) — Splinter Review
(In reply to Chris Cooper [:coop] from comment #786) > I'll post a revised patch shortly that limits the scope of the fix to that > (for now). ...and here it is. It's a subset of the previous patch, really: doesn't change the unittest of talos factories, and makes sure the toolsdir prop is set prior to android cleanup.
Attachment #590316 - Attachment is obsolete: true
Attachment #598551 - Flags: review?(bhearsum)
Attachment #598551 - Flags: review?(bhearsum) → review+
That doesn't seem to have gone too well, e.g. https://tbpl.mozilla.org/php/getParsedLog.php?id=9563430&tree=Mozilla-Beta retries cleanup.py five times, every time getting "Warning proxy.flg found during cleanup" and then a stream of reconnect/failed to connect, and then doesn't actually set RETRY in the end.
Actually, the comment 789 one set exception, but https://tbpl.mozilla.org/php/getParsedLog.php?id=9563847&tree=Mozilla-Beta set failed after five retries on a tegra that was long gone, and was red instead of RETRY blue.
https://tbpl.mozilla.org/php/getParsedLog.php?id=9569690&tree=Mozilla-Inbound with the cute "I want to pointlessly retry 5 times even though the tegra is refusing my connection attempts, but a signal 15 kill hit me while sleeping after the third, and here's a exception status rather than retry."
And as a sample of the breadth: https://tbpl.mozilla.org/php/getParsedLog.php?id=9569690&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9566885&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9566626&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9567086&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9566785&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9566936&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9567055&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9566886&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9566735&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9567364&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9567031&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9570117&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9570082&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9566048&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9566370&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9565645&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=9568960&tree=Mozilla-Inbound Handy in a way, though, it's scooped up several other sorts of failure so now nearly every Android failure of any color is this, instead of just most of the purple :)
Attachment #598551 - Flags: checked-in+ → checked-in-
Assignee: coop → nobody
(In reply to Chris Cooper [:coop] from comment #585) > * changes some test steps that are hitting these errors to use > RetryingShellCommand. In some cases, this requires reorganizing how the > classes are initialized. Specifically, the test command needs to be > created/appended to before calling init in the super class I still don't understand the why of this: was it necessary to have a log_eval_func, or just to see whether retrying five times would find success, before giving up on the run and setting RETRY to do it again on a less busted slave? Nothing I've ever heard about this class of failure (basically, all of them, since apparently nearly every Tegra failure is "yeah, that Tegra has wandered off to stare at the wall and maybe chew on some paint flakes") would lead me to think that it'll be better in a second or five or ten or thirty. Muddling through the manual makes it look like you can have a log_eval_func on just a plain ShellCommand, so why don't we try the absolute simplest thing that might possibly work, two regexes for signal 15 and signal 9 on just the cleanup step, with no RetryingShellCommand, and no adding in flunkOnFailure, and then if that works build up from there?
Attached patch nearly as little as maybe possible (obsolete) (deleted) — Splinter Review
If I really believed myself about doing as little as possible, I'd only do one or the other, not both unittest and talos, but if it works and I only got RETRY on one, not both, I'd be sobbing until the next reconfig.
Assignee: nobody → philringnalda
Status: NEW → ASSIGNED
Attachment #600622 - Flags: review?(coop)
Attachment #600622 - Flags: review?(coop)
Attachment #600622 - Flags: review+
Attachment #600622 - Flags: checked-in+
Comment on attachment 600622 [details] [diff] [review] nearly as little as maybe possible Backed out in http://hg.mozilla.org/build/buildbotcustom/rev/5f1e278ddad0 for being wishful thinking. One part was clear to me once I stopped just hoping for magic: these patches have made things worse rather than better because once we add a log_eval_func, we no longer get the global_errors which include several things that we have to retry multiple times a day, so devicemanager.DMError and Remote Device Error within whichever steps we break turn from retry to red. The part I'm not quite clear about is the way we don't manage to set RETRY for things we do match, but I suspect the answer is down around http://hg.mozilla.org/build/buildbotcustom/annotate/1c11026036ff/process/factory.py#l8166 or a little less obviously around http://hg.mozilla.org/build/buildbotcustom/annotate/813db58c220f/steps/base.py#l13 - I get the impression that if your command sets a non-zero exit code, you have to defeat the attempt that regex_log_evaluator will make to set FAILURE based on that by having a worst_status() that will let your RETRY trump FAILURE. The thing I don't get is how the things that have a log_eval_func using upload_errors seem to be somehow getting around that (unless none of our upload errors set a non-zero exit code?).
Attachment #600622 - Flags: checked-in+ → checked-in-
Attached patch even less (deleted) — Splinter Review
Well, I'm not proud. Or tired. This does two things: * copies over the working things from global_errors, so we don't lose them and make things worse than they are now * notices that the two regexes we were adding have something in common with the one thing in global_errors for the sake of Tegras which does *not* work, a thing those three do not have in common with the other two which do work: "process killed by signal 15", "process killed by signal 9" and bug 720073's "program finished with exit code 80" are all the full log line out to the end, while the working ones like "devicemanager.DMError" are matching log lines that have something else following that in the same line, like "devicemanager.DMError: unable to connect to 10.250.49.66 after 5 attempts". I don't have any ghost of a hint of an idea why they wouldn't match the end of a line, but if I'm right and that's the problem, then we can put Smart People on the task of figuring out why.
Attachment #598551 - Attachment is obsolete: true
Attachment #600622 - Attachment is obsolete: true
Attachment #601523 - Flags: review?(coop)
Your recent landed changes went live around 7:30 AM PDT.
That one's better in that it didn't make things worse, but now I'm wildly guessing that cmd.logs.values() doesn't actually include the "process killed by" or "finished with exit code" lines.
Assignee: philringnalda → nobody
Status: ASSIGNED → NEW
<dustin> yeah, they're headers <dustin> which are not included in stdio <dustin> so you'd need to use a LogWatcher <dustin> or modify the existing one to also scan headers was where I decided someone else was going to be the one who fixed this.
Priority: P3 → --
Comment on attachment 601523 [details] [diff] [review] even less Backed out in 21e83d4b3843 since it was more trouble to turn it into something on another buildstep that could actually catch failures that exist in the log than just to start over from scratch.
Attachment #601523 - Flags: checked-in+ → checked-in-
Armen is going to take a stab at fixing bug 690311, but that's a couple of weeks out still.
Priority: -- → P3
That'll probably take a chunk out of this, but I'd expect that's mostly bug 686084, since this is not at all exclusively "the first time we looked for the tegra, it wasn't there," and a preclean that made sure the tegra wasn't going to just wander off in the middle of the job would have to be "are you a tegra?" "yes I am" "well, then the odds are fairly good that you'll wander off in the middle of a job." https://tbpl.mozilla.org/php/getParsedLog.php?id=9907218&tree=Mozilla-Inbound
I am adding cleanup at the end of the run hoping that it increases the likelihood that we are already clean at the beginning of the run. On staging I have been having very good results but that might be pure luck. https://bug734221.bugzilla.mozilla.org/attachment.cgi?id=605549 Perhaps some of the issues we are seeing is due to bugs in devicemanager.py. I have already found one last night. ########### On another side, some of this logs points to the foopy-buildslave loosing connection and *not* the actual board. remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ########### I can also see installApp failing but I can see that DM returned back a hashtag for the file. python /builds/sut_tools/installApp.py 10.250.51.61 build/fennec-12.0.en-US.android-arm.apk org.mozilla.firefox_beta in dir /builds/tegra-221/test/. (timeout 1200 secs) watching logfiles {} argv: ['python', '/builds/sut_tools/installApp.py', '10.250.51.61', u'build/fennec-12.0.en-US.android-arm.apk', 'org.mozilla.firefox_beta'] environment: PATH=/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin PWD=/builds/tegra-221/test SUT_IP=10.250.51.61 SUT_NAME=tegra-221 __CF_USER_TEXT_ENCODING=0x1F5:0:0 closing stdin using PTY: False copying build/fennec/application.ini to build/talos/remoteapp.ini connecting to: 10.250.51.61 reconnecting socket devroot /mnt/sdcard/tests /builds/tegra-221/test/../proxy.flg 10.250.48.221 50221 Current device time is 2012/03/13 22:52:18 Setting device time to 2012/03/13 22:52:18 Current device time is 2012/03/13 22:52:18 results: {'process': [['1001', '1149', 'com.android.phone'], ['10007', '1140', 'com.android.inputmethod.latin'], ['1000', '1020', 'system'], ['10029', '1256', 'com.android.deskclock'], ['10034', '1683', 'org.mozilla.ffxcp'], ['10033', '1674', 'org.mozilla.fencp'], ['10031', '1355', 'com.mozilla.SUTAgentAndroid'], ['1000', '1169', 'com.android.settings'], ['10018', '1160', 'com.android.launcher'], ['10013', '1384', 'com.cooliris.media'], ['10004', '1282', 'android.process.media'], ['10009', '1366', 'com.android.quicksearchbox'], ['10002', '1376', 'com.android.music'], ['10032', '1345', 'com.mozilla.watcher'], ['10006', '1324', 'com.android.mms'], ['10010', '1307', 'com.android.providers.calendar'], ['10014', '1290', 'com.android.email'], ['10017', '1271', 'com.android.bluetooth'], ['10015', '1198', 'android.process.acore']]} results: {'memory': ['PA:830160896']} results: {'uptime': ['0 days 0 hours 8 minutes 11 seconds 270 ms']} results: {'screen': ['X:1600 Y:1200']} results: {'os': ['harmony-eng 2.2 FRF91 20110202.102810 test-keys']} results: {'screen': ['X:1600 Y:1200']} INFO: we have a current resolution of 1600, 1200 INFO: adjusting screen resolution to 1024, 768 and rebooting calling reboot DEBUG: gCallbackData is: on port: 50221 Creating server with 10.250.48.221:50221 Calling disconnect on callback server . . Got data back from agent: System rebooted Shutting down server now Waiting for tegra to come back... Try 1 reconnecting socket devroot /mnt/sdcard/tests results: {'screen': ['X:1024 Y:768']} Installing /mnt/sdcard/tests/fennec-12.0.en-US.android-arm.apk in push file with: build/fennec-12.0.en-US.android-arm.apk, and: /mnt/sdcard/tests/fennec-12.0.en-US.android-arm.apk local hash returned: '4051fe2d46e7af9099cd8d8ae5b44e80' sending: push /mnt/sdcard/tests/fennec-12.0.en-US.android-arm.apk process killed by signal 15 program finished with exit code -1 elapsedTime=272.153844 ################################ I also see cases of cleanup.py failing.
Because Tegras are Tegras, first log I looked at post-updateSUT.py was https://tbpl.mozilla.org/php/getParsedLog.php?id=10126090&tree=Mozilla-Inbound
Summary: Intermittent tegra "Cleanup Device failed" or "'python run_tests.py ...' failed" or "Configure Device failed" ending in a "process killed by signal 15" → Intermittent tegra "Cleanup Device failed" or "'python run_tests.py ...' failed" or "Configure Device failed" or "updateSUT.py failed" ending in a "process killed by signal 15"
Summary: Intermittent tegra "Cleanup Device failed" or "'python run_tests.py ...' failed" or "Configure Device failed" or "updateSUT.py failed" ending in a "process killed by signal 15" → Intermittent tegra "Cleanup Device failed" or "'python run_tests.py ...' failed" or "Configure Device failed" ending in a "process killed by signal 15"
Summary: Intermittent tegra "Cleanup Device failed" or "'python run_tests.py ...' failed" or "Configure Device failed" ending in a "process killed by signal 15" → Intermittent tegra "Cleanup Device failed" or "'python run_tests.py ...' failed" or "Configure Device failed" or "updateSUT.py failed" ending in a "process killed by signal 15"
(In reply to :armenzg - Off and back on Wed. 25th from comment #1065) > I am adding cleanup at the end of the run hoping that it increases the > likelihood that we are already clean at the beginning of the run. So, how can we evaluate the success or failure of that? I don't have any idea how to tell about the success, but I know about the failure because I see it all day long, as we hit this or bug 686084 during the second cleanup step, because Tegras have brutally bad ADD, and you can only keep their attention so long, and the second cleanup step means trying to keep it longer and failing a job that completed successfully (or failed, but in another way). https://tbpl.mozilla.org/php/getParsedLog.php?id=11073399&tree=Birch
That wasn't really an explosion of this, just that the reboot step following an instance of bug 681861 does hit this, since the Tegra was dead so it's not going to reboot. https://tbpl.mozilla.org/php/getParsedLog.php?id=11327953&tree=Mozilla-Inbound
Summary: Intermittent tegra "Cleanup Device failed" or "'python run_tests.py ...' failed" or "Configure Device failed" or "updateSUT.py failed" ending in a "process killed by signal 15" → Intermittent tegra "Cleanup Device failed" or "'python run_tests.py ...' failed" or "Configure Device failed" or "updateSUT.py failed" or "verify.py failed" ending in a "process killed by signal 15"
Depends on: 793091
Blocks: 793358
Continued in bug 793358, because we're sick of loading this one.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: