Closed
Bug 811444
Opened 12 years ago
Closed 12 years ago
android panda boards magically reboot in the middle of the test
Categories
(Testing :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
mozilla20
People
(Reporter: jmaher, Assigned: jmaher)
References
Details
Attachments
(3 files, 1 obsolete file)
(deleted),
patch
|
ted
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
kmoir
:
review+
jmaher
:
feedback+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
bhearsum
:
review+
|
Details | Diff | Splinter Review |
oh, receurdo de Tegra, but I have seen it in the log files, unable to connect, then we fail, then we get the logcat and a startup sequence is shown.
We need to make sure we are setting proper watcher.ini stuff and that the latest version of the watcher (my custom version) will not reboot unless told to specifically
If all that is good, then we have other problems, maybe a common test case, or a history on the board.
Assignee | ||
Comment 1•12 years ago
|
||
I have verified we are running a legit version of watcher which by default doesn't reboot. I have also verified that we are not setting a watcher.ini file which will tell watcher to reboot.
Now the chance of watcher being the cause of reboots is much less a concern for me.
I am seeing this a lot today while testing a fix to the sutagent hanging on shelling out to system commands. There is no common spot in a test suite, test suite or panda board.
This leads me to question the relay.py script and the power strips. I would really like to have a central place to look at all requests that get sent to the power strip so I can determine if we are maybe sending the data. I would also like to get a log from the power strip to see what it thinks it rebooted.
Comment 2•12 years ago
|
||
joel, are you talking about pandas or tegras (comment 1 and the subject don't agree). No pandas are attached to power strips. They're all attached to relay boards inside chassis. All tegras are attached to power strips.
Comment 3•12 years ago
|
||
I'm certain that the relays don't log. There are no pandas on power strips (by which I assume you mean PDUs), and anyway I'm fairly certain the PDUs don't log either.
Mozpool will be that central place - you can look at the device_logs table, or aggregate /var/log/mozpool.log across all imaging servers if you want the relayhost/bank/relay breakdown. But since sut_lib is doing reboots itself, rather than using mozpool, that won't show you what you need to know today.
If you can give me some examples of pandas that have failed this way, I can check for mozpool having power-cycled them. But I haven't had any problems with misfires, so that's unlikely to be it.
I've been using chassis 3 (panda-{0034..0045}) to test mozpool for the last two weeks, so this may simply be a matter of two groups using the same chassis.
Assignee | ||
Comment 4•12 years ago
|
||
:arr,
I am only talking about pandas. I really have no idea what technology lies behind the scenes to cycle the power, but whatever it is (I guess relay board in this case), I would like to get a log from it.
Assignee | ||
Comment 5•12 years ago
|
||
Dustin, I believe we are using chassis 2 (panda-{0022-0033}). Does that mean chassis 1 is panda-0000 -> panda-0021?
It sounds like we are not toggling the same boards. If I wanted to check which host relay:bank mozpool was calling, how could I do that?
Assignee | ||
Comment 6•12 years ago
|
||
I know that panda-relay-002.build.scl1.mozilla.com is the host we use for the relay, if that is being called from mozpool, we have an answer, if not, we are still at the drawing board.
Speaking of mozpool, is there an easy api to call or a python module we can import to initiate a device reboot?
Comment 7•12 years ago
|
||
They pandas are all listed here
http://mobile-services.build.scl1.mozilla.com/ui/lifeguard.html
chassis numbers match relay hostnames, so chassis 1 is just panda-0010 and panda-0021 (dunno why)
I don't see panda-relay-002 in the mozpool logs.
And yes, of course there's an API -- that's what we've been working on for two weeks!
https://wiki.mozilla.org/ReleaseEngineering/Mozpool
Assignee | ||
Comment 8•12 years ago
|
||
Assignee | ||
Updated•12 years ago
|
Blocks: android_4.0_testing
Comment 9•12 years ago
|
||
Ok, this does the needed work for buildbot and makes a few assumtions I want to call out:
* attachment 685737 [details] [diff] [review] will get an r+ (if it doesn't we need to peek at this again with whatever revised patch does)
* That we are ok with the hack being hidden at this low a level (I couldn't come up with a less hacky idea for passing the var down to here)
* That I am correct and attachment 685737 [details] [diff] [review] only affects mochi and is not wanted/needed for reftests
* That if we are to turn on any panda mochitests on other train-branches we would backport the --run-slower arg.
* b2g pandas won't use this buildbot factory.
* This won't land until *after* the mozilla-central code patch lands and gets carried forward to cedar.
Attachment #686041 -
Flags: review?(kmoir)
Attachment #686041 -
Flags: feedback?(jmaher)
Updated•12 years ago
|
Attachment #686041 -
Flags: review?(kmoir) → review+
Comment 10•12 years ago
|
||
Comment on attachment 685737 [details] [diff] [review]
add a --run-slower to mochitest options (1.0)
Review of attachment 685737 [details] [diff] [review]:
-----------------------------------------------------------------
This sucks. :-( I can only assume we're overloading the device and that's causing it to reboot. It would be better to find a root cause here, but as a band-aid we can deal with this.
::: testing/mochitest/tests/SimpleTest/TestRunner.js
@@ +441,5 @@
>
> TestRunner.updateUI(tests);
> TestRunner._currentTest++;
> + if (TestRunner.runSlower) {
> + setTimeout(TestRunner.runNextTest, 1000);
Have you done any testing to see if we can get away with a lower value? It'd be nice to make this as low as possible without causing issues.
Attachment #685737 -
Flags: review?(ted) → review+
Assignee | ||
Comment 11•12 years ago
|
||
inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a8c28e8d114a
I would like to leave this bug open to experiment with lower values instead of 1000.
Assignee | ||
Comment 12•12 years ago
|
||
Comment on attachment 686041 [details] [diff] [review]
[buildbotcustom] v1
Review of attachment 686041 [details] [diff] [review]:
-----------------------------------------------------------------
looks good.
Attachment #686041 -
Flags: feedback?(jmaher) → feedback+
Comment 13•12 years ago
|
||
Can one of you briefly summarize the problem and fix? I'm mostly curious.
Assignee | ||
Comment 14•12 years ago
|
||
The problem is described in bug 815726, the solution is a hack to slow down the tests which cause fennec to have lower cpu and memory usage. All signs point to overheating, but it could easily be something else.
Comment 15•12 years ago
|
||
OK, thanks! An alternative approach might be to look at the kernel CPU governor configuration, but since this is critical-path, don't let me distract you :)
Assignee | ||
Comment 16•12 years ago
|
||
do you have a pointer to how that can be configured?
Comment 17•12 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
Comment 18•12 years ago
|
||
We forget the "leave open" whiteboard for this bug, tehrefore got resolved with the m-c landing.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 19•12 years ago
|
||
Comment on attachment 686041 [details] [diff] [review]
[buildbotcustom] v1
http://hg.mozilla.org/build/buildbotcustom/rev/1075c14a159d
(no checked-in flag here)
Comment 20•12 years ago
|
||
This is in production.
Comment 21•12 years ago
|
||
in production
Updated•12 years ago
|
Summary: panda boards magically reboot in the middle of the test → android panda boards magically reboot in the middle of the test
Comment 22•12 years ago
|
||
Several recent logs in bug 722166 show pandaboards rebooting: comments 1439, 1438, and 1436.
Assignee | ||
Comment 23•12 years ago
|
||
In looking at the test logs on mozilla-central and mozilla-inbound we are not running --run-slower anymore for the pandaboard mochitests. I have confirmed that we continue to have problems with rebooting and adding --run-slower appears to help remediate this problem.
For some data, a regular run yields just >50C from the kernel board reading, but with --run-slower, I always stay <46C. This is data collected from 10 idle boards and 1 board running smoketests.
Comment 24•12 years ago
|
||
The problem is here
http://mxr.mozilla.org/build/source/buildbotcustom/process/factory.py#5369
if 'panda' in self.platform:
is never true because platform has a value of android
We don't want to run these on tegras. Not sure if we should add a value to RemoteUnittestFactory so we have a another value that describes the platform better.
It did work before, not sure if recent changes broke something and the value passed used to be panda_android.
Assignee | ||
Comment 25•12 years ago
|
||
can we go with slaveName instead of platform?
is it possible to distinguish between android 4.0 and android 2.2 ?
Comment 26•12 years ago
|
||
I don't think slaveName is available to RemoteUnittestFactory now either. I have to think about how to refactor it.
Comment 27•12 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #23)
> In looking at the test logs on mozilla-central and mozilla-inbound we are
> not running --run-slower anymore for the pandaboard mochitests.
Good spot - thank you for chasing this Joel/Kim :-)
Comment 28•12 years ago
|
||
I looked this again today and I don't know how it ever worked. (Although I recall looking at the logs and seeing the parameter in the log when it was landed). In our buildbot-configs,
PLATFORMS['android']['slave_platforms'] = ['tegra_android', 'panda_android']
platform is always android, it's the slave platform that is panda_android
If you look at the build properties on a page from a recent panda android build, the platform is panda_android. I know how to capture that at build time, but I don't see how to use this in an if statement and add the slowTests parameter for only panda_android
http://buildbot.net/buildbot/docs/latest/manual/cfg-properties.html
because this doesn't appear to be supported by buildbot (Using Properties in Steps section).
I'm not sure how to implement this, any ideas would be appreciated.
Comment 29•12 years ago
|
||
I wrote a patch yesterday to address this issue with doStepIf in factory.py and setting properties to determine if the device was a panda by doing a find on the foopy for the panda related file. I tested it this morning and it's not pretty at all, very hacky. Given that Jake and Joel are testing a new power supply next week for the pandas chassis as they have determined that there are issues with the existing ones, I think perhaps we should not sink any more time into trying to make the tests run slower. I'll focus on getting the test infrastructure set up in bug 853947 so they can assess the power issues.
Assignee | ||
Comment 30•12 years ago
|
||
Thanks kmoir! If our power supply solution works, then life is good, otherwise this is a good fallback solution which we can use if needed.
Comment 31•12 years ago
|
||
Patch to check if the foopy has attached pandas if running mochitests. If so, add --run-slower. I've tested this on my dev-master and it works.
Attachment #732434 -
Flags: review?(bhearsum)
Comment 32•12 years ago
|
||
Comment on attachment 732434 [details] [diff] [review]
patch
Review of attachment 732434 [details] [diff] [review]:
-----------------------------------------------------------------
::: process/factory.py
@@ +5396,5 @@
> + extract_fn=ifAPanda,
> + flunkOnFailure=False,
> + haltOnFailure=False,
> + warnOnFailure=False
> + ))
sorry, drive-by review.
could you use a SetBuildProperty step instead? then you could use a property function that looks at build.slavename to determine if it's a panda or not?
the find command as written will traverse all of /builds, only to return the first matching entry with panda-*[0-9] in the name. this could be pretty expensive!
Comment 33•12 years ago
|
||
Comment on attachment 732434 [details] [diff] [review]
patch
Review of attachment 732434 [details] [diff] [review]:
-----------------------------------------------------------------
::: process/factory.py
@@ +5396,5 @@
> + extract_fn=ifAPanda,
> + flunkOnFailure=False,
> + haltOnFailure=False,
> + warnOnFailure=False
> + ))
I second catlee's comments. Let me know if you want help with the SetBuildProperty part. https://mxr.mozilla.org/build-central/source/buildbotcustom/process/factory.py#4729 might be a decent example.
Attachment #732434 -
Flags: review?(bhearsum) → review-
Comment 34•12 years ago
|
||
I changed it like this
def ifAPanda(build):
slavename = build.slavename
if re.match(r'panda-[0-9]{4}\+?', slavename):
return "True"
else:
return "False"
self.addStep(SetBuildProperty(
property_name="slowTests",
value=ifAPanda,
))
so on the build page slowTests is now a Step instead of a SetProperty and has the correct value
Like here
http://dev-master01.build.scl1.mozilla.com:8036/builders/Android%20Tegra%20250%20mozilla-central%20opt%20test%20mochitest-1/builds/30
However, I'm not sure how to parse this in unittest.py anymore since it's not a property associated with build but rather a step. I tried to find some examples and hacked around a bit but was unable to find a solution - catlee or bhearsum - suggestions?
Comment 35•12 years ago
|
||
tested on my dev-master and it works.
Attachment #732434 -
Attachment is obsolete: true
Attachment #732808 -
Flags: review?(bhearsum)
Comment 36•12 years ago
|
||
Comment on attachment 732808 [details] [diff] [review]
patch that uses SetBuildProperty instead
Review of attachment 732808 [details] [diff] [review]:
-----------------------------------------------------------------
r=me, but can you get rid of the excess whitespace when you land?
Attachment #732808 -
Flags: review?(bhearsum) → review+
Comment 37•12 years ago
|
||
Comment on attachment 732808 [details] [diff] [review]
patch that uses SetBuildProperty instead
Checked in and fixed whitespace
http://hg.mozilla.org/build/buildbotcustom/rev/e854634ca5bb
Comment 38•12 years ago
|
||
latest patch is in production
Comment 39•12 years ago
|
||
Verified that this is working by looking at recent test runs in tbpl.
Status: REOPENED → RESOLVED
Closed: 12 years ago → 12 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•