Closed Bug 690311 Opened 13 years ago Closed 13 years ago

clean and verify connectivity to a tegra before launching a job on it

Categories

(Infrastructure & Operations Graveyard :: CIDuty, task, P1)

ARM
Android
task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: Callek)

References

Details

(Keywords: intermittent-failure, Whiteboard: [mobile_unittests][android_tier_1_mozharness])

Attachments

(3 files, 8 obsolete files)

Currently a lot of the purple and red that we have falls into a category of pre/post test scripts failing.  I don't have exact numbers but I would say between 30-50% of the red/purple that I look at.

What I am proposing is:
- test finishes
- do basic reporting
- move tegra to a cleanup pool
- cleanup the tegra (remove files, adjust resolution, reboot, etc...)
- cleanup the foopy (remove test specific files, kill zombie webservers)
- verify we can get a deviceroot on the tegra
- if OK, move tegra to 'available' queue
- if failed, move tegra to 'offline' queue

We will still run into situations where the network goes offline, but this should minimize that noise.
Blocks: 438871
bear: do we have any facility to attempt what Joel is proposing? 

It sounds like he wants to separate test result reporting from the hardware cleanup/recycling reporting. Not sure how we bubble this up so that we still get notified when we have hardware issues. Maybe every tegra test job splits into two: the actual test job that triggers a dependent cleanup job. The cleanup job could then notify releng loudly without devs noticing/caring.
Priority: -- → P3
we don't have anything immediately obvious to do this that I know of - the one thing I thought of last night was to make a dependent job that is always run post-tegra job that does all of the sanity checks.  It would be able to set the tegra error state if something is wrong which would take that tegra out of the pool

but I need to talk to one of our buildbot scheduler gurus to see if that is feasible
This sounds promising guys.  I am open to other suggestions than what I proposed or some related variation.  anything to reduce our orange/red/purple from the tbpl page will only give street cred to the android tegra farm!
catlee:

as the scheduler guru (I hope) - does what I outline in comment 2 make any sense or should I slowly step away from the computer and take a break?
(In reply to Mike Taylor [:bear] from comment #2)
> we don't have anything immediately obvious to do this that I know of - the
> one thing I thought of last night was to make a dependent job that is always
> run post-tegra job that does all of the sanity checks.  It would be able to
> set the tegra error state if something is wrong which would take that tegra
> out of the pool
> 
> but I need to talk to one of our buildbot scheduler gurus to see if that is
> feasible

Aiui there's no way to specify which tegra to run a dependent job on.
This is one reason I want a builder per slave so we can target slaves.

However, there is nothing keeping us from putting more maintenance post-reboot-step; we may want to do this with always-run steps that ping/cleanup the tegra and pull it out of commission if it's not reachable.
Also, a post-reboot-step ping/cleanup/maintenance will get a lot of this, but there's always the capacity for a tegra to fall over between finishing this and starting the next job.  This won't be a magic bullet to solve all tegra connectivity.

The way we solve this on all other platforms is by running buildbot directly on the slave.  If the slave isn't up, it won't be able to take jobs.  The fact that the foopy can accept jobs on behalf of the tegra without the tegra being available is inherently problematic.
can the post-reboot-step pull the tegra out of the pool if it detects a failure?
Assignee: nobody → bear
Blocks: 686085
Blocks: 681855
Blocks: 681861
Blocks: 687098
Blocks: 686143
Blocks: 675297
Blocks: 687116
Whiteboard: [orange][mobile_unittests][android_tier_1] → [orange][mobile_unittests][android_tier_1_mozharness]
So this is effectively a blocker. 

We're limping along right now, wasting philor/mbrubeck's time starring these oranges across (at least) 7 blocker bugs, and generating a shit-ton of useless bugmail in the process.

Are we honestly waiting until we can run buildbot directly on each tegra (comment #6)? Is there a bug on file for that already, and should it be a blocker here?

If that isn't the solution, how are we going to address this? Is a mozharness script the answer, as indicated in the whiteboard?

(In reply to Aki Sasaki [:aki] (back jan 9) from comment #5)
> However, there is nothing keeping us from putting more maintenance
> post-reboot-step; we may want to do this with always-run steps that
> ping/cleanup the tegra and pull it out of commission if it's not reachable.

Can we at least do this ASAP, please? Having one failure and then pulling the tegra automatically *should* prevent pile-ups (and will reduce my annoyance with this markedly).
Severity: normal → blocker
Priority: P3 → P1
so mozharness resolves this problem;  We were on track last quarter to have mozharness on the tegras, but a crazy idea to have native fennec was born and all resources from ateam and releng went to supporting that.

If there are other resources to put on mozharness, that would unblock a lot of ateam stuff as well while solving a handful of tegra automation bugs, including this one.
Conveniently, we're not wasting my time anymore: I lost all faith in Android tests, and I now just ignore all failures in them.
Hal is going to take on the mozharness work here. He'll sync up with Aki next week  when Aki gets back from vacation.

(In reply to Phil Ringnalda (:philor) from comment #10)
> Conveniently, we're not wasting my time anymore: I lost all faith in Android
> tests, and I now just ignore all failures in them.

That makes me both sad and happy at the same time.
Assignee: bear → hwine
Severity: blocker → critical
Priority: P1 → P2
(In reply to Joel Maher (:jmaher) from comment #9)
> so mozharness resolves this problem;  We were on track last quarter to have
> mozharness on the tegras, but a crazy idea to have native fennec was born
> and all resources from ateam and releng went to supporting that.
> 
> If there are other resources to put on mozharness, that would unblock a lot
> of ateam stuff as well while solving a handful of tegra automation bugs,
> including this one.

Yes, with a caveat: we were on track to get Android *talos* on mozharness last quarter if Android native hadn't reared its head.  We also have to do unit tests, and the checking-out-tegras solution has some other moving parts that haven't been fully fleshed out yet.

I can work with Hal on this, but this isn't a small project by any stretch of the imagination.
Is it possible to raise this to a P1? 

Pretty much everyone (bar philor) using TBPL at this point seems to be ignoring unstarred Android failures unless a whole run goes orange/red (due to the sheer number of false positives) - and will inevitably continue doing so until this bug is resolved - which I feel is going to bite us in the long run...
Right now our only real fix would be porting all mobile testing to mozharness.
The talos portion of that was scheduled for Q4 until Android Native took over my entire schedule.

Android Native is still my main priority for Q1; I will have more time later this quarter or Q2.
From reading the last few comments, it seems like people are relying on MozHarness to fix all this. However, thats a significant amount of work, and even if we started now, it doesnt sound like we'd have MozHarness for a while. 

In the meanwhile, is there anything we can do to make machines more stable in the shortterm?
The lowest of the low-hanging fruit is to redo (and stage) the patch in bug 660480.
(In reply to John O'Duinn [:joduinn] from comment #15) 
> In the meanwhile, is there anything we can do to make machines more stable
> in the shortterm?

I think there are some smarts we can build in to the current system as a precursor to moving everything to mozharness.

I'm not 100% sure which part of the toolchain needs to change...I'm guessing clientproxy?. Assuming clientproxy knows "why" the tegra was just rebooted -- mid-run software install vs. end-of-run reboot -- I would like clientproxy to initiate all the cleanup steps on the tegra that we currently run as part of the test job itself. If the cleanup steps succeed, only then is the tegra marked as available. clientproxy could try a few times to cleanup, and then mark the tegra as "bad" and complain loudly to releng.

This would shorten the cycle time for all our tegra tests by removing the cleanup steps, and should hopefully allow more tegra tests to pass (or fail legitimately) by making sure tegras taking jobs have a fighting chance of actually running tests.
Assignee: hwine → nobody
Component: Release Engineering → Release Engineering: Platform Support
QA Contact: release → coop
(In reply to Chris Cooper [:coop] from comment #17) 
> This would shorten the cycle time for all our tegra tests by removing the
> cleanup steps, and should hopefully allow more tegra tests to pass (or fail
> legitimately) by making sure tegras taking jobs have a fighting chance of
> actually running tests.

I would note that this is essentially what Joel proposed when he filed the bug, I'm just advocating that we add the code to clientproxy (or a helper script) to get this done.
(In reply to Chris Cooper [:coop] from comment #17)
> (In reply to John O'Duinn [:joduinn] from comment #15) 
> > In the meanwhile, is there anything we can do to make machines more stable
> > in the shortterm?
> 
> I think there are some smarts we can build in to the current system as a
> precursor to moving everything to mozharness.
> 
> I'm not 100% sure which part of the toolchain needs to change...I'm guessing
> clientproxy?. Assuming clientproxy knows "why" the tegra was just rebooted
> -- mid-run software install vs. end-of-run reboot -- I would like
> clientproxy to initiate all the cleanup steps on the tegra that we currently
> run as part of the test job itself. If the cleanup steps succeed, only then
> is the tegra marked as available. clientproxy could try a few times to
> cleanup, and then mark the tegra as "bad" and complain loudly to releng.
> 
> This would shorten the cycle time for all our tegra tests by removing the
> cleanup steps, and should hopefully allow more tegra tests to pass (or fail
> legitimately) by making sure tegras taking jobs have a fighting chance of
> actually running tests.

clientproxy knows that a tegra is being rebooted for the install step (presence of the reboot.flg semaphore) so it treats all other reboots as errors and takes down buildbot during that process to avoid a job being started.

if the reboot step set a "ok I'm done go check" flag, then clientproxy could wait for the tegra to reboot and run any helper script we want - only when that new flag is cleared would it restart buildbot.
Attached patch rough outline of adding a verify.sh script run (obsolete) (deleted) β€” β€” Splinter Review
This is a first stab at adding to clientproxy the bugfix.  

buildbot reboot step should set the "verify.flg" and reboot the tegra

If during the reboot handling clientproxy sees a "verify.flg" file it will run ../sut_tools/verify.sh and then wait

verify.sh should do whatever it needs to do, and if successful reboot the tegra and remove the verify.flg

when the tegra comes back, clientproxy will see it and then restart buildbot
Attachment #604282 - Flags: review?(aki)
Attachment #604282 - Flags: feedback?(armenzg)
Comment on attachment 604282 [details] [diff] [review]
rough outline of adding a verify.sh script run

You don't check the verify script for any return value; will it set flags or something that will tell CP about what to do?
I left it open if verify.sh would return a value and/or just clear the flag and reboot the tegra.

If I end up writing it then i'll probably do both, but yea - that needs to be figured out.
Comment on attachment 604282 [details] [diff] [review]
rough outline of adding a verify.sh script run

This doesn't do anything until we create the verify.flg and verify.sh, then we need to see what the verify.sh does.  Should be a good start though.

(Curious if verify.sh hard kills cp, if that in turn kills the verify.sh call.)
Attachment #604282 - Flags: review?(aki) → review+
Comment on attachment 604282 [details] [diff] [review]
rough outline of adding a verify.sh script run

Good start bear.

What happens if verify.sh does not pass? would the tegra reboot/verify forever?
Attachment #604282 - Flags: feedback?(armenzg) → feedback+
That is an open question in my mind right now.  I can add easily enough a counter to clientproxy to track how many times it's called before it's cleaned-up or it can be tracked in the verify code itself.

Or we can take the hardline route of saying that if a verify fails the tegra is taken offline and a bug filed.

like all things tegra, we won't know for sure (or even find out) until we start testing it
I want to stab an upgrade of the sut agent in between taking jobs.
The reason of this is that in bug 734221 I am not being able to figure out how to make updateSUT.py work well inside of buildbot without turning the job red.
I could have a downtime and forget about making the update of the SUT agent a standar procedure.
taking per joduinn
Assignee: nobody → bugspam.Callek
Priority: P2 → P1
Attached patch WIP (obsolete) (deleted) β€” β€” Splinter Review
Attachment #604282 - Attachment is obsolete: true
Comment on attachment 607328 [details] [diff] [review]
WIP

-                    events.put(('start',))
+                    events.put(('verify',))

Don't add the "verify" event. Otherwise you would be adding two states in the queue (I think). One for success run of updateSUT.py (event/verify) and another one for verify.sh (event/start).

                 else:
                     log.info('updateSUT.py has had issues')
                     for i in output:
                         log.info(i)
I like how you added log.warning for verify.sh on failure. Could you please also add it for the failure of updateSUT.py?

Thanks for posting patches early! That is great :)
Attached patch WIP 2 (obsolete) (deleted) β€” β€” Splinter Review
Still untested, but getting closer
Attachment #607328 - Attachment is obsolete: true
Comment on attachment 607620 [details] [diff] [review]
WIP 2

This looks good.
I don't see the need to verify the version inside of verify.sh/py since updateSUT.py fails if the version is not correct.
If you want, you can call updateSUT.py from inside verify.sh/py and set the flags according to the errcode or do the current double calling if it complicates things. I like it that you grabbed target_version as EXPECTED_SUT_VERSION; that way the scripts don't fall out of sync :D

FYI updateSUT.py does 5 retries in 90 seconds interval.
I like the direction Callek is taking with moving parts (or all?) of the updateSUT inside of the verify code.  IMO that is part of exactly what verify should be doing.

If SUT is incorrect version, try to fix, if not flag for recovery.
Attached patch WIP 3 (obsolete) (deleted) β€” β€” Splinter Review
This *should* be good for me to begin testing.
Attachment #607620 - Attachment is obsolete: true
Ok, rudimentary testing is back, and it looks like I caused tegra-010 and tegra-013 to get into a perm error state due to screen resolution.

foopy05:builds cltbld$ grep "Screen" */error.flg
tegra-010/error.flg:Unexpected Screen on tegra, got 'X:1600 Y:1200' expected 'X:1024 Y:768'
tegra-012/error.flg:Unexpected Screen on tegra, got 'X:1680 Y:1050' expected 'X:1024 Y:768'

While Bug 737427 [I think] should erase the current cause of this, it should be easy to bake in an attempt at solving this automatically, which I will do here for completeness, (and buildduty sanity).
Attached patch [tools] v1 (obsolete) (deleted) β€” β€” Splinter Review
Ok, this should do it.

This has had testing on foopy05, I was not able to test all the error conditions easily (I don't have any foopys here with a broken sdcard mount, nor any that I can ping but cannot telnet to, nor a single one with an outdated SUTAgent [after armen cp code to update that])

I decided to do away with the shell script here, as it did not provided any benefit afaict above and beyond invoking the python file directly.
Attachment #607658 - Attachment is obsolete: true
Attachment #608144 - Flags: review?(jmaher)
Attachment #608144 - Flags: review?(bear)
Comment on attachment 608144 [details] [diff] [review]
[tools] v1

Review of attachment 608144 [details] [diff] [review]:
-----------------------------------------------------------------

I don't like the global use of dm.  I am fine as this is a utility script and everything has a clean true/false return path.

::: sut_tools/verify.py
@@ +17,5 @@
> +
> +def canPing(ip):
> +    """ Check a tegra is reachable by ping
> +
> +    Returns False on failure, True on Succes

Missing s at the end of Success

@@ +30,5 @@
> +                setFlag(errorFile, "Unable to ping tegra after %s Retries" % numRetries)
> +                print "WARNING: Unable to ping tegra after %s try" % numRetries
> +                return False
> +            else:
> +                print "INFO: Unable to ping tegra after %s try. Sleeping for 60s then retrying" % curRetry

comment doesn't match sleep below

@@ +40,5 @@
> +def canTelnet(ip):
> +    """ Checks if we can establish a Telnet session (via devicemanager)
> +
> +    Sets global `dm`
> +    Returns False on failure, True on Succes

Missing s at the end of Success

@@ +64,5 @@
> +
> +def checkVersion(dm):
> +    """ Verify SUTAgent Version
> +
> +    Returns False on failure, True on Succes

Missing s at the end of Success

@@ +100,5 @@
> +    """ Attempt to write a temp file to the SDCard
> +
> +    We use this existing verify script as the source of the temp file
> +
> +    Returns False on failure, True on Succes

missing s at the end of Success

@@ +121,5 @@
> +            setFlag(errorFile, "Unable to cleanup from written tempfile")
> +            return False
> +    except Exception, e:
> +        setFlag(errorFile, "Unknown error while testing ability to write to" \
> +                           "sdcard, see following exception: %s" % e)

after the exception, you return True, we should return false here.

@@ +144,5 @@
> +    if not checkAndFixScreen(dm):
> +        return 1
> +
> +    if not checkSDCard(dm):
> +        return 1

can we return 0 at the end of main here?
Attachment #608144 - Flags: review?(jmaher) → review+
Comment on attachment 608144 [details] [diff] [review]
[tools] v1

>diff --git a/sut_tools/verify.py b/sut_tools/verify.py
>new file mode 100644
>--- /dev/null
>+++ b/sut_tools/verify.py
>@@ -0,0 +1,155 @@
>+#!/usr/bin/env python
>+
>+import sys
>+import os
>+import time
>+from sut_lib import pingTegra, setFlag
>+import devicemanagerSUT as devicemanager
>+from updateSUT import target_version as EXPECTED_SUT_VERSION
>+from updateSUT import connect, version
>+
>+MAX_RETRIES = 5
>+EXPECTED_TEGRA_SCREEN = 'X:1024 Y:768'
>+EXPECTED_TEGRA_SCREEN_ARGS = {'width': 1024, 'height': 768, 'type': 'crt'}
>+
>+errorFile = None
>+dm = None
>+
>+def canPing(ip):
>+    """ Check a tegra is reachable by ping
>+
>+    Returns False on failure, True on Succes
>+    """
>+    curRetry = 0
>+    print "INFO: attempting to ping tegra"
>+    while curRetry < MAX_RETRIES:
>+        ret, _ = pingTegra(ip)
>+        if ret != True:

safer and, IMO, easier to read if you avoid mixing != and a boolean constant - why not use |if not ret:| ?

>+            curRetry += 1
>+            if curRetry == numRetries:

where is numRetries defined?

>+                setFlag(errorFile, "Unable to ping tegra after %s Retries" % numRetries)
>+                print "WARNING: Unable to ping tegra after %s try" % numRetries
>+                return False
>+            else:
>+                print "INFO: Unable to ping tegra after %s try. Sleeping for 60s then retrying" % curRetry
>+                time.sleep(90)

60 or 90?  also, should we really sleep? can we instead just increase the max retry count?

>+        else:
>+            break # we're done here
>+    return True
>+
>+def canTelnet(ip):
>+    """ Checks if we can establish a Telnet session (via devicemanager)
>+
>+    Sets global `dm`
>+    Returns False on failure, True on Succes
>+    """
>+    global dm
>+    curRetry = 0
>+    sleepDuration = 0
>+    while curRetry < MAX_RETRIES:
>+        try:
>+            dm = connect(ip, sleepDuration)
>+        except:
>+            curRetry += 1
>+            if curRetry == numRetries:

where is numRetries defined?

>+                setFlag(errorFile, "Unable to connect to tegra after %s Retries" % numRetries)
>+                print "WARNING: Unable to connect to tegra after %s try" % numRetries
>+                return False
>+            else:
>+                print "INFO: Unable to connect to tegra after %s try" % curRetry
>+                sleepDuration = 90
>+        else:
>+            break # We're done here
>+    return True

<snip>

>+def main(tegra, ip):
>+    global dm, errorFile
>+    tegraPath = os.path.join('/builds', tegra)
>+    errorFile = os.path.join(tegraPath, 'error.flg')
>+
>+    if not canPing(ip):
>+        # TODO Reboot via PDU if ping fails
>+        return 1
>+
>+    if not canTelnet(ip):
>+        return 1
>+
>+    if not checkVersion(dm):
>+        # TODO Move updateSUT from cp to here
>+        return 1
>+
>+    if not checkAndFixScreen(dm):
>+        return 1
>+
>+    if not checkSDCard(dm):
>+        return 1
>+
>+if __name__ == '__main__':
>+    if (len(sys.argv) <> 3):
>+        print "usage: verify.py <tegra name> <ip address>"
>+        sys.exit(1)
>+
>+    sys.exit(main(sys.argv[1], sys.argv[2]))

This is a "style" issue, but with all of the other functions returning True/False we have the can*() series returning 1.  I get that it allows you to just inline it to the sys.exit() but if you switch it to True/False then you can also have the return code in a single spot in case you decide to return something else later.  The only reason I could see to do the above is if you have different return values based on what part fails.

Like Joel, using a global DM has been a source of odd errors in the past - if your using a global DM then you will need to make sure you check all of the returns from any dm.* call.

a typo nit:  s/Succes/Success/
Attachment #608144 - Flags: review?(bear) → review-
(In reply to Mike Taylor [:bear] from comment #37)
> >+                setFlag(errorFile, "Unable to ping tegra after %s Retries" % numRetries)
> >+                print "WARNING: Unable to ping tegra after %s try" % numRetries
> >+                return False
> >+            else:
> >+                print "INFO: Unable to ping tegra after %s try. Sleeping for 60s then retrying" % curRetry
> >+                time.sleep(90)
> 
> 60 or 90?  also, should we really sleep? can we instead just increase the
> max retry count?

I feel a sleep is a much better counter to both a temporary network outage among other things, like keeping actual network load lower when stuff gets over saturated (nothing like having every foopy trying to ping every tegra at just about the same time) At least this way we give things a bit of a break. 5 tries 90 seconds apart means we will ALLOW the verify script to gracefully wait on being pingable for 7.5 minutes. I can dial that back to 30s or 60s sleeps if you want though.
Attached patch [tools] v1.1 (obsolete) (deleted) β€” β€” Splinter Review
Corrects issues, adds a dmAlive function that eliminates bear's concerns with global dm.
Attachment #608144 - Attachment is obsolete: true
Attachment #608405 - Flags: review?(bear)
Comment on attachment 608405 [details] [diff] [review]
[tools] v1.1

looking good! let's get thing testing!
Attachment #608405 - Flags: review?(bear) → review+
Comment on attachment 608405 [details] [diff] [review]
[tools] v1.1

Bear suggests you sanity-check here.
Attachment #608405 - Flags: feedback?(armenzg)
Attached patch [tools] Part 1 final (deleted) β€” β€” Splinter Review
This changed from v1.1 per conversation with armenzg's and bear's over IRC. Was told to land on friday, but attacking landing this slipped my mind.

The changes for record:
* verify.py need only get passed the tegra name
* comment out (I opted for `if |False and|` ... instead) the check for Screen Resolution until that consolidation patch lands again.

This approach/patch has r=bear/jmaher and has been tested by me on foopy5. Leaving feedback open for armen incase he has any followup things that need for this stuff.

I do plan on doing a part 2 for the "Clean" part of this bug, which *should* be relatively easy.

The plan is for a staggered rollout of this onto the production foopies, and does require stop_cp/start_cp run.
Attachment #608405 - Attachment is obsolete: true
Attachment #609194 - Flags: feedback?(armenzg)
Attachment #608405 - Flags: feedback?(armenzg)
Comment on attachment 609194 [details] [diff] [review]
[tools] Part 1 final

http://hg.mozilla.org/build/tools/rev/dc933e3ad581
Attachment #609194 - Flags: checked-in+
Comment on attachment 609194 [details] [diff] [review]
[tools] Part 1 final

Review of attachment 609194 [details] [diff] [review]:
-----------------------------------------------------------------

It all looks good. Feel free to take nits into consideration or not (not sure if nit is the right word for the comments I made).

::: sut_tools/verify.py
@@ +26,5 @@
> +            return True
> +    except:
> +        pass # the actual exception holds no additional value here
> +    setFlag(errorFile, "Device manager lost connection to tegra")
> +    return False

Nit: Would it look better to put setFlag and return False inside the except clause?

@@ +146,5 @@
> +    except Exception, e:
> +        setFlag(errorFile, "Unknown error while testing ability to write to" \
> +                           "sdcard, see following exception: %s" % e)
> +        return False
> +    return True

Nit: perhaps return True should be at the end of the try clause rather than after it? Not sure what makes it cleaner.
Attachment #609194 - Flags: feedback?(armenzg) → feedback+
Does this code usually have license headers?
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #44)
> Comment on attachment 609194 [details] [diff] [review]
> [tools] Part 1 final
> 
> Review of attachment 609194 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> It all looks good. Feel free to take nits into consideration or not (not
> sure if nit is the right word for the comments I made).
> 
> ::: sut_tools/verify.py
> @@ +26,5 @@
> > +            return True
> > +    except:
> > +        pass # the actual exception holds no additional value here
> > +    setFlag(errorFile, "Device manager lost connection to tegra")
> > +    return False
> 
> Nit: Would it look better to put setFlag and return False inside the except
> clause?

Not imo, because we only return True on the success return. It can return not-successful without throwing and I'd rather catch that as well.

> @@ +146,5 @@
> > +    except Exception, e:
> > +        setFlag(errorFile, "Unknown error while testing ability to write to" \
> > +                           "sdcard, see following exception: %s" % e)
> > +        return False
> > +    return True
> 
> Nit: perhaps return True should be at the end of the try clause rather than
> after it? Not sure what makes it cleaner.

This one is just personal preference, if you insist I'll change it, but to me it is clearer the way it is.
(In reply to Ms2ger from comment #45)
> Does this code usually have license headers?

The other sut_agent code I saw here did not, so I didn't bother adding it to my added code. Easy to add in if need be, but given that I didn't bother
(In reply to Justin Wood (:Callek) from comment #47)
> (In reply to Ms2ger from comment #45)
> > Does this code usually have license headers?
> 
> The other sut_agent code I saw here did not, so I didn't bother adding it to
> my added code. Easy to add in if need be, but given that I didn't bother

I never did add license headers to the sut_tools related code, but we really should start getting back into that habit.

As we both touch this code we should be adding the new MPLv2 headers.
Attached patch [tools] part 2 - WIP 1 (obsolete) (deleted) β€” β€” Splinter Review
This should add in the cleanup stuff.

I decided to reuse the script that buildbot currently runs here, but that needed some tweaking to work well from the cwd that verify runs in, and not also break for buildbot.

As well as other tweaks to prevent the sys.exit() earlier and to be sure that it can be properly imported, etc.

The general idea is that uninstalls need to reboot, so let it reboot early and verify.py should not wait on a reboot, and instead expect to just not start buildbot and re-verify at next startup of the tegra.

Once this lands and bakes a bit in prod we'll have a patch for buildbot to drop some of these verify/cleanup stuff from its end, (or at least make some of them non-failing)

I also added in the MPLv2 headers to the two files I touched.
Attachment #609417 - Flags: feedback?(jmaher)
Attachment #609417 - Flags: feedback?(bear)
Depends on: 739384
Attachment #609417 - Flags: feedback?(bear) → feedback+
Comment on attachment 609417 [details] [diff] [review]
[tools] part 2  - WIP 1

Testing looks relatively sane, but some weird file-names/cleanup stuff that seems to not return failure.

I am not certain that my script did not cause this, also not certain that it is something to worry over, I just want to point it out:
http://callek.pastebin.mozilla.org/1535722
I would not be surprised if you are seeing something that we were not aware of that causes problems down the road.

Perhaps we want to get the output of which processes are still around and ensure that we kill anything that could be keeping a hold of those files.

Let us know if you find some more things like is. There is nothing better than learning more of how the innards are of the tegras.

2012-03-26 14:36:35,969 DEBUG   MainProcess: INFO: attempting to create file /mnt/sdcard/writetest
2012-03-26 14:36:35,970 DEBUG   MainProcess: removing file: /mnt/sdcard/writetest
2012-03-26 14:36:35,970 DEBUG   MainProcess: devroot /mnt/sdcard/tests
2012-03-26 14:36:35,970 DEBUG   MainProcess: removeDir() returned [Deleting file(s) from /mnt/sdcard
/tests
2012-03-26 14:36:35,970 DEBUG   MainProcess:    Unable to delete AΒͺV.Γ’-^uAβ–Œ
2012-03-26 14:36:35,970 DEBUG   MainProcess: Deleting file(s) from /mnt/sdcard/tests/profile
2012-03-26 14:36:35,971 DEBUG   MainProcess: Deleting file(s) from /mnt/sdcard/tests/profile/Cache.T
rash710348615
2012-03-26 14:36:35,971 DEBUG   MainProcess: Deleting file(s) from /mnt/sdcard/tests/profile/Cache.T
rash710348615/E
2012-03-26 14:36:35,971 DEBUG   MainProcess: Deleting file(s) from /mnt/sdcard/tests/profile/Cache.T
rash710348615/E/20
2012-03-26 14:36:35,971 DEBUG   MainProcess:    Unable to delete Γ’-'β™₯Ò£Òz.AΒΆAΒ½Γ’^T
2012-03-26 14:36:35,971 DEBUG   MainProcess: Unable to delete directory /mnt/sdcard/tests/profile/Ca
che.Trash710348615/E/20
these are actually filenames which exist with these unicode characters as the name.  I found that we are unable to delete these via sut or adb shell.  the only way I could delete them was via a sdcard format.  Upon further thought, and sdcard format would be a good idea in the cleanup step!  The one caveat I can think of is that we need to reboot after a format.
Comment on attachment 609417 [details] [diff] [review]
[tools] part 2  - WIP 1

Review of attachment 609417 [details] [diff] [review]:
-----------------------------------------------------------------

this is looking really good.
Attachment #609417 - Flags: feedback?(jmaher) → feedback+
(In reply to Joel Maher (:jmaher) from comment #52)
> these are actually filenames which exist with these unicode characters as
> the name.  I found that we are unable to delete these via sut or adb shell. 
> the only way I could delete them was via a sdcard format.  Upon further
> thought, and sdcard format would be a good idea in the cleanup step!  The
> one caveat I can think of is that we need to reboot after a format.

That might not be a bad idea at all and jmaher mentioned in his email.

* What do guys think?
* Do we know what the upside would be? (besides deleting this odd filenames that we don't know yet if it causes any trouble later on)
* How hard is it to re-format? How long does the process take?
* What is the down side of formatting?
* Could it be troublesome after a thousand re-formats?
* Is it healthier for the SD card?
* What happens on failed attempt of re-formatting?
The reformat is really fast (seconds).  I would assume it is less wear and tear on the sdcard since we wouldn't be deleting each individual file.

The biggest downside I can think of is requiring a reboot afterwards.  This might not be required, but I think it is as the OS stores some information and other partitions on the sdcard which don't appear to exist.  Technically we have to reboot during install, so maybe we can ignore the reboot.
Attached patch [tools] Part 1.5 (deleted) β€” β€” Splinter Review
Go figure that while I had this code issue in my testing on foopy05, nothing failed this check.
Attachment #609916 - Flags: review?(armenzg)
Attachment #609916 - Flags: review?(armenzg) → review+
Comment on attachment 609916 [details] [diff] [review]
[tools] Part 1.5

http://hg.mozilla.org/build/tools/rev/282d8239eb1b
Attachment #609916 - Flags: checked-in+
Attachment #609916 - Attachment is patch: true
Attached patch [tools] Part 2 - v1 (obsolete) (deleted) β€” β€” Splinter Review
This part is running on foopy05 right now, has caught the weird unicode directory/file names we were unable to delete in some runs.

I opted not to do the reformatting automatically, that just feels prone for trouble, if it turns out to be a major issue when we deploy then I'm happy to revisit that.
Attachment #609417 - Attachment is obsolete: true
Attachment #610333 - Flags: review?(bear)
Attachment #610333 - Flags: review?(armenzg)
Attachment #610333 - Flags: feedback?(jmaher)
Comment on attachment 610333 [details] [diff] [review]
[tools] Part 2 - v1

Review of attachment 610333 [details] [diff] [review]:
-----------------------------------------------------------------

all we are missing is removing/wiping the sdcard data.

::: sut_tools/cleanup.py
@@ +89,5 @@
> +        if errcode == 2:
> +            print "processes from previous run were detected and cleaned up"
> +        elif errocode == 3:
> +            setFlag(errorFile, "Remote Device Error: process from previous test run present")
> +            return KILLSTALLED_ERROR

I don't think we will get here because we uninstall all previous apps
Attachment #610333 - Flags: feedback?(jmaher) → feedback+
Comment on attachment 610333 [details] [diff] [review]
[tools] Part 2 - v1

Review of attachment 610333 [details] [diff] [review]:
-----------------------------------------------------------------

::: sut_tools/cleanup.py
@@ +89,5 @@
> +        if errcode == 2:
> +            print "processes from previous run were detected and cleaned up"
> +        elif errocode == 3:
> +            setFlag(errorFile, "Remote Device Error: process from previous test run present")
> +            return KILLSTALLED_ERROR

Actually I think you're wrong, checkStalled runs against the foopy not the tegra, so uninstalling processes on the tegra itself can still potentially leave processes running on the foopy.
Comment on attachment 610333 [details] [diff] [review]
[tools] Part 2 - v1

Review of attachment 610333 [details] [diff] [review]:
-----------------------------------------------------------------

This is looking pretty good :)

In general, should we be using our standard return codes so we can retry the jobs (where needed)?
https://github.com/buildbot/buildbot/blob/master/master/buildbot/status/results.py#L16

Since you are looking at this code I have gone and analyzed a typical problem we see on tbpl.
If you look below you will see that tegra-101 failed in cleanup.py. [1]
You can also notice that there is a warning flag that gets cleared out but we don't know what the contents of it is.
> Warning proxy.flg found during cleanup
Would you please be able to modify sut_lib.py or cleanup.py to print out the contents of it? I believe it might give us some clues on what happened on the previous job and perhaps be able to recuperate. bear might want to weigh in.

I guess I am trying to understand if that flag should have been a warning or something more severe rather than just clearing it out.
Please feel free to question my reasoning.

[1]
https://tbpl.mozilla.org/php/getParsedLog.php?id=10481533&tree=Mozilla-Inbound&full=1

python /builds/sut_tools/cleanup.py 10.250.49.89
 in dir /builds/tegra-101/test/../talos-data (timeout 1200 secs)
 watching logfiles {}
 argv: ['python', '/builds/sut_tools/cleanup.py', '10.250.49.89']
 environment:
  PATH=/opt/local/bin:/opt/local/sbin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin
  PWD=/builds/tegra-101/talos-data
  SUT_IP=10.250.49.89
  SUT_NAME=tegra-101
  __CF_USER_TEXT_ENCODING=0x1F5:0:0
 closing stdin
 using PTY: False
Warning proxy.flg found during cleanup
Connecting to: 10.250.49.89
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
reconnecting socket
unable to connect socket
devroot None
/builds/tegra-101/talos-data/../error.flg
Remote Device Error: devRoot from devicemanager [None] is not correct
program finished with exit code 1
elapsedTime=30.135135

::: sut_tools/cleanup.py
@@ +12,5 @@
>  
> +# main() RETURN CODES
> +ERRORCODE_SUCCESS = 0
> +GENERIC_ERROR = 1
> +KILLSTALLED_ERROR = 2

suggestions/nits:
RETCODE_SUCCESS = 0
RETCODE_ERROR = 2
RETCODE_KILLSTALLED = 1 or 4 (depending on what you do with it later)

These suggestions are according to these:
https://github.com/buildbot/buildbot/blob/master/master/buildbot/status/results.py#L16

@@ +48,5 @@
> +    # Now Verify that they are all gone
> +    for p in processNames:
> +        if dm.dirExists('/data/data/%s' % p):
> +            setFlag(errorFile, "Unable to properly uninstall %s" % p)
> +            return GENERIC_ERROR

Should this be a RETRY JOB return code?
Any idea on how we could pull the board out of taking jobs? Or stopping buildbot and starting on verify steps by cp?

If GENERIC_ERROR(1) causes the job to be retried please ignore my comment.

@@ +61,5 @@
> +        status = dm.removeDir(devRoot)
> +        print "removeDir() returned [%s]" % status
> +        if status is None or not status:
> +           setFlag(errorFile, "Remote Device Error: call to removeDir() returned [%s]" % status)
> +           return GENERIC_ERROR

Same as previous GENERIC_ERROR comment.

@@ +64,5 @@
> +           setFlag(errorFile, "Remote Device Error: call to removeDir() returned [%s]" % status)
> +           return GENERIC_ERROR
> +        if dm.dirExists(devRoot):
> +            setFlag(errorFile, "Remote Device Error: Unable to properly remove %s" % devRoot)
> +            return GENERIC_ERROR

Same as previous GENERIC_ERROR comment.

@@ +76,5 @@
> +            dm.verifySendCMD(['exec dd if=/mnt/sdcard/hosts of=/system/etc/hosts'])
> +        except devicemanager.DMError, e:
> +            print "Exception hit while trying to restore /system/etc/hosts: %s" % str(e)
> +            setFlag(errorFile, "failed to restore /system/etc/hosts")
> +            return GENERIC_ERROR

Same as previous GENERIC_ERROR comment.

@@ +79,5 @@
> +            setFlag(errorFile, "failed to restore /system/etc/hosts")
> +            return GENERIC_ERROR
> +        if not dm.fileExists('/system/etc/hosts'):
> +            setFlag(errorFile, "failed to restore /system/etc/hosts")
> +            return GENERIC_ERROR

Same as previous GENERIC_ERROR comment.

@@ +89,5 @@
> +        if errcode == 2:
> +            print "processes from previous run were detected and cleaned up"
> +        elif errocode == 3:
> +            setFlag(errorFile, "Remote Device Error: process from previous test run present")
> +            return KILLSTALLED_ERROR

I think Callek is right.

Why do we return status 2? What do we do with it after?
How is it different than GENERIC_ERROR?
Comment on attachment 610333 [details] [diff] [review]
[tools] Part 2 - v1

r+ with constant suggestions. As long as it starts with RETCODE I am happy.
There is nothing on this patch that can break things.
To be honest, most of my comments were to the original code rather than the patch and rather improvements.
If you could output the contents of the flag on this pass that would be great. Otherwise we can do it later.
Attachment #610333 - Flags: review?(armenzg) → review+
Attached patch [tools] Part 2 - v1.5 (deleted) β€” β€” Splinter Review
r+=armenzg f+=jmaher, giving bear a shot to comment as well.

From IRC, the retcodes I'm using here are the same as what the code used to use sys.exit() with, so the retcodes match that for now [Rather than introduce more changes in this bug]. The const's are added so changing what codes we use will actually be easier later if we decide its worth doing.

Added a way to do the clearFlag printing as suggested (f? to armen for that alone) and I did change the return const names as discussed

----

Future improvements I feel should be separate followup bugs [some harder than others]:
* Enable the Screen Resolution check/change (Bug 737427)
* Auto PDU Cycle a tegra if its unpingable for too long
* Auto PDU Cycle a tegra if it is pingable but SUTAgent can't connect for too long
* Move updateSUT.py invocation from both clientproxy.cp and buildbot to verify.py
* Format the sdcard via adb and reboot if we can't properly delete /mnt/sdcard/tests
* Delete additional files on the sdcard [e.g. /mnt/sdcard/Downloads/*]
* [maybe] Stop running cleanup.py during buildbot test cycles and allow it to always be run by verify instead.
Attachment #610333 - Attachment is obsolete: true
Attachment #610704 - Flags: review?(bear)
Attachment #610704 - Flags: feedback?(armenzg)
Attachment #610333 - Flags: review?(bear)
Attachment #610704 - Flags: review?(bear) → review+
Blocks: 741476
Blocks: 741478
Blocks: 741480
Attachment #610704 - Flags: feedback?(armenzg) → feedback+
Comment on attachment 610704 [details] [diff] [review]
[tools] Part 2 - v1.5

This was checked in with:
http://hg.mozilla.org/build/tools/rev/2f7a4ba088eb

All the foopies have now been updated with this change as well.

I have notified on dev.tree-management that this was deployed (just in case).

I have also updated the maintenance page.
Attachment #610704 - Flags: checked-in+
Blocks: 742479
Blocks: 742480
This bug is now fully deployed. Thanks bear/armen for the reviews, advice, and help deploying!

(In reply to Justin Wood (:Callek) from comment #63)
> ----
> 
> Future improvements I feel should be separate followup bugs [some harder
> than others]:
> * Auto PDU Cycle a tegra if its unpingable for too long
> * Auto PDU Cycle a tegra if it is pingable but SUTAgent can't connect for
> too long

Both of those are Bug 741476

> * Move updateSUT.py invocation from both clientproxy.cp and buildbot to
> verify.py

Bug 741478

> * Format the sdcard via adb and reboot if we can't properly delete
> /mnt/sdcard/tests

Bug 741480

> * Delete additional files on the sdcard [e.g. /mnt/sdcard/Downloads/*]

Bug 742479

> * [maybe] Stop running cleanup.py during buildbot test cycles and allow it
> to always be run by verify instead.

Bug 742480
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
No longer blocks: 686085
Blocks: 686085
Whiteboard: [orange][mobile_unittests][android_tier_1_mozharness] → [mobile_unittests][android_tier_1_mozharness]
Product: mozilla.org → Release Engineering
Component: Platform Support → Buildduty
Product: Release Engineering → Infrastructure & Operations
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: