Closed Bug 772531 Opened 12 years ago Closed 9 years ago

detect when a tegra reboots via log file parsing, mark test as red and allow tbpl to get accurate information

Categories

(Release Engineering :: General, defect)

ARM
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jmaher, Unassigned)

References

Details

Attachments

(4 files, 1 obsolete file)

I have detected a pattern in our log files that show up when the tegra reboots:
Error receiving data from socket. cmd=ps
; err=[Errno 54] Connection reset by peer
reconnecting socket


This makes sense and causes us to report orange although no tests have failed.

If we detect this pattern in the log output, I would like to mark the test red, and display a message to TBPL that would incdicate "Android device rebooted (; err=[Errno 54] Connection reset by peer)"
We would get this for free with mozharness.
There's also buildbotcustom parsing for unit tests.
well, the real problem is why are we getting a high rate of reboots...bad tegras, bad sdcards, or tests doing troublesome things.  Mozharness would be awesome, and in the meantime, lets get this added to our infrastructure.
The colour is determined by the status code that buildbot provides.

What TBPL shows in the annotated summary box is determined by:
http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/e2e344885c80/php/inc/GeneralErrorFilter.php#l24

Alternatively, the log can just prefix the line with "TinderboxPrint: " and we'll have the message shown in the list of bullets below the annotated summary box.
maybe the best method is to return a proper error from the harness and output "Automation error:" at the beginning of one of the lines.
Detecting a line in the log and taking an action can be accomplished by:
1) by adding a log_eval_func to the specific step. [1]

This can be done by either adding it in-step [2] or on the errors.py [3] file and importing it.

The steps to be modified are in RemoteUnittestFactory [4]

To add "TinderboxPrint:" messages we can modify createSummary IIUC. jlund would know as he has been looking at that recently.

"Automation error:" can be added to errors.py.

[1]
log_eval_func=lambda c,s: regex_log_evaluator(c, s, talos_hgweb_errors),
[2]
http://hg.mozilla.org/build/buildbotcustom/file/default/misc.py#l419
   420             log_eval_func=lambda c,s: regex_log_evaluator(c, s, (
   420              (re.compile('# TBPL WARNING #'), WARNINGS),
   421              (re.compile('# TBPL FAILURE #'), FAILURE),
   422              (re.compile('# TBPL EXCEPTION #'), EXCEPTION),
   423              (re.compile('# TBPL RETRY #'), RETRY),
   424             ))
[3]
http://hg.mozilla.org/build/buildbotcustom/file/281fc2c2fa9e/status/errors.py
[4] http://hg.mozilla.org/build/buildbotcustom/file/default/process/factory.py#l6793
(In reply to Joel Maher (:jmaher) from comment #4)
> maybe the best method is to return a proper error from the harness and
> output "Automation error:" at the beginning of one of the lines.

This would be my preference (over TinderboxPrint) :-)
this patch looks for a known error and delivers a tbpl parsable message.  This doesn't change the job red, but we can build from this.
Attachment #640728 - Flags: review?(wlachance)
Comment on attachment 640728 [details] [diff] [review]
report a real error from devicemanager when we receive a known failure state (1.0)

Idea of patch is good but could be improved in a few ways to be more robust and clear. I'll give an r+ assuming the below is implemented (or given a good reason why not). ;)

># HG changeset patch
># Parent 2e702f2948b447e46e28cb3094308c6846f254f6
>Bug 772531 - detect when a tegra reboots via log file parsing, mark test as red and allow tbpl to get accurate information. r=wlach
>
>diff -r 2e702f2948b4 build/mobile/devicemanagerSUT.py
>--- a/build/mobile/devicemanagerSUT.py	Tue Jul 10 14:56:47 2012 -0400
>+++ b/build/mobile/devicemanagerSUT.py	Tue Jul 10 15:46:43 2012 -0400
>@@ -213,6 +213,9 @@
>           except socket.error, msg:

Change this to:
            except socket.error, err:

(you'll see why below)

>             self._sock.close()
>             self._sock = None
>+            # This error shows up with we have our tegra rebooted.
>+            if str(msg.strip()) == "[Errno 54] Connection reset by peer":

I think it's better to just look for the errno, rather than examining a string whose format could change.

I'd suggest replacing this with:

>             if err[0] == errno.ECONNRESET:

(you'll also need to import errno on top)

>+              raise AgentError("Automation error: Error receiving data from socket (possible reboot). cmd="+str(cmd)+"; err="+str(msg), True)

I find this kind of concatenation less clear than the standard method, where we substitute in variables via '%'. Also, there's no need to cast to string explicitly. So try:

raise AgentError("Automation error: Error receiving data from socket (possible reboot). cmd=%s; err=%s" % (cmd, msg), True)
Attachment #640728 - Flags: review?(wlachance) → review+
I'd be glad to *try* add the glue needed for this from the buildbot end.

I have noticed that one of buildbots error checking for 'upload_errors' would have caught this if it was considered an upload step. However it would have set this to RETRY AKA Blue not red: https://github.com/mozilla/buildbotcustom/blob/master/status/errors.py#L28

I am wondering if the Tegra is rebooting, should we be going for the same(blue) and signalling for a Retry?

Are there any logs of this?

I can take a look into parsing Joel's recent append: 'Automation error: Error receiving data from socket' and change the color accordingly in buildbot.
I wonder if adding something like this to global_errors would work without any further changes needed. 

This would set buildbots return status to RETRY and in return blue in tbpl.

I think for the tegra's unittests we are already passing global errors here:
http://mxr.mozilla.org/build/source/buildbotcustom/steps/base.py#17

as I dont think we are setting log_eval_func but I will certainly need confirmation from someone who knows more
I suppose this could have dangerous effects since its 'global' and lots of other builds would pass through it. I am not sure if 'connection reset by peer' appears in other logs and should be ignored AKA stay green/success
I am not sure either.  That is something we should test in staging with desktop builds.  If we see that on an Android build I would say we are safe to throw an error.  I have been testing this for a while today and about half the reboots fall into that category.
Component: Release Engineering: Automation (Release Automation) → Release Engineering: Automation (General)
QA Contact: bhearsum → catlee
it would probably be safer to add an argument to the 'step' its rebooting and check for jmahers: 'Automation Error' via something among the lines of:

log_eval_func=lambda c,s: regex_log_evaluator(c, s, (
global_errors + ((re.compile("Automation error: Error receiving data from socket (possible reboot)"), FAILURE),)
Assignee: nobody → jlund
this patch should be good for testing in staging *once* joel's patch lands as this new patch will not check for 'connection reset by peer' but instead looks for 'Automation error: Error'. 

I think this approach is cleaner and safer as it only targets tegra machines specifically for unittests.
Attachment #640768 - Attachment is obsolete: true
https://hg.mozilla.org/mozilla-central/rev/a4b0d896e9ca
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
We forgot to add a whiteboard to prevent devs from closing this bug.

jlund this is still an issue.
Can I review your bug?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Blocks: 771626
Attachment #641174 - Flags: review?(armenzg)
Comment on attachment 641174 [details] [diff] [review]
checks for tegra reboot error and changes buildbot to red / failure

Review of attachment 641174 [details] [diff] [review]:
-----------------------------------------------------------------

This looks great.

jmaher, do you want the jobs to turn red/FAILURE? or RETRY?
Attachment #641174 - Flags: review?(armenzg)
Attachment #641174 - Flags: review+
Attachment #641174 - Flags: feedback?(jmaher)
Ideally retry, but I would really like a way to track this.  If they turn red and we can star them easily to track the frequency that would be fine as well.
Attachment #643496 - Flags: review?(wlachance)
Comment on attachment 641174 [details] [diff] [review]
checks for tegra reboot error and changes buildbot to red / failure

Review of attachment 641174 [details] [diff] [review]:
-----------------------------------------------------------------

looks good to me
Attachment #641174 - Flags: feedback?(jmaher) → feedback+
I will land and reconfigure the masters in the morning.
I don't want to have any moving pieces while Callek is doing any investigations.
Comment on attachment 643496 [details] [diff] [review]
print out the agent error message (1.0)

Yep, this is the right thing.
Attachment #643496 - Flags: review?(wlachance) → review+
https://hg.mozilla.org/mozilla-central/rev/27884b05391b
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Comment on attachment 641174 [details] [diff] [review]
checks for tegra reboot error and changes buildbot to red / failure

I reconfigured the masters with jlund's flagging reboots code.
http://hg.mozilla.org/build/buildbotcustom/rev/def40730fbbe

Let's now hope that we catch *only* the right thing.
Attachment #641174 - Flags: checked-in+
In case we have to land any other patches to m-i (adding [leave open]).
https://wiki.mozilla.org/Tree_Rules/Inbound#Please_do_the_following_after_pushing_to_inbound
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Whiteboard: [leave open]
armen. I think this is the problem. 

I am wondering is there a way I can debug what has been landed in buildbot or test this on dev-master. After a silly mess up, I would like to make sure I am reading and implementing this right. What do you think is best to do?
Attachment #643982 - Flags: review?(armenzg)
Comment on attachment 643982 [details] [diff] [review]
changes talos_error to RETRY instead of FAILURE

I have thought about it a little more.
We don't have more capacity to RETRY and we don't know yet how often this happens.
We should RETRY but for now let's try to spot occurences.

jmaher, could you please get a patch when you can return an error code (2) rather than a WARNING code (1)?
I want to prevent to re-queue even more jobs.

This is in accordance to:
https://github.com/buildbot/buildbot/blob/master/master/buildbot/status/results.py#L16
Attachment #643982 - Flags: review?(armenzg) → review-
Blocks: 778688
Whiteboard: [leave open] → [leave open][sheriff-want]
I don't think this is critical anymore.  My patch to get the harnesses to bubble the information from devicemanager to sys.exit() in the harness is horribly hacky and didn't quite work so well.

I can leave this out there as a bug and tackle it some other day.
Much of this bug has been covered by bug 793091, bug 790613, bug 790595, bug 794017, bug 793678 & bug 790618; to the extent that this bug doesn't affect sheriffs much now so removing bug 778688 from dependency chain.
No longer blocks: 778688
Whiteboard: [leave open][sheriff-want]
Product: mozilla.org → Release Engineering
I am not actively working on this.

my focus/priority is primarily here:
    - Bug 858797 - fx desktop builds in mozharness
Assignee: jlund → nobody
Do we still want this?
Flags: needinfo?(coop)
Ew, tegra. Get it away from me!
Status: REOPENED → RESOLVED
Closed: 12 years ago9 years ago
Flags: needinfo?(coop)
Resolution: --- → WONTFIX
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: