Closed Bug 772531 Opened 12 years ago Closed 9 years ago

detect when a tegra reboots via log file parsing, mark test as red and allow tbpl to get accurate information

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: jmaher, Unassigned)

References

Details

Attachments

(4 files, 1 obsolete file)

report a real error from devicemanager when we receive a known failure state (1.0) 12 years ago Joel Maher ( :jmaher ) (UTC -8) (deleted), patch	wlach : review+	Details \| Diff \| Splinter Review
adjust global errors to parse for 'connection reset by peer' 12 years ago Jordan Lund (:jlund) (deleted), patch		Details \| Diff \| Splinter Review
checks for tegra reboot error and changes buildbot to red / failure 12 years ago Jordan Lund (:jlund) (deleted), patch	armenzg : review+ jmaher : feedback+ armenzg : checked-in+	Details \| Diff \| Splinter Review
print out the agent error message (1.0) 12 years ago Joel Maher ( :jmaher ) (UTC -8) (deleted), patch	wlach : review+	Details \| Diff \| Splinter Review
changes talos_error to RETRY instead of FAILURE 12 years ago Jordan Lund (:jlund) (deleted), patch	armenzg : review-	Details \| Diff \| Splinter Review

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Description

•

12 years ago

I have detected a pattern in our log files that show up when the tegra reboots:
Error receiving data from socket. cmd=ps
; err=[Errno 54] Connection reset by peer
reconnecting socket


This makes sense and causes us to report orange although no tests have failed.

If we detect this pattern in the log output, I would like to mark the test red, and display a message to TBPL that would incdicate "Android device rebooted (; err=[Errno 54] Connection reset by peer)"

Aki Sasaki (not active)

Comment 1

•

12 years ago

We would get this for free with mozharness.
There's also buildbotcustom parsing for unit tests.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 2

•

12 years ago

well, the real problem is why are we getting a high rate of reboots...bad tegras, bad sdcards, or tests doing troublesome things.  Mozharness would be awesome, and in the meantime, lets get this added to our infrastructure.

Ed Morley [:emorley]

Comment 3

•

12 years ago

The colour is determined by the status code that buildbot provides.

What TBPL shows in the annotated summary box is determined by:
http://hg.mozilla.org/users/mstange_themasta.com/tinderboxpushlog/file/e2e344885c80/php/inc/GeneralErrorFilter.php#l24

Alternatively, the log can just prefix the line with "TinderboxPrint: " and we'll have the message shown in the list of bullets below the annotated summary box.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 4

•

12 years ago

maybe the best method is to return a proper error from the harness and output "Automation error:" at the beginning of one of the lines.

Armen [:armenzg]

Comment 5

•

12 years ago

Detecting a line in the log and taking an action can be accomplished by:
1) by adding a log_eval_func to the specific step. [1]

This can be done by either adding it in-step [2] or on the errors.py [3] file and importing it.

The steps to be modified are in RemoteUnittestFactory [4]

To add "TinderboxPrint:" messages we can modify createSummary IIUC. jlund would know as he has been looking at that recently.

"Automation error:" can be added to errors.py.

[1]
log_eval_func=lambda c,s: regex_log_evaluator(c, s, talos_hgweb_errors),
[2]
http://hg.mozilla.org/build/buildbotcustom/file/default/misc.py#l419
   420             log_eval_func=lambda c,s: regex_log_evaluator(c, s, (
   420              (re.compile('# TBPL WARNING #'), WARNINGS),
   421              (re.compile('# TBPL FAILURE #'), FAILURE),
   422              (re.compile('# TBPL EXCEPTION #'), EXCEPTION),
   423              (re.compile('# TBPL RETRY #'), RETRY),
   424             ))
[3]
http://hg.mozilla.org/build/buildbotcustom/file/281fc2c2fa9e/status/errors.py
[4] http://hg.mozilla.org/build/buildbotcustom/file/default/process/factory.py#l6793

Ed Morley [:emorley]

Comment 6

•

12 years ago

(In reply to Joel Maher (:jmaher) from comment #4)
> maybe the best method is to return a proper error from the harness and
> output "Automation error:" at the beginning of one of the lines.

This would be my preference (over TinderboxPrint) :-)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 7

•

12 years ago

Attached patch report a real error from devicemanager when we receive a known failure state (1.0) (deleted) — Details — Splinter Review

this patch looks for a known error and delivers a tbpl parsable message.  This doesn't change the job red, but we can build from this.

Attachment #640728 - Flags: review?(wlachance)

William Lachance (:wlach)

Comment 8

•

12 years ago

Comment on attachment 640728 [details] [diff] [review]
report a real error from devicemanager when we receive a known failure state (1.0)

Idea of patch is good but could be improved in a few ways to be more robust and clear. I'll give an r+ assuming the below is implemented (or given a good reason why not). ;)

># HG changeset patch
># Parent 2e702f2948b447e46e28cb3094308c6846f254f6
>Bug 772531 - detect when a tegra reboots via log file parsing, mark test as red and allow tbpl to get accurate information. r=wlach
>
>diff -r 2e702f2948b4 build/mobile/devicemanagerSUT.py
>--- a/build/mobile/devicemanagerSUT.py	Tue Jul 10 14:56:47 2012 -0400
>+++ b/build/mobile/devicemanagerSUT.py	Tue Jul 10 15:46:43 2012 -0400
>@@ -213,6 +213,9 @@
>           except socket.error, msg:

Change this to:
            except socket.error, err:

(you'll see why below)

>             self._sock.close()
>             self._sock = None
>+            # This error shows up with we have our tegra rebooted.
>+            if str(msg.strip()) == "[Errno 54] Connection reset by peer":

I think it's better to just look for the errno, rather than examining a string whose format could change.

I'd suggest replacing this with:

>             if err[0] == errno.ECONNRESET:

(you'll also need to import errno on top)

>+              raise AgentError("Automation error: Error receiving data from socket (possible reboot). cmd="+str(cmd)+"; err="+str(msg), True)

I find this kind of concatenation less clear than the standard method, where we substitute in variables via '%'. Also, there's no need to cast to string explicitly. So try:

raise AgentError("Automation error: Error receiving data from socket (possible reboot). cmd=%s; err=%s" % (cmd, msg), True)

Attachment #640728 - Flags: review?(wlachance) → review+

Jordan Lund (:jlund)

Comment 9

•

12 years ago

I'd be glad to *try* add the glue needed for this from the buildbot end.

I have noticed that one of buildbots error checking for 'upload_errors' would have caught this if it was considered an upload step. However it would have set this to RETRY AKA Blue not red: https://github.com/mozilla/buildbotcustom/blob/master/status/errors.py#L28

I am wondering if the Tegra is rebooting, should we be going for the same(blue) and signalling for a Retry?

Are there any logs of this?

I can take a look into parsing Joel's recent append: 'Automation error: Error receiving data from socket' and change the color accordingly in buildbot.

Jordan Lund (:jlund)

Comment 10

•

12 years ago

Attached patch adjust global errors to parse for 'connection reset by peer' (obsolete) (deleted) — Details — Splinter Review

I wonder if adding something like this to global_errors would work without any further changes needed. 

This would set buildbots return status to RETRY and in return blue in tbpl.

I think for the tegra's unittests we are already passing global errors here:
http://mxr.mozilla.org/build/source/buildbotcustom/steps/base.py#17

as I dont think we are setting log_eval_func but I will certainly need confirmation from someone who knows more

Jordan Lund (:jlund)

Comment 11

•

12 years ago

I suppose this could have dangerous effects since its 'global' and lots of other builds would pass through it. I am not sure if 'connection reset by peer' appears in other logs and should be ignored AKA stay green/success

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 12

•

12 years ago

I am not sure either.  That is something we should test in staging with desktop builds.  If we see that on an Android build I would say we are safe to throw an error.  I have been testing this for a while today and about half the reboots fall into that category.

bhearsum@mozilla.com (:bhearsum)

Updated

•

12 years ago

Component: Release Engineering: Automation (Release Automation) → Release Engineering: Automation (General)

QA Contact: bhearsum → catlee

Jordan Lund (:jlund)

Comment 13

•

12 years ago

it would probably be safer to add an argument to the 'step' its rebooting and check for jmahers: 'Automation Error' via something among the lines of:

log_eval_func=lambda c,s: regex_log_evaluator(c, s, (
global_errors + ((re.compile("Automation error: Error receiving data from socket (possible reboot)"), FAILURE),)

Jordan Lund (:jlund)

Updated

•

12 years ago

Assignee: nobody → jlund

Jordan Lund (:jlund)

Comment 14

•

12 years ago

Attached patch checks for tegra reboot error and changes buildbot to red / failure (deleted) — Details — Splinter Review

this patch should be good for testing in staging *once* joel's patch lands as this new patch will not check for 'connection reset by peer' but instead looks for 'Automation error: Error'. 

I think this approach is cleaner and safer as it only targets tegra machines specifically for unittests.

Attachment #640768 - Attachment is obsolete: true

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 15

•

12 years ago

landed on inbound:
https://hg.mozilla.org/integration/mozilla-inbound/rev/a4b0d896e9ca

Ed Morley [:emorley]

Comment 16

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/a4b0d896e9ca

Status: NEW → RESOLVED

Closed: 12 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Comment 17

•

12 years ago

We forgot to add a whiteboard to prevent devs from closing this bug.

jlund this is still an issue.
Can I review your bug?

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Armen [:armenzg]

Updated

•

12 years ago

Blocks: 771626

Jordan Lund (:jlund)

Updated

•

12 years ago

Attachment #641174 - Flags: review?(armenzg)

Armen [:armenzg]

Comment 18

•

12 years ago

Comment on attachment 641174 [details] [diff] [review]
checks for tegra reboot error and changes buildbot to red / failure

Review of attachment 641174 [details] [diff] [review]:
-----------------------------------------------------------------

This looks great.

jmaher, do you want the jobs to turn red/FAILURE? or RETRY?

Attachment #641174 - Flags: review?(armenzg)

Attachment #641174 - Flags: review+

Attachment #641174 - Flags: feedback?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 19

•

12 years ago

Ideally retry, but I would really like a way to track this.  If they turn red and we can star them easily to track the frequency that would be fine as well.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 20

•

12 years ago

Attached patch print out the agent error message (1.0) (deleted) — Details — Splinter Review

Attachment #643496 - Flags: review?(wlachance)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 21

•

12 years ago

Comment on attachment 641174 [details] [diff] [review]
checks for tegra reboot error and changes buildbot to red / failure

Review of attachment 641174 [details] [diff] [review]:
-----------------------------------------------------------------

looks good to me

Attachment #641174 - Flags: feedback?(jmaher) → feedback+

Armen [:armenzg]

Comment 22

•

12 years ago

I will land and reconfigure the masters in the morning.
I don't want to have any moving pieces while Callek is doing any investigations.

William Lachance (:wlach)

Comment 23

•

12 years ago

Comment on attachment 643496 [details] [diff] [review]
print out the agent error message (1.0)

Yep, this is the right thing.

Attachment #643496 - Flags: review?(wlachance) → review+

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 24

•

12 years ago

landed fix for printing the error message:
https://hg.mozilla.org/integration/mozilla-inbound/rev/27884b05391b

Ed Morley [:emorley]

Comment 25

•

12 years ago

https://hg.mozilla.org/mozilla-central/rev/27884b05391b

Status: REOPENED → RESOLVED

Closed: 12 years ago → 12 years ago

Resolution: --- → FIXED

Armen [:armenzg]

Comment 26

•

12 years ago

Comment on attachment 641174 [details] [diff] [review]
checks for tegra reboot error and changes buildbot to red / failure

I reconfigured the masters with jlund's flagging reboots code.
http://hg.mozilla.org/build/buildbotcustom/rev/def40730fbbe

Let's now hope that we catch *only* the right thing.

Attachment #641174 - Flags: checked-in+

Armen [:armenzg]

Comment 27

•

12 years ago

In case we have to land any other patches to m-i (adding [leave open]).
https://wiki.mozilla.org/Tree_Rules/Inbound#Please_do_the_following_after_pushing_to_inbound

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Whiteboard: [leave open]

Jordan Lund (:jlund)

Comment 28

•

12 years ago

Attached patch changes talos_error to RETRY instead of FAILURE (deleted) — Details — Splinter Review

armen. I think this is the problem. 

I am wondering is there a way I can debug what has been landed in buildbot or test this on dev-master. After a silly mess up, I would like to make sure I am reading and implementing this right. What do you think is best to do?

Attachment #643982 - Flags: review?(armenzg)

Armen [:armenzg]

Comment 29

•

12 years ago

Comment on attachment 643982 [details] [diff] [review]
changes talos_error to RETRY instead of FAILURE

I have thought about it a little more.
We don't have more capacity to RETRY and we don't know yet how often this happens.
We should RETRY but for now let's try to spot occurences.

jmaher, could you please get a patch when you can return an error code (2) rather than a WARNING code (1)?
I want to prevent to re-queue even more jobs.

This is in accordance to:
https://github.com/buildbot/buildbot/blob/master/master/buildbot/status/results.py#L16

Attachment #643982 - Flags: review?(armenzg) → review-

Ed Morley [:emorley]

Updated

•

12 years ago

Blocks: 778688

Ed Morley [:emorley]

Updated

•

12 years ago

Whiteboard: [leave open] → [leave open][sheriff-want]

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 30

•

12 years ago

I don't think this is critical anymore.  My patch to get the harnesses to bubble the information from devicemanager to sys.exit() in the harness is horribly hacky and didn't quite work so well.

I can leave this out there as a bug and tackle it some other day.

Ed Morley [:emorley]

Comment 31

•

12 years ago

Much of this bug has been covered by bug 793091, bug 790613, bug 790595, bug 794017, bug 793678 & bug 790618; to the extent that this bug doesn't affect sheriffs much now so removing bug 778688 from dependency chain.

No longer blocks: 778688

Whiteboard: [leave open][sheriff-want]

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Jordan Lund (:jlund)

Comment 32

•

11 years ago

I am not actively working on this.

my focus/priority is primarily here:
    - Bug 858797 - fx desktop builds in mozharness

Assignee: jlund → nobody

Rail Aliiev [:rail]

Comment 33

•

9 years ago

Do we still want this?

Flags: needinfo?(coop)

Chris Cooper [:coop] (he/him)

Comment 34

•

9 years ago

Ew, tegra. Get it away from me!

Status: REOPENED → RESOLVED

Closed: 12 years ago → 9 years ago

Flags: needinfo?(coop)

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Assignee

Updated

•

6 years ago

Component: General Automation → General

You need to log in before you can comment on or make changes to this bug.