Closed Bug 474950 Opened 16 years ago Closed 16 years ago

logs of talos builds all look like they end with a networking problem

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: dbaron, Assigned: anodelman)

Details

Attachments

(1 file)

[Checked in]better warning message before scheduled talos reboots 16 years ago alice nodelman [:alice] [:anode] (deleted), patch	mozilla : review+ anodelman : checked-in+	Details \| Diff \| Splinter Review

David Baron :dbaron:

Reporter

Description

•

16 years ago

The actual problem in bug 474915 went undetected by at least two people because they looked at the logs showing up on tinderbox and misinterpreted them, because talos runs now all end with: Reached max number of runs before reboot required, restarting machine... [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. ] The last bit there is the thing people expect when a networking problem causes a build to go orange, so these real crashes were dismissed as networking issues. These logs should not end with this failure, since people look for the last failure in the log to see why a build went orange.

Nick Thomas [:nthomas] (UTC+12)

Comment 1

•

16 years ago

I added the Reached max number of runs before reboot required, restarting machine... for just this reason. Suggestions on rewording the message, or making it more obvious are welcome.

Chris AtLee [:catlee]

Comment 2

•

16 years ago

Unfortunately, I don't think there's a good way to have the logs NOT end with this message, since we are telling the machine to reboot once it's done.

(not currently active) Ted Mielczarek

Comment 3

•

16 years ago

Ideally, we could get the tinderbox error parser to highlight the relevant failure lines. If need be, we can make an ep_talos.pl to highlight Talos-specific errors. If they show up in the summary, then it's fairly clear what happened. The unit test error parser looks like this: http://mxr.mozilla.org/mozilla/source/webtools/tinderbox/ep_unittest.pl

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 4

•

16 years ago

(In reply to comment #3) > Ideally, we could get the tinderbox error parser to highlight the relevant > failure lines. If need be, we can make an ep_talos.pl to highlight > Talos-specific errors. If they show up in the summary, then it's fairly clear > what happened. I'm not optimistic about tinderbox changes like that, based on experiences like bug#454055, so dont want to start down that path unless we know its going to be accepted. Right now, I'm tempted to WONTFIX this. However, if it would help to change the text of what nthomas added in comment#1 to be even more clear (for example "***IGNORE THE FOLLOWING ERROR"), we could do that.

(not currently active) Ted Mielczarek

Comment 5

•

16 years ago

(In reply to comment #4) > I'm not optimistic about tinderbox changes like that, based on experiences like > bug#454055, so dont want to start down that path unless we know its going to be > accepted. I added ep_unittest.pl in bug 394250, so this is pretty non-contentious. > Right now, I'm tempted to WONTFIX this. However, if it would help to change the > text of what nthomas added in comment#1 to be even more clear (for example > "***IGNORE THE FOLLOWING ERROR"), we could do that. I think WONTFIX is a bad idea, as this is clearly affecting developers.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 6

•

16 years ago

(In reply to comment #5) > (In reply to comment #4) > > Right now, I'm tempted to WONTFIX this. However, if it would help to change the > > text of what nthomas added in comment#1 to be even more clear (for example > > "***IGNORE THE FOLLOWING ERROR"), we could do that. > > I think WONTFIX is a bad idea, as this is clearly affecting developers. dbaron/ted: Dont know if its possible to easily suppress the last disconnect message. However, changing the text of comment#1 to whats in comment#4 (or to something else that you prefer) is something we can do easily enough, if that helps.

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 7

•

16 years ago

Yeah, something in bold letters clearly stating that the following error is expected and should be ignored would be enough I think.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 8

•

16 years ago

Would this be ok? ***** END OF RUN - NOW DOING SCHEDULED REBOOT *****

OS: Mac OS X → All

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 9

•

16 years ago

Please explicitly mention that there is an error that is expected and that should be ignored, that is the important part for people looking at the log.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 10

•

16 years ago

Would this be ok? ***** END OF RUN - NOW DOING SCHEDULED REBOOT; FOLLOWING ERROR MSG EXPECTED *****

Component: Release Engineering: Talos → Release Engineering

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 11

•

16 years ago

Sounds great

alice nodelman [:alice] [:anode]

Assignee

Updated

•

16 years ago

Assignee: nobody → anodelman

Priority: -- → P2

alice nodelman [:alice] [:anode]

Assignee

Comment 12

•

16 years ago

Doing some tests on talos stage just to make sure that everything works as expected.

alice nodelman [:alice] [:anode]

Assignee

Comment 13

•

16 years ago

Attached patch [Checked in]better warning message before scheduled talos reboots (deleted) — Details — Splinter Review

Attachment #360997 - Flags: review?(aki)

Aki Sasaki (not active)

Updated

•

16 years ago

Attachment #360997 - Flags: review?(aki) → review+

alice nodelman [:alice] [:anode]

Assignee

Comment 14

•

16 years ago

Comment on attachment 360997 [details] [diff] [review] [Checked in]better warning message before scheduled talos reboots Checking in perf-staging/scripts/count_and_reboot.py; /cvsroot/mozilla/tools/buildbot-configs/testing/talos/perf-staging/scripts/count_and_reboot.py,v <-- count_and_reboot.py new revision: 1.2; previous revision: 1.1 done Checking in perfmaster2/scripts/count_and_reboot.py; /cvsroot/mozilla/tools/buildbot-configs/testing/talos/perfmaster/scripts/count_and_reboot.py,v <-- count_and_reboot.py new revision: 1.3; previous revision: 1.2 done

Attachment #360997 - Attachment description: better warning message before scheduled talos reboots → [Checked in]better warning message before scheduled talos reboots

Attachment #360997 - Flags: checkedâ€‘in+ checked‑in+

alice nodelman [:alice] [:anode]

Assignee

Comment 15

•

16 years ago

Pushed to production, should show up on upcoming talos cycles.

alice nodelman [:alice] [:anode]

Assignee

Comment 16

•

16 years ago

Now appearing in talos logs.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

David Baron :dbaron:

Reporter

Comment 17

•

16 years ago

Wasn't good enough to not trick sdwilsh: http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1235077383.1235080074.5935.gz

Shawn Wilsher :sdwilsh

Comment 18

•

16 years ago

Admittedly, I just skimmed, but when we get a long log in the short log, I tend to press end, and then read the last line and then upwards until I find an error.

(not currently active) Ted Mielczarek

Comment 19

•

16 years ago

What I suggested in comment 3 would probably help. (It would get us a short log.)

Jonas Sicking (:sicking) No longer reading bugmail consistently

Comment 20

•

16 years ago

Also putting a couple of newlines *before* the current message, but no newline after it would probably help. Skimming is something we should take into account.

David Baron :dbaron:

Reporter

Comment 21

•

16 years ago

I filed bug 489523 (with a new suggestion of mine) and bug 489524 (with comment 3) as followups.

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.