Closed Bug 474950 Opened 16 years ago Closed 16 years ago

logs of talos builds all look like they end with a networking problem

Categories

(Release Engineering :: General, defect, P2)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dbaron, Assigned: anodelman)

Details

Attachments

(1 file)

The actual problem in bug 474915 went undetected by at least two people because they looked at the logs showing up on tinderbox and misinterpreted them, because talos runs now all end with: Reached max number of runs before reboot required, restarting machine... [Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion. ] The last bit there is the thing people expect when a networking problem causes a build to go orange, so these real crashes were dismissed as networking issues. These logs should not end with this failure, since people look for the last failure in the log to see why a build went orange.
I added the Reached max number of runs before reboot required, restarting machine... for just this reason. Suggestions on rewording the message, or making it more obvious are welcome.
Unfortunately, I don't think there's a good way to have the logs NOT end with this message, since we are telling the machine to reboot once it's done.
Ideally, we could get the tinderbox error parser to highlight the relevant failure lines. If need be, we can make an ep_talos.pl to highlight Talos-specific errors. If they show up in the summary, then it's fairly clear what happened. The unit test error parser looks like this: http://mxr.mozilla.org/mozilla/source/webtools/tinderbox/ep_unittest.pl
(In reply to comment #3) > Ideally, we could get the tinderbox error parser to highlight the relevant > failure lines. If need be, we can make an ep_talos.pl to highlight > Talos-specific errors. If they show up in the summary, then it's fairly clear > what happened. I'm not optimistic about tinderbox changes like that, based on experiences like bug#454055, so dont want to start down that path unless we know its going to be accepted. Right now, I'm tempted to WONTFIX this. However, if it would help to change the text of what nthomas added in comment#1 to be even more clear (for example "***IGNORE THE FOLLOWING ERROR"), we could do that.
(In reply to comment #4) > I'm not optimistic about tinderbox changes like that, based on experiences like > bug#454055, so dont want to start down that path unless we know its going to be > accepted. I added ep_unittest.pl in bug 394250, so this is pretty non-contentious. > Right now, I'm tempted to WONTFIX this. However, if it would help to change the > text of what nthomas added in comment#1 to be even more clear (for example > "***IGNORE THE FOLLOWING ERROR"), we could do that. I think WONTFIX is a bad idea, as this is clearly affecting developers.
(In reply to comment #5) > (In reply to comment #4) > > Right now, I'm tempted to WONTFIX this. However, if it would help to change the > > text of what nthomas added in comment#1 to be even more clear (for example > > "***IGNORE THE FOLLOWING ERROR"), we could do that. > > I think WONTFIX is a bad idea, as this is clearly affecting developers. dbaron/ted: Dont know if its possible to easily suppress the last disconnect message. However, changing the text of comment#1 to whats in comment#4 (or to something else that you prefer) is something we can do easily enough, if that helps.
Yeah, something in bold letters clearly stating that the following error is expected and should be ignored would be enough I think.
Would this be ok? ***** END OF RUN - NOW DOING SCHEDULED REBOOT *****
OS: Mac OS X → All
Please explicitly mention that there is an error that is expected and that should be ignored, that is the important part for people looking at the log.
Would this be ok? ***** END OF RUN - NOW DOING SCHEDULED REBOOT; FOLLOWING ERROR MSG EXPECTED *****
Component: Release Engineering: Talos → Release Engineering
Assignee: nobody → anodelman
Priority: -- → P2
Doing some tests on talos stage just to make sure that everything works as expected.
Attachment #360997 - Flags: review?(aki) → review+
Comment on attachment 360997 [details] [diff] [review] [Checked in]better warning message before scheduled talos reboots Checking in perf-staging/scripts/count_and_reboot.py; /cvsroot/mozilla/tools/buildbot-configs/testing/talos/perf-staging/scripts/count_and_reboot.py,v <-- count_and_reboot.py new revision: 1.2; previous revision: 1.1 done Checking in perfmaster2/scripts/count_and_reboot.py; /cvsroot/mozilla/tools/buildbot-configs/testing/talos/perfmaster/scripts/count_and_reboot.py,v <-- count_and_reboot.py new revision: 1.3; previous revision: 1.2 done
Attachment #360997 - Attachment description: better warning message before scheduled talos reboots → [Checked in]better warning message before scheduled talos reboots
Attachment #360997 - Flags: checked‑in+ checked‑in+
Pushed to production, should show up on upcoming talos cycles.
Now appearing in talos logs.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Admittedly, I just skimmed, but when we get a long log in the short log, I tend to press end, and then read the last line and then upwards until I find an error.
What I suggested in comment 3 would probably help. (It would get us a short log.)
Also putting a couple of newlines *before* the current message, but no newline after it would probably help. Skimming is something we should take into account.
I filed bug 489523 (with a new suggestion of mine) and bug 489524 (with comment 3) as followups.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: