Closed
Bug 474950
Opened 16 years ago
Closed 16 years ago
logs of talos builds all look like they end with a networking problem
Categories
(Release Engineering :: General, defect, P2)
Release Engineering
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dbaron, Assigned: anodelman)
Details
Attachments
(1 file)
(deleted),
patch
|
mozilla
:
review+
anodelman
:
checked-in+
|
Details | Diff | Splinter Review |
The actual problem in bug 474915 went undetected by at least two people because they looked at the logs showing up on tinderbox and misinterpreted them, because talos runs now all end with:
Reached max number of runs before reboot required, restarting machine...
[Failure instance: Traceback (failure with no frames): twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clean fashion.
]
The last bit there is the thing people expect when a networking problem causes a build to go orange, so these real crashes were dismissed as networking issues.
These logs should not end with this failure, since people look for the last failure in the log to see why a build went orange.
Comment 1•16 years ago
|
||
I added the
Reached max number of runs before reboot required, restarting machine...
for just this reason. Suggestions on rewording the message, or making it more obvious are welcome.
Comment 2•16 years ago
|
||
Unfortunately, I don't think there's a good way to have the logs NOT end with this message, since we are telling the machine to reboot once it's done.
Comment 3•16 years ago
|
||
Ideally, we could get the tinderbox error parser to highlight the relevant failure lines. If need be, we can make an ep_talos.pl to highlight Talos-specific errors. If they show up in the summary, then it's fairly clear what happened.
The unit test error parser looks like this:
http://mxr.mozilla.org/mozilla/source/webtools/tinderbox/ep_unittest.pl
Comment 4•16 years ago
|
||
(In reply to comment #3)
> Ideally, we could get the tinderbox error parser to highlight the relevant
> failure lines. If need be, we can make an ep_talos.pl to highlight
> Talos-specific errors. If they show up in the summary, then it's fairly clear
> what happened.
I'm not optimistic about tinderbox changes like that, based on experiences like bug#454055, so dont want to start down that path unless we know its going to be accepted.
Right now, I'm tempted to WONTFIX this. However, if it would help to change the text of what nthomas added in comment#1 to be even more clear (for example "***IGNORE THE FOLLOWING ERROR"), we could do that.
Comment 5•16 years ago
|
||
(In reply to comment #4)
> I'm not optimistic about tinderbox changes like that, based on experiences like
> bug#454055, so dont want to start down that path unless we know its going to be
> accepted.
I added ep_unittest.pl in bug 394250, so this is pretty non-contentious.
> Right now, I'm tempted to WONTFIX this. However, if it would help to change the
> text of what nthomas added in comment#1 to be even more clear (for example
> "***IGNORE THE FOLLOWING ERROR"), we could do that.
I think WONTFIX is a bad idea, as this is clearly affecting developers.
Comment 6•16 years ago
|
||
(In reply to comment #5)
> (In reply to comment #4)
> > Right now, I'm tempted to WONTFIX this. However, if it would help to change the
> > text of what nthomas added in comment#1 to be even more clear (for example
> > "***IGNORE THE FOLLOWING ERROR"), we could do that.
>
> I think WONTFIX is a bad idea, as this is clearly affecting developers.
dbaron/ted: Dont know if its possible to easily suppress the last disconnect message. However, changing the text of comment#1 to whats in comment#4 (or to something else that you prefer) is something we can do easily enough, if that helps.
Yeah, something in bold letters clearly stating that the following error is expected and should be ignored would be enough I think.
Comment 8•16 years ago
|
||
Would this be ok?
***** END OF RUN - NOW DOING SCHEDULED REBOOT *****
OS: Mac OS X → All
Please explicitly mention that there is an error that is expected and that should be ignored, that is the important part for people looking at the log.
Comment 10•16 years ago
|
||
Would this be ok?
***** END OF RUN - NOW DOING SCHEDULED REBOOT; FOLLOWING ERROR MSG EXPECTED *****
Component: Release Engineering: Talos → Release Engineering
Sounds great
Assignee | ||
Updated•16 years ago
|
Assignee: nobody → anodelman
Priority: -- → P2
Assignee | ||
Comment 12•16 years ago
|
||
Doing some tests on talos stage just to make sure that everything works as expected.
Assignee | ||
Comment 13•16 years ago
|
||
Attachment #360997 -
Flags: review?(aki)
Updated•16 years ago
|
Attachment #360997 -
Flags: review?(aki) → review+
Assignee | ||
Comment 14•16 years ago
|
||
Comment on attachment 360997 [details] [diff] [review]
[Checked in]better warning message before scheduled talos reboots
Checking in perf-staging/scripts/count_and_reboot.py;
/cvsroot/mozilla/tools/buildbot-configs/testing/talos/perf-staging/scripts/count_and_reboot.py,v <-- count_and_reboot.py
new revision: 1.2; previous revision: 1.1
done
Checking in perfmaster2/scripts/count_and_reboot.py;
/cvsroot/mozilla/tools/buildbot-configs/testing/talos/perfmaster/scripts/count_and_reboot.py,v <-- count_and_reboot.py
new revision: 1.3; previous revision: 1.2
done
Attachment #360997 -
Attachment description: better warning message before scheduled talos reboots → [Checked in]better warning message before scheduled talos reboots
Attachment #360997 -
Flags: checked‑in+ checked‑in+
Assignee | ||
Comment 15•16 years ago
|
||
Pushed to production, should show up on upcoming talos cycles.
Assignee | ||
Comment 16•16 years ago
|
||
Now appearing in talos logs.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 17•16 years ago
|
||
Wasn't good enough to not trick sdwilsh:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1235077383.1235080074.5935.gz
Comment 18•16 years ago
|
||
Admittedly, I just skimmed, but when we get a long log in the short log, I tend to press end, and then read the last line and then upwards until I find an error.
Comment 19•16 years ago
|
||
What I suggested in comment 3 would probably help. (It would get us a short log.)
Also putting a couple of newlines *before* the current message, but no newline after it would probably help. Skimming is something we should take into account.
Reporter | ||
Comment 21•16 years ago
|
||
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•