Closed
Bug 467322
Opened 16 years ago
Closed 16 years ago
buildbot slaves occasionally lose connection to production-master.b.m.o
Categories
(Release Engineering :: General, defect, P3)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: Gavin, Assigned: joduinn)
References
()
Details
Failed red with:
remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
[Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion.
]
Comment 1•16 years ago
|
||
Should be helped by bug 466499, which is to happen early Monday morning.
Updated•16 years ago
|
Summary: Linux mozilla-1.9.1 build busted → buildbot slaves occasionally lose connection to their master
Comment 2•16 years ago
|
||
Hmm, apparently that didn't help. bug 467634 for the latest theory.
Depends on: 467634
Assignee | ||
Comment 3•16 years ago
|
||
This is only with production-master.b.m.o, aiui, so updating summary.
To echo comment#2 below, we've already:
- switched to faster disks earlier this week
- increased RAM on this VM last week
...neither of which made any difference, afaict.
Also, there was some debate on whether the new l10n changes were a factor in this. Did we ever confirm that the l10n changes were not a factor?
Summary: buildbot slaves occasionally lose connection to their master → buildbot slaves occasionally lose connection to production-master.b.m.o
Comment 4•16 years ago
|
||
production-1.9-master is also seeing slaves connect and disconnect, although I haven't seen it interrupt an active build.
Comment 5•16 years ago
|
||
I've also had that production-1.8-master being slow to respond to opening an ssh connection, both displaying a password prompt and the shell prompt. Wild theory: It's like it the host optimisation parks a semi-idle VM where it has just enough resources to tick over, then it takes time to get more when it needs them.
Assignee | ||
Comment 6•16 years ago
|
||
I've been talking with mrz about this a few times, so might as well grab this bug, unless someone else wants it?
Assignee: nobody → joduinn
Priority: -- → P2
Assignee | ||
Comment 7•16 years ago
|
||
Yesterday during RelEng meeting, bherarum noted there were no dropped connections so far that day. Today, I see only one dropped connection: moz2-linux64-slave01 dropped at 16:41:41 and reconnected at 16:47:27. All in all, quite stable.
Not sure exactly what "fixed" this:
- the VMware reservation/partitioning settings for production-master VM?
- the reduced load of cronjob cleanup work running in the production-master VM?
- some combination of those?
We've got l10n running on these systems, so that not it. We deleted a small few VMs across the entire set of RelEng ESX hosts, but not enough that I think it made a difference, imho.
Leaving this bug open for a little bit longer - while we play "wait and see".
Priority: P2 → P3
Comment 8•16 years ago
|
||
(In reply to comment #7)
> We've got l10n running on these systems, so that not it. We deleted a small few
> VMs across the entire set of RelEng ESX hosts, but not enough that I think it
> made a difference, imho.
We do *not* have l10n running consistently on these systems. l10n was only re-enabled for m-c this morning.
Assignee | ||
Comment 9•16 years ago
|
||
(In reply to comment #8)
> (In reply to comment #7)
> > We've got l10n running on these systems, so that not it.
> We do *not* have l10n running consistently on these systems. l10n was only
> re-enabled for m-c this morning.
Yes, true, we've not been running l10n consistently. We had l10n running for a few hours earlier this week (monday), but then took l10n offline (for unrelated reasons). We reenabled it again yesterday (thurs) morning on mozilla-central, and its still running there just fine. Each time enabling l10n did not kill production-master.
(In reply to comment #7)
> Leaving this bug open for a little bit longer - while we play "wait and see".
No other dropped connections on production-master, since the one dropped connection from linux64 slave earlier this week. I calling this bug well and truly FIXED. Filed bug#470462 to track setting up VMWare reservations on our other production VMs to prevent this biting us elsewhere.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•