Closed Bug 467322 Opened 16 years ago Closed 16 years ago

buildbot slaves occasionally lose connection to production-master.b.m.o

Categories

(Release Engineering :: General, defect, P3)

x86
Linux
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Gavin, Assigned: joduinn)

References

()

Details

Failed red with: remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ]
Should be helped by bug 466499, which is to happen early Monday morning.
Summary: Linux mozilla-1.9.1 build busted → buildbot slaves occasionally lose connection to their master
Hmm, apparently that didn't help. bug 467634 for the latest theory.
Depends on: 467634
This is only with production-master.b.m.o, aiui, so updating summary. To echo comment#2 below, we've already: - switched to faster disks earlier this week - increased RAM on this VM last week ...neither of which made any difference, afaict. Also, there was some debate on whether the new l10n changes were a factor in this. Did we ever confirm that the l10n changes were not a factor?
Summary: buildbot slaves occasionally lose connection to their master → buildbot slaves occasionally lose connection to production-master.b.m.o
production-1.9-master is also seeing slaves connect and disconnect, although I haven't seen it interrupt an active build.
I've also had that production-1.8-master being slow to respond to opening an ssh connection, both displaying a password prompt and the shell prompt. Wild theory: It's like it the host optimisation parks a semi-idle VM where it has just enough resources to tick over, then it takes time to get more when it needs them.
I've been talking with mrz about this a few times, so might as well grab this bug, unless someone else wants it?
Assignee: nobody → joduinn
Priority: -- → P2
Yesterday during RelEng meeting, bherarum noted there were no dropped connections so far that day. Today, I see only one dropped connection: moz2-linux64-slave01 dropped at 16:41:41 and reconnected at 16:47:27. All in all, quite stable. Not sure exactly what "fixed" this: - the VMware reservation/partitioning settings for production-master VM? - the reduced load of cronjob cleanup work running in the production-master VM? - some combination of those? We've got l10n running on these systems, so that not it. We deleted a small few VMs across the entire set of RelEng ESX hosts, but not enough that I think it made a difference, imho. Leaving this bug open for a little bit longer - while we play "wait and see".
Priority: P2 → P3
(In reply to comment #7) > We've got l10n running on these systems, so that not it. We deleted a small few > VMs across the entire set of RelEng ESX hosts, but not enough that I think it > made a difference, imho. We do *not* have l10n running consistently on these systems. l10n was only re-enabled for m-c this morning.
(In reply to comment #8) > (In reply to comment #7) > > We've got l10n running on these systems, so that not it. > We do *not* have l10n running consistently on these systems. l10n was only > re-enabled for m-c this morning. Yes, true, we've not been running l10n consistently. We had l10n running for a few hours earlier this week (monday), but then took l10n offline (for unrelated reasons). We reenabled it again yesterday (thurs) morning on mozilla-central, and its still running there just fine. Each time enabling l10n did not kill production-master. (In reply to comment #7) > Leaving this bug open for a little bit longer - while we play "wait and see". No other dropped connections on production-master, since the one dropped connection from linux64 slave earlier this week. I calling this bug well and truly FIXED. Filed bug#470462 to track setting up VMWare reservations on our other production VMs to prevent this biting us elsewhere.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Component: Release Engineering: Maintenance → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.