467322 - buildbot slaves occasionally lose connection to production-master.b.m.o

:Gavin Sharp [email: gavin@gavinsharp.com]

Reporter

Description

•

16 years ago

Failed red with: remoteFailed: [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ] [Failure instance: Traceback (failure with no frames): <class 'twisted.internet.error.ConnectionLost'>: Connection to the other side was lost in a non-clean fashion. ]

Nick Thomas [:nthomas] (UTC+12)

Comment 1

•

16 years ago

Should be helped by bug 466499, which is to happen early Monday morning.

Nick Thomas [:nthomas] (UTC+12)

Updated

•

16 years ago

Summary: Linux mozilla-1.9.1 build busted → buildbot slaves occasionally lose connection to their master

Nick Thomas [:nthomas] (UTC+12)

Comment 2

•

16 years ago

Hmm, apparently that didn't help. bug 467634 for the latest theory.

Depends on: 467634

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 3

•

16 years ago

This is only with production-master.b.m.o, aiui, so updating summary. To echo comment#2 below, we've already: - switched to faster disks earlier this week - increased RAM on this VM last week ...neither of which made any difference, afaict. Also, there was some debate on whether the new l10n changes were a factor in this. Did we ever confirm that the l10n changes were not a factor?

Summary: buildbot slaves occasionally lose connection to their master → buildbot slaves occasionally lose connection to production-master.b.m.o

Chris AtLee [:catlee]

Comment 4

•

16 years ago

production-1.9-master is also seeing slaves connect and disconnect, although I haven't seen it interrupt an active build.

Nick Thomas [:nthomas] (UTC+12)

Comment 5

•

16 years ago

I've also had that production-1.8-master being slow to respond to opening an ssh connection, both displaying a password prompt and the shell prompt. Wild theory: It's like it the host optimisation parks a semi-idle VM where it has just enough resources to tick over, then it takes time to get more when it needs them.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 6

•

16 years ago

I've been talking with mrz about this a few times, so might as well grab this bug, unless someone else wants it?

Assignee: nobody → joduinn

Priority: -- → P2

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Updated

•

16 years ago

Blocks: 464164

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 7

•

16 years ago

Yesterday during RelEng meeting, bherarum noted there were no dropped connections so far that day. Today, I see only one dropped connection: moz2-linux64-slave01 dropped at 16:41:41 and reconnected at 16:47:27. All in all, quite stable. Not sure exactly what "fixed" this: - the VMware reservation/partitioning settings for production-master VM? - the reduced load of cronjob cleanup work running in the production-master VM? - some combination of those? We've got l10n running on these systems, so that not it. We deleted a small few VMs across the entire set of RelEng ESX hosts, but not enough that I think it made a difference, imho. Leaving this bug open for a little bit longer - while we play "wait and see".

Priority: P2 → P3

Chris Cooper [:coop] (he/him)

Comment 8

•

16 years ago

(In reply to comment #7) > We've got l10n running on these systems, so that not it. We deleted a small few > VMs across the entire set of RelEng ESX hosts, but not enough that I think it > made a difference, imho. We do *not* have l10n running consistently on these systems. l10n was only re-enabled for m-c this morning.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Assignee

Comment 9

•

16 years ago

(In reply to comment #8) > (In reply to comment #7) > > We've got l10n running on these systems, so that not it. > We do *not* have l10n running consistently on these systems. l10n was only > re-enabled for m-c this morning. Yes, true, we've not been running l10n consistently. We had l10n running for a few hours earlier this week (monday), but then took l10n offline (for unrelated reasons). We reenabled it again yesterday (thurs) morning on mozilla-central, and its still running there just fine. Each time enabling l10n did not kill production-master. (In reply to comment #7) > Leaving this bug open for a little bit longer - while we play "wait and see". No other dropped connections on production-master, since the one dropped connection from linux64 slave earlier this week. I calling this bug well and truly FIXED. Filed bug#470462 to track setting up VMWare reservations on our other production VMs to prevent this biting us elsewhere.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

timeless

Updated

•

15 years ago

Component: Release Engineering: Maintenance → Release Engineering

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

Bugzilla

buildbot slaves occasionally lose connection to production-master.b.m.o

Categories

(Release Engineering :: General, defect, P3)

Tracking

(Not tracked)

People

(Reporter: Gavin, Assigned: joduinn)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Updated

Updated