655304 - Replace iX slave HDs with 'enterprise' drives

Assignee

Description

•

14 years ago

[Explanation of how we got here below] We are going to replace all of the drives in all of the iX half-depth 1Us (anything with ...ix-slave... in the hostname) with WD RE4 RAID-duty enterprise drives. The first ones to get replaced will be the 8 machines they still have. After that we'll have to figure out how large a batch we can spare at a time. Turnaround time will be approximately 3 business days for each set. iX will replace the drives, burn in the systems again, and return them to us. Let's use this bug as a log of activity around rotating all the machines through replacement. Backstory: We met with iX systems yesterday and came up with a plan going forward for the iX 1U machines. They have done additional testing and experimentation on the few machines they haven't returned yet, and found the following: Replacing fans doesn't solve the problem reliably. The problem follows the chassis, not the drive. Enterprise drives don't show the problem. All of this suggests that this is still a vibration issue, and desktop drives just aren't going to cut it in this chassis. As verification, they took the worst of the machines they had, one that couldn't exceed 15MB/s with a new desktop drive. They put an enterprise drive in it, and throughput exceeded 85MB/s. That's the worst result they were able to create using enterprise drives.

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 1

•

14 years ago

(In reply to comment #1) > [Explanation of how we got here below] > > We are going to replace all of the drives in all of the iX half-depth 1Us > (anything with ...ix-slave... in the hostname) with WD RE4 RAID-duty > enterprise drives. > > The first ones to get replaced will be the 8 machines they still have. After > that we'll have to figure out how large a batch we can spare at a time. > Turnaround time will be approximately 3 business days for each set. Great news. Hopefully, this will resolve all these long-running hard-to-debug headaches. > iX will replace the drives, burn in the systems again, and return them to us. Will the replacement drives have the image/toolchain installed, so we just power them up when they return from IX? Or will the machines be returned with blank drives and we do a full clean reinstall on each machine as it comes back to us? (Each plan has pros/cons. I'm not sure which is better approach, tbh, just asking because it wasnt clear in comment#0.)

Dustin J. Mitchell [:dustin] (he/him)

Comment 2

•

14 years ago

I'm game. This is going to be a long process, so let's keep the paperwork straight. For each batch, let's open a new bug that lists the machines in the batch both by hostname and asset tag, blocking this bug. We can reference these batch bugs from each of the failed-hardware bugs that are floating around. When machines return, I'll annotate them as such in the notes section of the inventory - "WD RE4 drive installed in yyyy-mm-dd batch". I'll keep a spreadsheet to track our progress through the entire set of releng IX systems. Since some are masters, it could take a while.

Dustin J. Mitchell [:dustin] (he/him)

Comment 3

•

14 years ago

The machines will need to be re-imaged as they return - but that's easy to do while re-racking anyway.

Dustin J. Mitchell [:dustin] (he/him)

Comment 4

•

14 years ago

Spreadsheet is in this bug's URL.

URL: https://spreadsheets.google.com/ccc?k...

Zandr Milewski [:zandr]

Assignee

Comment 5

•

14 years ago

(In reply to comment #1) > Will the replacement drives have the image/toolchain installed, so we just > power them up when they return from IX? Or will the machines be returned > with blank drives and we do a full clean reinstall on each machine as it > comes back to us? They will come with blank drives, or possibly with the FreeBSD burnin tools installed. We'll reimage as they come in.

Chris Cooper [:coop] (he/him)

Comment 6

•

14 years ago

For my own curiosity: * are we getting the replacement drives for free (or at least deeply discounted)? * how much would the enterprise drive bump up the cost of buying a new ix machine, assuming we want to drink from that well again in the future?

Zandr Milewski [:zandr]

Assignee

Comment 7

•

14 years ago

(In reply to comment #6) > For my own curiosity: > > * are we getting the replacement drives for free (or at least deeply > discounted)? We're buying the drives, they're doing the labor to swap them out and repeat the burn-in. > * how much would the enterprise drive bump up the cost of buying a new ix > machine, assuming we want to drink from that well again in the future? Net, maybe $30-40? The drives are $71 each.

Amy Rich [:arr] [:arich]

Updated

•

14 years ago

Assignee: server-ops-releng → zandr

Dustin J. Mitchell [:dustin] (he/him)

Comment 8

•

14 years ago

Anything I can do to get a batch or two started here? I have an empty spreadsheet just *waiting* to track the mayhem..

Zandr Milewski [:zandr]

Assignee

Comment 9

•

14 years ago

The 8 machines that are at iX systems from the last repair batch will be in late this week (which probably actually means Monday)

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

14 years ago

Depends on: 656474

Dustin J. Mitchell [:dustin] (he/him)

Comment 10

•

14 years ago

Bug 656474 has 9 machines on it, not 8, by the way. We have lots more machines already down and waiting to go to iX. If they're shipping these, can we make up a new batch from that and cross-ship? If they're delivering the machines themselves, should we have a stack ready for them to pick up when they drop off the bug 656474 batch? To summarize: I'm ready to start another batch whenever you are. Just say the word.

Dustin J. Mitchell [:dustin] (he/him)

Comment 11

•

14 years ago

As a last-minute addition, we're looking at doubling the RAM in these machines (to 8G) while they're at iX. Still awaiting a quote on that, but the new RAM would go into the 9 machines in bug 656474 before they're returned, and into all subsequent batches. In other news, I'll be putting together a batch of 40 for iX to pick up when they drop off the bug 656474 batch.

Dustin J. Mitchell [:dustin] (he/him)

Comment 12

•

14 years ago

General repair instructions: https://wiki.mozilla.org/ReleaseEngineering/How_To/Send_a_Machine_For_Repair

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

14 years ago

Depends on: 657015

Dustin J. Mitchell [:dustin] (he/him)

Comment 13

•

14 years ago

As a reminder to self, plan to move the mw* and mv-* hosts *last*, but the *-ix-* hosts that are in mtv1 and want to be in scl1 *first* (so, in the next, third batch)

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

14 years ago

Depends on: 659005

Zandr Milewski [:zandr]

Assignee

Comment 14

•

14 years ago

I just duped a pile of broken iX machine bugs against the batch II bug. (bug 657015). I'm going to dupe the remaining three against this bug. Please consider that a nomination to include these in batch III along with all the *-ix-* in MV. (would like to free up that space for the RelOps Lab) Those machines are: linux-ix-slave34 linux64-ix-slave26 linux64-ix-slave37

Zandr Milewski [:zandr]

Assignee

Comment 18

•

14 years ago

More for batch III: buildbot-master2 w32-ix-slave05

Chris AtLee [:catlee]

Comment 21

•

14 years ago

Add buildbot-master1 to the list

Dustin J. Mitchell [:dustin] (he/him)

Comment 22

•

13 years ago

Status update: The big picture here is that we're trying to recover as much value from these systems as possible, but given the history of the problem getting them "fixed" is unlikely to happen soon, and work is proceeding apace to get replacement systems in place and get our builder capacity back up where it should be. Based on some questions in the releng meeting, here are some numbers for right now (don't hold me to the exact values here - these are just counts from the spreadsheet in the URL field). number of working ix systems: 164 number of disabled ix systems: 56 number of systems (in second ix purchase) && (now in mtv1) && (broken): 0 see bug 663950 remaining to be repaired: 220, if we decide to "repair" all of them at ix: 1 returned, but not visible to releng: 56 returned, in production: 5 IX has had a few ideas for fixing this - putting the same hardware in a new chassis, some sort of foam insert, or a new heatsink/fan combination. They'll be onsite tomorrow afternoon. I don't know anything about costs, contracts, quotes, etc., for any of this, and that's probably best not put in a public bug anyway.

Zandr Milewski [:zandr]

Assignee

Comment 23

•

13 years ago

iX was onsite today and returned two machines (the 1 'at ix', plus another one that was tracked in the spreadsheet as out for repair) with an entirely different heatsink/fan arrangement. We then converted 8 more machines for a total of 10 units. Of the four we've tested three show very good performance. One remains disappointing. Stay tuned as we test the other six.

Dustin J. Mitchell [:dustin] (he/him)

Comment 24

•

13 years ago

Amy points out we're not doing this particular project anymore. I'll open a new bug to track the new path to victory.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

11 years ago

Component: Server Operations: RelEng → RelOps

Product: mozilla.org → Infrastructure & Operations

Bugzilla

Replace iX slave HDs with 'enterprise' drives

Categories

(Infrastructure & Operations :: RelOps: General, task)

Tracking

(Not tracked)

People

(Reporter: zandr, Assigned: zandr)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Updated

Comment 8

Comment 9

Updated

Comment 10

Comment 11

Comment 12

Updated

Comment 13

Updated

Comment 14

Comment 18

Comment 21

Comment 22

Comment 23

Comment 24

Updated