Closed Bug 655304 Opened 14 years ago Closed 13 years ago

Replace iX slave HDs with 'enterprise' drives

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: zandr, Assigned: zandr)

References

()

Details

[Explanation of how we got here below] We are going to replace all of the drives in all of the iX half-depth 1Us (anything with ...ix-slave... in the hostname) with WD RE4 RAID-duty enterprise drives. The first ones to get replaced will be the 8 machines they still have. After that we'll have to figure out how large a batch we can spare at a time. Turnaround time will be approximately 3 business days for each set. iX will replace the drives, burn in the systems again, and return them to us. Let's use this bug as a log of activity around rotating all the machines through replacement. Backstory: We met with iX systems yesterday and came up with a plan going forward for the iX 1U machines. They have done additional testing and experimentation on the few machines they haven't returned yet, and found the following: Replacing fans doesn't solve the problem reliably. The problem follows the chassis, not the drive. Enterprise drives don't show the problem. All of this suggests that this is still a vibration issue, and desktop drives just aren't going to cut it in this chassis. As verification, they took the worst of the machines they had, one that couldn't exceed 15MB/s with a new desktop drive. They put an enterprise drive in it, and throughput exceeded 85MB/s. That's the worst result they were able to create using enterprise drives.
(In reply to comment #1) > [Explanation of how we got here below] > > We are going to replace all of the drives in all of the iX half-depth 1Us > (anything with ...ix-slave... in the hostname) with WD RE4 RAID-duty > enterprise drives. > > The first ones to get replaced will be the 8 machines they still have. After > that we'll have to figure out how large a batch we can spare at a time. > Turnaround time will be approximately 3 business days for each set. Great news. Hopefully, this will resolve all these long-running hard-to-debug headaches. > iX will replace the drives, burn in the systems again, and return them to us. Will the replacement drives have the image/toolchain installed, so we just power them up when they return from IX? Or will the machines be returned with blank drives and we do a full clean reinstall on each machine as it comes back to us? (Each plan has pros/cons. I'm not sure which is better approach, tbh, just asking because it wasnt clear in comment#0.)
I'm game. This is going to be a long process, so let's keep the paperwork straight. For each batch, let's open a new bug that lists the machines in the batch both by hostname and asset tag, blocking this bug. We can reference these batch bugs from each of the failed-hardware bugs that are floating around. When machines return, I'll annotate them as such in the notes section of the inventory - "WD RE4 drive installed in yyyy-mm-dd batch". I'll keep a spreadsheet to track our progress through the entire set of releng IX systems. Since some are masters, it could take a while.
The machines will need to be re-imaged as they return - but that's easy to do while re-racking anyway.
(In reply to comment #1) > Will the replacement drives have the image/toolchain installed, so we just > power them up when they return from IX? Or will the machines be returned > with blank drives and we do a full clean reinstall on each machine as it > comes back to us? They will come with blank drives, or possibly with the FreeBSD burnin tools installed. We'll reimage as they come in.
For my own curiosity: * are we getting the replacement drives for free (or at least deeply discounted)? * how much would the enterprise drive bump up the cost of buying a new ix machine, assuming we want to drink from that well again in the future?
(In reply to comment #6) > For my own curiosity: > > * are we getting the replacement drives for free (or at least deeply > discounted)? We're buying the drives, they're doing the labor to swap them out and repeat the burn-in. > * how much would the enterprise drive bump up the cost of buying a new ix > machine, assuming we want to drink from that well again in the future? Net, maybe $30-40? The drives are $71 each.
Assignee: server-ops-releng → zandr
Anything I can do to get a batch or two started here? I have an empty spreadsheet just *waiting* to track the mayhem..
The 8 machines that are at iX systems from the last repair batch will be in late this week (which probably actually means Monday)
Bug 656474 has 9 machines on it, not 8, by the way. We have lots more machines already down and waiting to go to iX. If they're shipping these, can we make up a new batch from that and cross-ship? If they're delivering the machines themselves, should we have a stack ready for them to pick up when they drop off the bug 656474 batch? To summarize: I'm ready to start another batch whenever you are. Just say the word.
As a last-minute addition, we're looking at doubling the RAM in these machines (to 8G) while they're at iX. Still awaiting a quote on that, but the new RAM would go into the 9 machines in bug 656474 before they're returned, and into all subsequent batches. In other news, I'll be putting together a batch of 40 for iX to pick up when they drop off the bug 656474 batch.
As a reminder to self, plan to move the mw* and mv-* hosts *last*, but the *-ix-* hosts that are in mtv1 and want to be in scl1 *first* (so, in the next, third batch)
I just duped a pile of broken iX machine bugs against the batch II bug. (bug 657015). I'm going to dupe the remaining three against this bug. Please consider that a nomination to include these in batch III along with all the *-ix-* in MV. (would like to free up that space for the RelOps Lab) Those machines are: linux-ix-slave34 linux64-ix-slave26 linux64-ix-slave37
More for batch III: buildbot-master2 w32-ix-slave05
Add buildbot-master1 to the list
Status update: The big picture here is that we're trying to recover as much value from these systems as possible, but given the history of the problem getting them "fixed" is unlikely to happen soon, and work is proceeding apace to get replacement systems in place and get our builder capacity back up where it should be. Based on some questions in the releng meeting, here are some numbers for right now (don't hold me to the exact values here - these are just counts from the spreadsheet in the URL field). number of working ix systems: 164 number of disabled ix systems: 56 number of systems (in second ix purchase) && (now in mtv1) && (broken): 0 see bug 663950 remaining to be repaired: 220, if we decide to "repair" all of them at ix: 1 returned, but not visible to releng: 56 returned, in production: 5 IX has had a few ideas for fixing this - putting the same hardware in a new chassis, some sort of foam insert, or a new heatsink/fan combination. They'll be onsite tomorrow afternoon. I don't know anything about costs, contracts, quotes, etc., for any of this, and that's probably best not put in a public bug anyway.
iX was onsite today and returned two machines (the 1 'at ix', plus another one that was tracked in the spreadsheet as out for repair) with an entirely different heatsink/fan arrangement. We then converted 8 more machines for a total of 10 units. Of the four we've tested three show very good performance. One remains disappointing. Stay tuned as we test the other six.
Amy points out we're not doing this particular project anymore. I'll open a new bug to track the new path to victory.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.