Closed Bug 1056143 Opened 10 years ago Closed 10 years ago

use excess pandas to backfill broken production pandas

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86
macOS
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: arich, Unassigned)

References

Details

Attachments

(1 file)

This bug is to track the production pandas that are broken and the pandas that we're going to take out of other chassis (which will be moving into storage) to backfill the broken production chassis. Coop, can you please start by providing a list of the "broken" pandas so we can determine which we want to backfill and which chassis we want to put into storage?
Flags: needinfo?(coop)
Blocks: 1056145
Van just did resolved the most recent panda recovery bug yesterday, so I'm putting a bunch of questionable pandas back into service today that may or may not hold up. There are a few known-bad pandas I can add to the list today, but I want to see how those recovered pandas hold-up before adding ones from that list.
Flags: needinfo?(coop)
I've added the disabled pandas as dependencies, but here's a list in case that's easier: * panda-0091 * panda-0126 * panda-0129 * panda-0157 * panda-0191 * panda-0219 * panda-0257 * panda-0370 * panda-0373 * panda-0460 * panda-0476 * panda-0490 * panda-0539 * panda-0584 * panda-0587 * panda-0592 * panda-0643 * panda-0647 * panda-0681 * panda-0726 * panda-0730 * panda-0731 * panda-0734 * panda-0736 * panda-0747 * panda-0749 * panda-0778 * panda-0797 * panda-0803 * panda-0807 * panda-0819 * panda-0832 * panda-0834 * panda-0835 * panda-0848 I'll mark them all for decomm in slavealloc and the relevant bugs. Note: I may have more to add once I go through the list of "broken" pandas, i.e. pandas that are enabled but not reporting.
I've added panda-0588 to the list.
Blocks: panda-0588
Added: * panda-0330 * panda-0344 * panda-0489
Added: * panda-0052 * panda-0095 * panda-0107 * panda-0234 * panda-0294 * panda-0302 * panda-0337 * panda-0621 * panda-0664 * panda-0665 * panda-0674
That's good enough for now. The only other pandas I worry about are marked in red as "broken" in slave health: https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=panda That so many of them cluster around 117 days and 38 days since their last job makes me wonder whether a colo move (or similar large event) knocked a whole batch of good pandas offline. I will file a follow-up bug for that.
(In reply to Chris Cooper [:coop] from comment #6) > I will file a follow-up bug for that. Bug 1057069 filed.
There doesn't appear to be any specific pattern, and all of the "broken" pandas are spread out over p1 - p9 (none that I could see in p10). The distribution is as follows: p1: panda-0091 panda-0095 panda-0107 panda-0126 panda-0129 panda-0157 p2: panda-0191 panda-0219 panda-0234 panda-0257 p3: panda-0294 panda-0302 panda-0330 panda-0337 panda-0344 p4: panda-0370 panda-0373 p5: panda-0460 panda-0476 panda-0489 panda-0490 p6: panda-0539 panda-0584 panda-0587 panda-0588 panda-0592 p7: panda-0621 panda-0643 panda-0647 panda-0664 panda-0665 panda-0674 panda-0681 p8: panda-0726 panda-0730 panda-0731 panda-0734 panda-0736 panda-0747 panda-0749 panda-0778 p9: panda-0797 panda-0803 panda-0807 panda-0819 panda-0832 panda-0834 panda-0835 panda-0848 It doesn't make a great deal of difference, but there are slightly higher numbers of failures in the last three racks, which adds to my belief that those are the three racks we should put in storage. That means that we should disable the pandas listed here: https://inventory.mozilla.org/en-US/systems/racks/?rack=372 https://inventory.mozilla.org/en-US/systems/racks/?rack=373 https://inventory.mozilla.org/en-US/systems/racks/?rack=374 And my suggestion is that we do backfill as follows: pandas from panda-chassis-055 relocated into p1 and p2: p1: panda-0091 -> panda-0610 panda-0095 -> panda-0611 panda-0107 -> panda-0612 panda-0126 -> panda-0613 panda-0129 -> panda-0614 panda-0157 -> panda-0615 p2: panda-0191 -> panda-0616 panda-0219 -> panda-0617 panda-0234 -> panda-0618 panda-0257 -> panda-0619 pandas from panda-chassis-056 relocated into p3, p4, and p5 p3: panda-0294 -> panda-0620 panda-0302 -> panda-0622 panda-0330 -> panda-0623 panda-0337 -> panda-0624 panda-0344 -> panda-0625 p4: panda-0370 -> panda-0626 panda-0373 -> panda-0627 p5: panda-0460 -> panda-0628 panda-0476 -> panda-0629 panda-0489 -> panda-0630 panda-0490 -> panda-0081 pandas from panda-chassis-057 relocated into p6 p6: panda-0539 -> panda-0631 panda-0584 -> panda-0632 panda-0587 -> panda-0633 panda-0588 -> panda-0634 panda-0592 -> panda-0635 That leaves panda-chassis-057 with extra boards in it and makes it the one we should scanvege from first when we next need backfill. Coop, does that look good to you?
Flags: needinfo?(coop)
(In reply to Amy Rich [:arich] [:arr] from comment #8) > That means that we should disable the pandas listed here: > > https://inventory.mozilla.org/en-US/systems/racks/?rack=372 > https://inventory.mozilla.org/en-US/systems/racks/?rack=373 > https://inventory.mozilla.org/en-US/systems/racks/?rack=374 > > Coop, does that look good to you? That's fine. I'll mark the pandas from those racks as disabled and in storage in slavealloc.
Flags: needinfo?(coop)
(In reply to Chris Cooper [:coop] from comment #9) > That's fine. I'll mark the pandas from those racks as disabled and in > storage in slavealloc. This is done.
Attached patch remove-dead-pandas.patch (deleted) — Splinter Review
This removes the dead pandas from nagios. I'll wait till the back fill ones are in place before adding them to nagios.
The following have also been decommissioned in inventory: panda-0091.p1.releng.scl3.mozilla.com panda-0095.p1.releng.scl3.mozilla.com panda-0107.p1.releng.scl3.mozilla.com panda-0126.p1.releng.scl3.mozilla.com panda-0129.p1.releng.scl3.mozilla.com panda-0157.p1.releng.scl3.mozilla.com panda-0191.p2.releng.scl3.mozilla.com panda-0219.p2.releng.scl3.mozilla.com panda-0234.p2.releng.scl3.mozilla.com panda-0294.p3.releng.scl3.mozilla.com panda-0302.p3.releng.scl3.mozilla.com panda-0330.p3.releng.scl3.mozilla.com panda-0337.p3.releng.scl3.mozilla.com panda-0344.p3.releng.scl3.mozilla.com panda-0370.p4.releng.scl3.mozilla.com panda-0373.p4.releng.scl3.mozilla.com panda-0460.p5.releng.scl3.mozilla.com panda-0476.p5.releng.scl3.mozilla.com panda-0489.p5.releng.scl3.mozilla.com panda-0490.p5.releng.scl3.mozilla.com panda-0539.p6.releng.scl3.mozilla.com panda-0584.p6.releng.scl3.mozilla.com panda-0587.p6.releng.scl3.mozilla.com panda-0588.p6.releng.scl3.mozilla.com panda-0592.p6.releng.scl3.mozilla.com
(In reply to Amy Rich [:arich] [:arr] from comment #8) dcops, could you please backfill pandas as described in comment #8? The notation is: dead panda -> replacement panda scavenged from a chassis we want to put in storage. e.g. panda-0091 -> panda-0610 We'll need to update inventory for each board with the new location information, IP, vlan, as well as the mobile imaging server and panda-relay key/values. Since there are only a handful, do you want to do them manually, or do you want to try to get uberj's assistance to do these as a batch? Once they're all up and functional, I'll add them to nagios.
Assignee: arich → server-ops-dcops
Component: RelOps → Server Operations: DCOps
Product: Infrastructure & Operations → mozilla.org
QA Contact: arich → dmoore
IP, vlan, mobile imaging server, and panda-relay updated in inventory. Waiting on physical move and update of location information in inventory.
colo-trip: --- → scl3
UPDATE: Remaining pandas that needs replacement. panda-0330 -> panda-0623 panda-0337 -> panda-0624 panda-0344 -> panda-0625 p4: panda-0370 -> panda-0626 panda-0373 -> panda-0627 p5: panda-0460 -> panda-0628 panda-0476 -> panda-0629 panda-0489 -> panda-0630 panda-0490 -> panda-0081 pandas from panda-chassis-057 relocated into p6 p6: panda-0539 -> panda-0631 panda-0584 -> panda-0632 panda-0587 -> panda-0633 panda-0588 -> panda-0634 panda-0592 -> panda-0635
All pandas have been replaced, rack location and switch ports updated in inventory.
Self tested and installed a fresh copy of 4.0.4_v3.3 on each replaced board.
Assignee: server-ops-dcops → relops
Status: NEW → RESOLVED
Closed: 10 years ago
Component: Server Operations: DCOps → RelOps
Product: mozilla.org → Infrastructure & Operations
QA Contact: dmoore → arich
Resolution: --- → FIXED
Depends on: 1072405
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: