Closed
Bug 1056143
Opened 10 years ago
Closed 10 years ago
use excess pandas to backfill broken production pandas
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: arich, Unassigned)
References
Details
Attachments
(1 file)
(deleted),
patch
|
Details | Diff | Splinter Review |
This bug is to track the production pandas that are broken and the pandas that we're going to take out of other chassis (which will be moving into storage) to backfill the broken production chassis.
Coop, can you please start by providing a list of the "broken" pandas so we can determine which we want to backfill and which chassis we want to put into storage?
Reporter | ||
Updated•10 years ago
|
Flags: needinfo?(coop)
Comment 1•10 years ago
|
||
Van just did resolved the most recent panda recovery bug yesterday, so I'm putting a bunch of questionable pandas back into service today that may or may not hold up.
There are a few known-bad pandas I can add to the list today, but I want to see how those recovered pandas hold-up before adding ones from that list.
Flags: needinfo?(coop)
Updated•10 years ago
|
Blocks: panda-0091, panda-0126, panda-0129, panda-0157, panda-0191, panda-0219, panda-0257, panda-0370, panda-0373, panda-0460, panda-0476, panda-0490, panda-0539, panda-0584, panda-0587, panda-0592, panda-0643, panda-0647, panda-0681, panda-0726, panda-0730, panda-0731, panda-0734, panda-0736, panda-0747, panda-0749, panda-0778, panda-0797, panda-0803, panda-0807, panda-0819, panda-0832, panda-0834, panda-0835, panda-0848
Comment 2•10 years ago
|
||
I've added the disabled pandas as dependencies, but here's a list in case that's easier:
* panda-0091
* panda-0126
* panda-0129
* panda-0157
* panda-0191
* panda-0219
* panda-0257
* panda-0370
* panda-0373
* panda-0460
* panda-0476
* panda-0490
* panda-0539
* panda-0584
* panda-0587
* panda-0592
* panda-0643
* panda-0647
* panda-0681
* panda-0726
* panda-0730
* panda-0731
* panda-0734
* panda-0736
* panda-0747
* panda-0749
* panda-0778
* panda-0797
* panda-0803
* panda-0807
* panda-0819
* panda-0832
* panda-0834
* panda-0835
* panda-0848
I'll mark them all for decomm in slavealloc and the relevant bugs.
Note: I may have more to add once I go through the list of "broken" pandas, i.e. pandas that are enabled but not reporting.
Comment 5•10 years ago
|
||
Added:
* panda-0052
* panda-0095
* panda-0107
* panda-0234
* panda-0294
* panda-0302
* panda-0337
* panda-0621
* panda-0664
* panda-0665
* panda-0674
Blocks: panda-0052, panda-0095, panda-0107, panda-0234, panda-0294, panda-0302, panda-0337, panda-0621, panda-0664, panda-0665, panda-0674
Comment 6•10 years ago
|
||
That's good enough for now.
The only other pandas I worry about are marked in red as "broken" in slave health:
https://secure.pub.build.mozilla.org/builddata/reports/slave_health/slavetype.html?class=test&type=panda
That so many of them cluster around 117 days and 38 days since their last job makes me wonder whether a colo move (or similar large event) knocked a whole batch of good pandas offline.
I will file a follow-up bug for that.
Comment 7•10 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #6)
> I will file a follow-up bug for that.
Bug 1057069 filed.
Reporter | ||
Comment 8•10 years ago
|
||
There doesn't appear to be any specific pattern, and all of the "broken" pandas are spread out over p1 - p9 (none that I could see in p10).
The distribution is as follows:
p1:
panda-0091
panda-0095
panda-0107
panda-0126
panda-0129
panda-0157
p2:
panda-0191
panda-0219
panda-0234
panda-0257
p3:
panda-0294
panda-0302
panda-0330
panda-0337
panda-0344
p4:
panda-0370
panda-0373
p5:
panda-0460
panda-0476
panda-0489
panda-0490
p6:
panda-0539
panda-0584
panda-0587
panda-0588
panda-0592
p7:
panda-0621
panda-0643
panda-0647
panda-0664
panda-0665
panda-0674
panda-0681
p8:
panda-0726
panda-0730
panda-0731
panda-0734
panda-0736
panda-0747
panda-0749
panda-0778
p9:
panda-0797
panda-0803
panda-0807
panda-0819
panda-0832
panda-0834
panda-0835
panda-0848
It doesn't make a great deal of difference, but there are slightly higher numbers of failures in the last three racks, which adds to my belief that those are the three racks we should put in storage.
That means that we should disable the pandas listed here:
https://inventory.mozilla.org/en-US/systems/racks/?rack=372
https://inventory.mozilla.org/en-US/systems/racks/?rack=373
https://inventory.mozilla.org/en-US/systems/racks/?rack=374
And my suggestion is that we do backfill as follows:
pandas from panda-chassis-055 relocated into p1 and p2:
p1:
panda-0091 -> panda-0610
panda-0095 -> panda-0611
panda-0107 -> panda-0612
panda-0126 -> panda-0613
panda-0129 -> panda-0614
panda-0157 -> panda-0615
p2:
panda-0191 -> panda-0616
panda-0219 -> panda-0617
panda-0234 -> panda-0618
panda-0257 -> panda-0619
pandas from panda-chassis-056 relocated into p3, p4, and p5
p3:
panda-0294 -> panda-0620
panda-0302 -> panda-0622
panda-0330 -> panda-0623
panda-0337 -> panda-0624
panda-0344 -> panda-0625
p4:
panda-0370 -> panda-0626
panda-0373 -> panda-0627
p5:
panda-0460 -> panda-0628
panda-0476 -> panda-0629
panda-0489 -> panda-0630
panda-0490 -> panda-0081
pandas from panda-chassis-057 relocated into p6
p6:
panda-0539 -> panda-0631
panda-0584 -> panda-0632
panda-0587 -> panda-0633
panda-0588 -> panda-0634
panda-0592 -> panda-0635
That leaves panda-chassis-057 with extra boards in it and makes it the
one we should scanvege from first when we next need backfill.
Coop, does that look good to you?
Flags: needinfo?(coop)
Comment 9•10 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #8)
> That means that we should disable the pandas listed here:
>
> https://inventory.mozilla.org/en-US/systems/racks/?rack=372
> https://inventory.mozilla.org/en-US/systems/racks/?rack=373
> https://inventory.mozilla.org/en-US/systems/racks/?rack=374
>
> Coop, does that look good to you?
That's fine. I'll mark the pandas from those racks as disabled and in storage in slavealloc.
Flags: needinfo?(coop)
Comment 10•10 years ago
|
||
(In reply to Chris Cooper [:coop] from comment #9)
> That's fine. I'll mark the pandas from those racks as disabled and in
> storage in slavealloc.
This is done.
Reporter | ||
Comment 11•10 years ago
|
||
This removes the dead pandas from nagios. I'll wait till the back fill ones are in place before adding them to nagios.
Reporter | ||
Comment 12•10 years ago
|
||
The following have also been decommissioned in inventory:
panda-0091.p1.releng.scl3.mozilla.com
panda-0095.p1.releng.scl3.mozilla.com
panda-0107.p1.releng.scl3.mozilla.com
panda-0126.p1.releng.scl3.mozilla.com
panda-0129.p1.releng.scl3.mozilla.com
panda-0157.p1.releng.scl3.mozilla.com
panda-0191.p2.releng.scl3.mozilla.com
panda-0219.p2.releng.scl3.mozilla.com
panda-0234.p2.releng.scl3.mozilla.com
panda-0294.p3.releng.scl3.mozilla.com
panda-0302.p3.releng.scl3.mozilla.com
panda-0330.p3.releng.scl3.mozilla.com
panda-0337.p3.releng.scl3.mozilla.com
panda-0344.p3.releng.scl3.mozilla.com
panda-0370.p4.releng.scl3.mozilla.com
panda-0373.p4.releng.scl3.mozilla.com
panda-0460.p5.releng.scl3.mozilla.com
panda-0476.p5.releng.scl3.mozilla.com
panda-0489.p5.releng.scl3.mozilla.com
panda-0490.p5.releng.scl3.mozilla.com
panda-0539.p6.releng.scl3.mozilla.com
panda-0584.p6.releng.scl3.mozilla.com
panda-0587.p6.releng.scl3.mozilla.com
panda-0588.p6.releng.scl3.mozilla.com
panda-0592.p6.releng.scl3.mozilla.com
Reporter | ||
Comment 13•10 years ago
|
||
(In reply to Amy Rich [:arich] [:arr] from comment #8)
dcops, could you please backfill pandas as described in comment #8? The notation is:
dead panda -> replacement panda scavenged from a chassis we want to put in storage. e.g.
panda-0091 -> panda-0610
We'll need to update inventory for each board with the new location information, IP, vlan, as well as the mobile imaging server and panda-relay key/values. Since there are only a handful, do you want to do them manually, or do you want to try to get uberj's assistance to do these as a batch?
Once they're all up and functional, I'll add them to nagios.
Assignee: arich → server-ops-dcops
Component: RelOps → Server Operations: DCOps
Product: Infrastructure & Operations → mozilla.org
QA Contact: arich → dmoore
Reporter | ||
Comment 14•10 years ago
|
||
IP, vlan, mobile imaging server, and panda-relay updated in inventory. Waiting on physical move and update of location information in inventory.
Updated•10 years ago
|
colo-trip: --- → scl3
Comment 15•10 years ago
|
||
UPDATE: Remaining pandas that needs replacement.
panda-0330 -> panda-0623
panda-0337 -> panda-0624
panda-0344 -> panda-0625
p4:
panda-0370 -> panda-0626
panda-0373 -> panda-0627
p5:
panda-0460 -> panda-0628
panda-0476 -> panda-0629
panda-0489 -> panda-0630
panda-0490 -> panda-0081
pandas from panda-chassis-057 relocated into p6
p6:
panda-0539 -> panda-0631
panda-0584 -> panda-0632
panda-0587 -> panda-0633
panda-0588 -> panda-0634
panda-0592 -> panda-0635
Comment 16•10 years ago
|
||
All pandas have been replaced, rack location and switch ports updated in inventory.
Reporter | ||
Comment 17•10 years ago
|
||
Self tested and installed a fresh copy of 4.0.4_v3.3 on each replaced board.
Assignee: server-ops-dcops → relops
Status: NEW → RESOLVED
Closed: 10 years ago
Component: Server Operations: DCOps → RelOps
Product: mozilla.org → Infrastructure & Operations
QA Contact: dmoore → arich
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•