Closed Bug 902657 Opened 11 years ago Closed 11 years ago

panda-recovery

Categories

(Infrastructure & Operations :: DCOps, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: Callek, Assigned: jpech)

References

Details

+++ This bug was initially created as a clone of Bug #817103 +++
Blocks: panda-0282
Blocks: panda-0292
Blocks: panda-0295
Blocks: panda-0300
Blocks: panda-0301
Blocks: panda-0305
Blocks: panda-0306
Blocks: panda-0325
Blocks: panda-0340
Blocks: panda-0387
Blocks: panda-0396
Blocks: panda-0479
Blocks: panda-0482
Blocks: panda-0729
Blocks: panda-0737
Blocks: panda-0743
Blocks: panda-0763
Blocks: panda-0769
Blocks: panda-0770
Blocks: panda-0788
Blocks: panda-0172
Blocks: panda-0180
Blocks: panda-0296
Blocks: panda-0313
Blocks: panda-0371
Blocks: panda-0392
Blocks: panda-0395
Blocks: panda-0664
Blocks: panda-0674
Blocks: panda-0696
Blocks: panda-0720
Blocks: panda-0820
Assignee: relops → jwatkins
Blocks: panda-0810
No longer blocks: panda-0479
No longer blocks: panda-0482
Depends on: 909440
Blocks: panda-0739
Assignee: jwatkins → achavez
All had failures, replaced SD cards on 2013-08-26. The following list of pandas can be put back into service: panda-0282 ready panda-0300 ready panda-0301 ready panda-0305 ready panda-0306 ready panda-0313 ready panda-0325 ready panda-0387 ready panda-0396 ready panda-0729 ready panda-0737 ready panda-0743 ready panda-0763 ready panda-0769 ready panda-0770 ready panda-0788 ready panda-0810 ready panda-0820 ready
The following had SD card failures, SD cards were replaced and can be put back in production: panda-0292 panda-0295 panda-0296 Panda board failure, decommissioned and will be replaced with a new panda: panda-0172
Assignee: achavez → arich
Status: NEW → ASSIGNED
Assignee: arich → achavez
These pandas were incorrectly flagged by the new mozpool selftest. The selftest has since been corrected and they are now passing the selftest without issue. We can close the tracker bugs and return them to service. 0835 0834 0803 0731 0674 0664
Also, a false positive. Pls return to service. 0819
panda-0036 removed/ decommissioned panda-0044 removed/ decommissioned panda-0172 removed/ decommissioned panda-0180 passed self test/sd card replaced panda-0259 passed self test/sd card replaced panda-0262 passed self test/sd card replaced panda-0265 passed self test/sd card replaced panda-0340 removed/decomissioned panda-0371 removed/decommissioned panda-0392 removed/decomissioned panda-0395 removed/decommissioned panda-0696 removed/decommissioned panda-0720 removed/decommissioned panda-0739 passed self test/sd card replaced panda-0784 passed self test/sd card replaced panda-0795 passed self test/sd card replaced panda-0801 passed self test/sd card replaced panda-0816 passed self test/sd card replaced panda-0864 passed self test/sd card replaced panda-0870 passed self test/sd card replaced
Before completely removing the decommissioned boards from mozpool/production, we should double check them. I see some boards that were removed but had previously passing tests.
Component: RelOps → Server Operations: DCOps
Product: Infrastructure & Operations → mozilla.org
QA Contact: arich → dmoore
colo-trip: --- → scl1
Whiteboard: [Will work with Jake on this Thursday]
Whiteboard: [Will work with Jake on this Thursday] → [Will work with Jake on this after summit2013]
Ashlee Chavez [:Ashlee] 2013-10-02 16:47:22 EDT Whiteboard: [Will work with Jake on this Thursday] → [Will work with Jake on this after summit2013] ETA on peeking guys?
Flags: needinfo?(jwatkins)
Flags: needinfo?(achavez)
Blocks: panda-0558
Blocks: panda-0843
(In reply to Justin Wood (:Callek) from comment #9) > Ashlee Chavez [:Ashlee] 2013-10-02 16:47:22 EDT > Whiteboard: [Will work with Jake on this Thursday] → [Will work with Jake on > this after summit2013] > > ETA on peeking guys? I've spoken with Jake via irc, we have not come to a conclusion as to when we will be able to tackle this. Jake, any ideas?
Depends on: panda-0818
Flags: needinfo?(achavez)
Blocks: panda-0818
No longer depends on: panda-0818
Blocks: panda-0479
Blocks: panda-0482
Blocks: panda-0701
Blocks: panda-0357
Blocks: panda-0347
Since Ashlee have moved to another team. I want to volunteer and take over this bug and wish to solve this bug before my internship ends..(T.T) What and Who can help guide me through the process to recover the Panda boards?
Flags: needinfo?(bugspam.Callek)
Whiteboard: [Will work with Jake on this after summit2013]
Hey John, Please work with Jake (already needinfo'd) and coord with dmoore as well to properly allocate your resources here. (It would be good imho, to have a permament member of dcops also go through this process with Jake, so we don't lose the mindshare when your internship ends)
Flags: needinfo?(bugspam.Callek) → needinfo?(dmoore)
(In reply to Justin Wood (:Callek) from comment #12) > Hey John, > > Please work with Jake (already needinfo'd) and coord with dmoore as well to > properly allocate your resources here. (It would be good imho, to have a > permament member of dcops also go through this process with Jake, so we > don't lose the mindshare when your internship ends) Will do. Thanks for the info!
Assignee: achavez → jpech
DCops got a good 4hours of training on panda therapy yesterday. So we should get this bug resolved soon and on to a weekly "r/f and clone" schedule.
Flags: needinfo?(jwatkins)
Majority of the pandas are in the "ready" state for releng to proceed with testing. The below pandas are failing and will need further troubleshooting. If releng can close out the working pandas in the "block" list above, then it will help me narrow down exactly which pandas need investigation (similar to tegra bugs). Thanks! panda-0172 failed_pxe_booting vhua-Unable to read "preEnv.txt" from mmc 0:1 ** panda-0444 failed_pxe_booting vhua-23.533630] panic occurred, switching back to text console panda-0479 failed_pxe_booting vhua-not in chassis panda-0482 failed_pxe_booting vhua-not in chassis panda-0638 failed_pxe_booting panda-android-4.0.4_v3.1 panda-0797 failed_pxe_booting android panda-0081 failed_self_test dividehex-panda-intervention panda-0173 failed_self_test panda-0280 failed_self_test vhua-selftest.py[INFO]: test_preseed_file_integrity[FAILED] boot.scr : panda-0720 failed_self_test vhua-selftest.py[INFO]: test_mmc_blk_dev[FAILED] /dev/mmcblk0 - No such file or directory (tried multiple SD cards) panda-0678 locked_out android
(In reply to Vinh Hua [:vinh] from comment #15) > Majority of the pandas are in the "ready" state for releng to proceed with > testing. The below pandas are failing and will need further > troubleshooting. If releng can close out the working pandas in the "block" > list above, then it will help me narrow down exactly which pandas need > investigation (similar to tegra bugs). Thanks! > > > panda-0172 failed_pxe_booting vhua-Unable to read "preEnv.txt" from mmc 0:1 Unable to read "preEnv.txt" from mmc 0:1 is a normal error message. The uboot loader should continue to load/netboot Did you try swapping out the sdcard on this? if you did, does it halt at that msg or continue booting? > ** > panda-0444 failed_pxe_booting vhua-23.533630] panic occurred, switching > back to text console This sounds like a pandaboard hardware issue and it would be interesting to see a entire serial console capture to better identify the deeper issue here. It might be something we can add to selftest to check for. I would suggest swapping the sdcard if you haven't. If you have, and it still continues, remove the panda board (and order a replacement if we don't any spare boards) > panda-0479 failed_pxe_booting vhua-not in chassis > panda-0482 failed_pxe_booting vhua-not in chassis These 2 panda boards were removed from service and should be replaced. I'll remove them from mozpool. see bug836808 > panda-0638 failed_pxe_booting panda-android-4.0.4_v3.1 SDcard swap didn't work here? If so, we can assume the pandaboard is dead and should be replaced. > panda-0797 failed_pxe_booting android Same here? > panda-0081 failed_self_test dividehex-panda-intervention what is the reason the self_test failed here? (see device log) > panda-0173 failed_self_test Same. Why did it fail? (see device log) > panda-0280 failed_self_test vhua-selftest.py[INFO]: > test_preseed_file_integrity[FAILED] boot.scr : Boot.scr integrity check failure indicates outdated preseed image and should be fixed by: 1.) force state to 'troubleshooting' 2.) please_image -> repair-boot 3.) please_self_test > panda-0720 failed_self_test vhua-selftest.py[INFO]: > test_mmc_blk_dev[FAILED] /dev/mmcblk0 - No such file or directory (tried > multiple SD cards) This test indicates a bad pandaboard. Remove and replace. > panda-0678 locked_out android I have no idea why (or who) locked_out this panda. Check with #releng or #ateam. There should always be a bug # in the comment of the panda that is locked_out for obvious reasons. If no one claims they have reserved it, force state to troubleshooting and then run a selftest. Aside from the pandas listed here, I do think we should close this bug, migrate the list in c15 a new recovery bug and return the rest of the pandas to production. We really want to get into the habit of a weekly bug for DCOPs to handle. Callek: is it reasonable to do this sometime this week so new problem pandas don't get cluttered up here.
Flags: needinfo?(bugspam.Callek)
(In reply to Jake Watkins [:dividehex] from comment #16) > Aside from the pandas listed here, I do think we should close this bug, > migrate the list in c15 a new recovery bug and return the rest of the pandas > to production. We really want to get into the habit of a weekly bug for > DCOPs to handle. > > Callek: is it reasonable to do this sometime this week so new problem pandas > don't get cluttered up here. Indeed, I had already planned to do so today, and got caught up with a power outage at home --> doing so now.
Flags: needinfo?(dmoore)
Flags: needinfo?(bugspam.Callek)
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Alias: panda-recovery
Product: mozilla.org → Infrastructure & Operations
No longer blocks: panda-0283
You need to log in before you can comment on or make changes to this bug.