Closed Bug 764534 Opened 12 years ago Closed 12 years ago

develop remote imaging process for Panda ES boards to be used in production

Categories

(Infrastructure & Operations :: RelOps: General, task)

ARM
Android
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dividehex, Assigned: dividehex)

References

Details

(Whiteboard: u=panda c=it p=2 [re-panda])

We need to develop a way to remotely re-image or reinitialize panda boards that fall down while in production. Note: panda boards do not have NVRAM or a flash. Boot code is located in the first partition on the sdcard. Our current working idea is to: Re-image sdcard via PXE boot - This solution would be accomplished by scripting the panda uboot loader (located on the first partition of the sdcard) to automatically look for a pxe boot server and attempt to load a boot file by its MAC address or other unique identifier. If the file is *NOT* found, the uboot code will give up and try booting android from the SDcard. If the file is found, it will continue by booting a small initrd linux image which will be loaded with further scripts to start the re-imaging process (such as mounting a NFS export, building the rest of the android partitions, and using dd to image the partitions.) The process would be triggered by having the pxe boot file, named with the MAC/unique_id of the panda board, generated on the PXE server (which is also removed immediately after the process is done) and then issuing a 'drop pwr' command to the relay board that controls the power to that individual panda. The limitations to this process are: - the uboot code must be able to look for a file based on a unique and static MAC or other unique_ID to that panda - the uboot code must be able to timeout and default to booting the SDcard if PXE boot fails (does not find a pxe boot file) - if uboot code or the first partition becomes corrupted on the SDcard, this will require a datacenter visit to replace the card.
Assignee: server-ops-releng → ted.mielczarek
I am going through all bugs for the tracking work for the panda boards. How are we progressing on this? Rough is fine.
As part of bug 731670 I found that u-boot correctly generates a unique MAC address for the Pandaboard's eth0. (In fact, the fix for that bug was simply to pass the MAC from u-boot down to the net driver.)
I'm going to go ahead and take this bug since I have done some work on it. I've successfully scripted u-boot to first attempt to pxe boot and upon failing will boot Android from the SDcard.
Assignee: ted.mielczarek → jwatkins
I've setup a test environment at home to begin building a linux initramfs image with tools to re-image the sdcard remotely. This will be built upon an ubuntu 12.04 arm core minimal filesystem. A couple things needed when it comes time to set this up in production: * a separate tftp server from the pxe server that is already in place in scl1. (the tftp server IP will be set in the boot.scr). This is due to the fact that it attempts loads a PXE config file on every boot and there cannot be a default file such as what is currently setup for PXE booting in scl1. Loading a default pxe config and having a timeout to LOCALBOOT 0 will not work for this solution. * an NFS export to serve up fs images. (should probably be the same server or vms as the tftp server) These images will be the actually android partitions that get dumped to the SDcard.
Depends on: 773275
I've updated the boot.scr to allow uboot to attempt a pxe boot before defaulting to the sdcard. I've also removed the smsc95xx.macaddr=${usbethaddr} since it wasn't needed and not in the correct env var. This is also not need since the linaro build has a patch to generate the mac id off the cpu die id already. If you want to add android boot args, append them to the bootarg var (inside the quotes) For pandas in scl1, this is the boot.scr to use: setenv initrd_high "0xffffffff" setenv fdt_high "0xffffffff" setenv bootargs "console=ttyO2,115200n8 rootwait ro earlyprintk fixrtc nocompcache vram=48M omapfb.vram=0:24M,1:24M mem=456M@0x80000000 mem=512M@0xA0000000 init=/init androidboot.console=ttyO2 omapdss.def_disp=dvi omapfb.mode=dvi:1024x768MR-24@60 consoleblank=0" setenv bootandroid "echo Booting Android from SDcard; fatload mmc 0:1 0x80200000 uImage; fatload mmc 0:1 0x81600000 uInitrd; bootm 0x80200000 0x81600000" setenv bootpxefirst "echo Launching PXE boot... ; if usb start; then set autoload no; bootp; setenv serverip 10.12.48.27; if pxe get; then pxe boot; else run bootandroid; fi; fi" run bootpxefirst For pandas that aren't in scl1 and that need to skip the pxe boot process, simply change "run bootpxefirst" to "run bootandroid" before running mkimage
No longer blocks: android_4.0_testing
Whiteboard: u=panda c=it p=2
Depends on: 788687
Depends on: 788632
Blocks: 789129
Whiteboard: u=panda c=it p=2 → u=panda c=it p=2 [re-panda]
Depends on: 791228
I've made progress on a LIVE Linux boot image. It boots from a PXE server and gives a more functional Linux environment then just an initrd w/busybox.
Here is an update from the IT side of work Completed: Netboot linux live runtime environment Re-imaging scripts in linux environment Work in progress: mobile-services puppet manifests pxe config management code module I was able to produce a proof of concept by netbooting the linux runtime env and letting the scripts handle the sdcard re-imaging. Worked flawlessly.
Depends on: 797946
No longer depends on: 801643
Ah, there's been progress since Jake's update. The "work in progress" items are complete. We have a working imaging system ready for integration with releng (buildbot), with ateam's systems (MozPool and Lifeguard), and with the hardware in scl1 (new servers, chassis, etc.). I emailed release@ a few days ago about integration, and haven't heard anything back. I'll get a bug open on that so we can all read it. Mark Coté and I talked last week about the a-team integration, and we are going to regroup later this week. That may result in some design changes to bmm, but I don't expect anything particularly challenging. There are some minor "polish" items I'd like to fix up, but none of those block deployment. The last open dependency here (bug 799616) is for monitoring. That's specified and handed off to the SREs, but not in place yet since the hardware's not installed. So I'll call this finished and remove the dep.
Status: NEW → RESOLVED
Closed: 12 years ago
No longer depends on: 799616
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.