Closed
Bug 764534
Opened 12 years ago
Closed 12 years ago
develop remote imaging process for Panda ES boards to be used in production
Categories
(Infrastructure & Operations :: RelOps: General, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: dividehex, Assigned: dividehex)
References
Details
(Whiteboard: u=panda c=it p=2 [re-panda])
We need to develop a way to remotely re-image or reinitialize panda boards that fall down while in production.
Note: panda boards do not have NVRAM or a flash. Boot code is located in the first partition on the sdcard.
Our current working idea is to:
Re-image sdcard via PXE boot
- This solution would be accomplished by scripting the panda uboot loader (located on the first partition of the sdcard) to automatically look for a pxe boot server and attempt to load a boot file by its MAC address or other unique identifier. If the file is *NOT* found, the uboot code will give up and try booting android from the SDcard. If the file is found, it will continue by booting a small initrd linux image which will be loaded with further scripts to start the re-imaging process (such as mounting a NFS export, building the rest of the android partitions, and using dd to image the partitions.)
The process would be triggered by having the pxe boot file, named with the MAC/unique_id of the panda board, generated on the PXE server (which is also removed immediately after the process is done) and then issuing a 'drop pwr' command to the relay board that controls the power to that individual panda.
The limitations to this process are:
- the uboot code must be able to look for a file based on a unique and static MAC or other unique_ID to that panda
- the uboot code must be able to timeout and default to booting the SDcard if PXE boot fails (does not find a pxe boot file)
- if uboot code or the first partition becomes corrupted on the SDcard, this will require a datacenter visit to replace the card.
Assignee | ||
Updated•12 years ago
|
Assignee: server-ops-releng → ted.mielczarek
Assignee | ||
Updated•12 years ago
|
Blocks: android_4.0_testing
Comment 1•12 years ago
|
||
I am going through all bugs for the tracking work for the panda boards.
How are we progressing on this? Rough is fine.
Comment 2•12 years ago
|
||
As part of bug 731670 I found that u-boot correctly generates a unique MAC address for the Pandaboard's eth0. (In fact, the fix for that bug was simply to pass the MAC from u-boot down to the net driver.)
Assignee | ||
Comment 3•12 years ago
|
||
I'm going to go ahead and take this bug since I have done some work on it. I've successfully scripted u-boot to first attempt to pxe boot and upon failing will boot Android from the SDcard.
Assignee: ted.mielczarek → jwatkins
Assignee | ||
Comment 4•12 years ago
|
||
I've setup a test environment at home to begin building a linux initramfs image with tools to re-image the sdcard remotely. This will be built upon an ubuntu 12.04 arm core minimal filesystem.
A couple things needed when it comes time to set this up in production:
* a separate tftp server from the pxe server that is already in place in scl1. (the tftp server IP will be set in the boot.scr). This is due to the fact that it attempts loads a PXE config file on every boot and there cannot be a default file such as what is currently setup for PXE booting in scl1. Loading a default pxe config and having a timeout to LOCALBOOT 0 will not work for this solution.
* an NFS export to serve up fs images. (should probably be the same server or vms as the tftp server) These images will be the actually android partitions that get dumped to the SDcard.
Assignee | ||
Comment 5•12 years ago
|
||
I've updated the boot.scr to allow uboot to attempt a pxe boot before defaulting to the sdcard. I've also removed the smsc95xx.macaddr=${usbethaddr} since it wasn't needed and not in the correct env var. This is also not need since the linaro build has a patch to generate the mac id off the cpu die id already. If you want to add android boot args, append them to the bootarg var (inside the quotes)
For pandas in scl1, this is the boot.scr to use:
setenv initrd_high "0xffffffff"
setenv fdt_high "0xffffffff"
setenv bootargs "console=ttyO2,115200n8 rootwait ro earlyprintk fixrtc nocompcache vram=48M omapfb.vram=0:24M,1:24M mem=456M@0x80000000 mem=512M@0xA0000000 init=/init androidboot.console=ttyO2 omapdss.def_disp=dvi omapfb.mode=dvi:1024x768MR-24@60 consoleblank=0"
setenv bootandroid "echo Booting Android from SDcard; fatload mmc 0:1 0x80200000 uImage; fatload mmc 0:1 0x81600000 uInitrd; bootm 0x80200000 0x81600000"
setenv bootpxefirst "echo Launching PXE boot... ; if usb start; then set autoload no; bootp; setenv serverip 10.12.48.27; if pxe get; then pxe boot; else run bootandroid; fi; fi"
run bootpxefirst
For pandas that aren't in scl1 and that need to skip the pxe boot process, simply change "run bootpxefirst" to "run bootandroid" before running mkimage
Updated•12 years ago
|
No longer blocks: android_4.0_testing
Updated•12 years ago
|
Whiteboard: u=panda c=it p=2 → u=panda c=it p=2 [re-panda]
Assignee | ||
Comment 6•12 years ago
|
||
I've made progress on a LIVE Linux boot image. It boots from a PXE server and gives a more functional Linux environment then just an initrd w/busybox.
Assignee | ||
Comment 7•12 years ago
|
||
Here is an update from the IT side of work
Completed:
Netboot linux live runtime environment
Re-imaging scripts in linux environment
Work in progress:
mobile-services puppet manifests
pxe config management code module
I was able to produce a proof of concept by netbooting the linux runtime env and letting the scripts handle the sdcard re-imaging. Worked flawlessly.
Comment 8•12 years ago
|
||
Ah, there's been progress since Jake's update. The "work in progress" items are complete. We have a working imaging system ready for integration with releng (buildbot), with ateam's systems (MozPool and Lifeguard), and with the hardware in scl1 (new servers, chassis, etc.).
I emailed release@ a few days ago about integration, and haven't heard anything back. I'll get a bug open on that so we can all read it.
Mark Coté and I talked last week about the a-team integration, and we are going to regroup later this week. That may result in some design changes to bmm, but I don't expect anything particularly challenging.
There are some minor "polish" items I'd like to fix up, but none of those block deployment.
The last open dependency here (bug 799616) is for monitoring. That's specified and handed off to the SREs, but not in place yet since the hardware's not installed. So I'll call this finished and remove the dep.
Updated•11 years ago
|
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in
before you can comment on or make changes to this bug.
Description
•