Closed Bug 819545 Opened 12 years ago Closed 12 years ago

determine why android img is stable and mozpool installed is not

Categories

(Infrastructure & Operations :: RelOps: General, task)

x86_64
Windows 7
task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dividehex, Assigned: dividehex)

References

Details

We need to figure out why the sector by sector image of an android sdcard is stable and why the mozpool deployed tarballs are not. We have installed the .img image to panda-0022 thru panda-0027 On the other hand we have installed android tarballs via mozpool to panda-0028 thru panda-0033. (panda-0029 and panda-0033 have been showing weird behavior all around so lets not use those for now) One working idea is the preseed image used by mozpool adds net device initialization/dhcp/pxe boot to the boot.scr before turning over to the kernel already installed on the sdcard. This could be causing a dirty initial state for the android kernel. For the first test, lets take the pandas with the mozpool deployed tarballs and pxe boot them into maintenance mode and change out their boot.scr for the non-pxe boot.scr from the IMG. :jmaher, can you give more detail as to how the mozpool deployed pandas are not/less stable in as much detail as possible? Do you see any of the same behavior in any degree on the IMG installed pandas? Also, where exactly on the /mnt/sdcard does sutagent put test data? does it create dirs or does it expect certain dirs to be already present? thanks
(In reply to Jake Watkins [:dividehex] from comment #0) > For the first test, lets take the pandas with the mozpool deployed tarballs > and pxe boot them into maintenance mode and change out their boot.scr for > the non-pxe boot.scr from the IMG. I've done this to panda-{0028,0030,0031,0032). They have also been locked_out in mozpool so you are free to test them and use relay.py.
Sutagent is not available most of the time (actually ping is not as well). I do a relay powercycle and it takes 2-3 times in order for sutagent to be accessible. Another issue is that during a test, sometimes the device becomes unresponsive. I haven't debugged this, but it is frustrating since this doesn't happen with the raw image. For reference, I rand 402 tests against the raw images and only 22 failed. This was over the weekend. I ran 6 attempts at the same smoketest on a different panda from lifeguard and not 1 succeeded. * Could we be delaying the bootup long enough that we miss out on our ip address? * could our /media partion or /system partition be mounted differently where it would affect performance and cause things to time out?
Blocks: 799698
Joel, what were the differences in results between panda-{0028,0030..0032} and other normal mozpool-imaged pandas? That's the critical bisection data we need to determine the next step.
We found that the boards would not boot up and get on the network consistently and that many times during execution of the test we would fail due to lost connection or a reboot. This is not the case with the raw image.
Does "the boards" in comment 5 mean the devices specifically configured with the non-pxe boot.scr? If so, then we can rule out u-boot and look to problems with the other partitions.
jmaher: I'm setting up different samples of pandas with certain changes to try and isolate the cause of instability. The first single blind test sample is called sample "A". Sample A set is panda-{0522-0526) Please test this set as rigorously you can and let me know what the failure rates were compared to the stable raw image rates. Also, any symptoms new or previously seen. As much detail as possible would be much appreciated.
jmaher: Sample B is ready for testing. panda-{0527-0531}
Blocks: 824767
Panda boards ran 49 tests and pass 48 of them, I call that success. Lets use this new technique.
The last change for sample B was fixing the boot args in boot.scr which were truncated in a c&p from the wiki when the preseed was built. We will need to update the preseed image. As for the pandas with the preseed already installed, we can write a second stage script for mozpool to update the boot.src.
I'm going to call this fixed. Deployment is being tracked in bug 825322 and bug 826694
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Component: Server Operations: RelEng → RelOps
Product: mozilla.org → Infrastructure & Operations
You need to log in before you can comment on or make changes to this bug.