Closed Bug 525037 Opened 15 years ago Closed 14 years ago

Investigate formatting working partition on boot for Talos

Categories

(Release Engineering :: General, defect, P2)

x86
All
defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: jhford)

References

Details

(Whiteboard: [talos][automation])

Attachments

(4 files, 2 obsolete files)

Running off of a fresh partition may help stabilize numbers for certain tests/platforms. It will at least give us a more consistent starting state. * Format 2nd partition on boot * Have Firefox and pagesets, etc. run off of the 2nd partition ** Temporary profile should be created there as well. Buildbot, apache, etc. will still run off of the primary partition.
we are already doing this with our maemo machines (http://hg.mozilla.org/build/tools/file/tip/buildfarm/mobile/production-sd/rootfs/etc/init.d/buildbot). I am sure that this script would work for our linux and mac slaves and maybe even windows ones if there is a command line way to format the drives. As a WIP portion of this script, I am investigating the use of the python standard library logging API to send device status before starting buildbot (though this remote monitoring does not yet work).
There is also work in bug#525030 to cleanout logs on Talos machines with each reboot, which might also help stabilize numbers. Catlee noted that Apache particularly has a lot of errors in the logs.
Assignee: catlee → nobody
Priority: -- → P5
Whiteboard: [talos][automation]
Assignee: nobody → nrthomas
Assignee: nrthomas → lsblakk
Whiteboard: [talos][automation] → [talos][automation][triagefollowup]
(In reply to comment #0) > Running off of a fresh partition may help stabilize numbers for certain > tests/platforms. It will at least give us a more consistent starting state. > > * Format 2nd partition on boot > * Have Firefox and pagesets, etc. run off of the 2nd partition > ** Temporary profile should be created there as well. > > Buildbot, apache, etc. will still run off of the primary partition. Which tests are we hoping to improve by doing this? Should this be an auto-tools bug for the investigation stage to see if in fact doing this does improve anything?
Whiteboard: [talos][automation][triagefollowup] → [talos][automation]
(In reply to comment #5) > Which tests are we hoping to improve by doing this? Any perf tests that are showing significant variation between runs are likely (we hope!) to see more consistent numbers by doing this. EG, Ts/Tshutdown, and perhaps even Tp4.
We currently do this on mobile. We are running fewer iterations of the tests but are seeing fairly stable numbers [1] from what I have been told. On linux, this isn't too difficult to do. We create a blank file of whatever size we need the partition to be, mkfs.ext2 it then loopback mount it. This is advantageous because you can later unmount the file and save it to another location for future inspection. A disadvantage is that this particular method only works on linux. The steps we use for this are located at: http://hg.mozilla.org/build/tools/file/8de071880651/buildfarm/mobile/n900-imaging/rootfs/root-skel/etc/event.d/buildbot#l28 [1] examples http://graphs.mozilla.org/#tests=[[16,11,463],[16,11,464],[16,11,465],[16,11,466],[16,11,467],[16,11,468],[16,11,469],[16,11,470],[16,11,471],[16,11,472],[16,11,474],[16,11,475],[16,11,476],[16,11,477],[16,11,478],[16,11,479],[16,11,480],[16,11,481],[16,11,615],[16,11,616],[16,11,617],[16,11,618],[16,11,619],[16,11,620],[16,11,621],[16,11,622],[16,11,624],[16,11,626],[16,11,627],[16,11,628],[16,11,629],[16,11,630],[16,11,631],[16,11,632],[16,11,633],[16,11,634],[16,11,636],[16,11,637],[16,11,638],[16,11,639],[16,11,642],[16,11,643]] http://graphs.mozilla.org/#tests=[[21,11,463],[21,11,464],[21,11,465],[21,11,466],[21,11,467],[21,11,468],[21,11,469],[21,11,470],[21,11,471],[21,11,472],[21,11,474],[21,11,475],[21,11,476],[21,11,477],[21,11,478],[21,11,479],[21,11,480],[21,11,481],[21,11,615],[21,11,616],[21,11,617],[21,11,618],[21,11,619],[21,11,620],[21,11,621],[21,11,622],[21,11,624],[21,11,626],[21,11,627],[21,11,628],[21,11,629],[21,11,630],[21,11,631],[21,11,632],[21,11,633],[21,11,634],[21,11,636],[21,11,637],[21,11,638],[21,11,639],[21,11,642],[21,11,643]]&sel=1282169202,1283535188 http://graphs.mozilla.org/#tests=[[23,11,463],[23,11,464],[23,11,465],[23,11,466],[23,11,467],[23,11,468],[23,11,469],[23,11,470],[23,11,471],[23,11,472],[23,11,474],[23,11,475],[23,11,476],[23,11,477],[23,11,478],[23,11,479],[23,11,480],[23,11,481],[23,11,615],[23,11,616],[23,11,617],[23,11,618],[23,11,619],[23,11,620],[23,11,621],[23,11,622],[23,11,624],[23,11,626],[23,11,627],[23,11,628],[23,11,629],[23,11,630],[23,11,631],[23,11,632],[23,11,633],[23,11,634],[23,11,636],[23,11,637],[23,11,638],[23,11,639],[23,11,642],[23,11,643]]&sel=1282442242,1283535188
Assignee: lsblakk → jhford
(In reply to comment #7) > On linux, this isn't too difficult to do. We create a blank file of whatever > size we need the partition to be, mkfs.ext2 it then loopback mount it. At risk of the perfect being the enemy of the good, this still might not fix the problem because you're at the mercy of the primary FS creating a relatively unfragmented 420MB file. And it's being created 1024 bytes at a time (modulo buffering?) so I wouldn't be so sure about the FS not being dumb about this -- which, really, is the whole reason we're here in the first place. It's likely better than the status quo, but if we're going to kill this dead (dead!), a new FS on a reserved area of the oxide really seems like the way to go.
(In reply to comment #8) > (In reply to comment #7) > > > On linux, this isn't too difficult to do. We create a blank file of whatever > > size we need the partition to be, mkfs.ext2 it then loopback mount it. > > At risk of the perfect being the enemy of the good, this still might not fix > the problem because you're at the mercy of the primary FS creating a relatively > unfragmented 420MB file. And it's being created 1024 bytes at a time (modulo > buffering?) so I wouldn't be so sure about the FS not being dumb about this -- > which, really, is the whole reason we're here in the first place. Yes, that is a valid concern. We are creating this file on a 32GB fat32 volume which is maybe using 1GB of files including this file. The main reason for doing this on the phones was not to improve numbers, but was to not be at the mercy of Fat32's suckage. The alternative (formatting the real fs) is very long process (32gb on a class 2? sdhc card) and requires low level os start modification changes > It's likely better than the status quo, but if we're going to kill this dead > (dead!), a new FS on a reserved area of the oxide really seems like the way to > go. For the real Linux desktop machines, that'd be what i would do for sure.
Depends on: 615603
I have requested that two currently offline minis be brought back to the office for use in support of this bug. My plan is to run linux on two minis with a new fs on each boot in a staging environment to evaluate if this is a useful step forward. The changes required for windows and possibly mac are outside the scope of this bug.
since we are using LVM, I was able to painlessly shrink the root volume without having to futz around with the MBR! I used a command similar to # lvresize -r -L 50G /dev/vg_talosr3fedref/lv_root in a Fedora 14 live cd session. Once I was back into Fedora 12 on the disk, I created a tests volume with # lvcreate vg_talosr3fedref -L 5G -n lv_tests To emulate a reboot on the non-formatting machine, I am thinking that I will "mount -o remount /" and see if the filesystem cache is cleared. If it isn't, I will try # sync # echo 3 > /proc/sys/vm/drop_caches I don't want to reboot between tests because that will complicate this testing.
Are you proposing we stop rebooting between jobs in production too ? There's value in having the machine in a clean state before starting tests too.
(In reply to comment #11) > since we are using LVM, I was able to painlessly shrink the root volume without > having to futz around with the MBR! I used a command similar to > > # lvresize -r -L 50G /dev/vg_talosr3fedref/lv_root > > in a Fedora 14 live cd session. Once I was back into Fedora 12 on the disk, I > created a tests volume with > > # lvcreate vg_talosr3fedref -L 5G -n lv_tests > > To emulate a reboot on the non-formatting machine, I am thinking that I will > "mount -o remount /" and see if the filesystem cache is cleared. If it isn't, > I will try > > # sync > # echo 3 > /proc/sys/vm/drop_caches > > I don't want to reboot between tests because that will complicate this testing. I have verified that the filesystem cache is cleared on an explicit umount & mount but not a mount -o remount. I am going to do the same partition jiggling on both machines but only reformat one.
(In reply to comment #12) > Are you proposing we stop rebooting between jobs in production too ? There's > value in having the machine in a clean state before starting tests too. not at all, this is purely for an investigation on the filesystem's effect on numbers. The way I have written my test script, I can either run N iterations and exit, or have the computer start the script every reboot, run once, reboot, ad infinium. Because one approach will take longer than the other, I will only get the same number of data points if I do N runs then exit.
I have also changed the sudo rule to allow passwordless arbitrary sudo as root.
Attached file bench-talos (deleted) —
this is a script that runs talos on a machine, with filesystem cache cleared after each run to loosely approximate a system reboot. The test procedure is going to be 0. copy data onto machine 1. full, cold restart 2. close terminal (that won't work) 3. open terminal again from kicker 4. run "./bench-talos" on one machine and "FORMAT=1 ./bench-talos" on the other 5. ? 6. profit
(In reply to comment #16) I ran this test overnight to verify that the script works for long periods (it does) and found that it is taking around 20 minutes to run through TS and TP4. Quick math tells me that I'll be able to do about 70 test runs per day. I am going to change the number from 2000 runs to 200 and see what the numbers look like after the weekend I am also moving these machines into the second floor server room and putting them on UPS.
Status: NEW → ASSIGNED
Priority: P5 → P2
(In reply to comment #17) > (In reply to comment #16) > > I ran this test overnight to verify that the script works for long periods (it > does) and found that it is taking around 20 minutes to run through TS and TP4. > > Quick math tells me that I'll be able to do about 70 test runs per day. I am > going to change the number from 2000 runs to 200 and see what the numbers look > like after the weekend > > I am also moving these machines into the second floor server room and putting > them on UPS. it looks like the machine which was doing the formats per-run froze. I am going to reboot it to figure out how far through it got. There is a good chance that doing a bunch of mkfs runs screwed up the system. I am going to try a rebooting version of the script to see if that improves reliability.
Attached file results from computer that didn't format (obsolete) (deleted) —
Attached file results from computer that did format (obsolete) (deleted) —
Attachment #496970 - Attachment mime type: application/octet-stream → text/plain
(In reply to comment #20) > Created attachment 496970 [details] > results from computer that did format not sure why this didn't run/report tp4
Attachment #496967 - Attachment is obsolete: true
Attachment #496970 - Attachment is obsolete: true
Attached file get the stddev and mean from data (deleted) —
Output from script: Formatting ts_cold stddev: 73.2167448757 Non-formatting ts_cold stddev: 87.6435537353 Formatting ts_cold_shutdown stddev: 7.49324117601 No-formatting ts_cold_shutdown stddev: 8.58330280141 Formatting ts_cold mean: 6365.83147059 Non-formatting ts_cold mean: 5896.51210526 Formatting ts_cold_shutdown mean: 460.328823529 No-formatting ts_cold_shutdown mean: 477.011578947
Attachment #514287 - Attachment mime type: application/octet-stream → text/plain
This has been investigated and I believe that there is no further action to be taken. While testing this, I found that the machine that did format the working drive was not able to actually run non-ts tests reliably (thus the lack of data) and that they weren't able to complete anywhere near the number of test runs as the device that didn't format. While there are advantages to formatting every test run, the slow down caused by the actual format (non-trivial time) as well as the lower reliability means that this would hurt us more than help. This test was done on Linux where I would expect automated filesystem management to be the strongest.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Thanks for looking into this Jhford.
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: