Closed Bug 525037 Opened 15 years ago Closed 14 years ago

Investigate formatting working partition on boot for Talos

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: catlee, Assigned: jhford)

References

Details

(Whiteboard: [talos][automation])

Attachments

(4 files, 2 obsolete files)

bench-talos 14 years ago John Ford [:jhford] CET/CEST Berlin Time (deleted), text/plain		Details
results from computer that didn't format 14 years ago John Ford [:jhford] CET/CEST Berlin Time (deleted), text/plain		Details
results from computer that did format 14 years ago John Ford [:jhford] CET/CEST Berlin Time (deleted), text/plain		Details
json version of data from the did not format computer 14 years ago John Ford [:jhford] CET/CEST Berlin Time (deleted), text/plain		Details
json version of data from the did format computer 14 years ago John Ford [:jhford] CET/CEST Berlin Time (deleted), text/plain		Details
get the stddev and mean from data 14 years ago John Ford [:jhford] CET/CEST Berlin Time (deleted), text/plain		Details

Chris AtLee [:catlee]

Reporter

Description

•

15 years ago

Running off of a fresh partition may help stabilize numbers for certain tests/platforms. It will at least give us a more consistent starting state. * Format 2nd partition on boot * Have Firefox and pagesets, etc. run off of the 2nd partition ** Temporary profile should be created there as well. Buildbot, apache, etc. will still run off of the primary partition.

Mike Shaver (:shaver -- probably not reading bugmail closely)

Comment 1

•

15 years ago

Woo woo!

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 2

•

15 years ago

we are already doing this with our maemo machines (http://hg.mozilla.org/build/tools/file/tip/buildfarm/mobile/production-sd/rootfs/etc/init.d/buildbot). I am sure that this script would work for our linux and mac slaves and maybe even windows ones if there is a command line way to format the drives. As a WIP portion of this script, I am investigating the use of the python standard library logging API to send device status before starting buildbot (though this remote monitoring does not yet work).

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 3

•

15 years ago

We could use diskutil on osx (http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPages/man8/diskutil.8.html) and format on windows (http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/format.mspx?mfr=true)

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 4

•

15 years ago

There is also work in bug#525030 to cleanout logs on Talos machines with each reboot, which might also help stabilize numbers. Catlee noted that Apache particularly has a lot of errors in the logs.

Chris AtLee [:catlee]

Reporter

Updated

•

15 years ago

Assignee: catlee → nobody

alice nodelman [:alice] [:anode]

Updated

•

15 years ago

Priority: -- → P5

Chris Cooper [:coop] (he/him)

Updated

•

15 years ago

Whiteboard: [talos][automation]

Mike Taylor [:bear]

Updated

•

14 years ago

Assignee: nobody → nrthomas

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

14 years ago

Assignee: nrthomas → lsblakk

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

14 years ago

Whiteboard: [talos][automation] → [talos][automation][triagefollowup]

Lukas Blakk [:lsblakk] use ?needinfo

Comment 5

•

14 years ago

(In reply to comment #0) > Running off of a fresh partition may help stabilize numbers for certain > tests/platforms. It will at least give us a more consistent starting state. > > * Format 2nd partition on boot > * Have Firefox and pagesets, etc. run off of the 2nd partition > ** Temporary profile should be created there as well. > > Buildbot, apache, etc. will still run off of the primary partition. Which tests are we hoping to improve by doing this? Should this be an auto-tools bug for the investigation stage to see if in fact doing this does improve anything?

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

14 years ago

Whiteboard: [talos][automation][triagefollowup] → [talos][automation]

Justin Dolske [:Dolske]

Comment 6

•

14 years ago

(In reply to comment #5) > Which tests are we hoping to improve by doing this? Any perf tests that are showing significant variation between runs are likely (we hope!) to see more consistent numbers by doing this. EG, Ts/Tshutdown, and perhaps even Tp4.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 7

•

14 years ago

We currently do this on mobile. We are running fewer iterations of the tests but are seeing fairly stable numbers [1] from what I have been told. On linux, this isn't too difficult to do. We create a blank file of whatever size we need the partition to be, mkfs.ext2 it then loopback mount it. This is advantageous because you can later unmount the file and save it to another location for future inspection. A disadvantage is that this particular method only works on linux. The steps we use for this are located at: http://hg.mozilla.org/build/tools/file/8de071880651/buildfarm/mobile/n900-imaging/rootfs/root-skel/etc/event.d/buildbot#l28 [1] examples http://graphs.mozilla.org/#tests=[[16,11,463],[16,11,464],[16,11,465],[16,11,466],[16,11,467],[16,11,468],[16,11,469],[16,11,470],[16,11,471],[16,11,472],[16,11,474],[16,11,475],[16,11,476],[16,11,477],[16,11,478],[16,11,479],[16,11,480],[16,11,481],[16,11,615],[16,11,616],[16,11,617],[16,11,618],[16,11,619],[16,11,620],[16,11,621],[16,11,622],[16,11,624],[16,11,626],[16,11,627],[16,11,628],[16,11,629],[16,11,630],[16,11,631],[16,11,632],[16,11,633],[16,11,634],[16,11,636],[16,11,637],[16,11,638],[16,11,639],[16,11,642],[16,11,643]] http://graphs.mozilla.org/#tests=[[21,11,463],[21,11,464],[21,11,465],[21,11,466],[21,11,467],[21,11,468],[21,11,469],[21,11,470],[21,11,471],[21,11,472],[21,11,474],[21,11,475],[21,11,476],[21,11,477],[21,11,478],[21,11,479],[21,11,480],[21,11,481],[21,11,615],[21,11,616],[21,11,617],[21,11,618],[21,11,619],[21,11,620],[21,11,621],[21,11,622],[21,11,624],[21,11,626],[21,11,627],[21,11,628],[21,11,629],[21,11,630],[21,11,631],[21,11,632],[21,11,633],[21,11,634],[21,11,636],[21,11,637],[21,11,638],[21,11,639],[21,11,642],[21,11,643]]&sel=1282169202,1283535188 http://graphs.mozilla.org/#tests=[[23,11,463],[23,11,464],[23,11,465],[23,11,466],[23,11,467],[23,11,468],[23,11,469],[23,11,470],[23,11,471],[23,11,472],[23,11,474],[23,11,475],[23,11,476],[23,11,477],[23,11,478],[23,11,479],[23,11,480],[23,11,481],[23,11,615],[23,11,616],[23,11,617],[23,11,618],[23,11,619],[23,11,620],[23,11,621],[23,11,622],[23,11,624],[23,11,626],[23,11,627],[23,11,628],[23,11,629],[23,11,630],[23,11,631],[23,11,632],[23,11,633],[23,11,634],[23,11,636],[23,11,637],[23,11,638],[23,11,639],[23,11,642],[23,11,643]]&sel=1282442242,1283535188

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

14 years ago

Assignee: lsblakk → jhford

Justin Dolske [:Dolske]

Comment 8

•

14 years ago

(In reply to comment #7) > On linux, this isn't too difficult to do. We create a blank file of whatever > size we need the partition to be, mkfs.ext2 it then loopback mount it. At risk of the perfect being the enemy of the good, this still might not fix the problem because you're at the mercy of the primary FS creating a relatively unfragmented 420MB file. And it's being created 1024 bytes at a time (modulo buffering?) so I wouldn't be so sure about the FS not being dumb about this -- which, really, is the whole reason we're here in the first place. It's likely better than the status quo, but if we're going to kill this dead (dead!), a new FS on a reserved area of the oxide really seems like the way to go.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 9

•

14 years ago

(In reply to comment #8) > (In reply to comment #7) > > > On linux, this isn't too difficult to do. We create a blank file of whatever > > size we need the partition to be, mkfs.ext2 it then loopback mount it. > > At risk of the perfect being the enemy of the good, this still might not fix > the problem because you're at the mercy of the primary FS creating a relatively > unfragmented 420MB file. And it's being created 1024 bytes at a time (modulo > buffering?) so I wouldn't be so sure about the FS not being dumb about this -- > which, really, is the whole reason we're here in the first place. Yes, that is a valid concern. We are creating this file on a 32GB fat32 volume which is maybe using 1GB of files including this file. The main reason for doing this on the phones was not to improve numbers, but was to not be at the mercy of Fat32's suckage. The alternative (formatting the real fs) is very long process (32gb on a class 2? sdhc card) and requires low level os start modification changes > It's likely better than the status quo, but if we're going to kill this dead > (dead!), a new FS on a reserved area of the oxide really seems like the way to > go. For the real Linux desktop machines, that'd be what i would do for sure.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Updated

•

14 years ago

Depends on: 615603

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 10

•

14 years ago

I have requested that two currently offline minis be brought back to the office for use in support of this bug. My plan is to run linux on two minis with a new fs on each boot in a staging environment to evaluate if this is a useful step forward. The changes required for windows and possibly mac are outside the scope of this bug.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 11

•

14 years ago

since we are using LVM, I was able to painlessly shrink the root volume without having to futz around with the MBR! I used a command similar to # lvresize -r -L 50G /dev/vg_talosr3fedref/lv_root in a Fedora 14 live cd session. Once I was back into Fedora 12 on the disk, I created a tests volume with # lvcreate vg_talosr3fedref -L 5G -n lv_tests To emulate a reboot on the non-formatting machine, I am thinking that I will "mount -o remount /" and see if the filesystem cache is cleared. If it isn't, I will try # sync # echo 3 > /proc/sys/vm/drop_caches I don't want to reboot between tests because that will complicate this testing.

Nick Thomas [:nthomas] (UTC+12)

Comment 12

•

14 years ago

Are you proposing we stop rebooting between jobs in production too ? There's value in having the machine in a clean state before starting tests too.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 13

•

14 years ago

(In reply to comment #11) > since we are using LVM, I was able to painlessly shrink the root volume without > having to futz around with the MBR! I used a command similar to > > # lvresize -r -L 50G /dev/vg_talosr3fedref/lv_root > > in a Fedora 14 live cd session. Once I was back into Fedora 12 on the disk, I > created a tests volume with > > # lvcreate vg_talosr3fedref -L 5G -n lv_tests > > To emulate a reboot on the non-formatting machine, I am thinking that I will > "mount -o remount /" and see if the filesystem cache is cleared. If it isn't, > I will try > > # sync > # echo 3 > /proc/sys/vm/drop_caches > > I don't want to reboot between tests because that will complicate this testing. I have verified that the filesystem cache is cleared on an explicit umount & mount but not a mount -o remount. I am going to do the same partition jiggling on both machines but only reformat one.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 14

•

14 years ago

(In reply to comment #12) > Are you proposing we stop rebooting between jobs in production too ? There's > value in having the machine in a clean state before starting tests too. not at all, this is purely for an investigation on the filesystem's effect on numbers. The way I have written my test script, I can either run N iterations and exit, or have the computer start the script every reboot, run once, reboot, ad infinium. Because one approach will take longer than the other, I will only get the same number of data points if I do N runs then exit.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 15

•

14 years ago

I have also changed the sudo rule to allow passwordless arbitrary sudo as root.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 16

•

14 years ago

Attached file bench-talos (deleted) — Details

this is a script that runs talos on a machine, with filesystem cache cleared after each run to loosely approximate a system reboot. The test procedure is going to be 0. copy data onto machine 1. full, cold restart 2. close terminal (that won't work) 3. open terminal again from kicker 4. run "./bench-talos" on one machine and "FORMAT=1 ./bench-talos" on the other 5. ? 6. profit

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 17

•

14 years ago

(In reply to comment #16) I ran this test overnight to verify that the script works for long periods (it does) and found that it is taking around 20 minutes to run through TS and TP4. Quick math tells me that I'll be able to do about 70 test runs per day. I am going to change the number from 2000 runs to 200 and see what the numbers look like after the weekend I am also moving these machines into the second floor server room and putting them on UPS.

Status: NEW → ASSIGNED

Priority: P5 → P2

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 18

•

14 years ago

(In reply to comment #17) > (In reply to comment #16) > > I ran this test overnight to verify that the script works for long periods (it > does) and found that it is taking around 20 minutes to run through TS and TP4. > > Quick math tells me that I'll be able to do about 70 test runs per day. I am > going to change the number from 2000 runs to 200 and see what the numbers look > like after the weekend > > I am also moving these machines into the second floor server room and putting > them on UPS. it looks like the machine which was doing the formats per-run froze. I am going to reboot it to figure out how far through it got. There is a good chance that doing a bunch of mkfs runs screwed up the system. I am going to try a rebooting version of the script to see if that improves reliability.

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 19

•

14 years ago

Attached file results from computer that didn't format (obsolete) (deleted) — Details

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 20

•

14 years ago

Attached file results from computer that did format (obsolete) (deleted) — Details

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Updated

•

14 years ago

Attachment #496970 - Attachment mime type: application/octet-stream → text/plain

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 21

•

14 years ago

(In reply to comment #20) > Created attachment 496970 [details] > results from computer that did format not sure why this didn't run/report tp4

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 22

•

14 years ago

Attached file json version of data from the did not format computer (deleted) — Details

Attachment #496967 - Attachment is obsolete: true

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 23

•

14 years ago

Attached file json version of data from the did format computer (deleted) — Details

Attachment #496970 - Attachment is obsolete: true

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 24

•

14 years ago

Attached file get the stddev and mean from data (deleted) — Details

Output from script: Formatting ts_cold stddev: 73.2167448757 Non-formatting ts_cold stddev: 87.6435537353 Formatting ts_cold_shutdown stddev: 7.49324117601 No-formatting ts_cold_shutdown stddev: 8.58330280141 Formatting ts_cold mean: 6365.83147059 Non-formatting ts_cold mean: 5896.51210526 Formatting ts_cold_shutdown mean: 460.328823529 No-formatting ts_cold_shutdown mean: 477.011578947

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Updated

•

14 years ago

Attachment #514287 - Attachment mime type: application/octet-stream → text/plain

John Ford [:jhford] CET/CEST Berlin Time

Assignee

Comment 25

•

14 years ago

This has been investigated and I believe that there is no further action to be taken. While testing this, I found that the machine that did format the working drive was not able to actually run non-ts tests reliably (thus the lack of data) and that they weren't able to complete anywhere near the number of test runs as the device that didn't format. While there are advantages to formatting every test run, the slow down caused by the actual format (non-trivial time) as well as the lower reliability means that this would hurt us more than help. This test was done on Linux where I would expect automated filesystem management to be the strongest.

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

cmtalbert

Comment 26

•

14 years ago

Thanks for looking into this Jhford.

Nobody; OK to take it and work on it

Updated

•

11 years ago

Product: mozilla.org → Release Engineering

You need to log in before you can comment on or make changes to this bug.