Closed
Bug 525037
Opened 15 years ago
Closed 14 years ago
Investigate formatting working partition on boot for Talos
Categories
(Release Engineering :: General, defect, P2)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: catlee, Assigned: jhford)
References
Details
(Whiteboard: [talos][automation])
Attachments
(4 files, 2 obsolete files)
Running off of a fresh partition may help stabilize numbers for certain tests/platforms. It will at least give us a more consistent starting state.
* Format 2nd partition on boot
* Have Firefox and pagesets, etc. run off of the 2nd partition
** Temporary profile should be created there as well.
Buildbot, apache, etc. will still run off of the primary partition.
Woo woo!
Assignee | ||
Comment 2•15 years ago
|
||
we are already doing this with our maemo machines (http://hg.mozilla.org/build/tools/file/tip/buildfarm/mobile/production-sd/rootfs/etc/init.d/buildbot). I am sure that this script would work for our linux and mac slaves and maybe even windows ones if there is a command line way to format the drives. As a WIP portion of this script, I am investigating the use of the python standard library logging API to send device status before starting buildbot (though this remote monitoring does not yet work).
Assignee | ||
Comment 3•15 years ago
|
||
We could use diskutil on osx (http://developer.apple.com/mac/library/documentation/Darwin/Reference/ManPages/man8/diskutil.8.html) and format on windows (http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/format.mspx?mfr=true)
Comment 4•15 years ago
|
||
There is also work in bug#525030 to cleanout logs on Talos machines with each reboot, which might also help stabilize numbers. Catlee noted that Apache particularly has a lot of errors in the logs.
Reporter | ||
Updated•15 years ago
|
Assignee: catlee → nobody
Updated•15 years ago
|
Priority: -- → P5
Updated•15 years ago
|
Whiteboard: [talos][automation]
Updated•14 years ago
|
Assignee: nobody → nrthomas
Updated•14 years ago
|
Assignee: nrthomas → lsblakk
Updated•14 years ago
|
Whiteboard: [talos][automation] → [talos][automation][triagefollowup]
Comment 5•14 years ago
|
||
(In reply to comment #0)
> Running off of a fresh partition may help stabilize numbers for certain
> tests/platforms. It will at least give us a more consistent starting state.
>
> * Format 2nd partition on boot
> * Have Firefox and pagesets, etc. run off of the 2nd partition
> ** Temporary profile should be created there as well.
>
> Buildbot, apache, etc. will still run off of the primary partition.
Which tests are we hoping to improve by doing this?
Should this be an auto-tools bug for the investigation stage to see if in fact doing this does improve anything?
Updated•14 years ago
|
Whiteboard: [talos][automation][triagefollowup] → [talos][automation]
Comment 6•14 years ago
|
||
(In reply to comment #5)
> Which tests are we hoping to improve by doing this?
Any perf tests that are showing significant variation between runs are likely (we hope!) to see more consistent numbers by doing this. EG, Ts/Tshutdown, and perhaps even Tp4.
Assignee | ||
Comment 7•14 years ago
|
||
We currently do this on mobile. We are running fewer iterations of the tests but are seeing fairly stable numbers [1] from what I have been told.
On linux, this isn't too difficult to do. We create a blank file of whatever size we need the partition to be, mkfs.ext2 it then loopback mount it. This is advantageous because you can later unmount the file and save it to another location for future inspection. A disadvantage is that this particular method only works on linux.
The steps we use for this are located at:
http://hg.mozilla.org/build/tools/file/8de071880651/buildfarm/mobile/n900-imaging/rootfs/root-skel/etc/event.d/buildbot#l28
[1] examples
http://graphs.mozilla.org/#tests=[[16,11,463],[16,11,464],[16,11,465],[16,11,466],[16,11,467],[16,11,468],[16,11,469],[16,11,470],[16,11,471],[16,11,472],[16,11,474],[16,11,475],[16,11,476],[16,11,477],[16,11,478],[16,11,479],[16,11,480],[16,11,481],[16,11,615],[16,11,616],[16,11,617],[16,11,618],[16,11,619],[16,11,620],[16,11,621],[16,11,622],[16,11,624],[16,11,626],[16,11,627],[16,11,628],[16,11,629],[16,11,630],[16,11,631],[16,11,632],[16,11,633],[16,11,634],[16,11,636],[16,11,637],[16,11,638],[16,11,639],[16,11,642],[16,11,643]]
http://graphs.mozilla.org/#tests=[[21,11,463],[21,11,464],[21,11,465],[21,11,466],[21,11,467],[21,11,468],[21,11,469],[21,11,470],[21,11,471],[21,11,472],[21,11,474],[21,11,475],[21,11,476],[21,11,477],[21,11,478],[21,11,479],[21,11,480],[21,11,481],[21,11,615],[21,11,616],[21,11,617],[21,11,618],[21,11,619],[21,11,620],[21,11,621],[21,11,622],[21,11,624],[21,11,626],[21,11,627],[21,11,628],[21,11,629],[21,11,630],[21,11,631],[21,11,632],[21,11,633],[21,11,634],[21,11,636],[21,11,637],[21,11,638],[21,11,639],[21,11,642],[21,11,643]]&sel=1282169202,1283535188
http://graphs.mozilla.org/#tests=[[23,11,463],[23,11,464],[23,11,465],[23,11,466],[23,11,467],[23,11,468],[23,11,469],[23,11,470],[23,11,471],[23,11,472],[23,11,474],[23,11,475],[23,11,476],[23,11,477],[23,11,478],[23,11,479],[23,11,480],[23,11,481],[23,11,615],[23,11,616],[23,11,617],[23,11,618],[23,11,619],[23,11,620],[23,11,621],[23,11,622],[23,11,624],[23,11,626],[23,11,627],[23,11,628],[23,11,629],[23,11,630],[23,11,631],[23,11,632],[23,11,633],[23,11,634],[23,11,636],[23,11,637],[23,11,638],[23,11,639],[23,11,642],[23,11,643]]&sel=1282442242,1283535188
Updated•14 years ago
|
Assignee: lsblakk → jhford
Comment 8•14 years ago
|
||
(In reply to comment #7)
> On linux, this isn't too difficult to do. We create a blank file of whatever
> size we need the partition to be, mkfs.ext2 it then loopback mount it.
At risk of the perfect being the enemy of the good, this still might not fix the problem because you're at the mercy of the primary FS creating a relatively unfragmented 420MB file. And it's being created 1024 bytes at a time (modulo buffering?) so I wouldn't be so sure about the FS not being dumb about this -- which, really, is the whole reason we're here in the first place.
It's likely better than the status quo, but if we're going to kill this dead (dead!), a new FS on a reserved area of the oxide really seems like the way to go.
Assignee | ||
Comment 9•14 years ago
|
||
(In reply to comment #8)
> (In reply to comment #7)
>
> > On linux, this isn't too difficult to do. We create a blank file of whatever
> > size we need the partition to be, mkfs.ext2 it then loopback mount it.
>
> At risk of the perfect being the enemy of the good, this still might not fix
> the problem because you're at the mercy of the primary FS creating a relatively
> unfragmented 420MB file. And it's being created 1024 bytes at a time (modulo
> buffering?) so I wouldn't be so sure about the FS not being dumb about this --
> which, really, is the whole reason we're here in the first place.
Yes, that is a valid concern. We are creating this file on a 32GB fat32 volume which is maybe using 1GB of files including this file. The main reason for doing this on the phones was not to improve numbers, but was to not be at the mercy of Fat32's suckage. The alternative (formatting the real fs) is very long process (32gb on a class 2? sdhc card) and requires low level os start modification changes
> It's likely better than the status quo, but if we're going to kill this dead
> (dead!), a new FS on a reserved area of the oxide really seems like the way to
> go.
For the real Linux desktop machines, that'd be what i would do for sure.
Assignee | ||
Comment 10•14 years ago
|
||
I have requested that two currently offline minis be brought back to the office for use in support of this bug.
My plan is to run linux on two minis with a new fs on each boot in a staging environment to evaluate if this is a useful step forward. The changes required for windows and possibly mac are outside the scope of this bug.
Assignee | ||
Comment 11•14 years ago
|
||
since we are using LVM, I was able to painlessly shrink the root volume without having to futz around with the MBR! I used a command similar to
# lvresize -r -L 50G /dev/vg_talosr3fedref/lv_root
in a Fedora 14 live cd session. Once I was back into Fedora 12 on the disk, I created a tests volume with
# lvcreate vg_talosr3fedref -L 5G -n lv_tests
To emulate a reboot on the non-formatting machine, I am thinking that I will "mount -o remount /" and see if the filesystem cache is cleared. If it isn't, I will try
# sync
# echo 3 > /proc/sys/vm/drop_caches
I don't want to reboot between tests because that will complicate this testing.
Comment 12•14 years ago
|
||
Are you proposing we stop rebooting between jobs in production too ? There's value in having the machine in a clean state before starting tests too.
Assignee | ||
Comment 13•14 years ago
|
||
(In reply to comment #11)
> since we are using LVM, I was able to painlessly shrink the root volume without
> having to futz around with the MBR! I used a command similar to
>
> # lvresize -r -L 50G /dev/vg_talosr3fedref/lv_root
>
> in a Fedora 14 live cd session. Once I was back into Fedora 12 on the disk, I
> created a tests volume with
>
> # lvcreate vg_talosr3fedref -L 5G -n lv_tests
>
> To emulate a reboot on the non-formatting machine, I am thinking that I will
> "mount -o remount /" and see if the filesystem cache is cleared. If it isn't,
> I will try
>
> # sync
> # echo 3 > /proc/sys/vm/drop_caches
>
> I don't want to reboot between tests because that will complicate this testing.
I have verified that the filesystem cache is cleared on an explicit umount & mount but not a mount -o remount. I am going to do the same partition jiggling on both machines but only reformat one.
Assignee | ||
Comment 14•14 years ago
|
||
(In reply to comment #12)
> Are you proposing we stop rebooting between jobs in production too ? There's
> value in having the machine in a clean state before starting tests too.
not at all, this is purely for an investigation on the filesystem's effect on numbers. The way I have written my test script, I can either run N iterations and exit, or have the computer start the script every reboot, run once, reboot, ad infinium. Because one approach will take longer than the other, I will only get the same number of data points if I do N runs then exit.
Assignee | ||
Comment 15•14 years ago
|
||
I have also changed the sudo rule to allow passwordless arbitrary sudo as root.
Assignee | ||
Comment 16•14 years ago
|
||
this is a script that runs talos on a machine, with filesystem cache cleared after each run to loosely approximate a system reboot.
The test procedure is going to be
0. copy data onto machine
1. full, cold restart
2. close terminal (that won't work)
3. open terminal again from kicker
4. run "./bench-talos" on one machine and "FORMAT=1 ./bench-talos" on the other
5. ?
6. profit
Assignee | ||
Comment 17•14 years ago
|
||
(In reply to comment #16)
I ran this test overnight to verify that the script works for long periods (it does) and found that it is taking around 20 minutes to run through TS and TP4.
Quick math tells me that I'll be able to do about 70 test runs per day. I am going to change the number from 2000 runs to 200 and see what the numbers look like after the weekend
I am also moving these machines into the second floor server room and putting them on UPS.
Status: NEW → ASSIGNED
Priority: P5 → P2
Assignee | ||
Comment 18•14 years ago
|
||
(In reply to comment #17)
> (In reply to comment #16)
>
> I ran this test overnight to verify that the script works for long periods (it
> does) and found that it is taking around 20 minutes to run through TS and TP4.
>
> Quick math tells me that I'll be able to do about 70 test runs per day. I am
> going to change the number from 2000 runs to 200 and see what the numbers look
> like after the weekend
>
> I am also moving these machines into the second floor server room and putting
> them on UPS.
it looks like the machine which was doing the formats per-run froze. I am going to reboot it to figure out how far through it got. There is a good chance that doing a bunch of mkfs runs screwed up the system. I am going to try a rebooting version of the script to see if that improves reliability.
Assignee | ||
Comment 19•14 years ago
|
||
Assignee | ||
Comment 20•14 years ago
|
||
Assignee | ||
Updated•14 years ago
|
Attachment #496970 -
Attachment mime type: application/octet-stream → text/plain
Assignee | ||
Comment 21•14 years ago
|
||
(In reply to comment #20)
> Created attachment 496970 [details]
> results from computer that did format
not sure why this didn't run/report tp4
Assignee | ||
Comment 22•14 years ago
|
||
Attachment #496967 -
Attachment is obsolete: true
Assignee | ||
Comment 23•14 years ago
|
||
Attachment #496970 -
Attachment is obsolete: true
Assignee | ||
Comment 24•14 years ago
|
||
Output from script:
Formatting ts_cold stddev: 73.2167448757
Non-formatting ts_cold stddev: 87.6435537353
Formatting ts_cold_shutdown stddev: 7.49324117601
No-formatting ts_cold_shutdown stddev: 8.58330280141
Formatting ts_cold mean: 6365.83147059
Non-formatting ts_cold mean: 5896.51210526
Formatting ts_cold_shutdown mean: 460.328823529
No-formatting ts_cold_shutdown mean: 477.011578947
Assignee | ||
Updated•14 years ago
|
Attachment #514287 -
Attachment mime type: application/octet-stream → text/plain
Assignee | ||
Comment 25•14 years ago
|
||
This has been investigated and I believe that there is no further action to be taken. While testing this, I found that the machine that did format the working drive was not able to actually run non-ts tests reliably (thus the lack of data) and that they weren't able to complete anywhere near the number of test runs as the device that didn't format.
While there are advantages to formatting every test run, the slow down caused by the actual format (non-trivial time) as well as the lower reliability means that this would hurt us more than help. This test was done on Linux where I would expect automated filesystem management to be the strongest.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 26•14 years ago
|
||
Thanks for looking into this Jhford.
Updated•11 years ago
|
Product: mozilla.org → Release Engineering
You need to log in
before you can comment on or make changes to this bug.
Description
•