Closed
Bug 988262
Opened 11 years ago
Closed 11 years ago
EC2 test spot instances with 15GB disk running low on (or completely out of) available disk when checking out gaia-1_4, gaia-1_2 and gaia-central under /builds/hg-shared/integration
Categories
(Infrastructure & Operations Graveyard :: CIDuty, task)
Infrastructure & Operations Graveyard
CIDuty
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: pmoore, Assigned: rail)
References
Details
Attachments
(2 files)
(deleted),
patch
|
catlee
:
review+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
catlee
:
review+
rail
:
checked-in+
|
Details | Diff | Splinter Review |
Disk full on tst-linux64-spot-757:
==================================
[cltbld@tst-linux64-spot-757.test.releng.usw2.mozilla.com integration]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 15G 15G 82M 100% /
udev 1.9G 4.0K 1.9G 1% /dev
tmpfs 751M 656K 750M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 1.9G 4.8M 1.9G 1% /run/shm
Biggest "eaters":
=================
[cltbld@tst-linux64-spot-757.test.releng.usw2.mozilla.com ~]$ du -sk /* 2>/dev/null | sort -n | tail -5
337212 /tools
802088 /var
2420164 /usr
3147704 /home
7199976 /builds
Relatively big:
===============
[cltbld@tst-linux64-spot-757.test.releng.usw2.mozilla.com .android]$ du -sh /home/cltbld/.android/avd
2.6G /home/cltbld/.android/avd
Three gaia repositories checked out, consuming 5.1G
===================================================
[cltbld@tst-linux64-spot-757.test.releng.usw2.mozilla.com test]$ cd /builds/hg-shared/integration/
[cltbld@tst-linux64-spot-757.test.releng.usw2.mozilla.com integration]$ du -sk * | sort -n
1425320 gaia-1_2
1901980 gaia-1_4
1920616 gaia-central
[cltbld@tst-linux64-spot-757.test.releng.usw2.mozilla.com integration]$ du -sh /builds/hg-shared/integration/
5.1G /builds/hg-shared/integration/
A random spot (tst-linux64-spot-248) without 100% disk usage:
=============================================================
[cltbld@tst-linux64-spot-248.test.releng.use1.mozilla.com ~]$ du -sk /* 2>/dev/null | sort -n | tail -5
337212 /tools
801104 /var
2420164 /usr
3350764 /home
5182808 /builds
However, already at 88% usage!! Only 1.9G available.
[cltbld@tst-linux64-spot-248.test.releng.use1.mozilla.com ~]$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 15G 13G 1.9G 88% /
udev 1.9G 4.0K 1.9G 1% /dev
tmpfs 751M 652K 750M 1% /run
none 5.0M 0 5.0M 0% /run/lock
none 1.9G 5.7M 1.9G 1% /run/shm
This spot instance does *not* have the 1.9Gb gaia-1_4 repo checked out!
[cltbld@tst-linux64-spot-248.test.releng.use1.mozilla.com integration]$ ls -ltr
total 8
drwxrwxr-x 19 cltbld cltbld 4096 Mar 11 18:00 gaia-central
drwxrwxr-x 19 cltbld cltbld 4096 Mar 23 02:17 gaia-1_2
In other words, it looks like as soon as a spot has gaia-central, gaia-1_2 *and* gaia-1_4 cloned - we are going to hit this issue!
From:
https://mxr.mozilla.org/build/source/cloud-tools/configs/tst-linux32
https://mxr.mozilla.org/build/source/cloud-tools/configs/tst-linux64
it looks like we only create 15G instances (thanks mgerva for the link).
I guess we need to either work out how to save disk space *somewhere* or increase this value.
Or maybe we can "jacuzzi" the spots a bit more, so that different gaia versions are tested on different instances?
Comment 1•11 years ago
|
||
disabled this slave in slavealloc
Assignee | ||
Comment 3•11 years ago
|
||
We can bump the size of the root partition easily for spot instances. All new instances will be coming with more space. On-demand instances would need to be recreated.
This will also affect our AWS bill.
Comment 4•11 years ago
|
||
we could also have the slaves pull the various gaia repos into the same local repo. as long as they're checking things out by revision, this should be safe.
Comment 5•11 years ago
|
||
Breakdown on a running machine with 2G free still:
[root@tst-linux64-spot-034.test.releng.use1.mozilla.com /]# du -chs * 2>/dev/null
0 1
6.9M bin
22M boot
5.1G builds
4.0K dev
12M etc
3.2G home
0 initrd.img
216M lib
4.0K lib64
16K lost+found
4.0K media
4.0K mnt
4.0K opt
0 proc
104K root
728K run
7.9M sbin
4.0K selinux
4.0K srv
0 sys
60K tmp
330M tools
2.4G usr
788M var
0 vmlinuz
12G total
[root@tst-linux64-spot-034.test.releng.use1.mozilla.com /]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/xvda1 15G 13G 2.0G 87% /
Updated•11 years ago
|
Comment 6•11 years ago
|
||
(In reply to Pete Moore [:pete][:pmoore] from comment #0)
> [cltbld@tst-linux64-spot-757.test.releng.usw2.mozilla.com .android]$ du -sh
> /home/cltbld/.android/avd
> 2.6G /home/cltbld/.android/avd
The Android 2.3 jobs extract AVDs-armv7a-gingerbread-build-2014-01-23-ubuntu.tar.gz there and that is approximately the expected size.
I think it could be reduced to about 25% of that though: We distribute 4 identical avd definitions but only use one of them. (The Android 4.2 x86 needs 4 avd definitions because they run up to 4 emulators at one time, but the Android 2.3 jobs on ec2 slaves only run 1 emulator at a time.)
Comment 7•11 years ago
|
||
RyanVM found these 5:
tst-linux64-spot-125
tst-linux64-spot-665 (unreachable, assumed terminated)
tst-linux64-spot-411
tst-linux64-spot-751
tst-linux64-spot-646
Did a short sweep through TBPL for m-c for others:
tst-linux64-spot-781 (unreachable, assumed terminated)
tst-linux64-spot-481
tst-linux64-spot-386
tst-linux64-spot-757
tst-linux64-spot-687
There are surely more.
Assignee | ||
Comment 8•11 years ago
|
||
Let's bump the root partition size until we can figure out how to reduce disk usage.
Attachment #8397593 -
Flags: review?(catlee)
Updated•11 years ago
|
Attachment #8397593 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 9•11 years ago
|
||
Comment on attachment 8397593 [details] [diff] [review]
bump the size
https://hg.mozilla.org/build/cloud-tools/rev/ca4694057ad2
This change will affect new spot instances, but not existing on-demand ones.
Attachment #8397593 -
Flags: checked-in+
Assignee | ||
Comment 10•11 years ago
|
||
It turns out that it's not enough to set the size of the volume properly, we also need to grow the partition when the new size is larger that the one of in AMI. This won't work for HVM root device, but should work for PV instances.
Attachment #8398848 -
Flags: review?
Assignee | ||
Updated•11 years ago
|
Attachment #8398848 -
Flags: review? → review?(catlee)
Updated•11 years ago
|
Attachment #8398848 -
Flags: review?(catlee) → review+
Assignee | ||
Comment 11•11 years ago
|
||
Comment on attachment 8398848 [details] [diff] [review]
resize.diff
https://hg.mozilla.org/build/cloud-tools/rev/280ebcbdbe3a
Attachment #8398848 -
Flags: checked-in+
Assignee | ||
Comment 12•11 years ago
|
||
the last fix made it work \o/
The on-demand instances still use 15G volumes.
Assignee | ||
Updated•11 years ago
|
Assignee: nobody → rail
Comment 14•11 years ago
|
||
Rail: What is the plan to age out the old on-demand and old spot instances?
Flags: needinfo?(rail)
Assignee | ||
Comment 15•11 years ago
|
||
(In reply to John Hopkins (:jhopkins) from comment #14)
> Rail: What is the plan to age out the old on-demand and old spot instances?
I was thinking about re-creating them on the TCW this window. Is there any reason why this should be done earlier?
Flags: needinfo?(rail)
Comment 16•11 years ago
|
||
Rail: we've been failing test runs due to "out of disk space" messages from time to time all week. I conferred with edmorley and he says we can wait until the TCW to reimage.
Assignee | ||
Comment 17•11 years ago
|
||
No need to wait, they were idle, so I recreated 200 on-demand instances (tst-linux64-{0,3}{01..99})
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Updated•7 years ago
|
Product: Release Engineering → Infrastructure & Operations
Updated•5 years ago
|
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•