Closed Bug 988262 Opened 11 years ago Closed 11 years ago

EC2 test spot instances with 15GB disk running low on (or completely out of) available disk when checking out gaia-1_4, gaia-1_2 and gaia-central under /builds/hg-shared/integration

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: pmoore, Assigned: rail)

References

Details

Attachments

(2 files)

bump the size 11 years ago Rail Aliiev [:rail] (deleted), patch	catlee : review+ rail : checked-in+	Details \| Diff \| Splinter Review
resize.diff 11 years ago Rail Aliiev [:rail] (deleted), patch	catlee : review+ rail : checked-in+	Details \| Diff \| Splinter Review

Pete Moore [:pmoore][:pete]

Reporter

Description

•

11 years ago

Disk full on tst-linux64-spot-757: ================================== [cltbld@tst-linux64-spot-757.test.releng.usw2.mozilla.com integration]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 15G 15G 82M 100% / udev 1.9G 4.0K 1.9G 1% /dev tmpfs 751M 656K 750M 1% /run none 5.0M 0 5.0M 0% /run/lock none 1.9G 4.8M 1.9G 1% /run/shm Biggest "eaters": ================= [cltbld@tst-linux64-spot-757.test.releng.usw2.mozilla.com ~]$ du -sk /* 2>/dev/null | sort -n | tail -5 337212 /tools 802088 /var 2420164 /usr 3147704 /home 7199976 /builds Relatively big: =============== [cltbld@tst-linux64-spot-757.test.releng.usw2.mozilla.com .android]$ du -sh /home/cltbld/.android/avd 2.6G /home/cltbld/.android/avd Three gaia repositories checked out, consuming 5.1G =================================================== [cltbld@tst-linux64-spot-757.test.releng.usw2.mozilla.com test]$ cd /builds/hg-shared/integration/ [cltbld@tst-linux64-spot-757.test.releng.usw2.mozilla.com integration]$ du -sk * | sort -n 1425320 gaia-1_2 1901980 gaia-1_4 1920616 gaia-central [cltbld@tst-linux64-spot-757.test.releng.usw2.mozilla.com integration]$ du -sh /builds/hg-shared/integration/ 5.1G /builds/hg-shared/integration/ A random spot (tst-linux64-spot-248) without 100% disk usage: ============================================================= [cltbld@tst-linux64-spot-248.test.releng.use1.mozilla.com ~]$ du -sk /* 2>/dev/null | sort -n | tail -5 337212 /tools 801104 /var 2420164 /usr 3350764 /home 5182808 /builds However, already at 88% usage!! Only 1.9G available. [cltbld@tst-linux64-spot-248.test.releng.use1.mozilla.com ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 15G 13G 1.9G 88% / udev 1.9G 4.0K 1.9G 1% /dev tmpfs 751M 652K 750M 1% /run none 5.0M 0 5.0M 0% /run/lock none 1.9G 5.7M 1.9G 1% /run/shm This spot instance does *not* have the 1.9Gb gaia-1_4 repo checked out! [cltbld@tst-linux64-spot-248.test.releng.use1.mozilla.com integration]$ ls -ltr total 8 drwxrwxr-x 19 cltbld cltbld 4096 Mar 11 18:00 gaia-central drwxrwxr-x 19 cltbld cltbld 4096 Mar 23 02:17 gaia-1_2 In other words, it looks like as soon as a spot has gaia-central, gaia-1_2 *and* gaia-1_4 cloned - we are going to hit this issue! From: https://mxr.mozilla.org/build/source/cloud-tools/configs/tst-linux32 https://mxr.mozilla.org/build/source/cloud-tools/configs/tst-linux64 it looks like we only create 15G instances (thanks mgerva for the link). I guess we need to either work out how to save disk space *somewhere* or increase this value. Or maybe we can "jacuzzi" the spots a bit more, so that different gaia versions are tested on different instances?

Carsten Book [:Tomcat]

Comment 1

•

11 years ago

disabled this slave in slavealloc

Rail Aliiev [:rail]

Assignee

Comment 3

•

11 years ago

We can bump the size of the root partition easily for spot instances. All new instances will be coming with more space. On-demand instances would need to be recreated. This will also affect our AWS bill.

Chris AtLee [:catlee]

Comment 4

•

11 years ago

we could also have the slaves pull the various gaia repos into the same local repo. as long as they're checking things out by revision, this should be safe.

bhearsum@mozilla.com (:bhearsum)

Comment 5

•

11 years ago

Breakdown on a running machine with 2G free still: [root@tst-linux64-spot-034.test.releng.use1.mozilla.com /]# du -chs * 2>/dev/null 0 1 6.9M bin 22M boot 5.1G builds 4.0K dev 12M etc 3.2G home 0 initrd.img 216M lib 4.0K lib64 16K lost+found 4.0K media 4.0K mnt 4.0K opt 0 proc 104K root 728K run 7.9M sbin 4.0K selinux 4.0K srv 0 sys 60K tmp 330M tools 2.4G usr 788M var 0 vmlinuz 12G total [root@tst-linux64-spot-034.test.releng.use1.mozilla.com /]# df -h Filesystem Size Used Avail Use% Mounted on /dev/xvda1 15G 13G 2.0G 87% /

Geoff Brown [:gbrown]

Updated

•

11 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=988657

Geoff Brown [:gbrown]

Comment 6

•

11 years ago

(In reply to Pete Moore [:pete][:pmoore] from comment #0) > [cltbld@tst-linux64-spot-757.test.releng.usw2.mozilla.com .android]$ du -sh > /home/cltbld/.android/avd > 2.6G /home/cltbld/.android/avd The Android 2.3 jobs extract AVDs-armv7a-gingerbread-build-2014-01-23-ubuntu.tar.gz there and that is approximately the expected size. I think it could be reduced to about 25% of that though: We distribute 4 identical avd definitions but only use one of them. (The Android 4.2 x86 needs 4 avd definitions because they run up to 4 emulators at one time, but the Android 2.3 jobs on ec2 slaves only run 1 emulator at a time.)

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

11 years ago

RyanVM found these 5: tst-linux64-spot-125 tst-linux64-spot-665 (unreachable, assumed terminated) tst-linux64-spot-411 tst-linux64-spot-751 tst-linux64-spot-646 Did a short sweep through TBPL for m-c for others: tst-linux64-spot-781 (unreachable, assumed terminated) tst-linux64-spot-481 tst-linux64-spot-386 tst-linux64-spot-757 tst-linux64-spot-687 There are surely more.

Rail Aliiev [:rail]

Assignee

Comment 8

•

11 years ago

Attached patch bump the size (deleted) — Details — Splinter Review

Let's bump the root partition size until we can figure out how to reduce disk usage.

Attachment #8397593 - Flags: review?(catlee)

Chris AtLee [:catlee]

Updated

•

11 years ago

Attachment #8397593 - Flags: review?(catlee) → review+

Rail Aliiev [:rail]

Assignee

Comment 9

•

11 years ago

Comment on attachment 8397593 [details] [diff] [review] bump the size https://hg.mozilla.org/build/cloud-tools/rev/ca4694057ad2 This change will affect new spot instances, but not existing on-demand ones.

Attachment #8397593 - Flags: checked-in+

Rail Aliiev [:rail]

Assignee

Comment 10

•

11 years ago

Attached patch resize.diff (deleted) — Details — Splinter Review

It turns out that it's not enough to set the size of the volume properly, we also need to grow the partition when the new size is larger that the one of in AMI. This won't work for HVM root device, but should work for PV instances.

Attachment #8398848 - Flags: review?

Rail Aliiev [:rail]

Assignee

Updated

•

11 years ago

Attachment #8398848 - Flags: review? → review?(catlee)

Chris AtLee [:catlee]

Updated

•

11 years ago

Attachment #8398848 - Flags: review?(catlee) → review+

Rail Aliiev [:rail]

Assignee

Comment 11

•

11 years ago

Comment on attachment 8398848 [details] [diff] [review] resize.diff https://hg.mozilla.org/build/cloud-tools/rev/280ebcbdbe3a

Attachment #8398848 - Flags: checked-in+

Rail Aliiev [:rail]

Assignee

Comment 12

•

11 years ago

the last fix made it work \o/ The on-demand instances still use 15G volumes.

Rail Aliiev [:rail]

Assignee

Updated

•

11 years ago

Assignee: nobody → rail

John Hopkins (:jhopkins)

Comment 14

•

11 years ago

Rail: What is the plan to age out the old on-demand and old spot instances?

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Assignee

Comment 15

•

11 years ago

(In reply to John Hopkins (:jhopkins) from comment #14) > Rail: What is the plan to age out the old on-demand and old spot instances? I was thinking about re-creating them on the TCW this window. Is there any reason why this should be done earlier?

Flags: needinfo?(rail)

Ed Morley [:emorley]

Updated

•

11 years ago

Depends on: 991020

John Hopkins (:jhopkins)

Comment 16

•

11 years ago

Rail: we've been failing test runs due to "out of disk space" messages from time to time all week. I conferred with edmorley and he says we can wait until the TCW to reimage.

Rail Aliiev [:rail]

Assignee

Comment 17

•

11 years ago

No need to wait, they were idle, so I recreated 200 on-demand instances (tst-linux64-{0,3}{01..99})

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

7 years ago

Product: Release Engineering → Infrastructure & Operations

BMO Automation

Updated

•

5 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard

You need to log in before you can comment on or make changes to this bug.