Closed Bug 942498 Opened 11 years ago Closed 11 years ago

Use spot instances for try builds

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: rail)

References

Details

Attachments

(4 files)

We should be able to use spot instances pretty easily for try builds. Try builds are always clobber builds, so as long as we have a semi-recent local copy of the try repo, builds on fresh spot instances shouldn't be slower than builds on regular in-house or ondemand instances.
Assignee: nobody → rail
Attached patch configs (deleted) — Splinter Review
Attachment #8346212 - Flags: review?(catlee)
Attachment #8346212 - Flags: review?(catlee) → review+
Attached file slavealloc CSV (deleted) —
Attached file inventory additions (deleted) —
I extended subnets in AWS to use the whole block of 10.DC.64.0/22 and added missing SOA entries to the inventory.
I have 2 spot instances running in staging environment since yesterday. If I don't find any major issues I'd like to bring some into production early next week.
Attached patch [puppet] allowed IPs (deleted) — Splinter Review
try is 10.DC.64.0/22 is actually
Attachment #8348247 - Flags: review?(dustin)
Attachment #8348247 - Flags: review?(dustin) → review+
Dec 16, Mon total jobs, jobs on spots, spot retries 528, 14 (2%), 9 (64%) The ratio is very high here... Since build jobs take much longer, probability of being killed is much higher here... I"ll try to bump the price a bit and see the difference.
It was better yesterday: total jobs, jobs on spots, spot retries 1097, 54 (4%), 1 (1%) I'm going to bump the limits
Thu, Dec 19 total jobs, jobs on spots, spot retries 781, 219 (28%), 28 (12%) I think, the high percentage is correlated to shortage of m3.xlarge these days.
date, total jobs, jobs on spots, spot retries 2013-12-20, 546, 286 (52%), 14 (4%) 2013-12-21, 133, 122 (91%), 0 (0%) 2013-12-22, 17, 17 (100%), 0 (0%) 2013-12-23, 236, 137 (58%), 0 (0%) 2013-12-24, 226, 152 (67%), 0 (0%) 2013-12-25, 33, 23 (69%), 0 (0%) 2013-12-26, 5, 5 (100%), 0 (0%) 2013-12-27, 1, 1 (100%), 0 (0%) 2013-12-28, 0, 0 (0%), 0 (0%) 2013-12-29, 38, 34 (89%), 0 (0%) 2013-12-30, 83, 26 (31%), 0 (0%) 2013-12-31, 27, 27 (100%), 0 (0%) I'm going to monitor try builds on spot instances to figure out cases when we don't retry properly.
I bumped the try limit up to 200 in total: https://hg.mozilla.org/build/cloud-tools/rev/c362cf6b1af6
FTR, after talking to Taras regarding AWS spot pricing model we decided to bump the bid prices up to 10¢ for m1.medium (vs 0.12¢ on-demand price). This price actually represents our highest price we will to pay in case the "market" price goes up. This should improve out retry stats. I bumped adjusted the prices on Friday, Jan 3rd, PT afternoon. date, total jobs, jobs on spots, spot retries 2014-01-01, 3, 3 (100%), 0 (0%) 2014-01-02, 262, 245 (93%), 1 (0%) 2014-01-03, 289, 283 (97%), 25 (8%) 2014-01-04, 163, 152 (93%), 8 (5%) 2014-01-05, 3, 3 (100%), 3 (100%) The stats are still not representative due to low load.
I'm going to leave the current limits as as because I see some spikes in retries. I hope it's related to the usw2 network blips, but it'd better to isolate them. date, total jobs, jobs on spots, spot retries 2014-01-06, 366, 362 (98%), 9 (2%) 2014-01-07, 754, 611 (81%), 106 (17%) 2014-01-08, 1093, 713 (65%), 54 (7%)
Updated stats for try-linux64 date, total jobs, jobs on spots, spot retries, o-d retries 2014-01-01, 3, 3 (100%), 0 (0%), 0 (0%) 2014-01-02, 262, 245 (93%), 1 (0%), 0 (0%) 2014-01-03, 289, 283 (97%), 25 (8%), 0 (0%) 2014-01-04, 163, 152 (93%), 8 (5%), 0 (0%) 2014-01-05, 3, 3 (100%), 3 (100%), 0 (0%) 2014-01-06, 366, 362 (98%), 9 (2%), 0 (0%) 2014-01-07, 754, 611 (81%), 106 (17%), 0 (0%) 2014-01-08, 1093, 713 (65%), 54 (7%), 0 (0%) 2014-01-09, 1231, 940 (76%), 69 (7%), 2 (0%) 2014-01-10, 1312, 848 (64%), 6 (0%), 4 (0%) 2014-01-11, 468, 345 (73%), 1 (0%), 28 (22%) 2014-01-12, 267, 267 (100%), 0 (0%), 0 (0%)
I bumped the limtis for try jobs twice: https://hg.mozilla.org/build/cloud-tools/rev/098082cd1239
This has been quite stable so far, closing. date, total jobs, jobs on spots, spot retries, o-d retries 2014-01-13, 790, 662 (83%), 7 (1%), 0 (0%) 2014-01-14, 889, 843 (94%), 32 (3%), 0 (0%) 2014-01-15, 1244, 1079 (86%), 105 (9%), 0 (0%) 2014-01-16, 1072, 997 (93%), 2 (0%), 0 (0%) 2014-01-17, 1158, 814 (70%), 122 (14%), 0 (0%) 2014-01-18, 377, 340 (90%), 0 (0%), 0 (0%) 2014-01-19, 152, 152 (100%), 0 (0%), 0 (0%) 2014-01-20, 461, 461 (100%), 2 (0%), 0 (0%) 2014-01-21, 1363, 814 (59%), 15 (1%), 0 (0%) 2014-01-22, 997, 679 (68%), 14 (2%), 1 (0%) 2014-01-23, 1249, 688 (55%), 3 (0%), 2 (0%) 2014-01-24, 1064, 704 (66%), 10 (1%), 0 (0%) 2014-01-25, 138, 113 (81%), 6 (5%), 0 (0%) 2014-01-26, 67, 67 (100%), 0 (0%), 0 (0%)
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: