Closed Bug 977611 Opened 11 years ago Closed 11 years ago

Use local storage for builds

Categories

(Release Engineering :: General, defect)

x86_64
Linux
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Unassigned)

References

Details

Attachments

(5 files, 10 obsolete files)

(deleted), patch
rail
: review+
rail
: feedback+
massimo
: checked-in+
Details | Diff | Splinter Review
(deleted), patch
catlee
: review+
massimo
: checked-in+
Details | Diff | Splinter Review
(deleted), patch
catlee
: review+
massimo
: checked-in+
Details | Diff | Splinter Review
(deleted), patch
catlee
: review+
massimo
: checked-in+
Details | Diff | Splinter Review
(deleted), patch
catlee
: review+
massimo
: checked-in+
Details | Diff | Splinter Review
With only a handful of builders per slave, we should be able to use the local SSD drives for builds that are attached to most EC2 instances. This will be awesome for lots of reasons: - Less waiting for EBS -> faster builds! -> less $$$ - Less IO to EBS -> less $$$ - We can reduce the disk size of our EBS volumes -> less $$$
To make this happen we need to teach lvm-init how to detect available ephemeral storage, so we can use different instance types without worrying about their lvm configs. We should also rethink how we generate AMIs for spot builders. ATM we clone one of the on-demand instances without any modifications like disk size or ephemeral storage configuration.
Hi Chris, this patch (is based on you scripts), adds jacuzzi_metadata script and manage_instance_storage.py. jacuzzi metadata downloads locally the content of the http://jacuzzi-allocator.pub.build.mozilla.org/v1/machines/<hostname> at boot and manage_instance_storage.py, uses the content of that file to decide where to mount the instance storage (if any) The jacuzzi metadata script is executed at boot time, before the manage instance script. I am still working on the manage instance script to find out all the possible scenarios and all the possible edge cases. In case you need it, dev-bld-linux64-ec2-catlee has a the scripts already checked out and it's ready for testing
Attachment #8398527 - Flags: feedback?(catlee)
Comment on attachment 8398527 [details] [diff] [review] [puppet] Bug 977611 - manage instance storage script.patch Review of attachment 8398527 [details] [diff] [review]: ----------------------------------------------------------------- Woot! I looked at manage_instance_storage.py only and have 2 minor comments: ::: modules/aws/files/manage_instance_storage.py @@ +107,5 @@ > + need_format = True > + output = get_output_form_cmd(cmd=blkid_cmd, raise_on_error=False) > + if output: > + for line in output.splitlines(): > + if 'ID_FS_TYPE=ext4' in line: IIRC, by default they come formatted as ext3. It's worth to check for both ext3 and ext4 @@ +189,5 @@ > + # just in case... > + # log fstab content before updating it > + log.debug(read_fstab()) > + old_fstab = read_fstab() > + with open('/etc/fstab', 'w') as out_fstab: I'd rather write to a temporary file and rename it later.
Attachment #8398527 - Flags: feedback+
Added requested fixes: * skipping formatting if the disk is already ext3 or ext4 * using a temporaty file to modify /etc/fstab rather than editing it in place
Attachment #8398527 - Attachment is obsolete: true
Attachment #8398527 - Flags: feedback?(catlee)
Attachment #8398580 - Flags: feedback?(rail)
Comment on attachment 8398580 [details] [diff] [review] [puppet] Bug 977611 - manage instance storage script II.patch Review of attachment 8398580 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/aws/files/manage_instance_storage.py @@ +53,5 @@ > + raise > + return False > + > + > +def get_output_form_cmd(cmd, cwd=None, raise_on_error=True): nit: s/_form_/_from_/ @@ +154,5 @@ > + > +def fstab_line(device): > + """check if device is in fstab""" > + is_fstab_line = False > + for line in read_fstab(): should probably make this skip lines beginning with # @@ +203,5 @@ > + > + > +def my_name(): > + import socket > + return socket.gethostname().partition('.')[0] is this used, or leftover from debugging? @@ +214,5 @@ > + _mount_point = '/mnt/instance_storage' > + try: > + with open(jacuzzi_metadata_file) as data_file: > + if json.load(data_file): > + # hey I am a Jacuzzi! can you add a check here that # of builders in the jacuzzi is <= 3?
Attachment #8398580 - Flags: feedback+
Comment on attachment 8398580 [details] [diff] [review] [puppet] Bug 977611 - manage instance storage script II.patch Review of attachment 8398580 [details] [diff] [review]: ----------------------------------------------------------------- lgtm don't forget that manage_instance_storage.py may conflict with lvm-init from aws_watch_pending.py (for spot instances) and aws_create_instance.py (for on-demand instances). We should prep a deployment plan for this.
Attachment #8398580 - Flags: feedback?(rail) → feedback+
(In reply to rail from comment #6) > Comment on attachment 8398580 [details] [diff] [review] > [puppet] Bug 977611 - manage instance storage script II.patch > > Review of attachment 8398580 [details] [diff] [review]: > ----------------------------------------------------------------- > > lgtm > > don't forget that manage_instance_storage.py may conflict with lvm-init from > aws_watch_pending.py (for spot instances) and aws_create_instance.py (for > on-demand instances). We should prep a deployment plan for this. If you delete /etc/lvm-init/lvm-init.json as a part of this patch this should be enough to prevent lvm-init working on boot.
Thanks for the feedback! This patch fixes the errors you have found and adds the get_builders_from() function to manage new problems found with broken/bad/not existing jacuzzi_metadata.json file
Attachment #8398580 - Attachment is obsolete: true
Attachment #8399498 - Flags: feedback?
(In reply to rail from comment #7) > If you delete /etc/lvm-init/lvm-init.json as a part of this patch this > should be enough to prevent lvm-init working on boot. Rail, last patch does not change anything about lvm-init, I am trying to understand if it is possible to remove lvm-init completely
(In reply to Massimo Gervasini [:mgerva] from comment #9) > (In reply to rail from comment #7) > > If you delete /etc/lvm-init/lvm-init.json as a part of this patch this > > should be enough to prevent lvm-init working on boot. > > Rail, last patch does not change anything about lvm-init, I am trying to > understand if it is possible to remove lvm-init completely yup, you can do this 1) rm the file I mentioned (puppet) 2) for centos make sure that lvm-init is not absent (puppet) 3) for ubuntu make sure that /sbin/lvm-init is absent (puppet) Once these are landed: 4) delete lvm-init related code from aws_create_instance.py and scripts/aws_watch_pending.py
Rail, did I miss anything?
Attachment #8400035 - Flags: feedback?(rail)
Comment on attachment 8400035 [details] [diff] [review] [cloud-tools] remove lvm-init configuration files.patch lgtm
Attachment #8400035 - Flags: feedback?(rail) → feedback+
Attached patch [puppet] instance_storage.pp.patch (obsolete) (deleted) — Splinter Review
This patch: * removes /etc/lvm-init/lvm-init.json * removes lvm-init package on CentOS * remove /sbin/lvm-init on Ubuntu
Attachment #8400053 - Flags: feedback?(rail)
Comment on attachment 8400053 [details] [diff] [review] [puppet] instance_storage.pp.patch Review of attachment 8400053 [details] [diff] [review]: ----------------------------------------------------------------- lgtm. BTW, we should remove the last block (removing lvm-init) when wea redone with that.
Attachment #8400053 - Flags: feedback?(rail) → feedback+
Updated jacuzzi mount point to /builds/slave. We already have hg/git shared directories in /builds, if we mount the instance storage there, we have to clone the shared directories again.
Attachment #8400674 - Flags: feedback?(rail)
Attachment #8400674 - Flags: feedback?(rail) → feedback+
Hi catlee, this patch contains only two fixes: * added: require => Class['packages::mozilla::python27'] in modules/jacuzzi_metadata/manifests/init.pp otherwise /usr/local/bin/jacuzzi_metadata.py fails its first run because python27 is not yet available in /tools * fixed: wrong JACUZZI_METADATA_FILE path in manage_instance_storage.py
Attachment #8399498 - Attachment is obsolete: true
Attachment #8400053 - Attachment is obsolete: true
Attachment #8400674 - Attachment is obsolete: true
Attachment #8399498 - Flags: feedback?
Attachment #8401276 - Flags: review?(catlee)
Comment on attachment 8401276 [details] [diff] [review] [puppet] Bug 977611 - use local storage for builds.patch Review of attachment 8401276 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/aws/manifests/instance_storage.pp @@ +27,5 @@ > + enable => true; > + } > + file { > + "/etc/lvm-init/lvm-init.json": > + ensure => absent; nit: indent this one more level. same below
Attachment #8401276 - Flags: review?(catlee) → review+
Hi catlee, in this patch: * fixed indentation * updated fstab line to have dump = 0 and fsck = 0 * updated string formatting
Attachment #8401276 - Attachment is obsolete: true
Attachment #8401442 - Flags: review?(catlee)
Comment on attachment 8401442 [details] [diff] [review] [puppet] Bug 977611 - use local storage for builds.patch Review of attachment 8401442 [details] [diff] [review]: ----------------------------------------------------------------- shipit! _~ _~ )_)_~ )_))_))_) _!__!__!_ \______t/ ~~~~~~~~~~~~~
Attachment #8401442 - Flags: review?(catlee) → review+
Attachment #8401442 - Flags: checked-in+
now try machines mount instance storage disk(s) under /builds/slave
Attachment #8401483 - Flags: review?(catlee)
/etc/slave-trustlevel is a single line, no need for lists here
Attachment #8401483 - Attachment is obsolete: true
Attachment #8401483 - Flags: review?(catlee)
Attachment #8401504 - Flags: review?(catlee)
Attachment #8401504 - Flags: review?(catlee) → review+
Attachment #8401504 - Flags: checked-in+
in production
Hi caltee, this new patch takes care of unmounting the logical volume if it is already mounted, it also takes care of unmounting the swap file (if any) I did not find any easy method to map the output of vgs -o lv_path to an entry in fstab, so the code looks messy. I'm happy to nuke that part of the code if there is a better way to do it.
Attachment #8401963 - Flags: feedback?(catlee)
Hi Chris, here's slightly better version of the previous patch for volumes already mounted/swap files in use. Changes: * removed redundant checks * updated logging messages * minor fixes
Attachment #8401963 - Attachment is obsolete: true
Attachment #8401963 - Flags: feedback?(catlee)
Attachment #8402579 - Flags: review?(catlee)
Comment on attachment 8402579 [details] [diff] [review] [puppet] Bug 977611 - unmount already mounted vg and swapfile.patch Review of attachment 8402579 [details] [diff] [review]: ----------------------------------------------------------------- just a few tweaks required, then we can land it I think! ::: modules/aws/files/manage_instance_storage.py @@ +148,5 @@ > + # vgs command failed, no volume groups > + log.debug('there are no volume groups') > + except IndexError: > + log.debug('No volume groups found') > + return token this will return either the raw input from 'vgs -o <token>' in the case no volume groups are found, or the original token name in the case there are no volume groups. perhaps move the 'return token' inside the try: clause, and return None after the except clauses? @@ +246,5 @@ > + for line in read_fstab(): > + if old_fstab_line not in line: > + out_fstab.write(line) > + log.debug('removed %s from /etc/fstab', old_fstab_line.strip()) > + os.rename(temp_fstab.name, '/etc/fstab') wrap this in a try/except block where you delete temp_fstab.name in the case of unhandled exceptions.
Attachment #8402579 - Flags: review?(catlee) → review-
Hey catlee, thanks for the review, this new patch adds a better exception handling for _query_vgs() and os.rename() calls. Moved '/etc/fstab' into a constant, updated log messages
Attachment #8402579 - Attachment is obsolete: true
Attachment #8402691 - Flags: review?
Comment on attachment 8402691 [details] [diff] [review] [puppet] Bug 977611 - unmount already mounted vg and swapfile.patch Review of attachment 8402691 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/aws/files/manage_instance_storage.py @@ +243,5 @@ > + temp_fstab = tempfile.NamedTemporaryFile(delete=False) > + with open(temp_fstab.name, 'w') as out_fstab: > + for line in read_fstab(): > + if old_fstab_line not in line: > + out_fstab.write(line) put this whole 'with' block into the 'try' block.
Attachment #8402691 - Flags: review? → review+
Attachment #8402691 - Flags: checked-in+
Attachment #8400035 - Flags: review?(rail)
Attachment #8400035 - Flags: review?(rail) → review+
Attachment #8400035 - Flags: checked-in+
Hi Chris, changes in this patch: * if the instance storage has enough space, mount it in /builds/slaves - enough space is set 20 GB, how big should it be? 100? more? * updated the pvcreate() so it manages the case where the instance type changes from one to multiple ephemeral devices (e.g m3.large -> c3.xlarge) * update log level to INFO and cleaned up log messages * some refactoring
Attachment #8404152 - Flags: feedback?(catlee)
Comment on attachment 8404152 [details] [diff] [review] [puppet] Bug 977611 - use instance storage if it has enough space.patch Review of attachment 8404152 [details] [diff] [review]: ----------------------------------------------------------------- ::: modules/aws/files/manage_instance_storage.py @@ +11,4 @@ > > > log = logging.getLogger(__name__) > +log.setLevel(logging.INFO) what's the purpose of this? We're already setting the log level in main() @@ +17,5 @@ > DEFAULT_MOUNT_POINT = '/mnt/instance_storage' > JACUZZI_MOUNT_POINT = '/builds/slave' > JACUZZI_METADATA_FILE = '/etc/jacuzzi_metadata.json' > ETC_FSTAB = '/etc/fstab' > +REQ_BUILDS_SIZE = 20 # size in GB I think 120 GB is a good starting point. 20 is definitely too small. @@ +358,5 @@ > + # test if device has enough space, if so mount the disk > + # in JACUZZI_MOUNT_POINT regardless the type of machine > + # assumption here: there's only one volume group > + if vg_size() >= REQ_BUILDS_SIZE: > + log.info('disk size: >= REQ_BUILDS_SIZE (%d GB)', REQ_BUILDS_SIZE) would be nice to log what the calculated disk size is here. e.g. device_size = vg_size() if device_size >= REQ_BUILDS_SIZE: log.info('disk size: %d GB >= REQ_BUILDS_SIZE (%d GB)', device_size, REQ_BUILDS_SIZE)
Attachment #8404152 - Flags: feedback?(catlee) → feedback+
Hi Chris, in this patch: * REQ_BUILDS_SIZE = 120 * removed get_swap_file() method - not used * updated docstrings and comments
Attachment #8404152 - Attachment is obsolete: true
Attachment #8405295 - Flags: review?(catlee)
Attachment #8405295 - Flags: review?(catlee) → review+
Attachment #8405295 - Flags: checked-in+
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: