Closed Bug 859573 Opened 12 years ago Closed 11 years ago

Expose system resource usage of jobs

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: gps)

References

Details

Attachments

(5 files, 6 obsolete files)

(deleted), patch
mozilla
: review+
Details | Diff | Splinter Review
(deleted), patch
mozilla
: review+
Details | Diff | Splinter Review
(deleted), patch
mozilla
: review+
Details | Diff | Splinter Review
(deleted), patch
mozilla
: review+
Details | Diff | Splinter Review
(deleted), patch
dustin
: review+
jhopkins
: checked-in+
Details | Diff | Splinter Review
I think it would be rad if we exposed the resource usage of all the build and test jobs than ran as part of our automation. i.e. let's post a machine readable file on the ftp server with a breakdown of CPU, memory, and disk usage over the lifetime of the job. I wouldn't be surprised if we had something already monitoring system resource usage. However, I'm willing to bet it isn't correlated to jobs. The end goal of this request is to monitor system resource usage over time and notice important trends such as changes in CPU and disk usage. We could use this as a measurement for whether build system optimization improvements have any effect, for example. In my ideal world, we would only measure system resource usage for the processes actually attached to the job at hand. In reality, this would require probing a tree of processes, some very short lived. By doing it this way, you would isolate resource usage of the job from all the other stuff running on a machine (cron jobs, etc). Unfortunately, doing it right requires considerable effort on some operating systems (e.g. we'd likely be looking at containers on Linux). In light of this, whole-system/non-isolated resource usage should be sufficient for initial deployment. We can iterate on it later. Since lots of our tooling is written in Python, we could leverage the psutil Python package (already in m-c) to collect the data. There is some work in bug 802420 to build a generic resource usage recorder Python class.
As discussed with jmaher just now on IRC, longer term I'd like to have per-test resource usage. I should be able to query per-test-file CPU, I/O, and memory usage. While granular per-test data may not always be extremely useful ("random" GCs while a single Firefox process is running hundreds of mochitest files is one source of skew), it is data and enough data could provide a useful signal. I'm not sure it would ever be sufficient to replace Talos tests, but it would be a good compromise between having no performance and "use Talos." That dream aside, this bug is just about getting the highest-level perf readings at a system level. We can figure out performance monitoring inside test runners in other bugs.
I'd love to see a system where the build slave keeps track of these metrics just as it also keeps track of text log data. The CPU load, disk i/o, etc. are just other types of time series data. You could throw video recording in for extra debugging goodness. We've made provisions for reporting this data in Treeherder, but I don't have a good sense right now of how to capture this in general.
Having learned a bit more about how the automation works, I think the best place for this functionality is mozharness. I make this conclusion on the assumption that mozharness scripts will eventually replace most of buildbot steps from buildbotcustom. Therefore, it makes little sense to add functionality to buildbot when it could be accomplished in mozharness. Here is what I propose: 1) The psutil Python package is installed on all automation systems. (If we're lucky, this is a simply "pip install". Worst case we need to package it.) 2) mozharness is updated to use https://github.com/mozilla/mozbase/blob/master/mozsystemmonitor/mozsystemmonitor/resourcemonitor.py to record system resource usage. We have it recording background activity and creating markers when mozharness actions start and top. 3a) Print a resource usage summary at the end of mozharness script execution. 3b) Figure out how to upload the raw data for later analysis. This could all be optional, of course. If psutil isn't available or doesn't work, we'd fall back to not resource recording (like today). I'm doing something similar to the build system itself in bug 883209. I'm even building an HTML visualizer of the raw resource data! I think the hardest thing I described is rolling out psutil everywhere and figuring out remote devices. Followup fodder? Is this plan appealing?
Sounds great! mozharness does know how to manage virtualenv's, so could do 'pip install' every job, but if we're going to want this for all our builds and tests, it may make sense to deploy system-wide.
Attached patch Record and log resource usage of jobs (obsolete) (deleted) — Splinter Review
This is pretty much what the mozharness patch needs to look like (I think). I haven't tested this completely because I need to figure out a mozharness script that will run locally. Anyone have any suggestions? I obviously need to update requirements.txt and setup.py. I may want to guard the new import with an "except ImportError" to handle where this mozbase package isn't deployed (probably everywhere right now). But if it's easy to deploy this mozbase package at the time this mozharness is deployed, we should just do that. On that note, mozsystemmonitor isn't currently on PyPI. Will file a bug to make that happen. It's worth mentioning that mozsystemmonitor requires psutil to actually work. However, it still appears to work if psutil isn't available (i.e. it won't throw when you start and stop recording, etc). The patch is obviously very crude when it comes to logging resource usage. But I think that's OK. Some things we may consider before landing: 1) Ignore actions that are very short and/or have no resource data points 2) Allow scripts to annotate (possibly via decorator) which actions are important and only print usage of those. My goal is for the data to be actionable. People should be like "this test harness is only using 31% CPU - we should try to make it use more CPU." Long term, I'd like to dump the raw data to facilitate deeper forensics. That's likely blocke on bug 749421, however. Until then, baby steps: something is better than nothing.
Assignee: nobody → gps
Comment on attachment 768181 [details] [diff] [review] Record and log resource usage of jobs >--- a/mozharness/base/script.py >+++ b/mozharness/base/script.py mozharness.base.* is intended to be non-mozilla specific, and mozharness is intended to be a generic script harness. Assuming that all scripts anywhere, run by anyone, will want resource monitoring is a very TBPL/production-level oriented assumption. With all the issues we're hitting trying to fix bug 805925, I'm going to assume we need a specific machine setup or set of permissions to be able to monitor these things? I think if we wanted the resource monitoring to be a global thing, we would need ways to turn this on and off, at the least. > try: > import simplejson as json > assert json > except ImportError: > import json > >+from mozsystemmonitor.resourcemonitor import SystemResourceMonitor This is problematic because the current expectation is that mozharness will bootstrap itself. The workflow is clone, run script, and the script sets up any python dependencies needed in a virtualenv that it then uses for later actions. Putting the import here will burn all mozharness jobs on every tree. Currently requirements.txt is only used to set up the virtualenv needed to run the nosetests. I think one possible solution is to subclass BaseScript, and to create a helper method to import SystemResourceMonitor, and to run that helper method *after* the virtualenv is created, a la http://hg.mozilla.org/build/mozharness/file/27b230a75983/mozharness/mozilla/testing/mozpool.py#l38 Then run actions based on a) whether the config is set to run resource monitoring, and b) whether self.system_resource_monitor is not None.
(In reply to Aki Sasaki [:aki] from comment #6) Thank you, Aki. I was ignorant of many of the things you stated. Your feedback seems very reasonable given those constraints and I will address them in the next revision. On IRC, Callek was saying something outside of mozharness/build system (like Collectd) feels better than what's being done here. First, having used collectd, I highly recommend it. Second, I think the two solutions are complementary. collectd will give you a very high-level view of overall resource utilization over all of time. While the solution in this bug will provide much of the same data, it has the advantage of a) being runnable everywhere b) being correlating to job/action activity. I believe these are sufficient to warrant redundancy and integration with mozharness.
This makes sense, and I'd love to see what our resource usage is. I was just noting some potential issues in this specific patch.
Per earlier feedback, integrating resource collection into BaseScript directly won't be possible because of a classic chicken-or-egg problem. So, it will need to be implemented in a derived class. To correlate resource usage with build steps, resource monitoring will need to integrate with the "dispatch loop" in BaseScript.run(). Currently, there's no way to do that aside from reimplementing that method. This patch changes that. This patch adds two functions to mozharness.base that can be used as decorators - PreScriptStep and PostScriptStep. If a BaseScript-derived class contains methods with these decorators, those methods will be called before or after execution of each step. More detailed info is in source docs. These "listeners" are generic and can be used for anything. I intend to use them for resource monitoring, but others may very well come up with other clever uses for them. In a subsequent patch, I'd like to add pre-run and post-run hook points. But, I'd like to run this implementation by you first to make sure we're on the same page. I'm also open to the idea of adding decorators that can be registered against a specific action (perhaps by passing an argument into the decorators I've defined so far) and replacing the preflight and postflight functions with these.
Attachment #771035 - Flags: review?(aki)
Comment on attachment 771035 [details] [diff] [review] Part 1: Register action listeners via decorators I think I'm overall good with this, but the exception handling below seems off: >+ for fn in self._pre_step_listeners: >+ method = getattr(self, fn) >+ method(action) Is it just me, or did you intend to wrap this for loop in the try/except, and instead wrapped the existing action calls instead? >+ success = True >+ try: >+ self._possibly_run_method("preflight_%s" % method_name) >+ self._possibly_run_method(method_name, error_if_missing=True) >+ self._possibly_run_method("postflight_%s" % method_name) >+ except: >+ success = False >+ raise Sometimes exceptions are helpful to show where in the code something happened, but in long loops I find a well formed error message narrow down the issue faster. Plus throwing exceptions bypasses the logging. I'm basically trying not to show the user an exception, if I can instead log something useful. >+ finally: >+ for fn in self._post_step_listeners: >+ method = getattr(self, fn) >+ method(action, success=success) I'm guessing we also want to wrap this in the try/except, and not the action call. >- sys.exit(self.return_code) >+ >+ if exit: >+ sys.exit(self.return_code) I guess you're looking at wrapping self.run() in a child object or something?
(In reply to Gregory Szorc [:gps] from comment #7) FYI, there was a meeting back in April between ateam, releng, and IT discussing the use of collectd/graphite to correlate data as you're suggesting. (IT already uses collectd/graphite for gathering/correlating stats on all of their internal systems.) We talked about the kinds of data we would want to checkpoint both from a system perspective and from an application perspective and how we could then correlate the results in various ways. relops (the IT group that supports releng) has already built collectd packages for both centos and ubuntu (OS X is next on the list) and written puppet manifests for them, but we are waiting for permission from releng/ateam to enable data collection on hosts. The primary concern for test machines is that collection of data does not interfere with timing tests. There was also discussion of application-level statistics gathering in mozharness and from various other tools like slavealloc and treestatus (so we could tell if changes in performance were due to a tree closure or disabling of a significant portion of the pool, for example), but I'm not sure how far along those portions of the project are since they were not within the purview of relops/IT.
Amy: Please read comments 6 and 7 regarding collectd. The patches in this bug should be isolated to mozharness, so I think that partially addresses the application-level statistics you referred to.
(In reply to Gregory Szorc [:gps] from comment #12) In the meeting we held, it was the representatives from ateam and releng that suggested placing the telemetry in mozharness. Maybe they will later have a larger meeting where they debate and disagree on this (or maybe they already have). I see that ctalbert and ted are cced on this bug, so maybe they have more information.
Amy: I think the work gps has done here fits in nicely with what we talked about, by instrumenting mozharness to provide specific data on what's happening during build jobs. It'd just be a matter of making this code report what it collects to graphite and we'd be in business.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #14) > Amy: I think the work gps has done here fits in nicely with what we talked > about, by instrumenting mozharness to provide specific data on what's > happening during build jobs. It'd just be a matter of making this code > report what it collects to graphite and we'd be in business. A better model would be to have it report to the local collectd daemon on the buildslave which would push the collected metrics from mazharness to the graphite server along with the data from the other collectd modules. This would allow us to manage where and how the data gets aggregated to IT's graphite server via puppet. Since I've been building out the collectd configuration for the releng infra, I can help make this happen. It shouldn't be to hard to connect the two components.
I think IT/RelEng should continue to deploy collectd+graphite(+carbon/whisper) to the slaves. This will provide overall machine resource usage outside the context of mozharness. However, collectd/graphite by itself is not sufficient for the goals in this bug for the following reasons: 1) collectd only collects at fixed time intervals. I would like collection to occur at specific points of time within mozharness so data for different "phases" of mozharness execution is clearly separated from each other. 2) carbon/whisper only stores numeric data associated with times. Again, one of the goals is to correlate named events/phases. We can't store this data in carbon/whisper. Therefore we can't display it with graphite. 3) Anyone running mozharness should be able to get resource usage monitoring. I'm going to assert that setting up collectd locally won't meet that requirement because a) it really only works well on Linux (OS X support was partial last I checked) b) it really wants to run as root. 4) We'd like to report on resource usage in the output of mozharness (possibly for display on TBPL and elsewhere). This means mozharness needs a way to access the data. I suppose mozharness could open a pipe to collectd or could access local storage or communicate with the remote carbon server. But, this introduces a run-time dependency which has operational considerations and may also violate #3. These points lead me to believe that collectd is insufficient for the goals set out in this bug (collecting and correlating data from mozharness). However, collectd is terrific for high-level system resource usage. As I stated in comment #7, I believe there is room for two complementary resource collection systems. While I would like for collectd/graphite and mozharness to initially be fully decoupled (I don't want this bug to be blocked on other work), it doesn't have to always be this way. We could later retool mozharness to write data into collectd or carbon. While we couldn't write event/phase data for correlation, the raw data would be in the graphite domain. However, if collectd is running on the machines and already collecting cpu/memory/io stats, then I'm not sure if this would provide us anything. If you still have questions after reading this comment, perhaps it's best to take things offline. I don't like long conversations derailing bugs with ongoing code reviews.
Comment on attachment 771035 [details] [diff] [review] Part 1: Register action listeners via decorators Clearing review flag until comment 10 is addressed.
Attachment #771035 - Flags: review?(aki)
(In reply to Aki Sasaki [:aki] from comment #10) > >- sys.exit(self.return_code) > >+ > >+ if exit: > >+ sys.exit(self.return_code) > > I guess you're looking at wrapping self.run() in a child object or something? This was the quick fix. I think a better solution is to remove the sys.exit() from run() and into all the callers of it. Or, we can provide a separate method - like run_and_exit(). The reason I don't want run() to sys.exit() is because it prohibits things like running multiple scripts from the same process. (OK, technically you can monkeypatch sys.exit, but you shouldn't have to do that.)
Depends on: 892140
I have implemented 4 decorators corresponding to the new hook points: pre run, pre action, post action, and post run. The post actions hooks are guaranteed to run, even if execution fails. This has a nice symmetric property to it. There is a slight backwards incompatible change with this code. Previously, exceptions when calling the action method weren't getting caught. In this patch, run() now catches all exceptions (exception KeyboardInterrupt and SystemExit because those inherit from BaseException, not Exception) and logs them. I think the end result is more or less the same. But exception handling is more explicit now.
Attachment #773721 - Flags: review?(aki)
Attachment #771035 - Attachment is obsolete: true
Attachment #773719 - Attachment is obsolete: true
Comment on attachment 773721 [details] [diff] [review] Part 1: Register script listeners via decorators Nit: pyflakes says: mozharness/base/script.py:921: local variable 'e' is assigned to but never used mozharness/base/script.py:974: local variable 'e' is assigned to but never used Looks like you're using |except Exception as e:| in many more places; it's just warning about the last unused assignment per scope. >diff --git a/mozharness/base/script.py b/mozharness/base/script.py >old mode 100755 >new mode 100644 Nit: Unintended side effect? Doesn't seem to have any negative consequences; just curious. >+ for k in dir(self): >+ item = getattr(self, k) >+ >+ # We only decorate methods, so ignore other types. >+ if not inspect.ismethod(item): >+ continue >+ >+ if hasattr(item, '_pre_run_listener'): >+ self._listeners['pre_run'].append(k) >+ >+ if hasattr(item, '_pre_action_listener'): >+ self._listeners['pre_action'].append(k) >+ >+ if hasattr(item, '_post_action_listener'): >+ self._listeners['post_action'].append(k) >+ >+ if hasattr(item, '_post_run_listener'): >+ self._listeners['post_run'].append(k) Looks like these are going to run in some arbitrary order, if we have multiple listeners of a type? I'm ok with that, just clarifying. >+ for fn in self._listeners['pre_action']: >+ try: >+ method = getattr(self, fn) >+ method(action) >+ except Exception as e: >+ self.error("Exception during pre-action for %s: %s" % (action, >+ traceback.format_exc())) >+ >+ for fn in self._listeners['post_action']: >+ try: >+ method = getattr(self, fn) >+ method(action, success=False) >+ except Exception as e: >+ self.error("An additional exception occurred during " >+ "post-action for %s: %s" % (action, >+ traceback.format_exc())) >+ >+ self.fatal("Aborting due to exception in pre-action listener.") >+ >+ # We always run post action listeners, even if the main routine failed. >+ success = False >+ try: >+ self._possibly_run_method("preflight_%s" % method_name) >+ self._possibly_run_method(method_name, error_if_missing=True) >+ self._possibly_run_method("postflight_%s" % method_name) >+ success = True >+ finally: >+ post_success = True >+ for fn in self._listeners['post_action']: >+ try: >+ method = getattr(self, fn) >+ method(action, success=success and self.return_code == 0) >+ except Exception as e: >+ post_success = False >+ self.error("Exception during post-action for %s: %s" % ( >+ action, traceback.format_exc())) >+ >+ if not post_success: >+ self.fatal("Aborting due to failure in post-action listener.") I'd like to see some logging happening describing what's happening. I'm worried about the scenario where the action finishes or hits an exception, or we hit an exception in the pre-actions, and we expect the script to stop executing or continue to the next action. Instead, we start running the post_action listener methods, which might take some non-trivial amount of time, and aren't guaranteed to log anything. To the user, this might appear as if something is hung or stuck. Maybe before calling each method() we should log something? I'd default to self.info() for this. >+ try: >+ for action in self.all_actions: >+ self.run_action(action) >+ except Exception as e: >+ self.fatal("Uncaught exception: %s" % traceback.format_exc()) >+ finally: I thought the self.fatal() here was going to sys.exit() before the finally: , but testing shows the finally: block runs. Weird, but cool, I think. Thanks for the patch + tests! I'm ok with this landing, with the nits addressed and logging added. If you need another review pass or want to discuss my comments further, I'm open to that too.
Attachment #773721 - Flags: review?(aki) → review+
I addressed review comments. As I was implementing this, in the back of my mind I was also thinking "this should be able to replace the preflight and postflight hooks." So, before your review came in, I was already implementing code to have the pre-action and post-action decorators only apply to specific actions. (The impetus was I ran into a roadblock with resource monitoring patches where multiple classes needed to define a preflight function and I figured it would be easiest to avoid requiring super() everywhere and accomplish things with independent functions/observers. So I did.) This means we can replace the preflight and postflight functions with decorated functions now!
Attachment #773754 - Flags: review?(aki)
This patch provides a formal API for declaring virtualenv packages. Before, scripts had to override create_virtualenv to install custom packages. Now, they simply need to make a few API calls at setup time to register packages and the default implementation of create_virtualenv will do the right thing. I am going to build on top of this to inject the system resource monitor installation into the virtualenv install. IMO, this is cleaner API than having N classes all override the same functions and call super() everywhere. It feels more declarative and less fragile to me. I may have missed something important with the script modification. I fully don't grok how variables come from different locations (like sometimes how modules are obtained from the local instance and others they are obtained from a config object). One thing I don't like about this patch is that virtualenv modules are registered in a pre action listener. I /think/ __init__ would make more sense. However, I wasn't sure if all the referenced variables are defined/proper at the time of __init__, so I preserved existing semantics by having the code still execute during run(). If you want to push back against the scripts changes, I'll gladly revert those files. However, I'd really like the register API to stick, as it makes global package registration easier. This should be more obvious in a subsequent patch.
Attachment #773759 - Flags: review?(aki)
Attached patch Part 3: Record system resource usage of scripts (obsolete) (deleted) — Splinter Review
And here's my first stab at resource monitoring hooked up to the new hooks/listeners functionality! I created a new class under mozharness.base. It registers the mozsystemmonitor package with the virtualenv (currently via pip). Once the virtualenv is present, it starts resource monitoring. It records when actions started and finished so at the end it can correlate recorded probes against mozharness actions/steps. Example output is below. Unfortunately, I couldn't get Marionette tests to actually finish (it's complaining about some HTTP server). But, things do appear to be working! This patch probably needs a bit more work before it can land. I'd like to expose an API so other classes can get at the raw info. This will be needed so a Mozilla-specific class can log the special line to get things to display on TBPL (I would like TBPL to display average CPU % to shine awareness at how CPU inefficient our automation). Anyway, I'm not looking for full review now. But, I would appreciate feedback. 21:31:35 ERROR - Marionette exited with return code 1: harness failures 21:31:35 ERROR - # TBPL FAILURE # 21:31:35 INFO - Running post-action listener: _resource_record_post_action 21:31:35 INFO - Running post-run listener: _resource_record_post_run 21:31:36 INFO - Total resource usage - Wall time: 37s; CPU: 46%; Read bytes: 41286656; Write bytes: 231290880; Read time: 1077; Write time: 2960 21:31:36 INFO - install - Wall time: 32s; CPU: 47%; Read bytes: 231947776; Write bytes: 226891776; Read time: 19513; Write time: 2877 21:31:36 INFO - run-marionette - Wall time: 6s; CPU: 38%; Read bytes: 15077376; Write bytes: 2361344; Read time: 538; Write time: 43 21:31:36 INFO - Copying logs to upload dir... 21:31:36 INFO - mkdir: /Users/gps/src/moz-build/mozharness/build/upload/logs
Attachment #768181 - Attachment is obsolete: true
Attachment #773769 - Flags: feedback?(aki)
Attachment #773721 - Attachment is obsolete: true
Attachment #773754 - Flags: review?(aki) → review+
Landed part 1 to get in front of bit rot: https://hg.mozilla.org/build/mozharness/rev/87a8c74982cf
Status: NEW → ASSIGNED
Comment on attachment 773759 [details] [diff] [review] Part 2: API for declaring virtualenv packages (In reply to Gregory Szorc [:gps] from comment #23) > This patch provides a formal API for declaring virtualenv packages. > Before, scripts had to override create_virtualenv to install custom > packages. Now, they simply need to make a few API calls at setup time to > register packages and the default implementation of create_virtualenv > will do the right thing. I think this looks great. r=me with the script issues addressed. I knew mozharness.base.python was getting a bit ugly, and it could still use additional cleanup, but this is a great start. > I may have missed something important with the script modification. I > fully don't grok how variables come from different locations (like > sometimes how modules are obtained from the local instance and others > they are obtained from a config object). > > One thing I don't like about this patch is that virtualenv modules are > registered in a pre action listener. I /think/ __init__ would make more > sense. However, I wasn't sure if all the referenced variables are > defined/proper at the time of __init__, so I preserved existing > semantics by having the code still execute during run(). I think it's safer to allow for runtime setting for now; there are plenty of instances where we define behavior after cloning source, well into run(). >@@ -307,16 +323,21 @@ > module_url = self.config.get('%s_url' % module, module_url) > install_method = 'pip' > if module in ('pywin32',): > install_method = 'easy_install' > self.install_module(module=module, > module_url=module_url, > install_method=install_method, > requirements=requirements) >+ >+ for module, url, method, requirements in self._virtualenv_modules: >+ self.install_module(module=module, module_url=url, >+ install_method=method, requirements=requirements or ()) I like how the existing logic isn't touched, ensuring that this is backwards-compatible until we decide to use this api everywhere. This will work for all cases except where we need a custom module installed in some order before the end of self.config['virtualenv_modules'] (to satisfy dependencies). I'm not sure how often we'll hit that case... and we have hardcoded distribute, pip, and pywin32 cases in create_virtualenv() already. If you have thoughts about this, cool; if not, I'm noting for the future. >diff --git a/scripts/b2g_emulator_unittest.py b/scripts/b2g_emulator_unittest.py <snip> >+ @PreScriptAction('create-virtualenv') >+ def _pre_create_virtualenv(self, action): > if self.tree_config.get('use_puppetagain_packages'): >- self.virtualenv_requirements = [ >- os.path.join('tests', 'b2g', 'b2g-unittest-requirements.txt') >- ] >- self.virtualenv_modules = [ >- 'mozinstall', >- {'marionette': os.path.join('tests', 'marionette')}, >- ] <snip> >+ self.register_virtualenv_module('mozinstall', >+ requirements=requirements) >+ self.register_virtualenv_module('marionette', >+ url=os.path.join('tests', 'marionette'), requirements=requirements) We may have an issue here with module order... a |pip install -r requirements.txt py| installed py first, then the requirements. I'm not sure what order we need here. Or if pip is smart enough to do the right thing if there are interdependencies. Then again, I think we're ignoring requirements completely if modules is defined: http://hg.mozilla.org/build/mozharness/file/87a8c74982cf/mozharness/base/python.py#l299 So I'm not entirely sure how this is working at all =\ Could you either talk to :ahal about the desired behavior here, and/or keep an eye on this on Cedar after landing on default? Both would be preferable. >diff --git a/scripts/marionette.py b/scripts/marionette.py <snip> > if self.config.get('gaiatest'): >- dirs = self.query_abs_dirs() >- self.install_module(module='gaia-ui-tests', >- module_url=dirs['abs_gaiatest_dir'], >- install_method='pip') >+ self.register_virtualenv_module('gaia-ui-tests', >+ url=self.query_abs_dirs(), method='pip') I think the url is wrong -- you want self.query_abs_dirs()['abs_gaiatest_dir'] If the intention was to remove all overrides of create_virtualenv() and their respective super() calls, there's another one in mozharness.mozilla.testing.talos.Talos . I think it should still work without refactoring if that wasn't your intention, though.
Attachment #773759 - Flags: review?(aki) → review+
Comment on attachment 773769 [details] [diff] [review] Part 3: Record system resource usage of scripts >diff --git a/mozharness/base/python.py b/mozharness/base/python.py <snip> >+ def activate_virtualenv(self): >+ """Import the virtualenv's packages into this Python interpreter.""" >+ bin_dir = os.path.dirname(self.query_python_path()) >+ activate = os.path.join(bin_dir, 'activate_this.py') >+ execfile(activate, dict(__file__=activate)) Nice. I did some testing; looks like this should be fine on windows. >+ def __init__(self, *args, **kwargs): >+ super(ResourceMonitoringMixin, self).__init__(*args, **kwargs) >+ >+ self.register_virtualenv_module('mozsystemmonitor', method='pip') If we foresee needing a specific version (or url, for testing a specific package) of mozsystemmonitor, and not just use the latest available, we may want to allow for querying self.config for this. I'm ok waiting til that's actually a need, though. >+ @PreScriptRun >+ def _resorce_record_pre_run(self): >+ self._times = [('run_start', time.time())] SP? Should work, either way, but s,resorce,resource, ScriptFactory does some setup work before we get to launching the mozharness script, so this won't reflect the start time of the job. I previously thought about hardcoding something like this at the beginning of run(), but it makes sense to add these via the hooks if desired, now that they're available. >+ def log_usage(prefix, duration, cpu_percent, cpu_times, io): >+ message = '{prefix} - Wall time: {duration:.0f}s; ' \ >+ 'CPU: {cpu_percent: .0f}%; ' \ >+ 'Read bytes: {io_read_bytes}; Write bytes: {io_write_bytes}; ' \ >+ 'Read time: {io_read_time}; Write time: {io_write_time}' I could see wanting to make these messages configurable, but I'm happy for that to wait til we have a real use case for it. >+ log_usage('Total resource usage', duration, cpu_percent, cpu_times, io) >+ #self.info('TinderboxPrint: CPU usage: %.0f%%' % cpu_percent) I think this is the only Mozilla-specific thing, and it's only a commented-out log message, so it's fairly innocuous. I think we should (eventually): * rename TinderboxPrint as it's now a misnomer (will require changing lots and lots of code; maybe it's easiest to add a new magic string in TBPL and let the old one die over time), and * allow for turning this on or off. The above comment about making the messages configurable would, in fact, allow for us to add a TinderboxPrint (or its future equivalent) without hardcoding it in the non-Mozilla-specific mozharness.base area. Then again, a generic magic string might allow us to hardcode it without it being a big deal.
Attachment #773769 - Flags: feedback?(aki) → feedback+
Andrew: Please see review comments for part 2.
Flags: needinfo?(ahalberstadt)
(In reply to Aki Sasaki [:aki] from comment #26) > >diff --git a/scripts/marionette.py b/scripts/marionette.py > <snip> > > if self.config.get('gaiatest'): > >- dirs = self.query_abs_dirs() > >- self.install_module(module='gaia-ui-tests', > >- module_url=dirs['abs_gaiatest_dir'], > >- install_method='pip') > >+ self.register_virtualenv_module('gaia-ui-tests', > >+ url=self.query_abs_dirs(), method='pip') > > I think the url is wrong -- you want > self.query_abs_dirs()['abs_gaiatest_dir'] Good catch! > If the intention was to remove all overrides of create_virtualenv() and > their respective super() calls, there's another one in > mozharness.mozilla.testing.talos.Talos . I think it should still work > without refactoring if that wasn't your intention, though. Initially it was my intention, yes. However, there was a little too much muckery in this method and I felt like it was too fragile for me to tinker with as part of this bug :) May I propose a comment indicating a future enhancement to use the PreScriptAction hooks?
Attached patch Part 3: Record system resource usage of scripts (obsolete) (deleted) — Splinter Review
I removed recording of run start and finish times because they weren't being used. I fixed a minor formatting issue with the log output. I pinned the versions of mozsystemmonitor and psutil to known working versions. I decided against some of the other features I described, notably TBPL printing. I'd like to get this rolled out first then we can add new features later. One thing I'd like to add is that some machines may fail to create the virtualenv because psutil needs to compile C for a binary Python extension. If we're lucky, it will just work. If not, we may need to add a feature to virtualenv population to swallow errors for "optional" packages. Then, I can file bugs against RelEng to support compiled Python extensions on the infrastructure. Or, if we run a PyPI mirror, we may install pre-compiled packages. I guess we'll wait and see.
Attachment #774448 - Flags: review?(aki)
Attachment #773769 - Attachment is obsolete: true
(In reply to Gregory Szorc [:gps] from comment #28) > Andrew: Please see review comments for part 2. What exactly do you want my opinion on? What to do if modules and requirements are both defined? Seems like ignoring requirements if modules is defined is somewhat surprising behaviour. I would think installing requirements.txt first and then installing individual modules which may overwrite whatever was in requirements.txt would make the most sense.
Flags: needinfo?(ahalberstadt)
Attachment #774448 - Flags: review?(aki) → review+
Depends on: 893254
I don't know if this bug is complete or not, but I merged these over to ash-mozharness and it's causing jobs to go red there: https://tbpl.mozilla.org/php/getParsedLog.php?id=25242974&tree=Ash&full=1
Backed out part 3 because psutil isn't installable everywhere and pushed a trivial fix for pieces not calling super().__init__. https://hg.mozilla.org/build/mozharness/rev/03ea44fd7a74 https://hg.mozilla.org/build/mozharness/rev/c2563db9d324
This trivial patch enables support for optional Python packages
Attachment #775021 - Flags: review?(aki)
Comment on attachment 775021 [details] [diff] [review] Part 3: Support optional Python package in virtualenv \> if self.run_command(command, > error_list=VirtualenvErrorList, >- cwd=dirs['abs_work_dir']) != 0: >+ cwd=dirs['abs_work_dir']) != 0 and not optional: > self.fatal("Unable to install %s!" % module_url) Maybe a self.info() or self.warning() if it fails, but the package is optional?
Attachment #775021 - Flags: review?(aki) → review+
psutil and mozsystemmonitor are now optional packages. Ideally, mozsystemmonitor would be required. However, I /think/ if psutil install fails, mozsystemmonitor will attempt to install via its dependencies and this will cause mozsystemmonitor to fail. It's easy enough to work around mozsystemmonitor not being present, however.
Attachment #775025 - Flags: review?(aki)
Attachment #774448 - Attachment is obsolete: true
Comment on attachment 775025 [details] [diff] [review] Part 4: Record system resource usage of scripts; I think this will work. I am trying to get Cedar green before EOD, so there isn't any risk of a mozharness merge on Monday from breaking everything, so it's up to you if you think you want to hold off landing or not.
Attachment #775025 - Flags: review?(aki) → review+
Not sure what this is: https://tbpl.mozilla.org/php/getParsedLog.php?id=25246306&tree=Cedar&full=1 17:30:52 ERROR - Exception during post-run listener: Traceback (most recent call last): 17:30:52 ERROR - File "/builds/slave/test/scripts/mozharness/base/script.py", line 1014, in run 17:30:52 ERROR - method() 17:30:52 ERROR - File "/builds/slave/test/scripts/mozharness/base/python.py", line 419, in _resource_record_post_run 17:30:52 ERROR - self._log_resource_usage() 17:30:52 ERROR - File "/builds/slave/test/scripts/mozharness/base/python.py", line 453, in _log_resource_usage 17:30:52 ERROR - log_usage(phase, end_time - start_time, cpu_percent, cpu_times, io) 17:30:52 ERROR - File "/builds/slave/test/scripts/mozharness/base/python.py", line 443, in log_usage 17:30:52 ERROR - io_write_time=io.write_time)) 17:30:52 ERROR - ValueError: Unknown format code 'f' for object of type 'str' 17:30:52 FATAL - Aborting due to failure in post-run listener. 17:30:52 FATAL - Exiting -1
I think the entire B2G Emu (VM) opt set of jobs don't like patch 4...
(In reply to Aki Sasaki [:aki] from comment #41) > I think the entire B2G Emu (VM) opt set of jobs don't like patch 4... That's a very peculiar error! It smells like a bug in mozsystemmonitor. I pushed a bustage fix that traps this ValueError. I'm optimistic this will be a sufficient band-aid. https://hg.mozilla.org/build/mozharness/rev/69e449b4b26f
Blocks: 893388
Blocks: 893391
mozharness logging was throwing on the last bustage fix. I removed some of the verbose on-exception logging. https://hg.mozilla.org/build/mozharness/rev/314de25615a9
Merged to mozharness production.
No longer depends on: 893254
Blocks: 895388
Blocks: 896037
Blocks: 896718
Blocks: 900070
Product: mozilla.org → Release Engineering
This is in production. Not sure why it wasn't resolved.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Blocks: 917346
Comment on attachment 8370239 [details] [diff] [review] [puppet] collect disk statistics for EC2 instances Review of attachment 8370239 [details] [diff] [review]: ----------------------------------------------------------------- r=dividehex (irc)
Attachment #8370239 - Flags: review?(dustin) → review+
Comment on attachment 8370239 [details] [diff] [review] [puppet] collect disk statistics for EC2 instances https://hg.mozilla.org/build/puppet/rev/5f54cb661801
Attachment #8370239 - Flags: checked-in+
Puppet change live in production (I'm CCing myself).
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: