Closed Bug 1121571 Opened 9 years ago Closed 9 years ago

Improve Talos profiling

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: mstange, Assigned: mstange)

References

Details

Attachments

(19 files, 19 obsolete files)

(deleted), patch
jmaher
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Yoric
: review+
Details | Diff | Splinter Review
(deleted), patch
jmaher
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
jmaher
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
jmaher
: review+
Details | Diff | Splinter Review
(deleted), patch
jmaher
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
jmaher
: review+
Details | Diff | Splinter Review
(deleted), patch
Details | Diff | Splinter Review
(deleted), patch
jmaher
: review+
Details | Diff | Splinter Review
      No description provided.
Depends on: 1121576
Depends on: 1121591
The problems with the current talos profiling setup are:
 - We dump exactly one profile per browser run, which contains all tested pages. It would be much better to have a separate profile per tested page.
 - Profiles don't have high accuracy. Since each profile spans such a large time frame (the whole browser run), the sampling interval is often set to 10ms so that the resulting profile file still has a manageable size. A lower sampling interval would result in much more useful profiles.
 - Many resulting profile files are truncated because we dump the profile during shutdown and kill the browser if shutdown takes more than 5 seconds.
 - Profiles aren't symbolicated, they contain raw addresses.
 - Many tests don't pause/resume the profiler properly, so the resulting profile contains stuff that Talos didn't measure.
 - The Compositor thread isn't profiled.
Attachment #8549223 - Flags: review?(jmaher)
Attachment #8549226 - Flags: review?(jmaher)
Attached patch 07-Bug_1121571___Fix_tresize_profiling_.diff (obsolete) (deleted) — Splinter Review
Attachment #8549227 - Flags: review?(jmaher)
Attachment #8549229 - Flags: review?(jmaher)
Attachment #8549230 - Flags: review?(jmaher)
Attachment #8549231 - Flags: review?(jmaher)
Attachment #8549232 - Flags: review?(jmaher)
Attachment #8549234 - Flags: review?(jmaher)
Attachment #8549235 - Flags: review?(jmaher)
Attachment #8549236 - Flags: review?(jmaher)
I've fired off a try push here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=8e8db0eec687

Previous pushes with older patch sets:
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=b54a23073798
https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=020ae04cbd73
(Windows 8 red is because unwinding on Windows 64 bit is broken at the moment)
Depends on: 1109862
(In reply to Markus Stange [:mstange] from comment #17)
> (Windows 8 red is because unwinding on Windows 64 bit is broken at the
> moment)

... and now it has been fixed by bug 1120126! W00t!
Depends on: 1120126
Comment on attachment 8549221 [details] [diff] [review]
01-Bug_1121571___Explicitly_call_out_shutdown_timeout_and_print_a_message_for_shutdown_termination_.diff

Review of attachment 8549221 [details] [diff] [review]:
-----------------------------------------------------------------

thanks!
Attachment #8549221 - Flags: review?(jmaher) → review+
Comment on attachment 8549222 [details] [diff] [review]
02-Bug_1121571___Create_a_directory_for_profiles__don_t_use_startup_profiling_and_change_the_profile_zip_layout_.diff

Review of attachment 8549222 [details] [diff] [review]:
-----------------------------------------------------------------

it appears we are not turning on the profiler, or we don't have a need for these variables.

::: talos/ttest.py
@@ +329,5 @@
> +                sps_profile_dir = tempfile.mkdtemp();
> +
> +                sps_profile_interval = test_config.get('sps_profile_interval', 1)
> +                sps_profile_entries = test_config.get('sps_profile_entries', 1000000)
> +                sps_profile_threads = 'GeckoMain,Compositor'

do we ever use these variables?  Before we applied them to environment variables which are removed with this patch and now not used.
Attachment #8549222 - Flags: review?(jmaher) → review-
Comment on attachment 8549223 [details] [diff] [review]
03-Bug_1121571___Add_Profiler_js_capabilities_.diff

Review of attachment 8549223 [details] [diff] [review]:
-----------------------------------------------------------------

I would like to ensure we safeguard these values.

::: talos/page_load_test/tart/addon/content/Profiler.js
@@ +74,5 @@
> +        return;
> +      }
> +      profiler_interval = obj.sps_profile_interval;
> +      profiler_entries = obj.sps_profile_entries;
> +      profiler_threadsArray = obj.sps_profile_threads.split(",");

what happens if these values don't exist?  also we are doing a split on sps_profile_threads, so that assumes it exists and it is a string.
Attachment #8549223 - Flags: review?(jmaher) → review-
Comment on attachment 8549224 [details] [diff] [review]
04-Bug_1121571___Pass_profiling_information_to_pageloader_and_make_pageloader_dump_profiles_.diff

Review of attachment 8549224 [details] [diff] [review]:
-----------------------------------------------------------------

keep in mind pageloader.js was just cleaned up yesterday (2015-01-14) to run through e10s code paths all the time and remove regular code paths.  I am not sure if that will affect this patch for bitrot or functionality.
Attachment #8549224 - Flags: review?(jmaher) → review+
Comment on attachment 8549222 [details] [diff] [review]
02-Bug_1121571___Create_a_directory_for_profiles__don_t_use_startup_profiling_and_change_the_profile_zip_layout_.diff

Review of attachment 8549222 [details] [diff] [review]:
-----------------------------------------------------------------

I see how you use these in a future patch.  I would have preferred that you put it all in the same patch, but as it stands, this is just fine.
Attachment #8549222 - Flags: review- → review+
Comment on attachment 8549225 [details] [diff] [review]
05-Bug_1121571___Pass_profiling_info_to_non_pageloader_tests_via_URL_params_.diff

Review of attachment 8549225 [details] [diff] [review]:
-----------------------------------------------------------------

looks good!
Attachment #8549225 - Flags: review?(jmaher) → review+
Comment on attachment 8549226 [details] [diff] [review]
06-Bug_1121571___Fix_tp5o_scroll_profiling_.diff

Review of attachment 8549226 [details] [diff] [review]:
-----------------------------------------------------------------

thanks.

::: talos/pageloader/chrome/pageloader.js
@@ +209,5 @@
>                         "  }" +
>                         "} " +
>                         "addEventListener('load', _contentLoadHandler, true); ";
>                       content.selectedBrowser.messageManager.loadFrameScript(contentScript, false, true);
> +                     content.selectedBrowser.messageManager.loadFrameScript("chrome://pageloader/content/Profiler.js", true, true);

our other scripts are loaded as "false, true", but this is loaded with "true, true".  Is there a reason for this?
Attachment #8549226 - Flags: review?(jmaher) → review+
Comment on attachment 8549227 [details] [diff] [review]
07-Bug_1121571___Fix_tresize_profiling_.diff

Review of attachment 8549227 [details] [diff] [review]:
-----------------------------------------------------------------

a few nits below, r+ as they are minor.

::: talos/generate-tresize-xpi.html
@@ +15,5 @@
>      var files = [
>        "chrome.manifest",
>        "install.rdf",
>        "content/framescript.js",
> +      "content/Profiler.js"

you are missing a trailing , here

::: talos/startup_test/tresize/addon/content/tresize.overlay.xul
@@ +5,1 @@
>  <script type="application/x-javascript" src="tresize.js" />

this specifically calls out language="Javascript" and add a </script> tag, this is differethen than the other <script...> tags in this file, why?
Attachment #8549227 - Flags: review?(jmaher) → review+
Comment on attachment 8549229 [details] [diff] [review]
08-Bug_1121571___Profile_startup_using_profiler_env_vars_.diff

Review of attachment 8549229 [details] [diff] [review]:
-----------------------------------------------------------------

r- as I would like to learn more about the url parameters vs environment variables for sps profiling knobs.

::: talos/startup_test/sessionrestore/main.js
@@ +15,2 @@
>  
> +    var duration = startup_info.sessionRestored - startup_info.sessionRestoreInit;

is there any concern of altering these numbers in general while profiling is not enabled?

::: talos/ttest.py
@@ +357,5 @@
> +                        'MOZ_PROFILER_STARTUP': '1',
> +                        'MOZ_PROFILER_INTERVAL': str(sps_profile_interval),
> +                        'MOZ_PROFILER_ENTRIES': str(sps_profile_entries),
> +                        "MOZ_PROFILER_THREADS": str(sps_profile_threads)
> +                    }

is there an advantage of doing this vs sending params into pageloader?  This is starting to get confusing.
Attachment #8549229 - Flags: review?(jmaher) → review-
Attachment #8549230 - Flags: review?(jmaher) → review+
Comment on attachment 8549231 [details] [diff] [review]
10-Bug_1121571___Print_some_debug_messages_.diff

Review of attachment 8549231 [details] [diff] [review]:
-----------------------------------------------------------------

::: talos/ttest.py
@@ +417,5 @@
>                                                                  profile_dir,
>                                                                  test_config['url'],
>                                                                  profiling_info=profiling_info)
>  
> +                utils.info("command_args: {0}".format(command_args))

I would like to see this be a bit more specific, maybe:
utils.info("browser command line args: ...")
Attachment #8549231 - Flags: review?(jmaher) → review+
Comment on attachment 8549232 [details] [diff] [review]
11-Bug_1121571___Increase_the_timeout_when_profiling_.diff

Review of attachment 8549232 [details] [diff] [review]:
-----------------------------------------------------------------

::: talos/ttest.py
@@ +411,5 @@
>                  # Run the test
>                  timeout = test_config.get('timeout', 7200) # 2 hours default
> +                if sps_profile:
> +                    # When profiling, give the browser plenty of extra time to dump the profile.
> +                    timeout += 3600

an extra hour?  Why not an extra 5 minutes?
Attachment #8549232 - Flags: review?(jmaher) → review-
Comment on attachment 8549233 [details] [diff] [review]
12-Bug_1121571___Add_more_granular_profiler_pausing_for_many_tests_.diff

Review of attachment 8549233 [details] [diff] [review]:
-----------------------------------------------------------------

::: talos/page_load_test/v8_7/base.js
@@ +212,5 @@
>  // average time it takes to run a single iteration.
>  BenchmarkSuite.prototype.RunSingleBenchmark = function(benchmark, data) {
>    function Measure(data) {
>      var elapsed = 0;
> +    Profiler.resume(benchmark.name);

are you sure this 'benchmark.name' will name sense?
Attachment #8549233 - Flags: review?(jmaher) → review+
Comment on attachment 8549234 [details] [diff] [review]
13-Bug_1121571___Do_symbolication_right_on_the_talos_machine_.diff

Review of attachment 8549234 [details] [diff] [review]:
-----------------------------------------------------------------

is there anyway we can have sym*.py as a published python module, then we don't have to add the code.  At the very least this looks as if it has existed elsewhere for a while- we should reference where it comes from.

also for sps.py, please make it 4 space indentation.  Quite possibly sps.py comes from the same source as the other sym*.py files, maybe it can be rolled into a package as well?

::: talos/sps.py
@@ +1,4 @@
> +try:
> +    import simplejson as json
> +except ImportError:
> +    import json

now that we run on python 2.7, this isn't needed anymore

@@ +15,5 @@
> +        if startMeasurementMarker in marker["name"]:
> +          in_measurement = True
> +        if stopMeasurementMarker in marker["name"]:
> +          in_measurement = False
> +    if True or in_measurement:

this if statement seems odd, why not just "if in_measurement" ?

::: talos/symbolicationRequest.py
@@ +101,5 @@
> +          return
> +        self.forwardCount = rawRequests["forwarded"]
> +
> +      # Only used for compatibility with older clients.
> +      # TODO: Remove after June 2013.

I assume this code is copied from somewhere, this could have been removed 1.5 years ago!

::: talos/ttest.py
@@ +541,5 @@
> +                    symbolicator = symbolication.ProfileSymbolicator({
> +                        # Trace-level logging (verbose)
> +                        "enableTracing": 0,
> +                        # Fallback server if symbol is not found locally
> +                        "remoteSymbolServer": "http://symbolapi.mozilla.org:80/",

I don't know if we will have access to this from the talos machine
Attachment #8549234 - Flags: review?(jmaher) → review-
Comment on attachment 8549235 [details] [diff] [review]
14-Bug_1121571___Read_symbols_from_system_libraries_on_OS_X_.diff

Review of attachment 8549235 [details] [diff] [review]:
-----------------------------------------------------------------

the ttest.py changes are file, I see what you are doing in symbolication.py, although my previous review to have it in a separate package would be preferred.
Attachment #8549235 - Flags: review?(jmaher) → review+
Comment on attachment 8549236 [details] [diff] [review]
15-Bug_1121571___Don_t_resume_profiling_at_the_end_of_TART_.diff

Review of attachment 8549236 [details] [diff] [review]:
-----------------------------------------------------------------

thanks!
Attachment #8549236 - Flags: review?(jmaher) → review+
(In reply to Joel Maher (:jmaher) from comment #22)
> Comment on attachment 8549223 [details] [diff] [review]
> 03-Bug_1121571___Add_Profiler_js_capabilities_.diff
> 
> Review of attachment 8549223 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I would like to ensure we safeguard these values.
> 
> ::: talos/page_load_test/tart/addon/content/Profiler.js
> @@ +74,5 @@
> > +        return;
> > +      }
> > +      profiler_interval = obj.sps_profile_interval;
> > +      profiler_entries = obj.sps_profile_entries;
> > +      profiler_threadsArray = obj.sps_profile_threads.split(",");
> 
> what happens if these values don't exist?  also we are doing a split on
> sps_profile_threads, so that assumes it exists and it is a string.

Good point, I'll add some validation.

(In reply to Joel Maher (:jmaher) from comment #23)
> Comment on attachment 8549224 [details] [diff] [review]
> 04-
> Bug_1121571___Pass_profiling_information_to_pageloader_and_make_pageloader_du
> mp_profiles_.diff
> 
> Review of attachment 8549224 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> keep in mind pageloader.js was just cleaned up yesterday (2015-01-14) to run
> through e10s code paths all the time and remove regular code paths.  I am
> not sure if that will affect this patch for bitrot or functionality.

The patch is already based on that updated version, and everything's working correctly. :-)

(In reply to Joel Maher (:jmaher) from comment #24)
> Comment on attachment 8549222 [details] [diff] [review]
> 02-
> Bug_1121571___Create_a_directory_for_profiles__don_t_use_startup_profiling_an
> d_change_the_profile_zip_layout_.diff
> 
> Review of attachment 8549222 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I see how you use these in a future patch.  I would have preferred that you
> put it all in the same patch, but as it stands, this is just fine.

Sorry, yeah, should have reshuffled a little.

(In reply to Joel Maher (:jmaher) from comment #26)
> Comment on attachment 8549226 [details] [diff] [review]
> 06-Bug_1121571___Fix_tp5o_scroll_profiling_.diff
> 
> Review of attachment 8549226 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> thanks.
> 
> ::: talos/pageloader/chrome/pageloader.js
> @@ +209,5 @@
> >                         "  }" +
> >                         "} " +
> >                         "addEventListener('load', _contentLoadHandler, true); ";
> >                       content.selectedBrowser.messageManager.loadFrameScript(contentScript, false, true);
> > +                     content.selectedBrowser.messageManager.loadFrameScript("chrome://pageloader/content/Profiler.js", true, true);
> 
> our other scripts are loaded as "false, true", but this is loaded with
> "true, true".  Is there a reason for this?

Hmm, I must have misunderstood the meaning of that argument when I wrote this part. It should work just fine with false. I'll change it. (The docs say: "If true, this flag means that the frame script will be loaded into any new frames opened after the loadFrameScript() call, until removeDelayedFrameScript() is called for that script.")

(In reply to Joel Maher (:jmaher) from comment #27)
> Comment on attachment 8549227 [details] [diff] [review]
> 07-Bug_1121571___Fix_tresize_profiling_.diff
> 
> Review of attachment 8549227 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> a few nits below, r+ as they are minor.
> 
> ::: talos/generate-tresize-xpi.html
> @@ +15,5 @@
> >      var files = [
> >        "chrome.manifest",
> >        "install.rdf",
> >        "content/framescript.js",
> > +      "content/Profiler.js"
> 
> you are missing a trailing , here

will fix

> 
> ::: talos/startup_test/tresize/addon/content/tresize.overlay.xul
> @@ +5,1 @@
> >  <script type="application/x-javascript" src="tresize.js" />
> 
> this specifically calls out language="Javascript" and add a </script> tag,
> this is differethen than the other <script...> tags in this file, why?

I must have copied it from somewhere else. But you're right, since this is a xul file, we can use a self-closing tag (<script/>) here. I'll update it to match the other one.

(In reply to Joel Maher (:jmaher) from comment #28)
> Comment on attachment 8549229 [details] [diff] [review]
> 08-Bug_1121571___Profile_startup_using_profiler_env_vars_.diff
> 
> Review of attachment 8549229 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> r- as I would like to learn more about the url parameters vs environment
> variables for sps profiling knobs.
> 
> ::: talos/startup_test/sessionrestore/main.js
> @@ +15,2 @@
> >  
> > +    var duration = startup_info.sessionRestored - startup_info.sessionRestoreInit;
> 
> is there any concern of altering these numbers in general while profiling is
> not enabled?

Actually, let me split up this patch into one that only shuffles stuff around and one that adds the profiling infrastructure. I think my changes won't change any numbers but I'd like Yoric to confirm.

> 
> ::: talos/ttest.py
> @@ +357,5 @@
> > +                        'MOZ_PROFILER_STARTUP': '1',
> > +                        'MOZ_PROFILER_INTERVAL': str(sps_profile_interval),
> > +                        'MOZ_PROFILER_ENTRIES': str(sps_profile_entries),
> > +                        "MOZ_PROFILER_THREADS": str(sps_profile_threads)
> > +                    }
> 
> is there an advantage of doing this vs sending params into pageloader?  This
> is starting to get confusing.

Yes, it really is confusing. I'll add a comment.
These environment variables are *not* read by talos JS code. Talos JS code doesn't get a chance to run until we're almost finished starting up the browser. Instead, they're read by Gecko code that runs extremely early during startup, and they will cause Gecko to start profiling as soon as it can, so that we'll get a profile that contains as much of startup as possible.

(In reply to Joel Maher (:jmaher) from comment #30)
> Comment on attachment 8549232 [details] [diff] [review]
> 11-Bug_1121571___Increase_the_timeout_when_profiling_.diff
> 
> Review of attachment 8549232 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: talos/ttest.py
> @@ +411,5 @@
> >                  # Run the test
> >                  timeout = test_config.get('timeout', 7200) # 2 hours default
> > +                if sps_profile:
> > +                    # When profiling, give the browser plenty of extra time to dump the profile.
> > +                    timeout += 3600
> 
> an extra hour?  Why not an extra 5 minutes?

I'll have a look through the logs, but 5 minutes are almost certainly not enough. I'll come back to this patch when I have more evidence.

(In reply to Joel Maher (:jmaher) from comment #31)
> Comment on attachment 8549233 [details] [diff] [review]
> 12-Bug_1121571___Add_more_granular_profiler_pausing_for_many_tests_.diff
> 
> Review of attachment 8549233 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: talos/page_load_test/v8_7/base.js
> @@ +212,5 @@
> >  // average time it takes to run a single iteration.
> >  BenchmarkSuite.prototype.RunSingleBenchmark = function(benchmark, data) {
> >    function Measure(data) {
> >      var elapsed = 0;
> > +    Profiler.resume(benchmark.name);
> 
> are you sure this 'benchmark.name' will name sense?

I'll add a sanity check, but this worked without throwing any errors, as far as I can tell. (Looks like Linux is the only platform that ran V8 in my try push, and the profile isn't big enough to contain the complete run...)

(In reply to Joel Maher (:jmaher) from comment #32)
> Comment on attachment 8549234 [details] [diff] [review]
> 13-Bug_1121571___Do_symbolication_right_on_the_talos_machine_.diff
> 
> Review of attachment 8549234 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> is there anyway we can have sym*.py as a published python module, then we
> don't have to add the code.  At the very least this looks as if it has
> existed elsewhere for a while- we should reference where it comes from.
> 
> also for sps.py, please make it 4 space indentation.  Quite possibly sps.py
> comes from the same source as the other sym*.py files, maybe it can be
> rolled into a package as well?
> 
> ::: talos/sps.py
> @@ +1,4 @@
> > +try:
> > +    import simplejson as json
> > +except ImportError:
> > +    import json
> 
> now that we run on python 2.7, this isn't needed anymore

What should I replace it with? Just "import json"? Because "import simplejson" didn't work.

> @@ +15,5 @@
> > +        if startMeasurementMarker in marker["name"]:
> > +          in_measurement = True
> > +        if stopMeasurementMarker in marker["name"]:
> > +          in_measurement = False
> > +    if True or in_measurement:
> 
> this if statement seems odd, why not just "if in_measurement" ?

Oops, this is a leftover from an earlier experiment apparently.

> ::: talos/symbolicationRequest.py
> @@ +101,5 @@
> > +          return
> > +        self.forwardCount = rawRequests["forwarded"]
> > +
> > +      # Only used for compatibility with older clients.
> > +      # TODO: Remove after June 2013.
> 
> I assume this code is copied from somewhere, this could have been removed
> 1.5 years ago!

Indeed! I'll update the original code and import the updated version.

> ::: talos/ttest.py
> @@ +541,5 @@
> > +                    symbolicator = symbolication.ProfileSymbolicator({
> > +                        # Trace-level logging (verbose)
> > +                        "enableTracing": 0,
> > +                        # Fallback server if symbol is not found locally
> > +                        "remoteSymbolServer": "http://symbolapi.mozilla.org:80/",
> 
> I don't know if we will have access to this from the talos machine

It worked in my test runs. I was surprised, too! Should I do anything to make sure that it continues to accessable?

(In reply to Joel Maher (:jmaher) from comment #33)
> Comment on attachment 8549235 [details] [diff] [review]
> 14-Bug_1121571___Read_symbols_from_system_libraries_on_OS_X_.diff
> 
> Review of attachment 8549235 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> the ttest.py changes are file, I see what you are doing in symbolication.py,
> although my previous review to have it in a separate package would be
> preferred.

I'll move the symbolication files into a profiler/ subdirectory, as discussed on IRC.
lets just use 'import json', thanks for the rest of the comments, they all look good!
(In reply to Joel Maher (:jmaher) from comment #29)
> Comment on attachment 8549231 [details] [diff] [review]
> 10-Bug_1121571___Print_some_debug_messages_.diff
> 
> Review of attachment 8549231 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: talos/ttest.py
> @@ +417,5 @@
> >                                                                  profile_dir,
> >                                                                  test_config['url'],
> >                                                                  profiling_info=profiling_info)
> >  
> > +                utils.info("command_args: {0}".format(command_args))
> 
> I would like to see this be a bit more specific, maybe:
> utils.info("browser command line args: ...")

Actually, utils.GenerateBrowserCommandLine already prints this command line, but at the "debug" log level, which I hadn't enabled, so I didn't see it. I'll just remove this change.
> (In reply to Joel Maher (:jmaher) from comment #31)
> > Comment on attachment 8549233 [details] [diff] [review]
> > 12-Bug_1121571___Add_more_granular_profiler_pausing_for_many_tests_.diff
> > 
> > Review of attachment 8549233 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > ::: talos/page_load_test/v8_7/base.js
> > @@ +212,5 @@
> > >  // average time it takes to run a single iteration.
> > >  BenchmarkSuite.prototype.RunSingleBenchmark = function(benchmark, data) {
> > >    function Measure(data) {
> > >      var elapsed = 0;
> > > +    Profiler.resume(benchmark.name);
> > 
> > are you sure this 'benchmark.name' will name sense?
> 
> I'll add a sanity check, but this worked without throwing any errors, as far
> as I can tell.

Actually, I'll leave it as is. There's a Benchmark "class" at the top of that file, and that explicitly contains a name field, and I'm pretty sure all variables with the name "benchmark" in that file are an instance of Benchmark.
I've shuffled some variables around and made the profile zip end in ".sps.zip".
Attachment #8549222 - Attachment is obsolete: true
Now with added type checks.
Attachment #8549223 - Attachment is obsolete: true
Attachment #8550515 - Flags: review?(jmaher)
with ", false, true);"
Attachment #8549224 - Attachment is obsolete: true
Attachment #8549225 - Attachment is obsolete: true
Attachment #8549226 - Attachment is obsolete: true
added the missing comma and changed the <script/> declaration
Attachment #8549227 - Attachment is obsolete: true
Comment on attachment 8550515 [details] [diff] [review]
03-Bug_1121571___Add_Profiler_js_capabilities_.diff

Review of attachment 8550515 [details] [diff] [review]:
-----------------------------------------------------------------

::: talos/page_load_test/tart/addon/content/Profiler.js
@@ +52,5 @@
> +
> +    for (var i in pairs) {
> +      if (pairs[i] !== "") {
> +        var pair = pairs[i].split("=");
> +        result[decodeURIComponent(pair[0])] = decodeURIComponent(pair[1]);

here you assume that pair has a length of 2, most likely it will, either fix it or add a comment
Attachment #8550515 - Flags: review?(jmaher) → review+
This patch does two things:
 1. It makes us not wait for the "load" event. This was just an unneeded delay, as far as I can tell.
 2. It moves the setTimeout(..., 0) into the finish() function so that we have a place in the finish() function that is hit as soon after "sessionstore-windows-restored" as possible.

It shouldn't change any numbers. David, do you agree?
Attachment #8550521 - Flags: review?(dteller)
Just the profiler changes from the old part 8, this time.
Attachment #8550522 - Flags: review?(jmaher)
I had to increase the profile size for v8 and I also increased the resolution for tresize a little.
Attachment #8549229 - Attachment is obsolete: true
Attachment #8549230 - Attachment is obsolete: true
Print just the "activating profiling" message.
Just five minutes this time. The longer timeouts I was seeing are likely caused by deadlocks in the profiler that hang the browser completely. I still need to find out how to fix those.
Attachment #8549231 - Attachment is obsolete: true
Attachment #8549232 - Attachment is obsolete: true
Attachment #8550527 - Flags: review?(jmaher)
I've moved the imported files to a "profiler" subdirectory. I've also simplified the imported files a little, and removed functions we don't use. I've kept the indentation as-is.
I was seeing out-of-memory errors during profile symbolication (specifically in the "profile = json.load(profile_file)" line), so I'm now catching the MemoryError exception.
Attachment #8549234 - Attachment is obsolete: true
Attachment #8550529 - Flags: review?(jmaher)
Attachment #8549235 - Attachment is obsolete: true
Attachment #8550531 - Attachment is obsolete: true
This error occurred in some of the Windows runs: "The process cannot access the file because it is being used by another process: 'c:\\docume~1\\cltbld~1.t-x\\locals~1\\temp\\tmpv0wi28\\page_0.sps'"
It happened in the "d" and "s" tests of this try push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a819e4547e25

I don't know what's happening here. The browser is the only process that knows of that file, and it should absolutely be closed by now. Maybe you know what's going on?
Attachment #8550535 - Flags: review?(jmaher)
There were a few tp timeouts, and I still haven't figured out their cause. In any case, dumping profiles on timeout will make this easier to track down.

Here's an example log (search for "__FAILTimeout in tp5n__FAIL"):
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/mstange@themasta.com-44c6445d8121/try-macosx64/try_mountainlion_test-tp5o-bm108-tests1-macosx-build0.txt.gz
Attachment #8550536 - Flags: review?(jmaher)
This might have unintended side effects. Any ideas?
Attachment #8550537 - Flags: review?(jmaher)
Comment on attachment 8550535 [details] [diff] [review]
17-Bug_1121571___Retry_profile_rmtree_for_five_minutes_.diff

This didn't fix the issue:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e7b17651136c
Attachment #8550535 - Flags: review?(jmaher)
Comment on attachment 8550537 [details] [diff] [review]
19-Bug_1121571___Continue_even_if_test_results_add_throws_an_error_.diff

Review of attachment 8550537 [details] [diff] [review]:
-----------------------------------------------------------------

a few random chances of failure in the future, but if we get a failed case to start with the job will be red/orange anyway, so forcing us to collect more information is a win.
Attachment #8550537 - Flags: review?(jmaher) → review+
Comment on attachment 8550536 [details] [diff] [review]
18-Bug_1121571___Also_dump_the_profile_from_pageloader_if_it_times_out_.diff

Review of attachment 8550536 [details] [diff] [review]:
-----------------------------------------------------------------

good find.
Attachment #8550536 - Flags: review?(jmaher) → review+
Comment on attachment 8550535 [details] [diff] [review]
17-Bug_1121571___Retry_profile_rmtree_for_five_minutes_.diff

Review of attachment 8550535 [details] [diff] [review]:
-----------------------------------------------------------------

I would like to see this done in general and address my question.

::: talos/ttest.py
@@ +622,5 @@
> +                # For some reason, on Windows, big profiles are sometimes locked
> +                # by another process even after all Firefox processes have been
> +                # terminated. Allow up to 5 minutes for the file lock to be
> +                # released.
> +                utils.rmtree_until_timeout(sps_profile_dir, 5 * 60)

fwiw, we see this on windows in general, could you apply this to profile cleanup overall?

::: talos/utils.py
@@ +279,5 @@
> +            return
> +        except Exception as e:
> +            if elapsed >= timeout:
> +                raise e
> +            time.sleep(1)

this assumes that shutl.rmtree will hit an exception in a few milliseconds, isn't it possible that we could hit the exception after a few seconds due to a locked file or some type of timeout?
Attachment #8550535 - Flags: review-
Comment on attachment 8550529 [details] [diff] [review]
14-Bug_1121571___Do_symbolication_right_on_the_talos_machine_.diff

Review of attachment 8550529 [details] [diff] [review]:
-----------------------------------------------------------------

I would like some indication of the root of these files and a mpl at the top.  Ideally they would live in a profiling/, sps/, or the existing scripts/ subdir instead of the root dir.  If they were in their own subdir, it would be easy to create a readme.txt which could point to the github root.

I am fine with the files, not with the locations- a few nits below.

::: talos/profiler/symFileManager.py
@@ +201,5 @@
> +            #print "Can't fit " + pdbName + "/" + pdbId
> +            break
> +          fetchedSymbols[(pdbName, pdbId)] = symbolInfo
> +          fetchedCount += symbolInfo.GetEntryCount()
> +          #print "fetchedCount = " + str(fetchedCount) + "  after " + pdbName + "/" + pdbId

we have a lot of print statements commented out, can we remove these?

::: talos/ttest.py
@@ +590,5 @@
> +                            except MemoryError as e:
> +                                utils.info("Ran out of memory while trying to symbolicate profile")
> +                            except Exception as e:
> +                                utils.info(e)
> +                                utils.info("Encountered an exception during profile symbolication")

could we be more explicit about which test we were not able to get sps profiles for in the error message?
Attachment #8550529 - Flags: review?(jmaher) → review-
Comment on attachment 8550527 [details] [diff] [review]
12-Bug_1121571___Increase_the_timeout_when_profiling_.diff

Review of attachment 8550527 [details] [diff] [review]:
-----------------------------------------------------------------

thanks!
Attachment #8550527 - Flags: review?(jmaher) → review+
Comment on attachment 8550522 [details] [diff] [review]
09-Bug_1121571___Profile_startup_using_profiler_env_vars_.diff

Review of attachment 8550522 [details] [diff] [review]:
-----------------------------------------------------------------

thanks!
Attachment #8550522 - Flags: review?(jmaher) → review+
(In reply to Joel Maher (:jmaher) from comment #63)
> Comment on attachment 8550535 [details] [diff] [review]
> 17-Bug_1121571___Retry_profile_rmtree_for_five_minutes_.diff

> ::: talos/ttest.py
> @@ +622,5 @@
> > +                # For some reason, on Windows, big profiles are sometimes locked
> > +                # by another process even after all Firefox processes have been
> > +                # terminated. Allow up to 5 minutes for the file lock to be
> > +                # released.
> > +                utils.rmtree_until_timeout(sps_profile_dir, 5 * 60)
> 
> fwiw, we see this on windows in general, could you apply this to profile
> cleanup overall?

Sure. Not sure it's going to work, though.

> ::: talos/utils.py
> @@ +279,5 @@
> > +            return
> > +        except Exception as e:
> > +            if elapsed >= timeout:
> > +                raise e
> > +            time.sleep(1)
> 
> this assumes that shutl.rmtree will hit an exception in a few milliseconds,

time.sleep accepts seconds, not milliseconds. I'll add a comment about that.
Does that clear up the confusion, or is there something else that's unclear or that I'm missing? The loop doesn't really care how long rmtree takes until it throws an exception. It waits for it to either succeed or throw, no matter how long it takes to do that. And if it throws, we discard the exception, until we've seen |timeout| many exceptions. And we wait for one second between tries.

> isn't it possible that we could hit the exception after a few seconds due to
> a locked file or some type of timeout?

Apparently. But in my latest push, a timeout of 5 minutes didn't help. I'll try increasing the timeout, maybe I just haven't waited long enough.
(In reply to Joel Maher (:jmaher) from comment #64)
> Comment on attachment 8550529 [details] [diff] [review]
> 14-Bug_1121571___Do_symbolication_right_on_the_talos_machine_.diff
> 
> Review of attachment 8550529 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I would like some indication of the root of these files and a mpl at the
> top.

Good point.

> Ideally they would live in a profiling/, sps/, or the existing
> scripts/ subdir instead of the root dir.

They're in the "profiler" dir in this patch, not in the root dir.

> If they were in their own subdir,
> it would be easy to create a readme.txt which could point to the github root.

I'll add one.

> ::: talos/profiler/symFileManager.py
> @@ +201,5 @@
> > +            #print "Can't fit " + pdbName + "/" + pdbId
> > +            break
> > +          fetchedSymbols[(pdbName, pdbId)] = symbolInfo
> > +          fetchedCount += symbolInfo.GetEntryCount()
> > +          #print "fetchedCount = " + str(fetchedCount) + "  after " + pdbName + "/" + pdbId
> 
> we have a lot of print statements commented out, can we remove these?

Sure.

> ::: talos/ttest.py
> @@ +590,5 @@
> > +                            except MemoryError as e:
> > +                                utils.info("Ran out of memory while trying to symbolicate profile")
> > +                            except Exception as e:
> > +                                utils.info(e)
> > +                                utils.info("Encountered an exception during profile symbolication")
> 
> could we be more explicit about which test we were not able to get sps
> profiles for in the error message?

Good idea.
Comment on attachment 8550521 [details] [diff] [review]
08-Bug_1121571___Call_finish___immediately_when_sessionstore_windows_restored_fires_.diff

Review of attachment 8550521 [details] [diff] [review]:
-----------------------------------------------------------------

I don't see any reason why this wouldn't work. So r=me.
Attachment #8550521 - Flags: review?(dteller) → review+
Attachment #8550515 - Attachment is obsolete: true
Attachment #8550529 - Attachment is obsolete: true
Attachment #8551437 - Flags: review?(jmaher)
New push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a40998427912
This one has an updated mozilla-central base. I hope it won't hit bug 1122947 during the Talos runs.
Blocks: 1123446
Attachment #8551437 - Flags: review?(jmaher) → review+
Comment on attachment 8551439 [details] [diff] [review]
17-Bug_1121571___Retry_rmtree_for_twenty_minutes__in_case_the_file_is_still_locked_by_another_process_.diff

Review of attachment 8551439 [details] [diff] [review]:
-----------------------------------------------------------------

this looks good.  My reservations about the times were misdirected.
Attachment #8551439 - Flags: review?(jmaher) → review+
I'm going to land all of these patches now. The remaining issues can be fixed in follow-up bugs. It's already strictly an improvement over the old state.
Depends on: 1123852
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Depends on: 1149364
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: