Closed Bug 776947 Opened 12 years ago Closed 10 years ago

ignore running processes in talos when you are in --develop mode

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: ankit.goyal90, Mentored)

References

Details

(Whiteboard: [good next bug][lang=python])

Attachments

(2 files, 4 obsolete files)

this frustrates everybody who runs talos.  There is no reason that we need to have no other processes running that match our process name (i.e. firefox).

If we are running in --develop mode, this means we are working on a local system and trying to add a test or debug a test.  We really don't care about reproducible numbers at this stage (that is what try server is for).
This *should* be pretty easy to ignore.  That said, I have noticed in semi-accidental experimentation that if one does have a browser running, Talos often becomes very confused and can do horrible things like deleting your session history of the browser instance that it did not start.  So while ignoring running processes is pretty easy, I'm guessing there's other process management stuff in talos that should probably be fixed along with this (or we could just move to mozprocess/mozrunner)
(In reply to Jeff Hammel [:jhammel] from comment #1)
> Talos often becomes very confused and can do horrible things like deleting
> your session history of the browser instance that it did not start.

!!! Yikes!

/me reverts his local change to talos/ttest.py and decides to close his main Firefox after all
Heh, well I wasn't meaning to sound alarmist, though I have seen this (and haven't had time to track down what exactly is happening).  FWIW, I haven't seen it do anything to the (non-talos) profile other than killing the history and tabs of the active session.
(In reply to Joel Maher (:jmaher) from comment #0)
> this frustrates everybody who runs talos.  There is no reason that we need
> to have no other processes running that match our process name (i.e.
> firefox).
> 
> If we are running in --develop mode, this means we are working on a local
> system and trying to add a test or debug a test.  We really don't care about
> reproducible numbers at this stage (that is what try server is for).

Also note that on Android only one instance of a given application may be running at a particular time, so we'll still want to check for that condition on that particular platform (so we can give an informative error).
There are a lot of little changes to make this work.  This patch takes care of the remote scenario where we still have the solo process requirement.
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #645349 - Flags: review?(jhammel)
Comment on attachment 645349 [details] [diff] [review]
allow for solo_process mode to be toggled with --develop (1.0)

I'm not a big fan of passing around solo_process everywhere.  I'd prefer to start getting talos on mozprocess before this bug as so much of this is trying to patch fairly messy code.  We have to change several places: it would be nicer to change one.

+        if not solo_process:
+            extra_args += " -no-remote "
+

Is there a reason we shouldn't add this flag in all cases?

         processes = self._ffprocess.checkAllProcesses(browser_config['process'], browser_config['child_process'])
-        if processes:
+        if processes and browser_config['solo_process']:

Should we even bother checking for processes in this case?

-                if (i > 0) and self._ffprocess.checkAllProcesses(browser_config['process'], browser_config['child_process']):
+                if (i > 0) and \
+                   browser_config['solo_process'] and \
+                   self._ffprocess.checkAllProcesses(browser_config['process'], browser_config['child_process']):
                     raise talosError("previous cycle still running")

So this will lose detection of the previous cycle running in --develop mode.  IMHO, not a great idea.
yeah, we lose a lot when we ignore the process state and this patch is much uglier than I thought it would be.

for the -no-remote, the goal of talos is to run as close to what an end user will run as.  I don't know what -no-remote does internally, but no end user runs that way.

I am fine shelving this until mozprocess happens, I know we keep putting that off, but we could realistically do that this quarter as a lot of the hurdles have been tackled.
Yeah, I think with the current state of talos, mozprocess should be....well, not easy but unblocked as much as it can be.  We just need to find time to do it.  We should probably make it an unofficial goal to get this added this quarter and, if sept comes and this isn't started, assess what time we have and throw some weight at it
Comment on attachment 645349 [details] [diff] [review]
allow for solo_process mode to be toggled with --develop (1.0)

taking this down for review for now.  please let me know if i should reconsider it
Attachment #645349 - Flags: review?(jhammel)
this bug can be a reality now that we use mozprocess to manage our browser.  I don't believe the original patch is that useful other than as a way to see what crazy things were done back in the old days.

Solving this would be a good win for developers, I would consider this a good_third_bug or something, it is a bit more complex than many would like.
I'll take a stab at this one to learn a bit more about talos.
Assignee: jmaher → dminor
It's been three months and I haven't had a chance to look at this, so I'm going to unassign myself. Maybe we should make this a mentored "good next bug"?
Assignee: dminor → nobody
Status: ASSIGNED → NEW
this bug could use a better description, if you are picking this up to work on please ask questions
Blocks: 1088251
Mentor: jmaher, dminor
Whiteboard: [good next bug][lang=python]
I am getting idea here that when we use --develop, we have to ignore the previously running instances of firefox, there should be no interaction with these instances. Is that all or there is more?
Ankit, correct.  We want to monitor our current process that we launch.  We shouldn't care about any previous existing browsers or even new ones launched by the user.

When we switched to mozprocess to launch the process this has made it possible to get the process id and just manage that instance of the process.
All instances of firefox(whether new tab or window) are running under a single process. I checked PID both on Windows and Linux. I think firefox is forking internally. So, I am stuck here!
This bug is supporting the feature when you have your regular instance of Firefox running, then you run talos in addition to it.

Talos needs to launch firefox with the "--no-remote" cli parameter so that it creates a separate instance.  I did just verify that launching the same instance of firefox-trunk, one normal and the other with --no-remote yields two unique firefox-trunk processes.

Let me know if this information helps you move forward.
Thanx, that is helpful.
Comment on attachment 8542845 [details] [diff] [review]
Bug 776947 - ignore running processes in talos when you are in --develop mode . r=jmaher

Review of attachment 8542845 [details] [diff] [review]:
-----------------------------------------------------------------

overall, I think we can make ffprocess.py a much smoother bit of code.  First off we don't need to preserve the old functions, lets get rid of duplicated code.

Second, now that we have pids, we don't need to get all processes, etc. I believe we can just have two functions:
checkProcesses()
cleanupProcesses()

of course these might call internal functions, but those are the two entry points we care about.

I believe you have done the hard work, now it is just making it clean.

::: talos/ffprocess.py
@@ +84,5 @@
> +    def TerminateProcesses(self, timeout, pids, *process_names):
> +        """Helper function to terminate all processes with the given process name and PID
> +
> +        Args:
> +            process_names: String or strings containing the process name, i.e. "firefox"

I would rather edit TerminateAllProcesses instead of duplicating the function.

@@ +91,5 @@
> +        for process_name in process_names:
> +            try:
> +                process_pids = mozpid.get_pids(process_name)
> +                pids_to_terminate = [process_pid for process_pid in process_pids if process_pid in pids]
> +                for p_id in pids_to_terminate:

we just need-
for p in pids:
  terminateprocess()

why do the entire mozpid.get_pids?

::: talos/run_tests.py
@@ +140,4 @@
>  
>      # set browser_config
>      browser_config=configurator.browser_config()
> +    

nit: extra whitespace here

::: talos/ttest.py
@@ +127,2 @@
>                                           browser_config['child_process'],
> +                                         browser_config['browser_wait'], 

nit: you have trailing whitespace here
Attachment #8542845 - Flags: review?(jmaher) → review-
Yeah, Joel. I was waiting for your approval of my approach. I already saw all these issues.
awesome.  If you want to gain consensus on an approach, you can ask for feedback instead of a review.  A neat feature in bugzilla to avoid getting the dreaded r-.

Looking forward to your patch!
Attachment #8542845 - Attachment is obsolete: true
Attachment #8543132 - Flags: review?(jmaher)
Attachment #8542845 - Flags: review- → feedback-
Comment on attachment 8543132 [details] [diff] [review]
Bug 776947 - ignore running processes in talos when you are in --develop mode . r=jmaher

Review of attachment 8543132 [details] [diff] [review]:
-----------------------------------------------------------------

a few small details here.  One question I do have is how do we handle the child_process (i.e. plugin-container?)  Right now we return a single pid from TalosProcess, but in reality firefox has the main process and the plugin-container process.  I do realize that I have overlooked this when discussing this bug originally and giving you feedback on your first patch.

Quite possibly could you comment on how you thing we could solve the plugin-container solution and test if we really do clean that up or not.  Once we resolve the child process quesiton, this patch has good code otherwise!

::: talos/ffprocess.py
@@ +28,5 @@
> +            ret = self._TerminateProcess(pid, timeout)
> +            if ret:
> +                results.append("(%s): %s" % (pid, ret))
> +            else:
> +                pids.remove(pid)

why do you remove pids here?

@@ +33,5 @@
>          return ",".join(results)
>  
> +    def checkProcesses(self, pids):
> +        """Returns a list of browser related PIDs still running
> +        

nit: blank line with whitespace

@@ +40,5 @@
> +        Returns:
> +            A list containing PIDs which are still running
> +        """
> +        return [pid for pid in pids if utils.is_running(pid)]
> +        

nit: blank line with whitespace

::: talos/run_tests.py
@@ +145,5 @@
> +    # set 'remote' in browser_config to False, if --develop is specified to PerfConfigurator.py
> +    if browser_config['develop']:
> +        browser_config['extra_args'] = '--no-remote'
> +        browser_config['remote'] = False
> + 

nit: you have a blank line with whitespace, both on lines 143 and 149

::: talos/ttest.py
@@ +469,3 @@
>                          cleanup = TalosProcess.TalosProcess(['python'] + test_config['cleanup'].split(), env=os.environ.copy())
>                          cleanup.run()
> +                        self._pids.append(cleanup.pid)

I don't believe we need a pid here, in this case we are running an external script which isn't the browser.
Attachment #8543132 - Flags: review?(jmaher) → review-
I believe `plugin-container` is child process of firefox, so it will be terminated when we terminate firefox.

> +            ret = self._TerminateProcess(pid, timeout)
> +            if ret:
> +                results.append("(%s): %s" % (pid, ret))
> +            else:
> +                pids.remove(pid)

I am removing PIDs here, because if we don't get any result from _TerminateProcess(), that PID is not running. It is unnecessary to keep in PIDs list and iterate over and over during cleanup.
in this case we make a deep copy of pids for this loop, I assume this modifies the self._pids variable that we store the list of pids in?
I already made deepcopy. Look at the loop.
got it, lets just figure out the child process and we should be good.
Wrote a hack to get child PIDs.
Also filed a bug: https://bugzilla.mozilla.org/show_bug.cgi?id=1117651
Attachment #8543132 - Attachment is obsolete: true
Attachment #8543803 - Flags: review?(jmaher)
Ankit, wow, this is cool.  I got curious how we deal with this in our other harnesses, and it appears that mozprocess (which TalosProcess is subclasses from) does handle child processes.  Take a look here:
http://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozprocess/mozprocess/processhandler.py#115

In our mochitest harness, we use mozprocess to launch a server and here is logic to kill it:
http://dxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#371


For the main firefox process, we do something like this:
http://dxr.mozilla.org/mozilla-central/source/testing/mochitest/runtests.py#1550

which then calls into mozrunner (we use .start() and .wait() methods):
http://dxr.mozilla.org/mozilla-central/source/testing/mozbase/mozrunner/mozrunner/base/runner.py#75

Those reference a processhandler which are part of mozprocess (as referenced in the first link), these effectively call wait() which if you drill down enough you can see that wait() handles child processes in the os specific _wait() definitions.

Long story short, I might have misled you on a patch of wasted effort whereas your first patch would have been more than enough.

Do let me know your thoughts on this.  If you have other concerns lets sort it out, otherwise we should take your first patch and file a cleanup bug to remove the references to child_process in the talos code base :)
Attachment #8543803 - Flags: review?(jmaher)
Comment on attachment 8543132 [details] [diff] [review]
Bug 776947 - ignore running processes in talos when you are in --develop mode . r=jmaher

Review of attachment 8543132 [details] [diff] [review]:
-----------------------------------------------------------------

can you address the nits in this patch.

::: talos/ffprocess.py
@@ +28,5 @@
> +            ret = self._TerminateProcess(pid, timeout)
> +            if ret:
> +                results.append("(%s): %s" % (pid, ret))
> +            else:
> +                pids.remove(pid)

why do you remove pids here?

@@ +33,5 @@
>          return ",".join(results)
>  
> +    def checkProcesses(self, pids):
> +        """Returns a list of browser related PIDs still running
> +        

nit: blank line with whitespace

@@ +40,5 @@
> +        Returns:
> +            A list containing PIDs which are still running
> +        """
> +        return [pid for pid in pids if utils.is_running(pid)]
> +        

nit: blank line with whitespace

::: talos/run_tests.py
@@ +145,5 @@
> +    # set 'remote' in browser_config to False, if --develop is specified to PerfConfigurator.py
> +    if browser_config['develop']:
> +        browser_config['extra_args'] = '--no-remote'
> +        browser_config['remote'] = False
> + 

nit: you have a blank line with whitespace, both on lines 143 and 149

::: talos/ttest.py
@@ +469,3 @@
>                          cleanup = TalosProcess.TalosProcess(['python'] + test_config['cleanup'].split(), env=os.environ.copy())
>                          cleanup.run()
> +                        self._pids.append(cleanup.pid)

I don't believe we need a pid here, in this case we are running an external script which isn't the browser.
Attachment #8543132 - Flags: review- → review+
Attachment #8543803 - Attachment is obsolete: true
Attachment #8544367 - Flags: review?(jmaher)
Comment on attachment 8544367 [details] [diff] [review]
Bug 776947 - ignore running processes in talos when you are in --develop mode . r=jmaher

Review of attachment 8544367 [details] [diff] [review]:
-----------------------------------------------------------------

overall this code is r+ quality, thanks for addressing the nits.  I have ran this in production and on windows we are having some issues:

try push that runs this code:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d1575a927a71

specific example:
http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/jmaher@mozilla.com-d1575a927a71/try-win32/try_win7-ix_test-chromez-bm119-tests1-windows-build56.txt.gz (search for 'Failed tresize')

It appears that it is failing on the 'ts' style test where we launch the browser to measure raw startup times.  Lets see if there is a timing issue with fast startup/shutdown times, or maybe something windows related.  

Marking as r- until we get that sorted out, quite possibly it should be an easy fix!
Attachment #8544367 - Flags: review?(jmaher) → review-
Apparently, mozprocess requires some upgrade.
In talos/talos/utils.py, function is_running() calls >>  
In https://github.com/mozilla/gecko-dev/blob/master/testing/mozbase/mozprocess/mozprocess/pid.py, function ps() calls >> 
subprocess.Popen(['ps', arg], stdout=subprocess.PIPE)

Now, `ps` is not available on windows.
on windows, we run in a unix style shell which allows for ps to be available.  If I get some time today or tomorrow, I will run some tests on a slave and see if I can see what is the root cause.

I am concerned with a couple things:
1) why this seems to happen on the startup only tests
2) why this happens at the end instead of in the middle of the test
3) what type of wait/timeout are we doing
4) if there are any retries we are doing 

looking into those questions might uncover an answer :)
one interesting thing here is that the pid list keeps growing, could it be that this works except that the pids get reused over time?
Apparently, _TerminateProcess() in ffprocess_win32.py, is returning a value in case of everything.
It was not consistent with ffprocess_linux.py and ffprocess_mac.py
I updated the patch.
I hope this works. We have done a lot of work on this so far.
Attachment #8544367 - Attachment is obsolete: true
Attachment #8545702 - Flags: review?(jmaher)
Comment on attachment 8545702 [details] [diff] [review]
Bug 776947 - ignore running processes in talos when you are in --develop mode . r=jmaher

Review of attachment 8545702 [details] [diff] [review]:
-----------------------------------------------------------------

This seems to solve the problem!  Thanks for finding and fixing this:)
Attachment #8545702 - Flags: review?(jmaher) → review+
Assignee: nobody → ankit.goyal90
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: