Closed Bug 865311 Opened 11 years ago Closed 10 years ago

Talos process checking is over-ambitious and wrong, Part Deux

Categories

(Testing :: Talos, defect)

All
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rnewman, Unassigned)

References

Details

(Whiteboard: [mozbase][mozharness+talos])

+++ This bug was initially created as a clone of Bug #700722 +++

jhammel: "I would advocate that the longer term fix is moving to mozrunner ( https://bugzilla.mozilla.org/show_bug.cgi?id=698898 ) and making the check for processes optional and off by default."

This process check is biting again, now that Fennec is accruing background services and content providers.

It's inaccurate to check for process lifetime on Android. The OS manages processes. Any check for processes should instead be a check for an activity or service.

Can we get some traction on the above change?
is this a fennec specific issue, or is this related to all of talos?  According to the initial comment this is related to Fennec, but according to the bugs this is blocking this is related to desktop only.  Please clarify and specify what we should be doing here?  My understanding is there is no conflict in the process checking on mobile.
Sorry, blocked bugs came across in the bug clone. (I don't know why they're there on the clone source.)

This is an issue on mobile; see your discussion with gbrown today in #mobile.

No test harness should be checking for process existence on Android, unless you just uninstalled the application and want to make sure it's gone.
No longer blocks: 713055, 650887
how could we check for an activity?  Checking for a process has worked for years of android automation.  I am interested in looking into this.  While talking about the bug that gbrown found, we see that the scenario he observed/theorized is only seen on reftest and mochitest, not on talos.  This is odd because talos is much different than reftest/mochitest.  Here are some stats:
http://brasstacks.mozilla.com/orangefactor/index.html?display=Bug&bugid=663657&startday=2013-04-01&endday=2013-04-24&tree=trunk
(In reply to Joel Maher (:jmaher) from comment #3)
> how could we check for an activity?  Checking for a process has worked for
> years of android automation.

I guess my question is "why do you need to check?".

As far as I can tell, you can use

  adb shell dumpsys | fgrep FocusedWindow

to see what the frontmost activity is. But if you kill it with `adb am`, it -- and all of its background processes -- should be dead.

> I am interested in looking into this.  While
> talking about the bug that gbrown found, we see that the scenario he
> observed/theorized is only seen on reftest and mochitest, not on talos. 

Feel free to morph this bug accordingly :D

> This is odd because talos is much different than reftest/mochitest.  Here
> are some stats:
> http://brasstacks.mozilla.com/orangefactor/index.
> html?display=Bug&bugid=663657&startday=2013-04-01&endday=2013-04-
> 24&tree=trunk

Note that test timeouts are not necessarily the result of this bug, as I'm sure you're aware!
We don't have direct adb access from the machine, but our sutagent can execute a dumpsys.  Solving the bug that gbrown was talking about earlier today will require actually killing the browser, because we would still have the same foreground window.

In all our automation (desktop,fennec,b2g) we launch a process and keep track of it.  I believe we could switch to log file polling.
Dumpsys outputs A LOT of information, I'm not sure if calling that is a good idea. Potentially we could add support inside the agent for getting this information though:

http://stackoverflow.com/questions/5446565/android-how-do-i-check-if-activity-is-running
adb shell dumpsys window input | wc -l
68

about the same amount of lines we have from 'adb shell ps'
in looking at using another method for this other than process checking, we need to know when the test is completed.  Right now we determine that by ensuring:
1) the process is dead
2) no output in the log file for <timeout> duration <- this generates a failure

To detect if the process is dead, we poll the process and wait for it to not show up in the process list.  On desktop, we have a handle to the process and if the handle is invalid we know the process has terminated.

Switching to 'FocusedWindow' method will not solve any bugs that would show up by polling the process list.
(In reply to Joel Maher (:jmaher) from comment #8)

> Switching to 'FocusedWindow' method will not solve any bugs that would show
> up by polling the process list.

Using FocusedWindow and looking for a Fennec activity will determine when the activity has finished, even if background services are still running in the same process.

It will also prevent your checks from being incorrect when you kill a process and Android restarts it (which is the mechanism by which gbrown was affected). Kill the process and then check whether the foreground activity is Fennec's, rather than kill the process and see if it's alive.

That you don't get a false positive in situations where Android has decided that the process should be running does indeed solve bugs that show up by polling the process list.

My fundamental point is that equating "process running" with "tests still running" is an invalid equation on Android, because Android is in charge of when a process with a given name should be alive.


> in looking at using another method for this other than process checking, we
> need to know when the test is completed.

It sounds like log file watching is the way to go. You already need to do 'smart' log watching to determine whether tests have output anything in a time window, so you might as well watch for "AND I'M DONE!".
To clarify my observations, see https://tbpl.mozilla.org/?tree=Try&rev=f24787dde0a1. With reference to the first Panda rc1 run, see https://tbpl.mozilla.org/php/getParsedLog.php?id=22172318&tree=Try&full=1, which has multiple logcats, due to the enhanced logging -- see the logcat that starts at 04-23 20:03:13.906.

remoteautomation.py's RProcess.wait loops for up to 3600 seconds (1 hour) dumping stdout until dm.processExist("org.mozilla.fennec") returns false. The logcat shows that the test ends successfully and apparently normally after a few minutes:

04-23 20:03:22.289 D/GeckoProfile( 3151): Found profile dir: ...
...
04-23 20:03:38.031 I/TestRunner( 3151): finished: testAboutPage(org.mozilla.fennec.tests.testAboutPage)
04-23 20:03:38.031 I/TestRunner( 3151): passed: testAboutPage(org.mozilla.fennec.tests.testAboutPage)
04-23 20:03:38.039 D/AndroidRuntime( 3140): Shutting down VM
04-23 20:03:38.039 I/ActivityManager( 1402): Force stopping package org.mozilla.fennec uid=10046
04-23 20:03:38.039 I/ActivityManager( 1402): Killing proc 3197:org.mozilla.f3nn3c.UpdateService/10046: force stop
04-23 20:03:38.039 I/ActivityManager( 1402): Killing proc 3151:org.mozilla.fennec/10046: force stop
04-23 20:03:38.062 I/AndroidRuntime( 3140): NOTE: attach of thread 'Binder Thread #3' failed

but then org.mozilla.fennec is restarted:

04-23 20:03:59.976 I/ActivityManager( 1402): Start proc org.mozilla.fennec for broadcast org.mozilla.fennec/org.mozilla.gecko.background.announcements.AnnouncementsStartReceiver: pid=3231 uid=10046 gids={3003, 1015, 1006}
04-23 20:04:00.093 I/ActivityThread( 3231): Pub org.mozilla.fennec.db.browser: org.mozilla.gecko.db.BrowserProvider
04-23 20:04:00.109 I/ActivityThread( 3231): Pub org.mozilla.fennec.db.tabs: org.mozilla.gecko.db.TabsProvider
04-23 20:04:00.109 I/ActivityThread( 3231): Pub org.mozilla.fennec.db.formhistory: org.mozilla.gecko.db.FormHistoryProvider

and sutAgent messages in the logcat confirm that RProcess.wait is continuing to loop waiting for org.mozilla.fennec to end.

Device info dumped by my modification to RProcess.wait confirms that the device manager is finding org.mozilla.fennec with pid=3231 after 10 minutes of waiting:

{'uptime': ... 'process': [['10046', '3231', 'org.mozilla.fennec'], 
['10044', '1882', 'com.mozilla.SUTAgentAndroid']...

For some reason (maybe just timing?), this problem only affects some runs of testAboutPage.

Note that these observations apply to:
 - mochitest-robotium (rc1, rc2)
 - Panda, Android 4.0
 - remoteautomation / RProcess.wait

I suspect that there are similar issues in other test suites and remote platforms, but I don't know, and don't have any evidence one way or the other.


> My fundamental point is that equating "process running" with "tests still 
> running" is an invalid equation on Android, because Android is in charge of when 
> a process with a given name should be alive.

This resonates with me. At the same time, when I look at devicemanager / remoteautomation code, it seems "right" for that code to launch a process and then use a ps process list to determine if the process is still running.

Another option to consider: sutAgent currently keeps a process object for launched processes and waits for some period of time for it to complete. We don't rely on that currently, but perhaps we could with some sut + devicemanagerSUT changes.
Another option: remoteautomation could note the pid of the launched process, and wait for the pid to end, rather than the process name.
(In reply to Richard Newman [:rnewman] from comment #9)
> (In reply to Joel Maher (:jmaher) from comment #8)

> It sounds like log file watching is the way to go. You already need to do
> 'smart' log watching to determine whether tests have output anything in a
> time window, so you might as well watch for "AND I'M DONE!".

I wonder if it wouldn't be even easier just to have an API endpoint in the webserver which is called by client code upon completion. This would eliminate the need to continually poll the log file.
(In reply to Geoff Brown [:gbrown] from comment #11)
> Another option: remoteautomation could note the pid of the launched process,
> and wait for the pid to end, rather than the process name.

That doesn't work well, because sutAgent's exec may wait until the launched process completes before returning; there isn't currently an opportunity for devicemanager or remoteautomation to look up the pid until that exec command completes, and that may be too late. It looks to me like any pid-based solution would require sutAgent changes.
An api endpoint would require us to modify the mochitest harness (fairly easy, but might raise a few eyebrows) and the same with the reftest harness.  I am not sure about xpcshell.  For talos we would have to modify the harness as well.

My goal is to keep harnesses as similar as possible.  A SUTAgent/DM change might be the best choice.  Polling the log file is hacky.  Whatever we do, I would like to treat this as we do the desktop tests- if we make a major difference here, lets do it on desktop.
I'm still in favor of polling activities. IMO on Android, an activity is closer in concept to a desktop process. A "process" on Android is more of an implementation detail that we shouldn't rely on.

Can we use, for example, ActivityManager.getRunningTasks in SUTAgent? http://developer.android.com/reference/android/app/ActivityManager.html#getRunningTasks%28int%29
Blocks: 663657
Try run showing Panda rc stability with announcements disabled: https://tbpl.mozilla.org/?tree=Try&rev=a8711d551a68 (evidence that our panda rc woes are caused by service restarts / process management).
I can think of two semi-hacky ways of working around the immediate issue that gbrown is seeing:
1) The first would be to parse the log files for the INFO SimpleTests FINISHED string.
2) The second would be to store the pid of the process on startup and poll the list of processes during the waitForFinish call to see if that pid still exists.

Both of these could be done with changes to remoteautomation.py.

Currently we are polling based upon the process name (e.g. org.mozilla.fennec) and so miss restarts. Polling based upon activities would not fix the issue gbrown found because the activity ends up being restarted, and we would still end up waiting until we timeout.

If Android kills the fennec process (due to OOM, I guess) and restarts it, will that test run still be valid? If it is still valid, we should go with the log file. Otherwise, we should look at the pids.
(In reply to Dan Minor [:dminor] from comment #17)

> Currently we are polling based upon the process name (e.g.
> org.mozilla.fennec) and so miss restarts. Polling based upon activities
> would not fix the issue gbrown found because the activity ends up being
> restarted, and we would still end up waiting until we timeout.

The *activity* isn't restarted. The *process* is restarted to continue to host background services and content providers.
(In reply to Dan Minor [:dminor] from comment #17)
> 2) The second would be to store the pid of the process on startup and poll
> the list of processes during the waitForFinish call to see if that pid still
> exists.

I don't think it is currently possible for remoteautomation to get the pid of the process on startup -- see my argument with myself in comment 13.
> The *activity* isn't restarted. The *process* is restarted to continue to
> host background services and content providers.

Ok, to make sure I follow you, there will be no fennec activity even though a fennec process is running?
(In reply to Geoff Brown [:gbrown] from comment #19)
> (In reply to Dan Minor [:dminor] from comment #17)
> > 2) The second would be to store the pid of the process on startup and poll
> > the list of processes during the waitForFinish call to see if that pid still
> > exists.
> 
> I don't think it is currently possible for remoteautomation to get the pid
> of the process on startup -- see my argument with myself in comment 13.

I had thought to get the PID shortly after startup through python calls, but maybe that wouldn't be reliable enough.
we can get a pid from the current ps we do to poll the process:
Trying 192.168.1.86...
Connected to 192.168.1.86.
Escape character is '^]'.
$>ps
10000	1520	android.process.media
10047	3992	org.mozilla.fennec
10044	1921	com.mozilla.SUTAgentAndroid
10007	1632	com.android.smspush
1001	1575	com.android.phone
10045	1533	net.rocboronat.android.wallpaper.npe
10014	1546	com.android.inputmethod.latin
1000	1426	system
10036	2029	com.android.quicksearchbox
10028	2012	com.svox.pico
10017	1997	com.android.gallery3d
10008	1982	com.android.musicfx
10027	1967	com.android.defcontainer
10001	1615	android.process.acore
10038	1819	com.android.providers.calendar
10021	1892	com.android.calendar
10030	1716	com.android.deskclock
10024	1598	com.android.launcher
1000	1873	com.android.settings
10031	1732	com.android.email
10033	1765	com.android.exchange
10043	1836	com.mozilla.watcher
1000	1504	com.android.systemui
$>

why don't we just ensure the pid:procname match up, that can be done in remoteautomation.py (and we should ensure this in talos as well).  I would not like to poll the log file as other harnesses don't do that.
Possibly relevant: ahal's been working on refactoring the logic for launching applications for b2g in bug 865349. If he succeeds (and I think he will), we'll probably want to do a similar refactoring for Android. CC'ing him on this bug.
(In reply to Dan Minor [:dminor] from comment #20)
> > The *activity* isn't restarted. The *process* is restarted to continue to
> > host background services and content providers.
> 
> Ok, to make sure I follow you, there will be no fennec activity even though
> a fennec process is running?

Correct. Android uses the process to run all of the code that ships in the Fennec APK -- content providers, background services, alarm handlers, etc.

The Fennec browser activity -- named "@ANDROID_PACKAGE_NAME@.App" -- is just one of those things.

The issue here is that the tools are using the process as a proxy for that activity. In trying to quit the activity they force-kill the process. Android says "tut tut" and fixes the mistake by restarting the process to handle the announcements service, shortly prior to the tools checking for success by seeing if the process is dead.

Ideally you would gracefully close the activity (e.g., by using a custom broadcast intent, calling Activity.finish() when it's done, etc.) rather than nuking the process.
(In reply to Joel Maher (:jmaher) from comment #22)
> why don't we just ensure the pid:procname match up, that can be done in
> remoteautomation.py (and we should ensure this in talos as well). 

I think that won't work reliably, at least for robocop-via-sut, because:
 - the process is launched via dmSUT.fireProcess: https://hg.mozilla.org/mozilla-central/file/681bbf7717c1/testing/mozbase/mozdevice/mozdevice/devicemanagerSUT.py#l466
 - sutAgent's exec will wait for up to 300 seconds for the process to end before returning: https://hg.mozilla.org/mozilla-central/file/681bbf7717c1/build/mobile/sutagent/android/DoCommand.java#l3734
 - by the time automation/remoteautomation gets back control: https://hg.mozilla.org/mozilla-central/file/681bbf7717c1/build/automation.py.in#l1225
it is too late -- we don't know if the process we launched is still running or completed and was restarted.
Depends on: 865944
Blocks: 873892
Blocks: 872477
now that we use topactivity for finding the process, can we resolve this bug?
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.