Closed Bug 695020 Opened 13 years ago Closed 13 years ago

Race condition in devicemanagerSUT fireProcess method

Categories

(Testing :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
mozilla11

People

(Reporter: wlach, Assigned: bmoss)

References

Details

Attachments

(1 file, 1 obsolete file)

We repeatedly check in deviceManagerSUT.py:fireProcess to see if it has started up, but with a 3 second interval. This means that if our process starts up and stops in between the checking interval, we'll assume that it never started. I can easily reproduce this bug here inside talos by running bcontroller with the startup test repeatedly. It doesn't actually *fail* (we accept a None return value from launchProcess and see if there is a log file anyway), but it sure does slow things down since we wait the entire timeout period for Fennec to start (when it has already started and stopped). Possible solutions to this problem: 1. Check inside SUTAgent itself to see if the process has started (problem: we currently use startActivity to start Fennec processes, and that doesn't seem to have any provision for sending a notification that something has started or failed to start: http://developer.android.com/reference/android/content/ContextWrapper.html) 2. Don't guarantee that the process has started inside fireProcess and make users of it (or things that call it like launchProcess) design their stuff to do something smarter than just waiting for something to start/stop (problem: asking users of an API to be smart is a recipe for disaster, invariably people will cut corners if something's too much effort) 3. Reduce the checking interval to something finer grained, like 0.1 seconds (problem: this doesn't strictly address the issue... there's still a race condition) I think the only viable solutions are (3) and (1). I would personally much rather do (1), but am not sure if there's an API that can do what we want here.
personally adjusting sutagent doesn't seem like a good idea since we would need to update all the agents (unless we didn't plan on rolling it out). I agree this is problematic and we should come up with a better solution. Is there a solution using devicemanagerADB that we can use?
Well, we'd only need to roll out a new agent if we broke the API. It's not yet clear to me that we couldn't fix this issue without changing the API. If so, we could update the agents incrementally.
Attached patch Add code to check for proper launching of apps (obsolete) (deleted) — Splinter Review
This patch adds checking for application launch within the StartJavaPrg function and returns same.
Attachment #571109 - Flags: review?(ctalbert)
Comment on attachment 571109 [details] [diff] [review] Add code to check for proper launching of apps Review of attachment 571109 [details] [diff] [review]: ----------------------------------------------------------------- Overall, the approach looks good, but I see at least one thing that needs correcting and have questions on some others, so I have to r- this patch. Details below... ::: build/mobile/sutagent/android/DoCommand.java @@ +3156,5 @@ > + } > + catch (IllegalThreadStateException itse) { > + itse.printStackTrace(); > + sRet = "\nuninst command timed out"; > + } This pattern of changes is happening a number of times and doesn't have anything to do with startPrg changes, right? I seem to recall this was an earlier fix that we made when we realized the agent was crashing on uninstall. I thought these had been checked in a while ago, but evidently not? Just want to make sure this is not part of the startPrg fix for my own sanity. @@ +3501,5 @@ > RedirOutputThread outThrd = new RedirOutputThread(pProc, out); > outThrd.start(); > + while (lcv < 30) { > + try { > + outThrd.join(10000); All these thread.join calls have a different timeout. Are these numbers significant? If not can we standardize on one constant? I can't imagine these are significant since it's just a timer to tell the system to "fail if it hasn't started by x". From what I read though, it looks like it will return the moment the process is created, so we aren't waiting a (potentially long time) for an app to be visible and usable. Therefore I think we could go with a constant and get rid of the magic numbers. @@ +3508,5 @@ > + lcv = 30; > + } > + catch (IllegalThreadStateException itse) { > + lcv++; > + itse.printStackTrace(); If we are retrying something 30 times, do we want 30 stack traces of failing to start? Maybe we do, I'm not sure. @@ +3510,5 @@ > + catch (IllegalThreadStateException itse) { > + lcv++; > + itse.printStackTrace(); > + } > + } It looks to me like this addition is also not for the startprg change in the bug. I think the change for the bug is the change in startjavaprg, right? Cause if this is also supposed to call the findProcThread code, we have a problem, since it doesn't. That's one of the reasons I have to r-, since I'm not sure if this is a deliberate omission or not. I think it is, but let's be sure. @@ +3627,5 @@ > + try { > + outThrd.join(10000); > + int nRetCode = pProc.exitValue(); > + sRet = "return code [" + nRetCode + "]"; > + lcv = 30; Nit: wouldn't a break; statement be cleaner/more readable? Same goes for the above loop. @@ +3631,5 @@ > + lcv = 30; > + } > + catch (IllegalThreadStateException itse) { > + lcv++; > + itse.printStackTrace(); Nit: same comment on stack trace as above. ::: build/mobile/sutagent/android/FindProcThread.java @@ +28,5 @@ > + > + if (aMgr == null) > + return; > + > + while (bStillRunning && !bFoundIt) { This logic looks good. But, please add a comment here about the process that happens. It took me a couple of minutes staring at it to realize you didn't have an infinite loop with the continue statement. Since we need to re-implement this behavior in any future agent (I think it'd be good, unless we have better hooks than we have on android) a comment here would be good so that re-implementers have something to go on. ::: build/mobile/sutagent/android/Makefile.in @@ +50,5 @@ > CmdWorkerThread.java \ > DataWorkerThread.java \ > DoAlert.java \ > DoCommand.java \ > + FindProcThread.java This needs a line continuation character \ at the end (like the other lines). Otherwise, this will break the build. Hence, my r-.
Attachment #571109 - Flags: review?(ctalbert) → review-
(In reply to Clint Talbert ( :ctalbert ) from comment #4) > Comment on attachment 571109 [details] [diff] [review] [diff] [details] [review] > Add code to check for proper launching of apps > > Review of attachment 571109 [details] [diff] [review] [diff] [details] [review]: > ----------------------------------------------------------------- > > Overall, the approach looks good, but I see at least one thing that needs > correcting and have questions on some others, so I have to r- this patch. > Details below... > > ::: build/mobile/sutagent/android/DoCommand.java > @@ +3156,5 @@ > > + } > > + catch (IllegalThreadStateException itse) { > > + itse.printStackTrace(); > > + sRet = "\nuninst command timed out"; > > + } > > This pattern of changes is happening a number of times and doesn't have > anything to do with startPrg changes, right? I seem to recall this was an > earlier fix that we made when we realized the agent was crashing on > uninstall. I thought these had been checked in a while ago, but evidently > not? > > Just want to make sure this is not part of the startPrg fix for my own > sanity. > Yes, it appears that the fix was never committed, or somehow got fat fingered. > @@ +3501,5 @@ > > RedirOutputThread outThrd = new RedirOutputThread(pProc, out); > > outThrd.start(); > > + while (lcv < 30) { > > + try { > > + outThrd.join(10000); > > All these thread.join calls have a different timeout. Are these numbers > significant? If not can we standardize on one constant? I can't imagine > these are significant since it's just a timer to tell the system to "fail if > it hasn't started by x". From what I read though, it looks like it will > return the moment the process is created, so we aren't waiting a > (potentially long time) for an app to be visible and usable. Therefore I > think we could go with a constant and get rid of the magic numbers. > > @@ +3508,5 @@ > > + lcv = 30; > > + } > > + catch (IllegalThreadStateException itse) { > > + lcv++; > > + itse.printStackTrace(); > > If we are retrying something 30 times, do we want 30 stack traces of failing > to start? Maybe we do, I'm not sure. Should be ok > > @@ +3510,5 @@ > > + catch (IllegalThreadStateException itse) { > > + lcv++; > > + itse.printStackTrace(); > > + } > > + } > > It looks to me like this addition is also not for the startprg change in the > bug. I think the change for the bug is the change in startjavaprg, right? > Cause if this is also supposed to call the findProcThread code, we have a > problem, since it doesn't. That's one of the reasons I have to r-, since > I'm not sure if this is a deliberate omission or not. I think it is, but > let's be sure. > > @@ +3627,5 @@ > > + try { > > + outThrd.join(10000); > > + int nRetCode = pProc.exitValue(); > > + sRet = "return code [" + nRetCode + "]"; > > + lcv = 30; > > Nit: wouldn't a break; statement be cleaner/more readable? Same goes for > the above loop. > > @@ +3631,5 @@ > > + lcv = 30; > > + } > > + catch (IllegalThreadStateException itse) { > > + lcv++; > > + itse.printStackTrace(); > > Nit: same comment on stack trace as above. > > ::: build/mobile/sutagent/android/FindProcThread.java > @@ +28,5 @@ > > + > > + if (aMgr == null) > > + return; > > + > > + while (bStillRunning && !bFoundIt) { > > This logic looks good. But, please add a comment here about the process > that happens. It took me a couple of minutes staring at it to realize you > didn't have an infinite loop with the continue statement. Since we need to > re-implement this behavior in any future agent (I think it'd be good, unless > we have better hooks than we have on android) a comment here would be good > so that re-implementers have something to go on. > > ::: build/mobile/sutagent/android/Makefile.in > @@ +50,5 @@ > > CmdWorkerThread.java \ > > DataWorkerThread.java \ > > DoAlert.java \ > > DoCommand.java \ > > + FindProcThread.java > > This needs a line continuation character \ at the end (like the other > lines). Otherwise, this will break the build. Hence, my r-. oops
Attached patch Addressing clints comments (deleted) — Splinter Review
Attachment #571109 - Attachment is obsolete: true
Attachment #572529 - Flags: review?(ctalbert)
Comment on attachment 572529 [details] [diff] [review] Addressing clints comments r+ Looks good. I'll try a quick build and land it for you.
Attachment #572529 - Flags: review?(ctalbert) → review+
Blocks: 705192
This one got away from me. Landed on inbound: https://hg.mozilla.org/integration/mozilla-inbound/rev/8eeb504515e3 Sorry for the lapse in checking in.
Assignee: nobody → bmoss
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla11
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: