Closed
Bug 504440
Opened 15 years ago
Closed 14 years ago
Use a global timeout for command line tests to kill the process if Firefox doesn't respond
Categories
(Testing Graveyard :: Mozmill, defect, P1)
Testing Graveyard
Mozmill
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: whimboo, Assigned: k0scist)
References
Details
(Whiteboard: [qae-p1][mozmill-1.4.2+])
Attachments
(2 files, 1 obsolete file)
(deleted),
patch
|
cmtalbert
:
review+
|
Details | Diff | Splinter Review |
(deleted),
patch
|
harth
:
review+
|
Details | Diff | Splinter Review |
Given by our yesterdays meeting we should have a frozen check for Mozmill. Under some situations it can happen that the network connection is dropped and commands cannot be transferred anymore by jsbridge. Firefox will stay open forever and you have to kill it manually. There should be a default timeout which users can modify via a command line option.
This would be necessary for the integration into the build system (see bug 457265). Adding it on the blocking list for now.
Comment 1•15 years ago
|
||
This has to be implemented on Python end and will be a global timeout when no events have occured for a specific amount of time.
Summary: Add frozen check to mozmill → Add global timeout to mozmill command line
Comment 2•15 years ago
|
||
Needs to be done ASAP for more reliable runs on buildbot.
Whiteboard: [mozmill-1.2.1]
Reporter | ||
Updated•15 years ago
|
Summary: Add global timeout to mozmill command line → Use a global timeout for command line tests to kill the process if Firefox doesn't respond
Reporter | ||
Updated•15 years ago
|
Whiteboard: [mozmill-1.2.1] → [mozmill-1.2.2]
Reporter | ||
Updated•15 years ago
|
Severity: critical → blocker
Reporter | ||
Updated•15 years ago
|
Whiteboard: [mozmill-1.2.2]
Reporter | ||
Comment 4•15 years ago
|
||
As discussed with Clint on phone today we would have to raise it as a major thing to fix. While running our daily Mozmill tests on the QA lab machine now we still have failures and a couple of process stay open. Given that I have to check the box each day after the test run.
It's eventually something testdev could work on next week.
Whiteboard: [qae-p1]
Solving this in two parts:
* Part 1: between firefox and mozmill. I've put something in place that destroys mozmill if it detects that no action is happening in the test framework - no new tests starting, no tests ending etc.
* Part 2: if a modal dialog is orphaned and thus causing tests to timeout, we should trap that and try to identify if an unknown modal dialog is hanging out. If a dialog is hanging out, we should kill all open windows (except main window, if we can determine that) and we will then try to keep going with the tests.
Forgot to Add in the info for the git stuff. it's on this branch: http://github.com/ctalbert/mozmill/commits/foobar2.
Reporter | ||
Comment 7•15 years ago
|
||
As talked on IRC with Mikeal, part 1 needs to handle the command line option and should not be enabled by default. Otherwise for users who run the command "mozmill -b %app%", Mozmill will be killed after 30 minutes.
Reporter | ||
Comment 8•15 years ago
|
||
Clint, when you will be able to continue with the remaining stuff?
Also something which came into my mind and would be really useful: When we kill the Firefox process, do we know which test gets run actually? If yes, we should add an error too.
Assignee: nobody → ctalbert
Status: NEW → ASSIGNED
Reporter | ||
Comment 9•15 years ago
|
||
Since a couple of days I get the following error when Firefox gets closed after a test-run. Could it be related to the already checked-in first patch?
Exception in thread Thread-6 (most likely raised during interpreter shutdown):
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/threading.py", line 460, in __bootstrap
File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/threading.py", line 625, in run
File "/Volumes/data/build/tools/mozmill/mozmill/__init__.py", line 277, in stop
<type 'exceptions.TypeError'>: 'NoneType' object is not callable
Unhandled exception in thread started by
Error in sys.excepthook:
Original exception was:
Assignee | ||
Comment 10•14 years ago
|
||
The global timeout is addressed in (obsolescent) bug 574043. Repeating from
https://bugzilla.mozilla.org/show_bug.cgi?id=574043#c2
> I've added a global timeout to jsbridge:
>
> http://github.com/k0s/jsbridge/commit/39f6b9ea983cb9e469751f7ee012218f0d085ab8
> http://github.com/k0s/jsbridge/commit/bd49c30218388c80c83e0ba8e46f99b7e41fdfbd
> http://github.com/k0s/jsbridge/commit/0388256d3de02c5ae4ae785ae28e8e78362efeb0
>
> This seems to work correctly; on my ubuntu machine, the tests run through
> passing, as they did before. On the CentOS VM, instead of the tests hanging,
> the exception is thrown and the run ends with "Application quit unexpectedly."
>
> There are a few more things that need to be done:
> - a CLI switch in mozmill should be added that sets this on the bridge
> object(s)
> - there should be a per-test timeout in mozmill/extension's JS; currently, the
> whole run stops (as jsbridge doesn't know anything about "tests") whereas, for
> reproducibility, individual tests should timeout. The default value (whatever
> it might be) should be overridable on a per-test basis.
> - mozmill/mozrunner should ensure that Firefox is killed when Something Bad
> Happens; currently, the jsbridge stuff kills the run successfully, but Firefox
> will remain open, which would be bad for buildbot
The CLI switch is also done: http://github.com/k0s/mozmill/commit/945cc4b8654330a9c72a28f5071eb4b9d41fb161
So what remains to be done is better cleanup of Firefox upon failure. I'm guessing this is just a missing try: except: condition.
Assignee | ||
Updated•14 years ago
|
Assignee: ctalbert → jhammel
Assignee | ||
Comment 11•14 years ago
|
||
Attachment #454555 -
Flags: review?(ctalbert)
Assignee | ||
Comment 12•14 years ago
|
||
This patch does a few things:
* front end to setting the global timeout for communication in jsbridge
* makes both of the wait JS functions go through the same function; seems to work, but the comment scares me
* better cleanup of firefox upon disconnect failure; again, seems to work, but may not be adequate for all cases [hard to test]
* cleanup
Attachment #454574 -
Flags: review?(harthur)
Comment 13•14 years ago
|
||
Comment on attachment 454574 [details] [diff] [review]
patch to mozmill to support global timeout
>+ parser_options[("--timeout",)] = dict(dest="timeout", type="float",
>+ default=60.,
>+ help="seconds before harness timeout if no communication is taking place")
why float?
> except JSBridgeDisconnectError:
>- print 'Application unexpectedly closed'
>+ print >> sys.stderr, 'Disconnect Error: Application unexpectedly closed'
we print all our other errors to stdout, why did you choose stderr for this one?
>+ while(gWaitOnLocked) {
>+ throw 'waitOn locked';
>+ dump("Waiting...");
> }
>
>+ gWaitOnLocked = true;
Did this work? I don't know about adding this if it doesn't help. Also, this is kind of for a separate bug.
>+
> // Calls repeatedly every X milliseconds until clearInterval is called
>- var interval = hwindow.setInterval(wait, milliseconds);
>+ var interval = hwindow.setInterval(callback, milliseconds);
>
> var thread = Components.classes["@mozilla.org/thread-manager;1"]
> .getService()
>@@ -409,14 +408,30 @@ function sleep (milliseconds) {
> // This blocks execution until our while loop condition is invalidated. Note
> // that you must use a simple boolean expression for the loop, a function call
> // will not work.
>- while(!self.timeup)
>- thread.processNextEvent(true);
>+ while(condition(self)) {
>+ // see nsIThread.idl
>+ // https://developer.mozilla.org/en/nsIThread
>+ var retval = thread.processNextEvent(false);
>+ }
>+
I don't see 'retval' used anywhere
Other than these questions it looks great.
Comment 14•14 years ago
|
||
Comment on attachment 454555 [details] [diff] [review]
patch to jsbridge to have a global timeout
>--- a/jsbridge/network.py
>+++ b/jsbridge/network.py
>@@ -45,15 +45,16 @@ from threading import Thread
>@@ -170,24 +167,40 @@ class Bridge(Telnet):
>
>+
>+ # XXX we've actually already connected in Telnet
>+ self.connect((host, port))
Why the XXX comment here? If this is a condition we should handle in the future let's either go ahead and do it now (it would seem to be easy to me to print an error if we find that we are already connected and exit) or file a follow on bug to handle it in the future. But since all we really need here is a simple error message, we should just ensure we have that for now.
>
> def handle_connect(self):
> self.register()
>
> def run(self, _uuid, exec_string, interval=.2, raise_exeption=True):
>+ global ctr
>+
As discussed, I think this is more readable (to me, if it's better python to do it this way, then maybe we should keep it) as a member of the base class, probably with a better variable name to indicate that it is the current "uptime" between messages from the javascript end that is running tests.
r=me with those nits.
Attachment #454555 -
Flags: review?(ctalbert) → review+
Assignee | ||
Comment 15•14 years ago
|
||
(In reply to comment #13)
> (From update of attachment 454574 [details] [diff] [review])
>
> >+ parser_options[("--timeout",)] = dict(dest="timeout", type="float",
> >+ default=60.,
> >+ help="seconds before harness timeout if no communication is taking place")
>
> why float?
As opposed to int? Why not?
>
> > except JSBridgeDisconnectError:
> >- print 'Application unexpectedly closed'
> >+ print >> sys.stderr, 'Disconnect Error: Application unexpectedly closed'
>
> we print all our other errors to stdout, why did you choose stderr for this
> one?
I can do stdout if desired. I just wanted to ensure things were unbuffered.
> >+ while(gWaitOnLocked) {
> >+ throw 'waitOn locked';
> >+ dump("Waiting...");
> > }
> >
> >+ gWaitOnLocked = true;
>
> Did this work? I don't know about adding this if it doesn't help. Also, this is
> kind of for a separate bug.
It is separate. As far as if it works, it doesn't seem to break anything. This is *not* to say that it in fact works. I'd be happy to revert this change, if desired, as it is actually a separate issue.
> >+
> > // Calls repeatedly every X milliseconds until clearInterval is called
> >- var interval = hwindow.setInterval(wait, milliseconds);
> >+ var interval = hwindow.setInterval(callback, milliseconds);
> >
> > var thread = Components.classes["@mozilla.org/thread-manager;1"]
> > .getService()
> >@@ -409,14 +408,30 @@ function sleep (milliseconds) {
> > // This blocks execution until our while loop condition is invalidated. Note
> > // that you must use a simple boolean expression for the loop, a function call
> > // will not work.
> >- while(!self.timeup)
> >- thread.processNextEvent(true);
> >+ while(condition(self)) {
> >+ // see nsIThread.idl
> >+ // https://developer.mozilla.org/en/nsIThread
> >+ var retval = thread.processNextEvent(false);
> >+ }
> >+
>
> I don't see 'retval' used anywhere
retval isn't used. It originally was for debugging purposes. Should I nix?
>
> Other than these questions it looks great.
Assignee | ||
Comment 16•14 years ago
|
||
(In reply to comment #14)
> (From update of attachment 454555 [details] [diff] [review])
> >--- a/jsbridge/network.py
> >+++ b/jsbridge/network.py
> >@@ -45,15 +45,16 @@ from threading import Thread
> >@@ -170,24 +167,40 @@ class Bridge(Telnet):
> >
> >+
> >+ # XXX we've actually already connected in Telnet
> >+ self.connect((host, port))
> Why the XXX comment here? If this is a condition we should handle in the
> future let's either go ahead and do it now (it would seem to be easy to me to
> print an error if we find that we are already connected and exit) or file a
> follow on bug to handle it in the future. But since all we really need here is
> a simple error message, we should just ensure we have that for now.
I put the XXX comment in as there is some duplication here: see
http://github.com/mozautomation/mozmill/blob/master/jsbridge/jsbridge/network.py#L59
vs.
http://github.com/mozautomation/mozmill/blob/master/jsbridge/jsbridge/network.py#L176
I assume that this might have been done for a reason. I have not exhaustively debugged whether the dual-implementation solved some race condition or not. In general, if this sort of thing is intentional, the intention should be documented such that future hackers don't have to guess what it is.
I would gladly comment out either/or legacy connect, though I can't guess if this will break anything in practice.
> >
> > def handle_connect(self):
> > self.register()
> >
> > def run(self, _uuid, exec_string, interval=.2, raise_exeption=True):
> >+ global ctr
> >+
> As discussed, I think this is more readable (to me, if it's better python to do
> it this way, then maybe we should keep it) as a member of the base class,
> probably with a better variable name to indicate that it is the current
> "uptime" between messages from the javascript end that is running tests.
>
> r=me with those nits.
K; I'll do so. As mentioned, the only reason I kept it global was to specifically note that it is global. Putting it as an attribute on jsbridge will not change its scope, I just wanted to be explicit about it.
Assignee | ||
Comment 17•14 years ago
|
||
(In reply to comment #15)
> (In reply to comment #13)
<snip/>
> > >+ while(gWaitOnLocked) {
> > >+ throw 'waitOn locked';
> > >+ dump("Waiting...");
> > > }
> > >
> > >+ gWaitOnLocked = true;
> >
> > Did this work? I don't know about adding this if it doesn't help. Also, this is
> > kind of for a separate bug.
>
> It is separate. As far as if it works, it doesn't seem to break anything.
> This is *not* to say that it in fact works. I'd be happy to revert this
> change, if desired, as it is actually a separate issue.
Moved to bug 575405
Assignee | ||
Comment 18•14 years ago
|
||
a new patch with the refactoring of waitForEval + sleep moved to bug 575405
anything else I need to do to get an r+?
Attachment #454574 -
Attachment is obsolete: true
Attachment #454653 -
Flags: review?(harthur)
Attachment #454574 -
Flags: review?(harthur)
Updated•14 years ago
|
Attachment #454653 -
Flags: review?(harthur) → review+
Comment 19•14 years ago
|
||
looks good, push it
Reporter | ||
Comment 20•14 years ago
|
||
Why do we still use a float as timeout value?
Assignee | ||
Comment 21•14 years ago
|
||
(In reply to comment #20)
> Why do we still use a float as timeout value?
Why shouldn't we?
Reporter | ||
Comment 22•14 years ago
|
||
Solved the question on IRC. Python is using seconds instead of milliseconds per default. So float is the correct way to go. Sorry.
Assignee | ||
Comment 23•14 years ago
|
||
pushed to http://github.com/mozautomation/mozmill/tree/1.4.2 ; should i close or wait until this is pushed to master and/or until 1.4.2 is put in m-c?
Reporter | ||
Comment 24•14 years ago
|
||
What blocks us from pushing it to the master? Once this is done we will mark it as fixed. m-c doesn't play into account here.
Assignee | ||
Comment 25•14 years ago
|
||
(In reply to comment #24)
> What blocks us from pushing it to the master? Once this is done we will mark it
> as fixed. m-c doesn't play into account here.
I marked as fixed anyway. I'm not sure what the workflow is for pushing mozauto 1.4.2 -> mozauto master, but I'm guessing it should only be done when we're ready to release (does that sound right?)
Reporter | ||
Comment 26•14 years ago
|
||
I don't think so. Once a feature has been reviewed and has been finalized by the developer it should be also pushed to the master immediately. That way we can get wider exposure.
Assignee | ||
Comment 27•14 years ago
|
||
Forgot to mark this as fixed for 1.4.2
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 28•14 years ago
|
||
(In reply to comment #26)
> I don't think so. Once a feature has been reviewed and has been finalized by
> the developer it should be also pushed to the master immediately. That way we
> can get wider exposure.
What Henrik said. We should land on both branches simultaneously when fixing issues on 1.4.2. When doing 2.0 work on Master, that will not be landed on 1.4.2 (perhaps obviously), but everything that gets changed on 1.4.2 should also be changed on master.
Reporter | ||
Comment 29•14 years ago
|
||
Hasn't been landed on the master yet -> reopen.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 30•14 years ago
|
||
(In reply to comment #29)
> Hasn't been landed on the master yet -> reopen.
:ctalbert, :henrik: well, this really depresses me. I've tried really hard to follow what development guidelines we have for mozmill and 1.4.2 and future. However this isn't documented:
https://wiki.mozilla.org/Auto-tools/Projects/Mozmill/RepoSetup#Setting_up_for_Development_.28for_1.4.2.29
We should document our workflow so that dummies like me have some set of instructions in order to follow.
Also, I wasn't going to bring it up, but what is the purpose of pushing to two branches? In most development methodologies I am familiar with (both single-trunk VCS and mercuria/bzr DVCS) there is a trunk/master which is either stable or unstable. If the trunk is stable, the development is done on whatever active branch until it is collapsed to another branch or to trunk, and then the development branch is closed except for backports.
In the other model, the trunk is unstable and development is done on the trunk. The changes are, sometimes selectively, moved over to a release branch at a time when a release is near.
Are we following either of these models? What is the point of pushing to both branches instead of doing so? Why isn't this documented?
I'm sorry to be an ass, but quite frrankly I try really hard to learn and use a version control system well that I don't like very much. When on the day before a summit I find that my bugs are reopened, I wonder if I am really as much of a horrible programmer and possibly person as I feel like right now.
Reporter | ||
Comment 31•14 years ago
|
||
(In reply to comment #30)
> https://wiki.mozilla.org/Auto-tools/Projects/Mozmill/RepoSetup#Setting_up_for_Development_.28for_1.4.2.29
>
> We should document our workflow so that dummies like me have some set of
> instructions in order to follow.
Well, we miss the part for the master branch. I will leave the documentation up to Clint.
> Also, I wasn't going to bring it up, but what is the purpose of pushing to two
> branches? In most development methodologies I am familiar with (both
> single-trunk VCS and mercuria/bzr DVCS) there is a trunk/master which is either
> stable or unstable. If the trunk is stable, the development is done on
> whatever active branch until it is collapsed to another branch or to trunk, and
> then the development branch is closed except for backports.
We should discuss that offline and not on this bug.
> Are we following either of these models? What is the point of pushing to both
> branches instead of doing so? Why isn't this documented?
Master will get refactoring patches soon which will not land on the 1.4.2 branch. We can't use 1.4.2 for mozmill-2.0 development. Because of that you will have to push to both master and 1.4.2.
> version control system well that I don't like very much. When on the day
> before a summit I find that my bugs are reopened, I wonder if I am really as
> much of a horrible programmer and possibly person as I feel like right now.
I don't see a reason why it has something todo with the summit here. You have only fixed the problem on 1.4.2 and not master. I have already stated that on comment 26. So once you have pushed the fix to the master you can mark this bug as closed.
Status: REOPENED → ASSIGNED
Reporter | ||
Comment 32•14 years ago
|
||
Jeff, when will you push the patch to the master? Even there is no documentation it will work the same way as for 1.4.2. Thanks.
Assignee | ||
Comment 33•14 years ago
|
||
(In reply to comment #32)
> Jeff, when will you push the patch to the master? Even there is no
> documentation it will work the same way as for 1.4.2. Thanks.
done, http://github.com/mozautomation/mozmill/commit/e99a3d115217709e0114578bd9fe744da69c9f0f
Status: ASSIGNED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 34•14 years ago
|
||
Tested locally and on our qa-horus box. There is no hanging test-run anymore. I'm impressed. Even with that verification I have filed other bugs I have discovered as follow-up bugs. See the dependency list.
Marking this one as verified fixed.
Status: RESOLVED → VERIFIED
Reporter | ||
Comment 35•14 years ago
|
||
Comment on attachment 454653 [details] [diff] [review]
patch to mozmill to support global timeout
>@@ -502,16 +518,20 @@ class CLI(jsbridge.CLI):
> help="Report the results. Requires url to results server.")
> parser_options[("--showall",)] = dict(dest="showall", default=False, action="store_true",
> help="Show all test output.")
>+ parser_options[("--timeout",)] = dict(dest="timeout", type="float",
>+ default=60.,
>+ help="seconds before harness timeout if no communication is taking place")
>
> def __init__(self, *args, **kwargs):
> super(CLI, self).__init__(*args, **kwargs)
> self.mozmill = self.mozmill_class(runner_class=mozrunner.FirefoxRunner,
> profile_class=mozrunner.FirefoxProfile,
>- jsbridge_port=int(self.options.port))
>+ jsbridge_port=int(self.options.port),
>+ jsbridge_timeout=self.options.timeout
>+ )
Not a very friendly way to set the options. Derived classes do not have a chance to modify those values. I will try how to workaround that issue tomorrow. If needed I will file a follow-up bug.
Reporter | ||
Comment 36•14 years ago
|
||
(In reply to comment #35)
> Not a very friendly way to set the options. Derived classes do not have a
> chance to modify those values. I will try how to workaround that issue
> tomorrow. If needed I will file a follow-up bug.
'self.mozmill.jsbridge_timeout = value' did the trick for now. I hope that it will become more friendly with the reorganization of the classes for Mozmill 2.0.
For the software update tests I'm able now to set a longer timeout value. So we are good here.
Assignee | ||
Comment 37•14 years ago
|
||
(In reply to comment #35)
> Comment on attachment 454653 [details] [diff] [review]
> patch to mozmill to support global timeout
>
> >@@ -502,16 +518,20 @@ class CLI(jsbridge.CLI):
> > help="Report the results. Requires url to results server.")
> > parser_options[("--showall",)] = dict(dest="showall", default=False, action="store_true",
> > help="Show all test output.")
> >+ parser_options[("--timeout",)] = dict(dest="timeout", type="float",
> >+ default=60.,
> >+ help="seconds before harness timeout if no communication is taking place")
> >
> > def __init__(self, *args, **kwargs):
> > super(CLI, self).__init__(*args, **kwargs)
> > self.mozmill = self.mozmill_class(runner_class=mozrunner.FirefoxRunner,
> > profile_class=mozrunner.FirefoxProfile,
> >- jsbridge_port=int(self.options.port))
> >+ jsbridge_port=int(self.options.port),
> >+ jsbridge_timeout=self.options.timeout
> >+ )
>
> Not a very friendly way to set the options. Derived classes do not have a
> chance to modify those values. I will try how to workaround that issue
> tomorrow. If needed I will file a follow-up bug.
I'm not sure what you mean by this. Are you talking about derived classes to the CLI interface? Ultimately, CLI shouldn't be a controller, which it currently is, if that's what you're referring to. Why not instantiate the MozMill object in the derived class instead of going through CLI?
Assignee | ||
Comment 38•14 years ago
|
||
(In reply to comment #36)
> (In reply to comment #35)
> > Not a very friendly way to set the options. Derived classes do not have a
> > chance to modify those values. I will try how to workaround that issue
> > tomorrow. If needed I will file a follow-up bug.
>
> 'self.mozmill.jsbridge_timeout = value' did the trick for now. I hope that it
> will become more friendly with the reorganization of the classes for Mozmill
> 2.0.
"Touching" values with more than one dot is ultimately subject to implementation changes.
> For the software update tests I'm able now to set a longer timeout value. So we
> are good here.
Reporter | ||
Comment 39•14 years ago
|
||
(In reply to comment #37)
> I'm not sure what you mean by this. Are you talking about derived classes to
> the CLI interface? Ultimately, CLI shouldn't be a controller, which it
> currently is, if that's what you're referring to. Why not instantiate the
> MozMill object in the derived class instead of going through CLI?
Well, with your refactoring of the CLI stuff into the Mozmill classes it should be possible now. There wasn't a way for me with Mozmill 1.4.1. I will have to check that once Mozmill 1.4.2 has been released and I could update our automation scripts.
(In reply to comment #38)
> "Touching" values with more than one dot is ultimately subject to
> implementation changes.
That's my concern too. But to stay compatible with Mozmill 1.4.1 for now, I do not have another option. I have to subclass the CLI interface. Once the above step has been done, this dependency would have been killed.
Updated•8 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•