Closed
Bug 571630
Opened 14 years ago
Closed 14 years ago
Mozmill tests mystically stops in the middle of a test-run
Categories
(Testing Graveyard :: Mozmill, defect)
Testing Graveyard
Mozmill
Tracking
(Not tracked)
VERIFIED
WORKSFORME
People
(Reporter: whimboo, Assigned: k0scist)
References
Details
(Whiteboard: [mozmill-1.4.2+])
Attachments
(3 files, 1 obsolete file)
All the last days - I can't really recall when it has started to happen - the Mozmill test-runs on our machines (mainly VMs) stall at some point. I would imagine it's the step when all normal tests have been processed and Firefox has to be shut down, in favor to start with the restart tests. When having exactly this state no error is listed in the error console. So it's really nothing obvious we could fix but would need some deeper investigation.
In our Windows VM's we start the test-run with a scheduled task which simply calls the start.bat file of the MozillaBuild environment. This bat file has a login script assigned which is our testrun_daily.py script. It gets executed the same way as before. Similar it has been implemented for Linux and OS X. Both are calling directly the testrun_daily.py script.
Could this somehow be related to how we close Firefox? Is the event not processed correctly? So the Mozmill back-end is waiting for shutdown but that doesn't happen. Curiously it doesn't happen all the time which also makes it harder to find a way to reproduce.
This is really a blocker for us and we will have to investigate it. If it's related to Mozmill it has to be fixed for 1.4.2 or even earlier.
Reporter | ||
Comment 1•14 years ago
|
||
One of our test-runs on Win7 x64 stopped here for Namoroka:
Test Failed : testVerifyDefaultBookmarks in c:\users\mozilla\appdata\local\temp\tmp7o8bxk.mozmill-tests\firefox\restartTests\testDefaultBookmarks\test1.js
Passed 215 :: Failed 9 :: Skipped 0
Test Failed : testVerifyDefaultBookmarks in c:\users\mozilla\appdata\local\temp\tmp7o8bxk.mozmill-tests\firefox\restartTests\testDefaultBookmarks\test1.js
Test Failed : testVerifyDefaultBookmarks in c:\users\mozilla\appdata\local\temp\tmp7o8bxk.mozmill-tests\firefox\restartTests\testDefaultBookmarks\test1.js
After the first restart tests has been finished and Mozmill has been restarted for the next restart test nothing happened. I will also check our other boxes.
Reporter | ||
Comment 2•14 years ago
|
||
Todays test-runs were all ok, except on Windows 2000. Namoroka hang after the normal tests and Minefield after the software update tests. Had to kill the process twice.
As it looks like we could use our Win2000 VM as system for you Clint or Jeff to get reproduce the hang for the needed investigation.
Assignee | ||
Comment 3•14 years ago
|
||
(In reply to comment #2)
> Todays test-runs were all ok, except on Windows 2000. Namoroka hang after the
> normal tests and Minefield after the software update tests. Had to kill the
> process twice.
>
> As it looks like we could use our Win2000 VM as system for you Clint or Jeff to
> get reproduce the hang for the needed investigation.
I should be able to reproduce on the CentOS 5 VM, as this is where I have consistently see the hang before. For now, I'm going under the assumption that the issue is the same (sounds like). I'll investigate and report more when I know.
Assignee | ||
Comment 4•14 years ago
|
||
Not a very exciting picture. This just shows the stalled state of minefield on the CentOS VM. When running, firefox is entirely gray. Not sure if this is related, but erring on the side of accumulating evidence here.
Assignee | ||
Comment 5•14 years ago
|
||
Log output of buildbot's mozmill step (see http://k0s.org/mozilla/hg/buildbotcustom-patches/file/5a630eaf0afa/bug-516984#l231 ). Mozmill hangs at the `testDownloadManagerClosed` test and stays there until buildbot kills it:
"""
<snip/>
INFO:mozmill:mozmill.setTest :: {'name': 'testDownloadManagerClosed', 'filename': '/builds/testor/slave/full/build/mozmill/tests/firefox/testPrivateBrowsing/testDownloadManagerClosed.js'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.waitForElement()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.click()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.waitForElement()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.click()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.check(ID: showWhenDownloading, state: false)'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.keypress()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.open()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.waitForElement()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.click()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.waitForElement()'}
command timed out: 600 seconds without output, killing pid 2530
process killed by signal 9
program finished with exit code -1
elapsedTime=725.384229
"""
Because of this, the restart tests don't work:
"""
(Gecko:2631): GnomeUI-WARNING **: While connecting to session manager:
Authentication Rejected, reason : None of the authentication protocols specified are supported and host-based authentication failed.
(Gecko:2631): GnomeUI-WARNING **: While connecting to session manager:
Authentication Rejected, reason : None of the authentication protocols specified are supported and host-based authentication failed.
jsbridge: Exception: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIServerSocket.init]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: file:///tmp/tmpVCwYNd.mozrunner/extensions/jsbridge@mozilla.com/resource/modules/server.js :: anonymous :: line 274" data: no]
command timed out: 300 seconds without output, killing pid 2622
process killed by signal 9
program finished with exit code -1
elapsedTime=303.401281
"""
I have seen similar behaviour running by hand. I'll try again interactively and see if I can pin down anything.
Assignee | ||
Comment 6•14 years ago
|
||
Confirmed: running manually in the same way that buildbot does gives me the same results:
{{{
export PYTHONHOME=/builds/testor/slave/full/build/mozmill
cd /builds/testor/slave/full/build/mozmill
sleep 5 && bin/python bin/mozmill --showall -b /builds/testor/slave/full/build/firefox/firefox -t tests/firefox
}}}
Hangs indefinately (note: /builds/testor is the path to my test buildslave.)
I did the `sleep 5 && ...` business to confirm that the window does not properly receive focus (I go off screen in this 5 second latency) and indeed, the refresh issue is the same as running via buildbot. Not sure if this is a VM issue or with CentOS 5 (or other environmental factor). This does not happen in my Ubuntu 10.06 non-VM.
As per buildbot, trying to close the windows gives a "Not responding" dialog. Even after a (GUI) force kill, the process remains in `ps aux`, though the windows are gone. Rerunning mozmill (or mozmill-restart) without genuinely killing the process gives a similar message as the buildbot logs:
"""
[root@localhost mozmill]# bin/python bin/mozmill --showall -b /builds/testor/slave/full/build/firefox/firefox -t tests/firefox
(Gecko:3020): GnomeUI-WARNING **: While connecting to session manager:
Authentication Rejected, reason : None of the authentication protocols specified are supported and host-based authentication failed.
(Gecko:3020): GnomeUI-WARNING **: While connecting to session manager:
Authentication Rejected, reason : None of the authentication protocols specified are supported and host-based authentication failed.
jsbridge: Exception: [Exception... "Component returned failure code: 0x80004005 (NS_ERROR_FAILURE) [nsIServerSocket.init]" nsresult: "0x80004005 (NS_ERROR_FAILURE)" location: "JS frame :: file:///tmp/tmp-J7TDt.mozrunner/extensions/jsbridge@mozilla.com/resource/modules/server.js :: anonymous :: line 274" data: no]
"""
The tests do not run in this case (mozmill hangs immediately and does not return, ever). A new Firefox window is open, but it is never painted (rendered).
Assignee | ||
Comment 7•14 years ago
|
||
(In reply to comment #5)
> Log output of buildbot's mozmill step (see
> http://k0s.org/mozilla/hg/buildbotcustom-patches/file/5a630eaf0afa/bug-516984#l231
> ). Mozmill hangs at the `testDownloadManagerClosed` test and stays there until
> buildbot kills it:
>
> """
> <snip/>
> INFO:mozmill:mozmill.setTest :: {'name': 'testDownloadManagerClosed',
> 'filename':
> '/builds/testor/slave/full/build/mozmill/tests/firefox/testPrivateBrowsing/testDownloadManagerClosed.js'}
> DEBUG:mozmill:Test Pass: {'function': 'Controller.waitForElement()'}
> DEBUG:mozmill:Test Pass: {'function': 'Controller.click()'}
> DEBUG:mozmill:Test Pass: {'function': 'Controller.waitForElement()'}
> DEBUG:mozmill:Test Pass: {'function': 'Controller.click()'}
> DEBUG:mozmill:Test Pass: {'function': 'Controller.check(ID:
> showWhenDownloading, state: false)'}
> DEBUG:mozmill:Test Pass: {'function': 'Controller.keypress()'}
> DEBUG:mozmill:Test Pass: {'function': 'Controller.open()'}
> DEBUG:mozmill:Test Pass: {'function': 'Controller.waitForElement()'}
> DEBUG:mozmill:Test Pass: {'function': 'Controller.click()'}
> DEBUG:mozmill:Test Pass: {'function': 'Controller.waitForElement()'}
>
> command timed out: 600 seconds without output, killing pid 2530
> process killed by signal 9
> program finished with exit code -1
> elapsedTime=725.384229
> """
>
> Because of this, the restart tests don't work:
>
> """
> (Gecko:2631): GnomeUI-WARNING **: While connecting to session manager:
> Authentication Rejected, reason : None of the authentication protocols
> specified are supported and host-based authentication failed.
>
> (Gecko:2631): GnomeUI-WARNING **: While connecting to session manager:
> Authentication Rejected, reason : None of the authentication protocols
> specified are supported and host-based authentication failed.
> jsbridge: Exception: [Exception... "Component returned failure code: 0x80004005
> (NS_ERROR_FAILURE) [nsIServerSocket.init]" nsresult: "0x80004005
> (NS_ERROR_FAILURE)" location: "JS frame ::
> file:///tmp/tmpVCwYNd.mozrunner/extensions/jsbridge@mozilla.com/resource/modules/server.js
> :: anonymous :: line 274" data: no]
>
> command timed out: 300 seconds without output, killing pid 2622
> process killed by signal 9
> program finished with exit code -1
> elapsedTime=303.401281
> """
>
> I have seen similar behaviour running by hand. I'll try again interactively
> and see if I can pin down anything.
Running interactively allows one to get the traceback:
INFO:mozmill:mozmill.setTest :: {'name': 'testDownloadManagerClosed', 'filename': '/builds/testor/slave/full/build/mozmill/tests/firefox/testPrivateBrowsing/testDownloadManagerClosed.js'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.waitForElement()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.click()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.waitForElement()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.click()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.check(ID: showWhenDownloading, state: false)'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.keypress()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.open()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.waitForElement()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.click()'}
DEBUG:mozmill:Test Pass: {'function': 'Controller.waitForElement()'}
Traceback (most recent call last):
File "/builds/testor/slave/full/build/mozmill/bin/mozmill", line 8, in <module>
load_entry_point('mozmill==1.4.1', 'console_scripts', 'mozmill')()
File "/builds/testor/slave/full/build/mozmill/lib/python2.5/site-packages/mozmill/__init__.py", line 576, in cli
CLI().run()
File "/builds/testor/slave/full/build/mozmill/lib/python2.5/site-packages/mozmill/__init__.py", line 553, in run
self._run()
File "/builds/testor/slave/full/build/mozmill/lib/python2.5/site-packages/mozmill/__init__.py", line 524, in _run
self.options.report)
File "/builds/testor/slave/full/build/mozmill/lib/python2.5/site-packages/mozmill/__init__.py", line 197, in run_tests
frame.runTestDirectory(test)
File "/builds/testor/slave/full/build/mozmill/lib/python2.5/site-packages/jsbridge/jsobjects.py", line 127, in __call__
response = self._bridge_.execFunction(self._name_, args)
File "/builds/testor/slave/full/build/mozmill/lib/python2.5/site-packages/jsbridge/network.py", line 205, in execFunction
return self.run(_uuid, 'bridge.execFunction('+ ', '.join(exec_args)+')', interval)
File "/builds/testor/slave/full/build/mozmill/lib/python2.5/site-packages/jsbridge/network.py", line 186, in run
sleep(interval)
KeyboardInterrupt
Comment 8•14 years ago
|
||
given that these are happening in different places, whimboo, can you confirm if this is the same hang?
Assignee | ||
Comment 9•14 years ago
|
||
(In reply to comment #8)
> given that these are happening in different places, whimboo, can you confirm if
> this is the same hang?
If not I'll file separately (though the symptoms appear similar). Again, this only happens to me on a VM and then it happens very consistently, at least in the case where the VM goes out of focus during the test run.
Assignee | ||
Comment 10•14 years ago
|
||
(In reply to comment #4)
> Created an attachment (id=451062) [details]
> firefox not refreshing on CentOS VM
>
> Not a very exciting picture. This just shows the stalled state of minefield on
> the CentOS VM. When running, firefox is entirely gray. Not sure if this is
> related, but erring on the side of accumulating evidence here.
It turns out this is a red herring. Firefox doesn't render even when tests run interactively!
Assignee | ||
Comment 11•14 years ago
|
||
Comment on attachment 451062 [details]
firefox not refreshing on CentOS VM
no need for this attachment; unrelated (and currently unknown) issue
Attachment #451062 -
Attachment is obsolete: true
Assignee | ||
Comment 12•14 years ago
|
||
based on where the error occurs (when jsbridge is waiting for feedback) and the fact that Firefox becomes not responding (consistently for me, though maybe :whimboo has different results), this could be at least in part a Firefox issue.
Assignee | ||
Comment 13•14 years ago
|
||
Preliminary guess that could be completely wrong:
In looking at the sending side of jsbridge, it seems that we're trying to err out on socket sending failures (in Bridge.run()) and, because they're asynchronous (?) an error never occurs even if Firefox is hung.
We can time out on this. It's not an elegant solution
We can try to do something more syncronous,at least as far as the update pings go. The rest can (and probably should) remain async.
Also TBD is why Firefox hangs here. Does the JSBridge extension cause it to hang? Or is it worse even than that?
Reporter | ||
Comment 14•14 years ago
|
||
Ok, so what about the following fact:
* Why haven't wee seen this freeze situation in the past? It has been started to manifest recently. So I wonder if we should run tests by simply:
** Using an environment with an older Mozmill (and jsbridge, mozrunner) release installed. Would be good to know if we have a regression in Mozmill 1.4.1. Jeff, I guess you are the one who could easily do this, because you can always reproduce the situation. I don't wanna modify our daily test-run box.
** Using Firefox builds from one or two months ago. If we don't have a regression in Mozmill we should run tests with older versions of Firefox. Could be that a patch has been checked into all three branches which could affect us. Al, would you have a wild guess or are you not aware of a bug which could be related to this problem?
** Given the fact that this happens across platforms I don't think any OS update is the cause.
Assignee | ||
Updated•14 years ago
|
Assignee: nobody → jhammel
Assignee | ||
Updated•14 years ago
|
Assignee: jhammel → nobody
Assignee | ||
Comment 15•14 years ago
|
||
I am moving my observed behaviour to bug 574043 as it may not be the same issue. Disregard my comments here as they may not pertain to what :whimboo observes. I am also unassigning myself, as I am unfamiliar with this issue.
Reporter | ||
Comment 16•14 years ago
|
||
The today's test-run on Windows 2000 didn't hang with Mozmill 1.4. I will wait for tomorrows test-run. I can hand you over the VM once I'm sure we can reproduce it with Mozmill 1.4.1 on a constantly basis.
Reporter | ||
Comment 17•14 years ago
|
||
Clint is copying the Win2k VM from the qa-horus machine now. Here some facts:
* There is a Mozmill-1.4 and Mozmill-1.4.1 VM which allow our daily test-runs
* Mozmill-1.4 has a tweaked extension to have compatibility with Minefield
* Check the task scheduler for the task we start each day (~8am PDT)
* It's highly reproducible when you let Windows start the test-run at 8am
Reporter | ||
Comment 18•14 years ago
|
||
Today I got a similar hang on my OS X box at home. It also happens during the checkDefaultBookmarks test. Here the stack from Python for that hang:
File "/Volumes/data/testing/envs/release/bin/mozmill-restart", line 8, in <module>
load_entry_point('mozmill==1.4.1', 'console_scripts', 'mozmill-restart')()
File "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/mozmill/__init__.py", line 582, in restart_cli
RestartCLI().run()
File "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/mozmill/__init__.py", line 569, in run
super(RestartCLI, self).run(*args, **kwargs)
File "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/mozmill/__init__.py", line 553, in run
self._run()
File "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/mozmill/__init__.py", line 524, in _run
self.options.report)
File "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/mozmill/__init__.py", line 443, in run_tests
self.run_dir(d, report, sleeptime)
File "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/mozmill/__init__.py", line 390, in run_dir
frame.runTestFile(test)
File "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/jsbridge/jsobjects.py", line 127, in __call__
response = self._bridge_.execFunction(self._name_, args)
File "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/jsbridge/network.py", line 205, in execFunction
return self.run(_uuid, 'bridge.execFunction('+ ', '.join(exec_args)+')', interval)
File "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/jsbridge/network.py", line 186, in run
sleep(interval)
KeyboardInterrupt
Assignee | ||
Comment 19•14 years ago
|
||
(In reply to comment #18)
> Today I got a similar hang on my OS X box at home. It also happens during the
> checkDefaultBookmarks test. Here the stack from Python for that hang:
>
> File "/Volumes/data/testing/envs/release/bin/mozmill-restart", line 8, in
> <module>
> load_entry_point('mozmill==1.4.1', 'console_scripts', 'mozmill-restart')()
> File
> "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/mozmill/__init__.py",
> line 582, in restart_cli
> RestartCLI().run()
> File
> "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/mozmill/__init__.py",
> line 569, in run
> super(RestartCLI, self).run(*args, **kwargs)
> File
> "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/mozmill/__init__.py",
> line 553, in run
> self._run()
> File
> "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/mozmill/__init__.py",
> line 524, in _run
> self.options.report)
> File
> "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/mozmill/__init__.py",
> line 443, in run_tests
> self.run_dir(d, report, sleeptime)
> File
> "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/mozmill/__init__.py",
> line 390, in run_dir
> frame.runTestFile(test)
> File
> "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/jsbridge/jsobjects.py",
> line 127, in __call__
> response = self._bridge_.execFunction(self._name_, args)
> File
> "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/jsbridge/network.py",
> line 205, in execFunction
> return self.run(_uuid, 'bridge.execFunction('+ ', '.join(exec_args)+')',
> interval)
> File
> "/Volumes/data/testing/envs/release/lib/python2.6/site-packages/jsbridge/network.py",
> line 186, in run
> sleep(interval)
> KeyboardInterrupt
Could you try the fix from http://github.com/mozautomation/mozmill/tree/1.4.2 ?
Also, can you tell what the test is trying to do? When I observed something similar, it wasn't (technically) a problem with mozmill (or jsbridge, etc), it was that the test was waiting to close a dialog window when another dialog window was on top, so it would just wait indefinitely.
Reporter | ||
Comment 20•14 years ago
|
||
Clint has the Win2k VM now. So you should be able to reproduce it too. If you don't have success I will try to get the dev version of Mozmill installed.
Assignee | ||
Updated•14 years ago
|
Assignee: nobody → jhammel
Reporter | ||
Comment 21•14 years ago
|
||
If you don't have the /z/ drive you have to do the following:
1. Clone http://hg.mozilla.org/qa/mozmill-automation/ to the /c/ drive
2. Goto /c/mozilla-build/ and open start-mozmill-testrun.bat
3. Replace the /z/data/scripts folder to the one from step 1
4. Run the test-run via the task scheduler at a given time not manually.
Assignee | ||
Comment 22•14 years ago
|
||
this now hangs on the windows 2k VM with the test_daily script. I'm not sure how reliable this hang point is....will investigate. Some facts:
* the browser is still responsive
* this is with whatever default mozmill the test_daily script invokes (1.4.1 series)
I would like to try to reproduce it "by hand" (that is, running `mozmill [options]`). If that works, it will much more precisely pinpoint the problem.
Then I would like to see if this occurs with mozauto master. Maybe the jsbridge timeout will fix this problem.
If that is the case, would it be okay to mark this bug as closed, possibly with 579791 as a dependency? If not, what is considered a good enough fix?
Reporter | ||
Comment 23•14 years ago
|
||
(In reply to comment #22)
> Then I would like to see if this occurs with mozauto master. Maybe the
> jsbridge timeout will fix this problem.
>
> If that is the case, would it be okay to mark this bug as closed, possibly with
> 579791 as a dependency? If not, what is considered a good enough fix?
It's good to hear that you are able to reproduce it. But the global timeout patch doesn't fix the problem, it hides it elegantly. We definitely have to find out what's causing this problem. I'm not really sure how much more information I can supply to help in here. The only thing is that you should save the state of the VM before testing it, so you can always go back and replay the same test. I hope it helps to find the underlying reason.
As said above, I wasn't able to find the reason with an official build of Mozmill. Eventually a dev build can help with some debug logging output.
Assignee | ||
Comment 24•14 years ago
|
||
(In reply to comment #23)
> (In reply to comment #22)
> > Then I would like to see if this occurs with mozauto master. Maybe the
> > jsbridge timeout will fix this problem.
> >
> > If that is the case, would it be okay to mark this bug as closed, possibly with
> > 579791 as a dependency? If not, what is considered a good enough fix?
>
> It's good to hear that you are able to reproduce it.
I didn't claim that much. I get to a point where it hangs. Whether it is the same hang as alluded to in this bug, I can't really know that, but can only assume that they are related/identical issues and go from there.
> But the global timeout
> patch doesn't fix the problem, it hides it elegantly. We definitely have to
> find out what's causing this problem.
I agree that we have to figure out what's going on. I don't necessarily agree that timeouts (again, global timeout should be a last-ditch effort; tests should also be timed out individually) is not the correct fix as far as the harness is concerned. If there is a failing test(s), intermittently or not, then those should be fixed. If the harness isn't doing as expected, then that should be fixed. There is a lot of gray area between these two points. Tests should behave as nicely as possible, and in general ease-of-use should flow upstream from test -> shared_modules -> mozmill as the need arises. But, ultimately, mozmill can't do much more for, say, waiting for an element that never appears than timing out.
Since what we have here is an unknown unknown, then the first step is to determine what the problem is and whether it should be classified as a harness failure. I can't jump to fixing the error without figuring out what it is. In general the steps are as follows:
1. Reproduce the failure as seen; I've possibly done this. In any case, I've been able to get a hang.
2. Reproduce the failure running the tests manually. Still in progress. In other words, reproduce it from the command line.
3.a. Determine the minimum command (with respect both to complexity and turnaround time) required to reproduce the failure. This is 3.a. as 3.a and 3.b are interdependent
3.b. Determine if the failure is intermittent or not; how intermittent is it? Every other time? Every 30th time? While intermittent failures should be fixed,
4. Once the problem is reliably reproducible (also, continue to seek steps to increase this reliability), then introduce diagnostics appropriate to actually find what the problem is. These diagnostics may be introduce on an ad hoc basis, or they may be debugging tools worth having in general.
All of these steps just get to what the problem is, which is a huge unknown right now. So "fixing the problem" can't simply be said "making this issue go away". The next step is:
5. Figure out what to do next. This will depend on what the problem is. For this particular case, it may involve fixing the tests, fixing the harness, or both. But that can't be known until the problem is determined. It may be advisble to reticket the issue at that point.
> I'm not really sure how much more
> information I can supply to help in here. The only thing is that you should
> save the state of the VM before testing it, so you can always go back and
> replay the same test. I hope it helps to find the underlying reason.
I'm not really an expert at how to use VMs or workflow. I would prefer to use virtualbox, really, as I'm using VMWare just for this bug and virtualbox is a free and opensource tool.
> As said above, I wasn't able to find the reason with an official build of
> Mozmill. Eventually a dev build can help with some debug logging output.
In general, the more reproduction steps entered the easier it is to diagnose. There is also a good deal of leverage gained from having the issue reporter do this in general. For this specific bug, I'm not sure what I'm looking for and I don't know windows very well, which will make reproduction and solution much harder.
Assignee | ||
Comment 25•14 years ago
|
||
another hang observed
Comment 26•14 years ago
|
||
(In reply to comment #24)
> (In reply to comment #23)
> > (In reply to comment #22)
> > > Then I would like to see if this occurs with mozauto master. Maybe the
> > > jsbridge timeout will fix this problem.
> > >
> > > If that is the case, would it be okay to mark this bug as closed, possibly with
> > > 579791 as a dependency? If not, what is considered a good enough fix?
> >
If it turns out that this was fixed by some other bug, it should be marked as WORKSFORME, that's the standard way we denote such things.
> I agree that we have to figure out what's going on. I don't necessarily agree
> that timeouts (again, global timeout should be a last-ditch effort; tests
> should also be timed out individually) is not the correct fix as far as the
> harness is concerned. If there is a failing test(s), intermittently or not,
> then those should be fixed. If the harness isn't doing as expected, then that
> should be fixed. There is a lot of gray area between these two points. Tests
> should behave as nicely as possible, and in general ease-of-use should flow
> upstream from test -> shared_modules -> mozmill as the need arises. But,
> ultimately, mozmill can't do much more for, say, waiting for an element that
> never appears than timing out.
Exactly.
>
> general the steps are as follows:
>
> 1. Reproduce the failure as seen; I've possibly done this. In any case, I've
> been able to get a hang.
>
> 2. Reproduce the failure running the tests manually. Still in progress. In
> other words, reproduce it from the command line.
>
> 3.a. Determine the minimum command (with respect both to complexity and
> turnaround time) required to reproduce the failure. This is 3.a. as 3.a and
> 3.b are interdependent
Henrik, why can't we determine what command your script uses to start Mozmill? This is all automated, why can't we run that script by hand? Why do we have to launch it from the scheduler to see this hang? If that's the case, then this is likely an OS integration issue, and frankly if it is an OS integration issue with windows 2k, I probably don't care as we're about to de-list support for it.
> 5. Figure out what to do next. This will depend on what the problem is. For
> this particular case, it may involve fixing the tests, fixing the harness, or
> both. But that can't be known until the problem is determined. It may be
> advisble to reticket the issue at that point.
Whatever the correct solution is, we'll just use this bug as a vehicle for it. We'll just change the summary to indicate what it's about.
>
> > I'm not really sure how much more
> > information I can supply to help in here. The only thing is that you should
> > save the state of the VM before testing it, so you can always go back and
> > replay the same test. I hope it helps to find the underlying reason.
>
> I'm not really an expert at how to use VMs or workflow. I would prefer to use
> virtualbox, really, as I'm using VMWare just for this bug and virtualbox is a
> free and opensource tool.
Both of them can store a snapshot. So if you know that the VM will hang right before the test, then you can store a snapshot of the state at that point. Then you can restore the snapshot, run the test and you will have a reproducible hang.
With VMWare, there is also the ability to do record/replay where you can basically record execution and replay using a debugger. That's probably what Henrik is talking about, but it doesn't help us here since Replay debugging is not supported for windows 2000 OS as the guest OS.
>
> > As said above, I wasn't able to find the reason with an official build of
> > Mozmill. Eventually a dev build can help with some debug logging output.
>
Henrik, what do you mean? You weren't able to do --showall to determine what test was hanging? You weren't able to watch the output and guess based on where the test was that it was hanging? Does it only happen with 1.4.1? I thought you said that it didn't happen with 1.4.1? Maybe I have my numbers backwards? I'm kind of confused.
Assignee | ||
Updated•14 years ago
|
OS: All → Windows 2000
Hardware: All → x86
Assignee | ||
Comment 27•14 years ago
|
||
(In reply to comment #26)
> (In reply to comment #24)
> > (In reply to comment #23)
> > > (In reply to comment #22)
<snip/>
> > turnaround time) required to reproduce the failure. This is 3.a. as 3.a and
> > 3.b are interdependent
> Henrik, why can't we determine what command your script uses to start Mozmill?
We can, I'm doing the detective work now.
> This is all automated, why can't we run that script by hand? Why do we have to
> launch it from the scheduler to see this hang?
The hangs I have seen have not been from the scheduler, but running directly `python test_daily.py --config=/path/to/config`. This is one of the reasons I'm not sure if it is the same issue. Once I have a feel for the various failures, I will drill down to running mozmill directly and ascertain if it fails in the same way (my guess: it will). Then hopefully I can drill down to a subset of tests that fails (at least intermittently).
> If that's the case, then this
> is likely an OS integration issue, and frankly if it is an OS integration issue
> with windows 2k, I probably don't care as we're about to de-list support for
> it.
Has this been tried on other OSes? Is the plan to keep the QA box on windows 2k? Is it worthwhile testing on other versions of windows? I have a XP VM (the build machine VM) and a non-VM version of windows 7. I am not confident that I can setup a parallel environment, but if this is worth investigating, I can do it.
> > 5. Figure out what to do next. This will depend on what the problem is. For
> > this particular case, it may involve fixing the tests, fixing the harness, or
> > both. But that can't be known until the problem is determined. It may be
> > advisble to reticket the issue at that point.
> Whatever the correct solution is, we'll just use this bug as a vehicle for it.
> We'll just change the summary to indicate what it's about.
K.
>
> >
> > > I'm not really sure how much more
> > > information I can supply to help in here. The only thing is that you should
> > > save the state of the VM before testing it, so you can always go back and
> > > replay the same test. I hope it helps to find the underlying reason.
> >
> > I'm not really an expert at how to use VMs or workflow. I would prefer to use
> > virtualbox, really, as I'm using VMWare just for this bug and virtualbox is a
> > free and opensource tool.
> Both of them can store a snapshot. So if you know that the VM will hang right
> before the test, then you can store a snapshot of the state at that point.
> Then you can restore the snapshot, run the test and you will have a
> reproducible hang.
>
> With VMWare, there is also the ability to do record/replay where you can
> basically record execution and replay using a debugger. That's probably what
> Henrik is talking about, but it doesn't help us here since Replay debugging is
> not supported for windows 2000 OS as the guest OS.
I'll use my old-fashioned but tried+true debugging techniques then.
Reporter | ||
Comment 28•14 years ago
|
||
We have seen this problem also on other machines with XP, Vista, and I believe Win 7. So it's not restricted to Win2k. I have only chosen Win2k because it was that machine where I have seen this hang daily. All the time it happened after restartTests/testDefaultBookmarks.js.
Assignee | ||
Comment 29•14 years ago
|
||
another hang on windows. This happens in shared-modules/testSearchApi.js
Reporter | ||
Comment 30•14 years ago
|
||
(In reply to comment #29)
> Created attachment 459476 [details]
> hanging on windows 2k, no. 3
>
> another hang on windows. This happens in shared-modules/testSearchApi.js
That's not relevant to this bug. What you are seeing here is a bug in handling the modal window. It's under shared-modules/testModalDialog.js. Not really sure if we already have fixed that. Which tests are you using? A fresh pull from mozmill-tests or the ones from m-c?
Assignee | ||
Comment 31•14 years ago
|
||
(In reply to comment #30)
> (In reply to comment #29)
> > Created attachment 459476 [details] [details]
> > hanging on windows 2k, no. 3
> >
> > another hang on windows. This happens in shared-modules/testSearchApi.js
>
> That's not relevant to this bug. What you are seeing here is a bug in handling
> the modal window. It's under shared-modules/testModalDialog.js. Not really sure
> if we already have fixed that. Which tests are you using? A fresh pull from
> mozmill-tests or the ones from m-c?
Again, I honestly have no idea what I'm looking for in this bug. I am still at the stage of running `python test_daily.py --config=/c/mozmill/testrun_daily.ini` and seeing what breaks. I have not altered any of the software on the machine, so whatever this does in the state the VM was left is what I am doing.
If you can give me more precise reproduction steps, things will go quicker. Really, the only thing I know how to do is (see comment #24 amongst others)
* guess what the failure is. I don't know the symptoms, exactly, so I don't know what I'm looking for. I can only guess that every hang ~= mystically stops.
* record each failure. These can then be assessed and fixed
* once I have this corpus, narrow down to something that I think reproduces the bug
Assignee | ||
Comment 32•14 years ago
|
||
Got the failure on firefox/softwareUpdate/testDirectUpdate/test1.js ( https://bug571630.bugzilla.mozilla.org/attachment.cgi?id=458851 )
Is this the failure? Or is this unrelated?
Reporter | ||
Comment 33•14 years ago
|
||
I have new steps. Please follow those steps, so you will hopefully be able to reproduce.
Steps:
1. Do a git pull mozauto 1.4.2 to get the latest dev version
2. Get all mozmill-tests
3. Execute Mozmill like "mozmill -b /Applications/Minefield.app/ --show-errors -t firefox/testTabbedBrowsing/"
As you can see above we hang inside of the teardownModule. It's not related to its content because even removing all the code doesn't help. We are still waiting for something and don't continue with the next test. That means after the global timeout we kill Firefox.
Reporter | ||
Comment 34•14 years ago
|
||
Looks like this problem is specific for bug 579943. See bug 579943 comment 14 ff.
I will have to go back to zero. Sorry for the noise.
OS: Windows 2000 → All
Hardware: x86 → All
Comment 35•14 years ago
|
||
Unreproducible. Please open follow on bugs if it comes back.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → WORKSFORME
Reporter | ||
Comment 36•14 years ago
|
||
Will do. As discussed in the meeting I haven't seen it anymore in the last days.
Status: RESOLVED → VERIFIED
Updated•8 years ago
|
Product: Testing → Testing Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•