Closed Bug 420216 Opened 17 years ago Closed 16 years ago

buildbot should be able to kill subprocesses properly

Categories

(Release Engineering :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rhelmer, Unassigned)

References

()

Details

Attachments

(1 obsolete file)

http://twistedmatrix.com/trac/ticket/2726

This patch seems to let twisted kill subprocesses on win32 reliably, from my quick tests on 1.9 relauto staging. It needs to be installed on the win32 slaves (I hacked it in-place on site-packages to test, which I'll revert so we can do it right :P).

We should totally take this patch, and beg them to check it in upstream. There are no negative comments on the upstream bug, I think they just haven't gotten around to it, yet.
Confirmed that the same box totally breaks without this same patch (throws exception, leaves subprocs around, etc).

I ran it several times with the patch, exits cleanly on the first try, all subprocs (perl calling make calling cvs) dead, as expected.
*swoon*

We should get this installed on relevant machines (talos/unittest/tryserver seem to the most vulnerable) pretty quickly! We don't have twisted checked in anywhere, which is fine, so we can just locally install these patches, I guess.
I would like to echo bhearsum's "swoon" and add a <3.

Hopefully we can lobby the twisted people to get this checked in. Nice work, Rob!
(In reply to comment #3)
> I would like to echo bhearsum's "swoon" and add a <3.
> 
> Hopefully we can lobby the twisted people to get this checked in. Nice work,
> Rob!


Check out the ticket; while this patch does work fine, it isn't really acceptable as-is for several reasons. I'm going to help get this in, but if anyone wants to help out that'd be great :)

(In reply to comment #4)
> (In reply to comment #3)
> > I would like to echo bhearsum's "swoon" and add a <3.
> > 
> > Hopefully we can lobby the twisted people to get this checked in. Nice work,
> > Rob!
> 
> 
> Check out the ticket; while this patch does work fine, it isn't really
> acceptable as-is for several reasons. I'm going to help get this in, but if
> anyone wants to help out that'd be great :)
> 

That said, this is only used in one place in Buildbot and so is fine for us. Once this does get fixed in Twisted, Buildbot is going to need to change to support it (probably using signalProcessGroup instead of signalProcess), so I think it would be totally reasonable for us to take this patch to stop the bleeding.
The plan to apply this to our active buildbots is r+ by me.  I'd like to get this in along with the select-random-slave patch.
Component: Build & Release → Release Engineering: Projects
Priority: -- → P3
QA Contact: build → release
To be clear, the goal here is for Buildbot to be able to kill subprocesses properly. Whether this is through a Twisted patch or Buildbot one doesn't matter. The Twisted jobobject patch is a possible solution, as is Buildbot launching subprocesses directly on win32.
Summary: take twisted win32 Job object patch → buildbot should be able to kill subprocesses properly
Can we get this fix in to our new buildbot 0.7.7 rollout?
Has anyone here actually tested the patch to make sure it works? I think it'd be great to roll this in if it does -- but we should confirm that first :).
still needs testing. I can try this on the moz2 machines, though I'm loathe to add more churn to those boxes.
I'm currently trying out this patch on qm-pxp-stage01 and qm-pvista-stage01.  

I've installed it by applying the patch to the correct twisted file on my local machine then copying the whole file into the correct directory on the talos machines, removing the .pyc files and then restarting the build slave.

I'm still trying to come up with a testing scenario to determine if we are actually seeing any improvement in process killing.  So far processes are killed when I shut down the build slave entirely (CTL-C in the slave cmd) and when I kill the cmd where the build slave was started.  Unfortunately, I believe that these two cases were already being correctly handled.  I'll see if I can come up with some definitive proof that it would help the talos boxes.
Assigning to myself while I do the talos side of the work.
Assignee: nobody → anodelman
Priority: P3 → P2
I forced a buildbot restart while the win boxes where testing a build and the existing browser processes where correctly killed.  I believe that this (mostly) recreates the situation that we've been seeing where we've had talos machines lose network connectivity with the master.  


From what I've seen this would be a useful patch to apply to all the talos win machines.  It won't be a magic bullet since it is unlikely to catch processes not owned by talos (like crash reporters and the like), but it would handle a common case that we hit semi-frequently.
(In reply to comment #8)
> Can we get this fix in to our new buildbot 0.7.7 rollout?

(In reply to comment #9)
> Has anyone here actually tested the patch to make sure it works? I think it'd
> be great to roll this in if it does -- but we should confirm that first :).

In meeting yesterday, Mikeal said he'd not tested this on 0.7.7, but did test with an earlier version. 

Mikeal, do you remember what version of buildbot you did test with?
No, I said I hadn't tested it but rhelmer had and that it worked for him, it's in the notes at the top of the bug.

I don't know which version he was using.
(In reply to comment #13)
> From what I've seen this would be a useful patch to apply to all the talos win
> machines.  It won't be a magic bullet since it is unlikely to catch processes
> not owned by talos (like crash reporters and the like), but it would handle a
> common case that we hit semi-frequently.

That's great news! Since this is a slave-side patch, we'll have to figure out how and when to roll this out to the talos and unittest slaves. We should be able to script something that's a little less painful than manually applying this file to all these machines.

rhelmer: any progress on getting this reviewed upstream? Anything we can do to help out?
Greetings,

I'm quite happy to hear that the solution in the patch on the Twisted tracker seems to address this issue.  Windows buildslaves have long been a maintenance nightmare for the Twisted project, too.

I'd very much like to get this fixed in Twisted.  Our Windows resources are fairly scarce, though (as you can probably tell from the level of activity on that ticket).  We're short both on actual Windows machines to test on (though we have a few Windows buildslaves, we don't actually have direct access to them), and on developers with actual Windows experience.

A couple things would greatly help get this fixed in Twisted:

  * What are the circumstances this is intended to fix?  From what I've read on this ticket and from what I know about Windows APIs, I think that the problem here is that the test process run by the buildslave creates child processes (with CreateProcess?).  When the build is cancelled or interrupted, the buildslave kills only the test process it ran, leaving the grandchildren to continue running.  The problem is solved if, instead of the current behavior, the test process and all the children it has created (and their children, and so on) are killed.  Is this accurate?

  * Do job objects have any implications aside from allowing processes to be grouped together?  That is, if Twisted is changed to always use job objects (so as to allow everything to be killed), what changes in behavior, if any, will we see for the existing supported uses of the process control API?

Thanks for any more information you can provide.
Yes, the goal of this patch is to kill grandchild+ generations of processes.

As far as I know, job objects don't have any side effects by default... they just group processes. You can set resource limits for an entire job, or terminate an entire job, but that doesn't happen unless you specifically ask for it.
(In reply to comment #17)
> Greetings,

Hi Jean-Paul,

> I'm quite happy to hear that the solution in the patch on the Twisted tracker
> seems to address this issue.  Windows buildslaves have long been a maintenance
> nightmare for the Twisted project, too.

As I understand it, this is a problem for anyone running Windows build-slaves and has been from the beginning.

> I'd very much like to get this fixed in Twisted.  Our Windows resources are
> fairly scarce, though (as you can probably tell from the level of activity on
> that ticket).  We're short both on actual Windows machines to test on (though
> we have a few Windows buildslaves, we don't actually have direct access to
> them), and on developers with actual Windows experience.

We've got access to lots of machinery and would be willing to help out if needed. This has been a source of pain for awhile now. Our most limited resource is time.

> A couple things would greatly help get this fixed in Twisted:
> 
>   * What are the circumstances this is intended to fix?  From what I've read on
> this ticket and from what I know about Windows APIs, I think that the problem
> here is that the test process run by the buildslave creates child processes
> (with CreateProcess?).  When the build is cancelled or interrupted, the
> buildslave kills only the test process it ran, leaving the grandchildren to
> continue running.  The problem is solved if, instead of the current behavior,
> the test process and all the children it has created (and their children, and
> so on) are killed.  Is this accurate?

You are completely accurate. As Ben mentioned in his reply, any processes and their ancestors should all be killed.

>   * Do job objects have any implications aside from allowing processes to be
> grouped together?  That is, if Twisted is changed to always use job objects (so
> as to allow everything to be killed), what changes in behavior, if any, will we
> see for the existing supported uses of the process control API?

As Benjamin mentions in his reply, job objects don't have any side-effects, they're just a grouping mechanism.

> Thanks for any more information you can provide.

My pleasure. Let us know if there's anything we can do to help.
Applied patch to windows talos machines:

qm-mini-xp01,02,03,04,05, qm-mini-vista01,02,03,04,05, qm-pxp-jss01,02,03, qm-pxp-fast01,02, qm-pxp-trunk01,02,03,04,05,06, qm-pxp-try01

All are now green.  Talos windows ref images still need to be updated so that all future talos windows machines will have the patch.  I will push that into the work that needs to be done on the ref images to make them cleanly reboot.  For this bug the talos work is complete.
Assignee: anodelman → nobody
Priority: P2 → P3
Multiple tinderboxes seem to have bug 381004 these days.
Do they have this patch ? Could they ?
This patch was only applied to talos testing boxes.  

It would be worth investigating if it could be applied to build machines.
I put [1] on bm-win2k3-unittest-02-hw, the tracemonkey unittest box having trouble with JIT enabled. The path on there to _dumbwin32proc.py is /c/mozilla-builds/python25/Lib/site-packages/twisted/internet/. The patch didn't apply cleanly so I had to merge manually; original is backed up to _dumbwin32proc.py.bak.

[1] http://twistedmatrix.com/trac/attachment/ticket/2726/_dumbwin32proc.patch?format=raw
(In reply to comment #23)
I ended up removing this again, as it was preventing timeouts from happening at all. The slave's twisted.log has the likes of 

Unhandled Error
Traceback (most recent call last):
  File "c:\mozilla-build\python25\Lib\site-packages\twisted\application\app.py", line 348, in startReactor
    self.config, oldstdout, oldstderr, self.profiler, reactor)
  File "c:\mozilla-build\python25\Lib\site-packages\twisted\application\app.py", line 273, in runReactorWithLogging
    reactor.run()
  File "c:\mozilla-build\python25\Lib\site-packages\twisted\internet\base.py", line 1048, in run
    self.mainLoop()
  File "c:\mozilla-build\python25\Lib\site-packages\twisted\internet\base.py", line 1057, in mainLoop
    self.runUntilCurrent()
--- <exception caught here> ---
  File "c:\mozilla-build\python25\Lib\site-packages\twisted\internet\base.py", line 705, in runUntilCurrent
    call.func(*call.args, **call.kw)
  File "c:\mozilla-build\python25\Lib\site-packages\buildbot\slave\commands.py", line 501, in doTimeout
    self.kill(msg)
  File "c:\mozilla-build\python25\Lib\site-packages\buildbot\slave\commands.py", line 510, in kill
    msg += ", killing pid %d" % self.process.pid
exceptions.TypeError: int argument required
Might want to try downgrading to Twisted 2.4 or 2.5 and trying the patch again.
Downgraded to:

http://tmrc.mit.edu/mirror/twisted/Twisted/2.5/Twisted_NoDocs-2.5.0.win32-py2.5.exe

The patch applied cleanly and now buildbot shows that a timeout occurred, and then keeps going on to the next step.
(In reply to comment #26)
> Downgraded to:
> 
> http://tmrc.mit.edu/mirror/twisted/Twisted/2.5/Twisted_NoDocs-2.5.0.win32-py2.5.exe
> 
> The patch applied cleanly and now buildbot shows that a timeout occurred, and
> then keeps going on to the next step.

To be clear, this is only on the tracemonkey win32 unittest machine, right?

Also, you mention that it proceeds to the next step -- does it kill the subprocesses properly?
yes, this is only on bm-win2k3-unittest-02-hw which is running tracemonkey unittests.

I'm guessing it kills subprocesses properly because the next build runs without issue (except the hangs of course).
(In reply to comment #28)
> yes, this is only on bm-win2k3-unittest-02-hw which is running tracemonkey
> unittests.
Is this installed on all the unittest slaves or only on bm-win2k3-unittest-02-hw? 

Also, is this installed on any build slaves?
It is not installed on any "build" slaves. AFAIK it is only installed on the Talos machines, and h/w tracemonkey unittest machine
Attached file Patched _dumbwin32proc.py from Twisted 2.5.0 (obsolete) (deleted) —
The version of patch.exe on the ref platform can't seem to cope with the patch on the twisted bug tracker, so this is a copy of the file from Twisted 2.5.0 with the changes applied.
Comment on attachment 344234 [details]
Patched _dumbwin32proc.py from Twisted 2.5.0

Not necessary if you avoid using "wget -O- ..."
Attachment #344234 - Attachment is obsolete: true
I've downgraded to Twisted to 2.5.0 and applied the patch on moz2-win32-slave14 (the new VM for tracemonkey). Steps:
* stop the buildbot slave
* disconnect as cltbld and log back on as Adminstrator user, run D:\mozilla-build\start-msvc8.bat
* rm -rfv /d/mozilla-build/python25/Lib/site-packages/{twisted/,Twisted-8.0.1-py2.5.egg-info/,zope/} /d/mozilla-build/python25/scripts/twistd*
* wget http://tmrc.mit.edu/mirror/twisted/Twisted/2.5/Twisted_NoDocs-2.5.0.win32-py2.5.exe
* ./Twisted_NoDocs-2.5.0.win32-py2.5.exe
* cd /d/mozilla-build/python25/Lib/site-packages/twisted/internet/
* wget -O patch "http://twistedmatrix.com/trac/attachment/ticket/2726/_dumbwin32proc.patch?format=raw"
* patch -p8 < patch
* log off as Administrator, and back on as cltbld
* run 'buildbot --version' to make sure Twisted 2.5.0. is listed
* start the buildbot slave again
List of machines with the patch applied:

* all of talos
* moz2-win32-slave14 (tracemonkey unittest, plus bm-win2k3-unittest-02-hw which was doing this job)
* moz2-win32-slave07 (mozilla-central unittest)
Also 
* moz2-win32-slave08 (mozilla-central unittest)
* win2k3sp2-vc8tools-ref-vm (the windows ref platform, doc'd as part of Version 10)
I am having some problems with moz2-win32-slave04 (bug 460791#c70):
"buildbot.slave.commands.TimeoutError: SIGKILL failed to kill process"
Is it related to this bug?
If the info is needed, this slave is being used for build, unittest and tracemonkey
Armen, we don't have the patch to twisted on moz2-win32-slave04, so that message is expected at this stage. I think we'll end up deploying it there at some stage.

John has been updating new windows slaves, so the current list of patched machines is
* all of talos
* win2k3sp2-vc8tools-ref-vm (the windows ref platform, doc'd as part of Version
10)
* moz2-win32-slave14 (tracemonkey unittest, plus bm-win2k3-unittest-02-hw which
was doing this job)
* moz2-win32-slave07,08 (mozilla-central unittest)
* moz2-win32-slave11,12,13,15,16,17,18 (build/leak test/nightly builders)
In addition to comment #37, 
* catlee has done moz2-win32-slave04 (bug 464057)
* I've done moz2-win32-slave03 (bug 464380)
Also put this change on 
* moz-win32-slave1,02,05,06 (build/leak test/nightly builders)
* moz-win32-slave09,10 (unittest staging)
So that's all the moz-win32-* boxes, and future ones via the refplatform.

These are the buildbot slaves we have left (I think)
* 1.9.0 unit tests - fx-win32-1.9-slave09,10,11. Lukas, what do you think about those ones ?
* try server slaves - try1-win32-slave, try2-win32-slave, try-win32-slave03. Do we need this there Ben ?
> * try server slaves - try1-win32-slave, try2-win32-slave, try-win32-slave03. Do
> we need this there Ben ?

We haven't this in *forever* on the try server. But it wouldn't hurt.
The three try slaves were recreated recently, so they're fixed via the ref platform. Gonna call this done.
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Moving closed Future bugs into Release Engineering in preparation for removing the Future component.
Component: Release Engineering: Future → Release Engineering
Product: mozilla.org → Release Engineering
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: