Closed Bug 851270 Opened 12 years ago Closed 9 years ago

Don't clobber the source checkout

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Assigned: massimo)

References

Details

(Whiteboard: [buildfaster:P1][capacity])

Attachments

(4 files, 6 obsolete files)

(deleted), patch
catlee
: feedback+
massimo
: checked-in+
Details | Diff | Splinter Review
(deleted), patch
catlee
: review+
massimo
: checked-in+
Details | Diff | Splinter Review
(deleted), patch
catlee
: review+
massimo
: checked-in+
Details | Diff | Splinter Review
(deleted), patch
Callek
: review+
massimo
: checked-in+
Details | Diff | Splinter Review
I was poking through build logs and noticed the following:

---
Executing: ['python', '/builds/slave/m-cen-osx64-000000000000000000/tools/buildfarm/utils/hgtool.py', '--mirror', 'http://hg-internal.dmz.scl3.mozilla.com/mozilla-central', '--bundle', 'http://ftp.mozilla.org/pub/mozilla.org/firefox/bundles/mozilla-central.hg', 'http://hg.mozilla.org/mozilla-central', 'build']
Checking if share extension works
command: START
command: hg help share
command: cwd: /builds/slave/m-cen-osx64-000000000000000000
command: output:
command: END (0.30 elapsed)

/builds/slave/m-cen-osx64-000000000000000000/build doesn't appear to be a valid hg directory; clobbering
command: START
command: hg path default
command: cwd: /builds/hg-shared/mozilla-central
command: output:
http://hg.mozilla.org/mozilla-central

command: END (0.20 elapsed)

Updating shared repo
command: START
command: hg path default
command: cwd: /builds/hg-shared/mozilla-central
command: output:
http://hg.mozilla.org/mozilla-central

command: END (0.16 elapsed)

command: START
command: hg pull -r b672877ed04656a9f4041f31ef5d8030df662726 http://hg-internal.dmz.scl3.mozilla.com/mozilla-central
command: cwd: /builds/hg-shared/mozilla-central
command: output:
pulling from http://hg-internal.dmz.scl3.mozilla.com/mozilla-central
searching for changes
adding changesets
adding manifests
adding file changes
added 5 changesets with 8 changes to 8 files
(run 'hg update' to get a working copy)
command: END (3.44s elapsed)

Trying to share /builds/hg-shared/mozilla-central to /builds/slave/m-cen-osx64-000000000000000000/build
command: START
command: hg share -U /builds/hg-shared/mozilla-central /builds/slave/m-cen-osx64-000000000000000000/build
command: cwd: /builds/slave/m-cen-osx64-000000000000000000
command: output:
command: END (0.14s elapsed)

command: START
command: hg update -C -r b672877ed04656a9f4041f31ef5d8030df662726
command: cwd: /builds/slave/m-cen-osx64-000000000000000000/build
command: output:
76788 files updated, 0 files merged, 0 files removed, 0 files unresolved
command: END (485.15s elapsed)

command: START
command: hg parent --template {node|short}
command: cwd: /builds/slave/m-cen-osx64-000000000000000000/build
command: output:
b672877ed046
command: END (0.77 elapsed)

Got revision b672877ed046
program finished with exit code 0
elapsedTime=490.712928
---

The interesting output there is:

---
command: START
command: hg update -C -r b672877ed04656a9f4041f31ef5d8030df662726
command: cwd: /builds/slave/m-cen-osx64-000000000000000000/build
command: output:
76788 files updated, 0 files merged, 0 files removed, 0 files unresolved
command: END (485.15s elapsed)
---

8:08 to update the Mercurial repository. Wat?! Looking at more logs, it appears this is common. Here's a Windows log:

---
command: START
command: hg update -C -r d3d9a1552a4d62a7b80c53b5af8ef6fd610b3ae6
command: cwd: e:\builds\moz2_slave\m-cen-w32-00000000000000000000\build
command: output:
76788 files updated, 0 files merged, 0 files removed, 0 files unresolved
command: END (664.56s elapsed)
---

11 minutes?!

It appears we are completely removing the source directory checkout and repopulating it from scratch. Why?

Instead of blowing away the entire source directory and performing a full update, we should instead clean out the old source directory and then update it to the revision specified. You could do this with something like:

  hg st -n -i -u -0 | xargs -0 rm
  hg up -C -r <rev>

This should complete in seconds rather than minutes and thus will make builds start a lot faster.
Whiteboard: [buildfaster:P1][capacity]
We're not blowing away the source directory and repopulating it from scratch. This only happens for dep builds if the build directory has been clobbered for some reason. Try and nightly builds are always clobbers.

If we're seeing this happen in general in dep builds that haven't been clobbered, then that's definitely a bug.
We should never need to delete the source directory, even for clobber builds. The hg st command I posted plus an up -C is equivalent to a clean checkout and should be much more efficient.

I double checked and non-clobber builds on inbound do appear to be sane. However, 
Try exhibits the bad behavior (presumably since all try builds are clobber builds).
Summary: More intelligent handling of Mercurial updates → Don't clobber the source checkout
Note that the full |hg up| is only part of the problem: the clobber step (rm -rf) can consume many minutes as well. I'm looking at a Try Windows log where rm -rf takes 9 minutes! I reckon this change could make Windows clobber builds start 15+ minutes faster!
iirc, there is a faster rm -rf that :jhford (i believe?) wrote for windows; not sure if that is/can be utilized.

Though I agree, avoiding altogether is even better
I think i wrote it in some other bug, but we could leverage file system level snapshot facilities (volume shadow copy, hfs+ snapshots, lvm snapshots/zfs/btrfs) to replace rm -rf.
We could also mount the source checkout as a read-only filesystem. But that would require changes to the build such that we never write files into the source tree. Do we have a tracking bug on that?
(In reply to Gregory Szorc [:gps] from comment #0)
>   hg st -n -i -u -0 | xargs -0 rm

hg purge --all ?
I definitely wrote a win32 native program that did rm -rf using native windows calls.  It was deployed prematurely and there were a couple issues with it on non-wow64 systems, iirc.  I don't know how much interest there still is in it and I'm not sure where a good version of the code ended up.
Bug 727551 / http://hg.mozilla.org/users/jford_mozilla.com/jhford-native-rm/ afaict :-)
Attached patch Part 1: Add purge ability to hgtool, v1 (obsolete) (deleted) — Splinter Review
I gave hgtool the ability to purge. When you pass --purge to hgtool, it will |hg purge -a --all| before |hg up|. This is effectively a clobber without actually blowing away the source dir. It should remove all the object directories and any temporary files and directories created in the source directory.

If hg purge fails, it will exit with a non-0 status code. This should be caught be existing except blocks and trigger a full rm -rf style clobber.

I have not tested this patch. I have no clue how to test it. I'd love it if someone else took this patch and ran it to the finish line.
Attachment #725229 - Flags: review?(bhearsum)
Attached patch Part 2: Purge, don't clobber, try builds, v1 (obsolete) (deleted) — Splinter Review
I removed the rm -rf step from Try builds and added --purge to the hgtool invocation. The net effect should be the same.

Again, I haven't tested this and have no clue how to test this.

There are other parts in this file that need the clobber logic updated. I figured Try would be a good place to start.
Attachment #725230 - Flags: review?(bhearsum)
Note that hg purge is an extension that needs to be enabled.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #12)
> Note that hg purge is an extension that needs to be enabled.

Ahh, good point. Presumably there is a .hgrc file lingering in one of these repositories? If not, we can always emulate hg purge with hg st | rm (see my initial comment).
.hgrc doesn't need to be in the repository. There are other places mercurial looks for one.
(In reply to Mike Hommey [:glandium] from comment #14)
> .hgrc doesn't need to be in the repository. There are other places mercurial
> looks for one.

I know how hgrc works :)

What I meant was I'm guessing there is an hgrc file in build-tools or buildbotcustom or some other repo that is automagically installed as part of running jobs.
Our hgrc files are centrally managed by puppet (on linux and OSX), so as long as the version of hg we have on there supports purge, it's easy to enable.

hgtool could test if the purge extension is enabled, and use some fallback code if it doesn't.
Afaict purge has been a bundled extension since forever:
http://selenic.com/repo/hg/log/a07be8953733/hgext/purge.py

(hg v1.0 was released in 2008, so after it was moved into hgext/)

...so we shouldn't even need the fallback :-)
I think that since we control the entire stack and given the performance repercussions if purge is not used, we should explicitly require purge. If nothing else, supporting N - 1 configurations is easier and less error prone than N. KISS.
Comment on attachment 725229 [details] [diff] [review]
Part 1: Add purge ability to hgtool, v1

Review of attachment 725229 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Gregory Szorc [:gps] from comment #2)
> We should never need to delete the source directory, even for clobber
> builds. The hg st command I posted plus an up -C is equivalent to a clean
> checkout and should be much more efficient.
> 
> I double checked and non-clobber builds on inbound do appear to be sane.
> However, 
> Try exhibits the bad behavior (presumably since all try builds are clobber
> builds).

Just wanted to note that we will *still* see full source clobbers due to the ensure-we-have-enough-disk-space script that runs at the start of most jobs. This is still great to have though, especially on try!

(In reply to Gregory Szorc [:gps] from comment #18)
> I think that since we control the entire stack and given the performance
> repercussions if purge is not used, we should explicitly require purge. If
> nothing else, supporting N - 1 configurations is easier and less error prone
> than N. KISS.

I know what you're saying here, but I respectfully disagree. Even though we control the whole stack, we still don't want to make the assumption that "purge" is always available. Changes to puppet, addition of new platforms, etc. all can cause us to lose our hgrc and end up with "purge" being unavailable. We need to be able to fallback gracefully to reduce the possibility of unnecessary tree burning. You can use the "hg share" logic as a model to test for purge: https://github.com/mozilla/build-tools/blob/master/lib/python/util/hg.py#L345

With that done, I think we should just drop the --purge option from hgtool altogether and make it the default and only fall back to deleting the entire directory if purge is unavailable or doesn't work for some reason.

::: lib/python/util/hg.py
@@ +98,5 @@
>          ver = (0, 0, 0)
>      log.debug("Running hg version %s", ver)
>      return ver
>  
> +def do_purge(dest):

Naming nit: drop the "do_" to match the style of the rest of this file. Also, please add a test or two (https://github.com/mozilla/build-tools/blob/master/lib/python/mozilla_buildtools/test/test_util_hg.py).
Attachment #725229 - Flags: review?(bhearsum) → review-
Comment on attachment 725230 [details] [diff] [review]
Part 2: Purge, don't clobber, try builds, v1

This patch won't be necessary, per my previous comment about making purging the normal behaviour.
Attachment #725230 - Attachment is obsolete: true
Attachment #725230 - Flags: review?(bhearsum)
(In reply to Ben Hearsum [:bhearsum] from comment #19)
> I know what you're saying here, but I respectfully disagree. Even though we
> control the whole stack, we still don't want to make the assumption that
> "purge" is always available. Changes to puppet, addition of new platforms,
> etc. all can cause us to lose our hgrc and end up with "purge" being
> unavailable. We need to be able to fallback gracefully to reduce the
> possibility of unnecessary tree burning. 

The only downside is that we potentially forget to update an hgrc and go 6 months without realising we are doing the fallback method and wasting a load of builder time. I'd rather this be noisy is possible.
(In reply to Ed Morley [:edmorley UTC+0] from comment #21)
> (In reply to Ben Hearsum [:bhearsum] from comment #19)
> > I know what you're saying here, but I respectfully disagree. Even though we
> > control the whole stack, we still don't want to make the assumption that
> > "purge" is always available. Changes to puppet, addition of new platforms,
> > etc. all can cause us to lose our hgrc and end up with "purge" being
> > unavailable. We need to be able to fallback gracefully to reduce the
> > possibility of unnecessary tree burning. 
> 
> The only downside is that we potentially forget to update an hgrc and go 6
> months without realising we are doing the fallback method and wasting a load
> of builder time. I'd rather this be noisy is possible.

Can we throw a warning that TBPL or some other tool can pick up?
We could either turn the job orange (but not burn the build), or else keep it green, but notify in the tinderbox print area (similar to how we do for the "this was a clobber build" message). ie output to the log:

TinderboxPrint:hg purge extension not found! Using inefficient srcdir clobber!

etc
(In reply to Ed Morley [:edmorley UTC+0] from comment #23)
> We could either turn the job orange (but not burn the build), or else keep
> it green, but notify in the tinderbox print area (similar to how we do for
> the "this was a clobber build" message). ie output to the log:
> 
> TinderboxPrint:hg purge extension not found! Using inefficient srcdir
> clobber!

My goal here is not to close the tree if we screw up an hgrc. If orange because of this would do it let's do the latter. If the orange wouldn't be a big deal, either sounds fine to me.
Orange on a build or two on a newly added platform/tree sounds manageable - we'll just star as we do for any other failure until purge has been added to the hgrc :-)
Depends on: 852170
Attached patch look for inefficient clobbers (obsolete) (deleted) — Splinter Review
(In reply to Ed Morley [:edmorley UTC+0] from comment #25)
> Orange on a build or two on a newly added platform/tree sounds manageable -
> we'll just star as we do for any other failure until purge has been added to
> the hgrc :-)

Okay, this patch adds an error regex for the string you suggested, and will turn the build orange if its found. We just need to get hgtool to print this to make it work.
Attachment #726201 - Flags: review?(emorley)
(In reply to Ben Hearsum [:bhearsum] from comment #19)
> (In reply to Gregory Szorc [:gps] from comment #18)
> > I think that since we control the entire stack and given the performance
> > repercussions if purge is not used, we should explicitly require purge. If
> > nothing else, supporting N - 1 configurations is easier and less error prone
> > than N. KISS.
> 
> I know what you're saying here, but I respectfully disagree. Even though we
> control the whole stack, we still don't want to make the assumption that
> "purge" is always available. Changes to puppet, addition of new platforms,
> etc. all can cause us to lose our hgrc and end up with "purge" being
> unavailable. We need to be able to fallback gracefully to reduce the
> possibility of unnecessary tree burning. You can use the "hg share" logic as
> a model to test for purge:
> https://github.com/mozilla/build-tools/blob/master/lib/python/util/hg.py#L345

I don't fully understand the complexities and requirements of the buildbot infrastructure, but I still think we should require purge. If a bad hgrc is rolled out, then the recently deployed system configuration is bad and should be reverted. We would have the system qualify/test builds instead of leaving it up to humans to fix warnings. What's wrong with this strictness?
 
> With that done, I think we should just drop the --purge option from hgtool
> altogether and make it the default and only fall back to deleting the entire
> directory if purge is unavailable or doesn't work for some reason.

It's not that simple.

Before, "clobber" and "update" were two separate steps in buildbot land. Now, they are one. That step needs to distinguish between "clobber and update" and "just update." --purge provides that distinction.

In the ideal world, I'd say we should purge on every build just because we're not supposed to be putting things in the srcdir. However, we can't simply do this because the objdir is a child directory of the srcdir and the purge would remove the objdir. In my ideal world, we should move the objdir so it isn't a child of the srcdir. Then we can more easily clobber one or the other. We can also do things like mount the srcdir on a read-only filesystem to enforce the "no writes in srcdir are allowed" ideal. We /could/ exclude the objdir from the purges. However, we would need to pass a pattern to purge telling it what directories are object directories. Do they always begin with obj-*? Can you guarantee this for all time on all platforms? That assumption seems rather fragile to me. Even if we do that, we still need a way to determine whether we're clobbering the world.

I guess what I'm trying to say is I don't see a way to achieve the optimal end state without introducing a new command argument.

> ::: lib/python/util/hg.py
> @@ +98,5 @@
> >          ver = (0, 0, 0)
> >      log.debug("Running hg version %s", ver)
> >      return ver
> >  
> > +def do_purge(dest):

> Naming nit: drop the "do_" to match the style of the rest of this file.
> Also, please add a test or two
> (https://github.com/mozilla/build-tools/blob/master/lib/python/
> mozilla_buildtools/test/test_util_hg.py).

Yeah, I had to do this because "purge" was a function argument and would have overridden the global function name.

Finally, fixing this issue is beyond my regular sphere of responsibility. I only submitted the patches because I was poking around to see what was going on and realized the fix seemed trivial. Since Ben and others have opinions here and are more familiar with the systems, I'd appreciate if someone else carried this bug to completion.
I think you can enable purge, even if the .hgrc is bad or missing, by

hg --config extensions.purge= purge
(In reply to Gregory Szorc [:gps] from comment #27)
> I don't fully understand the complexities and requirements of the buildbot
> infrastructure, but I still think we should require purge. If a bad hgrc is
> rolled out, then the recently deployed system configuration is bad and
> should be reverted. We would have the system qualify/test builds instead of
> leaving it up to humans to fix warnings. What's wrong with this strictness?

The problem is that if a bad update is rolled out, we could be burning jobs for hours or longer (and thus, holding the tree closed), before it's noticed and fixed. I agree 100% that we should be using purge everywhere, I just don't think we should be enforcing that by unnecessarily burning jobs.

> > With that done, I think we should just drop the --purge option from hgtool
> > altogether and make it the default and only fall back to deleting the entire
> > directory if purge is unavailable or doesn't work for some reason.
> 
> It's not that simple.
> 
> Before, "clobber" and "update" were two separate steps in buildbot land.
> Now, they are one. That step needs to distinguish between "clobber and
> update" and "just update." --purge provides that distinction.

We might just be having a difference in terminology here...the commands from comment #0 are being run from hgtool.py or one of the libraries it calls - not Buildbot. We can modify hgtool/mercurial() to do whatever we want. In this case I think we want to:
- Try to purge if the directory exists
-- If purge fails, rm -rf and repopulate the directory (using the existing logic that does that).
-- If purge succeeds, pull/update (using the existing logic that does that).

> In the ideal world, I'd say we should purge on every build just because
> we're not supposed to be putting things in the srcdir. However, we can't
> simply do this because the objdir is a child directory of the srcdir and the
> purge would remove the objdir. In my ideal world, we should move the objdir
> so it isn't a child of the srcdir.Then we can more easily clobber one or
> the other.

I filed bug 852199 on moving the objdirs.

> We can also do things like mount the srcdir on a read-only
> filesystem to enforce the "no writes in srcdir are allowed" ideal.

Interesting idea, not sure how well it works for platforms other than Linux. Feel free to open a bug on it, though.

> We
> /could/ exclude the objdir from the purges. However, we would need to pass a
> pattern to purge telling it what directories are object directories. Do they
> always begin with obj-*?
> 
> I guess what I'm trying to say is I don't see a way to achieve the optimal
> end state without introducing a new command argument.

An alternative here would be to write out an .hgignore before running hgtool.py.

> > ::: lib/python/util/hg.py
> > @@ +98,5 @@
> > >          ver = (0, 0, 0)
> > >      log.debug("Running hg version %s", ver)
> > >      return ver
> > >  
> > > +def do_purge(dest):
> 
> > Naming nit: drop the "do_" to match the style of the rest of this file.
> > Also, please add a test or two
> > (https://github.com/mozilla/build-tools/blob/master/lib/python/
> > mozilla_buildtools/test/test_util_hg.py).
> 
> Yeah, I had to do this because "purge" was a function argument and would
> have overridden the global function name.

Ah, OK.

> Finally, fixing this issue is beyond my regular sphere of responsibility. I
> only submitted the patches because I was poking around to see what was going
> on and realized the fix seemed trivial. Since Ben and others have opinions
> here and are more familiar with the systems, I'd appreciate if someone else
> carried this bug to completion.

OK, I'll try to find an owner for this. Thanks for your initial work.
(In reply to Aki Sasaki [:aki] from comment #28)
> I think you can enable purge, even if the .hgrc is bad or missing, by
> 
> hg --config extensions.purge= purge

In fact....we already do this for mq: https://github.com/mozilla/build-tools/blob/master/lib/python/util/hg.py#L486

So we can just do that here and not worry about bug 852170.
(In reply to Ben Hearsum [:bhearsum] from comment #29)
> (In reply to Gregory Szorc [:gps] from comment #27)
> > I don't fully understand the complexities and requirements of the buildbot
> > infrastructure, but I still think we should require purge. If a bad hgrc is
> > rolled out, then the recently deployed system configuration is bad and
> > should be reverted. We would have the system qualify/test builds instead of
> > leaving it up to humans to fix warnings. What's wrong with this strictness?
> 
> The problem is that if a bad update is rolled out, we could be burning jobs
> for hours or longer (and thus, holding the tree closed), before it's noticed
> and fixed. I agree 100% that we should be using purge everywhere, I just
> don't think we should be enforcing that by unnecessarily burning jobs.

<and other comments>

From a onlooker point of view, it seems like what *ideally* we would have would be a way of introspecting and noting "build"/machine issues separately from burning jobs.  Instead, orange v. no orange on the basis of, here, "is purgeable?" (+ tinderboxprint) is being focused on since that's all we can trivially do with our current infrastructure.  Not saying that a deeper system can/should be done/specced now/ever, but I think its worth noting when a solution is a quick measure v something in good form
(In reply to Ben Hearsum [:bhearsum] from comment #29)
> > We
> > /could/ exclude the objdir from the purges. However, we would need to pass a
> 
> An alternative here would be to write out an .hgignore before running
> hgtool.py.
 
.hgignore already excludes ^obj 

:-)
We want to purge ignored files from the source dir, especially on Try. We sometimes don't want to purge the objdir.
Comment on attachment 726201 [details] [diff] [review]
look for inefficient clobbers

This patch is no longer necessary if we go with the solution in comment 30.
Attachment #726201 - Attachment is obsolete: true
Attachment #726201 - Flags: review?(emorley)
(In reply to Gregory Szorc [:gps] from comment #2)
> We should never need to delete the source directory, even for clobber
> builds. The hg st command I posted plus an up -C is equivalent to a clean
> checkout and should be much more efficient.

This comment slipped by me earlier. I just wanted to note that the reason we clobber things sometimes is disk space. We have one build directory per job (where a job is as specific as "mozilla-central opt build" or "mozilla-inbound b2g desktop". Because we have hundreds of different jobs, we can't keep src/objdirs forever. We make up for this a bit by favouring slaves that have most recently done a build of the same type when choosing one for a pending job.

Having re-read the first part of this bug I'm actually confused about what running purge is buying us, except in the extreme cases where we can't update the existing checkout (eg, try repo reset).
(In reply to Jeff Hammel [:jhammel] from comment #31)
<snip/>
> From a onlooker point of view, it seems like what *ideally* we would have
> would be a way of introspecting and noting "build"/machine issues separately
> from burning jobs.  Instead, orange v. no orange on the basis of, here, "is
> purgeable?" (+ tinderboxprint) is being focused on since that's all we can
> trivially do with our current infrastructure.  Not saying that a deeper
> system can/should be done/specced now/ever, but I think its worth noting
> when a solution is a quick measure v something in good form

Will file this as requested.  Meant to note more explicitly that this is completely out of scope for this bug
Purges saves you from having to recreate all the files in the srcdir. For any 2 builds, the work required to switch from rev A to B will likely be much smaller than the work required to perform a fresh checkout. The purge is to ensure any remnants from a previous build are gone, thus treating things like a clobber.
(In reply to Gregory Szorc [:gps] from comment #37)
> Purges saves you from having to recreate all the files in the srcdir. For
> any 2 builds, the work required to switch from rev A to B will likely be
> much smaller than the work required to perform a fresh checkout. The purge
> is to ensure any remnants from a previous build are gone, thus treating
> things like a clobber.

Ah. So, it sounds like this would be primarily useful for nightly builds, where we always start with a completely fresh repository. We already don't clobber dep builds as part of the build process (but they do sometimes get clobbered to free up space, as noted earlier). It would be a correctness fix for those builds still.
(In reply to Ben Hearsum [:bhearsum] from comment #38)
> Ah. So, it sounds like this would be primarily useful for nightly builds

And Try surely?
It's 2013: why are we talking about not enough free disk space?
(In reply to Jeff Hammel [:jhammel] from comment #36)
> (In reply to Jeff Hammel [:jhammel] from comment #31)
> <snip/>
> > From a onlooker point of view, it seems like what *ideally* we would have
> > would be a way of introspecting and noting "build"/machine issues separately
> > from burning jobs.  Instead, orange v. no orange on the basis of, here, "is
> > purgeable?" (+ tinderboxprint) is being focused on since that's all we can
> > trivially do with our current infrastructure.  Not saying that a deeper
> > system can/should be done/specced now/ever, but I think its worth noting
> > when a solution is a quick measure v something in good form
> 
> Will file this as requested.  Meant to note more explicitly that this is
> completely out of scope for this bug

bug 852357
According to http://brasstacks.mozilla.com/gofaster/#/overhead/build our win32 builders spend ~1 hour in "build setup and teardown." http://brasstacks.mozilla.com/gofaster/#/executiontime/build shows that our total win32 build time is ~2 hours (I'm pretty sure this doesn't include "build setup and teardown"). Putting the two together, ~33% of the build time for win32 builders is spent in setup and teardown. Or, as I'll call it "buildbot overhead."

And, from my earlier comments, we are losing 15+ minutes (possibly even 20+) to the issues described in this bug. Assuming its 15 minutes exactly and assuming the 1 and 2 hour numbers above are exact, build times would go from 180 to 165 minutes. That's a 9% increase in builder capacity.
I think we need to be a bit careful with that overhead number. Everything that doesn't match this config
  http://mxr.mozilla.org/build/source/braindump/reports/buildfaster_report.py#243
counts as setup/tear down in buildfaster. Note that it's not including long-running things like 'make check'.

Taking a (statistically insignificant) sample of two builds:
* total time: 135min, previously clobbered and took 10mins to get a working copy of the source from the hg share already on disk, 52min compile, 8 mins building symbols and packaging, 58min make check
* total time: 117 min, true dep build with 75 sec hg update, 41 min compile, 8 min packaging, 57 min make check

For the first example, the clobbering would have taken a few minutes on some other job. Even with that, 'buildbot overhead' is a lot less than a third of the total time.
Assignee: nobody → mgervasini
Worth noting that very long paths on Windows cannot be deleted by mercurial:

C:\Users\jhopkins\tmp\buildbot-configs>hg purge --all
0xxxxxxxxxxxxxxxxxxxx\1xxxxxxxxxxxxxxxxxxxx\2xxxxxxxxxxxxxxxxxxxx\3xxxxxxxxxxxxx
xxxxxxx\4xxxxxxxxxxxxxxxxxxxx\5xxxxxxxxxxxxxxxxxxxx\6xxxxxxxxxxxxxxxxxxxx\7xxxxx
xxxxxxxxxxxxxxx\8xxxxxxxxxxxxxxxxxxxx\9xxxxxxxxxxxxxxxxxxxx: The system cannot f
ind the path specified
abort: The system cannot find the path specified: 'C:\Users\jhopkins\tmp\buildbo
t-configs\0xxxxxxxxxxxxxxxxxxxx/1xxxxxxxxxxxxxxxxxxxx/2xxxxxxxxxxxxxxxxxxxx/3xxx
xxxxxxxxxxxxxxxxx/4xxxxxxxxxxxxxxxxxxxx/5xxxxxxxxxxxxxxxxxxxx/6xxxxxxxxxxxxxxxxx
xxx/7xxxxxxxxxxxxxxxxxxxx/8xxxxxxxxxxxxxxxxxxxx/9xxxxxxxxxxxxxxxxxxxx/*.*'

catlee pointed out that we can fall back to a clobber when that happens.
We should add an explicit test for this in hgtool's test suite.
Attachment #725229 - Attachment is obsolete: true
Attachment #738505 - Flags: feedback?(catlee)
Enabling purge on try branch
Attachment #738509 - Flags: feedback?(catlee)
Attachment #738509 - Attachment is patch: true
Attachment #738509 - Attachment mime type: text/x-patch → text/plain
Comment on attachment 738509 [details] [diff] [review]
enabling hg --purge on try branch

Review of attachment 738509 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good!

Can you add this to your ~/.hgrc:

[diff]
git=1
showfunc = 1

Having the class/function names is really helpful for reading the diffs.
Attachment #738509 - Flags: feedback?(catlee) → feedback+
Comment on attachment 738505 [details] [diff] [review]
based on gps patch: renamed do_purge, added tests

Review of attachment 738505 [details] [diff] [review]:
-----------------------------------------------------------------

looks good!

Ideally we should handle those bad long filenames on windows.

We definitely need to re-raise the exception from inside purge() so that the calling functions can remove dest.

::: lib/python/mozilla_buildtools/test/test_util_hg.py
@@ +325,5 @@
> +        open(fileToModify, 'w').write('just a test')
> +        purge(self.wc)
> +        content = open(fileToModify).read()
> +        self.assertEqual(content, 'just a test')
> +

can we add a test for purging really long filenames?

::: lib/python/util/hg.py
@@ +105,5 @@
> +        run_cmd(['hg', '--config', 'extensions.purge=', 'purge', '-a', '--all'], cwd=dest)
> +    except subprocess.CalledProcessError:
> +        # purge failed
> +        # https://bugzilla.mozilla.org/show_bug.cgi?id=851270#c44
> +        log.error('purge failed (%s)' % dest)

Ok, I think probably the best thing to do is to re-raise the exception here. Both calls below are inside try/except clauses which call remove_dest after the exception is caught.

remove_dest should ideally be adjusted with jhopkins' logic from clobberer

@@ +390,4 @@
>              remove_path(dest)
>          elif not os.path.exists(os.path.join(dest, ".hg", "sharedpath")):
>              try:
> +                purge(dest)

should this have an 'if autoPurge:' guard?
Attachment #738505 - Flags: feedback?(catlee) → feedback-
FYI, Aki ported my Windows clobber routine to Mozharness and added unit tests: 
 https://hg.mozilla.org/build/mozharness/rev/0308cdea2f32
changes in this patch:

* added missing 'if autoPurge' check
* ported mozharness _rmtree_windows() in commands.py
* added long path test
* remove exception handling when purge fails (just log and re-raise)
* added tox.ini for testing
Attachment #738505 - Attachment is obsolete: true
Attachment #741471 - Flags: feedback?(catlee)
* ported fixes made in mozharness _rmtree_windows().
* added cwd=os.path.normpath(os.path.join(dest, '..')) in purge command
Attachment #741471 - Attachment is obsolete: true
Attachment #741471 - Flags: feedback?(catlee)
Attachment #742301 - Flags: feedback?(catlee)
Attachment #742301 - Flags: feedback?(catlee) → feedback+
Comment on attachment 742301 [details] [diff] [review]
updated _rmtree_windows(), added cwd in purge command

Review of attachment 742301 [details] [diff] [review]:
-----------------------------------------------------------------

::: lib/python/mozilla_buildtools/test/test_util_hg.py
@@ +309,5 @@
> +
> +    def testPurgeUntrackedDirectory(self):
> +        rev = clone(self.repodir, self.wc, update_dest=False)
> +        self.assertEquals(rev, None)
> +        directoryToPurge = os.path.join(self.wc, 'directoyTopurge')

Was this meant to say 'directoy'?

::: lib/python/util/commands.py
@@ +1,1 @@
> +# ***** BEGIN LICENSE BLOCK *****

"***** BEGIN LICENSE BLOCK *****" is a MPL1.1-ism, and is no longer encouraged.

::: lib/python/util/hg.py
@@ +195,5 @@
>              # We need to make sure our paths are correct though
>              if os.path.exists(os.path.join(dest, '.hg')):
>                  adjust_paths(dest, default=repo)
> +            return mercurial(repo, dest, branch, revision, autoPurge=True,
> +                             update_dest=update_dest, clone_by_rev=clone_by_rev,)

Why the extra comma?
(In reply to Ben Hearsum [:bhearsum] from comment #35)
> (In reply to Gregory Szorc [:gps] from comment #2)
> > We should never need to delete the source directory, even for clobber
> > builds. 
> 
> This comment slipped by me earlier. I just wanted to note that the reason we
> clobber things sometimes is disk space. We have one build directory per job
> (where a job is as specific as "mozilla-central opt build" or
> "mozilla-inbound b2g desktop". Because we have hundreds of different jobs,
> we can't keep src/objdirs forever.

I understand that we can't keep objdirs forever (too many permutations), however there's no reason to need to delete the srcdir checkouts, since there should only need to be one per tree & they are smaller than the objdirs. Or do we create a new checkout for each of opt/debug/...? (In which case that's something we should fix asap).

(In reply to Nick Thomas [:nthomas] from comment #43)
> I think we need to be a bit careful with that overhead number. Everything
> that doesn't match this config
>  
> http://mxr.mozilla.org/build/source/braindump/reports/buildfaster_report.
> py#243
> counts as setup/tear down in buildfaster. 

I'm sitting watching an inprogress m-c linux nightly retrigger where I've had to force a clobber on multiple platforms using the clobberer (something that happens very often due to our sucky build dependencies), and it's spent 16 minutes in the checking_clobber_times step (removing old srcdirs and objdirs) & another 10 mins in the hg_update step (checking mozilla-central back out again for the lolz), according to the buildbot master summary. 

Whilst I follow that in the specific examples below the overhead may have been overstated, this bug still is impacting us significantly.
(In reply to Ed Morley [:edmorley UTC+1] from comment #54)
> (In reply to Ben Hearsum [:bhearsum] from comment #35)
> > (In reply to Gregory Szorc [:gps] from comment #2)
> > > We should never need to delete the source directory, even for clobber
> > > builds. 
> > 
> > This comment slipped by me earlier. I just wanted to note that the reason we
> > clobber things sometimes is disk space. We have one build directory per job
> > (where a job is as specific as "mozilla-central opt build" or
> > "mozilla-inbound b2g desktop". Because we have hundreds of different jobs,
> > we can't keep src/objdirs forever.
> 
> I understand that we can't keep objdirs forever (too many permutations),
> however there's no reason to need to delete the srcdir checkouts, since
> there should only need to be one per tree & they are smaller than the
> objdirs. Or do we create a new checkout for each of opt/debug/...? (In which
> case that's something we should fix asap).

Yes, each type of build gets its own source checkout. Changing this would be a significant undertaking I think, since most of the build scripts assume they're operating out of the root of a source checkout.

> (In reply to Nick Thomas [:nthomas] from comment #43)
> > I think we need to be a bit careful with that overhead number. Everything
> > that doesn't match this config
> >  
> > http://mxr.mozilla.org/build/source/braindump/reports/buildfaster_report.
> > py#243
> > counts as setup/tear down in buildfaster. 
> 
> I'm sitting watching an inprogress m-c linux nightly retrigger where I've
> had to force a clobber on multiple platforms using the clobberer (something
> that happens very often due to our sucky build dependencies), and it's spent
> 16 minutes in the checking_clobber_times step (removing old srcdirs and
> objdirs)

Ideally we do this before starting buildbot so it's not in the critical path of doing builds (bug 699195)

> & another 10 mins in the hg_update step (checking mozilla-central
> back out again for the lolz), according to the buildbot master summary. 

This machine didn't have a local copy of m-c for whatever reason. It most likely got deleted to free up space for another build. The code that does that is here:
http://hg.mozilla.org/build/tools/file/default/buildfarm/maintenance/purge_builds.py#l242

> Whilst I follow that in the specific examples below the overhead may have
> been overstated, this bug still is impacting us significantly.

Here are some numbers for the past week. I've looked at the time taken for 'hg_update' steps.
number of checkouts: 10756
mean time:           171s
50th percentile time:120s
90th precentile time:452s
I ran the same numbers for 'clean_old_builds' to see if it's a significant problem:
n: 39062
mean: 9.5s
50th: 1s
90th: 7s
and 'checking_clobber_times':
n 17910
mean 42.3s
50th 1s
90th 139s
In this patch:
* removed license headers
* now PYWIN32 is always defined in commands.py
* minor fixes
Attachment #742301 - Attachment is obsolete: true
Attachment #745320 - Flags: review?(catlee)
Attachment #745320 - Flags: review?(catlee) → review+
Attachment #745320 - Flags: checked-in+
Attachment #738509 - Flags: checked-in+
in production
From a Windows 7 opt try I just pushed:

command: START
command: hg --config extensions.purge= purge -a --all e:\builds\moz2_slave\try-w32-0000000000000000000000\build
command: cwd: e:\builds\moz2_slave\try-w32-0000000000000000000000
command: output:
command: END (127.06s elapsed)

command: START
command: hg update -C -r f4cceab72020554b302791c075249dc125813c7f
command: cwd: e:\builds\moz2_slave\try-w32-0000000000000000000000\build
command: output:
1583 files updated, 0 files merged, 151 files removed, 0 files unresolved
command: END (28.06s elapsed)

155s total to get the source tree in a sane state.

Contrast with https://tbpl.mozilla.org/php/getParsedLog.php?id=22639341&tree=Try&full=1 which took 588s to rm -rf the source dir and 485s to |hg up| with an empty slate. That's 1073s before and 155s currently. If those numbers are average, Windows clobber builds (all try builds) will now complete 918s == 15:18 faster.

\o/

\o/
Interesting that this works on win32 builds. It's failing on linux builds, and falling back to a fresh clobber.
now using dest as cwd for purge command.
Attachment #746418 - Flags: review?(catlee)
Attachment #746418 - Flags: review?(catlee) → review+
Attachment #746418 - Flags: checked-in+
massimo, this patch added 9 new pep8 violations: http://10.134.48.37:8080/job/tools_tests/194/
bug 870323 has some fallout related to Python 2.7 specific syntax ("except foo as e" vs. "except foo, e")
Attached patch fix for pep8 violations (deleted) — Splinter Review
Callek, this patch fixes all the pep8 validation errors and removes some duplicated tests.
Attachment #748002 - Flags: review?(bugspam.Callek)
Comment on attachment 748002 [details] [diff] [review]
fix for pep8 violations

r+ on the pep8 fixups.

Can you please either point out what each of these tests dupes are, and then reflag me for review. Or flag someone else more familiar with those tests/code for review?
Attachment #748002 - Flags: review?(bugspam.Callek)
Attachment #748002 - Flags: review?
Attachment #748002 - Flags: review+
(In reply to Justin Wood (:Callek) from comment #66)
> Comment on attachment 748002 [details] [diff] [review]
> fix for pep8 violations
> 
> r+ on the pep8 fixups.
> 
> Can you please either point out what each of these tests dupes are, and then
> reflag me for review. Or flag someone else more familiar with those
> tests/code for review?

Hi Callek,

by mistake I have added the following tests twice:

* testPurgeUntrackedDirectory
* testPurgeTrackedFile
* testPurgeUntrackedFile

I have spotted them fixing the pep8 errors.
Attachment #748002 - Flags: review? → review?(bugspam.Callek)
Comment on attachment 748002 [details] [diff] [review]
fix for pep8 violations

Review of attachment 748002 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for the explanation -- clearing my extra review request (since two r+'s by the same person just looks silly ;-) ) you're good to land this. Thanks again for the fixup
Attachment #748002 - Flags: review?(bugspam.Callek)
Depends on: 873067
I just saw a l10n build log where 9:45 of the 10:03 it took to manage source code was spent in hg up, doing a full/clobber update, of course. It looked like the entire job was taking about 20 minutes. So, giving l10n builds the same treatment should net major savings for l10n build times!
Attachment #748002 - Flags: checked-in+
Product: mozilla.org → Release Engineering
Anything left to do here?
Flags: needinfo?(mgervasini)
Didn't this whole thing get backed out?
(In reply to John Hopkins (:jhopkins) from comment #70)
> Anything left to do here?

Hi John, I don't know if it's worth to extend the same procedure to nightly builds, but it's something we could try if this thing has not been backed out as gps suggests
Flags: needinfo?(mgervasini)
You can also fix this for Git. I noticed Git checkouts for B2G builds are also in the 10+ minute range. e.g. https://tbpl.mozilla.org/php/getParsedLog.php?id=26921499&tree=Mozilla-Central&full=1 shows the source checkout lasting from 04:02:54 to 04:14:25.
Blocks: 950850
Yes, this got backed out due to some intermittent bustage. Looks like an hg bug, but we could never reproduce.
For git, most of the time is spent fetching new tags, which we don't generally need to do. I have a patch for 'repo' that needs to get upstreamed which will avoid a fetch if the manifest specifies an absolute commit id rather than a refname. Combined with work in bug 899969 which will give us completely specified manifests, we should get much faster b2g checkouts.
Depends on: 969689, 883918
Given the work in bug 969689, let's give this another shot.

Re-enabled purge on try here:
https://hg.mozilla.org/build/buildbotcustom/rev/6c60d0187e8a

This will go live after the next reconfig.
(In reply to Chris AtLee [:catlee] from comment #76)
> Given the work in bug 969689, let's give this another shot.
> 
> Re-enabled purge on try here:
> https://hg.mozilla.org/build/buildbotcustom/rev/6c60d0187e8a
> 
> This will go live after the next reconfig.

in production, godspeed!
(In reply to Ben Hearsum [:bhearsum] from comment #77)
> in production, godspeed!

Should this bug be closed now? I'm aware we still have "clobber issues" but I'm not sure if they are the same ones that this bug aimed to solve.
Flags: needinfo?(catlee)
gps: ^^
Flags: needinfo?(catlee) → needinfo?(gps)
If purge (instead of rm) is being used in production, then yes, this should be closed.
Flags: needinfo?(gps)
let's close this
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: