Open Bug 1055298 Opened 10 years ago Updated 4 years ago

(HEADlessTry) Reimplement try in HEADless, scalable, cloudy way

Categories

(Developer Services :: Mercurial: hg.mozilla.org, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

People

(Reporter: taras.mozilla, Assigned: sheehan)

References

Details

(Keywords: meta, Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1521] )

Attachments

(2 files, 1 obsolete file)

Upcoming reviewboard tool will require 'try' changesets to be persistent. We also would like to stop resetting try once number of heads gets to be too high.

We came up with a new bundle + object-store architecture where:
1) A push to try becomes "store a bundle in S3 of what the client would've pushed to try". This will also store the repository url(eg http://hg.mozilla.org/mozilla-central) as metadata of the bundle. 
In practice this means pinging an LDAP-authed server to get temporary S3 credentials to upload a file to the "try" bucket. One also needs to notify pushlog.

2) A pull of X from try becomes "fetch document id X from S3, figure out what repo needs to be checked out to unbundle(eg repository url metadata from step1)..try to apply the bundle..try hg pull..in case repo too old"

3) We need to implement a new url format where pushID is in the url so we can hand it off to hgweb so it knows what bundle to pull from S3+ what url to checkout to render a particular try push....This will not be used for hg clones by our infra, this only for 'manual' browsing.
Summary: Reimplement try in scalable, cloudy way → Reimplement try in HEADless, scalable, cloudy way
Blocks: 1055302
Blocks: 1055304
Depends on: 1053567
Summary: Reimplement try in HEADless, scalable, cloudy way → (HEADlessTry) Reimplement try in HEADless, scalable, cloudy way
Note this doesn't take care of the pushlog aspect.
Attachment #8478747 - Flags: review?(gps)
Assignee: nobody → mh+mozilla
Status: NEW → ASSIGNED
Comment on attachment 8478747 [details] [diff] [review]
part 1 - Mercurial extension for headless try pushes

I believe the design has been warped a bit since we attended the Mercurial sprint this past weekend. If you still want me to take a look, re-request review.
Attachment #8478747 - Flags: review?(gps)
Product: Release Engineering → Developer Services
Depends on: 961279
Morphing this into a tracker bug. Hold on.
Keywords: meta
Depends on: 1078912
Depends on: 1078916
Depends on: 1078918
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/170]
Whiteboard: [kanban:engops:https://kanbanize.com/ctrl_board/6/170] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1508] [kanban:engops:https://kanbanize.com/ctrl_board/6/170]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1508] [kanban:engops:https://kanbanize.com/ctrl_board/6/170] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1507] [kanban:engops:https://kanbanize.com/ctrl_board/6/170]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1507] [kanban:engops:https://kanbanize.com/ctrl_board/6/170] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1514] [kanban:engops:https://kanbanize.com/ctrl_board/6/170]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1514] [kanban:engops:https://kanbanize.com/ctrl_board/6/170] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1515] [kanban:engops:https://kanbanize.com/ctrl_board/6/170]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1515] [kanban:engops:https://kanbanize.com/ctrl_board/6/170] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1521] [kanban:engops:https://kanbanize.com/ctrl_board/6/170]
Whiteboard: [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1521] [kanban:engops:https://kanbanize.com/ctrl_board/6/170] → [kanban:engops:https://mozilla.kanbanize.com/ctrl_board/6/1521]
No longer depends on: 1078912
Assignee: mh+mozilla → gps
Depends on: 1104374
No longer depends on: 1053567
Attached file MozReview Request: bz://1055298/gps (obsolete) (deleted) —
Attachment #8530476 - Flags: review?(mh+mozilla)
/r/1085 - bundleheads: extension for bundle-based heads (bug 1055298)

Pull down this commit:

hg pull review -r 0ca63c593ce8625b2f3ba7eaeff508d45c9c9cb9
https://reviewboard.mozilla.org/r/1083/#review639

I think I got all the major pieces working. This includes pushlog. (I'm using a pushlog-compatible database schema to trick the existing hgweb JSON APIs into emitting a valid response. This works without any code changes to existing pushlog code (aside from the bug fixes which I'll split off into a separate commit later).

I need to add replication for the bundle index data. But that should be a minor feature compared to the hacks to coerce vanilla clients into pushing and pulling data not actually in the repositories.

I may also want to add support for serving each bundle from its own URL space. Currently, if you hit `/rev/<node>`, we use the most recent bundle containing `<node>`. This could be confusing if people want https://hg.mozilla.org/try/rev/<node> to just work. I think we'll need to provide https://hg.mozilla.org/try/bundle/<bundle>/rev/<node> to avoid any ambiguity. How we'll generate URLs for that in Treeherder, etc, I don't know. I consider this a P2 feature unless people start screaming for it to be a launch time requirement. The way I see it, we already drop commit data when we reset Try: I don't think *temporary* unavailability of exact parent/children data is going to be painful when we do the headless switchover.

The S3 code hasn't been tested. I'm confident it is buggy. I think I also need to implement a multi-level cache for S3 bundles. Otherwise, things like pushlog lookups may require dozens of S3 queries and could get quite costly (I need to do some math here).

I view the core of the extension as pretty solid. I'm pretty confident with the test coverage. I think I hit all the major goals of a headless repo with the current implementation. Now we just need to iterate towards the finish line. And, we need to start working on the operational aspects. Scripts to convert existing Try data to bundles. A daemon on hgssh1 that aggregates Firefox repos, etc.
(In reply to Gregory Szorc [:gps] from comment #7)
> I may also want to add support for serving each bundle from its own URL
> space. Currently, if you hit `/rev/<node>`, we use the most recent bundle
> containing `<node>`. This could be confusing if people want
> https://hg.mozilla.org/try/rev/<node> to just work. I think we'll need to
> provide https://hg.mozilla.org/try/bundle/<bundle>/rev/<node> to avoid any
> ambiguity. 

Just to check I understand the implications of this:

1) Someone pushes several commits to try
2) They amend some of the most recent commits, but leave the others untouched (and so with the same SHA)
3) They push to try again
4) If they view the older try push in treeherder (or even on hgweb), everything else works as currently, except if they follow a https://hg.mozilla.org/try/rev/<node> link, the diff will be correct (since the SHA is the same), but the child commits link shown on the hgweb page will be for the descendants of the newer try push, not their older one?

If so, I think that's fairly reasonable for the short term.

> How we'll generate URLs for that in Treeherder, etc, I don't
> know. I consider this a P2 feature unless people start screaming for it to
> be a launch time requirement. The way I see it, we already drop commit data
> when we reset Try: I don't think *temporary* unavailability of exact
> parent/children data is going to be painful when we do the headless
> switchover.

I guess we just annotate the pushlog response with a bundle identifier, and add specific handling for it in downstream consumers of the pushlog. Schema change + some UI changes in Treeherder, won't be too much extra work tbh.

That said, I guess we could simplify things if the bundle identifier was the SHA of the bundle's tip commit (ie the one used currently for /pushloghtml?changeset=) - then treeherder can just derive it from what we already store as the push identifier.
(In reply to Ed Morley (moved to Treeherder) [:edmorley] from comment #8)
> (In reply to Gregory Szorc [:gps] from comment #7)
> > I may also want to add support for serving each bundle from its own URL
> > space. Currently, if you hit `/rev/<node>`, we use the most recent bundle
> > containing `<node>`. This could be confusing if people want
> > https://hg.mozilla.org/try/rev/<node> to just work. I think we'll need to
> > provide https://hg.mozilla.org/try/bundle/<bundle>/rev/<node> to avoid any
> > ambiguity. 
> 
> Just to check I understand the implications of this:
> 
> 1) Someone pushes several commits to try
> 2) They amend some of the most recent commits, but leave the others
> untouched (and so with the same SHA)
> 3) They push to try again
> 4) If they view the older try push in treeherder (or even on hgweb),
> everything else works as currently, except if they follow a
> https://hg.mozilla.org/try/rev/<node> link, the diff will be correct (since
> the SHA is the same), but the child commits link shown on the hgweb page
> will be for the descendants of the newer try push, not their older one?
> 
> If so, I think that's fairly reasonable for the short term.

Yes, that's pretty much it.

> > How we'll generate URLs for that in Treeherder, etc, I don't
> > know. I consider this a P2 feature unless people start screaming for it to
> > be a launch time requirement. The way I see it, we already drop commit data
> > when we reset Try: I don't think *temporary* unavailability of exact
> > parent/children data is going to be painful when we do the headless
> > switchover.
> 
> I guess we just annotate the pushlog response with a bundle identifier, and
> add specific handling for it in downstream consumers of the pushlog. Schema
> change + some UI changes in Treeherder, won't be too much extra work tbh.

Something like that, yeah. We should probably put URLs in the pushlog response. Because having downstream systems construct URLs is generally a bad idea and against the hyperlinked world philosophy.

> That said, I guess we could simplify things if the bundle identifier was the
> SHA of the bundle's tip commit (ie the one used currently for
> /pushloghtml?changeset=) - then treeherder can just derive it from what we
> already store as the push identifier.

Using just the tip identifier locks us into 1 bundle per push. Mercurial is working on a "bundle2" format, which we'll eventually want to switch to.
(In reply to Gregory Szorc [:gps] from comment #9)
> Using just the tip identifier locks us into 1 bundle per push. Mercurial is
> working on a "bundle2" format, which we'll eventually want to switch to.

There is still one bundle per push with bundle2. Or were you thinking to separate changegroups? Sounds useless. And what would you do with the other data (phases, obsmarkers, bookmarks, etc.)?
Depends on: 1108729
https://reviewboard.mozilla.org/r/1083/#review771

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +``hg pull`` exhibitis similar behavior: only non-bundle-based changesets

exhibits

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +bundle2 pull    yes         yes

are we actually enabling bundle2 on the server?

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +    def write(self, data, bundlenodes):

@abc.abstractmethod?

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +        They key of the written data MUST be returned.

The

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +class simplefilebundleindex(object):

This should inherit from abstractbundleindex.

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +        filename = util.sha1(data).hexdigest()

Considering the bundles can't have more than one head, why not store bundles by the sha1 of their tip changeset instead?

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +    # 5) A changegroup.cg1unpacker() is returned. It's internal stream has

Its

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +    return 1

IIRC returning 0 here would break the client, right?

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +# client's way of saying "I know about this head, you don't need to send it

You mean it's the client's way of saying "I know you have this head that I don't have, give it to me"

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +    res = orig(repo, proto, key)

res = orig(repo, proto, key)
if not isbundleheadsrepo(repo):
    return res

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +    res = orig(repo, proto)

same as above

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +    res = '%s %s\n' % (res, '0' * 40)

heh, mercurial is actually happy with the nullid being returned as a head?

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +            return orig(repo, source, heads=heads, **kwargs)

too many orig(repo, source, heads=heads, **kwargs) to my taste. Can you try rewrapping this function such that there's only one?

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +        """

How about using named tuples? That would make the bundles[n][p] later on slightly more readable.

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +to public phase.

You should mention that without load balancing affinity, matters get worse.

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +    command is just a web command.

Too many "is" in this sentence.

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +    original/unfiltered repo class and then recrease a proxy object for it.

recreate?

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +                        if len(changeid) != 20:

You're already transforming in the binary form above, do you really expect there to be non 40-byte, non 20-byte changeids that end up with a usable value after calling bin()?

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +                    newrepo = makebundlerepo(repo, bundle)

So the sad part here is that in the pull case you end up calling makebundlerepo twice for the same request.

::: hgext/bundleheads/s3store.py
(Diff revision 1)
> +            self._conn = S3Connection(self._access_key, self._secret_key,

FWIW, here's an interesting bit about boto: S3Connection doesn't actually open a connection. Neither does get_bucket with validate=False.

::: hgext/bundleheads/s3store.py
(Diff revision 1)
> +                bucket=self._bucket)

There is no bucket argument to S3Connection.

::: hgext/bundleheads/s3store.py
(Diff revision 1)
> +        key.hey = name

"hey"?

I generally prefer to use the bucket methods instead of creating Key instances directly.

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +                    raise Abort('unknown storetype: %s' % storetype)

It fills weird that vfs store type and s3 store type are not handled in a similar way. I started writing that you're not hooking s3 here until I realized it is hooked differently.

::: hgext/bundleheads/sqliteindex.py
(Diff revision 1)
> +    'CREATE INDEX IF NOT EXISTS pushlog_user ON pushlog (user)',

I would be more comfortable if this shared code with the pushlog extension.

::: hgext/bundleheads/__init__.py
(Diff revision 1)
> +    repo.ui.write('produced bundle %s with %d changesets in %d bytes\n' % (

the "in n bytes" part reads weird.

::: hgext/bundleheads/tests/test-bundle-availability.t
(Diff revision 1)
> +We can pull over HTTP immediate after we pushed via SSH

immediately

::: hgext/bundleheads/tests/test-bundle-availability.t
(Diff revision 1)
> +  (run 'hg update' to get a working copy)

check phases status?

::: hgext/bundleheads/tests/test-hgweb.t
(Diff revision 1)
> +

It would help the test readability to have a hg log -G -T '{node} {desc}' output here.

::: hgext/bundleheads/tests/test-bundle-availability.t
(Diff revision 1)
> +  $ cat hg.pid >> $DAEMON_PIDS

Don't you also need a command to cleanup daemons?

::: hgext/bundleheads/tests/test-local-unbundle-allowed.t
(Diff revision 1)
> +Pulling will not trigger special unbundle code path

Maybe it would be better to have at least one push performed beforehand?

::: hgext/bundleheads/tests/test-pull.t
(Diff revision 1)
> +

You should also be able to hg clone -r

::: hgext/bundleheads/tests/test-pull.t
(Diff revision 1)
> +  $ hg pull -r 3fda63883101

might be good to check that a bundle2 is actually exchanged here.

::: hgext/bundleheads/tests/test-push-basic.t
(Diff revision 1)
> +When we add a new commit on top of an existing bundled commit and push, we get a new bundle

a new bundle containing the existing bundled commit. (worth mentioning)

::: hgext/bundleheads/tests/test-sqliteindex.t
(Diff revision 1)
> +  }

It would be good to have a test with the case where two bundles share changesets.

::: hgext/pushlog-legacy/pushlog-feed.py
(Diff revision 1)
> -                        self.entries.append((id,user,date,node))
> +                        self.entries.append((id,user,date,node.encode('ascii')))

Those pushlog changes seem like they should go with the pushlog refactor bug.
Attachment #8530476 - Flags: review?(mh+mozilla)
https://reviewboard.mozilla.org/r/1083/#review1089

> are we actually enabling bundle2 on the server?

I plan to enable bundle2 as soon as it is stable. But presumably bundle2 will be enabled in Mercurial core once it is stable...

I'm worried about BC issues if we enable bundle2 now. It isn't easy for us to atomically update all Mercurial clients. I'd rather not open this can of forms.

> This should inherit from abstractbundleindex.

Indeed. Every time I see abstract base classes in Python, a voice in my heads says "stop writing Java" and I fall back to doing something like this. But I'll use `abc` if that's what you want. It certainly is a textbook use case.

> Considering the bundles can't have more than one head, why not store bundles by the sha1 of their tip changeset instead?

We certainly could do this today without any problems. I'm a bit concerned about what the future holds. e.g. if we start storing bundles with multiple heads or if we change what needs to get covered in the hash. I kinda like the idea of making everything content-derived. That seems to have the fewest problems with key collisions.

> the "in n bytes" part reads weird.

I plan on rewriting this. Will probably delete it outright or move it to a debug statement.

> IIRC returning 0 here would break the client, right?

I believe so. And, uh, I forgot exactly what breaks, if anything. I'll figure it out and document it.

> heh, mercurial is actually happy with the nullid being returned as a head?

I was surprised by that as well!

I think relying on this is definitely a bit haphazard. If it breaks in future Mercurial versions, we can set it to some arbitrary value on the server: the value is treated as a black box to the client (assuming the changeset doesn't actually exist).

> I would be more comfortable if this shared code with the pushlog extension.

I would too. When I first wrote this, it was before pushlog were written as an extension (or at least before that code was reviewed). I didn't want to bloat scope to include the pushlog rewrite. Fast forward a few weeks and things have changed. I'll re-evaluate this.

> Don't you also need a command to cleanup daemons?

No. $DAEMON_PIDS is magic built into the Mercurial test harness (for better or worse).

> Those pushlog changes seem like they should go with the pushlog refactor bug.

Yup. I got lazy with my patch writing. IIRC these patches conflicted and I wanted to keep the heads separate in my local repo.
Attachment #8530476 - Flags: review?(mh+mozilla)
/r/1085 - bundleheads: extension for bundle-based heads (bug 1055298)

Pull down this commit:

hg pull review -r b8b94e977a3aeb98956899cb19dbed6ee642c46f
https://reviewboard.mozilla.org/r/1083/#review1091

No major changes in this push. Just incorporating low-hanging fruit from previous review. Mostly comment changes.
> > Don't you also need a command to cleanup daemons?
>
> No. $DAEMON_PIDS is magic built into the Mercurial test harness (for better or worse).

Why do so many tests invoke "$TESTDIR/killdaemons.py" $DAEMON_PIDS, then? cargo-culting?
(In reply to Mike Hommey [:glandium] from comment #16)
> > > Don't you also need a command to cleanup daemons?
> >
> > No. $DAEMON_PIDS is magic built into the Mercurial test harness (for better or worse).
> 
> Why do so many tests invoke "$TESTDIR/killdaemons.py" $DAEMON_PIDS, then?
> cargo-culting?

It is either cargo-culting or tests wishing to kill daemons before the test actually finishes. I know there's a few tests that stop daemons via killdaemons.py.
https://reviewboard.mozilla.org/r/1083/#review1101

::: hgext/bundleheads/tests/test-bundle-availability.t
(Diff revisions 1 - 2)
> -We can pull over HTTP immediate after we pushed via SSH
> +We can pull over HTTP immediatey after we pushed via SSH

immediately

::: hgext/pushlog-legacy/pushlog-feed.py
(Diff revisions 1 - 2)
> +            pass

Congratulations, you triggered a bug in reviewboard, where diff orig 1 + diff 1 2 != diff orig 2.

Many of the comments from the previous review still apply.
Attachment #8530476 - Flags: review?(mh+mozilla)
(In reply to Gregory Szorc [:gps] from comment #13)
> > Considering the bundles can't have more than one head, why not store bundles by the sha1 of their tip changeset instead?
> 
> We certainly could do this today without any problems. I'm a bit concerned
> about what the future holds. e.g. if we start storing bundles with multiple
> heads or if we change what needs to get covered in the hash. I kinda like
> the idea of making everything content-derived. That seems to have the fewest
> problems with key collisions.

Why would we ever allow multiple-head bundles? That seems like a hell of a stretch to imagine an hypothetical future where this might be allowed. Especially considering how just using the sha1 of the push head would simplify reverse lookups: you wouldn't actually have to store anything more than we currently have in the pushlog!
(In reply to Mike Hommey [:glandium] from comment #18)
> Congratulations, you triggered a bug in reviewboard, where diff orig 1 +
> diff 1 2 != diff orig 2.

ah... it's just being confused by both patches not using the same base... I thought it was smarter than that at interdiffing...
https://reviewboard.mozilla.org/r/1083/#review1109

::: hgext/pushlog-legacy/pushlog-feed.py
(Diff revisions 1 - 2)
> +            pass

This is because I rebased.

RB is showing the actual intradiff, rebase and all.

You can make the argument that RB should only show diffs from relevant hunks. But I can easily make the opposite argument: if a file changes, you probably want to see what else in the file changed between revisions. After all, review is about approving the file state of a file, not just the changes.

But yeah, it would be nice if a RB had a way to show/hide *unrelated changes*.
Attachment #8530476 - Attachment is obsolete: true
No longer blocks: 1055302
Facebook has built the the "infinitepush" extension, which is basically my PoC extension done better. https://bitbucket.org/facebook/hg-experimental/src/default/infinitepush/?at=default I'm told the extensive comments in the bundleheads extension aided their work.

I have yet to look at it in more detail. I'll likely chat with its designers in a few weeks.
QA Contact: hwine → klibby
Following discussion at the Mercurial 4.6 developer sprint, the "infinitepush" extension was just upstreamed with functionality that will make it suitable for "Try server" use cases.

Essentially, the repo is configured to be in a mode where all incoming pushes are siphoned off to a "store" containing bundle files. There is an "index" containing which changesets are in which bundles. An `hg clone` or `hg pull` will operate on the "base" repo. But if an `hg pull -r <rev>` is performed, the server can find data from the various stored bundles and serve it transparently. The "index" and "store" bits are defined via an interface, so you can have different implementations to suit your fancy. Upstream has support for file and MySQL indexes and stores currently.

I envision us using this in the following manner:

* try repo becomes an `hg share` of the mozilla-unified repo.
* We configure a SQL-based index - either backed by a local SQLite file or a proper hosted database
* We store bundles in a SQL database or S3 bucket

This will take a bit of work to deploy on hg.mo because of replication and pushlog. If we use a hosted SQL server for the index and store, that problem should go away. Pushlog will need work to handle pushes referencing changesets that don't exist in the repo proper. (I'm pretty sure if we used pushlog with infinitepush as is that pushlog would elide entries because they aren't in the store.)

Also, unassigning the bug because I'm not actively working on it. (Although I did just review and land infinitepush upstream at https://phab.mercurial-scm.org/D2096.)
Assignee: gps → nobody
Status: ASSIGNED → NEW

It looks like the infinitepush extensions is currently slated for removal at the end of 2020: https://www.mercurial-scm.org/wiki/WhatsNew#Mercurial_5.2_.282019-11-05.29

Assignee: nobody → sheehan
Type: defect → enhancement
OS: Windows 8.1 → Unspecified
Hardware: x86_64 → Unspecified
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: