Open Bug 1517978 Opened 6 years ago Updated 1 year ago

Implement syntactically-bound token-centric blame / annotate to better deal with code reformattings and refactorings (AKA hyperblame and hyperannotate)

Categories

(Webtools :: Searchfox, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: kats, Assigned: asuth)

References

Details

This is bug 1517657 but for searchfox, since it has the same problem. I like Karl's idea of using the most recently modified line of the ones that get collapsed.

I spent some time writing code to do this and it seems to work but is really slow. And it only fixes some of the problems. There are all sorts of other problems because within a hunk you can have lines early in the hunk getting split or merged and that throws off the blame for the rest of the lines.

I have an idea on how to do this better: for each revision to ignore, we should compute an accurate mapping from the "after" state to the "before" state, and save that in a tarball on S3 so that we don't have to recompute it on every run. Instead we would just compute the mapping for any new revisions added to the list.

In order to compute the mapping accurately I think we want to take the "before" code and the "after" code from the hunk and recursively do longest substring matches to compute a character mapping. Then, for a given line in the "after" text, go through each character on the line with a corresponding "before" character, and pick the most recently modified container line. This should build an accurate line map which we would save to disk and put in S3. But it's going to be relatively expensive to compute.

The bcmp crate should help with the recursive substring matching part since it has an implementation of Ukkonen's algorithm which is what we want for fast substring matching.

Bug 1517657 comment 2 shows another possible solution.

(In reply to Marco Castelluccio [:marco] from comment #2)

Bug 1517657 comment 2 shows another possible solution.

Holy cow, the https://cregit.linuxsources.org/code/4.19/net/ethernet/eth.c.html example seems amazing. It would definitely be fantastic for searchfox to be able to display that just a portion of a line changed, etc. I suspect this means that the code formatting logic would want a stateful HTML emitter that would track outstanding highlighting ranges and fragment them appropriately across the different lines and div nesting that we do for the position: sticky stuff. (Or the existing ad hoc code could be extended VERY carefully...)

It looks like https://github.com/mozilla/microannotate is where the tool lives?

Blocks: 1673468

(In reply to Andrew Sutherland [:asuth] (he/him) from comment #3)

(In reply to Marco Castelluccio [:marco] from comment #2)

Bug 1517657 comment 2 shows another possible solution.

Holy cow, the https://cregit.linuxsources.org/code/4.19/net/ethernet/eth.c.html example seems amazing. It would definitely be fantastic for searchfox to be able to display that just a portion of a line changed, etc. I suspect this means that the code formatting logic would want a stateful HTML emitter that would track outstanding highlighting ranges and fragment them appropriately across the different lines and div nesting that we do for the position: sticky stuff. (Or the existing ad hoc code could be extended VERY carefully...)

It looks like https://github.com/mozilla/microannotate is where the tool lives?

Sorry for not replying earlier... Yes, that's the tool! It supports splitting by word, and also has an option for removing comments (useful if you're interested in blaming code only, but I guess it's not the case for Searchfox).

I spent some time looking at https://github.com/mozilla/microannotate with some help from Marco today (in #codecoverage on chat.mozilla.org). My understanding is:

  • microannotate transforms the source so that there's one non-whitespace token per source line broken up using a regex using \w+ and explicit treatment of a bunch of programming punctuation characters as their own tokens.
    • Whitespace is ignored and is re-inferred from the original text/token stream at rendering/generation time.
    • The optional rust-code-analysis hookup is only for comment removal and does not inform the determination of what's a token.
  • This allows git's normal line-centric blame/annotate logic to operate on a per-token basis (which means searchfox's blame precomputation strategy is still viable).
  • Because whitespace is no longer significant, changes to whitespace don't impact blame history at all. Other code formatting changes like introduction/removal of braces/extra parentheses similarly do not meaningfully impact blame.

Various thoughts:

  • The big question is whether this makes the greedy diff algorithm do stuff that's unacceptably wackier than a line-centric approach. It opens up more territory for bizarre longest subsequences even as it removes whitespace from the equation.
    • Looking at some random git log -p's from the derived repo at https://github.com/marco-c/gecko-dev-wordified (may be private), it seems like there are some weird reuses of random punctuation tokens, but that's hardly surprising and is something that can potentially be addressed by making the policy decision to not surface / allow following the blame of random punctuation that's almost certain to be nonsensical.
    • Otherwise the diffs make sense and that also makes sense because most patches to mozilla-central are going to be quite small in size, especially with whitespace removed from the equation.
  • The availability of https://github.com/mozilla/rust-code-analysis presents some interesting potential to semantically bind the punctuation characters so that they aren't randomly reused and thereby help avoid weird diffs.
    • That is, for the closing } of the function doFoo, one could emit } doFoo or } HASH where HASH is a fixed-size hash function over "doFoo" with a trade-off between string length and collision chance.
    • This additional semantic data attached to the tokens could also assist in reconstructing moves in a file as tokens whose identities under normal blame processing would be ambiguous are rendered entirely unambiguous. If the opening and closing braces of a function were both deleted and reappeared elsewhere in the file, that's a move, eh?
    • That said, one might also go directly to an AST-based diff, such as https://github.com/afnanenayet/diffsitter but that would represent a more fundamental architectural change.

Using rust-code-analysis to split lines in a smarter way was one of the things I wanted to try at the time.

That said, one might also go directly to an AST-based diff, such as https://github.com/afnanenayet/diffsitter but that would represent a more fundamental architectural change.

This would be super interesting, I wanted to support something like that in rust-code-analysis! Yeah, I think using that would be way more complex than just word-level, where you can more or less use the tools we already have with very few changes.

An interesting practical example of this the diff algorithm doing unexpected-but-reasonable things just came up in a use-case for :Gijs where the question of the origin of nsWebBrowserPersist::CalculateAndAppendFileExt came up. microannotate says the origin is https://hg.mozilla.org/mozilla-central/rev/7c5745d2ff1f1c26945f5d68b9c68673e62c5797 and the gecko-dev-wordified commit at https://github.com/marco-c/gecko-dev-wordified/commit/a786fcc4e1dc666300f0ad5a59a3e7e407466e25 seems to show that what's happening is that because a relatively large block of code was moved around (nsWebBrowserPersist::EnumPersistURIs) in that commit, the diff algorithm found it was easier to move the nsWebBrowserPersist::CalculateAndAppendFileExt token stream around by deleting it and re-adding it.

That is, the trade-off was made here between which blame should survive and it picked the less obvious one.

I'll be hobby-hacking on this during PTO this week so assigning to myself to avoid overlap on this goal. This does not mean that the current blame implementation is going anywhere! (And in particular I see from https://github.com/mozsearch/mozsearch/pull/644 that :kats may be looking at some enhancements? edit: I now see this was in response to a rustc warning.) My hope is just to get this to a prototype stage I can share[1], but I'm optimistic because I previously got a tree-sitter based tokenizer in place that implements the "semantic"[2] binding I proposed in comment 5 so I'm fairly confident about that feature. The question for me is how much progress I'll make on the other history related functionality I'm envisioning but which does need to be baked into the history processing pipeline[3].

1: I think this probably crosses into functionality that would be useful for others even if it won't be land-able in its current state, so I've re-created an "asuth" channel that will be available at https://asuth.searchfox.org/ and will mention any seemingly working runs at https://chat.mozilla.org/#/room/#searchfox:mozilla.org and/or on my blog at https://www.visophyte.org/blog/ which is on planet.mozilla.org. Because there are costs associated to channels like this, I'll turn it back off if no one's using it once I'm not actively hacking on this.

2: "syntax-bound" is probably a better way to put it since I'm using tree-sitter which for something like C++ is definitely not semantic, although the use of the pattern-matching query language (like is used for the "tags.scm" files) does mean that we are operating at a slightly higher level of abstraction than just the pure AST. In particular, this does open up some possibilities to impose additional layers in the binding stack based on custom heuristics that aren't something that we would magically have if we were consume a clang AST but instead would also need to create via tree-matching.

3: Note that I'm very explicitly building on top of build-blame.rs. It just will get used twice; the first time we build the "dual" syntax-based representation of the underlying tree, and then the second stage consumes that syntax git tree instead of the source git tree. The only real change to history process for the second stage is that in the '+' and '-' processing we can leverage the binding stack to be able to infer code re-ordering and moves instead of treating all '+' diff lines as new code. Also, this is a spot where I'd like to derive information to allow git log -S-like super-powers where we can do things like create token "tombstones" so that if someone searches for a token that no longer exists we can have the search results provide a link to the history of the token[4].

4: There are are some interesting options to be able to keep the amount of data that searchfox has to deal with at any time reasonable and thereby keep searchfox responsive. In particular, my tentative plan is to take an RRDtool-like approach where the HEAD revision of any summary history files would have detailed per-revision information for "recent" changes, and then apply levels of aggregation/summarization for "medium old" and "very old" changes. Any consolidated (JSON-ND) record could reference the pre-change-HEAD revision where the pre-aggregated data can be found. Each aggregated block would also hold the first/last revisions of the data it's a roll-up for. There would also be a header in the file which could allow indicating for tokens that are ridiculously common that we've categorized it as a stopword and are handling the token differently. This needn't mean that we don't do anything for the token at all. There could be interesting fun to be had from having weekly token added/removed stats to be able to show comparative visualizations of use of RefPtr versus UniquePtr. Stats for the token the might not be quite so useful, but as long as the indexer and web-server don't catch on fire, there might not need to be special handling for extreme stop words.

Assignee: nobody → bugmail
Status: NEW → ASSIGNED

Thanks to (deeply appreciated) manager support, I have a (time-boxed 1-2 week) OKR to finish this out now. Thanks to the progress made during my PTO week I'm quite confident about being able to land a core implementation of this[1], and so I've re-titled the bug to capture the implementation choices made (and because I keep having trouble finding this bug via the awesomebar).

1: Although I think there are probably some open questions about how much of the rest of my hobby stack should also be landed, but I'll address that on the maintainer-ish mailing list.

Summary: Do better with skip-blame for when lines get collapsed → Implement syntactically-bound token-centric blame / annotate to better deal with code reformattings and refactorings (AKA hyperblame and hyperannotate)
You need to log in before you can comment on or make changes to this bug.