1258460 - Write microbenchmarks for tree traversal algorithms

Reporter

Description

•

9 years ago

In bug 1226919 and its dependencies, we are writing generic tree traveral algorithms, and replacing manual loops over certain trees, such as the Layer tree, with calls to these algorithms. I don't expect this transformation to have any performance impact, but it would be good to verify that this is the case. Now that we have a micro-benchmarking framework (added in bug 1256408), that would be a good way to verify this.

Bas Schouten (:bas.schouten)

Updated

•

9 years ago

Whiteboard: gfx-noted

Kevin Wern

Assignee

Updated

•

9 years ago

Assignee: nobody → kevin.m.wern

Botond Ballo [:botond]

Reporter

Updated

•

9 years ago

Mentor: botond

Kevin Wern

Assignee

Comment 1

•

8 years ago

Hey, Botond! I'm taking a look over the patch you linked to, and am starting work on this bug this weekend. I'm thinking that I should extend TestTreeTraversal to include benchmarks for a node being traversed by normal recursion, and then the corresponding solution using our new functions. In addition, I've been thinking two other things: - Are there functions in gfx/apz code that would make good 'practical' tests? - Other than the fact that we're calling lambda functions, is there any potential for overhead? I can't think of any way that the specific objects being captured or calls being made would expose any performance difference (other than magnifying any differences because more work is done in each call). Any other information you think would be helpful is appreciated, as well.

Flags: needinfo?(botond)

Botond Ballo [:botond]

Reporter

Comment 2

•

8 years ago

(In reply to Kevin Wern from comment #1) Thanks for working on this! > I'm thinking that I should extend TestTreeTraversal to > include benchmarks for a node being traversed by normal recursion, and then > the corresponding solution using our new functions. That sounds like the right idea. > - Are there functions in gfx/apz code that would make good 'practical' tests? Nothing in particular comes to mind. It might make sense to have a test case that does a search of the tree, and one that traverses the entire tree and performs some computation (e.g. say the leaves start off storing numbers, the algorithm could compute, for each node, the sum of the numbers of its children; or, if you want to be particularly realistic, you could use regions instead of numbers, and take the union of the regions). > - Other than the fact that we're calling lambda functions, is there any > potential for overhead? I can't think of any way that the specific objects > being captured or calls being made would expose any performance difference > (other than magnifying any differences because more work is done in each > call). A couple of ideas: - Capturing a lot of variables, including some by reference, might make it harder for the compiler to inline the call to the lambda, so it might be worth testing that. - Using a separate stack for keeping track of state as we traverse the tree, as we do in some places (e.g. ApplyAsyncContentTransformToTree), might be more expensive than before, when the recursion just made use of the program's stack for this purpose. It's probably worth testing this too. > Any other information you think would be helpful is appreciated, as well. Feel free to build the tree structure you're testing on algorithmically, e.g. loop to add 20 levels of depth and things like that.

Flags: needinfo?(botond)

Kevin Wern

Assignee

Comment 3

•

8 years ago

Attached file Bug 1258460: create microbenchmarks for TreeTraversal.h algorithms (deleted) — Details

Review commit: https://reviewboard.mozilla.org/r/58428/diff/#index_header See other reviews: https://reviewboard.mozilla.org/r/58428/

Attachment #8761412 - Flags: review?(botond)

Kevin Wern

Assignee

Comment 4

•

8 years ago

Attached file Output from tests, including benchmarks (deleted) — Details

Kevin Wern

Assignee

Comment 5

•

8 years ago

I tried to use regions rather than numbers for ForEachNode operations. If you feel these cases are too complicated, please let me know. Additionally, is 26 variables enough of a test for your first point? I couldn't find anything regarding compiler implementation that gave a concrete number. I attached the output of one test, and will also attach 3 additional trials I ran for an overview of how everything seems to be stacking up.

Kevin Wern

Assignee

Comment 6

•

8 years ago

Attached file Overview of test cases, tab-separated (deleted) — Details

Botond Ballo [:botond]

Reporter

Comment 7

•

8 years ago

Comment on attachment 8761412 [details] Bug 1258460: create microbenchmarks for TreeTraversal.h algorithms https://reviewboard.mozilla.org/r/58428/#review55530 Thanks, Kevin, for these thorough benchmarks! A few comments: ::: gfx/tests/gtest/TestTreeTraversal.cpp:14 (Diff revision 1) > +#include <queue> > + > +#define PERFORMANCE_TREE_DEPTH 20 > +#define PERFORMANCE_TREE_CHILD_COUNT 2 > +#define PERFORMANCE_TREE_LEAF_COUNT 1048576 // 2 ** 20 > +#define PERFORMANCE_REGION_XWRAP 100000 These can be constant integers instead of defines. ::: gfx/tests/gtest/TestTreeTraversal.cpp:1252 (Diff revision 1) > +{ > + int depth = PERFORMANCE_TREE_DEPTH; > + int childrenCount = PERFORMANCE_TREE_CHILD_COUNT; > + RefPtr<SearchTestNodeForward> needleNode; > + RefPtr<SearchTestNodeForward> root = CreateBenchmarkTree<SearchTestNodeForward>(depth, childrenCount, > + [&needleNode](SearchTestNodeForward* aNode) { It's often convenient to use a lambda as an action, but of course it doesn't have to be a lambda - it can be a regular function or function object. In this case, as we re-use this particular action in many different tests, it makes sense to factor it out into a function object, which is then re-used by the various tests. This applies to several lambdas that we use over and over again in these tests. ::: gfx/tests/gtest/TestTreeTraversal.cpp:1254 (Diff revision 1) > + int childrenCount = PERFORMANCE_TREE_CHILD_COUNT; > + RefPtr<SearchTestNodeForward> needleNode; > + RefPtr<SearchTestNodeForward> root = CreateBenchmarkTree<SearchTestNodeForward>(depth, childrenCount, > + [&needleNode](SearchTestNodeForward* aNode) { > + aNode->SetType(SearchNodeType::Hay); > + if (aNode->GetFirstChild() == nullptr) { Perhaps we could add an IsLeaf() method to TestNodeReverse and TestNodeForward, for better readability? ::: gfx/tests/gtest/TestTreeTraversal.cpp:1296 (Diff revision 1) > +static RefPtr<Node> DepthFirstSearchCaptureVariablesForwardRecursive(RefPtr<Node> aNode, > + int a, int b, int c, int d, int e, int f, > + int g, int h, int i, int j, int k, int l, > + int m, int& n, int& o, int& p, int& q, int& r, > + int& s, int& t, int& u, int& v, int& w, int& x, > + int& y, int& z) These variables may just be optimized out if they're not actually used. Perhaps we could modify the search to actually use them? For example, we could have the nodes store an integer value, and search for the one whose value is the sum of these variables. This applies to both the plain and algorithm versions. ::: gfx/tests/gtest/TestTreeTraversal.cpp:1498 (Diff revision 1) > +// each ancestor's region as the union of its child regions. > +template <typename Node> > +static void ForEachNodePostOrderForwardRecursive(RefPtr<Node> aNode) > +{ > + if (aNode->GetFirstChild()) { > + aNode->SetRegion(new nsRegion()); We're leaking these regions. It would be better to have the nodes store them by value. ::: gfx/tests/gtest/TestTreeTraversal.cpp:1506 (Diff revision 1) > + node = node->GetNextSibling()) { > + ForEachNodePostOrderForwardRecursive(node); > + nsRegion* currentRegion = aNode->GetRegion(); > + nsRegion* childRegion = node->GetRegion(); > + nsRegion* newRegion = new nsRegion(currentRegion->OrWith(*childRegion)); > + aNode->SetRegion(newRegion); currentRegion->OrWith() modifies currentRegion, so it's sufficient to do: currentRegion->OrWith(\*childRegion); instead of creating and setting a new region. Applies to the algorithm version too. ::: gfx/tests/gtest/TestTreeTraversal.cpp:1526 (Diff revision 1) > + int y = squareCount / xWrap; > + aNode->SetRegion(new nsRegion(nsRect(x, y, 1, 1))); > + squareCount++; > + } > + }); > + ForEachNodePostOrderForwardRecursive(root); Do we want to make some assertion about the root node's region? ::: gfx/tests/gtest/TestTreeTraversal.cpp:1572 (Diff revision 1) > +// Starting with a tree whose root has a rectangular region of size > +// PERFORMANCE_TREE_LEAF_COUNT x 1, for each node, split the region into > +// PERFORMANCE_TREE_CHILD_COUNT separate regions of equal width and assign to > +// each child left-to-right. In the end, every node's region should equal the > +// sum of its childrens' regions, and each level of depth's regions should sum > +// to the root's region. Nice :) ::: gfx/tests/gtest/TestTreeTraversal.cpp:1631 (Diff revision 1) > + nChildren++; > + } > + nsRect bounds = aNode->GetRegion()->GetBounds(); > + int childWidth = bounds.width / nChildren; > + int x = bounds.x; > + for (RefPtr<ForEachTestNodeForward> node = aNode->GetFirstChild(); I note that here, we're looping over the children of each node during the action, and then the algorithm will loop over the children again to traverse them, so we're doing more looping overall than the "plain" version. This is not necessarily a bad thing - if people are going to be introducing extra loops like this when refactoring their code to use the algorithms, then doing so in the benchmark is totally fair - I'm just pointing it out. ::: gfx/tests/gtest/TestTreeTraversal.cpp:1652 (Diff revision 1) > +// a starting width of PERFORMANCE_TREE_LEAF_COUNT, and an empty tree, create a > +// tree with the same conditions as > +// ((Plain|TreeTraversal)_ForwardForEachNodePerformance) by assigning regions > +// of the current width, starting from the min x and min y coordinates. For > +// each level of depth, decrease the current width by a factor of > +// PERFORMANCE_TREE_CHILD_COUNT, and maintiain a stack of ancestor regions. s/maintiain/maintain ::: gfx/tests/gtest/TestTreeTraversal.cpp:1653 (Diff revision 1) > +// tree with the same conditions as > +// ((Plain|TreeTraversal)_ForwardForEachNodePerformance) by assigning regions > +// of the current width, starting from the min x and min y coordinates. For > +// each level of depth, decrease the current width by a factor of > +// PERFORMANCE_TREE_CHILD_COUNT, and maintiain a stack of ancestor regions. > +// Use the stack to track the poriton of each region still available to assign s/poriton/portion ::: gfx/tests/gtest/TestTreeTraversal.cpp:1654 (Diff revision 1) > +// ((Plain|TreeTraversal)_ForwardForEachNodePerformance) by assigning regions > +// of the current width, starting from the min x and min y coordinates. For > +// each level of depth, decrease the current width by a factor of > +// PERFORMANCE_TREE_CHILD_COUNT, and maintiain a stack of ancestor regions. > +// Use the stack to track the poriton of each region still available to assign > +// to children, which determines the aformentioned min x and min y coordinates. s/aformentioned/aforementioned ::: gfx/tests/gtest/TestTreeTraversal.cpp:1669 (Diff revision 1) > + aRectangleWidth /= aChildrenCount; > + > + for (RefPtr<Node> node = aNode->GetFirstChild(); > + node != nullptr; > + node = node->GetNextSibling()) { > + ForEachNodeForwardStackRecursive(node, aRectangleWidth, aRegion, aChildrenCount); Did you mean to pass |newRegion| instead of |aRegion|? Otherwise why compute |newRegion| at all? ::: gfx/tests/gtest/TestTreeTraversal.cpp:1670 (Diff revision 1) > + > + for (RefPtr<Node> node = aNode->GetFirstChild(); > + node != nullptr; > + node = node->GetNextSibling()) { > + ForEachNodeForwardStackRecursive(node, aRectangleWidth, aRegion, aChildrenCount); > + newRegion = newRegion.SubOut(*(node->GetRegion())); Like OrWith(), SubOut() modifies the region it's called on, so it's sufficient to just do: newRegion.SubOut(\*(node->GetRegion())); ::: gfx/tests/gtest/TestTreeTraversal.cpp:1702 (Diff revision 1) > +static void TreeTraversal_ForwardForEachNodeStackPerformance() > +{ > + int depth = PERFORMANCE_TREE_DEPTH; > + int childrenCount = PERFORMANCE_TREE_CHILD_COUNT; > + int rectangleWidth = PERFORMANCE_TREE_LEAF_COUNT; > + stack<nsRegion*> regionStack; Store the regions in the stack by value, so we don't leak them. ::: gfx/tests/gtest/TestTreeTraversal.cpp:1714 (Diff revision 1) > + nsRegion* parentRegion = regionStack.top(); > + nsRect parentRect = parentRegion->GetBounds(); > + nsRect newRect(parentRect.x, parentRect.y, rectangleWidth, 1); > + nsRegion newRegion(newRect); > + aNode->SetRegion(new nsRegion(newRegion)); > + regionStack.top() = new nsRegion(regionStack.top()->SubOut(newRegion)); Just: regionStack.top()->SubOut(newRegion); ::: gfx/tests/gtest/TestTreeTraversal.cpp:1830 (Diff revision 1) > + RefPtr<SearchTestNodeReverse> needleNode; > + RefPtr<SearchTestNodeReverse> root = CreateBenchmarkTree<SearchTestNodeReverse>(depth, childrenCount, > + [&needleNode](SearchTestNodeReverse* aNode) { > + aNode->SetType(SearchNodeType::Hay); > + if (aNode->GetLastChild() == nullptr) { > + needleNode = aNode; Why not |if (!needleNode)| as in the previous test? ::: gfx/tests/gtest/TestTreeTraversal.cpp:1955 (Diff revision 1) > + int childrenCount = PERFORMANCE_TREE_CHILD_COUNT; > + RefPtr<SearchTestNodeReverse> needleNode; > + RefPtr<SearchTestNodeReverse> root = CreateBenchmarkTree<SearchTestNodeReverse>(depth, childrenCount, > + [&needleNode](SearchTestNodeReverse* aNode) { > + aNode->SetType(SearchNodeType::Hay); > + if (aNode->GetLastChild() == nullptr) { Similarly, should this be |if (!needleNode)|?

Attachment #8761412 - Flags: review?(botond)

Botond Ballo [:botond]

Reporter

Comment 8

•

8 years ago

(In reply to Kevin Wern from comment #6) > Created attachment 8761440 [details] > Overview of test cases, tab-separated Interesting results! If I'm interpreting them correctly, they are showing that: - For forward searches, the traversal algorithms actually outperform hand-coded loops in all tests except for the one with the separate stack. - For reverse searches, the results are more mixed, with the algorithms outperforming hand-coded loops in some tests but not others. - For searches with a separate stack, the hand-coded loop noticeably outperforms the algorithm. The difference between forward and reverse searches is a bit surprising to me; it may, however, just be due to noise in the results. The performance hit with a separate stack is expected, as we're maintaining a separate, heap-allocated data structure. We should probably be careful that we don't have such traversals in critical code paths (I don't believe we currently do).

Kevin Wern

Assignee

Comment 9

•

8 years ago

Ok! A few of the comments were just mistakes I made repurposing previous code or search/replacing--I'll do another pass for that once I'm done with this round of changes. Just one question: (In reply to Botond Ballo [:botond] from comment #7) > It's often convenient to use a lambda as an action, but of course it doesn't > have to be a lambda - it can be a regular function or function object. > > In this case, as we re-use this particular action in many different tests, > it makes sense to factor it out into a function object, which is then > re-used by the various tests. > > This applies to several lambdas that we use over and over again in these > tests. Can normal functions capture external variables for state? Otherwise, I'd try doing something creative, but I think a huge advantage here is that we don't have to reuse code for things like creating argument structs or supporting different argument lists in CreateBenchmarkTreeRecursive.

Botond Ballo [:botond]

Reporter

Comment 10

•

8 years ago

(In reply to Kevin Wern from comment #9) > (In reply to Botond Ballo [:botond] from comment #7) > > It's often convenient to use a lambda as an action, but of course it doesn't > > have to be a lambda - it can be a regular function or function object. > > > > In this case, as we re-use this particular action in many different tests, > > it makes sense to factor it out into a function object, which is then > > re-used by the various tests. > > > > This applies to several lambdas that we use over and over again in these > > tests. > > Can normal functions capture external variables for state? Otherwise, I'd > try doing something creative, but I think a huge advantage here is that we > don't have to reuse code for things like creating argument structs or > supporting different argument lists in CreateBenchmarkTreeRecursive. Plain functions can't, but function objects can. Example: With lambdas: void CallingFunction() { Type var; ... Foo(..., [&var](int arg){ /* use var and arg */ }); } With function objects: struct Action { Type& var; void operator()(int arg) { /* use var and arg */ } }; void CallingFunction() { Type var; ... Foo(..., Action{var}); } Note the use of C++11 uniform initialization to avoid having to write a constructor for Action.

Kevin Wern

Assignee

Comment 11

•

8 years ago

Comment on attachment 8761412 [details] Bug 1258460: create microbenchmarks for TreeTraversal.h algorithms Review request updated; see interdiff: https://reviewboard.mozilla.org/r/58428/diff/1-2/

Attachment #8761412 - Flags: review?(botond)

Botond Ballo [:botond]

Reporter

Comment 12

•

8 years ago

Kevin, apologies for the long delay here. I've been travelling for the past couple of weeks. I hope to review your updated patch early next week.

Botond Ballo [:botond]

Reporter

Comment 13

•

8 years ago

Comment on attachment 8761412 [details] Bug 1258460: create microbenchmarks for TreeTraversal.h algorithms https://reviewboard.mozilla.org/r/58428/#review58186 Thanks! Just one small comment remaining: ::: gfx/tests/gtest/TestTreeTraversal.cpp:1721 (Diff revision 2) > + > + for (RefPtr<Node> node = aNode->GetFirstChild(); > + node != nullptr; > + node = node->GetNextSibling()) { > + ForEachNodeForwardStackRecursive(node, aRectangleWidth, newRegion, aChildrenCount); > + newRegion = newRegion.SubOut(node->GetRegion()); Just newRegion.SubOut(node->GetRegion()); is sufficient.

Attachment #8761412 - Flags: review?(botond) → review+

Botond Ballo [:botond]

Reporter

Comment 14

•

8 years ago

https://reviewboard.mozilla.org/r/58428/#review58188 ::: gfx/tests/gtest/TestTreeTraversal.cpp:2171 (Diff revision 2) > + > + for (RefPtr<Node> node = aNode->GetLastChild(); > + node != nullptr; > + node = node->GetPrevSibling()) { > + ForEachNodeReverseStackRecursive(node, aRectangleWidth, newRegion, aChildrenCount); > + newRegion = newRegion.SubOut(node->GetRegion()); Here too.

Botond Ballo [:botond]

Reporter

Comment 15

•

8 years ago

I made the two adjustments I suggested, and pushed the patch to Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=27a894d5224b

Botond Ballo [:botond]

Reporter

Comment 16

•

8 years ago

(In reply to Botond Ballo [:botond] from comment #15) > I made the two adjustments I suggested, and pushed the patch to Try: > > https://treeherder.mozilla.org/#/jobs?repo=try&revision=27a894d5224b The Try push has some static analysis failures of this form: TestTreeTraversal.cpp:49:1: error: Refcounted record 'TestNodeReverse<SearchNodeType>' has multiple mRefCnt members The problem is that the derived classes (TestNodeReverse and TestNodeForward) and the base class (TestNodeBase) both have NS_INLINE_DECL_REFCOUNTING. It's sufficient for the base class to have it.

Kevin Wern

Assignee

Comment 17

•

8 years ago

Comment on attachment 8761412 [details] Bug 1258460: create microbenchmarks for TreeTraversal.h algorithms Review request updated; see interdiff: https://reviewboard.mozilla.org/r/58428/diff/2-3/

Kevin Wern

Assignee

Comment 18

•

8 years ago

Sorry for the delay, just made those changes now. Try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=578935e372f2

Botond Ballo [:botond]

Reporter

Comment 19

•

8 years ago

Thanks! (In reply to Kevin Wern from comment #18) > https://treeherder.mozilla.org/#/jobs?repo=try&revision=578935e372f2 This is looking good.

Pulsebot

Comment 20

•

8 years ago

Pushed by bballo@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9740c3f2b075 create microbenchmarks for TreeTraversal.h algorithms r=botond

Botond Ballo [:botond]

Reporter

Comment 21

•

8 years ago

^ Autolanded.

Carsten Book [:Tomcat]

Comment 22

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9740c3f2b075

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox50: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla50

Bug 1258460: create microbenchmarks for TreeTraversal.h algorithms 8 years ago Kevin Wern (deleted), text/x-review-board-request	botond : review+	Details
Output from tests, including benchmarks 8 years ago Kevin Wern (deleted), text/plain		Details
Overview of test cases, tab-separated 8 years ago Kevin Wern (deleted), text/plain		Details