Closed Bug 1431291 Opened 7 years ago Closed 7 years ago

Hyperchunking of reftests on instances is inefficient, wastes a lot of money on GPU instances

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: gps, Unassigned)

References

(Blocks 1 open bug)

Details

Bug 1373578 and bug 1396260 both increased the number of chunks being executed for various reftests configurations to 32 because it made some intermittent failures go away. The price we paid for working around the problem instead of fixing the underlying problem is we are now spending a lot more money on GPU-enabled AWS instances to run these tasks. And that's because the efficiency of reftest tasks on workers is... pretty bad. Looking at logs, we spend most of the task in setup overhead! In https://public-artifacts.taskcluster.net/HBB4iBUCRviEJ19_bw5DZA/0/public/logs/live_backing.log, we start at 00:20:12, run Firefox at 00:24:24, and finish at 00:25:15. So of the ~303s the task ran, 252s was spent getting the task ready to run. That's ~83%! This in of itself is a major problem and someone should look into it. But the issue I want to get on people's radar is the cost we're incurring for this overhead and specifically how much "hyperchunking" the reftests made it worse. Here is the daily spend on g2.2xlarge instances in TaskCluster: Day Usage (hours) Cost ---------- --------------- ------ 2017-09-01 12256 3300 2017-09-02 12406 3199 2017-09-03 6775 1495 2017-09-04 6623 1402 2017-09-05 10809 2622 2017-09-06 13401 3063 2017-09-07 13063 3137 2017-09-08 10266 2610 2017-09-09 14407 3595 2017-09-10 15657 3281 2017-09-11 4772 1077 2017-09-12 8987 2526 2017-09-13 11612 3212 2017-09-14 12016 3613 2017-09-15 12347 3118 2017-09-16 9071 2846 2017-09-17 6832 1408 2017-09-18 4930 906 2017-09-19 10415 2697 2017-09-20 12220 2747 2017-09-21 14803 3258 2017-09-22 14464 2790 2017-09-23 9097 1674 2017-09-24 6188 935 2017-09-25 5156 907 2017-09-26 10391 2034 2017-09-27 10614 2104 2017-09-28 12301 2815 2017-09-29 17054 4275 2017-09-30 16617 3370 2017-10-01 5132 759 2017-10-02 7832 1279 2017-10-03 7368 1320 2017-10-04 6272 1303 2017-10-05 8277 1818 2017-10-06 10323 2541 2017-10-07 8765 1765 2017-10-08 7378 1352 2017-10-09 6208 1024 2017-10-10 9177 1777 2017-10-11 10465 2249 2017-10-12 11516 2563 2017-10-13 14838 3210 2017-10-14 16784 3340 2017-10-15 12828 2275 2017-10-16 10961 1743 2017-10-17 12431 2254 2017-10-18 16432 4216 2017-10-19 16282 3472 2017-10-20 16787 3642 2017-10-21 14572 3641 2017-10-22 11023 2211 2017-10-23 5608 982 2017-10-24 14330 3575 2017-10-25 16589 5164 2017-10-26 17270 5036 2017-10-27 18361 4699 2017-10-28 16364 4434 2017-10-29 14899 3599 2017-10-30 10520 2594 2017-10-31 14609 4651 2017-11-01 13811 4970 2017-11-02 14584 4911 2017-11-03 16756 4214 2017-11-04 15333 4325 2017-11-05 13663 2812 2017-11-06 9664 1741 2017-11-07 14327 3836 2017-11-08 16126 3431 2017-11-09 16142 3852 2017-11-10 16360 3828 2017-11-11 16588 3874 2017-11-12 11578 2058 2017-11-13 8300 1743 2017-11-14 14164 3948 2017-11-15 14399 3732 2017-11-16 16288 4443 2017-11-17 14440 3768 2017-11-18 15586 4067 2017-11-19 9768 2642 2017-11-20 9701 2181 2017-11-21 13579 3126 2017-11-22 14647 4526 2017-11-23 16110 4127 2017-11-24 14830 3488 2017-11-25 14824 3258 2017-11-26 8791 1803 2017-11-27 6071 1189 2017-11-28 12990 3494 2017-11-29 15488 3655 2017-11-30 12644 2789 2017-12-01 17380 3881 If you plot this, you see an uptick in usage and cost in the middle of October. It's even clearer when you look at the 14 day moving average. That's when https://hg.mozilla.org/mozilla-central/rev/f68eef8bbd21 landed. The 14 day moving average cost increased from ~$2,500/day to ~$3,300/day. Or ~$290,000/year. In other words, we could justify >1 FTE to work on this problem and only this problem for a full year. You can see this in the monthly spend numbers (keep in mind all hands, holidays, and shutdown skewing numbers for December): Month Usage (hours) Cost ---------- --------------- ------ 2017-02-01 629 408 2017-03-01 7648 1426 2017-04-01 19987 4124 2017-05-01 188373 59886 2017-06-01 157841 44803 2017-07-01 174782 49773 2017-08-01 253260 68712 2017-09-01 268130 63487 2017-10-01 318426 73488 2017-11-01 378893 88710 2017-12-01 411135 100758 2018-01-01 310800 93546 2018-01-17 150465 43313 What the big jump in July/August was for, I don't know. But someone should probably look into that too! Normally test machines are cheap. But the GPU-enabled ones aren't. So we need to be more careful with their utilization or we can spend a lot of money very quickly. The changes to increase the number of chunks for these GPU-enabled tests exacerbated some inefficiencies in CI. Given the amount of money involved, we should either undo the hyperchunking or improve the efficiency of these tasks so we're not spending 4+ minutes getting the task ready to run. needinfo coop so he can triage this.
Flags: needinfo?(coop)
FWIW, about half the startup overhead is extracting the zip files with test files: 00:20:14 INFO - Downloading and extracting to Z:\task_1516234807\build\tests these dirs bin/*, certs/*, config/*, mach, marionette/*, modules/*, mozbase/*, tools/*, reftest/*, jsreftest/*, mozinfo.json from https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.common.tests.zip 00:20:14 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.common.tests.zip'}, attempt #1 00:20:14 INFO - Fetch https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.common.tests.zip into memory 00:20:15 INFO - Content-Length response header: 38590587 00:20:15 INFO - Bytes received: 38590587 00:21:42 INFO - Downloading and extracting to Z:\task_1516234807\build\tests these dirs bin/*, certs/*, config/*, mach, marionette/*, modules/*, mozbase/*, tools/*, reftest/*, jsreftest/*, mozinfo.json from https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.reftest.tests.zip 00:21:42 INFO - retry: Calling fetch_url_into_memory with args: (), kwargs: {'url': u'https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.reftest.tests.zip'}, attempt #1 00:21:42 INFO - Fetch https://queue.taskcluster.net/v1/task/WA4O6l0vQdeRtu1eSdzNUw/artifacts/public/build/target.reftest.tests.zip into memory 00:21:44 INFO - Content-Length response header: 60572713 00:21:44 INFO - Bytes received: 60572713 00:22:54 INFO - proxxy config: {} Long term fix for that is "run tests from source checkouts." The closest bug we have is bug 1286900. The naive solution is to do a normal clone + checkout. But perf on Windows is abysmal. So we need infrastructure changes to the Mercurial server to support making the perf not suck. That's tracked in bug 1428470 and should land in Q2 or Q3.
Also, we knew the zip files were super slow on test instances. That's what initially led us down the "run tests from source checkouts" path a while back. But our assumption - and a reason that work fell off the priority list - was testers are cheap, so the slowdown - while annoying - didn't seem to have much of an impact. I recall discussion with lmandel and/or jgriffin where we thought we could overcome the inefficiencies by using more chunks. If tests only cost pennies an hour, throwing money at the problem as a stop-gap is viable. With expensive GPU-enabled test workers, the calculus changes and the strategy of throwing money at the problem backfires in a big way :/
I should also add that running tests from source checkouts is the key that unlocks a lot of other wins. For example, these tasks spend a few dozen seconds installing Python packages. I'm almost certain this involves some network activity. We're definitely invoking setup.py, which involves new process overhead (expensive on Windows). And we're likely copying a number of files around. If we run from a source checkout, we can leverage sys.path hacks to reference in-repo Python packages and most of the Python overhead will go away. We do this in `mach` and the Firefox build system to cut down on overhead, for example.
Undoing the increase in chunking without doing anything else would lead to even more savings, because if they go back to the way they were, sheriffs will demote them to tier-3 since they didn't even vaguely qualify to be visible in the default treeherder view the way they were. I'm told that doing so would then mean that we would need to close all trees, that we cannot have open trees without Windows reftests. Major savings!
I have considered going to 64 chunks instead of 32- we have so many intermittents on windows 7 reftests specifically for 2 main reasons: * unable to keep up and draw at the speed we run reftests (no development resources to fix firefox in low memory environments) * the OS theme is inconsistent (1.5% of jobs we fail to set the theme properly and the chance of intermittent failure is high) This topic of money has never been a concern, if it is we should be focusing on a lot of other things in addition to this.
Depends on: 1431467
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #5) > I have considered going to 64 chunks instead of 32- Hyperchunking has been a long-term goal as a key pillar of significantly reducing end-to-end times (bug 1262834). So I support efforts to move in this direction. > we have so many > intermittents on windows 7 reftests specifically for 2 main reasons: > * unable to keep up and draw at the speed we run reftests (no development > resources to fix firefox in low memory environments) > * the OS theme is inconsistent (1.5% of jobs we fail to set the theme > properly and the chance of intermittent failure is high) However, hyperchunking as a workaround (to avoid intermittent failures or reduce failure rate)... isn't great. Correct me if I'm wrong, but the rough framework for our make end-to-end times as fast as possible project was to have each test chunk complete in <5 minutes. Our target was to get the builds down to 15-20 minutes (ideally less). If tests completed in ~5 minutes, end-to-end times would be <30 minutes and we could focus energy on optimizing builds to further reduce end-to-end times. I bring this up because the GPU tasks are already completing in the ~5 minute range. And ~4 minutes of those is task startup overhead. Using more chunks will further decrease efficiency and blow up costs. And it yields little to no end-user wins from an end-to-end time perspective (it does help with intermittents though). > This topic of money has never been a concern, if it is we should be focusing > on a lot of other things in addition to this. It's not really been a concern for test machines because test machines historically didn't cost a lot. Many of our test instances cost <$0.02/hour. We can run >1,000 test workers for what it costs to employ 1 person. It's easy to justify that cost. What changed in 2017 is that the g2.2xlarge instances entered the scene and changed the calculus for cost of test workers. They went from ~$0 to $50-60k/month practically overnight. Now we're flirting with $100k/month. In contrast, we run ~4x more m3.large and m3.xlarge instance-hours for ~50% the cost of the g2.2xlarge. The g2's are ~8x more expensive. We're now spending about as much on them as we are on build workers. We have historically been concerned with cost of operating the build workers. Those are substantially more expensive instances. We've been careful to not be wasteful with over-provisioning those instances because they can easily contribute to runaway costs (like the g2's have). FWIW bug 1430878 tracks provisioning >30 vCPU count instances. Look for those to make an entrance soon... I don't have the full context of the cause of the intermittent reftest failures. But it seems to me that identifying and fixing the root cause is a sound investment. Even if we keep splitting up test chunks, the issue will still be there. From my perspective, chunking the tests feels like a very expensive way of sweeping dirt under the rug.
As I mentioned in the Developer Workflow mtg today, I chatted about this with jmaher in the TC migration mtg yesterday. Developer resources are required to fix the underlying tests. I'm happy to drive that request up the management chain to try to make it happen. jmaher: do you have a short-list of the dev teams we need to target based on the tests that are failing?
Flags: needinfo?(coop) → needinfo?(jmaher)
this is a windows7 reftest issue- I would start with :jet and :milan.
Flags: needinfo?(jmaher)
Reftest run-by-manifest is close to landing, which should hopefully let us reduce the chunks again.
Depends on: 1353461
Product: TaskCluster → Firefox Build System
we run similar chunks on all configs now as of bug 1449587
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.