Closed Bug 1272175 Opened 9 years ago Closed 8 years ago

Decrease chunking of bc and dt mochitests on OSX opt

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: gps, Unassigned)

References

(Blocks 1 open bug)

Details

On OS X 10.10 opt, we're currently running 7 bc chunks and 8 dt chunks. The jobs complete nice and fast (<10 minutes) usually. Unfortunately, this isn't the most efficient way to run mochitest jobs because of the startup overhead associated with those jobs. Take http://archive.mozilla.org/pub/firefox/try-builds/gszorc@mozilla.com-865f5cbeb7fb50c5993ba9cba3621c568bca51b0/try-macosx64/try_yosemite_r7_test-mochitest-e10s-browser-chrome-6-bm106-tests1-macosx-build1277.txt.gz for example. This job started at 2016-05-09 23:23:48.342202 and ended at 2016-05-09 23:29:30.347325. That's only 340s, which is pretty fast. But if you look at the logs, we don't start running tests until 23:25:40, or 118s into the job. That means ~35% of the job is spent in startup overhead. My data shows we're doing ~100,000 bc/dt jobs per month on OS X, or ~4,000/day (factoring in the weekend when machines are mostly idle). 4,000 jobs per day means 480,000s of machine time per day in job startup overhead. An individual OS X machine has work to do maybe 75% of the day, or 64,800s. 480,000s of machine time is equivalent to ~7.5 testers. We currently have 195 enabled OS X testers. If we reduce the number of bc and rt chunks by half, that will buy us the equivalent of ~3.75 new testers and increase our OS X capacity by ~2%. Not the greatest win in the world, but not insignificant either. I could argue a greater percentage win if our machines are busy less than 75% of the day. Decreasing the chunking does mean longer job times. But machines will be more efficient because they'll be executing tests more of the time they are running. Since we have an OS X capacity issue right now, I think it is better to increase machine efficiency and reduce our wait times at the expense of jobs taking longer. (It would be really cool if the scheduler changed the chunking size dynamically based on available machines, that way we get the best of both worlds. But that's probably difficult and would confuse sheriffs and tools expecting mostly static chunk counts.)
I believe we have the chunking set at 7 to make it equal across all platforms for a given flavor (opt/debug). In addition winxp falls into a similar trap :) as a note we are getting another large pool of osx 10.10 mac minis- we have them, just need to get them setup in the datacenter. looking at the overhead, we have: :20 download and extract zip files (as a note an old optimization for tests.zip has caution errors about filename not matches bin/*, etc. maybe we could be more efficient in unzip as that is at 5 seconds) :54 create virtual env and install packages into it. This seems like a big waste of time, maybe reducing our dependencies on so many libraries/packages would be nice, also finding a more efficient way to install these would be even better. :34 install the .dmg -> filesystem- holy crap this is just a glorified unzip :05 run the harness and get the first result from a test :04 blobber upload I see 117 seconds of overhead and 226 seconds of actually running the test, confirming what :gps wrote initially. If we can optimize our virtualenv path, it gives us wins on all platforms! likewise optimizing the .dmg extraction could save on all osx jobs. I guess the question is if we had 60 seconds of overhead, what would be the ideal job time assuming we have static scheduling? Maybe 10 minutes, so 90% goes towards running tests? Would just trimming the overhead be enough to avoid chunking? to change the chunking, I believe we have to do this inside of buildbot- and it should be considered for winxp as well :)
Summary: Decrease chunking of bc and rt mochitests on OSX opt → Decrease chunking of bc and dt mochitests on OSX opt
I completely agree that the overhead seems a bit high. I believe there are bugs on file for all these issues. I've been looking at the uncompress issue in the past ~24 hours. I haven't really looked at virtualenv yet because it's a bit more work to solve. Another source of overhead we have to factor in is reboots. Although I'm not sure what all we're rebooting when these days. But reboots can easily add a few more minutes of overhead. If the new machines are put into service within days, I think we can wontfix this as not worth the trouble. I'd rather spend time on reducing job overhead.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → INCOMPLETE
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.