Closed
Bug 1312383
Opened 8 years ago
Closed 8 years ago
taskcluster windows 7 ec2 instances underperform
Categories
(Release Engineering :: General, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: grenade, Assigned: grenade)
References
Details
Attachments
(1 file)
cppunit tests on tc win 7 take longer than they could/should to complete.
timings from https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=22e62e6b9783b9ff92a37b4318421b15cb411b1b:
tc win 7 opt: 18 minutes
tc win 7 dbg: 18 minutes
bb win 7 opt: 8 minutes
bb win 7 dbg: 9 minutes
tc win 10 opt: 7 minutes
tc win 10 dbg: 7 minutes
also, when running checkout based tests, tc win 7 fails to complete the hg checkout due to a timeout. tc win 10 performs the checkout quickly. both use ephemeral disks for both the checkout and robustcheckout shared folders. examples at:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=a7faf377dfbdd56be037f2efe35ca35587a099c4&group_state=expanded&exclusion_profile=false&filter-tier=2
tc win 7 log: https://public-artifacts.taskcluster.net/axL3dLhrRCerYhux3oIcZQ/0/public/logs/live_backing.log
tc win 10 log: https://public-artifacts.taskcluster.net/Wqd3tgCHRkquFWXpqZjzvQ/0/public/logs/live_backing.log
Assignee | ||
Comment 1•8 years ago
|
||
worth noting that the current tc win 7 ami is based on releng ami-60167577 (use1) which was created on 2016-08-24.
today i'm rebasing on releng ami-94790083 which was created on 2016-09-20 and is the current base for golden (https://github.com/mozilla-releng/build-cloud-tools/blob/cffc9e1170a6325cd2f10b750f66907abbc15cde/configs/t-w732#L6)
Assignee: nobody → rthijssen
Status: NEW → ASSIGNED
Assignee | ||
Comment 2•8 years ago
|
||
things i tried which did not affect performance (time taken for the cppunit test to complete):
- correct the PythonPath and InstallPath registry settings: https://github.com/mozilla-releng/OpenCloudConfig/commit/f00e2f4cfc0019d73d8614f764fc549c71ee87eb
- use python 2.7.3 (an exact copy of c:\mozilla-build\python27 from buildbot win7 slaves + virtualenv/virtualenv_support from mozilla-build 2.2.0): https://github.com/mozilla-releng/OpenCloudConfig/commit/1ab2f74e73fbb7a32ae8eabb795938f08f8754dc
- remove the pagefile and ram drive (r:) - resulted in no difference to performance (better/worse)
- ngen executeQueuedItems (https://github.com/mozilla-releng/OpenCloudConfig/commit/8599dc41f1672a0531f39c789c20de4a4120871d)
things to try next (it's looking more and more like the .net upgrade causes the slowdown):
- hand build the ami (no occ, no .net 4.5, no powershell upgrade), just an exact copy of bb win7 with runner and buildbot removed and generic worker installed
Comment 3•8 years ago
|
||
Thanks for all your hard work on this Rob!
Assignee | ||
Comment 4•8 years ago
|
||
i've now ruled out the .net upgrade/dsc. latest amis are hand built exact copies of buildbot win 7 slaves with the only modification being removal of runner/buildbot and installation of the generic worker.
the only places we could be losing performance now are either in the generic worker or if there was some hack to the cltbld user that somehow made it fast (which seems unlikely).
Comment 5•8 years ago
|
||
given the fact that we are losing most of the performance in downloading builds/packages, how can we narrow that down? It does seem odd that generic-worker could be causing this, unless there is something really odd going on.
We do create users dynamically- I wonder if there is some quirk in windows 7 specifically with the newly created user and some flag to make it act differently.
Looking at :grenade's recent push, I see this log:
https://public-artifacts.taskcluster.net/DvCA9fJdQSWNuZemCVh_HQ/0/public/logs/live_backing.log
I see a few instances of timeout/sleep/retry:
01:58:56 INFO - c:\mozilla-build\python\python.exe: can't open file 'C:\mozilla-build\tooltool.py': [Errno 2] No such file or directory
01:58:56 ERROR - Return code: 2
01:58:56 INFO - retry: Failed, sleeping 240 seconds before retrying
02:02:56 INFO - retry: Calling run_command with args: (['c:\\mozilla-build\\python\\python.exe', 'C:\\mozilla-build\\tooltool.py', '--url',
this is the other area of longer delays:
01:52:03 INFO - Processing z:\task_1477703515\build\tests\marionette\client
01:52:05 INFO - Processing z:\task_1477703515\build\tests\marionette\marionette\runner\mixins\browsermob-proxy-py
01:52:09 INFO - Processing z:\task_1477703515\build\tests\marionette
01:52:56 INFO - Building wheels for collected packages: browsermob-proxy
and 8 seconds to download 818k:
01:50:30 INFO - Downloading http://pypi.pub.build.mozilla.org/pub/mock-1.0.1.tar.gz (818kB)
01:50:38 INFO - Building wheels for collected packages: mock
and 15 seconds to install some python modules:
01:47:26 INFO - Successfully built blobuploader requests docopt
01:47:26 INFO - Installing collected packages: requests, docopt, blobuploader
01:47:43 INFO - Successfully installed blobuploader-1.2.4 docopt-0.6.1 requests-1.2.3
01:47:44 INFO - Return code: 0
of course the download of tests.zip is 1+ minutes here:
01:42:51 INFO - x-amz-meta-url: https://s3-us-west-2.amazonaws.com/taskcluster-public-artifacts/eTf1lMx_R5GulM6kwfj12w/0/public/build/firefox-52.0a1.en-US.win32.common.tests.zip
01:42:51 INFO - x-amz-request-id: D7589D676E109906
01:42:51 INFO - Expected file size: 17133438
01:42:51 INFO - Obtained file size: 17133438
01:44:04 INFO - Downloading and extracting to Z:\task_1477703515\build\tests these dirs bin/*, certs/*, config/*, mach, marionette/*, modules/*, mozbase/*, tools/*, cppunittest/* from https://queue.taskcluster.net/v1/task/eTf1lMx_R5GulM6kwfj12w/artifacts/public/build/firefox-52.0a1.en-US.win32.cppunittest.tests.zip
the trend is really the high overhead in installing python modules (download and installation). I really wish we could reduce that in general as this accounts for about 2 minutes of overhead in each job normally- here it seems to account for much more of the time (in the 5+ minutes window)
Assignee | ||
Comment 6•8 years ago
|
||
the tooltool error is a red herring and was fixed in a subsequent ami.
i am aware that there is a performance penalty to all that downloading and i am working to mitigate that by sharing the local pypi cache (between subsequent task users on the same host'.
however, windows 10, which is spun up in the same availability zones, performs the same cppunit test in consistently less time (6-7 min) than buildbot slaves (7-8 min). windows 7 takes 17 - 19 min consistently. this points to something other than the time taken to download since win 10 workers have to perform the same downloads.
since we know that we have absolute parity with the buildbot ami which doesn't underperform, we can also rule out network or io drivers. the only thing left to look at is the process spawning mechanism on the 32 bit version of the generic worker (used only by tc win 7 and no other worker types).
i'm leaning towards some glitch or difference in the way that the operating system reacts to the g-w mechanism for spawning the task process. which should be easy to demonstrate by running the cppunit test outside of generic-worker (manually from a cmd prompt) and comparing the time it takes to complete with the same test run from inside g-w.
updates to follow...
Assignee | ||
Comment 7•8 years ago
|
||
well, its not generic worker either. cppunit took 28 minutes to complete when run outside of g-w.
Assignee | ||
Comment 8•8 years ago
|
||
i've discounted the multiple calls to proxxy and pypi.pvt hosts with this change to mozharness configs (which removed noise from the logs but didn't make for a faster test run):
+ "proxxy": {},
"find_links": [
- "http://pypi.pvt.build.mozilla.org/pub",
"http://pypi.pub.build.mozilla.org/pub",
],
Assignee | ||
Comment 9•8 years ago
|
||
pmoore and i did some debugging of a long running process under g-w. a task was created with a single long running command:
hg robustcheckout --sharebase y:\\hg-shared --purge --upstream https://hg.mozilla.org/mozilla-unified --revision 916cbaf21a63da2f7953933481a502d7db9b39a0 https://hg.mozilla.org/integration/mozilla-inbound/ .\\build\\src
the same command (with altered paths) run manually from a cmd prompt, completes in 10 ~ 15 minutes. under g-w it times out incomplete after 3 hours.
using process monitor and process explorer we ascertained that the process running under g-w has a low process priority (the lowest possible) while manual processes have a normal process priority.
overriding the process priority (to increase it) of the g-w process allowed the task to complete in a similar time to the manual process.
pmoore is exploring ways to increase the process priority of the g-w process
Assignee: rthijssen → nobody
Status: ASSIGNED → NEW
Assignee | ||
Comment 10•8 years ago
|
||
Attachment #8812741 -
Flags: review?(pmoore)
Assignee | ||
Comment 11•8 years ago
|
||
i tried bumping up the process priority for the generic-worker process
https://github.com/mozilla-releng/OpenCloudConfig/commit/f260dedd95dd0c5fef0ec04118423ce8e7210536
i tried both "Real Time" (0) and "Above Normal" (3). it didn't seem to make any difference. my guess is that the generic worker process is running with the new priority but that processes it spawns (the builds/test) are not. https://papertrailapp.com/groups/2488493/events?q=%22process+priority+for+generic+worker%22
i also read (http://stackoverflow.com/a/1664006/68115) that in order to increase a process priority, the SeIncreaseBasePriorityPrivilege permission is required, so i've added that to the permissions granted to the g-w user (https://github.com/mozilla-releng/OpenCloudConfig/commit/33ed93c39cae94bd972ff4157247c0d9c16ee632). i'm hoping that helps it to spawn processes with a higher priority.
Assignee | ||
Comment 12•8 years ago
|
||
https://books.google.co.uk/books?id=WZdCAwAAQBAJ&pg=PT506&lpg=PT506&dq=%22Image+File+Execution+Options%22+PerfOptions+Io+Priority+Cpu+Priority&source=bl&ots=nilSGhacsd&sig=x-Q6uDIa3tDquumLyNQZ13pkAtE&hl=en&sa=X&ved=0ahUKEwjKkvCXlrrQAhVCDMAKHbEXCHEQ6AEILDAC#v=onepage&q=%22Image%20File%20Execution%20Options%22%20PerfOptions%20Io%20Priority%20Cpu%20Priority&f=false
this book states "The process's default base-scheduling priority (which starts as Normal, or 8, unless the parent process was set to Idle or Below Normal, in which case the setting is inherited)."
which suggests that until we get the scheduled task starting on a priority higher than "Below Normal", we won't get higher priorities for spawned processes.
Assignee | ||
Comment 13•8 years ago
|
||
i found a workaround! i've applied a registry hack which ensures that hg.exe and python.exe always run with in the "Above Normal" process priority class.
cppunit test now completes in 3 minutes. it was 18 minutes on ephemeral and 12 on ebs volumes previously.
https://github.com/mozilla-releng/OpenCloudConfig/commit/60920c08b0fe314e5526985a588e3f83918145e0
Assignee | ||
Updated•8 years ago
|
Assignee: nobody → rthijssen
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Comment 14•8 years ago
|
||
Commits pushed to master at https://github.com/taskcluster/generic-worker
https://github.com/taskcluster/generic-worker/commit/e2fa6601905f0003e8c7699fd62e2097b0bcfae5
bug 1312383 - set scheduled task process priority to realtime
not sure it will help, but the current scheduled task runs its process with priority 7 (below normal) while priority 0 (real time) is the highest, 10 (idle) is the lowest.
https://msdn.microsoft.com/en-us/library/windows/desktop/aa383512(v=vs.85).aspx
https://github.com/taskcluster/generic-worker/commit/67b96f57ade0838da11bebf322d3d226c9cf60a2
bug 1312383 - set scheduled task process priority to above normal
testing with realtime yielded build times the same as before (no improvement). setting to above normal in the hope it provokes the os into doing something
https://github.com/taskcluster/generic-worker/commit/ddd5ebb19645ffb971aa3687e5772e61389109a4
Merge pull request #30 from grenade/patch-5
bug 1312383 - set scheduled task process priority to above normal
Updated•8 years ago
|
Attachment #8812741 -
Flags: review?(pmoore) → review+
Updated•7 years ago
|
Component: General Automation → General
Comment 15•6 years ago
|
||
You need to log in
before you can comment on or make changes to this bug.
Description
•