1291940 - TaskCluster slows down considerably when using AUFS

Reporter

Description

•

8 years ago

In bug 1290282, I was investigating some oddities where c4.8xlarge instances barely had any performance wins over c3.4xlarge instances on TC Firefox build tasks despite having 2x the number of CPU cores. I knew something was wrong after garndt posted dstat output while `mach build` was doing some heavy compilation: ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 10 5 83 3 0 0| 705k 16M| 0 0 | 0 0 | 18k 43k 39 15 42 4 0 0| 0 25M| 42k 44k| 0 0 | 52k 106k 36 8 29 28 0 0|4096B 21M| 35k 34k| 0 0 | 38k 68k 47 9 43 1 0 0| 0 4728k| 54k 56k| 0 0 | 48k 96k 45 14 41 0 0 0| 0 28k| 25k 25k| 0 0 | 46k 93k 42 14 44 0 0 0| 0 20k| 74k 74k| 0 0 | 53k 104k 34 20 46 0 0 0| 0 36k| 77k 76k| 0 0 | 57k 111k 35 19 45 0 0 0| 0 1376k| 53k 52k| 0 0 | 53k 104k 37 20 43 0 0 0| 0 9456k| 79k 82k| 0 0 | 54k 104k 28 27 46 0 0 0| 0 36k| 82k 81k| 0 0 | 54k 105k 31 24 45 0 0 0| 0 8192B| 41k 41k| 0 0 | 51k 99k 40 18 42 0 0 0| 0 28k| 44k 43k| 0 0 | 50k 97k 36 21 42 0 0 0| 0 1444k| 135k 139k| 0 0 | 45k 88k 33 21 46 1 0 0| 0 21M| 105k 100k| 0 0 | 47k 92k 27 26 47 0 0 0| 0 2020k| 129k 132k| 0 0 | 52k 103k 38 19 43 0 0 0| 0 32k| 90k 92k| 0 0 | 50k 96k Note the very high sys (kernel) CPU usage and number of context switches. For comparison, here's what I got on a non-TC c4.8xlarge instance doing the same operation: ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 93 6 1 0 0 0| 0 41M| 660B 832B| 0 0 |9801 4182 92 7 1 0 0 0| 0 7320k|2016B 13k| 0 0 |9596 7047 93 7 1 0 0 0| 0 8888k|2310B 10k| 0 0 |9567 4825 91 8 1 0 0 0| 0 9248k|2574B 11k| 0 0 |9798 6004 93 7 1 0 0 0| 0 7464k| 924B 3044B| 0 0 |9530 4706 94 6 0 0 0 0| 0 102M| 198B 730B| 0 0 | 10k 4105 94 5 1 0 0 0| 0 105M|1386B 8132B| 0 0 | 12k 7619 92 7 1 0 0 0| 0 16M|1716B 6180B| 0 0 |9710 8397 93 6 0 0 0 0| 0 4096k| 396B 1220B| 0 0 |9570 5332 The sys numbers are still a bit high for my liking. But context switches are *way* down, dare I say reasonable. The TC team stood up a docker worker and gave me SSH access to investigate. I initially thought OS and kernel differences might be to blame. My instance was running Ubuntu 16.04 (Linux 4.4). TC was running Ubuntu 14.04 (Linux 3.13) and had a number of outdated system packages. Upgrading system packages (including the kernel) and rebooting appeared to make things a bit faster. Although I would have to measure again to confirm. I built Firefox from source outside of Docker and both my instance and TC instance had remarkably similar times: 5:04 wall vs 5:03 wall 143:28 user vs 143:08 user 9:18 sys vs 9:29 sys Inside Docker was a different story. Building from a host-mounted ext4 volume was slightly slower: 5:42 wall 136:49 user 13:54 sys But building from a Docker aufs volume was much slower: 8:38 wall 132:19 user 75:03 sys dstat output during parts of the build was horrendous: ----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system-- usr sys idl wai hiq siq| read writ| recv send| in out | int csw 9 82 9 0 0 0| 0 72k|3300B 15k| 0 0 | 20k 27k 7 83 11 0 0 0| 0 64k|3366B 12k| 0 0 | 19k 29k 3 88 9 0 0 0| 0 68k|1188B 5074B| 0 0 | 20k 29k 5 86 9 0 0 0| 0 64k| 66B 358B| 0 0 | 25k 25k 12 78 10 0 0 0| 0 576k|2574B 10k| 0 0 | 18k 27k 6 80 14 0 0 0| 0 30M| 990B 3684B| 0 0 | 19k 28k 4 91 4 0 0 0| 0 64k| 198B 722B| 0 0 | 21k 22k 5 93 1 0 0 0| 0 64k| 132B 874B| 0 0 | 22k 21k 9 85 6 0 0 0| 0 92k|2310B 8290B| 0 0 | 19k 26k 4 87 9 0 0 0| 0 1316k| 198B 866B| 0 0 | 18k 25k 3 95 3 0 0 0| 0 18M| 198B 714B| 0 0 | 21k 21k 7 90 3 0 0 0| 0 64k| 990B 4310B| 0 0 | 17k 22k 95% CPU time in the kernel!!! So it appears aufs under high load is a steaming pile. This shouldn't come as a surprise - aufs is an abandoned project and overlayfs is the equivalent technology that's part of the Linux kernel. So where does this leave us? Well, TaskCluster is using aufs heavily in production. My understanding is that unless a cache is declared (caches use volumes mounted from the host and are backed by ext4), your I/O inside a TC task will be using aufs. And if aufs has performance problems, well, TC tasks will be slow. The good news is a *lot* of Firefox automation uses caches (read: no aufs), so the negative impact of aufs is mitigated a bit. However, caches are stripped on most Try tasks because of concerns over cache poisoning. So e.g. a TC Firefox build will use ext4 on mozilla-central but aufs on Try. That means Try tasks will be slower than non-Try tasks. In theory, this applies to non-build tasks as well. A potential silver lining is that the aufs performance degradation might only be for large CPU counts driving high concurrent I/O loads. In bug 1290282, c3.4xlarge instances were faster than c3.2xlarge instances (not linear as expected but still pretty good) whereas c3.8xlarge hardly had any advantage over c3.4xlarge despite double the cores. I'll have to run some tests on less beefy instances to see if aufs overhead is measurable on e.g. 1, 2, and 4 core instances, which make up most of the instances in TC. If it is, then this performance problem could result in thousands of hours of wasted machine time every day. Per discussion in #taskcluster, jonasfj and garndt have ideas for addressing this at the worker level. I'll let them chime in.