Closed
Bug 1291940
Opened 8 years ago
Closed 7 years ago
TaskCluster slows down considerably when using AUFS
Categories
(Taskcluster :: Workers, defect, P1)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: gps, Assigned: wcosta)
References
(Blocks 2 open bugs)
Details
In bug 1290282, I was investigating some oddities where c4.8xlarge instances barely had any performance wins over c3.4xlarge instances on TC Firefox build tasks despite having 2x the number of CPU cores.
I knew something was wrong after garndt posted dstat output while `mach build` was doing some heavy compilation:
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
10 5 83 3 0 0| 705k 16M| 0 0 | 0 0 | 18k 43k
39 15 42 4 0 0| 0 25M| 42k 44k| 0 0 | 52k 106k
36 8 29 28 0 0|4096B 21M| 35k 34k| 0 0 | 38k 68k
47 9 43 1 0 0| 0 4728k| 54k 56k| 0 0 | 48k 96k
45 14 41 0 0 0| 0 28k| 25k 25k| 0 0 | 46k 93k
42 14 44 0 0 0| 0 20k| 74k 74k| 0 0 | 53k 104k
34 20 46 0 0 0| 0 36k| 77k 76k| 0 0 | 57k 111k
35 19 45 0 0 0| 0 1376k| 53k 52k| 0 0 | 53k 104k
37 20 43 0 0 0| 0 9456k| 79k 82k| 0 0 | 54k 104k
28 27 46 0 0 0| 0 36k| 82k 81k| 0 0 | 54k 105k
31 24 45 0 0 0| 0 8192B| 41k 41k| 0 0 | 51k 99k
40 18 42 0 0 0| 0 28k| 44k 43k| 0 0 | 50k 97k
36 21 42 0 0 0| 0 1444k| 135k 139k| 0 0 | 45k 88k
33 21 46 1 0 0| 0 21M| 105k 100k| 0 0 | 47k 92k
27 26 47 0 0 0| 0 2020k| 129k 132k| 0 0 | 52k 103k
38 19 43 0 0 0| 0 32k| 90k 92k| 0 0 | 50k 96k
Note the very high sys (kernel) CPU usage and number of context switches. For comparison, here's what I got on a non-TC c4.8xlarge instance doing the same operation:
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
93 6 1 0 0 0| 0 41M| 660B 832B| 0 0 |9801 4182
92 7 1 0 0 0| 0 7320k|2016B 13k| 0 0 |9596 7047
93 7 1 0 0 0| 0 8888k|2310B 10k| 0 0 |9567 4825
91 8 1 0 0 0| 0 9248k|2574B 11k| 0 0 |9798 6004
93 7 1 0 0 0| 0 7464k| 924B 3044B| 0 0 |9530 4706
94 6 0 0 0 0| 0 102M| 198B 730B| 0 0 | 10k 4105
94 5 1 0 0 0| 0 105M|1386B 8132B| 0 0 | 12k 7619
92 7 1 0 0 0| 0 16M|1716B 6180B| 0 0 |9710 8397
93 6 0 0 0 0| 0 4096k| 396B 1220B| 0 0 |9570 5332
The sys numbers are still a bit high for my liking. But context switches are *way* down, dare I say reasonable.
The TC team stood up a docker worker and gave me SSH access to investigate.
I initially thought OS and kernel differences might be to blame. My instance was running Ubuntu 16.04 (Linux 4.4). TC was running Ubuntu 14.04 (Linux 3.13) and had a number of outdated system packages. Upgrading system packages (including the kernel) and rebooting appeared to make things a bit faster. Although I would have to measure again to confirm.
I built Firefox from source outside of Docker and both my instance and TC instance had remarkably similar times:
5:04 wall vs 5:03 wall
143:28 user vs 143:08 user
9:18 sys vs 9:29 sys
Inside Docker was a different story.
Building from a host-mounted ext4 volume was slightly slower:
5:42 wall
136:49 user
13:54 sys
But building from a Docker aufs volume was much slower:
8:38 wall
132:19 user
75:03 sys
dstat output during parts of the build was horrendous:
----total-cpu-usage---- -dsk/total- -net/total- ---paging-- ---system--
usr sys idl wai hiq siq| read writ| recv send| in out | int csw
9 82 9 0 0 0| 0 72k|3300B 15k| 0 0 | 20k 27k
7 83 11 0 0 0| 0 64k|3366B 12k| 0 0 | 19k 29k
3 88 9 0 0 0| 0 68k|1188B 5074B| 0 0 | 20k 29k
5 86 9 0 0 0| 0 64k| 66B 358B| 0 0 | 25k 25k
12 78 10 0 0 0| 0 576k|2574B 10k| 0 0 | 18k 27k
6 80 14 0 0 0| 0 30M| 990B 3684B| 0 0 | 19k 28k
4 91 4 0 0 0| 0 64k| 198B 722B| 0 0 | 21k 22k
5 93 1 0 0 0| 0 64k| 132B 874B| 0 0 | 22k 21k
9 85 6 0 0 0| 0 92k|2310B 8290B| 0 0 | 19k 26k
4 87 9 0 0 0| 0 1316k| 198B 866B| 0 0 | 18k 25k
3 95 3 0 0 0| 0 18M| 198B 714B| 0 0 | 21k 21k
7 90 3 0 0 0| 0 64k| 990B 4310B| 0 0 | 17k 22k
95% CPU time in the kernel!!!
So it appears aufs under high load is a steaming pile. This shouldn't come as a surprise - aufs is an abandoned project and overlayfs is the equivalent technology that's part of the Linux kernel.
So where does this leave us? Well, TaskCluster is using aufs heavily in production. My understanding is that unless a cache is declared (caches use volumes mounted from the host and are backed by ext4), your I/O inside a TC task will be using aufs. And if aufs has performance problems, well, TC tasks will be slow.
The good news is a *lot* of Firefox automation uses caches (read: no aufs), so the negative impact of aufs is mitigated a bit. However, caches are stripped on most Try tasks because of concerns over cache poisoning. So e.g. a TC Firefox build will use ext4 on mozilla-central but aufs on Try. That means Try tasks will be slower than non-Try tasks. In theory, this applies to non-build tasks as well.
A potential silver lining is that the aufs performance degradation might only be for large CPU counts driving high concurrent I/O loads. In bug 1290282, c3.4xlarge instances were faster than c3.2xlarge instances (not linear as expected but still pretty good) whereas c3.8xlarge hardly had any advantage over c3.4xlarge despite double the cores. I'll have to run some tests on less beefy instances to see if aufs overhead is measurable on e.g. 1, 2, and 4 core instances, which make up most of the instances in TC. If it is, then this performance problem could result in thousands of hours of wasted machine time every day.
Per discussion in #taskcluster, jonasfj and garndt have ideas for addressing this at the worker level. I'll let them chime in.
Reporter | ||
Comment 1•8 years ago
|
||
We currently build on c3.2xlarge instances on an Ubuntu 14.04 host. I decided to compare the performance of a build in a c4.2xlarge (8 vCPUs) instance from at the host level and inside a Docker aufs volume to quantify the impact aufs has on TC tasks today.
Host build (ext3):
19:30 wall
138:54 user
7:23 sys
Docker build (aufs):
20:15 wall
138:50 user
9:33 sys
There appears to be slight overhead to aufs on a c4.2xlarge (8 vCPUs). Although this was a sampling size of 1, so results could be within variance. FWIW, I did witness rather high sys CPU time during the "export" tier and during parts of the "compile" tier. But it wasn't as high as it is in the c4.8xlarge. (I suspect "export" tier is bad because that tier creates thousands of files and symlinks and is effectively a stress test for I/O writing.)
On a c4.4xlarge (16 vCPUs):
Host build (ext3):
10:59 wall
141:54 user
7:28 sys
Docker build (aufs):
11:47 wall
140:32 user
15:00 sys
sys time is higher with aufs. dstat output also shows sys CPU time was higher throughout this build compared to the c4.2xlarge. Still not as bad as c4.8xlarge. But noticeably worse than c4.2xlarge.
The good news is aufs impact on c4.2xlarge instances appears minimal (the 45s gap between native and aufs remained intact - although not sure how that happened with ~5 minutes more kernel time). But the impact isn't horrendous like with c4.8xlarge instances. Since we don't run many instances with more than 8 cores in TC (do we run any at all?), this hopefully means we don't have major, widespread performance issues on TC due to aufs.
The bad news is that until we address this underlying problem with aufs, running high vCPU instances in TC will have diminishing returns for many workloads. We'll get *some* wins. But they won't be as large as expected because aufs will eat the extra CPU. Put another way, running high vCPU instances is, to some extent, throwing money away. c4.4xlarge has an arguably tolerable amount of aufs overhead. But we should definitely avoid c4.8xlarge instances for the time being, as aufs could eat a lot of their extra CPU.
Reporter | ||
Comment 2•8 years ago
|
||
In #taskcluster yesterday, I stated that the excessive kernel time appears to have gone away in Ubuntu 16.04. I was wrong. Kernel overhead on a c4.8xlarge on Ubuntu 16.04 is *worse* than 14.04!
Ubuntu 14.04 aufs (from comment #0):
8:38 wall
132:19 user
75:03 sys
Ubuntu 16.04 aufs:
20:53 wall
103:03 user
461:28 sys !!!
Ubuntu 16.04 overlayfs:
5:49 wall
135:32 user
9:39 sys
Building outside of Docker yielded the same time as within Docker on overlayfs. That's an encouraging sign for overlayfs.
Comment 3•8 years ago
|
||
Did you use the overlay or overlay2 docker storage driver?
Reporter | ||
Comment 4•8 years ago
|
||
overlay with docker 1.10 (the Docker that ships with Ubuntu 16.04).
Reporter | ||
Comment 5•8 years ago
|
||
There has been some talk of mitigating the impact of aufs by using aggressively using Docker volumes (which are essentially bind mounted from the host filesystem). This is a good idea in theory. So let's put that to the test.
On the same c4.8xlarge Ubuntu 16.04 instance used in comment #2, building Firefox with a source directory and object directory volume bind mounted from the host but still using aufs for the Docker container:
Ubuntu 16.04 host volume + aufs container:
10:09 wall
111:27 user
117:41 sys
As you can see, we're somewhere between native/overlayfs and full aufs despite avoiding aufs for the Firefox source and output.
What gives? My theory is that the *reads* to the aufs container filesystem by processes like python and gcc loading standard library modules and headers is triggering aufs overhead. So even though we do hardly any write I/O to aufs, aufs is still slowing us down by interfering with the read path. FWIW, dstat reported little to no disk reads during the build. This is presumably because everything was in the page cache. But apparently the page cache isn't good enough for aufs: it still interferes with reading.
It's looking more and more like avoiding aufs entirely is the only path forward.
Reporter | ||
Comment 6•8 years ago
|
||
I built with overlay2 and it has nearly identical performance as overlay.
Updated•8 years ago
|
Blocks: thunder-try
Reporter | ||
Comment 7•8 years ago
|
||
We have what looks like to be an intermittent failure related to aufs in bug 1295474.
garndt: given the performance and stability implications of running aufs, what is the timeline for not using aufs?
Blocks: 1295474
Flags: needinfo?(garndt)
Comment 8•8 years ago
|
||
Also note that bug 1256004 was also likely due to aufs, although we were able to work around it.
Comment 9•8 years ago
|
||
I don't have a timeline really, but i can say I can start looking into this soon with others. I think making the changes is easy enough, it's then validating that everything continues running as expected that might take a little time.
Flags: needinfo?(garndt)
Updated•8 years ago
|
Assignee: nobody → garndt
Reporter | ||
Comment 10•8 years ago
|
||
FWIW, I tried switching my local dev environment to overlay2 and I'm running into what looks like bugs due to the filesystem not behaving like a filesystem should. So it looks like that storage driver is still buggy :/
I've used devicemapper with some success. However, it is *really* slow at creating new image layers compared to aufs and overlayfs, which is a big deal for my local development environment which reminds tons of Docker images frequently. However, I don't think this is a major issue for TC because it treats Docker images as tarballs and doesn't do any of the fancy layering foo.
The matrix at https://docs.docker.com/engine/userguide/storagedriver/selectadriver/ also seems to indicate devicemapper might be the least crappy storage driver.
Comment 11•8 years ago
|
||
that's really good to know. Is it worth it to try overlay2 on the TC instances or is it just way too far gone to be of use?
Reporter | ||
Comment 12•8 years ago
|
||
Everything coming out of Docker seems to indicate overlayfs/overlay2 is the future. But if there are bugs in the filesystem, I'm inclined to write it off.
I should clarify my comment about layer creation being slow. It's like 1-3s instead of say 100ms. This likely won't matter unless you are building or pulling images with many layers.
Comment 13•8 years ago
|
||
Another option is btrfs.
Reporter | ||
Comment 14•8 years ago
|
||
I upgraded to Docker 1.12 and haven't yet reproduced the overlayfs bug I was seeing before. It's possible I wasn't actually running overlay2 on the machine encountering the issues referenced in comment #10.
Regarding btrfs, I'll have to give it another try. I think its layer generation is also a bit slow. But, again, we shouldn't care about this so much.
Reporter | ||
Comment 15•8 years ago
|
||
Ahh - the issue I was having with overlay was related to hard links (some processes were confused by the presence of hard links where hard links should not have been present). The docs at https://docs.docker.com/engine/userguide/storagedriver/overlayfs-driver/ indicate that "overlay" relied heavily on hard links and "overlay2" does not.
FWIW, the bottom of that page contains some nuggets about overlayfs, notably that open(2) doesn't conform to POSIX standards and rename(2) isn't supported either. Good times. A "real" filesystem like btrfs keeps sounding better and better.
Reporter | ||
Comment 16•8 years ago
|
||
UNIX domain sockets don't work with the overlay filesystem: https://github.com/docker/docker/issues/12080
Reporter | ||
Comment 17•7 years ago
|
||
garndt: I keep seeing people bit by AUFS bugs or performance issues. The problems aren't obvious because AUFS "just works" most of the time. I suspect the use of AUFS is a "silent killer" of overall performance on TaskCluster, especially on large-core instances (read: we're wasting money due to sub-optimal performance).
What's the story on transitioning off AUFS? Depending on the timeline, it may justify someone doing work to have tasks alert when they are doing obviously sub-optimal things on AUFS like not having the workspace be a cache. But that won't save us: frequently accessed files baked into the Docker image will still kill perf because they are hosted on AUFS. And Try tasks can't use caches easily, meaning they always get AUFS and are therefore slower.
Flags: needinfo?(garndt)
Comment 18•7 years ago
|
||
Sorry for the delayed response, I just returned from PTO and will be looking at this once I get caught up. I will have to see where we are with other priority projects on our plate. Last time I looked at this, I think I hit a rabbit hole of issues when trying to switch the fs drivers, but we can investigate with a new set of eyes.
I'm going to leave the ni? at least for a couple of days to keep it on my radar.
Comment 19•7 years ago
|
||
Switching ni? to wander, he will start looking into this.
Flags: needinfo?(garndt) → needinfo?(wcosta)
Updated•7 years ago
|
Component: Worker → Docker-Worker
Assignee | ||
Updated•7 years ago
|
Assignee: garndt → wcosta
Status: NEW → ASSIGNED
Flags: needinfo?(wcosta)
Reporter | ||
Comment 20•7 years ago
|
||
While I was hacking on bug 1415725, I cranked out a patch to switch docker-worker to overlay2. PR at https://github.com/taskcluster/docker-worker/pull/337.
Comment 21•7 years ago
|
||
Commit pushed to master at https://github.com/taskcluster/docker-worker
https://github.com/taskcluster/docker-worker/commit/feb14875f6dbee6cdbca619a280f4b56deeb398b
Upgrade to kernel 4.4 and use overlay2 storage driver
AUFS is the source of numerous performance problems and
filesystem consistency issues. Bug 1291940 has plenty of
details. The tl;dr is AUFS takes out a global kernel lock
for pretty much all I/O operations and concurrent I/O
grinds to a halt as you add more concurrency. Firefox builds
actually get slower as you use more cores. In addition, there
are consistency issues with AUFS that result in files randomly
not being evailable after write.
In this commit, I bump the kernel version from 3.13 to 4.4
(so overlayfs is usable), add support for the overlay2 storage
driver (added in Docker 1.12 - which we conveniently use),
and switch the default storage driver to overlay2.
The version of v4l2loopback was also bumped to the latest
in order to pull in fixes so it compiles with Linux 4.4.
Reporter | ||
Comment 22•7 years ago
|
||
wcosta just pushed this out ~2 hours ago. So all new worker instances should be using the new AMIs with Linux 4.4 and overlay2 storage driver.
I'm optimistic we'll see perfherder alerts as a result of this change. Could take a while for existing workers running AUFS to self-terminate, however.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Reporter | ||
Comment 23•7 years ago
|
||
https://treeherder.mozilla.org/perf.html#/graphs?series=mozilla-inbound,1445527,1,2&series=mozilla-inbound,1582370,1,2&series=mozilla-inbound,1446792,1,2 is showing a ~4 minute Linux64 opt build time drop for c4.4xlarge today. This will almost certainly generate a Perfherder alert once enough data comes in.
Assignee | ||
Comment 24•7 years ago
|
||
I rolled back AMIs today after a report of docker daemon failing to start in some instances. Connecting to a machine docker.log showed a failure to create logical volumes:
ls: cannot access /dev/nvme*n*: No such file or directory
umount: /dev/xvdb: not mounted
No physical volume label read from /dev/xvdb
Can't open /dev/xvdb exclusively. Mounted filesystem?
Unable to add physical volume '/dev/xvdb' to volume group 'instance_storage'.
No physical volume label read from /dev/xvdb
Can't open /dev/xvdb exclusively. Mounted filesystem?
Unable to add physical volume '/dev/xvdb' to volume group 'instance_storage'.
No volume groups found
Creating logical volume 'instance_storage'
ls: cannot access /dev/nvme*n*: No such file or directory
umount: /dev/xvdb: not mounted
Tracking down the issue, the root cause was vgcreate trying to open /dev/xvdb exclusively and failing:
open /dev/xvdb exclusively open("/dev/xvdb", O_RDWR|O_EXCL|O_DIRECT|O_NOATIME) = -1 EBUSY (Device or resource busy)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 25•7 years ago
|
||
Today a made a new rolled out the images after some tests yesterday. The gecko-t-linux-xlarge it is only work type that showed the problem so far.
Assignee | ||
Comment 26•7 years ago
|
||
* Today I rolled out
Reporter | ||
Comment 27•7 years ago
|
||
Per IRC discussion, it looks like failures in gecko-t-linux-xlarge were due to using an older, buggy AMI. The new AMIs appear to "just work."
wcosta: can we resolve this bug now?
Flags: needinfo?(wcosta)
Assignee | ||
Comment 28•7 years ago
|
||
(In reply to Gregory Szorc [:gps] from comment #27)
> Per IRC discussion, it looks like failures in gecko-t-linux-xlarge were due
> to using an older, buggy AMI. The new AMIs appear to "just work."
>
> wcosta: can we resolve this bug now?
I found one case with newest AMI where the error happened. I connected to the machine, and after I rebooted it, the error was gone. It feels like there is a very subtle race condition on docker daemon initialization that produces the wrong behavior once in a while. Not sure if we should close this bug and investigate in a follow-up one, as it seems not to affect production. The only downside is Papertrail backlog.
Flags: needinfo?(wcosta)
Reporter | ||
Comment 29•7 years ago
|
||
If the incidence rate is low, let's get a new bug on file.
FWIW, I think the way we create the volume (as part of docker's initialization routine) is kinda wonky. IMO this should be a one-time operation performed as part of launching the EC2 instance. But we can discuss this in another bug.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 30•7 years ago
|
||
The problem seems to get a lot more frequent once I added c5 instances.
Assignee | ||
Comment 31•7 years ago
|
||
Rolled back AMIs, as we won't take advantage of c5 instances and there still occasional bustage with other instances. Didn't reverted the patches, however.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Reporter | ||
Comment 32•7 years ago
|
||
New AMIs got rolled out today. Workers are still recycling. But it looks like things will stick this time.
Status: REOPENED → RESOLVED
Closed: 7 years ago → 7 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Component: Docker-Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•