Closed Bug 1286935 Opened 8 years ago Closed 8 years ago

Requests to queue.taskcluster.net from Heroku sometimes taking up to 30 seconds

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: emorley, Assigned: jhford)

References

Details

Attachments

(3 files)

Treeherder's log parser fetches logs from queue.taskcluster.net in order to parse them. For the last 12 hours, our SCL3 instance was seeing request times between 1.0-1.5s: https://rpm.newrelic.com/accounts/677903/applications/5585473/externals?tw%5Bend%5D=1468523469&tw%5Bstart%5D=1468480269#id=5b2245787465726e616c2f71756575652e7461736b636c75737465722e6e65742f616c6c222c22225d&sort_by=average_call_time (will attach graph screenshots shortly for those without New Relic access) However for the same timeframe, our Heroku instance saw times between 2-29s: https://rpm.newrelic.com/accounts/677903/applications/14179733/externals?tw%5Bend%5D=1468523454&tw%5Bstart%5D=1468480254#id=5b2245787465726e616c2f71756575652e7461736b636c75737465722e6e65742f616c6c222c22225d&sort_by=average_call_time An example slow to fetch log was: https://queue.taskcluster.net/v1/task/G4y51plRTzu3UgrjjFWkLQ/runs/0/artifacts/public%2Flogs%2Flive_backing.log I'm presuming this may be because the Heroku instance requests are being routed through cloud-mirror. As such, I'm guessing that the problem is either: a) cloud-mirror was under too much load at those times b) Treeherder was the first consumer to request that log, and the first fetch just takes a while (until cloud-mirror has mirrored it to the relevant region) c) some other issue? Some questions: 1) Is there logging somewhere for cloud-mirror that can show how long requests are taking / current load? If not, could we add some? 2) I see cloud-mirror's `maxWaitForCachedCopy` is set to 28000ms, which matches the max times we're seeing. Can we lower this please? 3) Given I'm pretty sure Treeherder is the majority consumer of these logs (can we work out some stats? it sets a unique user-agent) - and Heroku's US region choice (where Treeherder will soon live) is in US-east, should we just store the taskcluster logs in US-east to start with, to save having to always use cloud-mirror? Is there a reason Taskcluster's AWS instances use US-west? (The Heroku parts of Taskcluster are US-east already)
Flags: needinfo?(jhford)
Flags: needinfo?(garndt)
Attached image scl3-request-durations.jpg (deleted) —
Attached image heroku-request-durations.jpg (deleted) —
As far as configuring cloud mirror, I will defer that to John. I'm pretty sure Treeherder might be the only consumer of the logs from AWS. There are people that will view them from the task inspector or raw log button from treeherder, but that's outside of AWS and not using cloud mirror. I can't think of anything within AWS that would be requesting those logs other than treeehrder and since heroku runs in us-east and our canonical artifact bucket is us-west-2, that guarantees that every log for jobs reported to treeherder will be transferred over to us-east. Our instances producing those logs can be spawned in any of the three US ec2 regions, but the upload of logs always goes to our canonical bucket, us-west-2, first. We have talked before about posting artifacts to buckets within the region and then somehow get those back into us-west-2, but I don't think there was much traction there. There is also not a way to request a artifact and explicitly say to not use cloud-mirror. It's an automatic operation by the queue.
Attached image us-east-1_copy_times.png (deleted) —
Here are the copy times we have for us-east-1 but it's not specific to artifact so not terribly helpful. We have a lot of artifacts that can be transferred of various sizes (some being many gigabytes, resulting in higher times)
Flags: needinfo?(garndt)
Thank you that's helpful. Some more questions (in addition to #1/#2 in comment 0 for John): 4) It is my understanding that the point of cloud-mirror is to save money (and I guess time) by avoiding multiple region to region file transfers for frequently requested resources. For docker images or other similar artifacts (I'm just guessing as to what else it mirrors) this makes perfect sense, but I wonder if logs are normally only ever accessed once from AWS, by just Treeherder? In which case perhaps cloud-mirror is redundant for logs specifically? Although Treeherder means both stage+prod instances of Treeherder to be fair. 5) Even if #4 isn't true (and we still want to use cloud-mirror for logs), it would presumably be cheaper to make the canonical S3 bucket be in US-east rather than always upload to US-west and immediately mirror to US-east. It's worth noting at well, that Treeherder requests every job's log, and not just those for failing jobs. (In reply to Greg Arndt [:garndt] from comment #3) > We have talked before about posting artifacts to buckets > within the region and then somehow get those back into us-west-2, but I > don't think there was much traction there. I don't think there would be much cost saving from doing this, given #5 above? > There is also not a way to request a artifact and explicitly say to not use > cloud-mirror. It's an automatic operation by the queue. Ah that was going to be my next question.
Also it would seem that in the cases where cloud-mirror hits maxWaitForCachedCopy, we get the worst of all worlds: * long request time for Treeherder * S3 transfer cost for Treeherder connecting directly to the original resource * S3 transfer cost from the cloud-mirror mirroring attempt, that's going to finish too late to be any use (and the cached resource may never get used, since Treeherder may end up having been the only AWS tool to request it)
Note: we can't disable cloud-mirror for logs only. The only way to do so would be a hack/regexp on the name... I don't want to go there :) I am not sure we can/should solve this. Cost-wise it probably/hopefully doesn't matter what we do, as logs are ideally not a significant portion of traffic.. Maybe cloud-mirror could be taught to skip caching of small artifacts... Or do caching of small artifacts in the web-process... Though I'll admit that is a bit scary too. @jhford, maybe we should review cloud-mirror stats to see if we cache hit rates have changed recently, I hope treeherder is just a small request number, but it is hard to tell...
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #7) > I am not sure we can/should solve this. Cost-wise it probably/hopefully > doesn't matter what we do, as logs are ideally not a significant portion of > traffic.. This bug was filed about the 15x speed slowdown issues affecting treeherder due to cloud mirror, rather than just cost. So we really do need to fix this bug. Can we just upload all logs to US-east instead? (And so presumably cloud mirror wouldn't need to do anything, so the slowdown would go away)
(In reply to Ed Morley [:emorley] from comment #0) > Some questions: > > 1) Is there logging somewhere for cloud-mirror that can show how long > requests are taking / current load? If not, could we add some? Yes. https://app.signalfx.com/#/dashboard/Ci0GbQDAcAM?startTime=-1d&endTime=Now Not sure what you need to do to access this, but it shows all of our metrics regarding the current copy speed and copy times as well as our cache hit and miss rates. We usually have a very good rate as long as the inter-region transfer speeds are good. > 2) I see cloud-mirror's `maxWaitForCachedCopy` is set to 28000ms, which > matches the max times we're seeing. Can we lower this please? For larger artifacts, we really do need to have >25000ms. Getting into special timeouts for different types of artifacts is a lot of added complexity and likely wouldn't help in the end, unless we're starved for copy nodes. If we're bound by the time it takes for a cloud-mirror copy node to do the file transfer, your process would also be bound by the same speed. I don't have metrics right now on how many items are waiting in the SQS copy queue, but it's theoretically possible that there aren't enough copiers around. I will look into adding a metric that periodically monitors how many items are in the work queue and increasing the number of copy nodes. Sadly, docker cloud is very limiting here, we can only have ten containers in a service without a couple of janky hacks. I'm considering dropping docker cloud and just deploying on bare metal. I've filed bug 1287429 to track the monitoring work. Dropping docker-cloud is in the very early stages. > 3) Given I'm pretty sure Treeherder is the majority consumer of these logs > (can we work out some stats? it sets a unique user-agent) - and Heroku's US > region choice (where Treeherder will soon live) is in US-east, should we > just store the taskcluster logs in US-east to start with, to save having to > always use cloud-mirror? Is there a reason Taskcluster's AWS instances use > US-west? (The Heroku parts of Taskcluster are US-east already) I'm not sure why us-west-2 was chosen as the canonical region, but it was. I'm not sure what would be involved in switching to us-east-1, but I don't personally have any issues with that change happening. I don't know if special casing logs is really the right thing to do, or if it would be better to just move all artifacts. Jonas could speak more to this than me.
Flags: needinfo?(jhford)
Ed, as a test, I've created a second identical service in docker cloud and bumped us up to 20 total copier processes running on 4 m4.xlarges. Please let me know if your times go down or not.
Flags: needinfo?(emorley)
Relevant pull request: https://github.com/taskcluster/cloud-mirror/pull/18 This will let us see how long each message waits before its work starts.
Thank you for spinning up the additional instance and opening that PR. The long (20-20s response time) requests come in spikes throughout the day, so we'll need to wait a bit to see how things have improved. I've sent you an invite to the Treeherder New Relic account, so you can follow along too, at: https://rpm.newrelic.com/accounts/677903/applications/14179733/externals#id=5b2245787465726e616c2f71756575652e7461736b636c75737465722e6e65742f616c6c222c22225d (Use the time adjustment menu to switch to eg 24 hour or 3 day view to compare)
Flags: needinfo?(emorley)
Sadly there was another massive slowdown today between 05:35 and 06:55 UTC+1 today. Permalink of the last 3 days until now: https://rpm.newrelic.com/accounts/677903/applications/14179733/externals?tw%5Bend%5D=1468912544&tw%5Bstart%5D=1468653344#id=5b2245787465726e616c2f71756575652e7461736b636c75737465722e6e65742f616c6c222c22225d (In reply to John Ford [:jhford] from comment #9) > I'm not sure why us-west-2 was chosen as the canonical region, but it was. > I'm not sure what would be involved in switching to us-east-1, but I don't > personally have any issues with that change happening. I don't know if > special casing logs is really the right thing to do, or if it would be > better to just move all artifacts. Jonas could speak more to this than me. Jonas?
Flags: needinfo?(jopsen)
UTC+1 is London time right now right? Berlin/CEST is +2? If so, that corresponds directly to a large spike in all transfer times. It looks like it was related to extremely large artifacts being requested. I suspect these are B2G artifacts. The transfer speeds remained relatively constant, but the transfer times were a lot longer. I suspect that these extremely large files are starving out the copiers for the smaller files. Based on the CPU utilization being so low, I've added another 10 containers, but left the number of nodes static. This *should* make it harder to become starved for copy nodes, but until I land PR#18, I won't know for sure. That's what I'm about to do. I think something needs to be done about these massive files. They're taking up to 6 minutes to complete, which isn't acceptable. I don't know much about the jobs that are using these files, Greg, Wander, do you know what's going on here? Are these things that we really do need to be caching? I suspect that if they are, we need to do some special work in the tasks to ensure 100% that they are fetched from the cache and not ever redirected to original url. If we're confident that we can trust Content-Length, I think we could do a couple things to mitigate this issue: 1. ignore files that have a Content-Length that's above 2GB. These files are definitely not going to transfer in the window, maybe it's better to just skip them. Downside is that these are the files we get the biggest caching win with. 1. create two buckets (and thus two work queues) of really big files and normal sized files, have a small pool of containers that work through the big files, leaving the normal containers to pick up the slack. And a couple other things to consider: 1. move artifacts to us-east-1. we're so tightly coupled to heroku, it's sort of weird that we use us-west-2 as our canonical region 1. figure out how to deploy heroku apps in us-west-2. I saw something about heroku being able to pick aws region, but I don't know the details.
Bug 1287450 is interfering with my ability to get this done.
(In reply to John Ford [:jhford] from comment #14) > 1. figure out how to deploy heroku apps in us-west-2. I saw something about > heroku being able to pick aws region, but I don't know the details. This isn't possible unless we (a) persuade whomever pays for Heroku to switch us to their Enterprise contract, (b) once on enterprise switch from the 'common runtime' to 'private spaces' (and this requires a manual migration of the app). (Though that said, I would like to persuade them to switch to Enterprise regardless, since we could really do with SSO and the finer-grained team permissions.) $ heroku regions ID Location Runtime ───────── ─────────────────────── ────────────── eu Europe Common Runtime us United States Common Runtime frankfurt Frankfurt, Germany Private Spaces oregon Oregon, United States Private Spaces tokyo Tokyo, Japan Private Spaces virginia Virginia, United States Private Spaces And: $ http --session=heroku -b https://api.heroku.com/regions/us { "country": "United States", "created_at": "2012-11-21T21:44:16Z", "description": "United States", "id": "59accabd-516d-4f0e-83e6-6e3757701145", "locale": "Virginia", "name": "us", "private_capable": false, "provider": { "name": "amazon-web-services", "region": "us-east-1" }, "updated_at": "2015-08-20T01:37:59Z" } $ http --session=heroku -b https://api.heroku.com/regions/oregon { "country": "United States", "created_at": "2015-08-20T01:37:59Z", "description": "Oregon, United States", "id": "be90a7d3-570f-4ba7-bc42-f33c2cbece22", "locale": "Oregon", "name": "oregon", "private_capable": true, "provider": { "name": "amazon-web-services", "region": "us-west-2" }, "updated_at": "2015-08-20T01:37:59Z" } See: https://devcenter.heroku.com/articles/regions
I just switched from 4x m4.xlarge to 2x c4.4xlarge, which cost roughly the same amount of money. It seems to be lowering the transfer times. It looks like this instance type has much better inter-region transfer speeds, with the same number of containers, it appears that the 2.8GB file that was taking 4.86minutes is now taking 2.26minutes. This is not a solution, but if we can double the throughput of even the slowest files, we'll work through the queue quicker.
A few random thoughts here: * Ignoring large files for the moment seems likely to alleviate this issue short-term (at a cost in transfer). * This seems like the perfect job for AWS SQS + autoscaling, where AWS itself would spin up the additional capacity during spikes. The autoscaled instances could either run the copy operations in multiple node processes, or in multiple docker containers configured to start on boot. * We might consider read-through (that is, redirect to the source object) after a short wait for a copy to complete. That short wait could scale with object time, so it's 500ms for a small JSON blob but tens of seconds for a multi-GB device image. * It may also be useful to "pre-warm" certain caches on task completion. This could be a simple pulse listener that issues a specially-crafted "HEAD" request to cloud-mirror for artifacts matching a list of regexes. This would avoid, or at least reduce, the wait-for-copy penalty paid by the first requester. In determining the best solution, it would be helpful to know - how many times an artifact is requested (ideally as a function of size) - when, relative to a task's completion, an artifact is requested - what the outliers are from these patterns (e.g., docker images, tc-vcs caches) This information should allow generation of a simple model of hit rate, storage costs, and requester wait time, which are the parameters we're trying to balance.
> I'm not sure why us-west-2 was chosen as the canonical region, but it was. Back in the day us-east-1 didn't have read-after-write consistency for the first PUT. In fact us-east-1 AFAIK still offers some end-points with eventual consistency. Thought it should be said that now has end-points that are read-after-write consistent for the first PUT. As the bucket is stored on azure entities for artifacts we can change the canonical bucket. But it would take some minor tricks and some asserts in the code would have to be removed. + cloud-mirror would have to be moved. If there is a significant interest it's something we can put on the road map.
Flags: needinfo?(jopsen)
If we can solve this bug via increasing instance sizes/and or numbers for cloud-mirror, or else via some of the other suggestions in eg comment 14 / comment 18, then which region is used for storage makes very little difference to me. (It will just be up to the Taskcluster team to make a cost/benefit tradeoff of comment 19.) We just need to find _a_ solution to this in the extremely short term, since in 1-2 weeks Treeherder will be on Heroku, and any log parsing issues due to cloud-mirror will become a tree closing event.
(In reply to Ed Morley [:emorley] from comment #20) > If we can solve this bug via increasing instance sizes/and or numbers for > cloud-mirror, or else via some of the other suggestions in eg comment 14 / > comment 18, then which region is used for storage makes very little > difference to me. (It will just be up to the Taskcluster team to make a > cost/benefit tradeoff of comment 19.) I switched to a significantly better instance type for the copier and have gone to a total of 40 copy nodes. My graphs are a lot better now > We just need to find _a_ solution to this in the extremely short term, since > in 1-2 weeks Treeherder will be on Heroku, and any log parsing issues due to > cloud-mirror will become a tree closing event. In that case, this bug would not cause a tree closing event. Baring exceptional cases, cloud mirror guarantees a 302 redirect to a copy of the artifact in your region or the canonical one within 30 seconds.
(In reply to John Ford [:jhford] from comment #21) > In that case, this bug would not cause a tree closing event. Baring > exceptional cases, cloud mirror guarantees a 302 redirect to a copy of the > artifact in your region or the canonical one within 30 seconds. That timeout makes the entire parsing job take 15x longer. Until now the workload is CPU bound so we're not using eventlet with celery, so this causes a backlog.
(In reply to John Ford [:jhford] from comment #21) > I switched to a significantly better instance type for the copier and have > gone to a total of 40 copy nodes. My graphs are a lot better now Many thanks :-)
So it looks like with the c4.4xlarge instead of m4.xlarge and increasing the number of docker cloud containers, we've reduced the peak time in new-relic from ~28s to ~4.5s. The average time is ~2.5s. The changes were in production for >24 hours with the new node type and had many of the large files that previously were thought to be causing the delay. Ed and I discussed this on IRC we've decided to call this fixed.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Many thanks! :-)
Assignee: nobody → jhford
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I'm writing a patch right now to monitor the number of messages that are waiting and in progress. Once deployed, we will see if the issue is copy nodes being starved or something else. As long as we aren't starved for copy nodes, cloud-mirror is really not adding much if any overhead vs. doing the request yourself inside heroku directly to us-west-2. We have lots of data to show that the us-west-2 -> us-east-1 transfer speeds are quite low. I'm doing a few different things to try to improve the situation, but this is likely down to low speeds between the regions.
I am now monitoring the number of pending and inflight messages with SQS. This will tell us for sure whether the issue is transfer speeds or an application issue with cloud-mirror. As long as we don't see large numbers of cloud-mirror.production.sqs-messages.waiting, we're bound only by transfer speeds.
Thank you :-)
The relevant chart is here: https://app.signalfx.com/#/chart/CoNJtW8AcAc?startTime=-1d&endTime=Now If that gets to be rather large, then we are being starved for copy nodes.
Hmm so the request times regularly spike every 4 hours looking at New Relic (I don't have access to the graph in comment 31). What's scheduled every 4 hours?
I've since moved cloud-mirror copiers onto even bigger machines and haven't seen too many long file copy times. Are we OK to close this bug out?
Sounds good :-)
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Seeing more spikes/backlogs today. Please can we just put the logs in the correct region to start with? (Comment 8, comment 19). This all seems like such a hack around not doing so...
(In reply to Ed Morley [:emorley] from comment #36) > Seeing more spikes/backlogs today. > Please can we just put the logs in the correct region to start with? > (Comment 8, comment 19). > > This all seems like such a hack around not doing so... Ed, us-west-2 is the correct region for artifacts. Logs are artifacts. Heroku being in us-east-1 is the issue that needs to be worked around. Our reason for choosing a region other than us-east-1 is that S3 in that region does not offer immediate consistency, which adds a lot of complexity to this problem. I have been busy working on two things to improve this for you. The first is that I wrote an entirely new SQS Queue library http://github.com/jhford/sqs-simple. This will enable us to easily have a bunch more file copiers. Second is that I've used that library to implement having even more file copiers in each process. That work is being tracked here https://github.com/taskcluster/cloud-mirror/pull/19 . These should help this situation substantially, so please understand that performance of cloud-mirror is not something that has been forgotten. Most importantly, though, is to understand that we're working with a distributed system. A 30s delay is unfortunate, but as that's an expected thing to happen, your system should be able to handle this occurrence correctly.
(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #37) > Our reason for choosing a region other than us-east-1 is that S3 in that region > does not offer immediate consistency, which adds a lot of complexity to this > problem. My understanding from comment 19 was that this was now mostly fixed. > I have been busy working on two things to improve this for you. Much appreciated :-) > Most importantly, though, is to understand that we're working with a > distributed system. A 30s delay is unfortunate, but as that's an expected > thing to happen, your system should be able to handle this occurrence > correctly. I agree we should handle occasional failures (we do), but backlogs when it comes to anything to do with the CI system is a tree-closing event. Whether that be the job scheduling becoming backlogged, or log parsing now increasing in duration from 5s to 30s per log. I think we're just going to have to move ahead with the Heroku migration, and in the worst case sheriffs will just have to close trees if the backlogs are problematic.
Another option: Given logs are typically only accessed once (particularly given the caching period is only 24 hours), can we just skip cloud-mirror for them? I don't see how it's really saving much money. From an outsiders perspective, queue.taskcluster.net "works" from !AWS, but is very unusable from AWS, due to a cost-saving system for which we personally see no benefit, and I suspect you see little benefit from (for logs specifically) either.
(In reply to Ed Morley [:emorley] from comment #39) > Another option: > > Given logs are typically only accessed once (particularly given the caching > period is only 24 hours), can we just skip cloud-mirror for them? I don't > see how it's really saving much money. > > From an outsiders perspective, queue.taskcluster.net "works" from !AWS, but > is very unusable from AWS, due to a cost-saving system for which we > personally see no benefit, and I suspect you see little benefit from (for > logs specifically) either. It's actually quite usable from within AWS, we use this system for all of our builds and tests to fetch any artifacts which they use. It's not just a cost-savings system, rather it's a performance-raising system. Instead of doing the slow inter-region transfers for each task, we have a small overhead on the first transfer or two in return for having nearly-zero transfer times thereafter. If you're interested, we have usually an 80+% cache-hit rate on resources cached with this system. Logs are not treated in any special way as far as the Queue (the relevant system here) is concerned. I really don't want to get into special casing logs, either. If you'd really like to work around cloud-mirror right now, you could do this: 1. HEAD request for artifact on queue.taskcluster.net 2. Grab the Location: header on previous request 3. Cloud-mirror route is https://cloud-mirror.taskcluster.net/v1/redirect/s3/us-east-1/<urlEncodedUpstreamUrl> 4. decodeURIComponent(urlEncodedUpstreamUrl) 5. GET the result of 4. This is not at all a supported method... If you chose to implement this, it'd be wise to have it fallback to the correct API. This is a bad idea that I'm mentioning to be complete and is not a recommendation. Here's a sample implementation using async/await and requests in node. Provided, again, only to be complete and absolutely not even slightly recommended. Expect this to break at literally any time. ====================================================================== var _url = require('url'); var request = require('request-promise'); async function main(url) { let response = await request.head({ uri: url, followRedirect: false, simple: false, resolveWithFullBody: true, }); let s3Location = _url.parse(response.location).pathname.split('/'); s3Location = s3Location.slice(s3Location.length - 1); s3Location = decodeURIComponent(s3Location); console.log(JSON.stringify({location: s3Location, outcome: 'success'})); } main(process.argv[2]).catch(err => { console.log(JSON.stringify({err: err.stack || err, outcome: 'failure'})); process.exit(1); });
(In reply to Ed Morley [:emorley] from comment #39) > Another option: > > Given logs are typically only accessed once (particularly given the caching > period is only 24 hours), can we just skip cloud-mirror for them? I don't > see how it's really saving much money. > > From an outsiders perspective, queue.taskcluster.net "works" from !AWS, but > is very unusable from AWS, due to a cost-saving system for which we > personally see no benefit, and I suspect you see little benefit from (for > logs specifically) either. Jonas, how do you feel about having a skip-cloud-mirror option or endpoint that's protected by a scope that we only give to treeherder? I think that's better to implement because it's a general solution to the problem, rather than doing some sort of special casing based on log name, then having to adjust that every time. If we do this, we should be very hesitant to give out this scope, maybe we should also be using the CDN URL for this instead of S3 directly? I'm not sure off hand.
Flags: needinfo?(jopsen)
I'm also curious to see what the effects of multi-consumer would be. The main bottleneck in this system is actually the inter-region transfer time and not anything else. Ed, talking directly with S3 is not going to reduce the transfer times much I suspect, aside from the 30s case where there weren't enough copy nodes.
(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #40) > It's actually quite usable from within AWS, we use this system for all of > our builds and tests to fetch any artifacts which they use. For artifacts that are (a) read many times, (b) large, then yes there is definitely value. My previous comment was referencing the log parsing use case specifically. I don't believe there are any other consumers of these logs apart from Treeherder, and each log is only read once (well twice if we include Treeherder stage, but it will request it about the same time, so I'm guessing results in no net saving in transfer time). > If you'd really like to work around cloud-mirror right now, you could do > this: Thank you for this suggestion, though I agree it might be good to settle on something slightly less fragile :-) (In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #42) > Ed, talking directly with S3 is not going to reduce the transfer times much > I suspect, aside from the 30s case where there weren't enough copy nodes. The transfer times aren't the problem at all. It's the 30s case that's the one that is the sole cause of our issues. Thank you again for your continued help with this!
(In reply to John Ford [:jhford] CET/CEST Berlin Time from comment #40) > Logs are not treated in any special way as far as the Queue (the relevant > system here) is concerned. I really don't want to get into special casing > logs, either. Why out of curiosity? I think the main issue here is the assumption in the design that all artifacts are equal regardless of size/type/intended consumer, and that we have to treat log files the same as 1GB images. As for what means we use to differentiate (file type, location, size, special header, alternative access URL) I'm happy to leave up to you - but I think any other solution (like increasing instance size or quantity of the cloud-mirror copier nodes) is missing the point that for the log file use case, cloud-mirror adds no value, only complexity and points of failure. Of course, if there's some other consumer or consideration for logs that I'm not aware of, please do say :-)
> how do you feel about having a skip-cloud-mirror option or endpoint that's protected by a scope > that we only give to treeherder? I would really prefer to avoid any special hacks based on file name etc. @jhford, perhaps the better thing to do is to bypass cloud-mirror for small files. Either on the cloud-mirror side (as it does a HEAD request right?), or on the queue if we recorded file size. Or we could implement the two SQS queues approach for cloud-mirror, that would help too. But perhaps cloud-mirror should just redirect small artifacts to origin region/bucket.
Flags: needinfo?(jopsen)
I think we need to do something to address Ed's concerns, as they are causing hardship for others by delaying job status and eventually closing trees. So we have a class of users downloading artifacts which are latency-sensitive (treeherder, and also users with browsers); and a class of users who are bandwidth-sensitive (mostly other tasks). These users have some patterns we might exploit: the latency-sensitive users tend to be grabbing small artifacts, while bandwidth-sensitivity is really only an issue for large artifacts. Likewise, latency-sensitive users tend to hit artifacts that are only downloaded once, while bandwidth-sensitive users hit artifacts which are downloaded multiple times. We can exploit the first difference as Jonas has suggested: do not mirror small artifacts, and just redirect to their origin region. This leaves the definition of "small" somewhat vague, and our log files do tend to get rather large (100MB). We can exploit the second difference by responding with a redirect for the first request to an object while triggering a copy, and only redirecting to the copy when it is complete. Here "first" should be considered as some distributed-system-friendly approximation. We could also just leave it to the user, adding an extra header that says "I am latency sensitive" which would cause a redirect to the origin every time.
Ah much more clearly framed than my comments - thank you :-) (In reply to Dustin J. Mitchell [:dustin] from comment #46) > We can exploit the first difference as Jonas has suggested: do not mirror > small artifacts, and just redirect to their origin region. This leaves the > definition of "small" somewhat vague, and our log files do tend to get > rather large (100MB). All of the logs should be gzipped, so the actual filesize is typically < 10MB in many cases, iirc?
(In reply to Dustin J. Mitchell [:dustin] from comment #46) > I think we need to do something to address Ed's concerns, as they are > causing hardship for others by delaying job status and eventually closing > trees. I think that if this will close trees it's probably some bit of measurement we want to do even on the TaskCluster side to say that a job status should be updated within Treeherder within X seconds since job resolution, especially if 30 second delay here would cause a tree closure. This could just as well be something happening within taskcluster-treeherder that delays job status update. > > So we have a class of users downloading artifacts which are > latency-sensitive (treeherder, and also users with browsers); The solution to reducing latency for users with browsers might be different here since they will not be going through cloud-mirror (unless they are in ec2) and they will also need to be aware of needing to switch headers to avoid any caching/copy middle layer, unless we base it on the user agent or something as well to know that the request is coming from a browser. > We could also just leave it to the user, adding an extra header that says "I > am latency sensitive" which would cause a redirect to the origin every time. I do like the idea of doing something like this especially for files that do not need to be transferred and stored in another bucket (such as log files). Seems like a lot of overhead for a one time artifact like this. ------------------------------ On a different note, I wanted to understand the frequency that this happens so I used one of the links earlier for new relic to see the typical response times for queue.taskcluster.net. Looking back in August it doesn't seem like there were any spikes > 5k ms so I think I might be looking at the wrong thing. I did notice some spikes in the 20k+ ms range in July. Also to note, it appears that cloud-mirror has made some of this a lot better (specifically the timeout) because as I understand it, s3-copy-proxy had a 90 second timeout. There's still some work to do here to make things better. John has some PRs to improve copy speeds/times.
It's been pointed out to me that I missed the part where having any delay in log processing delays all other log processing. That's definitely bad. Sorry that I missed that point originally.
(In reply to Ed Morley [:emorley] from comment #43) > For artifacts that are (a) read many times, (b) large, then yes there is > definitely value. > > My previous comment was referencing the log parsing use case specifically. I > don't believe there are any other consumers of these logs apart from > Treeherder, and each log is only read once (well twice if we include > Treeherder stage, but it will request it about the same time, so I'm > guessing results in no net saving in transfer time). Special casing things leads to really unmaintainable code on our side. The concern I have is that we could add this special case for logs, but then we'll need a special case for system X, then another for system Y. The reason for the latency spikes is that we're running out of copier nodes. I have I patch for that which is awaiting review, but that should be ready to go pretty soon. Even for small files or files that are used once, a potential performance penalty of cloud-mirror when it has enough copy nodes should be negligable. The time to copy has the main elements: 1. time to get copy node 2. time to do inter-region transfer 3. time to do intra-region transfer The first time should go away entirely once my patch lands. The second time will be the same whether you fetch from S3 or cloud-mirror fetches from S3. The third time is negligable. If we can use the same pipeline for all artifacts, we can avoid the complexities of having special pipelines for one style of artifact vs another style. > Thank you for this suggestion, though I agree it might be good to settle on > something slightly less fragile :-) I'm very releived ;) (In reply to Ed Morley [:emorley] from comment #44) > I think the main issue here is the assumption in the design that all > artifacts are equal regardless of size/type/intended consumer, and that we > have to treat log files the same as 1GB images. I made the assumption that all artifacts should be equal because I beleive it to be true. The root of the issue here is that we're running out of copy nodes. One of the strategies we've discussed to avoid this is bucketting requests into a small-file/low-latency-demand bucket and a large-file/high-latency-works bucket. With this, we can keep the general approach for actual actions taken, but make optimizations about how to order that work. Let's see how things change when we have the same bucket but more copiers change deployed! > As for what means we use to differentiate (file type, location, size, > special header, alternative access URL) I'm happy to leave up to you - but I > think any other solution (like increasing instance size or quantity of the > cloud-mirror copier nodes) is missing the point that for the log file use > case, cloud-mirror adds no value, only complexity and points of failure. Whether using cloud mirror for all artifacts adds or removes complexity is a matter of perspective. I think a concern here, though, is that delay of processing a single log is able to bring the whole log ingestion down. Even if you're talking to S3 directly, I have data to show that us-west-2 to us-east-1 often runs as slow as 0.25kb/s. At that speed, a 10mb file itself might be longer than 30s just to transfer. > Of course, if there's some other consumer or consideration for logs that I'm > not aware of, please do say :-) I'm always open to have a chat on vidyo or equivalent about this. Maybe there are archictural choices in the Treeherder system which could be revisited now that you've changed the deployment environment signficantly. One idea that I had was that instead of shipping *every* log across the continent, which will have transfer time and cost penalties, you could always do the log processing in us-west-2 and ship only the result of processing across. I think this is actually the best outcome here, because you will get signficantly better performance out of this, if the result of log processing is smaller than the original log. I'm happy to help design something which would acheive this.
> Special casing things leads to really unmaintainable code on our side. The concern I have is that we could add this special case for logs, but then we'll need a special case for system X, then another for system Y. I agree in general things shouldn't be special-cased, however there is a significant distinction between special-casing poorly chosen attributes (which is not what I was suggesting, hence "As for what means we use to differentiate ... I'm happy to leave up to you"), and say having two distinct types of consumers that are not going to scope creep into 3 or 4 types later: * latency-sensitive * bandwidth-sensitive > One idea that I had was that instead of shipping *every* log across the > continent, which will have transfer time and cost penalties, you could > always do the log processing in us-west-2 and ship only the result of > processing across. I think this is actually the best outcome here, because > you will get signficantly better performance out of this, if the result of > log processing is smaller than the original log. A few considerations: * log processing requires a bunch of DB access, so moving them to US-west would increase DB latency, unless all of Treeherder moved. * Heroku doesn't offer US-west for the standard common runtime apps, only Enterprise private spaces (though we may be moving to the enterprise plan soon) - and the latter has a few considerations over the common runtime. (Plus all of the other taskcluster heroku apps are US-east.) * The current Treeherder log viewer routes the log chunk being viewed through the Treeherder API, so even if log processing was handled in the same region as the logs, there would be latency viewing the logs in the browser if S3->S3 transfer slow. (The new proposed log viewer skips the API entirely however.) One option (and something we want anyway), is to move more of Taskcluster to structured logging, since the structured error summaries it produces are much smaller (Treeherder only has to fetch the summary, not the full structured log, during processing). We would also need to omit the steps covered by the structured log from the raw log (currently difficult, since the structured log doesn't cover all of stdout/stderr), since Treeherder has to parse that too. (CC jgraham) > I think a concern here, though, is that delay of processing a single log is > able to bring the whole log ingestion down. Even if you're talking to S3 > directly, I have data to show that us-west-2 to us-east-1 often runs as slow > as 0.25kb/s. At that speed, a 10mb file itself might be longer than 30s > just to transfer. This is orthogonal though to whether we use cloud-mirror or not though. Even if we do the above (ie: move more things to structured logging, so have smaller files to transfer), any cloud-mirror copier-node issue still causes ingestion lag - and in fact will become a bigger component of the overall time. ie time to transfer: * with cloud-mirror: <up to 30s copier node delays> + <S3 west-east transfer time of a small file> * without cloud-mirror: <S3 west-east transfer time of a small file> If S3-S3 transfer times are regularly bad, we should also file issues against AWS.
(In reply to Ed Morley [:emorley] from comment #51) > I agree in general things shouldn't be special-cased, however there is a > significant distinction between special-casing poorly chosen attributes > (which is not what I was suggesting, hence "As for what means we use to > differentiate ... I'm happy to leave up to you"), and say having two > distinct types of consumers that are not going to scope creep into 3 or 4 > types later: > * latency-sensitive > * bandwidth-sensitive I agree that there two classes, but the absolute long poll of cloud-mirror is the actual inter-region transfer time in all cases where we aren't copy-node starved. Given that, adding complexity to handle these two classes differently when the potential improvement to be had is so small doesn't feel like the right thing to do. > A few considerations: > * log processing requires a bunch of DB access, so moving them to US-west > would increase DB latency, unless all of Treeherder moved. Maybe the DB latency increase is worth the transfer time trade-off? Maybe you could do more of the analysis as part of the query so that latency is less critical? > * Heroku doesn't offer US-west for the standard common runtime apps, only > Enterprise private spaces (though we may be moving to the enterprise plan > soon) - and the latter has a few considerations over the common runtime. > (Plus all of the other taskcluster heroku apps are US-east.) I would also *really* love to have this, for cloud-mirror itself! I just don't know when or if this is happening. > * The current Treeherder log viewer routes the log chunk being viewed > through the Treeherder API, so even if log processing was handled in the > same region as the logs, there would be latency viewing the logs in the > browser if S3->S3 transfer slow. (The new proposed log viewer skips the API > entirely however.) Skipping the treeherder api seems like the right solution to this problem to me. But in the meantime, doesn't this mean that we are using the us-east-1 copy of the artifact more than once? doesn't that mean that we need to have these behind cloud-mirror anyway? > One option (and something we want anyway), is to move more of Taskcluster to > structured logging, since the structured error summaries it produces are > much smaller (Treeherder only has to fetch the summary, not the full > structured log, during processing). That's a great idea! > We would also need to omit the steps covered by the structured log from the > raw log (currently difficult, since the structured log doesn't cover all of > stdout/stderr), since Treeherder has to parse that too. (CC jgraham) Maybe that logic can be moved into the job itself? That might make it easier for self-service of developers who wish to add new patterns to the parsing? Maybe more of the treeherder logic could move into the job itself? This model has worked really well for taskcluster itself, and means that task-specific parsing code can live along side the code that generates the logs that it tests? I suspect credentials and ACLs are what'd undermine that, though. > This is orthogonal though to whether we use cloud-mirror or not though. > > Even if we do the above (ie: move more things to structured logging, so have > smaller files to transfer), any cloud-mirror copier-node issue still causes > ingestion lag - and in fact will become a bigger component of the overall > time. Right. The lag introduced by cloud-mirror is a very well understood problem. Without the lag waiting for a copy-node, cloud-mirror being there makes no practical difference and allows us to have no special casing. > > ie time to transfer: > * with cloud-mirror: <up to 30s copier node delays> + <S3 west-east transfer > time of a small file> > * without cloud-mirror: <S3 west-east transfer time of a small file> with multi-consumer cloud-mirror branch, there should not be issues with waiting for copy nodes. > If S3-S3 transfer times are regularly bad, we should also file issues > against AWS. This is a known problem that is not unique to us. I'm pretty sure this is a constraint outside of Amazon's control. I am working on. Oregon and Virginia are quite far from each other. Oregon -> California copies run the same code path but are way faster.
The multi-consumer branch has been deployed into production today and has drastically reduced the amount of messages that sit around waiting. I've seen a 3-4x in message processing concurrency in the load peaks that I've seen today and I've seen no messages that were waiting for a copy node. I'd like to mark this bug as fixed if you'd agree. If this problem shows up again, let's use a new bug.
Sounds good to me! Thank you for your work on this :-)
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
As we discussed last night and today, this morning I changed the formation of copy nodes. The change was 10 containers of 100 queue listeners each on one c4.8xlarge to the same load split across one c4.8xlarge and two c4.4xlarge. We're not seeing any issues with message processing concurrency any longer and the transfer speeds are back to before. The last 3 hours since this change went live have seen the new relic reported times back to the 2-3s that they should be, along with no spikes in the time period since this change went live. I suspect that there's a couple improvements possible still bug 1304697, but we're in a pretty good state here. I'm marking this bug as RESOLVED->FIXED because we're no longer starved for nodes and we're seeing good transfer times. The issue underlying this bug has been fully addressed, so if there are further issues, let's open a new bug.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
Depends on: 1305768
Component: Integration → Services
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: