Closed Bug 1284516 Opened 8 years ago Closed 8 years ago

Huge lag in funsize tasks on taskcluster which causes partial updates to be available hours later as usual

Categories

(Release Engineering :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dbaron, Unassigned)

References

Details

(Keywords: regression)

Attachments

(1 file)

Updates for Windows mozilla-central nightly builds have been full updates rather than partial updates for a few days now. I definitely noticed this today (July 5), and I noticed it a previous day this weekend, probably yesterday (July 4). padenot also noticed it when updating today.
Any idea what's going on here? This is still happening (updating from July 6 to July 7 builds).
Flags: needinfo?(rail)
Summary: starting around July 4, Windows nightly updates are full updates rather than partial → starting around July 4, Windows (32-bit) nightly updates are full updates rather than partial
Sorry for the tardy reply. I believe the main reason for this is the fact that we generate partials after we publish completes. There is a lag until we generate and publish the corresponding partials and it depends on the load in TaskCluster. I filed bug 1195365 ages ago, but had no chance to look at it yet. :/
Depends on: 1195365
Flags: needinfo?(rail)
This actually affects all platforms and Aurora too.
Summary: starting around July 4, Windows (32-bit) nightly updates are full updates rather than partial → starting around July 4, updates are full updates rather than partial
I think its better to move this over to taskcluster in case it continues and we can figure out why this has been regressed that badly lately.
Component: General Automation → Scheduler
Product: Release Engineering → Taskcluster
QA Contact: catlee
Summary: starting around July 4, updates are full updates rather than partial → Huge lag in funsize tasks on taskcluster which causes partial updates to be available hours later as usual
I can see a lot of worker-shutdowns. Dustin, are other tasks also that heavily affected as funsize is? I wonder if that is a general issue with AWS or specific to the funsize workers.
Flags: needinfo?(dustin)
It would depend on the instance types in use -- that generally occurs when the price goes up substantially. My understanding from yesterday was that this was an issue with the funsize tasks themselves, and not something we can fix by e.g., adding more instances (as that would cause more balrog contention).
Component: Scheduler → General
Flags: needinfo?(dustin)
I think we need more detailed info from Rail here. So far as I can see the Treeherder links do not show any failures regarding Balrog today.
Flags: needinfo?(rail)
As I said in comment #2, we should address the root cause, bug 1195365. We can try to minimize the lag, but there is no way we can get the partials immediately.
Flags: needinfo?(rail)
With bug 1195365 fixed we will have a longer delay for every update - well depending on when during the funsize process you want to make the full mars available. I strongly assume it will happen after all partials have been created for the locale and platform. If funsize takes longer to produce partials, as what we see here, we all will have to wait much longer to be able to get updates offered, and be able to test the new build. IMHO it's not what we want. As David has seen and what I can second with our firefox-ui-update tests since a couple of days it takes a very long time before partials are available. So there has to be a reason why it takes that long. And exactly that we really have to get identified and fixed. If that is not the before mentioned Balrog issue, what is it then?
Looks like this got way better in the past days. I don't really see the longish lag anymore.
Rail, I think we can mark this fixed or WFM?
Flags: needinfo?(rail)
... or incomplete
Status: NEW → RESOLVED
Closed: 8 years ago
Flags: needinfo?(rail)
Resolution: --- → INCOMPLETE
Lets reopen ;) because today I can see a huge lack again. Looking at the AWS provisioner stats of TC I can see 289 pending jobs for the custom funsize-balrog worker. Its max capacity is 32 tasks. Which means we are piling up a huge amount of already generated and signed partial updates, which simply get not uploaded that quickly.
Status: RESOLVED → REOPENED
Resolution: INCOMPLETE → ---
Here an example: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=b7f7ae14590aced450bb0b0469dfb38edd2c0ace&filter-searchStr=update generating task: Started: Thu Sep 1, 15:30:39 Ended: Thu Sep 1, 15:41:32 signing task: Started: Thu Sep 1, 15:42:57 Ended: Thu Sep 1, 15:44:12 upload task: Requested: Thu Sep 1, 15:30:36 Duration: Not started (queued for 67 minute(s))
The upload task takes about 3 minutes to run. Due to the amount of partials for mozilla-aurora we are still not done with uploading those partials yes. So nightlies for mozilla-central would still have to wait a bit before their partials are getting uploaded.
(In reply to Henrik Skupin (:whimboo) from comment #14) > Lets reopen ;) because today I can see a huge lack again. Looking at the AWS > provisioner stats of TC I can see 289 pending jobs for the custom > funsize-balrog worker. Its max capacity is 32 tasks. Which means we are > piling up a huge amount of already generated and signed partial updates, > which simply get not uploaded that quickly. Fixing this completely is not going to be easy because of the extremely spikey nature of the load. Even if the backend webapps can handle the load, there's still a lot of potential collisions in the database that cause retries. However, now that we've finished our transition to CloudOps infrastructure I wonder if we can handle more load than before? Looks like we currently peak around 8 requests/second to the admin app, and write lag doesn't go beyond 50ms. Benson, do you think we could handle more concurrent traffic to the admin app without causing issues? Maybe we can try increasing slowly until we find a sweet spot?
Flags: needinfo?(bwong)
> Benson, do you think we could handle more concurrent traffic Yes. Lots of available capacity. I'm more concerned about the collisions and retries. I suppose it is TBD how that scales (linearly, exponentially?) and its effect on the database.
Flags: needinfo?(bwong)
(In reply to Benson Wong [:mostlygeek] from comment #18) > > Benson, do you think we could handle more concurrent traffic > > Yes. Lots of available capacity. > I'm more concerned about the collisions and retries. I suppose it is TBD how > that scales (linearly, exponentially?) and its effect on the database. Yeah. We should be able to track this pretty easily - collisions show up as 400 responses. I think the main metric we care about is how long it takes to get nightly changes into the database - which should line up with the spike of admin reqs/sec. If that shortens we're probably in a better place, even if we have a larger number of 400s. I suspect that the length of the spike of admin reqs/sec and # of 400s will track pretty closely though. Rail, how about we try bumping this to 64 to start, and see how that goes?
Flags: needinfo?(rail)
(In reply to Ben Hearsum (:bhearsum) from comment #19) > Rail, how about we try bumping this to 64 to start, and see how that goes? This is done now. Let's see how it goes tomorrow.
Flags: needinfo?(rail)
(In reply to Rail Aliiev [:rail] from comment #20) > (In reply to Ben Hearsum (:bhearsum) from comment #19) > > Rail, how about we try bumping this to 64 to start, and see how that goes? > > This is done now. Let's see how it goes tomorrow. I think this helped a little bit. Yesterday morning's big spike in requests to the admin api started at 11:20 and ended at 16:20. Today's started at 11:20 and ended at 15:20. There was an increase in 400s, but nothing else spiked. 400s get retried for, so that shouldn't be causing any issues. I'm not sure exactly what this means in terms of delay between completes and partials being published, but it should be an improvement. We could try pushing the max workers a bit higher as well, maybe to 72 as the next step, but let's consider that next week. No wonder what we do here, I think it's important to prioritize https://bugzilla.mozilla.org/show_bug.cgi?id=1195365 at some point. If we truly care about making sure people get partials, the only real solution for it is to publish partials and completes at exactly the same time.
Shouldn't this be easy when we have the Nightly graph for Taskcluster builds, and funsize tasks integrated into that afterward?
RyanVM pointed out that funsize submitter jobs are sometimes timing out now, eg: https://treeherder.mozilla.org/logviewer.html#?job_id=3514280&repo=mozilla-aurora#L220 This appears to be new since the 6th: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1204281&entireHistory=true&tree=mozilla-aurora It's really tough to reason about what will actually help here, so I'd like to try going the opposite way, and reducing the maxCapacity down to 16. Doing so should reduce the number of data races we're getting. It's possible that this will reduce overall end to end time of partial generation. It's also possible it will increase it. Rail, Whimboo - what do you two think about giving this a try for a few days or a week? Another that may help regardless of maxCapacity is to greatly reduce the sleep time between submission attempts. Sleeping for more than a second between attempts is really just a waste of time because we know the errors aren't being caused by the balrog service being down.
Flags: needinfo?(rail)
Flags: needinfo?(hskupin)
Not sure if that would be good to do know just before the merge. With a delayed readyness of updates we might face update failures more likely in case of additional nightly builds. But I will let Rail decide, given that he knows the funsize and Balrog system way better.
Flags: needinfo?(hskupin)
(In reply to Ben Hearsum (:bhearsum) from comment #23) > It's really tough to reason about what will actually help here, so I'd like > to try going the opposite way, and reducing the maxCapacity down to 16. > Doing so should reduce the number of data races we're getting. It's possible > that this will reduce overall end to end time of partial generation. It's > also possible it will increase it. Rail, Whimboo - what do you two think > about giving this a try for a few days or a week? I think is the main idea of balrogworker, where you submit update serially per blob. > Another that may help regardless of maxCapacity is to greatly reduce the > sleep time between submission attempts. Sleeping for more than a second > between attempts is really just a waste of time because we know the errors > aren't being caused by the balrog service being down. Agree. We can reduce maxCapacity, sleep time and increase the number of retries. Sounds like a plan! Leaving NI in place for now.
(In reply to Rail Aliiev [:rail] from comment #25) > (In reply to Ben Hearsum (:bhearsum) from comment #23) > > Another that may help regardless of maxCapacity is to greatly reduce the > > sleep time between submission attempts. Sleeping for more than a second > > between attempts is really just a waste of time because we know the errors > > aren't being caused by the balrog service being down. > > Agree. We can reduce maxCapacity, sleep time and increase the number of > retries. Sounds like a plan! > > Leaving NI in place for now. OK. I'm not sure if it's safe to bump maxCapacity while so many workers are still running. I'll do it later, once things have calmed down. The merge week has been pushed off one week anyways, so we have no risk of impacting it. I'll put together a patch for the sleeptime and number of retries, too.
Attached patch [checked-in] retry faster (deleted) — Splinter Review
Attachment #8790287 - Flags: review?(rail)
Flags: needinfo?(rail)
Attachment #8790287 - Flags: review?(rail) → review+
Comment on attachment 8790287 [details] [diff] [review] [checked-in] retry faster https://hg.mozilla.org/build/tools/rev/fd9b9b5c2df5 Rail, I think you need to rebuild the Docker image to make this live?
Attachment #8790287 - Attachment description: retry faster → [checked-in] retry faster
Flags: needinfo?(rail)
Regenerated the image, updated the template, deployed. We'll see the changes tomorrow.
Flags: needinfo?(rail)
(In reply to Ben Hearsum (:bhearsum) from comment #26) > (In reply to Rail Aliiev [:rail] from comment #25) > > (In reply to Ben Hearsum (:bhearsum) from comment #23) > > > Another that may help regardless of maxCapacity is to greatly reduce the > > > sleep time between submission attempts. Sleeping for more than a second > > > between attempts is really just a waste of time because we know the errors > > > aren't being caused by the balrog service being down. > > > > Agree. We can reduce maxCapacity, sleep time and increase the number of > > retries. Sounds like a plan! > > > > Leaving NI in place for now. > > OK. I'm not sure if it's safe to bump maxCapacity while so many workers are > still running. I'll do it later, once things have calmed down. The merge > week has been pushed off one week anyways, so we have no risk of impacting > it. maxCapacity is now at 16. Let's see how things go tomorrow after all these changes.
Component: General → Balrog: Backend
Product: Taskcluster → Release Engineering
QA Contact: bhearsum
Component: Balrog: Backend → General Automation
QA Contact: bhearsum → catlee
With maxCapacity set to 16 it looks like the end to end time of partial submission took about 30min longer than when it was to 64. There were about the same number of 400s. It's hard to know how this has actually impacted the delay between complete and partial publishing - Henrik, do you have a sense of that?
Flags: needinfo?(hskupin)
https://bugzilla.mozilla.org/show_bug.cgi?id=1195365 is getting worked on now. When fixed, it will prevent completes from being published ahead of partials. That will fix this bug, but is unlikely to improve the retry situation.
Status: REOPENED → RESOLVED
Closed: 8 years ago8 years ago
Resolution: --- → FIXED
I still hope that at some time we will have updates available earlier so we do not run into a race condition if there are multiple Nightly builds on a day which overlap with their partial update generation.
Flags: needinfo?(hskupin)
This should be finally fixed now with the landing of bug 1195365.
Component: General Automation → General
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: