Closed Bug 1284516 Opened 8 years ago Closed 8 years ago

Huge lag in funsize tasks on taskcluster which causes partial updates to be available hours later as usual

Categories

(Release Engineering :: General, defect)

Product:

Component:

Type:

defect

Priority:

Not set

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

People

(Reporter: dbaron, Unassigned)

References

Details

(Keywords: regression)

Attachments

(1 file)

[checked-in] retry faster 8 years ago bhearsum@mozilla.com (:bhearsum) (deleted), patch	rail : review+	Details \| Diff \| Splinter Review

David Baron :dbaron:

Reporter

Description

•

8 years ago

Updates for Windows mozilla-central nightly builds have been full updates rather than partial updates for a few days now. I definitely noticed this today (July 5), and I noticed it a previous day this weekend, probably yesterday (July 4). padenot also noticed it when updating today.

David Baron :dbaron:

Reporter

Comment 1

•

8 years ago

Any idea what's going on here? This is still happening (updating from July 6 to July 7 builds).

Flags: needinfo?(rail)

Summary: starting around July 4, Windows nightly updates are full updates rather than partial → starting around July 4, Windows (32-bit) nightly updates are full updates rather than partial

Rail Aliiev [:rail]

Comment 2

•

8 years ago

Sorry for the tardy reply. I believe the main reason for this is the fact that we generate partials after we publish completes. There is a lag until we generate and publish the corresponding partials and it depends on the load in TaskCluster. I filed bug 1195365 ages ago, but had no chance to look at it yet. :/

Depends on: 1195365

Flags: needinfo?(rail)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 3

•

8 years ago

This actually affects all platforms and Aurora too.

Summary: starting around July 4, Windows (32-bit) nightly updates are full updates rather than partial → starting around July 4, updates are full updates rather than partial

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 4

•

8 years ago

I think its better to move this over to taskcluster in case it continues and we can figure out why this has been regressed that badly lately.

Component: General Automation → Scheduler

Product: Release Engineering → Taskcluster

QA Contact: catlee

Summary: starting around July 4, updates are full updates rather than partial → Huge lag in funsize tasks on taskcluster which causes partial updates to be available hours later as usual

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 5

•

8 years ago

mozilla-central: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&exclusion_profile=false&filter-searchStr=funsize&filter-tier=1&filter-tier=2&filter-tier=3 mozilla-aurora: https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&exclusion_profile=false&filter-searchStr=funsize&filter-tier=1&filter-tier=2&filter-tier=3

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 6

•

8 years ago

I can see a lot of worker-shutdowns. Dustin, are other tasks also that heavily affected as funsize is? I wonder if that is a general issue with AWS or specific to the funsize workers.

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 7

•

8 years ago

It would depend on the instance types in use -- that generally occurs when the price goes up substantially. My understanding from yesterday was that this was an issue with the funsize tasks themselves, and not something we can fix by e.g., adding more instances (as that would cause more balrog contention).

Component: Scheduler → General

Flags: needinfo?(dustin)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 8

•

8 years ago

I think we need more detailed info from Rail here. So far as I can see the Treeherder links do not show any failures regarding Balrog today.

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Comment 9

•

8 years ago

As I said in comment #2, we should address the root cause, bug 1195365. We can try to minimize the lag, but there is no way we can get the partials immediately.

Flags: needinfo?(rail)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 10

•

8 years ago

With bug 1195365 fixed we will have a longer delay for every update - well depending on when during the funsize process you want to make the full mars available. I strongly assume it will happen after all partials have been created for the locale and platform. If funsize takes longer to produce partials, as what we see here, we all will have to wait much longer to be able to get updates offered, and be able to test the new build. IMHO it's not what we want. As David has seen and what I can second with our firefox-ui-update tests since a couple of days it takes a very long time before partials are available. So there has to be a reason why it takes that long. And exactly that we really have to get identified and fixed. If that is not the before mentioned Balrog issue, what is it then?

Henrik Skupin [:whimboo][⌚️UTC+2]

Updated

•

8 years ago

Blocks: 1285340

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 11

•

8 years ago

Looks like this got way better in the past days. I don't really see the longish lag anymore.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 12

•

8 years ago

Rail, I think we can mark this fixed or WFM?

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Comment 13

•

8 years ago

... or incomplete

Status: NEW → RESOLVED

Closed: 8 years ago

Flags: needinfo?(rail)

Resolution: --- → INCOMPLETE

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 14

•

8 years ago

Lets reopen ;) because today I can see a huge lack again. Looking at the AWS provisioner stats of TC I can see 289 pending jobs for the custom funsize-balrog worker. Its max capacity is 32 tasks. Which means we are piling up a huge amount of already generated and signed partial updates, which simply get not uploaded that quickly.

Status: RESOLVED → REOPENED

Resolution: INCOMPLETE → ---

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 15

•

8 years ago

Here an example: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=b7f7ae14590aced450bb0b0469dfb38edd2c0ace&filter-searchStr=update generating task: Started: Thu Sep 1, 15:30:39 Ended: Thu Sep 1, 15:41:32 signing task: Started: Thu Sep 1, 15:42:57 Ended: Thu Sep 1, 15:44:12 upload task: Requested: Thu Sep 1, 15:30:36 Duration: Not started (queued for 67 minute(s))

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 16

•

8 years ago

The upload task takes about 3 minutes to run. Due to the amount of partials for mozilla-aurora we are still not done with uploading those partials yes. So nightlies for mozilla-central would still have to wait a bit before their partials are getting uploaded.

bhearsum@mozilla.com (:bhearsum)

Comment 17

•

8 years ago

(In reply to Henrik Skupin (:whimboo) from comment #14) > Lets reopen ;) because today I can see a huge lack again. Looking at the AWS > provisioner stats of TC I can see 289 pending jobs for the custom > funsize-balrog worker. Its max capacity is 32 tasks. Which means we are > piling up a huge amount of already generated and signed partial updates, > which simply get not uploaded that quickly. Fixing this completely is not going to be easy because of the extremely spikey nature of the load. Even if the backend webapps can handle the load, there's still a lot of potential collisions in the database that cause retries. However, now that we've finished our transition to CloudOps infrastructure I wonder if we can handle more load than before? Looks like we currently peak around 8 requests/second to the admin app, and write lag doesn't go beyond 50ms. Benson, do you think we could handle more concurrent traffic to the admin app without causing issues? Maybe we can try increasing slowly until we find a sweet spot?

Flags: needinfo?(bwong)

Benson Wong [:mostlygeek]

Comment 18

•

8 years ago

> Benson, do you think we could handle more concurrent traffic Yes. Lots of available capacity. I'm more concerned about the collisions and retries. I suppose it is TBD how that scales (linearly, exponentially?) and its effect on the database.

Flags: needinfo?(bwong)

bhearsum@mozilla.com (:bhearsum)

Comment 19

•

8 years ago

(In reply to Benson Wong [:mostlygeek] from comment #18) > > Benson, do you think we could handle more concurrent traffic > > Yes. Lots of available capacity. > I'm more concerned about the collisions and retries. I suppose it is TBD how > that scales (linearly, exponentially?) and its effect on the database. Yeah. We should be able to track this pretty easily - collisions show up as 400 responses. I think the main metric we care about is how long it takes to get nightly changes into the database - which should line up with the spike of admin reqs/sec. If that shortens we're probably in a better place, even if we have a larger number of 400s. I suspect that the length of the spike of admin reqs/sec and # of 400s will track pretty closely though. Rail, how about we try bumping this to 64 to start, and see how that goes?

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Comment 20

•

8 years ago

(In reply to Ben Hearsum (:bhearsum) from comment #19) > Rail, how about we try bumping this to 64 to start, and see how that goes? This is done now. Let's see how it goes tomorrow.

Flags: needinfo?(rail)

bhearsum@mozilla.com (:bhearsum)

Comment 21

•

8 years ago

(In reply to Rail Aliiev [:rail] from comment #20) > (In reply to Ben Hearsum (:bhearsum) from comment #19) > > Rail, how about we try bumping this to 64 to start, and see how that goes? > > This is done now. Let's see how it goes tomorrow. I think this helped a little bit. Yesterday morning's big spike in requests to the admin api started at 11:20 and ended at 16:20. Today's started at 11:20 and ended at 15:20. There was an increase in 400s, but nothing else spiked. 400s get retried for, so that shouldn't be causing any issues. I'm not sure exactly what this means in terms of delay between completes and partials being published, but it should be an improvement. We could try pushing the max workers a bit higher as well, maybe to 72 as the next step, but let's consider that next week. No wonder what we do here, I think it's important to prioritize https://bugzilla.mozilla.org/show_bug.cgi?id=1195365 at some point. If we truly care about making sure people get partials, the only real solution for it is to publish partials and completes at exactly the same time.

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 22

•

8 years ago

Shouldn't this be easy when we have the Nightly graph for Taskcluster builds, and funsize tasks integrated into that afterward?

bhearsum@mozilla.com (:bhearsum)

Comment 23

•

8 years ago

RyanVM pointed out that funsize submitter jobs are sometimes timing out now, eg: https://treeherder.mozilla.org/logviewer.html#?job_id=3514280&repo=mozilla-aurora#L220 This appears to be new since the 6th: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1204281&entireHistory=true&tree=mozilla-aurora It's really tough to reason about what will actually help here, so I'd like to try going the opposite way, and reducing the maxCapacity down to 16. Doing so should reduce the number of data races we're getting. It's possible that this will reduce overall end to end time of partial generation. It's also possible it will increase it. Rail, Whimboo - what do you two think about giving this a try for a few days or a week? Another that may help regardless of maxCapacity is to greatly reduce the sleep time between submission attempts. Sleeping for more than a second between attempts is really just a waste of time because we know the errors aren't being caused by the balrog service being down.

Flags: needinfo?(rail)

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 24

•

8 years ago

Not sure if that would be good to do know just before the merge. With a delayed readyness of updates we might face update failures more likely in case of additional nightly builds. But I will let Rail decide, given that he knows the funsize and Balrog system way better.

Flags: needinfo?(hskupin)

Rail Aliiev [:rail]

Comment 25

•

8 years ago

(In reply to Ben Hearsum (:bhearsum) from comment #23) > It's really tough to reason about what will actually help here, so I'd like > to try going the opposite way, and reducing the maxCapacity down to 16. > Doing so should reduce the number of data races we're getting. It's possible > that this will reduce overall end to end time of partial generation. It's > also possible it will increase it. Rail, Whimboo - what do you two think > about giving this a try for a few days or a week? I think is the main idea of balrogworker, where you submit update serially per blob. > Another that may help regardless of maxCapacity is to greatly reduce the > sleep time between submission attempts. Sleeping for more than a second > between attempts is really just a waste of time because we know the errors > aren't being caused by the balrog service being down. Agree. We can reduce maxCapacity, sleep time and increase the number of retries. Sounds like a plan! Leaving NI in place for now.

bhearsum@mozilla.com (:bhearsum)

Comment 26

•

8 years ago

(In reply to Rail Aliiev [:rail] from comment #25) > (In reply to Ben Hearsum (:bhearsum) from comment #23) > > Another that may help regardless of maxCapacity is to greatly reduce the > > sleep time between submission attempts. Sleeping for more than a second > > between attempts is really just a waste of time because we know the errors > > aren't being caused by the balrog service being down. > > Agree. We can reduce maxCapacity, sleep time and increase the number of > retries. Sounds like a plan! > > Leaving NI in place for now. OK. I'm not sure if it's safe to bump maxCapacity while so many workers are still running. I'll do it later, once things have calmed down. The merge week has been pushed off one week anyways, so we have no risk of impacting it. I'll put together a patch for the sleeptime and number of retries, too.

bhearsum@mozilla.com (:bhearsum)

Comment 27

•

8 years ago

Attached patch [checked-in] retry faster (deleted) — Details — Splinter Review

Attachment #8790287 - Flags: review?(rail)

Rail Aliiev [:rail]

Updated

•

8 years ago

Flags: needinfo?(rail)

Attachment #8790287 - Flags: review?(rail) → review+

bhearsum@mozilla.com (:bhearsum)

Comment 28

•

8 years ago

Comment on attachment 8790287 [details] [diff] [review] [checked-in] retry faster https://hg.mozilla.org/build/tools/rev/fd9b9b5c2df5 Rail, I think you need to rebuild the Docker image to make this live?

Attachment #8790287 - Attachment description: retry faster → [checked-in] retry faster

Flags: needinfo?(rail)

Rail Aliiev [:rail]

Comment 29

•

8 years ago

Regenerated the image, updated the template, deployed. We'll see the changes tomorrow.

Flags: needinfo?(rail)

bhearsum@mozilla.com (:bhearsum)

Comment 30

•

8 years ago

(In reply to Ben Hearsum (:bhearsum) from comment #26) > (In reply to Rail Aliiev [:rail] from comment #25) > > (In reply to Ben Hearsum (:bhearsum) from comment #23) > > > Another that may help regardless of maxCapacity is to greatly reduce the > > > sleep time between submission attempts. Sleeping for more than a second > > > between attempts is really just a waste of time because we know the errors > > > aren't being caused by the balrog service being down. > > > > Agree. We can reduce maxCapacity, sleep time and increase the number of > > retries. Sounds like a plan! > > > > Leaving NI in place for now. > > OK. I'm not sure if it's safe to bump maxCapacity while so many workers are > still running. I'll do it later, once things have calmed down. The merge > week has been pushed off one week anyways, so we have no risk of impacting > it. maxCapacity is now at 16. Let's see how things go tomorrow after all these changes.

Dustin J. Mitchell [:dustin] (he/him)

Updated

•

8 years ago

Component: General → Balrog: Backend

Product: Taskcluster → Release Engineering

QA Contact: bhearsum

bhearsum@mozilla.com (:bhearsum)

Updated

•

8 years ago

Component: Balrog: Backend → General Automation

QA Contact: bhearsum → catlee

bhearsum@mozilla.com (:bhearsum)

Comment 31

•

8 years ago

With maxCapacity set to 16 it looks like the end to end time of partial submission took about 30min longer than when it was to 64. There were about the same number of 400s. It's hard to know how this has actually impacted the delay between complete and partial publishing - Henrik, do you have a sense of that?

Flags: needinfo?(hskupin)

bhearsum@mozilla.com (:bhearsum)

Comment 32

•

8 years ago

https://bugzilla.mozilla.org/show_bug.cgi?id=1195365 is getting worked on now. When fixed, it will prevent completes from being published ahead of partials. That will fix this bug, but is unlikely to improve the retry situation.

Status: REOPENED → RESOLVED

Closed: 8 years ago → 8 years ago

Resolution: --- → FIXED

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 33

•

8 years ago

I still hope that at some time we will have updates available earlier so we do not run into a race condition if there are multiple Nightly builds on a day which overlap with their partial update generation.

Flags: needinfo?(hskupin)

Henrik Skupin [:whimboo][⌚️UTC+2]

Comment 34

•

7 years ago

This should be finally fixed now with the landing of bug 1195365.

Nobody; OK to take it and work on it

Assignee

Updated

•

7 years ago

Component: General Automation → General

You need to log in before you can comment on or make changes to this bug.