1223872 - Single locale release updates should not race each other

Reporter

Description

•

9 years ago

We see a lot of races in funsize similar to this: l) locale A requests data_version 2) locale B requests data_version 3) locale B updates the release blob 4) locale A fails to update the release blob They probably don't update the same data in the blob, so we can probably find a work around for this.

Rail Aliiev [:rail]

Reporter

Comment 1

•

9 years ago

Attached patch balrog_metrics-tools-4.diff (obsolete) (deleted) — Details — Splinter Review

Let's start with some basic request metrics for now. We can walk the TC tasks using the index, fetch their logs, process with something like https://github.com/etsy/logster (would require some custom parser) and submit to graphite.

Attachment #8687173 - Flags: review?(catlee)

Rail Aliiev [:rail]

Reporter

Updated

•

9 years ago

Assignee: nobody → rail

Rail Aliiev [:rail]

Reporter

Comment 2

•

9 years ago

BTW, this is an issue for l10n repacks as well know. It's all suddenly started on Tue Nov 3, see https://treeherder.mozilla.org/#/jobs?repo=mozilla-aurora&revision=bc9c6e996006&filter-searchStr=l10n&exclusion_profile=false I wonder if it is somehow related to all recent DB/webhead migrations...

Rail Aliiev [:rail]

Reporter

Comment 3

•

9 years ago

According to the webapp/db PHX1-SCL3 dashboard, aus migrated on Monday, Oct 26

Rail Aliiev [:rail]

Reporter

Comment 4

•

9 years ago

Attached patch balrog_metrics-tools-5.diff (deleted) — Details — Splinter Review

req.elapsed is not that good from what I see. It's usually less than a second while the surrounding log lines are 10-15 secs away. Not sure why, let's go with old school way.

Attachment #8687173 - Attachment is obsolete: true

Attachment #8687173 - Flags: review?(catlee)

Attachment #8687210 - Flags: review?(catlee)

Chris AtLee [:catlee]

Updated

•

9 years ago

Attachment #8687173 - Attachment is obsolete: false

Chris AtLee [:catlee]

Updated

•

9 years ago

Attachment #8687210 - Flags: review?(catlee) → review+

Rail Aliiev [:rail]

Reporter

Comment 5

•

9 years ago

Comment on attachment 8687210 [details] [diff] [review] balrog_metrics-tools-5.diff https://hg.mozilla.org/build/tools/rev/58900072a047

Attachment #8687210 - Flags: checked-in+

Rail Aliiev [:rail]

Reporter

Updated

•

9 years ago

Depends on: 1224674

Rail Aliiev [:rail]

Reporter

Comment 6

•

9 years ago

Looks like this is related to the aus migration, which according to the dashboard happened on Oct 26. It starts with the push from 25th (nightly builds on 26th): https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=d53a52b39a95&filter-searchStr=update-%20balrog The whole picture can bee seen at this link: https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&filter-searchStr=update-&fromchange=f8283eaf5aad We are interested in jobs ending with "u" (en-USu, 1.1u, 5u, etc) with "HTTP 400" errors in the summary. On Oct 29 we stopped funsize (see https://bugzilla.mozilla.org/show_bug.cgi?id=1219879#c3 and https://bugzilla.mozilla.org/show_bug.cgi?id=1220252#c3) It was turned back on Nov 2, Mon (https://bugzilla.mozilla.org/show_bug.cgi?id=1220857). Nov 2 and 3 were very terrible - tons of balrog submission errors. Sounds like we degraded after the migration, but not sure where: networking, db, TC-to-VPN delays...

Rail Aliiev [:rail]

Reporter

Updated

•

9 years ago

Attachment #8687173 - Attachment is obsolete: true

Shyam Mani [:fox2mike]

Updated

•

9 years ago

See Also: → https://bugzilla.mozilla.org/show_bug.cgi?id=1224698

Rail Aliiev [:rail]

Reporter

Comment 7

•

9 years ago

Back to the pool - I don't think that I can look at this until after Mozlando.

Assignee: rail → nobody

Comment hidden (Intermittent Failures Robot)

bhearsum@mozilla.com (:bhearsum)

Comment 10

•

8 years ago

Varun is actively working on this!

Assignee: nobody → varunj.1011

[github robot]

Comment 11

•

8 years ago

Commit pushed to master at https://github.com/mozilla/balrog https://github.com/mozilla/balrog/commit/18047698e064bff5af1459323fe943daf6c5175a bug 1223872: merge blob updates on server when safe to do so (#93). r=bhearsum

Comment hidden (Intermittent Failures Robot)

Nick Thomas [:nthomas] (UTC+12)

Comment 14

•

8 years ago

This is in production I think, but we still hit some issues around old_data_version. eg over the weekend eg https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=bbb29a9b88dd680dbb59577cbe4dc6e58d117100&filter-searchStr=l10n&exclusion_profile=false&selectedJob=4293214 https://treeherder.mozilla.org/logviewer.html#?job_id=4293214&repo=mozilla-central#L24037 Could you investigate Varun ?

Varun Joshi (:vjoshi)

Assignee

Comment 15

•

8 years ago

(In reply to Nick Thomas [:nthomas] from comment #14) > This is in production I think, but we still hit some issues around > old_data_version. eg over the weekend > > eg > https://treeherder.mozilla.org/#/jobs?repo=mozilla- > central&revision=bbb29a9b88dd680dbb59577cbe4dc6e58d117100&filter- > searchStr=l10n&exclusion_profile=false&selectedJob=4293214 > https://treeherder.mozilla.org/logviewer.html#?job_id=4293214&repo=mozilla- > central#L24037 > > Could you investigate Varun ? Yes, I'm on it!

bhearsum@mozilla.com (:bhearsum)

Comment 16

•

8 years ago

(In reply to Nick Thomas [:nthomas] from comment #14) > This is in production I think, but we still hit some issues around > old_data_version. eg over the weekend > > eg > https://treeherder.mozilla.org/#/jobs?repo=mozilla- > central&revision=bbb29a9b88dd680dbb59577cbe4dc6e58d117100&filter- > searchStr=l10n&exclusion_profile=false&selectedJob=4293214 > https://treeherder.mozilla.org/logviewer.html#?job_id=4293214&repo=mozilla- > central#L24037 > > Could you investigate Varun ? It's a bit confusing right now actually. CloudOps is running code with this, but we haven't moved admin traffic over to them yet. The WebOps admin box is still running older code, so this is effectively not in production yet. We should be cutting over admin later this week, so hopefully we'll have this in production sometime next week.

Nick Thomas [:nthomas] (UTC+12)

Comment 17

•

8 years ago

That makes more sense. Perhaps we should update the webops admin soon, so that we're not changing hosting and code when we swap over to CloudOps.

Comment hidden (Intermittent Failures Robot)

bhearsum@mozilla.com (:bhearsum)

Comment 22

•

8 years ago

This landed in production awhile ago. Thanks Varun!

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

BMO Automation

Updated

•

5 years ago

Product: Release Engineering → Release Engineering Graveyard

balrog_metrics-tools-4.diff 9 years ago Rail Aliiev [:rail] (deleted), patch		Details \| Diff \| Splinter Review
balrog_metrics-tools-5.diff 9 years ago Rail Aliiev [:rail] (deleted), patch	catlee : review+ rail : checked-in+	Details \| Diff \| Splinter Review