Closed Bug 1501167 Opened 6 years ago Closed 5 years ago

Rerunning a balrog submit results in duplicate partials information

Categories

(Release Engineering Graveyard :: Applications: Balrog (backend), enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED MOVED

People

(Reporter: nthomas, Unassigned)

References

Details

Attachments

(1 file)

In Firefox 64.0b3 we reran some partial generation, signing, balrog submission, and beetmover (see bug 1501113). Rerunning balrog submission added new partial update data without removing the previous data, and we ended up with two "from": "Firefox-63.0b14-build1" blocks. merge_lists() should be smarter about this.

It feels like maybe the right thing to do here is to return a 409 Conflict whenever we receive an update that would add multiple partials with the same "from" entry.

This would cause automation to fail the first time, and require someone to remove the old entry before re-running. Would this be a good solution and decrease confusion?

Flags: needinfo?(nthomas)
Flags: needinfo?(jlund)

From a practical point of view, if all the other balrog submission is complete then it's relatively straightforward to manually download the json, remove the old partial info, and resubmit it (manual work, we could make it a little safer with scripting). If other platforms were still running it would be necessary to wait until they were done, to avoid reverting all the automation submissions between the manual download and resubmit.

Given we submit all the partials plus the complete in a single job, what are the downsides to replacing a locale entry instead ?

Flags: needinfo?(nthomas)

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #2)

From a practical point of view, if all the other balrog submission is
complete then it's relatively straightforward to manually download the json,
remove the old partial info, and resubmit it (manual work, we could make it
a little safer with scripting). If other platforms were still running it
would be necessary to wait until they were done, to avoid reverting all the
automation submissions between the manual download and resubmit.

Given we submit all the partials plus the complete in a single job, what are
the downsides to replacing a locale entry instead ?

I think I'm a bit confused about the exact circumstances that trigger this bug. It sounds like we have instances where we have multiple different submissions for the same partial/complete, but with different data? And we hit this when they run at the same time and run the update merge code? And all of this only happens when we're retriggering in circumstances like bug 1501113?

In any case, it's possible to handle this in merge_lists, although it would be a hack (I think we'd have to make some guesses on whether or not we're in a list of partials/completes based on the shape of the data). Depending on exactly how this is happening, we might be able to find a better fix.

Flags: needinfo?(jlund)

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #3)

I think I'm a bit confused about the exact circumstances that trigger this bug. It sounds like we have instances where we have multiple different submissions for the same partial/complete, but with different data? And we hit this when they run at the same time and run the update merge code? And all of this only happens when we're retriggering in circumstances like bug 1501113?

Comment #0 was a problem generating partials where we reran the generation and downstreams, meaning we had different mar files and wanted to replace data already in Balrog. Pretty uncommon thing to happen. Having looked through today's code I'm not sure how this happened, as addLocaleToRelease() appears to replace, and we really need that for Firefox-mozilla-central-nightly-latest.

(In reply to Nick Thomas [:nthomas] (UTC+13) from comment #4)

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #3)

I think I'm a bit confused about the exact circumstances that trigger this bug. It sounds like we have instances where we have multiple different submissions for the same partial/complete, but with different data? And we hit this when they run at the same time and run the update merge code? And all of this only happens when we're retriggering in circumstances like bug 1501113?

Comment #0 was a problem generating partials where we reran the generation
and downstreams, meaning we had different mar files and wanted to replace
data already in Balrog. Pretty uncommon thing to happen. Having looked
through today's code I'm not sure how this happened, as
[addLocaleToRelease()](https://github.com/mozilla/balrog/blob/
695f19fa6b7167b2b27beb65d3bfae5a21ff32dc/auslib/db.py#L2083) appears to
replace, and we really need that for Firefox-mozilla-central-nightly-latest.

Hm, maybe we should add some logging code to make it possible to figure out what happened when it happens again. The scenario I described in comment #3 was the only one I could imagine hitting this, and it sounds like those weren't our circumstances when this was filed.

We've now got extra debugging statements for this in Balrog prod. The next time we hit it again I hope we'll be able to get enough information to make sense of it.

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #7)

We've now got extra debugging statements for this in Balrog prod. The next
time we hit it again I hope we'll be able to get enough information to make
sense of it.

Looks like we hit this again with the 2019060222 nightlies, for at least some locales (eg: "tl"): https://treeherder.mozilla.org/#/jobs?repo=mozilla-central&revision=6d71d3ca012438d7eac6e8f9471e198a10eabc70

We hit this again last night in bug 1574404.

(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #9)

We hit this again last night in bug 1574404.

Well, not exactly rerunning but more like bug 1537710

Taking a step back at what else could cause this, in bug 1537710.

Blocks: 1579125

We're going to burn down Balrog submissions soon, see https://github.com/mozilla-releng/balrog/issues/1049. We'll lok into this more/ensure it's not an issue as part of that.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → MOVED
Product: Release Engineering → Release Engineering Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: