Closed Bug 1814727 Opened 2 years ago Closed 2 years ago

include attribution information in builds published to archive.mozilla.org

Categories

(Release Engineering :: Release Automation: Signing, enhancement, P2)

enhancement

Tracking

(firefox113 fixed)

RESOLVED FIXED
Tracking Status
firefox113 --- fixed

People

(Reporter: bhearsum, Assigned: bhearsum)

References

(Blocks 1 open bug, Regressed 1 open bug)

Details

Attachments

(7 files, 1 obsolete file)

We use attribution information to help understand many parts of Firefox telemetry. At the moment, all attribution information is added at download time, through the attribution service (as long as the download was initiated through Bedrock + Bouncer.

One of the gaps of this system is that installers on archive.mozilla.org provide no attribution information.

It looks this is fairly trivial to correct -- we just need to write attribution data into our installers at some point after signing them. There's two basic options we have here:

  1. If we're OK with Taskcluster artifacts having attribution information in them, the best place to do this is in signingscript, as part of the repackage-signing tasks (those are the ones that sign the installer).
  2. If we prefer that attribution information is only present in the builds on archive.mozilla.org, we could do it in beetmover instead.

If we want to get very granual, we could attribute builds that sit in different places differently (I'm thinking of Taskcluster artifacts, candidates directories, and releases directories, but I may be forgetting something).

Nick, a couple of questions for you:

  • Do you have an opinion on the above options?
  • Do you have a preference on what the attribution code looks like? Obviously we'd include source set to some reasonable value (eg: archive.mozilla.org) - but I'm not sure there's any need to include any of the other fields.

I made a hacky prototype of this over in https://github.com/mozilla-releng/scriptworker-scripts/compare/master...bhearsum:scriptworker-scripts:sentinel-attribution?expand=1, and it worked fine after a run through try. It needs some polish and runtime configuration added to it - but it proves that it's pretty easy to do anywhere where we have the installers on disk already.

Flags: needinfo?(nalexander)

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #0)

We use attribution information to help understand many parts of Firefox telemetry. At the moment, all attribution information is added at download time, through the attribution service (as long as the download was initiated through Bedrock + Bouncer.

One of the gaps of this system is that installers on archive.mozilla.org provide no attribution information.

It looks this is fairly trivial to correct -- we just need to write attribution data into our installers at some point after signing them. There's two basic options we have here:

  1. If we're OK with Taskcluster artifacts having attribution information in them, the best place to do this is in signingscript, as part of the repackage-signing tasks (those are the ones that sign the installer).
  2. If we prefer that attribution information is only present in the builds on archive.mozilla.org, we could do it in beetmover instead.

If we want to get very granual, we could attribute builds that sit in different places differently (I'm thinking of Taskcluster artifacts, candidates directories, and releases directories, but I may be forgetting something).

Nick, a couple of questions for you:

  • Do you have an opinion on the above options?

I don't think we can or should pursue granularity: release promotion is all about not making changes during the promotion process. I think that means we want our hashes to be consistent, so that we can easily say a candidate rc1 is the same as the Beta N or the Release .0.

I would like to look ahead to macOS attribution, which differs from Windows attribution. On macOS, the scheme I'm working towards will have the postSigningData as an extended attribute in the DMG, which is downstream of the signed builds. So whatever we do needs to be flexible enough to accommodate attribution in the Rpk tasks as well as the Bs tasks. Since we don't currently sign DMGs (and this scheme requires unsigned DMGs) the signing service can't be a single point of attribution.

These kind of manipulations feel outside of the scope of Beetmover tasks, IMO. How would a regular user QA them without actually being able to move the beets?

  • Do you have a preference on what the attribution code looks like? Obviously we'd include source set to some reasonable value (eg: archive.mozilla.org) - but I'm not sure there's any need to include any of the other fields.

Looking at the currently accepted UTM parameters, I'd start with including:

  • source
  • dltoken, with either all 0s or a well-known sentinel GUID
  • possibly medium

We should defer to Marketing for what the source and medium should actually be, so that it fits their ontology; and we should defer to DS about what the dltoken should be (if present at all -- but I hope it is).

I made a hacky prototype of this over in https://github.com/mozilla-releng/scriptworker-scripts/compare/master...bhearsum:scriptworker-scripts:sentinel-attribution?expand=1, and it worked fine after a run through try. It needs some polish and runtime configuration added to it - but it proves that it's pretty easy to do anywhere where we have the installers on disk already.

Cool cool!

Flags: needinfo?(nalexander)

(In reply to Nick Alexander :nalexander [he/him] from comment #1)

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #0)

Nick, a couple of questions for you:

  • Do you have an opinion on the above options?

I don't think we can or should pursue granularity: release promotion is all about not making changes during the promotion process. I think that means we want our hashes to be consistent, so that we can easily say a candidate rc1 is the same as the Beta N or the Release .0.

That's a great point - thank you for raising it. This certainly rules out doing it in beetmover, and also reminds me that we need to make sure that the checksums files we published will contain hashes that are calculated post-attribution. By cursory reading suggests that they should (they consume checksums that beetmover generates), but we should verify this assumption).

I would like to look ahead to macOS attribution, which differs from Windows attribution. On macOS, the scheme I'm working towards will have the postSigningData as an extended attribute in the DMG, which is downstream of the signed builds. So whatever we do needs to be flexible enough to accommodate attribution in the Rpk tasks as well as the Bs tasks. Since we don't currently sign DMGs (and this scheme requires unsigned DMGs) the signing service can't be a single point of attribution.

I agree that it's ideal to keep all the attribution for all platforms in the same place. It sounds like macOS attribution can happen during repackage or anytime after. Windows attribution, on the other hand, can happen no earlier than repackage-signing (which is after repackage, and doesn't even run for macOS). The next task for both platforms in beetmover.

We could introduce a new task right before that, but it feels awfully wasteful just to download a bunch of builds to tweak a few bits and republish them. My instinct is to say that we ought to integrate this with repackage-signing and repackage for Windows and macOS respectively.

These kind of manipulations feel outside of the scope of Beetmover tasks, IMO. How would a regular user QA them without actually being able to move the beets?

We do have the ability to do post to the staging version of archive with separate beetmover workers on Try -- but this is probably a moot point since we're ruling out beetmover as a place for this anyways.

  • Do you have a preference on what the attribution code looks like? Obviously we'd include source set to some reasonable value (eg: archive.mozilla.org) - but I'm not sure there's any need to include any of the other fields.

Looking at the currently accepted UTM parameters, I'd start with including:

  • source
  • dltoken, with either all 0s or a well-known sentinel GUID
  • possibly medium

We should defer to Marketing for what the source and medium should actually be, so that it fits their ontology; and we should defer to DS about what the dltoken should be (if present at all -- but I hope it is).

Thank you, I'll nail these details down with the right parties.

Assignee: nobody → bhearsum

There's actually nothing partner-specific in this script, and it's about to be used for other types of attribution as well.

This allows us to easily append attributions, which is helpful when configuring this script in taskgraph. (You can set up some defaults, and then add others for specific jobs.)

As far as I can tell, all current usage of this script uses the environment variables, so it should be safe to remove the current arguments.

Depends on D170239

This appears to be entirely unused; this image already contains python3, and things appear to be using it.

Depends on D170240

Notably, the actual attribution code we're using is stored in browser. This was largely motivated by the fact that the subsequent revision in this stack will also need it, and this seemed like the best way to make it shareable between the two. The only alternative I could come up with was stuffing it into a transforms - but it's really just data - there's no reason it ought to live in such a place. (We do have precedent for this sort of thing with both locale and whats new page information, so I don't think it's breaking huge new ground.) Nick - I'm tagging you mainly on this part, but I welcome any other comments you may have (here or in the rest of the stack for that matter).

The other notable part of this patch is that I've explicitly decided not to use the multi_dep loader, nor reimplement any of its magic pulling of properties in a transform. I find that this makes it more clear what's actually going on, and easier to debug when making changes. The dwonside, of course, is that there's some verboseness in the kind - all platforms we need to run this for must be explicitly listed. I'm open to debate on whether or not this is the right trade-off, so feel free to push back if you disagree.

Depends on D170242

The kind is more or less the same as the en-US counterpart in the previous revision.

As with the en-US attribution kind, this also does does not use the multi_dep loader to set up the per-locale tasks. Of course, we do need to split by locale, which is now being done quite explicitly by the new split_by_locale that looks at the specified locales file, and uses specific platforms specified in the kind. As with the previous revision, please feel free to push back if any of you feel any of this is going in the wrong direction, is worse than multi_dep, etc. etc.

Depends on D170243

This is mostly just switching the upstream tasks with pull the Windows installers from. The only wrinkle is that we're not attributing the asan-reporter installers (and we shouldn't IMO), so I had to add support for keying upstream tasks on platform in the beetmover manifests. (If we simply listed all three possible upstreams, we ended up pulling two installers for platforms that are attributed...and I don't even know which one would get published, or if both would.)

Depends on D170245

Comment on attachment 9318479 [details]
Bug 1814727: Remove python2 from partner repack docker image r?#releng-reviewers!

Revision D170241 was moved to bug 1817983. Setting attachment 9318479 [details] to obsolete.

Attachment #9318479 - Attachment is obsolete: true

Hi Ben! so here's the Data Science perspective on this:

  • I agree with Nick on not pursuing granularity.
  • To expand on that, what we really need to know is:
    • If an installation/profile came from archive.mozilla.org:
      • We just need to know that they did so
      • We don't necessarily need to know how they got to archive.mozilla.org.
    • But If an installation/ profile came from mozilla.org, we need to know both that:
      • They did so (meaning we want to distinguish between these users and those who came from archive.mozilla.org)
      • How they got to mozilla.org.
    • If an installation / profile came from neither archive.mozilla.org nor mozilla.org:
      • We just need to know that they fall into this category

Given those requirements, I think we should be very intentional about the attribution schema. Using the currently existing attribution fields (source/medium/campaign/value) introduces some risk:

  • Attribution signifies different things, depending on where the user downloaded
    • For cases where the download happened on mozilla.org, attribution.source (and the other fields) are identifying how users arrived at the website
    • For cases where the download happened on archive.mozilla.org, if attribution.source was set to "archive.mozilla.org", it would be identifying that the download happened on archive.mozilla.org
    • Note that for downloads that happened on mozilla.org, there isn't any explicit identifier saying this came from mozilla.org. This was never a problem previously, because if attribution simply existed, we could assume it came from mozilla.org, however, if we start attributing archive.mozilla.org downloads using the attribution fields, this is no longer the case.

Given that, I think it would be best to implement a new separate sub-field in attribution (attribution.dl_source, for example) that distinguishes between mozilla.org and archive.mozilla.org downloads specifically for this use case.

  • So downloads from mozilla.org would continue to get values in the attribution.source, attribution.medium, attribution.campaign, and attribution.term fields,
  • downloads from archive.mozilla.org would get attribution.dl_source='archive.mozilla.org' (while leaving the source, medium, campaign, and term fields blank
    I'm not sure how much more difficult that makes things, so interested in feedback.

Now for dl token, there's a similar complication. Currently, the dl token serves two functions:

  1. connect installs and profiles back to the the Google Analytics session data from mozilla.org, so we can attach attribution, and other session info back to the installs / profiles.
  2. act as unique identifier for a download so that we can de-dupe multiple installs / profiles from a single download

The first function is not necessary for downloads from archive.mozilla.org, however, the second function would be very much useful.

If we include a dl token in the attribution for archive.mozilla.org, we would need to make sure that they're unique (among themselves, as well as unique against the dl tokens being generated from downloads going through mozilla.org).

And I'm assuming that these download tokens / events would not be logged in the current dl token logs being generated by stubinstall service (we wouldn't want them ending up in those logs).

A few of us me to discuss comment #11. To recap the relevant parts:

  • We will use a separate field (dlsource) for this new information instead of polluting existing ones.
  • dltoken will not be set, partly because it's not technically feasible at this time, but also because philosophically, archive.mozilla.org is The Official Record of builds, and we ought not to do anything that would alter them per-download.

We've decided to use a new attribution field when attributing our vanilla builds. This field is valid as the only field in the attribution data.

Depends on D170240

Pushed by bhearsum@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/0fe1b1ad7279 Rename partner_attribution.py to a more generic name r=releng-reviewers,gbrown https://hg.mozilla.org/integration/autoland/rev/3baea1a65d39 enhance attribution script args to make it easier to append additional attributions r=releng-reviewers,gbrown https://hg.mozilla.org/integration/autoland/rev/4aa2d6544ea9 allow `dlsource` OR existing required keys when attributing builds r=releng-reviewers,gbrown https://hg.mozilla.org/integration/autoland/rev/93e85cf13248 Add a transform that knows how to pull in command-context from an external file r=taskgraph-reviewers,ahal https://hg.mozilla.org/integration/autoland/rev/834dd2eec9a6 add tasks for attributing en-US builds r=ahal,nalexander https://hg.mozilla.org/integration/autoland/rev/834b431a6a21 add tasks for attributing l10n builds r=ahal https://hg.mozilla.org/integration/autoland/rev/14d94e19b541 adjust beetmover tasks to be downstream of attribution tasks for Windows r=releng-reviewers,gbrown

I'll note that while this has landed, other work is required before we receive Telemetry with it (notably https://bugzilla.mozilla.org/show_bug.cgi?id=1819997). There's also some related work to start setting this for dynamic attribution as well (https://github.com/mozilla-services/stubattribution/issues/159, https://github.com/mozilla/bedrock/issues/12836).

Regressions: 1839815
Regressions: 1846489
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: