Closed Bug 1444399 Opened 7 years ago Closed 5 years ago

Add mar download telemetry to evaluate impact of downloading mars over https

Categories

(Toolkit :: Application Update, task, P2)

task

Tracking

()

RESOLVED WONTFIX
Install Update Workflow Prioritized

People

(Reporter: jvehent, Assigned: agashlin)

References

(Blocks 1 open bug)

Details

MAR downloads are redirect to the HTTP endpoint of the download CDN. We should use HTTPS instead. > $ curl -si http://download.mozilla.org/?product=firefox-49.0.1-partial-48.0.2 > Location: http://download.cdn.mozilla.net/pub/firefox/releases/49.0.1/update/win32/en-US/firefox-48.0.2-49.0.1.partial.mar
A quick look at the AWS calculator seems to indicate little difference in pricing between HTTP and HTTPS traffic, so I'm hoping this change no longer impacts our budget, but we should confirm that before making the change.
Group: infra
This should be WONTFIX for now. cdn.mozilla.net has pins set in https://hg.mozilla.org/releases/mozilla-beta/file/default/security/manager/ssl/StaticHPKPins.h#l707, and we know there are cases where user's connections are MITMed, breaking the cert validation.
Blocks: 1444406
We do indeed have evidence that about 4% of our users are being MITMed when connecting to AUS5 [1]. This is likely true for connections to the download CDN. MITM will violate the static pins and block the connection, but those pins expire a few weeks after Firefox is released. Below are the current pin expiration dates per release: mozilla-beta: Wed Jun 13 20:15:38 UTC 2018 mozilla-release: Tue Jun 26 00:00:00 UTC 2018 mozilla-esr52: Wed Jun 13 20:53:29 UTC 2018 If we were to move MAR downloads behind HTTPS, potentially 4% of our users that are being MITMed would fail to download updates until their pins expire. After this, they'll connect to the CDN successfully. Needinfo keeler to confirm this, as he owns moz::pkix. Also, I though pins were set to expire 7 weeks after a release, just enough time for us to release a new version, but the dates above are much longer than that. Maybe the logic that calculate the expiration dates has not been updated when we removed aurora? [1] https://jhalderm.com/pub/papers/interception-ndss17.pdf
Flags: needinfo?(dkeeler)
If we do want to do this, it's just two steps: 1. I'll need to enable the akamai SSL endpoint in bouncer 2. Releng will need to start setting the ssl flag on mar products
Group: infra
If a user is being MITM'd and the corresponding root has been imported/trusted into their profile, the pins will be bypassed by default, so updates should succeed here. If the corresponding root hasn't been imported/trusted, then basically the entire web will be broken for them (as in, they will not be able to connect to google, gmail, facebook, twitter, amazon, etc.). In terms of pin expiring, we basically set the expiration date manually for each release to be around the time we release version n + 2 (see e.g. bug 1427957). The ESR and Beta expiration dates should be about 14 weeks from when each point version/build is released.
Flags: needinfo?(dkeeler)
Why not just have balrog serve an https url for the mar file? Before this change telemetry should be added specifically targeting this change so we know as close as possible what the impact of this change is if any. We are trying to get to below 2% of clients on versions less than 3 versions prior to the latest version and even an extremely small impact for all clients could easily be a major impact to this goal. If there is an impact then this defense in depth (mar signing already prevents an invalid mar file from being applied) then the value vs. cost would need to be discussed and likely the tradeoff would need to be agreed to by directors or equivalent.
What should be done with this bug?
Flags: needinfo?(jvehent)
I think the next step is to follow :rstrong's recommendation: > Before this change telemetry should be added specifically targeting this change so we know as close as possible what the impact of this change is if any. Then test the impact of the change on nightly, then beta. :jimm, is this something you could add to the updater toolkit roadmap?
Flags: needinfo?(jvehent) → needinfo?(jmathies)
Assignee: oremj → jmathies
(In reply to Julien Vehent [:ulfr] from comment #8) > I think the next step is to follow :rstrong's recommendation: > > > Before this change telemetry should be added specifically targeting this change so we know as close as possible what the impact of this change is if any. > > Then test the impact of the change on nightly, then beta. > > :jimm, is this something you could add to the updater toolkit roadmap? will do.
Flags: needinfo?(jmathies)
Blocks: 1468549
Blocks: 1445702

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #5)

<snip>
In terms of pin expiring, we basically set the expiration date manually for
each release to be around the time we release version n + 2 (see e.g. bug
1427957). The ESR and Beta expiration dates should be about 14 weeks from
when each point version/build is released.

Hi Dana,
What does "set the expiration date manually for each release to be around the time we release version n + 2" mean in practice? Will a client older than that not be able to update? A long time ago I was given a failure rate due to pinning (iirc around 0.5%). What is the current failure rate?

Flags: needinfo?(dkeeler)

For example, 67 released recently with the pinning expiration time set to Aug 15 2019. Since we're planning on releasing 68 on/around Jul 9 2019 (according to https://wiki.mozilla.org/Release_Management/Calendar anyway), those pins will still be active for about a month in 67 at that point. (So it looks like n+2 isn't really true any longer...)
In any case, when those pins expire, they won't be enforced, so expiry would make it easier to update rather than harder.
Some quick back-of-the-envelope calculations using https://mzl.la/2KqkNvl and buckets 32/33 (failures/successes for *.cdn.mozilla.net) indicate 160k failures vs 707m successes, or a failure rate of 0.02%. (I think I can run the numbers on release using atmo if necessary, but it might take a little while.)

Flags: needinfo?(dkeeler)

You may find this little bash snippet helpful:

mozpins () {
    echo "Preloaded Public Key Pins expiration dates"
    for release in mozilla-beta mozilla-release mozilla-esr52 mozilla-esr60
    do
        loc="https://hg.mozilla.org/releases/${release}/raw-file/tip/security/manager/ssl/StaticHPKPins.h"
        echo -n "$release: "
        TZ='UTC' date --date="@$((
        $( \
            curl -s $loc | grep kPreloadPKPinsExpirationTime | awk -F'[()]' '{print $2}' \
        ) / 10**6 ))"
    done
}
$ mozpins
Preloaded Public Key Pins expiration dates
mozilla-beta: Mon Sep  9 12:36:53 UTC 2019
mozilla-release: Thu Aug 15 12:31:43 UTC 2019
mozilla-esr52: Thu Dec 13 06:21:31 UTC 2018
mozilla-esr60: Mon Sep  9 12:38:39 UTC 2019
Status: NEW → ASSIGNED
Component: Operations: Product Delivery → Installer
Product: Cloud Services → Firefox
QA Contact: oremj
Component: Installer → Application Update
Product: Firefox → Toolkit

Julien, since this was changed to a client component, what needs to be changed on the client to fix this bug?

Flags: needinfo?(jvehent)

It is blocked on implementing your recommendation from comment 6:

Before this change telemetry should be added specifically targeting this change so we know as close as possible what the impact of this change is if any. We are trying to get to below 2% of clients on versions less than 3 versions prior to the latest version and even an extremely small impact for all clients could easily be a major impact to this goal. If there is an impact then this defense in depth (mar signing already prevents an invalid mar file from being applied) then the value vs. cost would need to be discussed and likely the tradeoff would need to be agreed to by directors or equivalent.

Flags: needinfo?(jvehent)
Priority: -- → P3

Ahhh... so this bug is being morphed. Changing summary to reflect what this bug is about now.

Julien, can you provide details as to what should be checked and reported to telemetry?

Flags: needinfo?(jvehent)
Summary: Bouncer should redirect MAR downloads to HTTPS → Add mar download telemetry to evaluate impact of downloading mars over https

The question we want to answer is "how many clients fails to download updated over https but succeed over http?".
If the updater can handle it, perhaps the easiest way to obtain that data would be to upgrade http download links to https, and fall back to http if that fails.
Ideally, the telemetry ping indicates what situation the updater ran into:

  • download link was https and succeeded
  • download link was https and failed
  • download link was http, upgraded to https and succeeded
  • download link was http, upgraded to https and failed, downgraded to http and succeeded
  • download link was http, upgraded to https and failed, downgraded to http and failed
  • download link was http and succeeded (no upgrade attempt)
  • download link was http and failed (no upgrade attempt)

How does that sound?

Flags: needinfo?(jvehent)

That might be ok but it is work in addition to the telemetry reporting but A / B testing might be better than using a fallback. I was asking for the networking error codes that would indicate the failure case if they are known.

I'm not sure what those error codes would be... perhaps Dana knows?

Flags: needinfo?(dkeeler)

I suppose you could use nsINSSErrorsService.getErrorClass [0]. Given an nsresult, it will tell you if the error is (roughly) categorized as due to a bad certificate or due to a problem with TLS (and if the error wasn't either of those, the implementation will throw an error).

[0] https://searchfox.org/mozilla-central/rev/40e889be8ff926e32f7567957f4c316f14f6fbef/security/manager/ssl/nsINSSErrorsService.idl#41

Flags: needinfo?(dkeeler)
Blocks: 1431892

Robert, does comment 19 above answer your question?

Yes but I haven't been allocated time to work on this.

Blocks: 1614388
Install Update Workflow: --- → Prioritized
Assignee: jmathies → agashlin
Priority: P3 → P2
Depends on: 1621971

Closing as won't fix for now; we're going to use the existing update ping to monitor for issues. See 1629033 for details.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.