Closed Bug 1572418 Opened 5 years ago Closed 5 years ago

[autoconfig] Exchange config fails due to TLS OCSP requests

Categories

(Thunderbird :: Account Manager, defect)

defect
Not set
normal

Tracking

(thunderbird_esr6069+ fixed, thunderbird_esr6869+ fixed, thunderbird69 fixed, thunderbird70 fixed)

RESOLVED FIXED
Thunderbird 70.0
Tracking Status
thunderbird_esr60 69+ fixed
thunderbird_esr68 69+ fixed
thunderbird69 --- fixed
thunderbird70 --- fixed

People

(Reporter: BenB, Assigned: neil)

References

Details

Attachments

(1 file, 4 obsolete files)

In the automatic account creation dialog:

Situation:
When we find an Exchange config, we fetch addons.json to get the addons that can support Exchange servers in Thunderbird. We request https://live.thunderbird.net/autoconfig/addons.json , which redirects to https://autoconfig.thunderbird.net/addons.json .

In FetchHTTP, we are settings a timeout of 5 seconds for any HTTPS calls that we make during the autoconfiguration. That is because we contact a number of different servers, including servers from the ISP, and they are often badly configured and do not respond in time or at all. To avoid the user waiting a lot time, we set this timeout. An HTTP server that needs more than 5 seconds to respond, I would consider it broken. DNS can add time, but that's usually no more than 1 second for a cold cache. Then we added HTTPS into the mix. TLS setup take a moment, but usually only 200 ms.

There's a utility called hey which can measure the time taken to make HTTP requests. With its default test settings of 200 requests, max 50 simultaneously, the slowest request only took 1.3671 seconds, with 25% of requests actually finishing in half the time. Given that we're using a timeout of 5 second, this should be plenty of time for the request to finish.

Bug:
However, Thunderbird enabled OSCP by default, for TLS certificate revokation checks. That happens transparently during the TLS setup and makes another server request, to the CA. And apparently, that call is slooooooow.

On the same machine that ran hey, the call to https://live.thunderbird.net/autoconfig/addons.json took up to 4532ms on TLS setup alone! Consequently, the redirect calls to https://autoconfig.thunderbird.net/addons.json were cancelled due to the timeout.

Impact:
In the statistics, I am seeing a massive discrepancy between people who fetched addons.json (from the server perspective) and who installed Owl. The discrepancy cannot be explained with people not wanting the addon, it's way too huge. This bug is a possible explanation: The users request addons.json, but never receive it, so they cannot install the addon.

Solution:
3 options:

  1. Remove the redirect on the server (cuts the time in half)
  2. Increase the timeout to 15 seconds, only for this addons.json call
  3. Pre-warm the OSCP by making TLS requests to these 2 servers at the time when we open the dialog. That would fetch the OSCP while the user types, it would be cached, so at the time we call the servers, we already have the OSCP response and the calls are much faster. Additional advantage: The MX and ISPDB requests would also be faster.

We should combine at least 2 of the above solutions, i.e. pre-warm (to reduce user wait time) and increase timeout (to increase reliability).

Assignee: nobody → ben.bucksch
Assignee: ben.bucksch → neil

(In reply to Ben Bucksch from comment #1)

  1. Increase the timeout to 15 seconds, only for this addons.json call
    Actually the issue affects all calls, for instance I made a test capture in a new profile and it was the request to https://live.thunderbird.net redirected to https://autoconfig.thunderbird.net that failed (again presumably due to an OCSP timeout, as retrying the request succeeded).

Sorry, that wasn't clear of me, because the request for addons.json uses the same servers. Sorry about that. I think the request I was looking at was for the ISPdata.

Attached patch Proposed patch (obsolete) (deleted) — Splinter Review

I increased the timeout to 10 seconds for all requests because we now make a lot of requests to various servers as part of autodiscover.

Attachment #9084019 - Flags: review?(ben.bucksch)
Attached patch Specific requests only (obsolete) (deleted) — Splinter Review
Attachment #9084129 - Flags: review?(ben.bucksch)

(In reply to Ben Bucksch (:BenB) from comment #0)

Bug:
However, Thunderbird enabled OSCP by default, for TLS certificate revokation checks. That happens transparently during the TLS setup and makes another server request, to the CA. And apparently, that call is slooooooow.

On the same machine that ran hey, the call to https://live.thunderbird.net/autoconfig/addons.json took up to 4532ms on TLS setup alone! Consequently, the redirect calls to https://autoconfig.thunderbird.net/addons.json were cancelled due to the timeout.

I don't really pretend to fully understand OSCP, but are you saying that any https request Thunderbird makes takes an additional 4000ms because of this? That can't be the case -- my locally installed beta doesn't behave this way. Even on a new profile it takes well under 200ms to contact live.thunderbird.net in the account setup process. Is this an outlier? Where is the CA and why does it take so long to connect to it? Does OCSP really make an additional, sequential server request for each https URL? The certificate on live.thunderbird.net and the one on autoconfig are the exact same certificate.

Basically what I'm saying here is that this kind of delay can't possibly be acceptable, normal operation and, if it is actually widespread, should be considered a bug.

autoconfig latency is poor(~800-1100ms) in many regions because it's not behind any CDN for logging purposes(Cloudflare does not allow log access without an Enterprise account), but we may not need real IPs in those logs anymore, so I could put it back behind Cloudflare. I would need to remove the rate limiter first.

Comment on attachment 9084129 [details] [diff] [review] Specific requests only Good patch. Please change HEAD to OPTIONS r+ BenB with that change
Attachment #9084129 - Flags: review?(ben.bucksch) → review+
Attachment #9084019 - Attachment is obsolete: true
Attachment #9084019 - Flags: review?(ben.bucksch)
Attachment #9084019 - Flags: review-

(In reply to Andrei Hajdukewycz from comment #6)

the call to https://live.thunderbird.net/autoconfig/addons.json took up to 4532ms on TLS setup alone! Consequently, the redirect calls to https://autoconfig.thunderbird.net/addons.json were cancelled due to the timeout.

I don't really pretend to fully understand OSCP, but are you saying that any https request Thunderbird makes takes an additional 4000ms because of this? That can't be the case -- my locally installed beta doesn't behave this way. Even on a new profile it takes well under 200ms to contact live.thunderbird.net in the account setup process. Is this an outlier?

All I can tell you is what I observed on my machine while I was doing a lot of testing of new profiles, where the developer tools claim that it's spending that long in TLS setup, and warming up the request significantly cuts down the time taken for the real request.

Hey sancus, not every request takes 4000 ms, but some do. For example, I do not see that problem on my machine. The tests were performed on Neil's machine. He saw it in some, but not all requests. Still, it was fairly reproducible for him. He could code and test against these timeouts.

Of course, the OSCP checks are cached for a certain amount of time, so you would only see this in a new profile. That made it harder to discover the bug.

this kind of delay can't possibly be acceptable, normal operation and, if it is actually widespread, should be considered a bug.

I concur, and I am glad you consider it the same.

As to whether this is widespread: I don't know any tools or facilities that would both allow me to query from different DSL connections around the world, and also do it in the same way as Thunderbird does, with OSCP. Commandline tools normally don't bother with this, as evidenced by the good and fast 'hay' results that Neil saw on the same machine where Thunderbird requests are slow.

I can only go by symptoms, and if the various server calls don't give a reliable response, that would explain a number of symptoms that I see in the statistics that were inexplicable so far. But I cannot be certain.

autoconfig latency is poor(~800-1100ms) in many regions

While 1100 ms is certainly not fast, it's still acceptable. I would worry upwards of 2000ms, or outliers above 4000ms, including all redirects commulative.

Attached patch Use OPTIONS (obsolete) (deleted) — Splinter Review
Attachment #9084129 - Attachment is obsolete: true
Attachment #9084282 - Flags: review?(ben.bucksch)

The pre-warming would help in any case. We already know we will make these calls, and that the OSCP check is necessary, so we can make the check earlier, and shave off 200ms in the fast case, and rescue an otherwise failing call in the worst case. We chose HTTP OPTIONS as method, to avoid messing up the stats, and give less work for the server. But still, that would increase the server calls, so that's why I've asked you. sancus, do you agree with this change?

Comment on attachment 9084282 [details] [diff] [review] Use OPTIONS r+ BenB Thanks.
Attachment #9084282 - Flags: review?(ben.bucksch) → review+

(In reply to Ben Bucksch (:BenB) from comment #11)

But still, that would increase the server calls, so that's why I've asked you. sancus, do you agree with this change?

The additional server calls are fine, no problem.

Keywords: checkin-needed
Summary: Autoconfig: Exchange config fails due to TLS OSCP requests → [autoconfig] Exchange config fails due to TLS OSCP requests

Sigh, no HG header, the patch doesn't apply to trunk (I fixed that) and no commit message. What's a good commit message, certainly not the bug summary.

Flags: needinfo?(ben.bucksch)
Keywords: checkin-needed

[autoconfig] Increase timeouts and pre-fetch TLS OSCP to avoid ISPDB/addon.json fetch failures. r=BenB

Flags: needinfo?(ben.bucksch)

@Neil: Please provide a patch with HG header, this summary, and a longer explanation about the 2 fixes.

Attached patch 1572418.diff - rebased (obsolete) (deleted) — Splinter Review

Oh, I was going to land it now, I can do the HG header, the summary, but not the explanation. While you're at it, please rebase to latest trunk, there was a merge failure in emailWizard.js.

Sorry for forgetting the commit message.

Attachment #9084282 - Attachment is obsolete: true
Attachment #9084987 - Attachment is obsolete: true
Attachment #9085041 - Flags: review+

Almost perfect now. The line with the commit message needs the bug number, bug I'll add that, don't worry.

Keywords: checkin-needed

(In reply to Jorg K from comment #19)

Almost perfect now. The line with the commit message needs the bug number, bug I'll add that, don't worry.

Bah, I didn't see this, otherwise I would have fixed my other attachment just now. Sorry about that.

Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/e056b7118c3a
Mitigate first-time HTTPS overhead to avoid ISPDB/addon.json fetch failures. r=BenB

Status: NEW → RESOLVED
Closed: 5 years ago
Keywords: checkin-needed
Resolution: --- → FIXED

hg qimport bz:1572418 failed here :-(

You guys cause me head-aches. One says ...
[autoconfig] Increase timeouts and pre-fetch TLS OSCP to avoid ISPDB/addon.json fetch failures. r=BenB
@Neil: Please provide a patch with HG header, this summary, and a longer explanation about the 2 fixes.

... and the other provides this commit message:
Mitigate first-time HTTPS overhead

So I combined the two since most people won't understand the TLS OSCP.

And then the explanation:

  • Preflight a couple of URLs that we know we're likely to access for speed
  • Increase the timeout on those two URLs for reliability

Why does one sentence say "a couple of" and the other "two". It's the same two you pre-fetch. What's "preflight"? I fixed that explanation as I deemed fit. Sigh, I just noticed that "for speed" had it's counterpart in "for reliability", so I shouldn't have removed it. Too bad.

Target Milestone: --- → Thunderbird 70.0

(In reply to Jorg K from comment #22)

You guys cause me head-aches.

Actually I had a (real) headache at the time, which I'm going to blame for overlooking that Ben said "this summary". Sorry about that.

Comment on attachment 9085041 [details] [diff] [review] Rebase and add message and description [Approval Request Comment] Regression caused by (bug #): User impact if declined: Risk of new user being unable to autoconfigure correctly Testing completed (on c-c, etc.): Risk to taking this patch (and alternatives if risky):
Attachment #9085041 - Flags: approval-comm-beta?
Attachment #9085041 - Flags: approval-comm-beta? → approval-comm-beta+
Comment on attachment 9085041 [details] [diff] [review] Rebase and add message and description This should uplift only in combination with bug 1571772.
Attachment #9085041 - Flags: approval-comm-esr68?
Attachment #9085041 - Flags: approval-comm-esr60?

(In reply to Ben Bucksch from comment #26)

This should uplift only in combination with bug 1571772.

Have you got your bugs mixed up? Bug 1572467 is the one that belongs with bug 1571772.

Well, I'd be inclined to to throw bug 1572418 (this one here) and bug 1571772 and bug 1572467 into the same lot and uplift all or none.

Neil wrote in comment 27:

Have you got your bugs mixed up? Bug 1572467 is the one that belongs with bug 1571772.

No, I really meant bug 1571772 and this bug 1572418 together. The relation is on the high level, not on the technical level.

Jörg wrote in comment 28:

I'd be inclined to to throw bug 1572418 (this one here) and bug 1571772 and bug 1572467 into the same lot and uplift all or none.

Agreed. That's what I meant.

(In reply to Ben Bucksch from comment #29)

Agreed. That's what I meant.

Sorry, I was confused because you made different requests on bug 1572467.

Attachment #9085041 - Flags: approval-comm-esr68? → approval-comm-esr68+
Attachment #9085041 - Flags: approval-comm-esr60? → approval-comm-esr60+

Thanks, Jörg!

There seems something wrong with the logic of this whole adjustment:

The addons.json used to determine if there's a suitable add-on for the protocol (exchange) couldn't be loaded due to a delay in lookup. But that is only (should only) be called after we found a config. And to find a config we ask the ISPDB, which is on same server we just queried to get a configuration in the first place! All network setup should already be done when we go into the next call.
Am I missing something?

(I think increasing the timeout is alright though.)

Regressions: 1594366

It's true that the addons.json is loaded only after we found a config that has an Exchange server in it. There are currently 2 ways to get an Exchange config: Via Microsoft AutoDiscover protocol, or via ISPDB. The AutoDiscover protocol has been there since last year (you had reviewed that change). This exists to support on-premise Exchange servers. That ISPDB can also return an Exchange config is rather new and was added to support Office365, which has a config in the ISPDB.

We have so many different autoconfig methods, and each need network calls, which are slow, that we start all the network calls in parallel. We query the email domain with our own autoconfig method. We query the ISPDB for the domain. We get the MX host, and query the ISPDB for that MX domain. We do AutoDiscover. Unfortunately, because many networks are configured to DROP packets, some of these queries have to time out before they finally fail, so many of these network calls can take 10s. If you chain them, one after the other, it becomes really slow. But even though the network calls are all started at the same time, we use them in the order that I mentioned, and if a more preferred method succeeds, we abort the rest. This change has been very much needed and dramatically increased the speed of the configuration from what could be 1.5 minutes to 10 seconds. Note that most calls are to the very domain where you host your emails, in fact the very first call is to the domain, and it's always been like that.

I think what happened here is that the AutoDiscover results were quick, because they came from the on-premise Exchange and the company-internal network, but the ISPDB and addons.json calls were slow due to the OSCP lookups, and the corresponding DNS lookups. Even if the ISPDB call happens first, and takes 7 seconds, and the addons.json call starts 1 second later while the ISPDB call is still in flight, the addons.json call would still wait for the same DNS lookups and OSCP queries, and block as well, and timeout as well.

Please don't concentrate on the addons.json call here. That's just how we found it. But the same problem also happened for the ISPDB call. These calls also timed out, even though I have no statistics about it. Given that they happen later than the ISPDB, I would expect that even more of the ISPDB calls timed out, based on the same reasoning that you mentioned, namely that the ISPDB call happens first.

That the calls to ispdb or addons.json has an explicit timeout at all seems like a bug. It should just be a normal fetch and results will show once done.

Summary: [autoconfig] Exchange config fails due to TLS OSCP requests → [autoconfig] Exchange config fails due to TLS OCSP requests
Regressions: 1596000
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: