Closed Bug 1572418 Opened 5 years ago Closed 5 years ago

[autoconfig] Exchange config fails due to TLS OCSP requests

Tracking

(thunderbird_esr6069+ fixed, thunderbird_esr6869+ fixed, thunderbird69 fixed, thunderbird70 fixed)

Status:

RESOLVED FIXED

Milestone:

Thunderbird 70.0

Tracking Flags:

Tracking

Status

thunderbird_esr60

69+

fixed

thunderbird_esr68

69+

fixed

thunderbird69

---

fixed

thunderbird70

---

fixed

People

(Reporter: BenB, Assigned: neil)

References

Details

Attachments

(1 file, 4 obsolete files)

Proposed patch 5 years ago neil@parkwaycc.co.uk (deleted), patch	BenB : review-	Details \| Diff \| Splinter Review
Specific requests only 5 years ago neil@parkwaycc.co.uk (deleted), patch	BenB : review+	Details \| Diff \| Splinter Review
Use OPTIONS 5 years ago neil@parkwaycc.co.uk (deleted), patch	BenB : review+	Details \| Diff \| Splinter Review
1572418.diff - rebased 5 years ago Jorg K (CEST = GMT+2) (deleted), patch		Details \| Diff \| Splinter Review
Rebase and add message and description 5 years ago neil@parkwaycc.co.uk (deleted), patch	neil : review+ jorgk-bmo : approval-comm-beta+ jorgk-bmo : approval-comm-esr60+ jorgk-bmo : approval-comm-esr68+	Details \| Diff \| Splinter Review

Ben Bucksch (:BenB)

Reporter

Description

•

5 years ago

In the automatic account creation dialog:

Situation:
When we find an Exchange config, we fetch addons.json to get the addons that can support Exchange servers in Thunderbird. We request https://live.thunderbird.net/autoconfig/addons.json , which redirects to https://autoconfig.thunderbird.net/addons.json .

In FetchHTTP, we are settings a timeout of 5 seconds for any HTTPS calls that we make during the autoconfiguration. That is because we contact a number of different servers, including servers from the ISP, and they are often badly configured and do not respond in time or at all. To avoid the user waiting a lot time, we set this timeout. An HTTP server that needs more than 5 seconds to respond, I would consider it broken. DNS can add time, but that's usually no more than 1 second for a cold cache. Then we added HTTPS into the mix. TLS setup take a moment, but usually only 200 ms.

There's a utility called hey which can measure the time taken to make HTTP requests. With its default test settings of 200 requests, max 50 simultaneously, the slowest request only took 1.3671 seconds, with 25% of requests actually finishing in half the time. Given that we're using a timeout of 5 second, this should be plenty of time for the request to finish.

Bug:
However, Thunderbird enabled OSCP by default, for TLS certificate revokation checks. That happens transparently during the TLS setup and makes another server request, to the CA. And apparently, that call is slooooooow.

On the same machine that ran hey, the call to https://live.thunderbird.net/autoconfig/addons.json took up to 4532ms on TLS setup alone! Consequently, the redirect calls to https://autoconfig.thunderbird.net/addons.json were cancelled due to the timeout.

Impact:
In the statistics, I am seeing a massive discrepancy between people who fetched addons.json (from the server perspective) and who installed Owl. The discrepancy cannot be explained with people not wanting the addon, it's way too huge. This bug is a possible explanation: The users request addons.json, but never receive it, so they cannot install the addon.

Ben Bucksch (:BenB)

Reporter

Comment 1

•

5 years ago

Solution:
3 options:

Remove the redirect on the server (cuts the time in half)
Increase the timeout to 15 seconds, only for this addons.json call
Pre-warm the OSCP by making TLS requests to these 2 servers at the time when we open the dialog. That would fetch the OSCP while the user types, it would be cached, so at the time we call the servers, we already have the OSCP response and the calls are much faster. Additional advantage: The MX and ISPDB requests would also be faster.

We should combine at least 2 of the above solutions, i.e. pre-warm (to reduce user wait time) and increase timeout (to increase reliability).

Assignee: nobody → ben.bucksch

Ben Bucksch (:BenB)

Reporter

Updated

•

5 years ago

Assignee: ben.bucksch → neil

neil@parkwaycc.co.uk

Assignee

Comment 2

•

5 years ago

(In reply to Ben Bucksch from comment #1)

Increase the timeout to 15 seconds, only for this addons.json call
Actually the issue affects all calls, for instance I made a test capture in a new profile and it was the request to https://live.thunderbird.net redirected to https://autoconfig.thunderbird.net that failed (again presumably due to an OCSP timeout, as retrying the request succeeded).

neil@parkwaycc.co.uk

Assignee

Comment 3

•

5 years ago

Sorry, that wasn't clear of me, because the request for addons.json uses the same servers. Sorry about that. I think the request I was looking at was for the ISPdata.

neil@parkwaycc.co.uk

Assignee

Comment 4

•

5 years ago

Attached patch Proposed patch (obsolete) (deleted) — Details — Splinter Review

I increased the timeout to 10 seconds for all requests because we now make a lot of requests to various servers as part of autodiscover.

Attachment #9084019 - Flags: review?(ben.bucksch)

neil@parkwaycc.co.uk

Assignee

Comment 5

•

5 years ago

Attached patch Specific requests only (obsolete) (deleted) — Details — Splinter Review

Attachment #9084129 - Flags: review?(ben.bucksch)

Andrei Hajdukewycz [:sancus]

Comment 6

•

5 years ago

(In reply to Ben Bucksch (:BenB) from comment #0)

Bug:
However, Thunderbird enabled OSCP by default, for TLS certificate revokation checks. That happens transparently during the TLS setup and makes another server request, to the CA. And apparently, that call is slooooooow.

On the same machine that ran hey, the call to https://live.thunderbird.net/autoconfig/addons.json took up to 4532ms on TLS setup alone! Consequently, the redirect calls to https://autoconfig.thunderbird.net/addons.json were cancelled due to the timeout.

I don't really pretend to fully understand OSCP, but are you saying that any https request Thunderbird makes takes an additional 4000ms because of this? That can't be the case -- my locally installed beta doesn't behave this way. Even on a new profile it takes well under 200ms to contact live.thunderbird.net in the account setup process. Is this an outlier? Where is the CA and why does it take so long to connect to it? Does OCSP really make an additional, sequential server request for each https URL? The certificate on live.thunderbird.net and the one on autoconfig are the exact same certificate.

Basically what I'm saying here is that this kind of delay can't possibly be acceptable, normal operation and, if it is actually widespread, should be considered a bug.

autoconfig latency is poor(~800-1100ms) in many regions because it's not behind any CDN for logging purposes(Cloudflare does not allow log access without an Enterprise account), but we may not need real IPs in those logs anymore, so I could put it back behind Cloudflare. I would need to remove the rate limiter first.

Ben Bucksch (:BenB)

Reporter

Comment 7

•

5 years ago

Comment on attachment 9084129 [details] [diff] [review] Specific requests only Good patch. Please change HEAD to OPTIONS r+ BenB with that change

Attachment #9084129 - Flags: review?(ben.bucksch) → review+

Ben Bucksch (:BenB)

Reporter

Updated

•

5 years ago

Attachment #9084019 - Attachment is obsolete: true

Attachment #9084019 - Flags: review?(ben.bucksch)

Ben Bucksch (:BenB)

Reporter

Updated

•

5 years ago

Attachment #9084019 - Flags: review-

neil@parkwaycc.co.uk

Assignee

Comment 8

•

5 years ago

(In reply to Andrei Hajdukewycz from comment #6)

the call to https://live.thunderbird.net/autoconfig/addons.json took up to 4532ms on TLS setup alone! Consequently, the redirect calls to https://autoconfig.thunderbird.net/addons.json were cancelled due to the timeout.

I don't really pretend to fully understand OSCP, but are you saying that any https request Thunderbird makes takes an additional 4000ms because of this? That can't be the case -- my locally installed beta doesn't behave this way. Even on a new profile it takes well under 200ms to contact live.thunderbird.net in the account setup process. Is this an outlier?

All I can tell you is what I observed on my machine while I was doing a lot of testing of new profiles, where the developer tools claim that it's spending that long in TLS setup, and warming up the request significantly cuts down the time taken for the real request.

Ben Bucksch (:BenB)

Reporter

Comment 9

•

5 years ago

Hey sancus, not every request takes 4000 ms, but some do. For example, I do not see that problem on my machine. The tests were performed on Neil's machine. He saw it in some, but not all requests. Still, it was fairly reproducible for him. He could code and test against these timeouts.

Of course, the OSCP checks are cached for a certain amount of time, so you would only see this in a new profile. That made it harder to discover the bug.

this kind of delay can't possibly be acceptable, normal operation and, if it is actually widespread, should be considered a bug.

I concur, and I am glad you consider it the same.

As to whether this is widespread: I don't know any tools or facilities that would both allow me to query from different DSL connections around the world, and also do it in the same way as Thunderbird does, with OSCP. Commandline tools normally don't bother with this, as evidenced by the good and fast 'hay' results that Neil saw on the same machine where Thunderbird requests are slow.

I can only go by symptoms, and if the various server calls don't give a reliable response, that would explain a number of symptoms that I see in the statistics that were inexplicable so far. But I cannot be certain.

autoconfig latency is poor(~800-1100ms) in many regions

While 1100 ms is certainly not fast, it's still acceptable. I would worry upwards of 2000ms, or outliers above 4000ms, including all redirects commulative.

neil@parkwaycc.co.uk

Assignee

Comment 10

•

5 years ago

Attached patch Use OPTIONS (obsolete) (deleted) — Details — Splinter Review

Attachment #9084129 - Attachment is obsolete: true

Attachment #9084282 - Flags: review?(ben.bucksch)

Ben Bucksch (:BenB)

Reporter

Comment 11

•

5 years ago

The pre-warming would help in any case. We already know we will make these calls, and that the OSCP check is necessary, so we can make the check earlier, and shave off 200ms in the fast case, and rescue an otherwise failing call in the worst case. We chose HTTP OPTIONS as method, to avoid messing up the stats, and give less work for the server. But still, that would increase the server calls, so that's why I've asked you. sancus, do you agree with this change?

Ben Bucksch (:BenB)

Reporter

Comment 12

•

5 years ago

Comment on attachment 9084282 [details] [diff] [review] Use OPTIONS r+ BenB Thanks.

Attachment #9084282 - Flags: review?(ben.bucksch) → review+

Andrei Hajdukewycz [:sancus]

Comment 13

•

5 years ago

(In reply to Ben Bucksch (:BenB) from comment #11)

But still, that would increase the server calls, so that's why I've asked you. sancus, do you agree with this change?

The additional server calls are fine, no problem.

neil@parkwaycc.co.uk

Assignee

Updated

•

5 years ago

Keywords: checkin-needed

Ben Bucksch (:BenB)

Reporter

Updated

•

5 years ago

Summary: Autoconfig: Exchange config fails due to TLS OSCP requests → [autoconfig] Exchange config fails due to TLS OSCP requests

Jorg K (CEST = GMT+2)

Comment 14

•

5 years ago

Sigh, no HG header, the patch doesn't apply to trunk (I fixed that) and no commit message. What's a good commit message, certainly not the bug summary.

Flags: needinfo?(ben.bucksch)

Keywords: checkin-needed

Ben Bucksch (:BenB)

Reporter

Comment 15

•

5 years ago

[autoconfig] Increase timeouts and pre-fetch TLS OSCP to avoid ISPDB/addon.json fetch failures. r=BenB

Flags: needinfo?(ben.bucksch)

Ben Bucksch (:BenB)

Reporter

Comment 16

•

5 years ago

@Neil: Please provide a patch with HG header, this summary, and a longer explanation about the 2 fixes.

Jorg K (CEST = GMT+2)

Comment 17

•

5 years ago

Attached patch 1572418.diff - rebased (obsolete) (deleted) — Details — Splinter Review

Oh, I was going to land it now, I can do the HG header, the summary, but not the explanation. While you're at it, please rebase to latest trunk, there was a merge failure in emailWizard.js.

neil@parkwaycc.co.uk

Assignee

Comment 18

•

5 years ago

Attached patch Rebase and add message and description (deleted) — Details — Splinter Review

Sorry for forgetting the commit message.

Attachment #9084282 - Attachment is obsolete: true

Attachment #9084987 - Attachment is obsolete: true

Attachment #9085041 - Flags: review+

Jorg K (CEST = GMT+2)

Comment 19

•

5 years ago

Almost perfect now. The line with the commit message needs the bug number, bug I'll add that, don't worry.

Keywords: checkin-needed

neil@parkwaycc.co.uk

Assignee

Comment 20

•

5 years ago

(In reply to Jorg K from comment #19)

Almost perfect now. The line with the commit message needs the bug number, bug I'll add that, don't worry.

Bah, I didn't see this, otherwise I would have fixed my other attachment just now. Sorry about that.

Pulsebot

Comment 21

•

5 years ago

Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/e056b7118c3a
Mitigate first-time HTTPS overhead to avoid ISPDB/addon.json fetch failures. r=BenB

Status: NEW → RESOLVED

Closed: 5 years ago

Keywords: checkin-needed

Resolution: --- → FIXED

Jorg K (CEST = GMT+2)

Comment 22

•

5 years ago

hg qimport bz:1572418 failed here :-(

You guys cause me head-aches. One says ...
[autoconfig] Increase timeouts and pre-fetch TLS OSCP to avoid ISPDB/addon.json fetch failures. r=BenB
@Neil: Please provide a patch with HG header, this summary, and a longer explanation about the 2 fixes.

... and the other provides this commit message:
Mitigate first-time HTTPS overhead

So I combined the two since most people won't understand the TLS OSCP.

And then the explanation:

Preflight a couple of URLs that we know we're likely to access for speed
Increase the timeout on those two URLs for reliability

Why does one sentence say "a couple of" and the other "two". It's the same two you pre-fetch. What's "preflight"? I fixed that explanation as I deemed fit. Sigh, I just noticed that "for speed" had it's counterpart in "for reliability", so I shouldn't have removed it. Too bad.

Target Milestone: --- → Thunderbird 70.0

neil@parkwaycc.co.uk

Assignee

Comment 23

•

5 years ago

(In reply to Jorg K from comment #22)

You guys cause me head-aches.

Actually I had a (real) headache at the time, which I'm going to blame for overlooking that Ben said "this summary". Sorry about that.

neil@parkwaycc.co.uk

Assignee

Comment 24

•

5 years ago

Comment on attachment 9085041 [details] [diff] [review] Rebase and add message and description [Approval Request Comment] Regression caused by (bug #): User impact if declined: Risk of new user being unable to autoconfigure correctly Testing completed (on c-c, etc.): Risk to taking this patch (and alternatives if risky):

Attachment #9085041 - Flags: approval-comm-beta?

Jorg K (CEST = GMT+2)

Updated

•

5 years ago

Attachment #9085041 - Flags: approval-comm-beta? → approval-comm-beta+

Jorg K (CEST = GMT+2)

Comment 25

•

5 years ago

TB 69 beta 4:
https://hg.mozilla.org/releases/comm-beta/rev/5477fcc25b929b4d74bc4cb60ed24cccd030bb0e

status-thunderbird69: --- → fixed

status-thunderbird70: --- → fixed

Ben Bucksch (:BenB)

Reporter

Comment 26

•

5 years ago

Comment on attachment 9085041 [details] [diff] [review] Rebase and add message and description This should uplift only in combination with bug 1571772.

Attachment #9085041 - Flags: approval-comm-esr68?

Attachment #9085041 - Flags: approval-comm-esr60?

neil@parkwaycc.co.uk

Assignee

Comment 27

•

5 years ago

(In reply to Ben Bucksch from comment #26)

This should uplift only in combination with bug 1571772.

Have you got your bugs mixed up? Bug 1572467 is the one that belongs with bug 1571772.

Jorg K (CEST = GMT+2)

Comment 28

•

5 years ago

Well, I'd be inclined to to throw bug 1572418 (this one here) and bug 1571772 and bug 1572467 into the same lot and uplift all or none.

Ben Bucksch (:BenB)

Reporter

Comment 29

•

5 years ago

Neil wrote in comment 27:

Have you got your bugs mixed up? Bug 1572467 is the one that belongs with bug 1571772.

No, I really meant bug 1571772 and this bug 1572418 together. The relation is on the high level, not on the technical level.

Jörg wrote in comment 28:

I'd be inclined to to throw bug 1572418 (this one here) and bug 1571772 and bug 1572467 into the same lot and uplift all or none.

Agreed. That's what I meant.

neil@parkwaycc.co.uk

Assignee

Comment 30

•

5 years ago

(In reply to Ben Bucksch from comment #29)

Agreed. That's what I meant.

Sorry, I was confused because you made different requests on bug 1572467.

Jorg K (CEST = GMT+2)

Updated

•

5 years ago

Attachment #9085041 - Flags: approval-comm-esr68? → approval-comm-esr68+

Jorg K (CEST = GMT+2)

Comment 31

•

5 years ago

TB 68.1 ESR:
https://hg.mozilla.org/releases/comm-esr68/rev/1b43445ce5f59d9932741c5abd188f6ece733afe

status-thunderbird_esr68: --- → fixed

tracking-thunderbird_esr68: --- → 69+

Jorg K (CEST = GMT+2)

Updated

•

5 years ago

Attachment #9085041 - Flags: approval-comm-esr60? → approval-comm-esr60+

Jorg K (CEST = GMT+2)

Comment 32

•

5 years ago

TB 60.9 ESR:
https://hg.mozilla.org/releases/comm-esr60/rev/ee3679e4a7199ba12c08a251080575d961b4b420

status-thunderbird_esr60: --- → fixed

tracking-thunderbird_esr60: --- → 69+

Ben Bucksch (:BenB)

Reporter

Comment 33

•

5 years ago

Thanks, Jörg!

Ben Bucksch (:BenB)

Reporter

Updated

•

5 years ago

Blocks: exchange-meta

Magnus Melin [:mkmelin]

Comment 34

•

5 years ago

There seems something wrong with the logic of this whole adjustment:

The addons.json used to determine if there's a suitable add-on for the protocol (exchange) couldn't be loaded due to a delay in lookup. But that is only (should only) be called after we found a config. And to find a config we ask the ISPDB, which is on same server we just queried to get a configuration in the first place! All network setup should already be done when we go into the next call.
Am I missing something?

(I think increasing the timeout is alright though.)

Magnus Melin [:mkmelin]

Updated

•

5 years ago

Regressions: 1594366

Ben Bucksch (:BenB)

Reporter

Comment 35

•

5 years ago

It's true that the addons.json is loaded only after we found a config that has an Exchange server in it. There are currently 2 ways to get an Exchange config: Via Microsoft AutoDiscover protocol, or via ISPDB. The AutoDiscover protocol has been there since last year (you had reviewed that change). This exists to support on-premise Exchange servers. That ISPDB can also return an Exchange config is rather new and was added to support Office365, which has a config in the ISPDB.

We have so many different autoconfig methods, and each need network calls, which are slow, that we start all the network calls in parallel. We query the email domain with our own autoconfig method. We query the ISPDB for the domain. We get the MX host, and query the ISPDB for that MX domain. We do AutoDiscover. Unfortunately, because many networks are configured to DROP packets, some of these queries have to time out before they finally fail, so many of these network calls can take 10s. If you chain them, one after the other, it becomes really slow. But even though the network calls are all started at the same time, we use them in the order that I mentioned, and if a more preferred method succeeds, we abort the rest. This change has been very much needed and dramatically increased the speed of the configuration from what could be 1.5 minutes to 10 seconds. Note that most calls are to the very domain where you host your emails, in fact the very first call is to the domain, and it's always been like that.

I think what happened here is that the AutoDiscover results were quick, because they came from the on-premise Exchange and the company-internal network, but the ISPDB and addons.json calls were slow due to the OSCP lookups, and the corresponding DNS lookups. Even if the ISPDB call happens first, and takes 7 seconds, and the addons.json call starts 1 second later while the ISPDB call is still in flight, the addons.json call would still wait for the same DNS lookups and OSCP queries, and block as well, and timeout as well.

Please don't concentrate on the addons.json call here. That's just how we found it. But the same problem also happened for the ISPDB call. These calls also timed out, even though I have no statistics about it. Given that they happen later than the ISPDB, I would expect that even more of the ISPDB calls timed out, based on the same reasoning that you mentioned, namely that the ISPDB call happens first.

Magnus Melin [:mkmelin]

Comment 36

•

5 years ago

That the calls to ispdb or addons.json has an explicit timeout at all seems like a bug. It should just be a normal fetch and results will show once done.

Summary: [autoconfig] Exchange config fails due to TLS OSCP requests → [autoconfig] Exchange config fails due to TLS OCSP requests

Ben Bucksch (:BenB)

Reporter

Updated

•

5 years ago

Regressions: 1596000

You need to log in before you can comment on or make changes to this bug.