Closed
Bug 1452436
Opened 7 years ago
Closed 7 years ago
Slack page can't load channel list from Slack's WebSocket server when using DoH and Google's DNS
Categories
(Core :: Networking: DNS, defect, P1)
Tracking
()
VERIFIED
FIXED
mozilla62
Tracking | Status | |
---|---|---|
firefox-esr52 | --- | unaffected |
firefox-esr60 | --- | disabled |
firefox59 | --- | unaffected |
firefox60 | --- | disabled |
firefox61 | --- | verified |
firefox62 | --- | verified |
People
(Reporter: cpeterson, Assigned: bagder)
References
Details
(Whiteboard: [necko-triaged][trr])
Attachments
(3 files, 2 obsolete files)
(deleted),
image/png
|
Details | |
(deleted),
text/plain
|
Details | |
(deleted),
patch
|
mcmanus
:
review+
RyanVM
:
approval-mozilla-beta+
|
Details | Diff | Splinter Review |
STR:
1. Enable DoH prefs with Google's DNS:
network.trr.mode = 2 or 1
network.trr.uri = https://dns.google.com/experimental
2. Restart Firefox.
3. Load Slack, e.g. https://mozilla.slack.com/
RESULT:
Slack can load the page's wireframe, but it can't load the channel list or recent messages from Slack's WebSocket server. See the attached screenshot. Occasionally, Slack can load the channel list and recent messages after waiting 20 seconds (which just happens to be a multiple of the Slack SOCKET-MANAGER's 10000 ms timeout below).
This problem only seems to happen with Google's DNS. If I use Cloudflare's DNS (network.trr.uri = https://mozilla.cloudflare-dns.com/dns-query) or disable DoH (network.trr.mode = 0) and reload the Slack page, then the channel list and recent messages load correctly.
When using Google's DNS, I see WebSocket error messages in the devtools console like:
Firefox can’t establish a connection to the server at wss://mpmulti-5eri.lb.slack-msgs.com/...
Apr-7 22:49:21.398 [T027LFU12] SOCKET-MANAGER: Changing from prov_connecting to error
Apr-7 22:49:21.400 [T027LFU12] SOCKET-MANAGER: will stay in error state for up to 10000 ms
Apr-7 22:49:21.401 T027LFU12 SOCKET-MANAGER: Closing socket due to unknown error
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → daniel
Priority: -- → P1
Whiteboard: [necko-triaged][trr]
Reporter | ||
Comment 1•7 years ago
|
||
This seems more like a race condition than a problem with Google's DNS. Today I can also reliably reproduce the problem with Cloudflare's DNS in mode = 1 (but mode = 2 still works).
Comment hidden (obsolete) |
Reporter | ||
Comment 3•7 years ago
|
||
I take that back: I can reproduce on both macOS and Windows 10. On macOS, I can make the problem start and stop just by changing the prefs. On Windows, however, it seems like I have to restart the browser for the new prefs to trigger the problem.
OS: Unspecified → All
Comment 4•7 years ago
|
||
I can reproduce on Windows 10, too. I'm suffering of that. :/
Comment 5•7 years ago
|
||
I'm also seeing this, on macOS. Fixed by flipping trr.mode back from 2 to 0.
Assignee | ||
Comment 6•7 years ago
|
||
I've managed to reproduce a long delay a few times but I can't seem to do it very reliably and I have not been able to observe exactly why this problem happens. I'm logging at nsHostResolver:5,nsHttp:5 and I can see that the slack host names are "CNAME-chased" quite a bit but I can't spot any DOH-related failures (due to CNAMEs or otherwise).
Mode 2 should cause a retry on native resole if TRR fails, but I see no fails.
Mode 1 should just go with the fastest (and use native if TRR fails).
Based on all this, it seems like TRR doesn't really (report) fail when this effect happens.
Comment 7•7 years ago
|
||
chris - can you turn on dns logging in the about:networking tab and repro for daniel? Mode 1 is the most curious.
Comment 8•7 years ago
|
||
actually you might want all of http logging on. its possible the dns data from cloud is the problem, not the dns interaction
Reporter | ||
Comment 9•7 years ago
|
||
Here is a zip with log files of mode=0, mode=1, and mode=2:
https://www.dropbox.com/s/d5x1x8cc3fk2n6e/bug-1452436-logs.zip?dl=0
I recorded these log files with the default modules: timestamp,sync,nsHttp:5,cache2:5,nsSocketTransport:5,nsHostResolver:5. If there is a different HTTP module I should specify, just let me know.
I can still reproduce this Slack connection problem very easily with mode=1 while mode=2 connects successfully.
Reporter | ||
Comment 11•7 years ago
|
||
(In reply to Chris Peterson [:cpeterson] from comment #9)
> I can still reproduce this Slack connection problem very easily with mode=1
> while mode=2 connects successfully.
Did any DoH changes land recently? Slack used to fail to load for me with mode=1 while mode=0 and mode=2 loaded successfully. But today Slack fails to load with either mode=1 or mode=2 (on Mac and Windows with both Google's and Cloudflare's DNS servers).
Assignee | ||
Comment 12•7 years ago
|
||
There's been a few smaller changes recently, and a change of how CNAME responses are handled landed just the other day (bug 1455972).
I've stared on your logs a lot and I've tried to make sense of the symptoms but failed so far - there doesn't seem to be any TRR failures logged when you get these problems!
I keep using the mozilla slack in a tab of mine all the time and I alternate between trr mode 2 and 3 (usually with cloudflare) and I experience none of these problems. (I've seen delayed loads at some point but I'm far from being able to reproduce).
mode 0 isn't using TRR at all so I ignore that.
mode 1 is racing native vs trr and goes with the fastest response, which then would indicate that sometimes you get the TRR response and for some reason it seems to give you the "wrong" IP or an IP that is slow for you.
mode 2 is trr-first and then as long as you get an IP from the DOH server (and Firefox doesn't fail to connect to it) you'll use the TRR address so why mode 2 would work and mode 1 would not is really hard to grasp.
It might offer some additional data points if you check "about:networking" on the DNS tab when things go wrong/right and see if the TRR columns says true or false for the involved host names.
Can you record some new logs with these more "absolute" failures you experience now? Maybe this situation gives a better picture.
Flags: needinfo?(daniel)
Reporter | ||
Comment 13•7 years ago
|
||
Here's a snapshot of my about:networking DNS tab with trr mode 2. After waiting about one minute, Slack finally loaded.
Hostname Family TRR Addresses Expires (Seconds)
ca.slack-edge.com ipv4 false 151.101.40.102 18
mpmulti-s86e.lb.slack-msgs.com ipv4 false 18.216.16.163 14
mpmulti-wq3r.lb.slack-msgs.com ipv4 true 54.149.194.173 17
a.slack-edge.com ipv4 true 151.101.40.102 9
mozilla.cloudflare-dns.com ipv4 false 104.16.112.25 104.16.111.25 90
lh4.googleusercontent.com ipv4 true 172.217.0.33 36
secure.gravatar.com ipv4 true 192.0.73.2 59
lh3.googleusercontent.com ipv4 true 172.217.0.33 36
a.slack-edge.com ipv4 true 151.101.40.102 9
emoji.slack-edge.com ipv4 true 151.101.40.102 9
lh6.googleusercontent.com ipv4 true 172.217.0.33 36
Assignee | ||
Comment 14•7 years ago
|
||
Here's a freshly recorded log file when this happens ("timestamp,sync,nsWebSocket:5,nsHostResolver:5").
This shows that TRR works fine the whole time, but the websockets fail to connect to the given addresses and it has a 20 second timeout on each attempt.
There's no fallback to native DNS resolve in the websocket logic so the next attempt (if any) may very well also fail. I think we need to add logic there i similar style that we're doing for the sockettransport:
https://searchfox.org/mozilla-central/source/netwerk/base/nsSocketTransport2.cpp#1820-1830
I can't explain why TRR returns diff/non-functional addresses and why native resolves doesn't. It surprises me that the slack web site out of all sites would show this behavior.
Michal, as somewhat familiar with the websockets code, would you agree with my conclusion and do you have a suggestion on how I'd best add the fallback/retry ?
Flags: needinfo?(michal.novotny)
Comment 15•7 years ago
|
||
(In reply to Daniel Stenberg [:bagder] from comment #14)
> Michal, as somewhat familiar with the websockets code, would you agree with
> my conclusion and do you have a suggestion on how I'd best add the
> fallback/retry ?
We use the address from resolver only in nsWSAdmissionManager to control how many websocket channels we're connecting to some host at the same time. We create http channel in WebSocketChannel::AsyncOpen with the given URI, so no fallback is necessary in websockets code.
Flags: needinfo?(michal.novotny)
Comment 16•7 years ago
|
||
daniel can you confirm that this will have no impact in shadow mode?
Assignee | ||
Comment 17•7 years ago
|
||
I can confirm that in mode 4 this issue doesn't trigger.
I haven't seen this problem (and nobody else has reported it) happen in mode 4 (nor 0) and it also follows the logic that the problem is somehow related to the particular IP addresses returned from TRR instead of native. After talking with the team this evening I got some new ideas on how to proceed my debugging from here so I hope to have some further details on this soon.
Assignee | ||
Comment 18•7 years ago
|
||
Ok Nick,
Let me bounce this with you as well. I have a 6MB gzipped logfile here: https://daniel.haxx.se/test/mozilla-slack-broken2-2018-04-27.log.gz (about 60MB unpacked)
That's a log of nsHostResolver, nsWebSocket and nsHttp when I load mozilla.slack.com from my own Firefox dev build a few days ago.
Search for "mpmulti-pzxk.lb.slack-msgs.com" and look at the WebSocket HTTP request that is sent at timestamp 2018-04-27 13:26:20.822957.
WebSockets here implies HTTP/1 while the rest of this site is served over h2. I'm asking for your help to interpret the logs here to spot why this request takes ~95 seconds(!) to fulfill. The WebSockets code itself only allows 20 seconds and times out before the actual HTTP response arrives.
The TRR name resolve is done in 22ms at "2018-04-27 13:26:20.860225". I can't see any problems with that.
After almost 7 seconds (2018-04-27 13:26:27.617449), we get "GetSpdyActiveConn() request for ent 0x7f6e42944580 .S...X.[tlsflags0x00000000]mpmulti-pzxk.lb.slack-msgs.com:443 did not find an active connection"
... which to me sounds as if we're looking for an h2 connection for this request? I can't see anything else interesting done on this request before then and it puzzles me greatly.
I can't see how this is related to TRR/name resolving at all, but right now I'm trying to figure out exactly what's going on so that I can possibly backtrack from there and figure out where the problem is introduced in this scenario. Since it doesn't happen without TRR it pretty much has to be involved somehow!
Flags: needinfo?(hurley)
Comment 19•7 years ago
|
||
I've looked at the bits specifically related to the GetSpdyActiveConn bits. It looks like we hit that code as part of ProcessPendingQForEntry, which while not strictly necessary (the conninfo's hash key has the 'X' bit set, indicating no spdy allowed), but it does the right thing (there is, of course, no spdy conn available to an entry that has conninfo with no spdy).
For reference, it looks like we're trying to ProcessPendingQForEntry due to the code at https://searchfox.org/mozilla-central/source/netwerk/protocol/http/nsHttpConnectionMgr.cpp#2543 (we were asked to process the pending queue for something with an empty pending queue, so we just go looking for something to process)
The other thing I noticed around that bit of the log is that it then tries to dispatch the transaction with no connection available (indicating at that point that we haven't successfully made a connection). We definitely *try* to make a connection (2018-04-27 13:26:20.823149 UTC - [12579:Socket Thread]: V/nsHttp nsHttpConnectionMgr::MakeNewConnection 0x7f6e65e97740 ent=0x7f6e42944580 trans=0x7f6e409f2c00).
If you look at the DispatchPendingQ log lines a bit below the GetSpdyActiveConn stuff, it looks like the connection stays in the state where it has a half-open connection (implied by the second DispatchPendingQ line) that has not yet completed. It keeps going around like that every time we ProcessPendingQForEntry. Eventually the existing halfopen goes away (timed out, I presume, no socket transport stuff in the log), and we try to MakeNewConnection again.
That all indicates to me that the connection is just taking a *really* long time to get set up. Are we certain that the TRR server is returning the same rrset as plain ol' dns? (I didn't see any mention of that in my quick scan of the bug, apologies if I missed that.)
Flags: needinfo?(hurley)
Assignee | ||
Comment 20•7 years ago
|
||
Thanks for taking a look Nick!
> Are we certain that the TRR server is returning the same rrset as plain ol' dns?
They often don't return the exact same addresses, no, but they return addresses from the same general IP ranges - which isn't a surprise since these are all Amazon things.
I have however found that in many cases when we get an A record back from TRR for this host, that also typically include two CNAMEs, I will get a different A record if we "chase" the CNAME down and get that A record. I might hint that the A record we get in the first response is actually... stale? I'm now working on trying this theory out more to see what I can learn.
I've also been thinking about TTL times, especially with responses using CNAMEs, and I intend to go back to study exact how they are supposed to work for us here and make sure that our code does what my reading will tell me.
Assignee | ||
Comment 21•7 years ago
|
||
Ok, here's a patch for testing and I'm curious how this changes the slack page for people who can repro this problem.
This patch does a few things that changes TRR behavior:
1 - it brings back CNAME "chasing" instead of relying on the address records that also comes back (reverting bug 1455972). Empirical testing shows that this makes us get a different address in a significant share of the cases.
2 - it prefers the last CNAME instead of the first, which is mostly arbitrary but in the slack case it gets the final name faster.
3 - I figured out that the code didn't correctly use the TRR provided TTL in all cases and now made sure that all addresses resolved with TRR always set the TTL correctlt for the DNS cache entry (with grace time zero).
Using this patch in my own testing, I seem to at least have reduced the likeliness of problems quite significantly. I'll continue and test more...
I'm mostly looking for some input before I try to land this.
Flags: needinfo?(cpeterson)
Comment 22•7 years ago
|
||
ni?myself to also try with this patch, since I'd been reproducing basically constantly.
Flags: needinfo?(hurley)
Comment 23•7 years ago
|
||
> Ok, here's a patch for testing and I'm curious how this changes the slack page for people who can repro this problem.
Is there some kind of performance decrease?
Assignee | ||
Comment 24•7 years ago
|
||
(In reply to Jens Hausdorf from comment #23)
> Is there some kind of performance decrease?
Yes, a small one. Performance goes back to how it was before bug 1455972 landed and I doubt anyone even noticed a performance difference (improvement) with that. Following CNAMEs means doing another HTTP request (potentially a whole series of them) to get the "actual" address, so typically another round-trip extra time for a "fresh" resolve. (We don't cache the CNAMEs, we only cache the end results in the in-memory DNS cache.)
Reporter | ||
Comment 25•7 years ago
|
||
(In reply to Daniel Stenberg [:bagder] from comment #21)
> Created attachment 8972846 [details] [diff] [review]
> 0001-bug-1452436-TRR-use-CNAME-and-TTL-better-r-mcmanus.patch
>
> Ok, here's a patch for testing and I'm curious how this changes the slack
> page for people who can repro this problem.
In my local build, this patch seems to improve things a little, but I can still easily reproduce the websocket connection problems. I'm testing a debug build on macOS with network.trr.uri = https://dns.google.com/experimental.
The first time I load mozilla.slack.com in a browser session seems to be the most likely to reproduce the problem. I can repro with either network.trr.mode = 1 or 2.
Flags: needinfo?(cpeterson)
Comment 26•7 years ago
|
||
So, first time I go to load slack with this patch applied, I hit the bug. However, like :cpeterson, this does appear to have improved the situation (it's not *every* time I load slack, as it was previously). Only other difference with my build is that it's debug (whereas of course I've been experiencing this on release builds otherwise). Just to rule out that difference, I've pushed to try (macOS only, since that's what I'm running) to get a "real" build out of infra to make sure.
For reference, I've been using trr mode 3, and https://1.1.1.1/dns-query as my resolver
Flags: needinfo?(hurley)
Comment hidden (mozreview-request) |
Assignee | ||
Comment 28•7 years ago
|
||
We eventually figured out the ultimate reason for this problem and it turned out to not really be about websockets.
Here's how the bug happens:
1. A name resolve is done for host X (TRR asks for A and AAAA records separately)
2. A response for the A records arrives
3. The host resolver callback fires early (before the AAAA response has arrived)
4. *Another* resolve is then started for the same host X
5. The AAAA response arrives
6. The second callback is then wrongly inhibited because the code sees that a callback has been called already for the A records
The fix (just submitted for review):
Makes sure that if a new resolve is done when the existing record is "in progress", that callback gets called at once if there has been a callback called previously for this record.
Comment 29•7 years ago
|
||
mozreview-review |
Comment on attachment 8974397 [details]
bug 1452436 - callback when resolving a host already in progress
https://reviewboard.mozilla.org/r/242746/#review248846
Attachment #8974397 -
Flags: review?(mcmanus)
Comment 30•7 years ago
|
||
mozreview-review |
Comment on attachment 8974397 [details]
bug 1452436 - callback when resolving a host already in progress
https://reviewboard.mozilla.org/r/242746/#review248848
Assignee | ||
Comment 31•7 years ago
|
||
A better take on fixing this issue
Attachment #8972846 -
Attachment is obsolete: true
Attachment #8974397 -
Attachment is obsolete: true
Attachment #8974678 -
Flags: review?(mcmanus)
Updated•7 years ago
|
Attachment #8974678 -
Flags: review?(mcmanus) → review+
Assignee | ||
Updated•7 years ago
|
Keywords: checkin-needed
Assignee | ||
Comment 32•7 years ago
|
||
Comment on attachment 8974678 [details] [diff] [review]
0001-bug-1452436-callback-when-resolving-a-host-already-i.patch
Approval Request Comment
[Feature/Bug causing the regression]: bug 1452436
[User impact if declined]: It causes non-obvious timeouts/hangs for users of TRR. Trigger perhaps easiest when websockets are involved.
[Is this code covered by automated tests?]: no
[Has the fix been verified in Nightly?]: it is about to
[Needs manual test from QE? If yes, steps to reproduce]: enable TRR, goto mozilla.slack.com and see if it updates fine
[List of other uplifts needed for the feature/fix]:
[Is the change risky?]: no
[Why is the change risky/not risky?]: the change is tiny and only affects users with TRR enabled
[String changes made/needed]:
Attachment #8974678 -
Flags: approval-mozilla-beta?
Updated•7 years ago
|
status-firefox62:
--- → affected
status-firefox-esr52:
--- → unaffected
status-firefox-esr60:
--- → disabled
Comment 33•7 years ago
|
||
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9d23e0279ecd
Callback when resolving a host already in progress. r=mcmanus
Keywords: checkin-needed
Comment 34•7 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla62
Updated•7 years ago
|
Flags: qe-verify+
Comment 35•7 years ago
|
||
Comment on attachment 8974678 [details] [diff] [review]
0001-bug-1452436-callback-when-resolving-a-host-already-i.patch
Fixes a Slack bug for users preffing on TRR for testing. Since the change only affects users who've preffed TRR on, I'll approve for 61.0b5.
Attachment #8974678 -
Flags: approval-mozilla-beta? → approval-mozilla-beta+
Comment 36•7 years ago
|
||
bugherder uplift |
Comment 37•6 years ago
|
||
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:62.0) Gecko/20100101 Firefox/62.0
Build ID: 20180514220126
I verified this issue on MacOS 10.13.4, Windows 10 and Ubuntu 16.04 with the latest FF Nightly 61.0a1(2018-05-14) and the issue is no longer reproducible.
We will verify this also in beta (61.0b5) after the fix will be uplifted.
Comment 38•6 years ago
|
||
(In reply to Cristian Craciun from comment #37)
> User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:62.0)
> Gecko/20100101 Firefox/62.0
> Build ID: 20180514220126
>
> I verified this issue on MacOS 10.13.4, Windows 10 and Ubuntu 16.04 with the
> latest FF Nightly 61.0a1(2018-05-14) and the issue is no longer
> reproducible.
> We will verify this also in beta (61.0b5) after the fix will be uplifted.
Edit: latest FF Nightly 62.0a1(2018-05-14)
Comment 39•6 years ago
|
||
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:61.0) Gecko/20100101 Firefox/61.0
Build ID: 20180514150347
Verified as fixed on Firefox 61.0b5 on MacOS 10.13.4, Windows 10 and Ubuntu 16.04
Comment 40•6 years ago
|
||
I'm still/again seeing this on 62.0a1 (2018-05-21) (64-bit) (resetting trr.mode from 1 to 0 makes it go away) on macOS.
Assignee | ||
Comment 41•6 years ago
|
||
I double-checked on my mac right now and it works as snappy as it is supposed to on the 2018-05-21 version for me. I checked and I can't spot any difference in behavior between Windows, Mac and Linux on the slack page. I suspect what you're experiencing is not the same problem that was already fixed.
Did you change any other trr pref than mode and uri?
Can you switch to trr mode 2 and enable logging[*] with "timestamp,nsHostResolver:5" when you go to the slack page and experience this problem, and then share that log with me/us?
[*] = https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/HTTP_logging
Comment 42•6 years ago
|
||
Hmm, can't reproduce now in either mode 1 or 2. Probably not worth logging now? Sorry for the false alarm!
Assignee | ||
Comment 43•6 years ago
|
||
No worries. If you can't reproduce it's not worth logging. Let's save that for if this ever happens again... =)
Reporter | ||
Comment 44•6 years ago
|
||
I had problems loading Slack yesterday, too. I thought it might be a DoH regression, but I think it was a Slack server problem:
https://status.slack.com/2018-05/42acc9f7860c0416
Comment 45•6 years ago
|
||
I had a similar problem again earlier. I was going to log it, but when I toggled trr.mode from 1 to 2, the problem resolved itself (and I can't reproduce it again setting it back to mode = 1).
You need to log in
before you can comment on or make changes to this bug.
Description
•