Closed Bug 1810137 Opened 2 years ago Closed 2 years ago

With DoH (using NextDNS) recently, phabricator pages take a long time to open on Nightly.

Categories

(Core :: Networking: DNS, defect, P2)

defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: mayankleoboy1, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged][necko-priority-review])

Attachments

(3 files)

Attached file log.txt-main.34900.7z (deleted) —

Example: Go to https://phabricator.services.mozilla.com/D165214

I am attaching the network logging.

This is not a recent regression, as a build from Dec-22 also repros. It might be that my network provide is blocking NExtDNS or something

Attached file about:support (deleted) —

Ok, this got better if I disabl Doh (using NextDNS).
If is use Cloudflare with DoH, things are better than DoH with NextDNS.

Summary: Recently, phabricator pages take a long time to open on Nightly. → With DoH (using NextDNS) recently, phabricator pages take a long time to open on Nightly.
Component: Networking → Networking: DNS

Thanks for the log.
It shows that the first connection attempt to phabricator.services.mozilla.com is using address 54.148.248.183.

2023-01-13 13:58:51.234000 UTC - [Parent 34900: Socket Thread]: E/nsSocketTransport nsSocketTransport::SendStatus [this=18ac7691400 status=804b0007]
  804b0007 = STATUS_CONNECTING_TO
2023-01-13 13:58:51.234000 UTC - [Parent 34900: Socket Thread]: D/nsSocketTransport   trying address: 54.148.248.183

It fails because of NS_ERROR_NET_TIMEOUT, so we try to connect again with another address 35.167.158.137.

2023-01-13 13:59:12.270000 UTC - [Parent 34900: Socket Thread]: D/nsSocketTransport nsSocketTransport::RecoverFromError [this=18ac7691400 state=3 cond=804b000e]
2023-01-13 13:59:12.270000 UTC - [Parent 34900: Socket Thread]: D/nsSocketTransport   trying again with next ip address
2023-01-13 13:59:12.270000 UTC - [Parent 34900: Socket Thread]: D/nsSocketTransport nsSocketTransport::PostEvent [this=18ac7691400 type=1 status=0 param=0]
2023-01-13 13:59:12.270000 UTC - [Parent 34900: Socket Thread]: D/nsSocketTransport   trying address: 35.167.158.137
2023-01-13 13:59:12.270000 UTC - [Parent 34900: Socket Thread]: D/nsSocketTransport   idle [0] { handler=18ac7691400 condition=0 pollflags=6 }

In the end, the second attempt also failed, so we try to resolve phabricator.services.mozilla.com with the native resolver.

2023-01-13 13:59:33.312000 UTC - [Parent 34900: Socket Thread]: V/nsHttp DnsAndConnectSocket::OnOutputStreamReady [this=18aae864480 ent=phabricator.services.mozilla.com primary]
2023-01-13 13:59:33.312000 UTC - [Parent 34900: Socket Thread]: V/nsHttp   failed to connect with TRR enabled, try w/o

Interestingly, the native resolver returns different address than TRR.

2023-01-13 13:59:33.647000 UTC - [Parent 34900: DNS Resolver #1]: D/nsHostResolver Caching host [phabricator.services.mozilla.com] record for 60 seconds (grace 0).
2023-01-13 13:59:33.647000 UTC - [Parent 34900: DNS Resolver #1]: D/nsHostResolver CompleteLookup: phabricator.services.mozilla.com has 52.35.152.109
2023-01-13 13:59:33.647000 UTC - [Parent 34900: DNS Resolver #1]: D/nsHostResolver CompleteLookup: phabricator.services.mozilla.com has 54.202.98.228
2023-01-13 13:59:33.647000 UTC - [Parent 34900: DNS Resolver #1]: D/nsHostResolver CompleteLookup: phabricator.services.mozilla.com has 35.163.76.207

It seems that NextDNS might return some unavailable IP addresses for phabricator.services.mozilla.com.

I assume we don't have a good way to mitigate this case. Valentin, what do you think?

Flags: needinfo?(valentin.gosu)

So, when connecting to all of the TRR IP addresses fails, we then try bypassing TRR here.
If that succeeds, I think that's a good indication that we shouldn't use TRR for that domain name.

In OnSocketConnected, if we are falling back to nativeDNS then we can add it to the temporary domain blocklist.

This would make it so the next time we try to establish a connection to this domain (if it's withing 60 seconds) we don't use TRR because it will fail anyway. I'm not sure how well this would work in practice, as presumably we shouldn't be connecting to the same host so often.

On my machine, nextDNS returns the same IPs as native DNS, but that may not be the case for everyone.
@Mayank, are you using the default nextDNS TRR: https://firefox.dns.nextdns.io/ or a custom one?

I'm wondering if this is a problem with phabricator/AWS - not accepting connections that are to the wrong zone.
We should add also some telemetry to see how often this happens.

Blocks: doh
Severity: -- → S3
Flags: needinfo?(valentin.gosu) → needinfo?(mayankleoboy1)
Priority: -- → P2
Whiteboard: [necko-triaged][necko-priority-review]

(In reply to Valentin Gosu [:valentin] (he/him) from comment #4)

On my machine, nextDNS returns the same IPs as native DNS, but that may not be the case for everyone.
@Mayank, are you using the default nextDNS TRR: https://firefox.dns.nextdns.io/ or a custom one?

In my about:config, network.trr.uri=https://firefox.dns.nextdns.io/

Also, I tried to repro just now, and the issue seems to be fixed.
Will attach a log.

Flags: needinfo?(mayankleoboy1) → needinfo?(valentin.gosu)
Attached file Logging_23jan2022.zip (deleted) —

Thanks Mayank. I'll close this as WORKSFORME.
Please reopen if you find it's happening again.

Status: NEW → RESOLVED
Closed: 2 years ago
Flags: needinfo?(valentin.gosu)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: