Make sure we only set AddrHostRecord::mTRRUsed = true when we actually send the request
Categories
(Core :: Networking: DNS, defect, P2)
Tracking
()
People
(Reporter: valentin, Assigned: valentin)
References
(Blocks 1 open bug)
Details
(Whiteboard: [necko-triaged][trr])
Attachments
(2 files)
(deleted),
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
|
Details |
(deleted),
text/x-phabricator-request
|
RyanVM
:
approval-mozilla-beta+
|
Details |
Previously we'd set mTRRUsed to true here even if no TRR lookups actually happened.
This has the potential to screw with our telemetry data.
We should instead make sure this is only set when we actually write the bytes to the socket.
Assignee | ||
Comment 1•4 years ago
|
||
Comment 2•4 years ago
|
||
Looking at the usage of mTRRUsed it may be that is a bit overloaded.
mTRRUsed is used for telemetry and in my opinion it is fine as it is now.
mTRRUsed is also used for BlockList hostnames: in this case we may need a more fine grain distinguish: blocklist if server returns an error, but do not block if connection to the TRR server fails.
Comment 3•4 years ago
|
||
One more note: our blockList is not really very broken, because it checks that the corresponding native lookup has succeeded. We only wrongly block something if TRR server is blocked (and maybe in some corner cases that should be rare, e.g. network comes up in between TRR and native request)
Assignee | ||
Comment 4•4 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #3)
One more note: our blockList is not really very broken, because it checks that the corresponding native lookup has succeeded. We only wrongly block something if TRR server is blocked (and maybe in some corner cases that should be rare, e.g. network comes up in between TRR and native request)
I mentioned this in https://phabricator.services.mozilla.com/D81517#2499399 but I should probably explain it further.
I think it's broken, especially when IPv6 support is missing, because we will call into TrrLookup for an AAAA type, set mTRRUsed = true, exit because there's no IPv6 support, do a native lookup which succeeds, then blocklist the domain because it thinks that the TRR request failed, even though we didn't perform one.
This would explain a large part of the missing TRR records.
Comment 5•4 years ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #4)
(In reply to Dragana Damjanovic [:dragana] from comment #3)
One more note: our blockList is not really very broken, because it checks that the corresponding native lookup has succeeded. We only wrongly block something if TRR server is blocked (and maybe in some corner cases that should be rare, e.g. network comes up in between TRR and native request)
I mentioned this in https://phabricator.services.mozilla.com/D81517#2499399 but I should probably explain it further.
I think it's broken, especially when IPv6 support is missing, because we will call into TrrLookup for an AAAA type, set mTRRUsed = true, exit because there's no IPv6 support, do a native lookup which succeeds, then blocklist the domain because it thinks that the TRR request failed, even though we didn't perform one.This would explain a large part of the missing TRR records.
OK, that is probably true, but the patch is not doing the right think.
Comment 6•4 years ago
|
||
What we want to know:
Temporary exclude hostnames that:
- TRR request was send but TRR resolver could not resolve it (this mean that we have send a request and request and receive a response from the TRR server), (Question: should request that have timed out also be excluded? we should maybe try this out on nightly only)
- the corresponding native request succeeded
We need more telemetry:
- TRR was OK
- TRR responded but response was corrupted,
- TRR server responded with an error - host not found
- TRR request timed out
- connection to TRR server could not be established (we may collect a couple of errors here, e.g. connection reset, connection timeout, TRR server's hostname could not be resolved)
- a host is temporary excluded
- a host is permanently excluded (the list in the pref)
- ipv6 not perform because ipv6 is disabled.
Am I missing something?
we will need to separate the above probe for ipv4 and IPv6.
We already have some of this telemetry: DNS_TRR_SUCCESS, DNS_TRR_BLACKLISTED2, etc. it may be useful to combine them or at least audit them.
When analyzing telemetry we should exclude users that have TRR disabled because of the OS parental control and VPNs. We should also collect telemetry about that as well.
We should have telemetry about the highest number of host in the temporary exclusion list at any point in time during a session.
Comment 7•4 years ago
|
||
For the temporal exclusion list we may check that both ipv6 and ipv4 has failed.
Assignee | ||
Comment 8•4 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #5)
(In reply to Valentin Gosu [:valentin] (he/him) from comment #4)
This would explain a large part of the missing TRR records.
OK, that is probably true, but the patch is not doing the right think.
Could you explain why it is wrong?
Looking at where it is used: https://searchfox.org/mozilla-central/search?q=mTRRUsed&path=&case=false®exp=false
It seems all the sites want to check if we actually got a response back from TRR or not. Do you mean the check should be even more strict, as to exclude errors that may occur after we sent the request?
Assignee | ||
Updated•4 years ago
|
Comment 9•4 years ago
|
||
Sorry for late response.
First I have not seen your comment in phabricator when I wrote my comment above (discussion at too places...)
So I am ok in changing the behavior of TRRUsed if you change the telemetry as well. The comments in the patch and bugzilla did not say anything about that. I was saying it is wrong because it breaks the telemetry completely.
I would suggest to change the name of the variable, actually to add a new one so that telemetry is not corrupted. Also mTRRUsed is not anymore a good name for this variable.
You should double check if the timing information are set in the channel before OnStartRequest. Timing information are a bit weird and only used for diagnostics(not an critical functionality) so may be missing or wrong. If you are not sure that they are set, please propagate this information to the channel in a different way. If you want to test this please test it with http1.1 and http/2. Thinking of it ... maybe it is better to get this information in another way, because I am not sure but we may send the NS_NET_STATUS_SENDING_TO event when we actually do not send data because a connection the TRR server failed.
Updated•4 years ago
|
Assignee | ||
Comment 10•4 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #9)
Sorry for late response.
First I have not seen your comment in phabricator when I wrote my comment above (discussion at too places...)
So I am ok in changing the behavior of TRRUsed if you change the telemetry as well. The comments in the patch and bugzilla did not say anything about that. I was saying it is wrong because it breaks the telemetry completely.
I would suggest to change the name of the variable, actually to add a new one so that telemetry is not corrupted. Also mTRRUsed is not anymore a good name for this variable.
You should double check if the timing information are set in the channel before OnStartRequest.
OnStartRequest is emited when we have responseHeaders, so surely the request has already been sent?
Timing information are a bit weird and only used for diagnostics(not an critical functionality) so may be missing or wrong.
We enable timing here
If you are not sure that they are set, please propagate this information to the channel in a different way. If you want to test this please test it with http1.1 and http/2. Thinking of it ... maybe it is better to get this information in another way, because I am not sure but we may send the NS_NET_STATUS_SENDING_TO event when we actually do not send data because a connection the TRR server failed.
I don't know whether we need to check http1.1 - we should never actually use that, right?
I was considering checking the status code of the channel in OnStartRequest, but that would not work if we did send the request, but the channel times out waiting for a response. 🙁
I'll also write some tests that the telemetry looks as we expect it to.
Comment 11•4 years ago
|
||
Comment 12•4 years ago
|
||
Backed out changeset 16ff0a677ab2 for causing failures in nsHostResolver.cpp
Backout link: https://hg.mozilla.org/integration/autoland/rev/1d15ac60a6131ed85f52620d99a4e96baea7a7ef
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=308531602&repo=autoland&lineNumber=3344
Assignee | ||
Comment 13•4 years ago
|
||
Due to a change in timing in this patch, when we reset the confirmation pref
at the end of the test, a TRR request would happen after we changed the prefs
leading to accessing a non-local IP in testing and causing a crash.
This should be gated on being in the correct mode anyway
Depends on D81517
Comment 14•4 years ago
|
||
Comment 15•4 years ago
|
||
Backed out for perma failures.
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=308575218&repo=autoland&lineNumber=3350
Backout: https://hg.mozilla.org/integration/autoland/rev/64c9f376e6a24b69459d28be752a4edd9bd21514
Comment 17•4 years ago
|
||
Comment 18•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/6054a345cd86
https://hg.mozilla.org/mozilla-central/rev/ad363ca253e7
Comment 19•4 years ago
|
||
Comment on attachment 9160048 [details]
Bug 1649127 - Make sure we only set AddrHostRecord::mTRRUsed = true when TRRServiceChannel::AsyncOpen succeeds r=dragana
Beta/Release Uplift Approval Request
- User impact if declined: TRR is incorrectly disabled for some host names.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: Yes
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This flag only controls whether to temporary disable TRR for a host name. It does not change functionaly in the way that could break DNS resolution. It will just decrease the number of hostnames for which TRR is temporary disabled.
- String changes made/needed: none
Updated•4 years ago
|
Comment 20•4 years ago
|
||
Comment on attachment 9160048 [details]
Bug 1649127 - Make sure we only set AddrHostRecord::mTRRUsed = true when TRRServiceChannel::AsyncOpen succeeds r=dragana
Approved for 79.0b7.
Updated•4 years ago
|
Comment 21•4 years ago
|
||
bugherder uplift |
https://hg.mozilla.org/releases/mozilla-beta/rev/5dcf654c68c9
https://hg.mozilla.org/releases/mozilla-beta/rev/e53d58def7d8
Updated•4 years ago
|
Updated•4 years ago
|
Comment 22•4 years ago
|
||
Hello,
Can you please provide some steps for QA to verify this issue?
Thank you!
Assignee | ||
Comment 23•4 years ago
|
||
I'm not sure manual verification is really possible with this.
Updated•4 years ago
|
Description
•