Open Bug 1355903 Opened 8 years ago Updated 2 years ago

Re-enable Certificate Transparency telemetry collection

Categories

(Core :: Security: PSM, defect, P3)

defect

Tracking

()

People

(Reporter: jcj, Unassigned)

References

(Depends on 2 open bugs, Blocks 1 open bug)

Details

(Whiteboard: [psm-blocked])

Bug 1353216 shut off certificate transparency telemetry collection due to a performance regression. This bug should turn it back on after performance is improved.
Depends on: 1355993
sorry for the drive-by partially informed comment - but given that this is just for reporting is this something that can be done asynchronously queued and processed at 'idle' time? (its been a while, but iirc firefox has an event for that) so the math doesn't impact responsiveness? or is there too much state tied up in the connection to practically copy/addref for queueing?
(In reply to Patrick McManus [:mcmanus] from comment #1) > sorry for the drive-by partially informed comment - but given that this is > just for reporting is this something that can be done asynchronously queued > and processed at 'idle' time? (its been a while, but iirc firefox has an > event for that) so the math doesn't impact responsiveness? or is there too > much state tied up in the connection to practically copy/addref for queueing? It's a good idea, but an eventual goal is to have CT enforcement, too -- which would need to be synchronous to be effective.
Keeler and I discussed more what is needed to re-enable our CT telemetry collection: == Better Tests == Before we get to performance testing, I think we need more significant tests in-tree, as specified in bug 1361177. == Speedups == It seems that most of the performance regression was due to unnecessary cert chain validations, fixed in Bug 1356611. The unnecessary validations were made much worse by the slower-than-desired verification algorithm; Keeler filed Bug 1355993 to address speeding that up. It should be possible to get some performance increase, but given the magnitude of the fix from Bug 1356611, we _might not need_ the algorithm speedup to proceed. We'd optimally like to see a Talos test for HTTPS that could show a regression during CT (Bug 1356073), but perhaps we can proceed with a plan that checks telemetry instead: == Speedup Test Plan == When we disabled CT, the SSL_SUCCESFUL_CERT_VALIDATION_TIME_MOZILLAPKIX telemetry metric dropped precipitously [1] (the drop corresponds to the disable date) [2]. In-code, this metric is firmly tied to the CT performance. I think it's OK to move on without the Talos test and verify whether our performance is OK using Nightly telemetry. If slows down more than a little, we'd want to wait on Bug 1355993 to proceed. == Recommendation Summary == 1. We need the tests in bug 1361177. 2. Then we can re-enable using this bug and monitor SSL_SUCCESFUL_CERT_VALIDATION_TIME_MOZILLAPKIX. 2.a. If performance slows down by an unacceptable amount, we back-out the enable commit. 2.b. If we back-out, then we block on Bug 1355993 to re-try. [1] https://i.have.insufficient.coffee/ct-validation-time-fix.png [2] https://mzl.la/2qKALHN
Depends on: 1361177
After further discussion, I am amending my recommendation that we only re-enable after we also have Talos tests, and marking this bug as being blocked on Bug 1356073: == Recommendation Summary (Revised) == 1. We need the tests in bug 1361177. 2. We also need the Talos test in bug 1356073 to monitor for HTTPS performance regressions. 3. Then we can re-enable using this bug and monitor SSL_SUCCESFUL_CERT_VALIDATION_TIME_MOZILLAPKIX in Nightly. 4.a. If performance slows down by an unacceptable amount, we back-out the enable commit. 4.b. If we back-out, then we block on Bug 1355993 before re-trying.
Depends on: 1356073
Priority: -- → P3
Whiteboard: [psm-blocked]
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.