Open
Bug 1192833
Opened 9 years ago
Updated 9 years ago
(k-randomization reporting) [META] Investigate k-rounds randomization reporting
Categories
(Content Services Graveyard :: General, defect)
Content Services Graveyard
General
Tracking
(Not tracked)
NEW
Iteration:
43.2 - Sep 7
People
(Reporter: mzhilyaev, Assigned: mzhilyaev)
References
(Blocks 4 open bugs)
Details
(Whiteboard: [story])
Investigate privacy and extraction precision aspects of user data collection scheme where users reports RR-randomized categorical vectors k times. The reports are then anonymized and mixed. This scheme allows for sqrt(k) reduction of noise. It also appears that its privacy guarantee is proportional to L*k/N, where N is the number of distinct users, and L is length of categorical vector.
For large audiences this method has a promise of delivering accurate user aggregates without significant loss in privacy.
Assignee | ||
Comment 1•9 years ago
|
||
the resulting proves are here:
https://docs.google.com/document/d/1_kquDsJfN_MTgBjo4mOAmxKdmiHurFryJGKGBsIVT04/edit?pli=1#
noise is reduced by sqrt(k)
privacy guarantee is: L*k / N, where
L - length of bit vector reported
k - randomization rounds
N - distinct users
Assignee | ||
Comment 2•9 years ago
|
||
The results listed above could be incorrect. I might have discovered an attack that will invalidate privacy guarantee limits above.
Blocks: Sprint_CS_S1
Blocks: Sprint_CS_S2
Updated•9 years ago
|
Flags: needinfo?(ekr)
Assignee | ||
Comment 3•9 years ago
|
||
Notes to reviewer:
The doc is mostly finished, the section on "Relaxing Privacy bounds" is still being worked on:
https://docs.google.com/document/d/1_kquDsJfN_MTgBjo4mOAmxKdmiHurFryJGKGBsIVT04/edit?pli=1#heading=h.3b41pubv2rqo
There seems to be a way to relax required collection size significantly without opening to high-cardinality attack. Stay tuned.
Assignee | ||
Comment 4•9 years ago
|
||
The research is complete and ready for review.
Final version https://docs.google.com/document/d/1_kquDsJfN_MTgBjo4mOAmxKdmiHurFryJGKGBsIVT04/edit?pli=1#heading=h.ivaimes7hdrg
Main finding:
L - length of bit vector reported
k - randomization rounds
N - distinct users
p,q - parameters of randomization
1. noise is reduced by sqrt(k)
2. Breaking point
If N*q^L >= 1, the observer can't determine the presence of an outlier - privacy guarantee is 0 regardless of k.
3. Tipping point
If N >= (p/q)^L, the observer may suspect the collection contains a real outlier, but the observer can't distinguish synthetic output from a real user and the noise of the collection, hence attack against RR is impossible. Privacy guarantee is 2k, but the observer can't conduct the attack against randomized responses from the distinct user.
4. When L is large, it's possible to decrease p to cover up for privacy and increase k to cover up for precision. Hence, enabling prescribed precision without privacy loss.
5. Regardless of L. Mutually exclusive categorical bit vectors occupy only 2 bits of privacy. Categorical vectors that have no more than S bits, occupy 2*S bits of privacy.
6. It's possible to collect search queries and user histories by randomizing string representations and running a dictionary attack against randomized corpus of strings
7. Privacy bounds can be further reduced case by case bases to satisfy infrastructure cost limitations.
Assignee | ||
Comment 5•9 years ago
|
||
(In reply to maxim zhilyaev from comment #4)
>
> 3. Tipping point
> If N >= (p/q)^L, the observer may suspect the collection contains a real
> outlier, but the observer can't distinguish synthetic output from a real
> user and the noise of the collection, hence attack against RR is impossible.
> Privacy guarantee is 2k, but the observer can't conduct the attack against
> randomized responses from the distinct user.
>
Correction: Privacy guarantee is k*ln(2)
Updated•9 years ago
|
Flags: needinfo?(ekr)
You need to log in
before you can comment on or make changes to this bug.
Description
•