Re-identification of Home Addresses from Anonymized Spatial Data
Author Information
Author(s): Cassa Christopher A, Wieland Shannon C, Mandl Kenneth D
Primary Institution: Children's Hospital Informatics Program, Children's Hospital Boston
Hypothesis
Can multiple anonymized versions of the same data set be used to re-identify original geographic locations?
Conclusion
Multiple versions of the same data, each anonymized by Gaussian skew, can be used to ascertain original geographic locations.
Supporting Evidence
- With ten anonymized copies, the average distance from the re-identified address to the original decreased from 0.7 km to 0.2 km.
- With fifty anonymized copies, the average distance decreased from 0.7 km to 0.1 km.
- The study demonstrates that averaging multiple anonymized data sets can significantly weaken privacy protections.
Takeaway
If you have many copies of a secret address that have been mixed up a little, you can still figure out where the real address is.
Methodology
The study created 10,000 geocoded patient addresses and anonymized them using Gaussian and uniform skew methods, averaging results to assess re-identification risk.
Potential Biases
The risk of bias may arise from the specific methods of anonymization used and the assumptions made about data access.
Limitations
The study primarily focuses on two anonymization methods and may not account for other potential vulnerabilities.
Participant Demographics
The study used artificially-generated geocoded values for patients in Boston, MA.
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website