Synthetic Datasets for Understanding Data Breaches
Author Information
Author(s): Abhishek Sharma, May Bantan
Primary Institution: Indian Institute of Technology Madras, India
Hypothesis
Can synthetic datasets help researchers understand the risks associated with data breaches?
Conclusion
The study successfully generated synthetic datasets that can be used to analyze the implications of data breaches without compromising real personal information.
Supporting Evidence
- The synthetic datasets allow for ethical analysis of data breaches without using real sensitive information.
- Generated datasets can help map relationships between data and enhance understanding of data breach implications.
- The study provides a baseline for future research directions in data breach analysis.
Takeaway
This study created fake data that looks real to help people understand how hackers can misuse personal information without using actual sensitive data.
Methodology
Synthetic datasets were generated using the Faker Python library to create profiles of 4 million unique individuals, followed by scenario-based datasets representing various data breach incidents.
Potential Biases
Potential biases in the dataset due to the exclusion of diverse geographic and demographic information.
Limitations
The datasets are limited to US-based addresses and do not include gender or GPS coordinates to avoid complications.
Participant Demographics
The dataset consists of synthetic profiles of individuals with various personal identifiable information (PII) attributes.
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website