Simulating data breaches: Synthetic datasets for depicting personally identifiable information through scenario-based breaches
2024

Synthetic Datasets for Understanding Data Breaches

Sample size: 4000000 publication 10 minutes Evidence: high

Author Information

Author(s): Abhishek Sharma, May Bantan

Primary Institution: Indian Institute of Technology Madras, India

Hypothesis

Can synthetic datasets help researchers understand the risks associated with data breaches?

Conclusion

The study successfully generated synthetic datasets that can be used to analyze the implications of data breaches without compromising real personal information.

Supporting Evidence

  • The synthetic datasets allow for ethical analysis of data breaches without using real sensitive information.
  • Generated datasets can help map relationships between data and enhance understanding of data breach implications.
  • The study provides a baseline for future research directions in data breach analysis.

Takeaway

This study created fake data that looks real to help people understand how hackers can misuse personal information without using actual sensitive data.

Methodology

Synthetic datasets were generated using the Faker Python library to create profiles of 4 million unique individuals, followed by scenario-based datasets representing various data breach incidents.

Potential Biases

Potential biases in the dataset due to the exclusion of diverse geographic and demographic information.

Limitations

The datasets are limited to US-based addresses and do not include gender or GPS coordinates to avoid complications.

Participant Demographics

The dataset consists of synthetic profiles of individuals with various personal identifiable information (PII) attributes.

Digital Object Identifier (DOI)

10.1016/j.dib.2024.111207

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication