Automated Approach to Harmonise Variable Names in Datasets
Author Information
Author(s): Xavier Bosch-Capblanch
Primary Institution: Swiss Tropical and Public Health Institute
Hypothesis
How can inconsistencies in variable names, labels, values, and value labels across datasets be solved to create fully harmonised datasets in an automated way?
Conclusion
Efficient and tested automated algorithms should be used to support the harmonisation process needed to analyse multiple datasets.
Supporting Evidence
- The algorithm achieved 100% sensitivity and specificity after a second iteration.
- The automated approach identified a DTP3 variable that was missing in other surveys.
- The program can process one variable in three to five seconds.
Takeaway
This study shows a way to automatically fix names and labels in data from different surveys so they can be compared easily.
Methodology
The study used automated algorithms to search for and harmonise variable names across multiple datasets.
Limitations
The algorithm relies on user-defined key terms and may miss variables if not properly defined.
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website