Using PCA for Population Structure Inference
Author Information
Author(s): Lee Chih, Abdool Ali, Huang Chun-Hsi
Primary Institution: University of Connecticut
Hypothesis
Can PCA and generic clustering algorithms effectively infer population structure from genotype data?
Conclusion
The proposed PCA-based approach is faster and scalable compared to the traditional STRUCTURE algorithm for population structure inference.
Supporting Evidence
- PCA reduced the number of variables from around 5,000 to at most 70.
- Soft K-means performed comparably well to STRUCTURE on the distant dataset.
- The BIC score produced identical predictions to STRUCTURE on simulated datasets.
Takeaway
This study shows that using PCA can help group people based on their genetic data quickly, which is useful for understanding population structures.
Methodology
The study used PCA to reduce genotype data dimensions and applied K-means, soft K-means, and spectral clustering algorithms to infer population structure.
Potential Biases
The choice of p-value for selecting significant PCs may introduce bias in the clustering results.
Limitations
The study's results may be affected by noisy and non-informative principal components.
Participant Demographics
The study included 1,064 individuals from 51 populations.
Statistical Information
P-Value
0.05
Statistical Significance
p<0.05
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website