Sample Size Requirements for Classification Algorithms in Clinical Data
Author Information
Author(s): Tsafnat Guy, Figueroa Rosa, Gad Ahmed G, Hasan Md Rakibul, Silvey Scott MS, Liu Jinze PhD
Primary Institution: Virginia Commonwealth University
Hypothesis
What are the optimal sample sizes for various binary classification algorithms in clinical datasets?
Conclusion
The sample sizes needed to reach AUC stability among four popular classification algorithms vary by dataset and method and are associated with dataset-level characteristics.
Supporting Evidence
- XGBoost performed the best on 14 out of 16 datasets.
- Logistic Regression required the smallest sample size to reach stability.
- More balanced classes were associated with decreased sample size.
- Dataset nonlinearity increased the expected sample sizes.
Takeaway
This study helps researchers figure out how many samples they need to use different algorithms for predicting health outcomes accurately.
Methodology
The study analyzed 16 large open-source datasets using four machine learning algorithms to determine optimal sample sizes for achieving stable AUC.
Limitations
The study was limited to 16 datasets, which may not generalize to datasets with more features or different types of data.
Statistical Information
P-Value
p<0.05
Statistical Significance
p<0.05
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website