Sample Size Requirements for Popular Classification Algorithms in Tabular Clinical Data: Empirical Study
2024

Sample Size Requirements for Classification Algorithms in Clinical Data

Sample size: 16 publication 10 minutes Evidence: moderate

Author Information

Author(s): Tsafnat Guy, Figueroa Rosa, Gad Ahmed G, Hasan Md Rakibul, Silvey Scott MS, Liu Jinze PhD

Primary Institution: Virginia Commonwealth University

Hypothesis

What are the optimal sample sizes for various binary classification algorithms in clinical datasets?

Conclusion

The sample sizes needed to reach AUC stability among four popular classification algorithms vary by dataset and method and are associated with dataset-level characteristics.

Supporting Evidence

  • XGBoost performed the best on 14 out of 16 datasets.
  • Logistic Regression required the smallest sample size to reach stability.
  • More balanced classes were associated with decreased sample size.
  • Dataset nonlinearity increased the expected sample sizes.

Takeaway

This study helps researchers figure out how many samples they need to use different algorithms for predicting health outcomes accurately.

Methodology

The study analyzed 16 large open-source datasets using four machine learning algorithms to determine optimal sample sizes for achieving stable AUC.

Limitations

The study was limited to 16 datasets, which may not generalize to datasets with more features or different types of data.

Statistical Information

P-Value

p<0.05

Statistical Significance

p<0.05

Digital Object Identifier (DOI)

10.2196/60231

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication