Predicting disease risks from highly imbalanced data using random forest
2011

Predicting Disease Risks Using Random Forest

Sample size: 8000000 publication 10 minutes Evidence: high

Author Information

Author(s): Khalilia Mohammed, Chakraborty Sounak, Popescu Mihail

Primary Institution: University of Missouri

Hypothesis

Can random forest classifiers effectively predict disease risks from imbalanced healthcare data?

Conclusion

The random forest method successfully predicted eight disease categories with an average AUC of 88.79%, overcoming class imbalance issues.

Supporting Evidence

  • The random forest method outperformed other classifiers like SVM and boosting in predicting disease risks.
  • Using repeated random sub-sampling improved the detection rate for diseases in the dataset.
  • The average AUC across all diseases was about 89.05%, indicating good predictive accuracy.

Takeaway

This study shows that we can use computer programs to help predict diseases based on patient data, even when some diseases are very rare.

Methodology

The study used the National Inpatient Sample data to train random forest classifiers, employing repeated random sub-sampling to address data imbalance.

Potential Biases

Potential bias due to the use of historical data without patient identifiers.

Limitations

The dataset lacks chronological order of diagnoses and does not provide anonymous patient identifiers.

Participant Demographics

The dataset includes diverse demographics such as age, race, and sex.

Statistical Information

P-Value

0.2

Statistical Significance

p<0.05

Digital Object Identifier (DOI)

10.1186/1472-6947-11-51

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication