Predicting Disease Risks Using Random Forest
Author Information
Author(s): Khalilia Mohammed, Chakraborty Sounak, Popescu Mihail
Primary Institution: University of Missouri
Hypothesis
Can random forest classifiers effectively predict disease risks from imbalanced healthcare data?
Conclusion
The random forest method successfully predicted eight disease categories with an average AUC of 88.79%, overcoming class imbalance issues.
Supporting Evidence
- The random forest method outperformed other classifiers like SVM and boosting in predicting disease risks.
- Using repeated random sub-sampling improved the detection rate for diseases in the dataset.
- The average AUC across all diseases was about 89.05%, indicating good predictive accuracy.
Takeaway
This study shows that we can use computer programs to help predict diseases based on patient data, even when some diseases are very rare.
Methodology
The study used the National Inpatient Sample data to train random forest classifiers, employing repeated random sub-sampling to address data imbalance.
Potential Biases
Potential bias due to the use of historical data without patient identifiers.
Limitations
The dataset lacks chronological order of diagnoses and does not provide anonymous patient identifiers.
Participant Demographics
The dataset includes diverse demographics such as age, race, and sex.
Statistical Information
P-Value
0.2
Statistical Significance
p<0.05
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website