Which Is Better: Holdout or Full-Sample Classifier Design?
Author Information
Author(s): Marcel Brun, Xu Qian, Edward R Dougherty
Primary Institution: Translational Genomics Research Institute
Hypothesis
Is it better to design a classifier and estimate its error on the full sample or to design a classifier on a training subset and estimate its error on the holdout test subset?
Conclusion
Full-sample design consistently outperforms holdout design in classifier performance.
Supporting Evidence
- Full-sample design provides better classifiers than holdout design.
- Holdout error estimation generally has higher expected bounds than full-sample error estimators.
- The study uses a variety of classification rules including 3-nearest neighbor and linear discriminant analysis.
Takeaway
This study looks at two ways to test how well a computer program can classify data: using all the data at once or splitting it into two parts. It finds that using all the data is usually better.
Methodology
The study uses simulations to compare full-sample and holdout designs across various classification rules and data models.
Potential Biases
Potential bias in error estimation due to the choice of training and testing data splits.
Limitations
The study primarily focuses on simulated data and may not fully capture real-world complexities.
Participant Demographics
The study includes data from 295 breast cancer patients, with 115 in the 'good prognosis' class and 180 in the 'poor prognosis' class.
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website