ParaKMeans: A Fast K-means Clustering Software for Laboratories
Author Information
Author(s): Kraj Piotr, Sharma Ashok, Garge Nikhil, Podolsky Robert, McIndoe Richard A
Primary Institution: Center for Biotechnology and Genomic Medicine, Medical College of Georgia, Augusta, GA USA
Hypothesis
Can a parallelized K-means algorithm improve clustering performance for large datasets in laboratory settings?
Conclusion
ParaKMeans significantly speeds up clustering of large datasets and is user-friendly for laboratory use.
Supporting Evidence
- ParaKMeans provides significant performance gains over a wide range of datasets using as little as seven nodes.
- The average time taken to cluster each dataset was reduced from 24.33 minutes to 3.03 minutes using 7 nodes.
- ParaKMeans was significantly faster than the Cluster program in all tested combinations of genes and arrays.
Takeaway
ParaKMeans is a computer program that helps scientists group similar data together much faster by using many computers at once.
Methodology
The software implements a parallelized K-means clustering algorithm using a client-server model with web services for distance calculations.
Limitations
The performance may vary based on the number of compute nodes and the size of the dataset.
Statistical Information
P-Value
p<0.0001
Statistical Significance
p<0.05
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website