ParaKMeans: Implementation of a parallelized K-means algorithm suitable for general laboratory use
2008

ParaKMeans: A Fast K-means Clustering Software for Laboratories

publication 10 minutes Evidence: high

Author Information

Author(s): Kraj Piotr, Sharma Ashok, Garge Nikhil, Podolsky Robert, McIndoe Richard A

Primary Institution: Center for Biotechnology and Genomic Medicine, Medical College of Georgia, Augusta, GA USA

Hypothesis

Can a parallelized K-means algorithm improve clustering performance for large datasets in laboratory settings?

Conclusion

ParaKMeans significantly speeds up clustering of large datasets and is user-friendly for laboratory use.

Supporting Evidence

  • ParaKMeans provides significant performance gains over a wide range of datasets using as little as seven nodes.
  • The average time taken to cluster each dataset was reduced from 24.33 minutes to 3.03 minutes using 7 nodes.
  • ParaKMeans was significantly faster than the Cluster program in all tested combinations of genes and arrays.

Takeaway

ParaKMeans is a computer program that helps scientists group similar data together much faster by using many computers at once.

Methodology

The software implements a parallelized K-means clustering algorithm using a client-server model with web services for distance calculations.

Limitations

The performance may vary based on the number of compute nodes and the size of the dataset.

Statistical Information

P-Value

p<0.0001

Statistical Significance

p<0.05

Digital Object Identifier (DOI)

10.1186/1471-2105-9-200

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication