Probing Metagenomics by Rapid Cluster Analysis of Very Large Datasets
2008

Analyzing Large Metagenomic Datasets with Clustering

Sample size: 44 publication Evidence: high

Author Information

Author(s): Li Weizhong, Wooley John C., Godzik Adam

Primary Institution: California Institute for Telecommunications and Information Technology, University of California San Diego

Hypothesis

Can advanced clustering strategies improve the analysis of large metagenomic datasets?

Conclusion

The new clustering approach significantly reduces computational effort and enhances the analysis of large metagenomic datasets.

Supporting Evidence

  • The clustering identified over 33 thousand large predicted protein clusters from nearly 6 million sequences.
  • 20% of the clusters did not match known protein families, indicating potential novel families.
  • The clustering process was two orders of magnitude faster than previous methods.

Takeaway

This study shows a faster way to group similar DNA sequences from ocean samples, helping scientists understand more about the tiny living things in the ocean.

Methodology

The study used a modified CD-HIT algorithm for hierarchical clustering of 17.4 million Open Reading Frames (ORFs) from the GOS study.

Limitations

The study primarily focused on large clusters, leaving small clusters less analyzed due to their potential spurious nature.

Participant Demographics

Samples were taken from 41 locations across the Atlantic and southern Pacific oceans.

Digital Object Identifier (DOI)

10.1371/journal.pone.0003375

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication