Analyzing Large Metagenomic Datasets with Clustering
Author Information
Author(s): Li Weizhong, Wooley John C., Godzik Adam
Primary Institution: California Institute for Telecommunications and Information Technology, University of California San Diego
Hypothesis
Can advanced clustering strategies improve the analysis of large metagenomic datasets?
Conclusion
The new clustering approach significantly reduces computational effort and enhances the analysis of large metagenomic datasets.
Supporting Evidence
- The clustering identified over 33 thousand large predicted protein clusters from nearly 6 million sequences.
- 20% of the clusters did not match known protein families, indicating potential novel families.
- The clustering process was two orders of magnitude faster than previous methods.
Takeaway
This study shows a faster way to group similar DNA sequences from ocean samples, helping scientists understand more about the tiny living things in the ocean.
Methodology
The study used a modified CD-HIT algorithm for hierarchical clustering of 17.4 million Open Reading Frames (ORFs) from the GOS study.
Limitations
The study primarily focused on large clusters, leaving small clusters less analyzed due to their potential spurious nature.
Participant Demographics
Samples were taken from 41 locations across the Atlantic and southern Pacific oceans.
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website