Compression-based classification of biological sequences and structures via the Universal Similarity Metric: experimental assessment
2007

Assessing the Universal Similarity Metric for Biological Data

Sample size: 6 publication Evidence: moderate

Author Information

Author(s): Paolo Ferragina, Raffaele Giancarlo, Valentina Greco, Giovanni Manzini, Gabriel Valiente

Hypothesis

Can the Universal Similarity Metric (USM) and its approximations effectively classify biological sequences and structures?

Conclusion

The USM methodology is robust and competitive with existing techniques for classifying biological data.

Supporting Evidence

  • UCD and NCD yield similar results, while CD performs worse.
  • UPGMA outperforms NJ in classification tasks.
  • PPMd and Gencompress are the best performing compression algorithms.

Takeaway

This study tested a new way to compare biological sequences using a method called the Universal Similarity Metric, which works well even with large data sets.

Methodology

The study used 25 compression algorithms and three approximations of USM to classify biological sequences across six relevant data sets.

Limitations

Only a few data compressors were tested, and comparisons among USM approximations and existing methods were limited.

Digital Object Identifier (DOI)

10.1186/1471-2105-8-252

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication