Matching curated genome databases: a non trivial task
2008

CorBank: A Tool for Matching Genome Databases

Sample size: 641 publication Evidence: moderate

Author Information

Author(s): Descorps-Declère Stéphane, Barba Matthieu, Labedan Bernard

Primary Institution: Institut de Génétique et Microbiologie, Université Paris Sud XI, CNRS UMR 8621

Hypothesis

Can we effectively cross-reference protein identifiers between independently curated genome databases?

Conclusion

CorBank efficiently detects differences between RefSeq and Genome Reviews versions of curated genomes, suggesting a need for better coordination between the two databases.

Supporting Evidence

  • 98% of the 1,983,258 amino acid sequences are matching between the two databases.
  • Only 50 of the 641 complete genomes analyzed are perfectly matching.
  • Differences in coding sequences were found in 321 species.

Takeaway

CorBank helps scientists find and compare gene information from different databases, showing that they often have different details about the same genes.

Methodology

CorBank uses hash tables to match amino acid sequences and identify differences in structural annotations between genome databases.

Potential Biases

Independent curation efforts may lead to increasingly divergent interpretations of the same genomic data.

Limitations

The study does not establish a direct correlation between sequencing age and divergence levels, and some differences may be due to subjective interpretations of gene annotations.

Digital Object Identifier (DOI)

10.1186/1471-2164-9-501

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication