Evaluating Text Mining Methods for Gene Identification

Sample size: 89 publication Evidence: moderate

Author Information

Author(s): Colosimo Marc E, Morgan Alexander A, Yeh Alexander S, Colombe Jeffrey B, Hirschman Lynette

Primary Institution: The MITRE Corporation

Hypothesis

Can automated systems accurately generate unique gene identifiers from PubMed abstracts for model organisms?

Conclusion

Clear annotation guidelines and interannotator experiments are crucial for validating gene lists, as abstracts alone often miss many genes.

Supporting Evidence

Interannotator agreement was 87% for Fly and 91% for Yeast, but only 69% for Mouse.
The study found that abstracts contained only 25% of genes for Fly and 36% for Mouse.
Answer pooling was faster and identified more conflicting genes than interannotator analysis.

Takeaway

This study shows that finding genes in research papers is tricky because sometimes important genes are only mentioned in the full text, not in the short summaries.

Methodology

The study involved preparing training and test materials, generating gene lists from abstracts, and assessing interannotator agreement.

Potential Biases

Differences in curation criteria among databases may introduce bias in gene identification.

Limitations

The study relied on abstracts, which often do not contain all relevant gene information, leading to potential inaccuracies.

Participant Demographics

Three annotators with varying levels of expertise in biological sciences participated.

Digital Object Identifier (DOI)

10.1186/1471-2105-6-S1-S12

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication

Home