Evaluating Text Mining Methods for Gene Identification
Author Information
Author(s): Colosimo Marc E, Morgan Alexander A, Yeh Alexander S, Colombe Jeffrey B, Hirschman Lynette
Primary Institution: The MITRE Corporation
Hypothesis
Can automated systems accurately generate unique gene identifiers from PubMed abstracts for model organisms?
Conclusion
Clear annotation guidelines and interannotator experiments are crucial for validating gene lists, as abstracts alone often miss many genes.
Supporting Evidence
- Interannotator agreement was 87% for Fly and 91% for Yeast, but only 69% for Mouse.
- The study found that abstracts contained only 25% of genes for Fly and 36% for Mouse.
- Answer pooling was faster and identified more conflicting genes than interannotator analysis.
Takeaway
This study shows that finding genes in research papers is tricky because sometimes important genes are only mentioned in the full text, not in the short summaries.
Methodology
The study involved preparing training and test materials, generating gene lists from abstracts, and assessing interannotator agreement.
Potential Biases
Differences in curation criteria among databases may introduce bias in gene identification.
Limitations
The study relied on abstracts, which often do not contain all relevant gene information, leading to potential inaccuracies.
Participant Demographics
Three annotators with varying levels of expertise in biological sciences participated.
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website