Identifying Gene and Protein Mentions in Text
Author Information
Author(s): Ryan McDonald, Fernando Pereira
Primary Institution: University of Pennsylvania
Hypothesis
Can conditional random fields effectively identify gene and protein mentions in biomedical text?
Conclusion
The study demonstrates that conditional random fields can accurately identify gene and protein mentions in text, achieving a precision of 86.4% and recall of 78.7%.
Supporting Evidence
- The model achieved a precision of 86.4% and recall of 78.7%.
- The system was trained on 7500 annotated MEDLINE sentences.
- Adding ABGene lexicons significantly improved precision and recall.
Takeaway
This study shows a way to find names of genes and proteins in text using a smart computer program that learns from examples.
Methodology
The study used conditional random fields with a diverse feature set to tag gene and protein mentions in text.
Limitations
The system struggles with correctly identifying the exact boundaries of gene mentions, which can affect precision and recall.
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website