Identifying gene and protein mentions in text using conditional random fields
2005

Identifying Gene and Protein Mentions in Text

Sample size: 10000 publication Evidence: moderate

Author Information

Author(s): Ryan McDonald, Fernando Pereira

Primary Institution: University of Pennsylvania

Hypothesis

Can conditional random fields effectively identify gene and protein mentions in biomedical text?

Conclusion

The study demonstrates that conditional random fields can accurately identify gene and protein mentions in text, achieving a precision of 86.4% and recall of 78.7%.

Supporting Evidence

  • The model achieved a precision of 86.4% and recall of 78.7%.
  • The system was trained on 7500 annotated MEDLINE sentences.
  • Adding ABGene lexicons significantly improved precision and recall.

Takeaway

This study shows a way to find names of genes and proteins in text using a smart computer program that learns from examples.

Methodology

The study used conditional random fields with a diverse feature set to tag gene and protein mentions in text.

Limitations

The system struggles with correctly identifying the exact boundaries of gene mentions, which can affect precision and recall.

Digital Object Identifier (DOI)

10.1186/1471-2105-6-S1-S6

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication