GENETAG: a tagged corpus for gene/protein named entity recognition
2005

GENETAG: A Tagged Corpus for Gene/Protein Named Entity Recognition

Sample size: 20000 publication Evidence: moderate

Author Information

Author(s): Tanabe Lorraine, Xie Natalie, Thom Lynne H, Matten Wayne, Wilbur W John

Primary Institution: National Center for Biotechnology Information, National Library of Medicine, NIH

Hypothesis

The study aims to create a standardized test corpus for evaluating biomedical named entity recognition systems.

Conclusion

The GENETAG corpus was successfully annotated, but the intricate manual judgments required hindered tagging consistency.

Supporting Evidence

  • GENETAG consists of 20K sentences tagged with gene/protein names.
  • The annotation guidelines were designed to allow flexible matching to the gold standard.
  • Semantic constraints were applied to define gene/protein entities.

Takeaway

The researchers made a big list of sentences to help computers learn how to find names of genes and proteins in medical texts.

Methodology

The corpus was constructed from 20K MEDLINE sentences, with 15K used for a competition, and annotated by experts using a web interface.

Limitations

The manual annotation process led to inconsistencies in tagging.

Digital Object Identifier (DOI)

10.1186/1471-2105-6-S3

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication