GENETAG: A Tagged Corpus for Gene/Protein Named Entity Recognition
Author Information
Author(s): Tanabe Lorraine, Xie Natalie, Thom Lynne H, Matten Wayne, Wilbur W John
Primary Institution: National Center for Biotechnology Information, National Library of Medicine, NIH
Hypothesis
The study aims to create a standardized test corpus for evaluating biomedical named entity recognition systems.
Conclusion
The GENETAG corpus was successfully annotated, but the intricate manual judgments required hindered tagging consistency.
Supporting Evidence
- GENETAG consists of 20K sentences tagged with gene/protein names.
- The annotation guidelines were designed to allow flexible matching to the gold standard.
- Semantic constraints were applied to define gene/protein entities.
Takeaway
The researchers made a big list of sentences to help computers learn how to find names of genes and proteins in medical texts.
Methodology
The corpus was constructed from 20K MEDLINE sentences, with 15K used for a competition, and annotated by experts using a web interface.
Limitations
The manual annotation process led to inconsistencies in tagging.
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website