GENETAG: A Tagged Corpus for Gene/Protein Named Entity Recognition

Sample size: 20000 publication Evidence: moderate

Author Information

Author(s): Tanabe Lorraine, Xie Natalie, Thom Lynne H, Matten Wayne, Wilbur W John

Primary Institution: National Center for Biotechnology Information, National Library of Medicine, NIH

The study aims to create a standardized test corpus for evaluating biomedical named entity recognition systems.

The GENETAG corpus was successfully annotated, but the intricate manual judgments required hindered tagging consistency.

GENETAG consists of 20K sentences tagged with gene/protein names.
The annotation guidelines were designed to allow flexible matching to the gold standard.
Semantic constraints were applied to define gene/protein entities.

The researchers made a big list of sentences to help computers learn how to find names of genes and proteins in medical texts.

The corpus was constructed from 20K MEDLINE sentences, with 15K used for a competition, and annotated by experts using a web interface.

The manual annotation process led to inconsistencies in tagging.

Access the complete publication on the publisher's website