Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation

2011

Creating a Biomedical Word Sense Disambiguation Dataset

Sample size: 203 publication Evidence: high

Author Information

Author(s): Jimeno-Yepes Antonio J, McInnes Bridget T, Aronson Alan R

Primary Institution: National Library of Medicine

Hypothesis

Can we automatically generate a comprehensive dataset for word sense disambiguation in the biomedical domain using MeSH indexing?

Conclusion

The MSH WSD dataset enables better evaluation of word sense disambiguation algorithms in the biomedical field by providing a larger and more diverse set of terms.

Supporting Evidence

The MSH WSD dataset contains 203 ambiguous entities, including 106 abbreviations.
The dataset covers 81 out of 133 semantic types in the UMLS.
Knowledge-based methods showed varying performance but maintained relative rankings across datasets.

Takeaway

This study made a big list of confusing medical terms and their meanings so computers can understand them better. It's like helping a robot know what 'cold' means when talking about weather or a sickness.

Methodology

The study used the UMLS Metathesaurus and MeSH indexing to automatically create a dataset of ambiguous terms from MEDLINE citations.

Potential Biases

Potential bias due to reliance on existing annotations and the automatic generation process.

Limitations

The dataset may not cover all senses of terms not included in the Metathesaurus.

Statistical Information

P-Value

p <= 0.005

Statistical Significance

p <= 0.005

Digital Object Identifier (DOI)

10.1186/1471-2105-12-223

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication

Home