Creating a Biomedical Word Sense Disambiguation Dataset
Author Information
Author(s): Jimeno-Yepes Antonio J, McInnes Bridget T, Aronson Alan R
Primary Institution: National Library of Medicine
Hypothesis
Can we automatically generate a comprehensive dataset for word sense disambiguation in the biomedical domain using MeSH indexing?
Conclusion
The MSH WSD dataset enables better evaluation of word sense disambiguation algorithms in the biomedical field by providing a larger and more diverse set of terms.
Supporting Evidence
- The MSH WSD dataset contains 203 ambiguous entities, including 106 abbreviations.
- The dataset covers 81 out of 133 semantic types in the UMLS.
- Knowledge-based methods showed varying performance but maintained relative rankings across datasets.
Takeaway
This study made a big list of confusing medical terms and their meanings so computers can understand them better. It's like helping a robot know what 'cold' means when talking about weather or a sickness.
Methodology
The study used the UMLS Metathesaurus and MeSH indexing to automatically create a dataset of ambiguous terms from MEDLINE citations.
Potential Biases
Potential bias due to reliance on existing annotations and the automatic generation process.
Limitations
The dataset may not cover all senses of terms not included in the Metathesaurus.
Statistical Information
P-Value
p <= 0.005
Statistical Significance
p <= 0.005
Digital Object Identifier (DOI)
Want to read the original?
Access the complete publication on the publisher's website