A scalable machine-learning approach to recognize chemical names within large text databases
2006

Machine Learning for Recognizing Chemical Names in Text

Sample size: 13100000 publication Evidence: high

Author Information

Author(s): Wren Jonathan D

Primary Institution: The University of Oklahoma

Hypothesis

A first-order Markov Model (MM) could be used to effectively discern chemical names.

Conclusion

The study demonstrated that a Markov Model can accurately recognize chemical names within large text databases with high precision and recall rates.

Supporting Evidence

  • The Markov Model achieved ~93% recall and ~99% precision on smaller test sets.
  • The method processed 13.1 million MEDLINE records with an average precision of 82.7%.
  • The study found that the number of spelling variants for a chemical name correlates with its frequency in literature.

Takeaway

This study shows that a computer program can learn to find chemical names in a lot of text, helping scientists organize information better.

Methodology

A first-order Markov Model was trained on chemical names and tested on MEDLINE records to evaluate its performance in recognizing chemical terms.

Limitations

The model may overestimate the number of unique chemical names due to 'tag-along' prefixes and suffixes, and it struggles with short terms.

Digital Object Identifier (DOI)

10.1186/1471-2105-7-S2-S3

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication