Mining Protein Function from Text Using Support Vector Machines

Sample size: 1858 publication Evidence: moderate

Author Information

Author(s): Simon B Rice, Goran Nenadic, Benjamin J Stapley

Primary Institution: University of Manchester

Hypothesis

Can a supervised machine-learning approach effectively assign Gene Ontology terms to human proteins based on text mining?

Conclusion

A machine learning approach to mining protein function predictions from text can yield good performance only if sufficient training data is available, and significant amount of supporting data is used for prediction.

Supporting Evidence

The study evaluated the performance of text mining systems in assigning Gene Ontology terms to proteins.
Results showed that the method performed better with a larger set of relevant documents.
Precision of selected supporting text was variable, ranging from 3% to 50%.

Takeaway

This study shows that using lots of documents helps computers figure out what proteins do by reading about them, but they struggle when there's not enough information.

Methodology

A supervised machine learning approach using support vector machines to assign Gene Ontology terms to proteins based on co-occurring terms extracted from documents.

Limitations

The method works poorly on single documents and short passages, and the performance is highly dependent on the availability of relevant training data.

Digital Object Identifier (DOI)

10.1186/1471-2105-6-S1-S22

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication

Home