Predicting Protein Functions from Genomic Sequences
Author Information
Author(s): Ross D. King, Andreas Karwath, Amanda Clare, Luc Dehaspe
Primary Institution: University of Wales, Aberystwyth
Hypothesis
Can data mining techniques be used to predict protein functional classes from genomic sequences?
Conclusion
The study successfully predicts protein functional classes from sequences in Mycobacterium tuberculosis and Escherichia coli, achieving an accuracy of 60-80%.
Supporting Evidence
- 65% of the ORFs with no assigned function in M. tuberculosis were predicted to have a function.
- 24% of the ORFs with no assigned function in E. coli were predicted to have a function.
- The rules were founded on detection of remote homology, convergent evolution, and horizontal gene transfer.
Takeaway
Scientists used computer programs to guess what proteins do based on their DNA sequences, and they were pretty good at it!
Methodology
The study used data mining techniques, including inductive logic programming and rule learning, to analyze genomic sequences and predict protein functions.
Potential Biases
The predictions may be overly optimistic as the unassigned ORFs come from a different distribution than the training data.
Limitations
The rules are often complicated and their biological basis can be difficult to understand.
Statistical Information
P-Value
p<0.05
Statistical Significance
p<0.05
Want to read the original?
Access the complete publication on the publisher's website