Accurate prediction of protein functional class from sequence in the Mycobacterium tuberculosis and Escherichia coli genomes using data mining
2000

Predicting Protein Functions from Genomic Sequences

Sample size: 8213 publication 10 minutes Evidence: moderate

Author Information

Author(s): Ross D. King, Andreas Karwath, Amanda Clare, Luc Dehaspe

Primary Institution: University of Wales, Aberystwyth

Hypothesis

Can data mining techniques be used to predict protein functional classes from genomic sequences?

Conclusion

The study successfully predicts protein functional classes from sequences in Mycobacterium tuberculosis and Escherichia coli, achieving an accuracy of 60-80%.

Supporting Evidence

  • 65% of the ORFs with no assigned function in M. tuberculosis were predicted to have a function.
  • 24% of the ORFs with no assigned function in E. coli were predicted to have a function.
  • The rules were founded on detection of remote homology, convergent evolution, and horizontal gene transfer.

Takeaway

Scientists used computer programs to guess what proteins do based on their DNA sequences, and they were pretty good at it!

Methodology

The study used data mining techniques, including inductive logic programming and rule learning, to analyze genomic sequences and predict protein functions.

Potential Biases

The predictions may be overly optimistic as the unassigned ORFs come from a different distribution than the training data.

Limitations

The rules are often complicated and their biological basis can be difficult to understand.

Statistical Information

P-Value

p<0.05

Statistical Significance

p<0.05

Want to read the original?

Access the complete publication on the publisher's website

View Original Publication