Mining Biomolecular Data using Background Knowledge and Artificial Neural Networks

Qicheng Ma, Jason T. L. Wang, and James R. Gattiker

Biomolecular data mining is the activity of finding significant information in protein, DNA and RNA molecules. The significant information may refer to motifs, clusters, genes, protein signatures and classification rules. This chapter presents an example of biomolecular data mining: the recognition of promoters in DNA. We propose a two-level ensemble of classifiers to recognize E. Coli promoter sequences. The first-level classifiers include three Bayesian neural networks that learn from three different feature sets. The outputs of the first-level classifiers are combined in the second level to give the final result. To enhance the recognition rate, we use the background knowledge (i.e., the characteristics of the promoter sequences) and employ new techniques to extract high-level features from the sequences. We also use an expectation-maximization (EM) algorithm to locate the binding sites of the promoter sequences. Empirical study shows that a precision rate of 95% is achieved, indicating an excellent performance of the proposed approach.

Keywords: Biomolecular data mining, Bayesian neural networks, Background knowledge, Human genome project, Expectation maximization algorithm.