| Abstract: |
It is necessary to integrate various types of information
in order to recognize the genes in DNA sequences. The information might
be the four letters of DNA sequences, stochastic frequency of the corresponding
codons, homology search scores, splice cite signal strengths. We have developed
a software system of multi-stream Hidden Markov Models with which those
information can be easily integrated with a consistent measure of probabilities.
The output symbols of HMMs are the signals that we can observe. In the field
of bioinformatics, the output symbols of the HMMs are mostly the four letters
of nucleic acids or the twenty letters of amino acids. However, the output
symbols can be anything, as far as we can attach their probability distributions.
They can be discrete symbols, real values, real valued vectors, and multiple
streams of those values. We propose gene annotation using HMMs with multiple
streams, which combine the sequence and other pre-processed information.
The important feature of multi-stream HMMs is that the weights of the streams
can be optimized for each model. The multi-stream HMM with adjustable weight
permits very flexible design of the gene finding systems. |