UCSC-CRL-94-24: USING MARKOV MODELS AND HIDDEN MARKOV MODELS TO FIND REPETITIVE EXTRAGENIC PALINDROMIC SEQUENCES IN ESCHERICHIA COLI

07/01/1994 09:00 AM
Computer Engineering
This paper presents a technique for using simple Markov models and hidden Markov models (HMMs) to search for interesting sequences in a database of DNA sequences. The models are used to create a cost map for each sequence in the database. These cost maps can be searched rapidly for subsequences that have significantly lower costs than a null model. Milosavljevic\'s algorithmic significance test is used to determine when a subsequence is significantly found. The sequences reported are trimmed to maximize the signal-to-noise ratio (cost savings / sqrt(length)). Methods are given for automatically constructing simple Markov models and hidden Markov models from small training sets. The techniques are illustrated by searching a database of E. coli genomic DNA, EcoSeq6, for clusters of Repetitive Extragenic Palindromic sequences (REPs). Of the known REPs, 91% are found with simple Markov models starting with a single REP cluster as a seed, and 95% are found by a hidden Markov model built from the results of the simple Markov model search. There are no false positives from the simple Markov models, and the few extra sequences found by the HMMs may be genuinely related sequences.

UCSC-CRL-94-24