UCSC-CRL-94-23: STOCHASTIC CONTEXT-FREE GRAMMARS FOR MODELING THREE SPLICEOSOMAL SMALL NUCLEAR RIBONUCLEIC ACIDS

06/01/1994 09:00 AM
Computer Engineering
In this thesis, stochastic context-free grammars (SCFGs) are applied to the problems of folding, aligning and modeling families of homologous RNA sequences---specifically, the small nuclear RNA sequences of the spliceosome. SCFGs generalize the hidden Markov models (HMMs) used in related work on protein and DNA to capture the sequences\' common primary and secondary structure. This thesis discusses a recently introduced algorithm, Tree-Grammar EM, for deducing SCFG parameters automatically from unaligned, unfolded training sequences [SBU+93, SBM+94, SBH+94a] and demonstrates its application to modeling three of the five prominent trans-acting factors in spliceosomal small nuclear RNA: U1, U5 and U6. Tree-Grammar EM, a generalization of the HMM forward-backward algorithm, is based on tree grammars and is faster than the previously proposed inside-outside SCFG training algorithm. Results show that after having been trained on as few as about 20 snRNA sequences, each of the three models can discern snRNA sequences from similar-length RNA sequences of other kinds, can find secondary structure of new snRNA sequences and can produce multiple alignments of snRNA sequences. Notes: Masters Thesis

UCSC-CRL-94-23