UCSC-CRL-95-11: REGULARIZERS FOR ESTIMATING DISTRIBUTIONS OF AMINO ACIDS FROM SMALL SAMPLES

03/01/1995 09:00 AM
Computer Engineering
This paper examines several different methods for estimating the distribution of amino acids in a specific context, given a very small sample of amino acids from that distribution. These distribution estimators, sometimes called regularizers, are frequently used when aligning sequences to each other or to models such as profiles or hidden Markov models. The distribution estimators considered here are zero-offsets, pseudocounts, substitution matrices (with several variants), feature alphabets, and Dirichlet mixture regularizers. A new method is presented for setting the parameters of the regularizers to minimize the encoding cost (also called the entropy) of the training data, for all possible samples from the training data. The optimal parameter settings depend on the size of the sample, but the optimization method can also be used to get good performance over a range of sample sizes. The optimal settings with this method are not the same as the traditional values used for the parameters. The regularizers are evaluated based on how well they estimate the distributions of the columns of a multiple alignment---specifically, the expected encoding cost per amino acid using the regularizer method and all possible samples from each column. The differences between the regularizers are fairly small (less than 0.2 bits per column), but large enough to make a significant difference when many columns are combined as is done in an an alignment. In general, the pseudocounts have the lowest encoding costs for samples of size zero, substitution matrices have the lowest encoding costs for samples of size one, and Dirichlet mixtures have the lowest for larger samples. One of the substitution matrix variants, which added pseudocounts and scaled counts, does almost as well as the best Dirichlet mixtures, but with a lower computation cost.

UCSC-CRL-95-11