UCSC-CRL-90-41: CATEGORIZATION OF MACROMOLECULAR SEQUENCES BY MINIMAL LENGTH ENCODING

09/01/1990 09:00 AM
Biomolecular Engineering
A new method for categorization of macromolecular sequences is developed by applying the Minimal Length Encoding principle. While the method is developed with the particular application in mind, the formal model may be applied to any categorization problem where the observations can be represented by using a finite set of attributes that take values from a finite set. The proposed method is shown to be more general than the categorization procedures based on Weighted Parsimony and Compatibility principles. The proposed method also provides a principled tradeoff between the number of inferred classes and their fit to data. A statistical significance test for the presence of classes is also proposed. The categorization problem is proven to be computationally hard even under various simplifying assumptions. To avoid the excessive computation time, an algorithm that is based on a local search strategy is proposed. The algorithm was applied to rediscover several known macromolecular sequence families as well as to discover new classes of Bacteriophage T7 promoters and Alu sequences. The algorithm was also tested on simulated data to determine the range of data parameters where it categorizes the sequences correctly. While the application of the Minimal Length Encoding principle was restricted to the categorization of macromolecules, a very general background on the process of categorization and the Minimal Length Encoding principle is also provided. It is proposed that categorization, and perhaps inductive inference in general, can be viewed as a process of minimal length encoding of observations. Notes: Ph.D. Thesis

This report is not available for download at this time.