02/01/2002 09:00 AM
Biomolecular Engineering
Clusters are increasingly used in research institutions and in the industry to solve complex problems. Computational biology, or bioinformatics, is an extremely important and growing research area that can utilize clusters. In most of the bioinformatics applications, like many other scientific applications, I/O is a bottleneck that limits performance. The fact that the genbank database, the repository for genome data grows exponentially, doubling every 14 months, makes the problem more significant. Since the cost of disks has been exponentially decreasing, replication of databases is feasible. Like RAID, which allows mirroring of data to provide data availability and reliability, we propose a model that combines these features with performance improvement through smart replication and location-transport storage. We are developing a user-level library for a new model of location- transparent storage. This library can be viewed as a parallel I/O library that not only stripes files but also maintains read-only replicas of records and information about access times. Therefore, a read access to a record will be redirected to the most appropriate location, sometimes by making an additional replica. Unlike a traditional cache, where there is a strict hierarchy of access times, our model does not assume a strict ordering. The cost function that determines the ordering currently is fixed at the start of application execution. The parameters for the cost function are obtained from access pattern characterization study we did on a variety of environments, ranging from clusters to supercomputers. An extension to this would be to include continuously varying parameters like network line break, bandwidth or hardware failures. Ultimately, we envision linking replication with a dynamic run-time performance model that can provide performance data of the execution environment on-the-fly to calculate access costs.