UCSC-SOE-11-04: SciHadoop: Array-based Query Processing in Hadoop

Joe Buck, Noah Watkins, Jeff Lefevre, Kleoni Ioannidou, Carlos Maltzahn, Neoklis Polyzotis, Scott A. Brandt
01/31/2011 09:00 AM
Computer Science
Hadoop has become the de-facto standard platform for large-scale analysis in commercial applications and increasingly also in scientific applications. However, applying Hadoop’s byte stream data model to scientific data that is commonly stored according to highly-structured, abstract data models causes a number of inefficiencies that significantly limits the scalability of Hadoop applications in science. In this paper we introduce SciHadoop, a modification of Hadoop which allows scientists to specify abstract queries using a logical, array-based data model and which executes these queries as map/reduce programs defined on the logical data model. We describe the implementation of a SciHadoop pro- totype and use it to quantify the performance of three lev- els of accumulative optimizations over the Hadoop baseline where a NetCDF data set is managed in a default Hadoop configuration: the first optimization avoids remote reads by subdividing the input space of mappers on the logical level and instantiates mapper tasks with subqueries against the logical data model, the second optimization avoids full file scans by taking advantage of metadata available in the scientific data, and the third optimization further minimizes data transfers by pulling holistic functions (i.e. functions that cannot compute partial results) to mappers whenever possible.