Sean Halle, Dmitry Nadezhkin, Albert Cohen
01/12/2010 09:00 AM
Computer Engineering
There is an increasing need for a framework that supports research on portable high-performance parallelism. Such a framework would facilitate two main goals: discovering how to separate application-specific code from hardware-specific code, and supporting research on parallel schedulers and optimizations. The framework would be agnostic to language, however the knowledge gained might inform future language research with better understanding of the boundary between application and hardware. We propose a first step towards such a framework; to separate application code from hardware specific code, we define a "bi-directional" library interface that has both, library functions implemented by application-code, and library functions implemented by hardware-specific code. We also make the scheduler a first-class entity that is called as a library function. We provide a sample implementation of the framework that can automatically link the two directions and generate executables, from the same source, for each platform being investigated. To support research on run-time schedulers we provide a pool of instrumented applications and our implementation of the framework to plug research schedulers into. The interface makes an instrumented application appear to the scheduler as a black box that has a number of "knobs" to manipulate, via the interface. Optimizations that performCode StructureTransforms are also supported by the application pool and our sample implementation. The researcher is free to use their preferred language to write command-line tools to perform the transforms. The framework would invoke the plugged-in tools on the suite of applications and generate ready-to-run executables for the target platform. We detail our first installment, which focuses on data parallelism, stating the interface defined so far. We describe the process of instrumenting a set of three applications, Hamiltonian Path (an NP-Complete problem) in Java, H264 Deblocking in C, and Matrix Multiply in Java. We describe the process of implementing, then plugging-in run-time schedulers for three hardware platforms: Multi-core, the Cell processor, and a heterogeneous collection of multi-core. We also share details of our sample implementation that transform tools and run-time schedulers can plug into.