Award Finalist/Winner

SC Conference - Activity Details

Scaling Highly-parallel Data-intensive applications using MapReduce on a Parallel Clustered File System

Team Members:
Prasenjit Sarkar  (IBM Almaden Research Center)
Rajagopalan Ananthanarayanan  (IBM Almaden Research Center)
Karan Gupta  (IBM Almaden Research Center)
Ioannis Kolstidas  (University of Edinburgh)
Prashant Pandey  (IBM Almaden Research Center)
Mansi Shah  (IBM Almaden Research Center)
Renu Tewari  (IBM Almaden Research Center)
Guanying Wang  (Virginia Tech)
Gong Zhang  (Georgia Institute of Technology)
Challenges Session
SC08 Storage Challenge
Tuesday,  11:30AM - 12:00PM
Room 17A/17B
A new class of data-intensive supercomputing applications (e.g., business analytics and web search) involves processing massive amounts of data with a greater focus on semantically transforming the data rather than the traditional supercomputing focus on computational speed and guaranteed data consistency. This class of applications is embarrassingly parallel and well suited for the MapReduce programming framework that allows users to do large-scale data analysis where the runtime handles the system architecture, data partitioning and task scheduling. In this paper, we demonstrate a terabyte sort initiated by a business intelligence application on an innovative architecture running GPFS over a cluster of commodity machines and direct-attached storage. The architecture maximizes storage performance by using optimizations such as wide striping across the cluster machines, large block sizes and exposing data locality information to leverage shipping computation to the data. We show initial proof-of-concept results that justify the validity of our approach.
   IEEE Computer Society  /  ACM     2 0   Y E A R S   -   U N L E A S H I N G   T H E   P O W E R   O F   H P C