SC Conference - Activity Details

Parallel I/O for Very Large Scale Data Sets in the R Statistical Computing Package

Jeffrey A. Delmerico  (University at Buffalo)
Matthew D. Jones  (University at Buffalo)
Steve M. Gallo  (University at Buffalo)
Daniel P. Gaile  (University at Buffalo)
Posters Session
Tuesday,  05:15PM - 07:00PM
Room Rotunda Lobby
In many scientific and medical applications, the ability to produce data is outpacing the capability of computer systems to process and analyze that data. The poor performance of conventional systems when working with data sets beyond their memory capacity can be severely limiting. We present a software package capable of decomposing a large data set while concurrently recording metadata about the decomposition, enabling the retrieval and processing of that data within the limits of the system. The performance has been optimized for parallel I/O, and scales well both with system size, and with processor count. The impetus for this work comes from the desire to perform statistical analyses within the programming environment R for data produced from the latest high-resolution DNA microarrays. The poster will include descriptions of the problem, approach, possible applications, and visual aids to illustrate the performance achieved on real-world data sets and storage systems.
   IEEE Computer Society  /  ACM     2 0   Y E A R S   -   U N L E A S H I N G   T H E   P O W E R   O F   H P C