SC Conference - Activity Details

The Growing Need for Resilience in HPC Software

Primary Session Leader:
Gregory M. Thorson  (SGI)

Secondary Session Leaders:
John T. Daly  (Los Alamos National Laboratory)
Stephen L Scott  (Oak Ridge National Laboratory)
Birds-of-a-Feather Session
Wednesday,  12:15PM - 01:15PM
Room Ballroom E
Low reliability in petascale systems and the desire for low-cost HPC hardware are increasing the need for resilience in software. Reliability can come through redundancy, but this is often expensive; hence the need for systems and application software resiliency. Due to the search engine market, new techniques have been developed for automatically restarting failed processes. Although useful for some areas of HPC, other algorithms are so fundamentally coupled, that an individual thread or process cannot simply be restarted from the beginning. This is where check pointing has traditionally been deployed. However with the size of systems on the drawing board today, it appears that check pointing, as currently implemented, is quickly reaching the end of its feasibility. A dialogue within the HPC community is needed to explore when resilience techniques are required and which ones are appropriate. We must then work together to lower the barrier to adoption of these techniques.
   IEEE Computer Society  /  ACM     2 0   Y E A R S   -   U N L E A S H I N G   T H E   P O W E R   O F   H P C