|At the ConferenceExhibitsTransportationLodgingDiningNightlife|
SC Conference - Activity Details
The Growing Need for Resilience in HPC SoftwarePrimary Session Leader:
Gregory M. Thorson (SGI)
Secondary Session Leaders:
Low reliability in petascale systems and the desire for low-cost HPC hardware are increasing the need for resilience in software. Reliability can come through redundancy, but this is often expensive; hence the need for systems and application software resiliency. Due to the search engine market, new techniques have been developed for automatically restarting failed processes. Although useful for some areas of HPC, other algorithms are so fundamentally coupled, that an individual thread or process cannot simply be restarted from the beginning. This is where check pointing has traditionally been deployed. However with the size of systems on the drawing board today, it appears that check pointing, as currently implemented, is quickly reaching the end of its feasibility. A dialogue within the HPC community is needed to explore when resilience techniques are required and which ones are appropriate. We must then work together to lower the barrier to adoption of these techniques.