Coordinated Fault Tolerance in High-end Computing Environments

Primary Session Leader:
Pete Beckman  (Argonne National Laboratory)

Secondary Session Leaders:
Rinku Gupta  (Argonne National Laboratory)
Al Geist  (Oak Ridge National Laboratory)
Birds-of-a-Feather Session
Tuesday,  12:15PM - 01:15PM
Room Ballroom G
The Coordinated Infrastructure for Fault Tolerant Systems (CIFTS) initiative provides a standard framework, through the Fault Tolerance Backplane (FTB), where any component of the software stack can report or be notified of faults through a common interface - thus enabling coordinated fault tolerance and recovery. At SC'07, we had an enthusiastic audience of industry leaders, academia, and research institutions participate in the CIFTS BOF. Expanding on our previous success, the objectives of the SC'08 BOF are: 1. Discuss the experiences gained, challenges faced in comprehensive fault management on petascale leadership machines, and the impact of the CIFTS framework in this environment. Teams developing FTB-enabled software such as MVAPICH2, MPICH2, Open MPI, Cobalt, and others, will share their experiences. 2. Discuss the recent enhancements and planned developments for CIFTS and solicit audience feedback. 3. Bring together individuals responsible for high-end, petascale computing infrastructures, who have an interest in developing fault tolerance specifically for these environments.
