HPC Fellowship: Process-Level Fault Tolerance for Job Healing in HPC Environments

Chao Wang  (North Carolina State University)
Doctoral Research Showcase Session
Wednesday,  03:30PM - 04:00PM
Room 17A/17B
As the number of nodes in high-performance computing (HPC) environments keeps increasing, faults are becoming common place. Frequently deployed checkpoint/restart mechanisms generally require a complete restart. Yet, some node failures can be anticipated by detecting a deteriorating health status in today's systems, which can be explored by proactive fault tolerance (FT). Our work proposes novel, scalable mechanisms in support of proactive FT and significant enhancements to reactive FT, including the following contributions: (1) Provide a transparent job pause service allowing live nodes to remain active and roll back to the last checkpoint while failed nodes are dynamically replaced by spares before resuming from the last checkpoint; (2) Complement reactive with proactive FT by a process-level live migration mechanism that supports continued execution of an application during much of migration; (3) Develop incremental checkpointing techniques to capture only data changed since the last checkpoint to reduce the cost of reactive FT.
