Student Contribution

SC Conference - Activity Details

Proactive Process-Level Live Migration in HPC Environments

Chao Wang  (North Carolina State University)
Frank Mueller  (North Carolina State University)
Christian Engelmann  (Oak Ridge National Laboratory)
Stephen L. Scott  (Oak Ridge National Laboratory)
Papers Session
Thursday,  11:30AM - 12:00PM
Room Ballroom F
As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70\% of the faults are handled proactively.
The full paper can be found in the IEEE Xplore Digital Library and ACM Digital Library
