headerlogo
scyourway
Student Contribution

SC Conference - Activity Details



Proactive Process-Level Live Migration in HPC Environments

Authors:
Chao Wang  (North Carolina State University)
Frank Mueller  (North Carolina State University)
Christian Engelmann  (Oak Ridge National Laboratory)
Stephen L. Scott  (Oak Ridge National Laboratory)
Papers Session
Scheduling
Thursday,  11:30AM - 12:00PM
Room Ballroom F
Abstract:
As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place. Reactive fault tolerance (FT) often does not scale due to massive I/O requirements and relies on manual job resubmission. This work complements reactive with proactive FT at the process level. Through health monitoring, a subset of node failures can be anticipated when one's health deteriorates. A novel process-level live migration mechanism supports continued execution of applications during much of processes migration. This scheme is integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs. Experiments indicate that 1-6.5 seconds of prior warning are required to successfully trigger live process migration while similar operating system virtualization mechanisms require 13-24 seconds. This self-healing approach complements reactive FT by nearly cutting the number of checkpoints in half when 70\% of the faults are handled proactively.
The full paper can be found in the IEEE Xplore Digital Library and ACM Digital Library
   IEEE Computer Society  /  ACM     2 0   Y E A R S   -   U N L E A S H I N G   T H E   P O W E R   O F   H P C