 |
 |
 |
Student Contribution |
SC Conference - Activity Details
Proactive Process-Level Live Migration in HPC Environments
Authors:
|
Chao Wang
(North Carolina State University)
|
|
Frank Mueller
(North Carolina State University)
|
|
Christian Engelmann
(Oak Ridge National Laboratory)
|
|
Stephen L. Scott
(Oak Ridge National Laboratory)
|
Papers Session
|
Scheduling
|
|
Thursday, 11:30AM - 12:00PM
|
|
Room Ballroom F
|
Abstract:
As the number of nodes in high-performance computing environments
keeps increasing, faults are becoming common place. Reactive fault
tolerance (FT) often does not scale due to massive I/O requirements
and relies on manual job resubmission.
This work complements reactive with proactive FT at the process
level. Through health monitoring, a subset of node failures can be
anticipated when one's health deteriorates. A novel process-level live
migration mechanism supports continued execution of applications
during much of processes migration. This scheme is integrated into an
MPI execution environment to transparently sustain health-inflicted
node failures, which eradicates the need to restart and requeue MPI
jobs. Experiments indicate that 1-6.5 seconds of prior
warning are required to successfully trigger live process migration
while similar operating system virtualization mechanisms require 13-24
seconds. This self-healing approach complements reactive FT by nearly
cutting the number of checkpoints in half when 70\% of the faults are
handled proactively.
|
|
|