 |
 |
|
SC Conference - Activity Details
Enhancing Fault Resilience via Virtual Machine Checkpointing Mechanism
Authors:
|
Kasidit Chanchio
(Thammasat University)
|
|
Hong Ong
(Oak Ridge National Laboratory)
|
|
Chokchai Leangsuksun
(Louisiana Tech University)
|
Posters Session
|
Tuesday, 05:15PM - 07:00PM
|
|
Room Rotunda Lobby
|
Abstract:
With the advent of PetaFlops clusters, the ability to provide reliability and availability at scale is of paramount importance to new scientific discoveries. Checkpointing is a common technique to deal with failures while virtualization allows running multiple runtime instances on the same compute system to maximize resources utilization. In this poster, we review a novel framework that implements a policy-driven virtual machine checkpointing mechanism. Our mechanism utilizes a latency-hiding protocol that enables concurrent execution of checkpointing and computation. Since the framework is implemented in KVM, it enables implicit system-level fault tolerance without modifying existing operating systems, applications, or hardware. For a quantitative evaluation, we compare the performance of three different checkpointing protocols using the Linpack benchmarks. For a qualitative evaluation, we demonstrate the working of a WinXP guest running a Mathematica benchmark. The evaluation shows that the framework is scalable, highly efficient, and low overhead. We finally discuss future work.
|
|
|