SC Conference - Activity Details

Enhancing Fault Resilience via Virtual Machine Checkpointing Mechanism

Kasidit Chanchio  (Thammasat University)
Hong Ong  (Oak Ridge National Laboratory)
Chokchai Leangsuksun  (Louisiana Tech University)
Posters Session
Tuesday,  05:15PM - 07:00PM
Room Rotunda Lobby
With the advent of PetaFlops clusters, the ability to provide reliability and availability at scale is of paramount importance to new scientific discoveries. Checkpointing is a common technique to deal with failures while virtualization allows running multiple runtime instances on the same compute system to maximize resources utilization. In this poster, we review a novel framework that implements a policy-driven virtual machine checkpointing mechanism. Our mechanism utilizes a latency-hiding protocol that enables concurrent execution of checkpointing and computation. Since the framework is implemented in KVM, it enables implicit system-level fault tolerance without modifying existing operating systems, applications, or hardware. For a quantitative evaluation, we compare the performance of three different checkpointing protocols using the Linpack benchmarks. For a qualitative evaluation, we demonstrate the working of a WinXP guest running a Mathematica benchmark. The evaluation shows that the framework is scalable, highly efficient, and low overhead. We finally discuss future work.
   IEEE Computer Society  /  ACM     2 0   Y E A R S   -   U N L E A S H I N G   T H E   P O W E R   O F   H P C