 |
 |
|
SC Conference - Activity Details
A Dependable Server I/O Networking Environment for High Performance
Authors:
|
Hsing-bung Chen
(Los Alamos National Laboratory)
|
|
Parks Fields
(Los Alamos National Laboratory)
|
Posters Session
|
Tuesday, 05:15PM - 07:00PM
|
|
Room Rotunda Lobby
|
Abstract:
We implemented a dependable server I/O fault-management mechanism
used in the LANL's PaScalBB based High Performance Computing cluster
systems to run computational jobs 24x7 without service interruption
during an unexpected physical I/O link failures and connection
loss. This mechanism, named Dead Server I/O Gateway Detection and
Recovery (DGD), can detect a data path connectivity problem within
seconds when it happens, removes the entry of a dead I/O gateway from
a Multi-Path routing table, transfers connecting I/O path to
available entrance in a Multi-Path routing table, and then resumes
the existing I/O data stream. The DGD can tolerate multiple single
points of failures; keep the streaming I/O data moving, and
seamlessly continue and finish computation jobs. We have developed a
proof-of-concept implementation of this DGD mechanism on server large
size Linux cluster as a blueprint for production-type
Reliability-Availability-Serviceability (RAS) solution.
|
|
|