SC Conference - Activity Details

A Dependable Server I/O Networking Environment for High Performance

Hsing-bung Chen  (Los Alamos National Laboratory)
Parks Fields  (Los Alamos National Laboratory)
Posters Session
Tuesday,  05:15PM - 07:00PM
Room Rotunda Lobby
We implemented a dependable server I/O fault-management mechanism used in the LANL's PaScalBB based High Performance Computing cluster systems to run computational jobs 24x7 without service interruption during an unexpected physical I/O link failures and connection loss. This mechanism, named Dead Server I/O Gateway Detection and Recovery (DGD), can detect a data path connectivity problem within seconds when it happens, removes the entry of a dead I/O gateway from a Multi-Path routing table, transfers connecting I/O path to available entrance in a Multi-Path routing table, and then resumes the existing I/O data stream. The DGD can tolerate multiple single points of failures; keep the streaming I/O data moving, and seamlessly continue and finish computation jobs. We have developed a proof-of-concept implementation of this DGD mechanism on server large size Linux cluster as a blueprint for production-type Reliability-Availability-Serviceability (RAS) solution.
   IEEE Computer Society  /  ACM     2 0   Y E A R S   -   U N L E A S H I N G   T H E   P O W E R   O F   H P C