Analyzing Failure Events on ORNL’s Cray XT4

Byung-Hoon Park  (Oak Ridge National Laboratory)
Ziming Zheng  (Illinois Institute of Technology)
Zhilling Lan  (Illinois Institute of Technology)
Gl Geist  (Oak Ridge National Laboratory)
Posters Session
Tuesday,  05:15PM - 07:00PM
Room Rotunda Lobby
Detection and diagnosis of failures in a supercomputer are challenging, but crucial to improve the reliability, availability, and serviceability (RAS). As the initial step toward this end, we consider to uncover correlated system events that have similar occurrence patterns. Such event set will constitute characteristic signatures of a machine in various states. In this study, we analyzed Cray system log data collected on Cray XT4 at Oak Ridge National Laboratory between May and December 2007. We then applied statistical and data mining approach to sift statically and temporarily correlated event types. From the analysis, we report (1) burst occurrence patterns found in fatal system events, (2) cumulative hazard functions that best fit the occurrences of fatal events, (3) correlated fatal events suggested by association rule mining and lift measure, and (4) sets of both fatal and non-fatal events that are suggested to have causal relationships by cross correlation analysis.
