SC Conference - Activity Details

System-wide Performance Equivalence Class Detection Using Clustering

Todd Gamblin  (University of North Carolina at Chapel Hill)
Bronis R. de Supinski  (Lawrence Livermore National Laboratory)
Martin Schulz  (Lawrence Livermore National Laboratory)
Robert J. Fowler  (Renaissance Computing Institute)
Daniel A. Reed  (Microsoft Research)
Posters Session
Tuesday,  05:15PM - 07:00PM
Room Rotunda Lobby
The largest current systems contain hundreds of thousands of processors and petascale systems are expected to have millions or more. Manual data analysis is not feasible for process counts this large, and engineers will need compact, aggregate descriptions of common behaviors among the distributed processes. We are developing techniques to model on-node performance based on application-semantic measures. Our on-node model correlates timing data and progress rates with Hardware Performance Metric data. Strong correlation with application progress allows us to determine the most important performance metrics on a per-node basis. Additionally, we are applying clustering techniques to our on-node models to detect behavioral equivalence classes among application nodes. We use scalable aggregation techniques to reduce the on-node data to a manageable size, then we apply clustering techniques to the compact representation. We show that our approach can be used to scalably diagnose performance problems in large parallel applications.
   IEEE Computer Society  /  ACM     2 0   Y E A R S   -   U N L E A S H I N G   T H E   P O W E R   O F   H P C