SC Conference - Activity Details

Lessons Learned at 208K: Toward Debugging Millions of Cores

Gregory L. Lee  (Lawrence Livermore National Laboratory)
Dong H. Ahn  (Lawrence Livermore National Laboratory)
Dorian C. Arnold  (University of Wisconsin-Madison)
Bronis R. de Supinski  (Lawrence Livermore National Laboratory)
Matthew Legendre  (University of Wisconsin-Madison)
Barton P. Miller  (University of Wisconsin-Madison)
Martin Schulz  (Lawrence Livermore National Laboratory)
Ben Liblit  (University of Wisconsin-Madison)
Papers Session
Large-Scale System Performance
Tuesday,  04:30PM - 05:00PM
Room Ballroom E
In this paper, we present challenges to petascale tool development, using the Stack Trace Analysis Tool (STAT) as a case study. STAT is a lightweight tool that gathers and merges stack traces from a parallel application to identify process equivalence classes. We use results gathered at thousands of tasks on an Infiniband cluster and results up to 208K processes on BlueGene/L to identify current scalability issues as well as challenges that will be faced at the petascale. We then present solutions to these challenges that have been implemented and show the resulting performance improvements. We also discuss future plans to meet the debugging demands of petascale machines.
The full paper can be found in the IEEE Xplore Digital Library and ACM Digital Library
   IEEE Computer Society  /  ACM     2 0   Y E A R S   -   U N L E A S H I N G   T H E   P O W E R   O F   H P C