Seeking new ways to help
supercomputers bounce back
Posted July 24, 2013
Like professional athletes, supercomputers must play smarter just to stay in the game.
Once athletes reach their personal limits for strength, speed, and skill, all room for improvement lies in game play – finding ways to optimize the resources at hand. Yet the question remains: After committing an error, does an athlete return to top form?
It all comes down to resilience.
High-performance computing (HPC) faces a similar challenge as it approaches exascale capability – the capacity to complete a million trillion calculations per second or handle millions of trillions of data bits. Resilience is a big problem because the sheer size and complexity of hardware in each machine makes failure inevitable.
Hardware also is near its limits for size and efficiency. As designers shrink circuits and reduce voltages, systems are more susceptible to soft errors – the unpredictable and unavoidable switching of 0s to 1s and vice versa that low-level noise or ambient radiation can trigger without hardware damage.
A University of Texas at Austin (UT) computer scientist and his colleagues are collaborating with Cray Inc. to develop a new approach, called containment domains. The concept could bring resilience strategies into play when and where they’re needed most.
“We are working to bring resilience to a footing similar to more traditional programmer concerns,” says Mattan Erez, UT associate professor of electrical and computer engineering. “Containment domains are the abstraction we came up with that satisfies these goals and can be used consistently across system and programming layers.”
A containment domain is a programming device that isolates an algorithm until all its components and iterations have been completed, checked for accuracy, and corrected, if necessary. Only after the resulting data pass these tests are they allowed to serve as inputs in subsequent algorithms.
The group first devised the idea under the Echelon Ubiquitous High Performance Computing program, supported by the Defense Advanced Research Projects Agency (DARPA), Erez says. He’s completing the first year a DOE Early Career Research Award supporting his project to develop promising solutions to the problem of resilience.