Refreshing the mesh,
and other career tales
(Page 2 of 3)
Today’s largest supercomputers incorporate hundreds of thousands of cores. Upcoming systems like Sequoia, under development at Lawrence Livermore National Laboratory (LLNL), will have more than 1 million. And exascale systems are expected to have hundreds of millions of cores, millions of memory chips and hundreds of thousands of disk drives.
“At these scales,” says Greg Bronevetsky, an LLNL computer scientist, “supercomputers become unreliable simply because of the large numbers of components involved, with exascale machines expected to encounter continuous hardware failures.”
Bronevetsky’s work looks at a key aspect of these failures, or hardware faults: their effect on applications.
“The current state of the art is to execute each application of interest thousands of times, each time injecting it with a random fault,” Bronevetsky says. “The result is a profile of the application errors most likely to result from hardware faults and the types of hardware faults most likely to cause each type of application error.”
This procedure is so expensive, he says, that it can be done for only a few applications of high importance.
To address this conundrum, Bronevetsky is devising a modular fault analysis approach that breaks the application into its constituent software components, such as linear solvers and physics models. He will then perform fault injections on each component, producing a statistical profile of how faults affect and travel through the component. He plans to connect these component profiles to produce a model of how hardware faults affect the entire application.
Given a distribution of faults, the model can predict the resulting distribution of application errors. Likewise, given a detected application error, the model can provide reverse analysis to come up with a probability distribution of hardware faults that most likely caused the error.
“The modularity of this mechanism will make it possible to assemble models of arbitrary applications out of pre-generated profiles of popular (software) libraries and services,” Bronevetsky says. “This will make large-scale complex systems easier to manage for system administrators and easier to use for computational scientists” — and help LLNL and others build ever larger and more powerful computers.
Speeding up scientific computing
Michelle Mills Strout, assistant professor of computer science at Colorado State University, will focus on models and tools that enable scientists to develop faster, more precise computational models of the physical world.
“Computing has become the third pillar of science along with theory and experimentation,” Strout says. “Some examples include molecular dynamics simulations that track atom movement in proteins over simulated femtoseconds and climate simulations for the whole Earth at tens-of-kilometers resolution.”
Such simulations require the constant evolution of algorithms that model the physical phenomena under study. The simulations also must keep up with rapidly changing implementation details that boost performance on highly parallel computer architectures.
“Currently, the algorithm and implementation specifications are strongly coupled,” Strout says. To correct for the resulting “code obfuscation that makes algorithm and implementation details difficult,” she will program new libraries that allow critical algorithms to “operate on sparse and dense matrices as well as computations that can be expressed as task graphs.”