Exascale Science
May 2012

Blueprints for power

Computer designers are rethinking nearly everything in their quest to develop systems capable of exaflops-speed calculations.

Designers are rethinking computing in their quest for systems capable of calculations at exaflops speed. See the sidebar for a related video.

The architects of tomorrow’s exascale computers are designing systems that borrow from and contribute to an unlikely source: the consumer electronics industry. The result will be systems exquisitely designed to meet the needs of scientists studying complex systems such as jet engine efficiency, large-scale weather patterns and earthquake models.

These machines won’t just be about a thousand times faster than today’s fastest high-performance computing systems. They’ll also require so much energy to operate that computer architects must completely rethink how these systems will be built. They’re breaking down and redesigning everything from microprocessor interconnects to memory and storage to conserve energy and increase efficiency. In many areas, architects are going back to the drawing board and asking what scientific questions the systems will be used to answer, and then optimizing the machine design to perform those operations.

“Once you choose the specific (scientific) performance metric that you want to improve, then you can start to make huge strides in determining the right hardware and software design tradeoffs to improve energy efficiency and effectiveness,” says John Shalf, who leads research into energy-efficient processors for scientific applications at Lawrence Berkeley National Laboratory. “Unlike the HPC community, which has tended to focus on flops, an energy-sensitive exascale design must use the application as the thing to measure performance.”

Unlike previous generations of large-scale computers, none of the improved performance is expected to come from faster processors, says Kathy Yelick, associate laboratory director for computing sciences at Berkeley Lab. Instead, exascale computing will extend the concept of massive parallelism. Systems will consist of tens of thousands of nodes running in parallel, with each node containing tens of thousands of processor cores.

“What these cores will look like and how they will be organized is an open question,” Yelick says. “What we do know is that there will be about a billion of them in a single system. And because any data movement outside of the chip really costs you in terms of time and energy consumption, it will be more important than ever at exascale to avoid doing any kind of communication and, whenever possible, to compute locally.”

Because there is more space available on chips now, as individual components continue to shrink, the question becomes: How do you make the best use of that available space? Processor design is an active area, with both academic research groups and commercial vendors putting forth innovative concepts for energy-efficient design. One popular idea, called system-on-a-chip, consists of a processor with its own local memory and a link to the network, allowing each processing node to act independently. Some groups have instead proposed using GPUs – graphics processing units – and a network interface on each chip to minimize data movement.

“These independent nodes will greatly affect the way computing systems are structured and the way programs are coded and executed,” Yelick says. The details of how memory will be organized, whether some processors will be specialized or restricted to operate in tandem with others and the cost of synchronizing between them are all important.

“Nodes, depending on how they are organized, might have hundreds of thousands of independent computing elements,” says Rick Stevens, associate laboratory director responsible for computing, environment and life sciences research at Argonne National Laboratory. “Those elements would be connected internally through electronics, as is the case today, or perhaps optics.”

Another critical issue looming in the jump to exascale computing is the failure rate of individual components in a system.

The energy cost of electronic communication increases with distance; however, the energy consumed by optically enabled communication is less dependent on distance.
“It’s really about how to re-architect the chip such that the millions of small wires that connect components inside the architecture are actually minimized,” Stevens says. “The most likely scenario will combine both concepts: Some amount of the performance you get will be from massive parallelism, and some will be a result of minimizing data motion within each node.”

Researchers continue to argue over how homogeneous individual processing units should be. They can design computing elements more efficiently for certain operations. An example is the petascale system under construction at Oak Ridge National Laboratory. It has a hybrid architecture combining traditional commodity processors with GPUs, which have different internal instruction sets. (For more, see “From games to mainframes, GPUs boost computing power.”)

“Years of experience in consumer electronics devices has demonstrated that specialization can improve energy efficiency,” Stevens says. “But trading specialization off with the development costs and the difficulty of programming has always been an issue.”

Exascale system architecture will likely contain both more processors and a heterogeneous mix of processor types, Yelick says. That also will significantly change how programs are coded and executed. Programmers will need to rewrite code to funnel different numerical problems to the appropriate processing-unit type.

For example, programmers want to be able to write a program that treats memory as if it were a single, global space. But the amount of shared memory available on an exascale machine is still an open question and could be quite small per processing unit, Stevens says. It is easier to build a machine with separated memory addresses, but it is much
easier to program if the memory is unified. System architects are still struggling with how transparent they can make the memory space while internally partitioning it for maximum
efficiency.

One of the most innovative ideas to increase efficiency in exascale systems posits that memory, rather than simply storing information, could be enlisted to perform simple
patterns of operations, or what’s known as a stencil. For example, a common arithmetic operation in high-performance computing involves communicating with nearest neighbors in a set pattern to determine their values, and then averaging those values.

That kind of simple, repetitive operation is ideally suited to the memory function, Stevens says. “That might save an enormous amount of power, possibly a factor of 20, so whether this capability could be built into the memory or into the switch is one of the interesting architecture questions for exascale.”

Another critical issue looming in the jump to exascale computing is the failure rate of individual components in a system that is expected to be considerably larger than
current high-performance systems. Assuming that the failure rate for components can be improved by a factor of 10 over the next 10 years, while the system itself grows by a factor of a thousand, the system would fail about once every hour.

To avoid this untenable situation, the system architecture must have a built-in failure detection system so that small glitches can be located immediately and systems restored locally, rather than globally.

Stevens calls this approach “micro-checkpointing.” The approach is similar in concept to the method used to back up personal computers. Each incremental backup records only
information that has changed since the previous backup. The difference is in speed. For exascale computing, backups might take place every millisecond and be distributed over multiple nodes, so one node crashing wouldn’t take down the entire system, as is the case today. Backups would allow the system to reconstruct the node locally.

“The idea here is to make the individual nodes much more robust so the systems don’t go down so much but also to take away from the programmer the responsibility for backups,” Stevens says. “That’s the vision. Whether we can actually make that happen, whether it would require too much additional bandwidth inside the machine that might be competing with the application, is not clear.”

What’s needed now is proof-of-concept for the new system architectures, a kind of
21st century test bed that will allow design concepts to be launched. One such test bed will come courtesy of Mira, a 10-petaflops IBM Blue Gene/Q being installed at the Argonne Leadership Computing Facility. Mira will support scientific computing in areas such as climate science and engine design, and it will be available as a INCITE (Innovative and Novel Computational Impact on Theory and Experiment) and ASCR Leadership Computing Challenge resource.

The system contains early versions of several ideas necessary for exascale computing. First, it employs a low-speed, low-energy, high-core approach, with about 750,000 computing cores and 750 terabytes of total memory. Its programmable pre-fetch engine
anticipates programmed data movement and retrieves data from memory so they will be immediately available when needed. What’s called a “transactional memory system” puts in place the first step to micro-checkpointing by divvying up changes to the memory into manageable chunks or blocks that can be written all at once. The design makes movement of data to memory a small transaction that greatly reduces the energy cost, as well as reducing the bottleneck in information processing. (For more, see Mira: Supercomputing’s next generation coming soon.)

Once these strategies can be tested, the next-generation machines, which approach exascale capability, will be a result of the architecture that serves the needs of the end user: the scientific community.

“The co-design process is trying to get applications people into the middle of the design process for the next-generation machines,” Shalf says. “This is really inserting the
scientist into the design process to make a more effective machine for science.”