At exascale, being oblivious
to a fault keeps apps running
Posted December 12, 2012
The FOX software stack.
Click image to enlarge and for more information.
|
Computer scientist Maya Gokhale is optimistic about exascale computing’s fast, bright future. But to achieve this success, the Lawrence Livermore National Laboratory researcher and her colleagues focus on an exascale computer’s inevitable failures, from processor cores, to memory and communications links – the unprecedented millions of hardware parts. The failure of any one of these elements could hobble a 1018 floating-point-operation scientific calculation.
“Failure will be a constant companion in exascale computing,” says Gokhale, a 30-year veteran in thinking about the silicon edge of high-performance computing (HPC). “We expect applications will have to function in an environment of near-continual failure: something, somewhere in the machine will always be malfunctioning, either temporarily, or simply breaking.”
Yet just as the most successful entrepreneurs learn and grow from overcoming failure, HPC scientists like Gokhale believe planning for breakdowns at the exascale will not only improve scientific computing; it may be a tipping point toward a new paradigm in how supercomputers are programmed.
As part of its effort to outsmart faults at the exascale, the DOE Office of Science is looking to FOX – the Fault-Oblivious eXtreme-scale execution environment project.
Led by Gokhale, FOX is an ambitious project, now in the last of its three years, with a transformational vision. To succeed at the exascale, FOX’s creators believe, HPC must fundamentally change its approach to failure – from one of seeking hardware perfection at all costs to one of embracing hardware failure when designing a new generation of operating systems.
The failure frontier
With 100 million to a billion processing cores spread among millions of nodes, an exascale supercomputer will represent an unprecedented opportunity for scientific computing. But it also will spawn a new frontier in terms of failure rates.

