Seeking new ways to help
supercomputers bounce back
(Page 3 of 3)
Several research groups also are developing new algorithms that tolerate faults. The programs grind away, eventually converging on the correct answer despite encountering errors along the way.
Other approaches try variations on checkpointing. Multilevel checkpointing stores the states of individual nodes locally – in or near the nodes – and fairly frequently. The nodes are checkpointed individually, then as small groups, and then in groups of groups of increasing size. The computer can recover data at the most efficient level, rolling back only the smallest group needed to reach a point before the fault. In distributed checkpointing, a handful of nodes save portions of the computation’s state for one another. If any one node fails, much of the data can be retrieved from its neighbors.
Like multilevel checkpointing, containment domains also preserve the state of a small part of the system before an algorithm begins. However, containment domains also sequester the algorithm from the rest of the system so each error can be ignored or corrected as appropriate.
Corrective measures then come into play within the containment domain. Any recovery of data is uncoordinated with the rest of the system. So containment domains avoid an expensive hazard of multilevel checkpointing: They do not interact with neighboring nodes, which could trigger a domino effect of unnecessary rollbacks.
Because containment domains allow the use of multiple mechanisms for preserving calculations, they offer many advantages. First, they are fine-grained, confining algorithm checks to the optimal number of necessary nodes – that is, as few as possible.
Second, containment domains are expressive, giving programmers the flexibility to embed the appropriate fault-detection and data-recovery strategies into each containment domain, depending on the algorithm, using relatively few lines of code.
Third, containment domains are tunable. They let programmers assign each error type a weight, based on the algorithm’s context. As a result, when an error is detected, the correction is a nuanced response of appropriate measures.
Finally, with the programmer’s knowledge of the running algorithms, containment domains also may exploit the fact that different nodes redundantly store the same data for performance reasons. Error-ridden data in one node may be restored not from a checkpoint but from another node using the same information.
As HPC expands toward exascale capability, small errors could cause big problems. Containment domains are likely to help scientists run more efficient algorithms on supercomputers, yielding more reliable answers, faster, to the big questions that only these big machines can address.