At exascale, being oblivious
to a fault keeps apps running
(page 4 of 4)
To tie the pieces together and test its approach, the research team is using an allotment of 10 million processor hours on Intrepid, awarded through DOE’s Innovative and Novel Computational Impact on Theory and Experiment (INCITE) program.
One of FOX’s key elements is TASCEL, a middleware task management library developed by a team, led by scientist Sriram Krishnamoorthy, at DOE’s Pacific Northwest National Laboratory. With TASCEL, all of a computation’s task queues are stored in a shared memory location. Thus, if a computing node fails, its current task queue can be recovered and reassigned to another node.
“We also have middleware and libraries that keep a consistent data store,” Gokhale says. “We don’t get rid of the task until the work of the task has been completed and stored in the data store. In case of node failure, the replicated task queues and data storage are automatically rebalanced, and the remaining nodes continue oblivious to the failure.”
The FOX project also employs systems software capable of core and node specialization – of simultaneously running various programs and operating systems on different types of cores and nodes, all in the same exascale machine.
NIX, a prototype operating system Minnich co-developed, efficiently divides cores on multicore processors according to function. Kittyhawk, a cloud environment developed by scientists at Boston University, lets users run different operating systems on different nodes. This has enabled the FOX team to port code to the Blue Gene architecture and run a variety of codes to solve problems.
In the FOX project’s final year, Gokhale says, they’ll speed test their fault-oblivious environment with a massively parallel quantum chemistry program originated at Sandia.
“Quantum chemistry computation is enormously challenging because it’s highly irregular, with some parts requiring an enormous amount of communication,” she says. “This will really put FOX to the test.”
Minnich suggests that the challenge fault-oblivious computing faces is more human than technical. From a programming perspective, it’s as big of a shift as the historic move from vector to massively parallel computing.
Regardless of whether the future of exascale HPC is fault-tolerant, he says, the FOX project’s approach is part of what’s needed to get there.