FastBit: Digging through
Posted Dec 23, 2008
Imagine trying to find someone in a New York City telephone book if the names were listed randomly.
Scientific databases prove even more daunting, perhaps holding a quadrillion bytes of data – about 5,000 desktop hard drives – or more. As databases grow, quickly finding specific information would be unmanageable without some data organization that facilitates searching.
Around the turn of the millennium, a team of scientists at Lawrence Berkeley National Laboratory (LBNL) began work on this database-indexing problem with something called FastBit. Since then, the technology has sped large data searches across the spectrum, from sifting through Enron e-mails and hunting new-drug molecules to tracking burn episodes in combustion simulations. The work earned the Berkeley researchers a 2008 R&D 100 Award for technology advances.
Many database indexes use tree data structures, such as B-trees. These look like upside-down trees, branching out into ever-more categories as the index goes deeper into a database. A search goes down the tree to find a specific section, sections within a section and so on. Such indexes work well when data are frequently added or deleted.
“B-trees are designed to support changes made to the data,” says Arie Shoshani, senior staff scientist at LBNL’s High Performance Computing Research Department in the Computational Research Division. “They make it possible to go down the tree and find the page or section where a change is made, and you only touch that section.”
But he adds, “The cost is that more space is needed for accommodating data changes. For scientific data that do not change over time, the extra space is wasteful and slows down the search.”
Bitmaps provide an alternative indexing approach. “We have been working on the idea of using bitmaps for scientific applications,” says Kesheng “John” Wu, who works in the Scientific Data Management Research Group at LBNL. “This method is known to work well for low-cardinality data, like the gender of a customer: male, female or unknown.”
For the most part, though, bitmaps only work efficiently when the cardinality is below, say, 100 distinct values for an attribute. “We wanted a way to make bitmaps work when the cardinality is higher, such as with instruments recording temperature or pressure,” Wu says.
That desire spawned FastBit.