spacer
ASCR Home Button ASCR Organization Button ASCR News Button Contact ASCR Button
DOE Homepage Science Homepage
ASCRlogo ASCR Discovery home page 


FastBit: Digging through
databases faster

(page 3 of 5)

Wu, Shoshani, and Otoo originally developed WAH to search data generated by the Department of Energy high-energy physics experiment called STAR. Instead of grabbing data by the byte, WAH organizes data by word – specifically, the native word length of the computer being used. WAH can grab one word – say, 32 bits, depending on the CPU – in less time than BBC uses to gather one byte. In fact, WAH can be configured to work with words that are 8, 16, 32, 64, 128 or more bits in length.

How does WAH work with 32-bit words? Rather than 32 bits, WAH compression uses groups of 31. WAH starts with a bitmap of thousands or millions of 0s and 1s and parses it into 31-bit groups. These groups could consist of all 0s, all 1s, or a combination. If a 31-bit group is a mixture of 0s and 1s, WAH stores the literal sequence, making a so-called literal word. Moreover, the most significant bit (MSB) is set to 0, which indicates that it’s a literal word, and the remaining 31 bits consist of the actual sequence from the bitmap. If a 31-bit sequence is all 0s or all 1s, then WAH creates a fill word. In such cases, WAH sets the MSB to 1, indicating a fill word. Then, the second MSB is the fill bit: 0 if the word consists of all 0s, and 1 if the word consists of all 1s. The remaining 30 bits encode the length of a fill. The length is an integer representing the number of 31-bit groups. A WAH-encoded fill could stretch across many words if a bitmap includes a long string of 0s or 1s. When WAH gets to the last word, which is generally only partially filled, it gets stored as a literal word.

By using 31-bit counts, the compressed data in fill words align exactly with uncompressed literal words (since one bit is used to distinguish between a literal and fill word), resulting in very efficient logical operations that can use the compressed fills directly, without decompressing them.

Wu explains: “With WAH, both compressed and decompressed indexes are word-aligned and generate results as compressed bitmaps that are word-aligned. That way, we avoid the need to look at individual bits in compressed data.” For example, to do a bitwise logical operation on two words that are fills, the software knows what every bit is, and there’s no need to decompress the data to run the operation. Likewise, literal words keep the bits available for easy bitwise logical operations.

How much WAH compresses a bitmap depends on the stretches of 1s or 0s in it. “On a typical scientific dataset, where the data are skewed,” Shoshani says, “the compressed data are about 30 percent of the size of the original. By comparison, B-trees tend to be two to four times larger than the original data.”

« Previous       1   |   2   |   3   |   4   |   5   |   Print     Next »

Web Policies Button No Fear Act Button Site Map Button Privacy Button Phone Book Button Employment Button
spacer