Thursday, June 21, 2012

How fast is bit packing?

One of the most anticipated changes in Lucene/Solr 4.0 is its improved memory efficiency. Indeed, according to several benchmarks, you could expect a 2/3 reduction in memory use for a Lucene-based application (such as Solr or ElasticSearch) compared to Lucene 3.x.

One of the techniques that Lucene uses to reduce its memory footprint is bit-packing. This means that integer array values, instead of being fixed-size (8, 16, 32 or 64 bits per value), can have any size in the [1-64] range. If you store 17-bits integers this way, this is a 47% reduction of the size of your array compared to an int[]!

Here is what the interface looks like:

interface Mutable {
  long get(int index);
  void set(int index, long value);
  int size();
}

Under the hood, this interface has 4 implementations that have different speed and memory efficiency:

  1. Direct8, Direct16, Direct32 and Direct64 that just wrap a byte[], a short[], an int[] or a long[],
  2. Packed64, which packs values contiguously in 64-bits (long) blocks,
  3. Packed64SingleBLock, that looks like Packed64 but uses padding bits to prevent values from spanning across several blocks (32 bits per value at most),
  4. Packed8ThreeBlocks and Packed16ThreeBlocks, that store values in either 3 bytes (24 bits per value) or 3 shorts (48 bits per value).

In case you are interested, the code is available in Lucene svn repository.

Direct{8,16,32,64}

The methods of these classes directly translate to operations on an array:

  • Direct8: byte[],
  • Direct16: short[],
  • Direct32: int[],
  • Direct64: long[].

Operations on these classes should be very fast given that they directly translate into array accesses. However, these implementations also have the same drawback as arrays, which is that if you want to store 17-bits values, you will need to use a Direct32, which has a 88% memory overhead for 17-bits values.

Packed64

This implementation stores values contiguously in 64-bits blocks. This is the most compact implementation: if you want to store a million17-bits values, it will require roughly 17 * 1000000 / 8 ~= 2MB space. One pitfall is that some values may span across two different blocks (when the number of bits per value is not a divisor of n), as a consequence, to avoid costly CPU branches, the implementation of the get and set methods are a little tricky and always update 2 blocks with different shifts and masks.

Packed64SingleBlock

This implementation is similar to Packed64 but does not allow its values to span across several blocks. If you want to store 21-bits values, every block will consist of 3 21-bits values (using 3*21=63 bits) and 64-63=1 padding bit (2% space loss). Here are the different value sizes that this class accepts.

Bits per valueValues per blockPadding bitsSpace loss
32200%
21312%
16400%
12546%
10646%
9712%
8800%
7912%
61046%
51246%
41600%
32112%
23200%
16400%

Packed{8,16}ThreeBlocks

This class uses 3 bytes or shorts to store a single value. It is well-suited for 24 and 48-bits values, but has a maximum size of Integer.MAX_VALUE/3 (so that the underlying array can be addressed by an int).

How do they compare?

For every number of bits per value, there are 2 to 4 available implementations. One important criterion to select the one that best suits your needs is the memory overhead.

Here are the memory overheads for every number of bits per value and bit-packing scheme. The X-axis is the number of bits per value while the Y-axis is the memory overhead (space loss / actually used space).

For every bit-packing scheme, I only considered the most compact implementation. I could use a Direct64 to store 20-bits values, but it is very likely to have similar (probably a little worse since the CPU cache is less likely to help) performance to a Direct32, although it requires twice as more space.

For example, there are 4 available implementations to store 20-bits values:

  • Direct32 (32 bits per value), which has 60% memory overhead
  • Packed64 (20 bits per value), which has 0% memory overhead
  • Packed64SingleBlock (21 bits + 1/3 padding bit per value), which has 7% memory overhead
  • Packed8ThreeBlocks (24 bits per value), which has 20% memory overhead

Even if we now know how compact the different implementations are, it is still very difficult to decide which implementation to use whithout knowing their relative performance characteristics. This is why I wrote a simple benchmark that for every number of bits per value in [1,64]:

  • creates 2 to 4 packed integer arrays (one per implementation) of size 10,000,000
  • tests their random write performance (offsets are randomly chosen in the [0, 10000000[ range),
  • tests their random read performance.

The X-axis is the number of bits per value while the Y-axis is the number of read/written values per second.

The Direct* implementations are clearly faster than the packed implementations (~3x faster than Packed64 and 2x faster than Packed64SingleBlock). However, it is interesting to observe that the Packed*ThreeBlocks implementations are almost as fast as the Direct* implementations.

Packed64 and Packed64SingleBLock are much faster with small values (1 or 2 bits), due to the fact that the CPU caches can hold many more values at the same time, resulting in fewer cache misses when trying to access the data.

Now, how do read operations compare?

This time results are very different. The fastest implementation are still the Direct* ones, but they are only ~18% faster than Packed64 and Packed*ThreeBlocks, and only ~8% faster than Packed64SingleBLock on average. This means that for read-only use cases, you could save a lot of memory by switching your arrays to a packed implementation while keeping performance to the same level.

Conclusion

Although bit-packing can help reduce memory use significantly, it is very rarely used in practice, probably because:

  • people usually don’t know how many bits per value they actually need,
  • 8, 16, 32 and 64-bits arrays are language built-ins, while packed arrays require some extra coding.
However, this experiment shows that you can achieve significant reductions in memory use by using packed integer arrays, without sacrificing performance too much since packed arrays can be almost as fast as raw arrays, especially for read operations.