Wednesday, November 14, 2012

Stored fields compression in Lucene 4.1

Last time, I tried to explain how efficient stored fields compression can help when your index grows larger than your I/O cache. Indeed, magnetic disks are so slow that it is usually worth spending a few CPU cycles on compression in order to avoid disk seeks.

I have a very good news for you: the stored fields format I used for these experiments will become the new default stored fields format as of Lucene 4.1! Here are the main highlights:

  • only one disk seek per document in the worst case (compared to two with the previous default stored fields format)
  • documents are compressed together in blocks ot 16 KB or more using the blazing fast LZ4 compression algorithm

Over the last weeks, I’ve had the occasion to talk about this new stored fields format with various Lucene users and developers who raised interesting questions that I’ll try to answer:

  • What happens if my documents are larger than 16KB? This stored fields format prevents documents from spreading across chunks: if your documents are larger than 16KB, you will have larger chunks that contain only one document.
  • Is it configurable? Yes and no: the stored fields format that will be used by Lucene41Codec is not configurable. However, it is based on another format: CompressingStoredFieldsFormat, which allows you to configure the chunk size and the compression algorithm to use (LZ4, LZ4 HC or Deflate).
  • Are there limitations? Yes, there is one: individual documents cannot be larger than 232 - 216 bytes (a little less than 2 GB). But this should be fine for most (if not all) use-cases.
  • Can I disable compression? Of course you can, all you need to do is to write a new codec that uses a stored fields format which does not compress stored fields such as Lucene40StoredFieldsFormat.
  • My index is stored in memory / on a SSD, does it still make sense to compress stored fields? I think so:
    • it won’t slow down your search engine: on my very slow laptop (Core 2 Duo T6670), decompressing a 16 KB block of english text takes 80┬Ás on average, so even if your result pages have 50 documents, your queries will only be 4ms slower (much less with faster hardware and/or smaller pages)
    • RAM and SSD are expensive, so thanks to stored fields compression you’ll be able to have larger indexes on the same hardware, or equivalent indexes on cheaper hardware
  • Can I plug in my own compression algorithm? Unfortunately you can’t, but if you really need to use a different compression algorithm, the code should be easy to adapt. However you should be aware of two optimizations of the LZ4 implementation in Lucene that you would almost certainly need to implement if you want to achieve similar performance:
    • it doesn’t compress to a temporary buffer before writing the compressed data to disk, instead it writes directly to a Lucene DataOutput — this proved to be faster (with MMapDirectory at least)
    • it stops decompressing as soon as enough data has been decompressed: for example, if you need to retrieve the second document of a chunk, which is stored between offsets 1024 and 2048 of the chunk, Lucene will only decompress 2 KB of data.

Many thanks to Robert Muir who helped me improve and fix this new stored fields format!