Tuesday, October 9, 2012

Efficient compressed stored fields with Lucene

Whatever you are storing on disk, everything usually goes perfectly well until your data becomes too large for your I/O cache. Until then, most disk accesses actually never touch disk and are almost as fast as reading or writing to main memory. The problem arises when your data becomes too large: disk accesses that can’t be served through the I/O cache will trigger an actual disk seek, and everything will suddenly become much slower. Once data becomes that large, there are three options: either you find techniques to reduce disk seeks (usually by loading some data in memory and/or relying more on sequential access), buy more RAM or better disks (SSD?), or performance will degrade as your data will keep growing.

If you have a Lucene index with some stored fields, I wouldn’t be surprised that most of the size of your index is due to its .fdt files. For example, when indexing 10M documents from a wikipedia dump that Mike McCandless uses for nightly benchmarks, the .fdt files are 69.3% of the index size.

.fdt is one of the two file extensions that are used for stored fields in Lucene. You can read more about how they work in Lucene40StoredFieldsFormat’s docs. The important thing to know is that loading a document from disk requires two disk seeks:

  • one in the fields index file (.fdx),
  • one in the fields data file (.fdt).

The fields index file being usually small (~ 8 * maxDoc bytes), the I/O cache should be able to serve most disk seeks in this file. However the fields data file is often much larger (a little more than the original data) so the seek in this file is more likely to translate to an actual disk seek. In the worst case that all seeks in this file translate to actual disk seeks, if your search engine displays p results per page, it won’t be able to handle more than 100/p requests per second (given that a disk seek on commodity hard drives is ~ 10ms). As a consequence the hit rate of the I/O cache on this file is very important for your query throughput. One option to improve the I/O cache hit rate is to compress stored fields so that the fields data file is smaller overall.

Up to version 2.9, Lucene had an option to compress stored fields but it has been deprecated and then removed (see LUCENE-652 for more information). In newer versions, users can still compress documents but this has to be done at the document level instead of the index level. However, the problem is still the same: if you are working with small fields, most compression algorithms are inefficient. In order to fix it, ElasticSearch 0.19.5 introduced store-level compression: it compresses large (64KB) fixed-size blocks of data instead of single fields in order to improve the compression ratio. This is probably the best way to compress small docs with Lucene up to version 3.6.

Fortunately, Lucene 4.0 (which should be released very soon) introduces flexible indexing: it allows you to customize Lucene low-level behavior, in particular the index files formats. With LUCENE-4226, Lucene got a new StoredFieldsFormat that efficiently compresses stored fields. Handling compression at the codec level allows for several optimizations compared to ElasticSearch’s approach:

  • blocks can have variable size so that documents never spread across two blocks (so that loading a document from disk never requires uncompressing more than one block),
  • uncompression can stop as soon as enough data has been uncompressed,
  • less memory is required.

Lucene40StoredFieldsFormat vs. CompressingStoredFieldsFormat

In order to ensure that it is really a win to compress stored fields, I ran a few benchmarks on a large index:

  • 10M documents,
  • every document has 4 stored fields:
    • an ID (a few bytes),
    • a title (a few bytes),
    • a date (a few bytes),
    • a body (up to 1KB).

CompressingStoredFieldsFormat has been instantiated with the following parameters:

  • compressionMode = FAST (fast compression and fast uncompression, but high compression ratio, uses LZ4 under the hood),
  • chunkSize = 16K (means that data will be compressed into blocks of ~16KB),
  • storedFieldsIndexFormat = MEMORY_CHUNK (the most compact fields index format, requires at most 12 bytes of memory per chunk of compressed documents).

Index size

  • Lucene40StoredFieldsFormat
    • Fields index: 76M
    • Field data: 9.4G
  • CompressingStoredFieldsFormat
    • Fields index: 1.7M
    • Field data: 5.7G

Indexing speed

Indexing took almost the same time with both StoredFieldsFormats (~ 37 minutes) and ingestion rates are very similar:

Document loading speed

I measured the average time to load a document from disk using random document identifiers in the [0 - maxDoc[ range. According to free, my I/O cache was ~ 5.2G when I ran these tests:

  • Lucene40StoredFieldsFormat: 11.5ms,
  • CompressingStoredFieldsFormat: 4.25ms.

In the case of Lucene40StoredFieldsFormat, the fields data file is much larger than the I/O cache so many requests to load a document translated to an actual disk seek. On the contrary, CompressingStoredFieldsFormat's fields data file is only a little larger than the I/O cache, so most seeks are served by the I/O cache. This explains why loading documents from disk was more than 2x faster, although it requires more CPU because of uncompression.

In that very particular case it would probably be even faster to switch to a more aggressive compression mode or a larger block size so that the whole .fdt file can fit into the I/O cache.

Conclusion

Unless your server has very fast I/O, it is usually faster to compress the fields data file so that most of it can fit into the I/O cache. Compared to Lucene40StoredFieldsFormat, CompressingStoredFieldsFormat allows for efficient stored fields compression and therefore better performance.

Notes

  1. jpountz posted this