Wednesday, January 23, 2013

Putting term vectors on a diet

What are term vectors?

Term vectors are an interesting Lucene feature, which allows for retrieving a single-document inverted index for any document ID of your index. This means that given any document ID, you can quickly list all its unique terms in sorted order, and for every term you can quickly know its original positions and offsets. For example, if you indexed the following document:

Field nameField value
textthe quick brown fox jumps over the lazy dog

You would retrieve the following term vectors:

the20, 6[0,3], [31,34]

For very small documents, it makes little sense to store term vectors given than they can be recomputed very quickly by re-analyzing a document’s stored fields. But if your documents are large or if your analysis pipeline is expensive, storing term vectors on disk can be much faster than computing them on the fly. So far, term vectors have been mainly used for highlighting and MoreLikeThis (searching for similar documents) but there is an interesting issue open in Lucene JIRA to use term vectors to perform partial document updates.

However, term vectors come with a cost. They store a lot of information and often take up a lot of disk space. This is bad because it can make indexing and searching slower (especially if the index size grows beyond the size of your OS cache).

Term vectors compression

Having worked on stored fields compression in the past months, my first idea was to apply the same recipe: collect enough raw data to fill a 16 KB block, then compress it and flush it to disk. However term vectors are more challenging to compress: terms are already unique so it is rather hard for LZ codecs such as LZ4 to reach good compression ratios. Moreover, general-purpose compression algorithms are usually not very good at compressing numeric data (frequencies, term positions and offsets) so I needed something else.

After long hours of trial and error, I managed to write a new term vectors format based on LZ4 and bit-packing which efficiently compresses term vectors for various cases. Depending on the collection of documents, the compression ratio of the term vector files varied from 0.53 to 0.90. For example, indexing term vectors (with positions and offsets enabled) for 1M articles from the English Wikipedia database generates 5.9G of term vector files with the default codec from Lucene 4.0 or 4.1. By switching to this new term vectors format, the size of the term vector files decreased to 3.9G! Another good news is that this size reduction made indexing faster: while indexing those 1M articles took 1038 seconds with the current term vectors format, it took only 870 seconds with this new compressed format (see ingestion rate charts below).

Ingestion rate with the current default format.

Ingestion rate with the new compressed format.

Although this new format is still very experimental, I think it’s promising and would make a good candidate to become the new default term vectors format for a future version of Lucene. If you are interested in better understanding how it works and the compression ratio you can expect from this format, you can read more about it in Lucene Jira.


  1. jpountz posted this