February 2013
1 post
2 tags
lz4-java 1.1.0 is out
I’m happy to announce the release of lz4-java 1.1.0. Artifacts can be downloaded from Maven Central and javadocs can be found at jpountz.github.com/lz4-java/1.1.0/docs/.
Release highlights
lz4 has been upgraded from r87 to r88 (improves LZ4 HC compression speed).
Experimental streaming support: data is serialized into fixed-size blocks of compressed data. This can be useful for people...
January 2013
2 posts
1 tag
Putting term vectors on a diet
What are term vectors?
Term vectors are an interesting Lucene feature, which allows for retrieving a single-document inverted index for any document ID of your index. This means that given any document ID, you can quickly list all its unique terms in sorted order, and for every term you can quickly know its original positions and offsets. For example, if you indexed the following...
2 tags
lz4-java 1.0.0 released
I am happy to announce that I released the first version of lz4-java, version 1.0.0.
lz4-java is a Java port of the lz4 compression library and the xxhash hashing library, which are both known for being blazing fast.
This release is based on lz4 r87 and xxhash r6. Artifacts have been pushed to Maven Central (net.jpountz.lz4:lz4:jar:1.0.0) and javadocs can be found at...
November 2012
1 post
2 tags
Stored fields compression in Lucene 4.1
Last time, I tried to explain how efficient stored fields compression can help when your index grows larger than your I/O cache. Indeed, magnetic disks are so slow that it is usually worth spending a few CPU cycles on compression in order to avoid disk seeks.
I have a very good news for you: the stored fields format I used for these experiments will become the new default stored fields format as...
October 2012
1 post
2 tags
Efficient compressed stored fields with Lucene
Whatever you are storing on disk, everything usually goes perfectly well until your data becomes too large for your I/O cache. Until then, most disk accesses actually never touch disk and are almost as fast as reading or writing to main memory. The problem arises when your data becomes too large: disk accesses that can’t be served through the I/O cache will trigger an actual disk seek, and...
July 2012
1 post
1 tag
Wow, LZ4 is fast!
I’ve been doing some experiments with LZ4 recently and I must admit that I am truly impressed. For those not familiar with LZ4, it is a compression format from the LZ77 family. Compared to other similar algorithms (such as Google’s Snappy), LZ4’s file format does not allow for very high compression ratios since:
you cannot reference sequences which are more than 64kb backwards...
June 2012
2 posts
1 tag
What is the theory behind Apache Lucene?
There is a recurring request from users to have more insight into Lucene internals. For example, see:
Lucene user mailing-list - lucene algorithm?,
StackOverflow - How does Lucene index documents?,
Quora - Could you introduce the index-file structure and theory of Lucene?.
Although most of the ideas behind Lucene are explained in any good book on Information Retrieval, Lucene also implements...
2 tags
How fast is bit packing?
One of the most anticipated changes in Lucene/Solr 4.0 is its improved memory efficiency. Indeed, according to several benchmarks, you could expect a 2/3 reduction in memory use for a Lucene-based application (such as Solr or ElasticSearch) compared to Lucene 3.x.
One of the techniques that Lucene uses to reduce its memory footprint is bit-packing. This means that integer array values, instead...