What is the theory behind Apache Lucene?
There is a recurring request from users to have more insight into Lucene internals. For example, see:
- Lucene user mailing-list - lucene algorithm?,
- StackOverflow - How does Lucene index documents?,
- Quora - Could you introduce the index-file structure and theory of Lucene?.
Although most of the ideas behind Lucene are explained in any good book on Information Retrieval, Lucene also implements some advanced algorithms for specific tasks. In these cases, it is probably easier to read an article describing the idea than to reverse-engineer the code. This is why I started a wiki page to collect links to research papers and blog articles that explain some advanced ideas behind Lucene.
Feel free to help me improve this wiki page by sending me ideas of Lucene algorithms that would deserve an entry on it!