Sorted data sets are very useful since they make a lot of things easier:
- checking for duplicates,
- computing frequencies (the number of times each unique element appears),
- compression (thanks to delta encoding and bit-packing for example),
- searching thanks to binary search,
Java provides the ability to sort arrays and lists thanks to Arrays.sort and Collections.sort. These methods are flexible enough to accept custom comparators, so that you can define which sort order to use depending on your needs, but you don’t have the choice on the sorting algorithm although they can have very different speed and memory characteristics. Here is how the Oracle JVM 1.7 sorts data:
- Arrays.sort(Object) uses TimSort,
- Arrays.sort on native arrays uses a dual-pivot quicksort,
- Collections.sort dumps the list into an Object array, sorts the array with TimSort, and then re-adds elements to the list from the array.
Although these APIs serve most use-cases, sometimes you would want to have more control over the implementation, for example:
- To sort parallel arrays: imagine you have an array of objects and an array of floats, where the float at offset i is the score of the object at the same offset in the other array. Now you want to sort objects by their score. It is doable by writing a list view on top of these two arrays, but since Collections.sort dumps data into an Object array, this would be very memory intensive.
- To avoid useless object allocations: to sort a random-access list (such as ArrayList, but not LinkedList), there is no need to dump all your elements into an array, you could instead sort the list in place and save memory.
- To better fit your data: if your data is “almost” sorted, there is a good chance that TimSort would perform faster than Quicksort.
- To better fit your constraints: if you want to sort a huge array which occupies a large part of your heap, then TimSort might not be the best algorithm since it requires up to n/2 temporary slots. You could instead use an in-place sorting algorithm.
- To better reuse memory: say you want to sort 100,000 small Object arrays. For every array, Java’s Arrays.sort will create a new temporary array (TimSort needs temporary storage to perform merges). By having more control over the sorting implementation, the temporary storage could be reused.
In order to have more control over sorting, Lucene initially imported CGlib’s SorterTemplate and improved it over time. But then arose the need to use TimSort to sort partially-sorted data, and it was very hard to fold this sorting algorithm into SorterTemplate since this sorting algorithm requires temporary storage (on the contrary to the quicksort and in-place merge-sort implemented by SorterTemplate). This is why I started a small GitHub project to refactor SorterTemplate so that it can support more sorting algorithms:
- a modified TimSort, which allows you to configure the amount of temporary storage that can be allocated,
- Merge-sort, with configurable memory overhead similarly to TimSort,
- Introspective sort, which is essentially an improved Quicksort,
- Heap-sort, on both binary and ternay heaps.
For example, here is some code which sorts two parallel arrays using introspective sort:
These new classes are now being used in Lucene, but I think it could be very useful to other projects, so feel free to use them! Feedback is very welcome.