YSK that ANN is already in the process of being added to Lucene: https://github....

kleebeesh · on July 30, 2020

Yep, I'm aware.

The Lucene implementation seems early and slow-moving. Seems they are trying to create new storage formats and use graph-based search methods. OpenDistro wrapped a C++ binary that also uses a graph-based method. It works quite well, but only for L2 similarity and comes with the operational burden of running a rather large sidecar process completely disjoint from the JVM.

The approach I've taken is to support five similarity functions (L1, L2, Angular, Jaccard, Hamming), support sparse and dense vectors, implement everything inside the JVM with no sidecar processes and no changes to Lucene, and to use hashing-based search methods (i.e. LSH). IMO the last point has a clear advantage over using graph-based methods, because the hashes are treated just like regular text tokens which is clearly the optimal access pattern in ES/Lucene. Of course it will likely lose to a C++ implementation in terms of raw speed because it's the JVM, but IMO that matters less than making the plugin trivial to run and scale.

I don't think there's a definitively better approach yet. It's an interesting problem and it'll be interesting to see what ends up working well.