How is this similar / different to Johnson-Lindenstrauss? (https://en.wikipedia....

jointpdf · on July 30, 2020

The JL lemma is an essentially an error bound, which reassures us that dimensionality reduction (compress big vectors —> small vectors) is not a lost cause. More specifically, it says that even if you perform a random projection—the most naive DR method—to compress a high-dimensional data matrix to a low-dimensional one (i.e. same number of rows, but many fewer columns), it will only the distort the pairwise Euclidean distances between points (row vectors of the data matrix) by $this_much.

Example: if you crudely crunch MNIST from a [60000 x 784] matrix down to [60000 x 15] matrix (15 is arbitrary here—anything smaller than the original 784 works) using random projection, the JL lemma tells you how badly you mangled the similarity structure of your dataset in the worst case scenario. It turns out it’s not so bad, so we can use DR to compress our data matrix and then confidently do useful things with the vectors it contains (like search for similar ones).

ScaNN is an approximate nearest neighbor (ANN) search algorithm. There is an entire zoo of them, but they all have the same goal: given a query vector q and a data matrix X (where q has the same number of elements as a row vector in X), find the k vectors in X that are “most similar” to q, without exhaustively checking all the possible matches. Most similar could be any similarity or distance measure, but in this case it’s the simple inner product.

To do this more quickly, the algorithm uses a type of vector quantization (i.e. sort the high-D vectors into buckets), which is similar to k-means. K-means can also be construed as a dimensionality reduction technique, you are just reducing to a single dimension. That way you’re only looking in buckets where there’s hope of finding close matches.

This works great in real life with socks—if you’re trying to find a match for a stray black dress sock, then 1) make sure you presort your socks so that they are approximately binned by similarity and 2) look in the bin that contains socks most similar to your “query sock” (i.e. don’t look in the colorful athletic sock or plain white dad-sock bins). I’m not actually kidding, if you have a billion unmatched socks in a box, try this and you’ll see.

Anyways...as for this particular algorithm, someone else may be able to explain the details better. But my coarse understanding is that the loss function (which measures how good the bucketing is—how purely you pre-binned your socks) was defined in such a way that accounts for the density of the region in question—a neighbor in NYC means something different than a neighbor in Wyoming, so you need more granular buckets for denser regions. However, I thought this was essentially how almost every ANN algorithm worked, so how this one blows all the others out of the water (faster for a given recall value) is mystifying to me.

[PS—someone please correct me if any aspect of my loosey goosey explanation is inaccurate]