I had access to a large collection of digital art, and I wanted to see if I could apply CLIP to find similar images. I was able to use CLIP to encode embeddings for every image, and then use scikit’s KNN implementation to find the top-6 related images, among other things.

This got me thinking that I could try to setup a live KNN to
quickly search for images using text descriptions. Obviously, adding a
single item to a dataset and then recomputing KNN is not the correct
approach, so I began looking for approximations. I found *Locality
Sensitive Hashing* and began to look into it.

I eventually found a weird bit of trivia that I think helped explain the algorithm, and contributed some of my findings below to the wikipedia page.

Locality sensitive hashing is an approximation to KNN. I found a specific implementation with an interesting approach mmaping an index from disk. I wanted to do a bit of research understanding how LSH works to approximate KNN, so I thought I’d look into some of the details of this particular implementation.

The authors explain that they use random projections for the hasing function, and the repo even links out to that specific section of the wikipedia page for LSH. The way it works is we continually split the space using randomly chosen hyperplanes, leading to a binary tree. They do this multiple times, leading to a forest of trees. The more trees, the more accurate the search will be.

That doesn’t really explain how this approximates KNN though, so I thought I’d look into it a little more. Following the wiipedia link down the rabbit-hole, I found that originally this hasing method was developed at Princeton and used by Google for some apparently nefarious user-tracking system, and called SimHash. Apparently it’s still in use today at Google, and probably many other places.

First I want to go over the gist of the algorithm. It seems pretty
clear, and the specific hashing mechanism used in the Spotify
implementation involves cutting a high dimensional space repeatedly
until you have created a sort of “pocket” (understood as a
*bucket*) where your points live, kind of like a random version
of R-trees.

This cutting can be understood as a tree structure, or the leaves
of the tree can be interpreted as buckets in a hash table. You can
index new points into a bucket, and manually check the points in the
same bucket, which are *probably* similar. You can make it more
likely that you’ll find all relevant nearby datapoints by creating
some number *k* of these trees and using them for lookups. In
fact you could vary the number of trees you’re using for lookups at
runtime to push things faster or slower as needed.

Anyway, the section on wikipedia says (*said*, because I
edited it after this research)

It is not difficult to prove that, for two vectors \(u,v\), \(Pr[h(u) = h(v)] = 1 - \frac{\theta(u,v)}{\pi}\), where \(\theta(u,v)\) is the angle between \(u\) and \(v\). \(1 - \frac{\theta(u,v)}{\pi}\) is closely related to \(\cos(\theta(u,v))\).

It was not really clear to me *how* those are related, and
the paper
on this techinque is similarly succinct.

One thing it does mention though, is

Note that the function \(1 − \frac{\theta}{\pi}\) is closely related to the function \(\cos(\theta)\). (In fact it is always within a factor \(0.878\) from it)

That sure is a weird constant! Later on we find:

This was used by Goemans and Williamson … in their rounding scheme for the semidefinite programming relaxation of MAX-CUT

This citation leads to a 1995 paper about an approximation method for the NP-complete MAX-CUT problem, the problem of partitioning the nodes of a graph into two subsets with the maximum number of edges between them.

This approximation uses a technique from a subfield of convex optimization called semi-definite programming, and is shown to be \(87.8\%\) accurate.

Here’s a timestamped link to a video from a very under-appreciated youtube channel with a nice diagram that I think gives a good intuition for where the “not difficult to prove” relationship and the constant comes from.

Basically, for two vectors \(\overrightarrow{\boldsymbol{u}}\) and \(\overrightarrow{\boldsymbol{v}}\), when \(\overrightarrow{\boldsymbol{u}} = -\overrightarrow{\boldsymbol{v}}\), then the angle between them is the maximal angle \(\pi\), ie if the vectors get any closer the angle is reduced. So the likelihood that a random plane seperates them (\(Pr[h(u) \neq h(v)]\)) is \(\frac{\theta}{\pi}\). And therefore the probability \(Pr[h(u) = h(v)]\) is \((1 - \frac{\theta}{\pi})\).

And how is \(\frac{\theta}{\pi}\)
related to \(\cos(\theta)\)? I
suggest you watch the whole series on semi-definite programming above
to learn more, but the timestamp I linked suggests we plot the ratio
between \(\frac{\theta}{\pi}\) and
\(\frac{1-\cos(\theta)}{2}\) and find
the minimum, which is \(.878\). In
other words, \(\frac{\theta}{\pi} \propto
\cos(\theta)\). Indeed, if you plot
those particular functions you can see that, within the bounds
\([0,\pi]\), they are rather close,
and if you plot
their difference you see it never gets very much beyond
`.10`

. In fact, the original paper finds the minimum of
\(\frac{2}{\pi}\frac{\theta}{1-\cos(\theta)}\)
on the interval \([0, \pi]\).

So we’ve solved the mystery of how paritioning a space using random
hyperplanes approximates cosine-distance.
Is that it? I think so. We repeatedly slice the space in half, and
things with similar cosine distance will *probably* end up on
the same side. Do it enough times and we get something interpretable
as a tree, or a hash table if you want to look at it that way, and you
have buckets that are increasingly more likely to contain items with
similar cosine distance. If you create more of these random hyperplane
trees you are more likely to get all the points you are looking
for.

Since I’m eager to post this up, especially given it’s length, I will postpone a discussion of how to use this to de-dupe an image set, for which an ipython notebook is probably better suited anyway.