I had access to a large collection of digital art, and I wanted to see if I could apply CLIP to find similar images. I was able to use CLIP to encode embeddings for every image, and then use scikit’s KNN implementation to find the top-6 related images, among other things.
This got me thinking that I could try to setup a live KNN to quickly search for images using text descriptions. Obviously, adding a single item to a dataset and then recomputing KNN is not the correct approach, so I began looking for approximations. I found Locality Sensitive Hashing and began to look into it.
I eventually found a weird bit of trivia that I think helped explain the algorithm, and contributed some of my findings below to the wikipedia page.
Locality sensitive hashing is an approximation to KNN. I found a specific implementation with an interesting approach mmaping an index from disk. I wanted to do a bit of research understanding how LSH works to approximate KNN, so I thought I’d look into some of the details of this particular implementation.
The authors explain that they use random projections for the hasing function, and the repo even links out to that specific section of the wikipedia page for LSH. The way it works is we continually split the space using randomly chosen hyperplanes, leading to a binary tree. They do this multiple times, leading to a forest of trees. The more trees, the more accurate the search will be.
That doesn’t really explain how this approximates KNN though, so I thought I’d look into it a little more. Following the wiipedia link down the rabbit-hole, I found that originally this hasing method was developed at Princeton and used by Google for some apparently nefarious user-tracking system, and called SimHash. Apparently it’s still in use today at Google, and probably many other places.
First I want to go over the gist of the algorithm. It seems pretty clear, and the specific hashing mechanism used in the Spotify implementation involves cutting a high dimensional space repeatedly until you have created a sort of “pocket” (understood as a bucket) where your points live, kind of like a random version of R-trees.
This cutting can be understood as a tree structure, or the leaves of the tree can be interpreted as buckets in a hash table. You can index new points into a bucket, and manually check the points in the same bucket, which are probably similar. You can make it more likely that you’ll find all relevant nearby datapoints by creating some number k of these trees and using them for lookups. In fact you could vary the number of trees you’re using for lookups at runtime to push things faster or slower as needed.
Anyway, the section on wikipedia says (said, because I edited it after this research)
It is not difficult to prove that, for two vectors \(u,v\), \(Pr[h(u) = h(v)] = 1 - \frac{\theta(u,v)}{\pi}\), where \(\theta(u,v)\) is the angle between \(u\) and \(v\). \(1 - \frac{\theta(u,v)}{\pi}\) is closely related to \(\cos(\theta(u,v))\).
It was not really clear to me how those are related, and the paper on this techinque is similarly succinct.
One thing it does mention though, is
Note that the function \(1 − \frac{\theta}{\pi}\) is closely related to the function \(\cos(\theta)\). (In fact it is always within a factor \(0.878\) from it)
That sure is a weird constant! Later on we find:
This was used by Goemans and Williamson … in their rounding scheme for the semidefinite programming relaxation of MAX-CUT
This citation leads to a 1995 paper about an approximation method for the NP-complete MAX-CUT problem, the problem of partitioning the nodes of a graph into two subsets with the maximum number of edges between them.
This approximation uses a technique from a subfield of convex optimization called semi-definite programming, and is shown to be \(87.8\%\) accurate.
Here’s a timestamped link to a video from a very under-appreciated youtube channel with a nice diagram that I think gives a good intuition for where the “not difficult to prove” relationship and the constant comes from.
Basically, for two vectors \(\overrightarrow{\boldsymbol{u}}\) and \(\overrightarrow{\boldsymbol{v}}\), when \(\overrightarrow{\boldsymbol{u}} = -\overrightarrow{\boldsymbol{v}}\), then the angle between them is the maximal angle \(\pi\), ie if the vectors get any closer the angle is reduced. So the likelihood that a random plane seperates them (\(Pr[h(u) \neq h(v)]\)) is \(\frac{\theta}{\pi}\). And therefore the probability \(Pr[h(u) = h(v)]\) is \((1 - \frac{\theta}{\pi})\).
And how is \(\frac{\theta}{\pi}\)
related to \(\cos(\theta)\)? I
suggest you watch the whole series on semi-definite programming above
to learn more, but the timestamp I linked suggests we plot the ratio
between \(\frac{\theta}{\pi}\) and
\(\frac{1-\cos(\theta)}{2}\) and find
the minimum, which is \(.878\). In
other words, \(\frac{\theta}{\pi} \propto
\cos(\theta)\). Indeed, if you plot
those particular functions you can see that, within the bounds
\([0,\pi]\), they are rather close,
and if you plot
their difference you see it never gets very much beyond
.10
. In fact, the original paper finds the minimum of
\(\frac{2}{\pi}\frac{\theta}{1-\cos(\theta)}\)
on the interval \([0, \pi]\).
So we’ve solved the mystery of how paritioning a space using random hyperplanes approximates cosine-distance. Is that it? I think so. We repeatedly slice the space in half, and things with similar cosine distance will probably end up on the same side. Do it enough times and we get something interpretable as a tree, or a hash table if you want to look at it that way, and you have buckets that are increasingly more likely to contain items with similar cosine distance. If you create more of these random hyperplane trees you are more likely to get all the points you are looking for.
Since I’m eager to post this up, especially given it’s length, I will postpone a discussion of how to use this to de-dupe an image set, for which an ipython notebook is probably better suited anyway.