Spectral clustering
Information clustering
Information we have can often be represented as a bunch of numerical variables in some space state. For instance, we could have information on users’ age and the time they spend online, and we could represent this information as two real numbers (which happen to be nonnegative). So what we’d have is a representation of this information as a set of vectors in
It’s often the case that we would like to group these samples into sets, and group them by some “likeness”. We’d like, for instance, two samples with similar ages and similar online time spent to be put into the same group. Based on this information, we’ll try to see if we notice patterns, or if we can use this data to, say, target campaigns better, or offer certain deals to certain groups.
As a graphical representation, we could have two variables, plotted on a plane, and several datapoints:
We might think there’s some interesting pattern underlying this information, so we can try dividing it into, say, 3 clusters:
Notice that I’m not defining this notion of a “pattern” formally: What type of clustering you do, and what your clusters will look like as a result, is up to you. You may want to cluster your information in different ways for different purposes, and different cluster shapes will help reveal different patterns about your data.
Here, I’ve used a very simple method of clustering, called K-means clustering. This worked relatively well, since the underlying information was simply three 2-dimensional random variables, with a 2-dimensional gaussian distribution. The resulting clusters to seem to group the data “nicely”, in this case.
With that covered, let’s move on to K-means clustering.
K-means
Basics
K-means is one of the simplest ways to cluster data in
The key idea here is that if your data is structured in this way, each of your clusters
This will work well when your data really is structured as a bunch of points “orbiting” some centroids. Note that the centroids aren’t part of the data set, they are just what we use to cluster objects. So if we were clustering a bunch of planets with their positions, where the planets have been organized as solar systems, we’d see that planets seem to be orbiting their respective stars, even if our data did not include said stars’ positions. We’d cluster the planets according, implicitly, to which star they orbit.
The algorithm is relatively crude:
Input:
Output:
- Let
be initial centroids, chosen in some unspecified way. - For
, let , with the ties broken in any way (each goes in exactly one cluster ). - For
, let . - Let
. - If
, stop and return . Otherwise, set , and go to step 2.
Notice that, in step 3,
The convergence rate and termination of the algorithm is not trivial to analyze. For termination, note that
What the algorithm tries to minimize is the within-cluster sum of squares. That is, it tries to minimize:
With
Note that K-means is not guaranteed to minimize
K-means peculiarities
If you play around with the algorithm a bit, you’ll see that it has certain peculiarities in the clusters it returns.
It tries to make all clusters “radiuses” of approximately the same size. That is, if we consider the clusters as all datapoints within some distance of a centroid, these distances tends to be similar across all clusters.
It doesn’t deal well with clusters within clusters, or in general, cluster intersections. It will greedily split the space among all intervening clusters.
It doesn’t deal well with clusters that aren’t, initially, globular in shape. Think of the following 2-dimensional dataset:
The points here don’t look like globular clusters. Indeed, trying to run k-means on them, with
This, while a valid clustering, and while it minimizes the within-cluster square of distances, does not seem to reflect the data very well. K-means tried to divide the data into equally greedy centroids, but the data wasn’t really produced by a process that generates that sort of structure.
So what can we do to try to mitigate this? Clearly this data has a lot of structure. How can we try to exploit this structure to cluster it better? Is there a way to pick better starting points for k-means to do so?
Spectral graph theory
All the way over, on another side of mathematics, spectral clustering takes advantage of the spectrum of a graph’s Laplacian to gather information about the structure of a graph. So first, let’s define the (unnormalized) Laplacian of a graph, which is the object of study of spectral graph theory.
Definition: Given a weighed graph
We also define its degree matrix
For convenience, we can also write
Definition: Given a weighed graph
Lemma: Let
Thus, if
For the disconnected case, without loss of generality, we write
Where
The spectrum of
What we conclude from this, is that if we know the spectrum of a graph’s Laplacian, we also have information about exactly what its connected components are.
Before we move on to the next section, we must define a normalized version of the Laplacian matrix we saw above:
Definition: Given a weighed graph
with
Cuts and spectra
How does this relate to our problem of clustering? To wrap these concepts together, we need to make a small visit back to the concept of cuts in a graph.
Recall the notion of a “cut” in a weighted graph:
Definition: Given a weighted graph
We say that the value of a cut
That is, it is the sum of the weights of the edges that cross partitions. This can be generalized to multiple partitions,
Where the factor of
One could attempt to find the
Where
We could now attempt to minimize
For a given set of
And of course, we could put these vectors as the columns of an
As a first step, we see that
Observation:
And so
Now comes the relation with graph spectra, via the Laplacian.
Observation:
To see what this product is, we see that
Thus it is the sum, over all nodes in
Since we are subtracting the weights of all the edges that go into
Thus, we can also formulate the problem of minimizing
To make things a bit more symmetric, we can write
We can then relax this by not requiring that
This leads us to the following algorithm (Shi and Malik (2000)):
Input:
Output:
- Let
be the normalized Laplacian of the . - Let
be the first eigenvectors of . - Let
be an matrix with the as its columns. - Let
be be the th row of . - Cluster the
by some clustering algorithm, such as -means, into clusters . - Let
.
The last step, the clustering, is needed because we relaxed the condition that
Results and implementation
How does this performed on the dataset above? Well, intuitively we wanted to conserve this notion of “connectedness”, and we saw how a graph’s spectra (the eigenvalues and eigenvectors of the graph’s Laplacian) held some of this information, and we saw how this was a relaxation of
This was done using Python for visualization, Haskell to construct the data, and C++ to cluster the points. The whole mess is over at this GitHub repository. The only difference is that there, instead of finding the eigenvectors