Spectral clustering

Information clustering

Information we have can often be represented as a bunch of numerical variables in some space state. For instance, we could have information on users’ age and the time they spend online, and we could represent this information as two real numbers (which happen to be nonnegative). So what we’d have is a representation of this information as a set of vectors in R2, one vector for each sample we have. In more complex examples, we could have a very large number of variables.

It’s often the case that we would like to group these samples into sets, and group them by some “likeness”. We’d like, for instance, two samples with similar ages and similar online time spent to be put into the same group. Based on this information, we’ll try to see if we notice patterns, or if we can use this data to, say, target campaigns better, or offer certain deals to certain groups.

As a graphical representation, we could have two variables, plotted on a plane, and several datapoints:

Unclustered datapoints

We might think there’s some interesting pattern underlying this information, so we can try dividing it into, say, 3 clusters:

Clustered datapoints

Notice that I’m not defining this notion of a “pattern” formally: What type of clustering you do, and what your clusters will look like as a result, is up to you. You may want to cluster your information in different ways for different purposes, and different cluster shapes will help reveal different patterns about your data.

Here, I’ve used a very simple method of clustering, called K-means clustering. This worked relatively well, since the underlying information was simply three 2-dimensional random variables, with a 2-dimensional gaussian distribution. The resulting clusters to seem to group the data “nicely”, in this case.

With that covered, let’s move on to K-means clustering.

K-means

Basics

K-means is one of the simplest ways to cluster data in Rn. Essentially, the idea is to guess that the datapoints are distributed as globular clusters, that is, clusters which look like slightly deformed n-spheres, and try to cluster the datapoints according to which sphere they belong to.

The key idea here is that if your data is structured in this way, each of your clusters Ci can be defined by a centroid ciRn, and the points belonging to Ci will be all the points in your dataset that are nearer to ci than to any other cj. That is, this is a kind of Voronoi tesselation of your datapoints.

This will work well when your data really is structured as a bunch of points “orbiting” some centroids. Note that the centroids aren’t part of the data set, they are just what we use to cluster objects. So if we were clustering a bunch of planets with their positions, where the planets have been organized as solar systems, we’d see that planets seem to be orbiting their respective stars, even if our data did not include said stars’ positions. We’d cluster the planets according, implicitly, to which star they orbit.

The algorithm is relatively crude:

Input: PRn, the datapoints to cluster, kN, the number of clusters to make, and ϵR>0, a stopping threshold.

Output: C1,,Ck, a partition of P.

  1. Let c1,,ckRn be initial centroids, chosen in some unspecified way.
  2. For i[1,k], let Ci={pP:||pci|||pcj||  j[1,k]}, with the ties broken in any way (each pP goes in exactly one cluster Ci).
  3. For i[1,k], let ci=pCip|Ci|.
  4. Let δ=maxi[1,k]{||cici||}.
  5. If δϵ, stop and return {Ci:i[1,k]}. Otherwise, set ci=ci  i[1,k], and go to step 2.

Notice that, in step 3, Ci could be empty. If that happens, we’ve lost a cluster, and we should somehow recover it. Techniques include splitting up the largest cluster into two, assigning the centroid to some random point in the dataset, or simply starting over with a new set of initial {ci}.

The convergence rate and termination of the algorithm is not trivial to analyze. For termination, note that δ in step 4 is monotonically decreasing with each iteration, and is bounded by below by 0.

What the algorithm tries to minimize is the within-cluster sum of squares. That is, it tries to minimize:

D=i=1kpCi||pμi||2

With μi=pCip|Ci|, the centroid of cluster Ci. You can check that, locally (that is, for a given cluster), the way to minimize that sum is to set μi equal to the centroid of the cluster. K-means then just iterates until this is true.

Note that K-means is not guaranteed to minimize D. Indeed, minimizing D, for either unbounded k, n, or both, is in NP-hard. K-means is, however, often useful in practice, relatively fast, and embarassingly parallel.

K-means peculiarities

If you play around with the algorithm a bit, you’ll see that it has certain peculiarities in the clusters it returns.

Unclustered concentric circles

The points here don’t look like globular clusters. Indeed, trying to run k-means on them, with k=3, yields the following clustering:

K-means clustering concentric circles

This, while a valid clustering, and while it minimizes the within-cluster square of distances, does not seem to reflect the data very well. K-means tried to divide the data into equally greedy centroids, but the data wasn’t really produced by a process that generates that sort of structure.

So what can we do to try to mitigate this? Clearly this data has a lot of structure. How can we try to exploit this structure to cluster it better? Is there a way to pick better starting points for k-means to do so?

Spectral graph theory

All the way over, on another side of mathematics, spectral clustering takes advantage of the spectrum of a graph’s Laplacian to gather information about the structure of a graph. So first, let’s define the (unnormalized) Laplacian of a graph, which is the object of study of spectral graph theory.

Definition: Given a weighed graph G=(V,E), with weight function w:V×VR0, we define its weight matrix W=(wi,j) as wi,j=w(vi,vj)

We also define its degree matrix D, as a diagonal matrix where

di,i=k=1nw(vi,vk)

For convenience, we can also write di for di,i, if the context is clear.

Definition: Given a weighed graph G=(V,E), with weight function w:V×VR0, we define the (unnormalized) Laplacian L of G as

L=DW

Lemma: Let G=(V,E) be a weighted graph with weight function w:V×VR0. Let C={C1,...,Cκ} be a partition of V into connected components of G. Then the eigenvalue 0 of L has dimension κ, and the indicator vectors 1i, where 1ij=1vjCi, act as eigenvectors of eigenvalue 0.

First, we’ll do the connected case. Then κ=1. I’ll leave it as an exercise to see that for any vector vRn, we have

vtLv=i,j=1nwi,j(vivj)2

Thus, if v is an eigenvector of eigenvalue 0, we have that this sum is 0. But then, since the weights were all non negative, when vi and vj connect (nonzero weight in the edge) vi=vj. So any path must be constant in v. In the same way, any connected component must be constant in v. But there’s only one connected component, and it’s the entire graph, so we have that v=1, the vector all whose components are 1, modulo some nonzero scalar multiplication.

For the disconnected case, without loss of generality, we write L=[L1000L2000Lκ]

Where Li is the laplacian of the connected component Ci.

The spectrum of L at 0 is the union of the spectra of the Li at 0. Take any Li, and we have that the eigenvectors corresponding to 0 will be the vectors with coordinates equal to 1 for the vertices in that connected component, and 0 elsewhere. So they are the indicator vectors for that connected component. Since by induction each block has eigenvalue 0 with dimension 1, the spectrum of L has 0 with dimension κ, which was to be shown. ∎

What we conclude from this, is that if we know the spectrum of a graph’s Laplacian, we also have information about exactly what its connected components are.

Before we move on to the next section, we must define a normalized version of the Laplacian matrix we saw above:

Definition: Given a weighed graph G=(V,E), we define the normalized Laplacian Lrw of G as

Lrw=D1L=ID1W

with W the weight matrix of G, D the degree matrix of G, and L the unnormalized Laplacian of G.

Cuts and spectra

How does this relate to our problem of clustering? To wrap these concepts together, we need to make a small visit back to the concept of cuts in a graph.

Recall the notion of a “cut” in a weighted graph:

Definition: Given a weighted graph G=(V,E), with weight matrix W=(wi,j), a cut in G is a partition of V into disjoint sets S and T.

We say that the value of a cut (S,T) is

cut(S,T)=viSvjTwi,j

That is, it is the sum of the weights of the edges that cross partitions. This can be generalized to multiple partitions,

cut(A1,,Ak)=12i=1kcut(Ai,Ai)

Where the factor of 12 is there to avoid counting each edge weight twice. What this is counting, then, is, for the partitions we’ve given it, the sum of the weights of the edges that leave each partition.

One could attempt to find the Ai that minimized this, in order to obtain some sort of “good partitioning of the data”. However it will often be the case that, in order to minimize this sum, the clusters will tend to be disproportionate in size (a lot of small clusters and one big cluster, for instance). Specifically, when k=2, the solution will often be a partition with a single vertex. Intuitively, this is no good - we would like clusters to be “balanced” in some sense. It is to this end that we introduce the normalized cut, Ncut, by Shi and Malik (2000).

Ncut(A1,,Ak)=12i=1cut(Ai,Ai)vol(Ai)

Where vol(Ai)=vjAidi. That is, we divide by the sum of the weights of the edges incident to nodes in the cluster. This means that, to minimize Ncut, the weights of the clusters should not be very tiny. Formally, the minimum of the function i=1k1vol(Ai) is achieved when vol(Ai)=vol(Aj)  i,j. This is then a measure of how well balanced the clusters are, in terms of intra-cluster weight.

We could now attempt to minimize Ncut, and so our problem becomes: minimize Ncut(A1,,Ak), subject to i=1kAi=V. While this is indeed what we want, we can reformulate this into a statement more amenable to optimization methods, using indicator vectors.

For a given set of n vertices v1,,vn, and a partition of V into A1,,Ak, we define indicator vectors f1,,fkRn, as:

fij={1vol(Ai) if vjAi0 otherwise

And of course, we could put these vectors as the columns of an n×k matrix, let’s call it F.

As a first step, we see that F is almost orthogonal:

Observation: FtDF=I.

Recall that D is the diagonal matrix, such that di,i=di, the sum of the weights of the edges incident to vi. So let us consider any element (i,j) of FtDF, which can be written as fi,(DF)j. Observe that (DF)j=(d1fj1,d2fj2,,dnfjn)t. Let’s call this fj. We now take a look at fi,fj. If ij, then all of the products will involve a zero, since otherwise, there would be a k such that both fik and fjk are not zero, but this cannot happen, since that would imply vk is in both Ai and Aj. If i=j, the inner product becomes k=1nfikfik=vjAi1vol(Ai)dj1vol(Ai)=1vol(Ai)vjAidj=vol(Ai)vol(Ai)=1

And so (FtDF)i,j=δi,j, or equivalently, FtDF=I.

Now comes the relation with graph spectra, via the Laplacian.

Observation: fitLfi=cut(Ai,Ai)vol(Ai)

Recall that L=DW. Thus, due to the previous observation, fitLfi=fitDfifitWfi=1fitWfi.

To see what this product is, we see that (Wfi)j=vkAiwj,kvol(Ai)=vkAiwj,kvol(Ai)

Thus it is the sum, over all nodes in Ai, of the weights of the edges going into Ai (divided by vol(Ai), of course). Let us then define ivj as vkAiwj,k. Recall that vol(Ai) was the sum of all the weights of all the edges incident to nodes in Ai. Hence,

fitLfi=1vjAifijivj=1vjAiivjvol(Ai)=vol(Ai)vjAiivjvol(Ai)=cut(Ai,Ai)vol(Ai)

Since we are subtracting the weights of all the edges that go into Ai from itself, from the sum of all the weights of edges incident to Ai, we are left with the sum of the weights of edges that are incident to Ai, and to Ai. This is the definition of cut(Ai,Ai).

Thus, we can also formulate the problem of minimizing Ncut as minimizing Tr(FtLF), with FtDF=I, and H as described above.

To make things a bit more symmetric, we can write T=D12F, and attempt to minimize Tr(TtD12LD12T), subject to TtT=I, with TRn×k, and T and F as described above.

We can then relax this by not requiring that T is defined as above, but instead is any real, n×k matrix, such that TtT=I. This problem is now handed over to Courant-Fischer-Weyl, which tells us that the solution is the matrix T made from writing the first k eigenvectors of D12LD12 as its columns. We can also see this means that the F we need, is the one with the first k eigenvectors of Lrw as its columns.

This leads us to the following algorithm (Shi and Malik (2000)):

Input: W, a weight matrix for some weighted graph G, and k, a number of clusters to construct.

Output: A1,,Ak, clusters, such that i=1kAi=V.

  1. Let Lrw be the normalized Laplacian of the G.
  2. Let v1,,vk be the first k eigenvectors of Lrw.
  3. Let F be an n×k matrix with the {vi} as its columns.
  4. Let ri be be the ith row of F.
  5. Cluster the {ri} by some clustering algorithm, such as k-means, into clusters C1,,Ck.
  6. Let Ai={vj | rjCi}.

The last step, the clustering, is needed because we relaxed the condition that F was really made up of indicator vectors. We can’t literally ask them to be indicators, but we can cluster them by similarity, hopefully recovering from this the information that indicator vectors should have gotten us, had we used them.

Results and implementation

How does this performed on the dataset above? Well, intuitively we wanted to conserve this notion of “connectedness”, and we saw how a graph’s spectra (the eigenvalues and eigenvectors of the graph’s Laplacian) held some of this information, and we saw how this was a relaxation of Ncut, the problem of finding decently sized clusters, with a decently large internal edge weight sum, and small external edge weight sum. I’ll let you be the judge of how this performed on the data shown above:

Spectral clustered datapoints

This was done using Python for visualization, Haskell to construct the data, and C++ to cluster the points. The whole mess is over at this GitHub repository. The only difference is that there, instead of finding the eigenvectors v of Lrw, I solve the system Lv=λDv.