K-Means Clustering and How it Relates to the Gaussian Mixture Model

Greetings,

I would like to write a post on the Gaussian Mixture models. The post would be a tutorial with a, hopefully, intuitive explanation of when and how to use Gaussian Mixture (GM) models. However, before I begin with GM, I really ought to start with KM…

K-Means Clustering

K-means clustering is the most fundamental ‘vanilla’ type clustering algorithm. The name says it all: the ‘k’ is the k-number of clusters, and the ‘mean’ is our target assignment. So, for a given data set, we come up with k estimations of means that form the cluster centers. The learning of these k centers is iterative, where the optimization criteria is to minimize the total Euclidean distance between the data points and k centers. Once we have k cluster centers, we can assign new observations to a cluster which mean/center is closest to the unobserved point.

KM can be used in supervised and unsupervised learning problems. When class assignments are present in the training data, the validation set can be created to adjust KM parameters. For example, the choice for k, the convergence criteria, the initial assignment of cluster centroids – all of these can be tuned with a validation set. In an unsupervised setting, the KM may need to be repeated several times to help finding the most optimal solution, since KM is very prone to the local optimum.

There are many positive aspects of KM. But, one of the not-so-good points of this clustering model is the ‘hard assignments’ to clusters. In particular, KM partitions the data space into a Voronoi tessellation. Any point that falls into some partition is assigned to that partition alone. There is no ‘maybe’ with KM…

To illustrate the KM algorithm, I have written a short example in python and posted it on my github. In the ‘k_means.py’ file you can see that I generate two sets of 2D random data using numpy multivariate_normal module. I proceed to discover the two clusters with KMeans from sklearn.cluster. The data and the discovered cluster centers are:

Data with two centroids.
Data with two centroids.

KMeans.predict module is used to generate predictions. Below I am showing side-by-side the actual clusters and the predicted ones. Any point that is closer to the left centroid will be assigned to the left cluster, and the same applies for the right cluster.

Actual vs. Predicted
Actual vs. Predicted

In reality, where the two clusters meet we would like some probabilistic assignments. That is, we would like to be able to say that there is an x% chance that the point belongs to the left cluster, but there is also a y% chance that it belongs to the right cluster. KM cannot do that. Also, in terms of cluster spread and direction, if you think of clusters as spheres, KM outputs clusters with identical shapes. Our blue and red clusters have small positive co-variance and KM cannot detect it. But the Mixture Gaussian model can. Let’s look at GM next.