# K-Means Clustering and How it Relates to the Gaussian Mixture Model

Greetings,

I would like to write a post on the Gaussian Mixture models. The post would be a tutorial with a, hopefully, intuitive explanation of when and how to use Gaussian Mixture (GM) models. However, before I begin with GM, I really ought to start with KM…

### K-Means Clustering

K-means clustering is the most fundamental ‘vanilla’ type clustering algorithm. The name says it all: the ‘k’ is the k-number of clusters, and the ‘mean’ is our target assignment. So, for a given data set, we come up with k estimations of means that form the cluster centers. The learning of these k centers is iterative, where the optimization criteria is to minimize the total Euclidean distance between the data points and k centers. Once we have k cluster centers, we can assign new observations to a cluster which mean/center is closest to the unobserved point.

KM can be used in supervised and unsupervised learning problems. When class assignments are present in the training data, the validation set can be created to adjust KM parameters. For example, the choice for k, the convergence criteria, the initial assignment of cluster centroids – all of these can be tuned with a validation set. In an unsupervised setting, the KM may need to be repeated several times to help finding the most optimal solution, since KM is very prone to the local optimum.

There are many positive aspects of KM. But, one of the not-so-good points of this clustering model is the ‘hard assignments’ to clusters. In particular, KM partitions the data space into a Voronoi tessellation. Any point that falls into some partition is assigned to that partition alone. There is no ‘maybe’ with KM…

To illustrate the KM algorithm, I have written a short example in python and posted it on my github. In the ‘k_means.py’ file you can see that I generate two sets of 2D random data using numpy multivariate_normal module. I proceed to discover the two clusters with KMeans from sklearn.cluster. The data and the discovered cluster centers are: