There are many flavors of clustering algorithms available to data scientists today. To name just a few would be to list k-means, KNN, LDA, parametric mixture models (e.g. Gaussian Mixture), hidden Markov for time-series and SOMs. Each mentioned model has been thoroughly studied and several variations have been proposed and adopted. I like clustering tasks since looking for and finding patterns in data is one of my favorite challenges.
Gaussian Mixture (GM) model is usually an unsupervised clustering model that is as easy to grasp as the k-means but has more flexibility than k-means. Fundamentally, GM is a parametric model (i.e. we assume a specific distribution for the data) that uses the Expectation Maximization (EM) algorithm to learn the parameters of the distribution. The earlier mentioned flexibility comes in a form of a co-variance matrix that gives the freedom to specify the shape and direction of each cluster. Although GM is not a mixed-membership clustering model, it does allow for overlap in which cluster a data point belongs to. This overlap is the probabilistic assignment that GM converges onto. So, how does GM generates the probabilistic assignments? GM uses EM. Take a look a this tutorial on EM I have written for kdnuggets. The tutorial has enough details and formulas for you, so I won’t repeat it here. In a nutshell, we begin with these ingredients and EM does the rest: the number of component/clusters we want to map the data to, the choice for the co-variance matrix (e.g. diagonal or full), some idea for the initial assignment of means and co-variances, as well as the cluster responsibilities and a stopping criteria. If the term ‘cluster responsibilities’ does not make sense to you, think of these as proportions of data that should be assigned to each cluster.
GM can be used for anomaly detection, and there is an abundance of academic work to support this. If the non-anomalous data is Gaussian with some mean and variance, the points that receive low probability assignments under the chosen prior may be flagged as anomalous. At its heart, anomaly detection is a different beast to classification. Most importantly because the cost function is heavily weighted towards recall rather than precision. Let’s see how GM performs on the shuttle dataset provided for outlier detection studies by Ludwig Maximilian University of Munich. In this dataset, class “2” is the outlier class. Class “1” is the dominant class and a good classifier should achieve an accuracy above 95% given the class imbalance. However, since we are testing GM on correctly predicting all instances of class “2” only, let’s glance over the false positives for now and concentrate on the pure recall. I used the scikit-learn GMM module and reduced the dataset to just the 2nd and the 5th feature, which on this dataset have the greatest explanatory powers. Also, I separately trained six models, each class against the class “2”. Note that scikit-learn GMM does not require initial assignments for mean and co-variance as it randomly performs this initialization. Also note that because of this random initialization you are not guaranteed the same outcome on each model run. You can see the code for this work on my github space.
So, how did GM do? Remarkably well. Below I am showing the results for each trained model. I also plot the data spread for class 1 and class 2 to show you that the anomaly class is not purely an outlier in this dataset. Since in anomaly detection task the cost of false negatives is more expensive than the cost of false positives, we can see that GM performed well and made a single miss-classification in a model trained on classes 7 and 2.
