Gaussian Mixture Model with Application to Anomaly Detection

There are many flavors of clustering algorithms available to data scientists today. To name just a few would be to list k-means, KNN, LDA, parametric mixture models (e.g. Gaussian Mixture), hidden Markov for time-series and SOMs. Each mentioned model has been thoroughly studied and several variations have been proposed and adopted. I like clustering tasks since looking for and finding patterns in data is one of my favorite challenges.

Gaussian Mixture (GM) model is usually an unsupervised clustering model that is as easy to grasp as the k-means but has more flexibility than k-means. Fundamentally, GM is a parametric model (i.e. we assume a specific distribution for the data) that uses the Expectation Maximization (EM) algorithm to learn the parameters of the distribution. The earlier mentioned flexibility comes in a form of a co-variance matrix that gives the freedom to specify the shape and direction of each cluster. Although GM is not a mixed-membership clustering model, it does allow for overlap in which cluster a data point belongs to. This overlap is the probabilistic assignment that GM converges onto. So, how does GM generates the probabilistic assignments? GM uses EM. Take a look a this tutorial on EM I have written for kdnuggets. The tutorial has enough details and formulas for you, so I won’t repeat it here. In a nutshell, we begin with these ingredients and EM does the rest: the number of component/clusters we want to map the data to, the choice for the co-variance matrix (e.g. diagonal or full), some idea for the initial assignment of means and co-variances, as well as the cluster responsibilities and a stopping criteria. If the term ‘cluster responsibilities’ does not make sense to you, think of these as proportions of data that should be assigned to each cluster.

GM can be used for anomaly detection, and there is an abundance of academic work to support this. If the non-anomalous data is Gaussian with some mean and variance, the points that receive low probability assignments under the chosen prior may be flagged as anomalous. At its heart, anomaly detection is a different beast to classification. Most importantly because the cost function is heavily weighted towards recall rather than precision. Let’s see how GM performs on the shuttle dataset provided for outlier detection studies by Ludwig Maximilian University of Munich. In this dataset, class “2” is the outlier class. Class “1” is the dominant class and a good classifier should achieve an accuracy above 95% given the class imbalance. However, since we are testing GM on correctly predicting all instances of class “2” only, let’s glance over the false positives for now and concentrate on the pure recall. I used the scikit-learn GMM module and reduced the dataset to just the 2nd and the 5th feature, which on this dataset have the greatest explanatory powers. Also, I separately trained six models, each class against the class “2”. Note that scikit-learn GMM does not require initial assignments for mean and co-variance as it randomly performs this initialization. Also note that because of this random initialization you are not guaranteed the same outcome on each model run. You can see the code for this work on my github space.

So, how did GM do? Remarkably well. Below I am showing the results for each trained model. I also plot the data spread for class 1 and class 2 to show you that the anomaly class is not purely an outlier in this dataset. Since in anomaly detection task the cost of false negatives is more expensive than the cost of false positives, we can see that GM performed well and made a single miss-classification in a model trained on classes 7 and 2.

Data spread for class 1 and 2 (top)  and model results (bottom).
Data spread for class 1 and 2 (top) and model results (bottom).

High Performance Python

Greetings All,

I have recently come across a nice and short youtube talk given by Ben Lerner on how to improve performance of Python code. Here is a quick rundown of the techniques he mentions which I found particularly useful:

Avoid accidentally quadratic code – take advantage of the build-in data types that give O(n) for free:

The running complexity of the code below is O(n^2). The square bit comes from searching the second list for each item in the first list. Nice.
A typical job interview question by a smug interviewer will be – can we do better than that? “Yes, we can”, you can reply smiling even “smug-lier” – we will just use python sets.

import numpy as np
import time

np.random.seed(1)

list1=range(60000)
random=np.random.randint(0,59999,60000)
list2 = random.tolist() # WORST PERFORMANCE with a list
intersect=[]

start_time = time.time()

for l1 in list1:
    if l1 in list2:
        intersect.append(l1)

print "List one has %s elements in common with list two" %(len(intersect))
print("--- %s seconds ---" % (time.time() - start_time))

list2=random # BETTER PERFORMANCE with NumPy array
intersect=[]
start_time = time.time()

for l1 in list1:
    if l1 in list2:
        intersect.append(l1)

print "List one has %s elements in common with list two" %(len(intersect))
print("--- %s seconds ---" % (time.time() - start_time))

list2=set(random) # BEST PERFORMANCE WITH set
intersect=[]
start_time = time.time()

for l1 in list1:
    if l1 in list2:
        intersect.append(l1)

print "List one has %s elements in common with list two" %(len(intersect))
print("--- %s seconds ---" % (time.time() - start_time))


List in a set
List in a set

Note that the only thing we have changed is to put list2 in a set. This example is trivial and can be done with two sets and an intersection(). But sometimes you need to perform a more complex task than finding an intersection, and keeping the “set” trick in mind will be helpful. Just remember that putting a list in a set will remove duplicates.

Function calls and global variables are expensive

In general, the advice is to avoid unnecessary function calls and avoid using global variables. Function calls require stack pointer saves, and global variables stick around.

Use map/filter/reduce instead of for-loops

map applies a function to every item on an iterable (i.e. map(function,iterable))
filter is similar, and will return those items for which the function is true on an iterable.
reduce will cumulatively apply a function on an iterable to reduce it to a single item output.

Use cProfile to analyse slow code

cProfile can be used to generate code statistics to pin down that loop or that function call that slows down your code. It is very easy to use.

In-line C with Cython

The main reason NumPy is fast is that about 60% of its code is C. In-lining C with python is not hard with Cython, which adds a small number of keywords to the Python language to give access to static data types of C or C++. Finally, Cython can be used to simply compile Python into C.

Hope this gives you enough useful pointers. Happy flying!

Two Ways to Perform Linear Regression in Python with Numpy and Scikit-Learn

Greetings,

This is a short post to share two ways (there are many more) to perform pain-free linear regression in python. I will use numpy.poly1d and sklearn.linearmodel.

Using numpy.polyfit we can fit any data to a specified degree polynomial by minimizing the least square error method (LSE). LSE is the most common cost function for fitting linear models. It is calculated as a the sum of squared differences between the predicted and the actual values. So, let’s say we have some target variable y that is generated according to some function of x. In real life situations we almost never deal with a perfectly predictable function that generates the target variable, but have a lot of noise. However, let’s assume for simplicity that our function is perfectly predictable. Our task is to come up with a model (i.e. learn the polynomial coefficients and select the best polynomial degree) to explain the target variable y. The function for the data is:

f(x)=32x^4-x^2+6x-\sqrt{x}

I will start by generating the train and the test data from 1,000 points:

 

import numpy as np
from sklearn import linear_model
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

# implement liner regression with numpy and sklearn

# 1. Generate some random data
np.random.seed(seed=0)
size=1000
x_data = np.random.random(size)
y_data=6*x_data-32*x_data**4-x_data**2-np.sqrt(x_data) # -32x^4-x^2+6x-sqrt(x)

# 2. Split equally between test and train
x_train=x_data[:size/2]
x_test = x_data[size/2:]

y_train=y_data[:size/2]
y_test = y_data[size/2:]

 

We will test polynomials with degrees 1, 2 and 3. To use numpy.polyfit, we only need 2 lines of code. To use the sklearn, we need to transform x to a polynomial form with the degrees we are testing at. For that, we need to use sklearn.preprocessing.PolynomialFeatures. So, we proceed to generate 3 models and calculate the final LSE as follows:

 

plot_config=['bs', 'b*', 'g^']
plt.plot(x_test, y_test, 'ro', label="Actual")
# 3. Set the polynomial degree to be fitted betwee 1 and 3
top_degree=3
d_degree = np.arange(1,top_degree+1)
for degree in d_degree:
    
    poly_fit = np.poly1d(np.polyfit(x_train,y_train, degree))
    # print poly_fit.coeffs
    
    # 4. Fit the numpy polynomial
    predict=poly_fit(x_test)
    error=np.mean((predict-y_test)**2)
    
    print "Error with poly1d at degree %s is %s" %(degree, error)
    
    # 5. Create a fit a polynomial with sk-learn LinearRegression
    model = Pipeline([('poly', PolynomialFeatures(degree=degree)),('linear', linear_model.LinearRegression())])
    model=model.fit(x_train[:,np.newaxis], y_train[:,np.newaxis])
    
    predict_sk=model.predict(x_test[:,np.newaxis])
    error_sk=np.mean((predict_sk.flatten()-y_test)**2)
    
    print "Error with sk-learn.LinearRegression at degree %s is %s" %(degree, error_sk)
    
    #plt.plot(x_test, y_test, '-', label="Actual")
    plt.plot(x_test, predict, plot_config[degree-1], label="Fit "+str(degree)+ " degree poly")
    plt.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=3,
               ncol=2, mode="expand", borderaxespad=0.)

plt.show()

 

Showing the final results (from numpy.polyfit only) are very good at degree 3. We could have produced an almost perfect fit at degree 4. The two method (numpy and sklearn) produce identical accuracy. Under the hood, both, sklearn and numpy.polyfit use linalg.lstsq to solve for coefficients. 

Linear Regression with numpy
Linear Regression with numpy
Compare LSE from numpy and sklearn.
Compare LSE from numpy and sklearn

K-Means Clustering and How it Relates to the Gaussian Mixture Model

Greetings,

I would like to write a post on the Gaussian Mixture models. The post would be a tutorial with a, hopefully, intuitive explanation of when and how to use Gaussian Mixture (GM) models. However, before I begin with GM, I really ought to start with KM…

K-Means Clustering

K-means clustering is the most fundamental ‘vanilla’ type clustering algorithm. The name says it all: the ‘k’ is the k-number of clusters, and the ‘mean’ is our target assignment. So, for a given data set, we come up with k estimations of means that form the cluster centers. The learning of these k centers is iterative, where the optimization criteria is to minimize the total Euclidean distance between the data points and k centers. Once we have k cluster centers, we can assign new observations to a cluster which mean/center is closest to the unobserved point.

KM can be used in supervised and unsupervised learning problems. When class assignments are present in the training data, the validation set can be created to adjust KM parameters. For example, the choice for k, the convergence criteria, the initial assignment of cluster centroids – all of these can be tuned with a validation set. In an unsupervised setting, the KM may need to be repeated several times to help finding the most optimal solution, since KM is very prone to the local optimum.

There are many positive aspects of KM. But, one of the not-so-good points of this clustering model is the ‘hard assignments’ to clusters. In particular, KM partitions the data space into a Voronoi tessellation. Any point that falls into some partition is assigned to that partition alone. There is no ‘maybe’ with KM…

To illustrate the KM algorithm, I have written a short example in python and posted it on my github. In the ‘k_means.py’ file you can see that I generate two sets of 2D random data using numpy multivariate_normal module. I proceed to discover the two clusters with KMeans from sklearn.cluster. The data and the discovered cluster centers are:

Data with two centroids.
Data with two centroids.

KMeans.predict module is used to generate predictions. Below I am showing side-by-side the actual clusters and the predicted ones. Any point that is closer to the left centroid will be assigned to the left cluster, and the same applies for the right cluster.

Actual vs. Predicted
Actual vs. Predicted

In reality, where the two clusters meet we would like some probabilistic assignments. That is, we would like to be able to say that there is an x% chance that the point belongs to the left cluster, but there is also a y% chance that it belongs to the right cluster. KM cannot do that. Also, in terms of cluster spread and direction, if you think of clusters as spheres, KM outputs clusters with identical shapes. Our blue and red clusters have small positive co-variance and KM cannot detect it. But the Mixture Gaussian model can. Let’s look at GM next.

Learning Predictive Rules on the Poker Hand Data Set

Hello again,

In my last post I have shared a rather simple python code to build a decision tree classifier to recognize a hand in a poker game. The simplicity of my solution stemmed from the fact that I added features that guided the tree model to learn accurate class representations. In particular, I added binary features to signal ‘four-of-a-kind’, ‘three-of-a-kind’, etc. I was able to add these features because I had a full domain knowledge. But what could I do if I did not have any idea of what these features should be?

Well, the authors of the data set have published several papers on this. One of them is Evolutionary Data Mining with Automatic Rule Generalization from 2002 where they introduced improvements made to the RAGA system capable of extracting predictive rules. They used evolutionary search algorithm (genetic algorithm and genetic programming) to synthesize new rules, select rules with best support and coverage, and apply a generalization step to the best rules. Essentially the recipe includes the following functionality:

  1. Synthesis of population of rules and pruning invalid rules.
  2. Fitness evaluation function.
  3. Rules cross-over and mutation.
  4. Rules generalization function.

The rules synthesis can be achieved with a combination of features on “=” and “!=”. The fitness evaluation can be a standard rules support and confidence metrics. The cross-over and mutation can be a step where single features are “AND”ed together to create longer rules. The rule-generalization step is the most complex. It would involve grouping commonly occurring item-sets into new symbols. For example, if we have several rules that include three ranks being the same, we could combine it into a single item-set:

Sets combination
Sets combination

After several generations, such symbols should evolve to more general rules (i.e. no reference to ‘7’, just R1=R2=R3). Here, the introduction of new symbols into the item set is the feature engineering step that I performed manually in my decision tree code.