K-Means Clustering and How it Relates to the Gaussian Mixture Model

Greetings,

I would like to write a post on the Gaussian Mixture models. The post would be a tutorial with a, hopefully, intuitive explanation of when and how to use Gaussian Mixture (GM) models. However, before I begin with GM, I really ought to start with KM…

K-Means Clustering

K-means clustering is the most fundamental ‘vanilla’ type clustering algorithm. The name says it all: the ‘k’ is the k-number of clusters, and the ‘mean’ is our target assignment. So, for a given data set, we come up with k estimations of means that form the cluster centers. The learning of these k centers is iterative, where the optimization criteria is to minimize the total Euclidean distance between the data points and k centers. Once we have k cluster centers, we can assign new observations to a cluster which mean/center is closest to the unobserved point.

KM can be used in supervised and unsupervised learning problems. When class assignments are present in the training data, the validation set can be created to adjust KM parameters. For example, the choice for k, the convergence criteria, the initial assignment of cluster centroids – all of these can be tuned with a validation set. In an unsupervised setting, the KM may need to be repeated several times to help finding the most optimal solution, since KM is very prone to the local optimum.

There are many positive aspects of KM. But, one of the not-so-good points of this clustering model is the ‘hard assignments’ to clusters. In particular, KM partitions the data space into a Voronoi tessellation. Any point that falls into some partition is assigned to that partition alone. There is no ‘maybe’ with KM…

To illustrate the KM algorithm, I have written a short example in python and posted it on my github. In the ‘k_means.py’ file you can see that I generate two sets of 2D random data using numpy multivariate_normal module. I proceed to discover the two clusters with KMeans from sklearn.cluster. The data and the discovered cluster centers are:

Data with two centroids.
Data with two centroids.

KMeans.predict module is used to generate predictions. Below I am showing side-by-side the actual clusters and the predicted ones. Any point that is closer to the left centroid will be assigned to the left cluster, and the same applies for the right cluster.

Actual vs. Predicted
Actual vs. Predicted

In reality, where the two clusters meet we would like some probabilistic assignments. That is, we would like to be able to say that there is an x% chance that the point belongs to the left cluster, but there is also a y% chance that it belongs to the right cluster. KM cannot do that. Also, in terms of cluster spread and direction, if you think of clusters as spheres, KM outputs clusters with identical shapes. Our blue and red clusters have small positive co-variance and KM cannot detect it. But the Mixture Gaussian model can. Let’s look at GM next.

Learning Predictive Rules on the Poker Hand Data Set

Hello again,

In my last post I have shared a rather simple python code to build a decision tree classifier to recognize a hand in a poker game. The simplicity of my solution stemmed from the fact that I added features that guided the tree model to learn accurate class representations. In particular, I added binary features to signal ‘four-of-a-kind’, ‘three-of-a-kind’, etc. I was able to add these features because I had a full domain knowledge. But what could I do if I did not have any idea of what these features should be?

Well, the authors of the data set have published several papers on this. One of them is Evolutionary Data Mining with Automatic Rule Generalization from 2002 where they introduced improvements made to the RAGA system capable of extracting predictive rules. They used evolutionary search algorithm (genetic algorithm and genetic programming) to synthesize new rules, select rules with best support and coverage, and apply a generalization step to the best rules. Essentially the recipe includes the following functionality:

  1. Synthesis of population of rules and pruning invalid rules.
  2. Fitness evaluation function.
  3. Rules cross-over and mutation.
  4. Rules generalization function.

The rules synthesis can be achieved with a combination of features on “=” and “!=”. The fitness evaluation can be a standard rules support and confidence metrics. The cross-over and mutation can be a step where single features are “AND”ed together to create longer rules. The rule-generalization step is the most complex. It would involve grouping commonly occurring item-sets into new symbols. For example, if we have several rules that include three ranks being the same, we could combine it into a single item-set:

Sets combination
Sets combination

After several generations, such symbols should evolve to more general rules (i.e. no reference to ‘7’, just R1=R2=R3). Here, the introduction of new symbols into the item set is the feature engineering step that I performed manually in my decision tree code.

Learning Binary Decision Trees from the Poker Hand Data Set

Greetings, my blog readers!

What a year this has been so far! I am finally finding some time and energy to write new posts on my blog. I am really looking forward to blogging more about machine learning and data science!

In this post I will share with you my approach to building a model to learn to recognize a poker hand. This problem and the training and testing data are available from the UCI machine learning repository here. My approach to this task is based on recognizing the following about the data:

  1. The training set of all positive examples is incomplete.
  2. It is possible to express all positive examples as a set of if/else/then rules.
  3. Given the training data and a classifier capable of learning the high-order rules, we should be able to achieve a 100% accuracy on test data. Thus the fact that the set of positive examples is incomplete does not matter.
  4. To evaluate our classifier we can stick with a simple accuracy ratio since our target accuracy is 100%.

The poker hand data set

There are 10 classes in the poker data set. Class 0 represents “nothing in hand” and is the largest class. The set of positive examples can be extended if we generate an exhaustive set of combinations representing a class. For example, we can fully specify the Flush by listing all possibilities of five cards of the same suit. We would need to list \frac{13!}{5!(13-5)!}\cdot 4 -40=5108 combinations, where minus 40 comes from subtracting the 40 possibilities of a straight flush.

I am not going to list all possibilities. Instead, I will train several binary decision tree models. Each tree will recognize one class vs. the rest. What do you think will happen if we present the raw training data to a decision tree model? Take, for example, scikit-learn DecisionTreeClassifier. Because our data is numeric, the tree will contain branches and nodes that do not represent meaningful rules that can be translated to categorical data. In other words, the decision tree model will try to locally optimize the splitting criteria and fail to see the big picture. If we want to use the decision tree model, we need to add more features. Specifically, we need to add features that will let the tree to spot the higher order separations that exist in this data.

Engineering more features

So, what features do we need to add? We need to add binary features indicating the Flush, 4 of a kind, 3 of a kind, a pair, full house, etc. Classes 6 and 8 can be learned by combining the features we are going to introduce. One can say that features engineering for this problem is cheating as it defeats the purpose of the task. I agree with this completely, and the main lesson of this post is that when presented with such data as the poker hand data set, the best choice of the classifier must take into account the higher-order abstracts. If adding new features is not possible, then using an un/supervised approach to first discover such constructs may be required.

Python code

Let’s take a look at a possible solution in Python (2.7). You can download it from my github space here. The “poker_hand.py” contains the solution. I have provided enough comments to make it clear what I am doing at each step. Please let me know if you can think of a more elegant way to concatenate_all_ranks and suits. The code has an ability to graph the trees using graphiz. Here is an example of a tree to pick-out the Royal Flush:

Royal Flush tree
Royal Flush tree

The code achieves 100% accuracy. No wonder! We told the tree what to learn…

My GitHub Space

New Year, new coding practice!
From now on I will publish the full solutions to all the coding projects I blog about on my GitHub space. This will make it easier for my readers to re-use my code or propose changes.
Each new post, where relevant, will include a link to a GitHub repository.

The solution for the latest post on Inverted Indices can be found here: https://github.com/elena-sharova/Inverted-Index.git

Inverted Index in C#

Greetings my blog readers!

This is my last post for 2015. All in all 2015 has been a good year. Many people from all over the world have read my posts, tried my code. Some even subscribed…

In this post I will do something rather simple – write an inverted index builder in C#. What is an inverted index? It is the simplest form of document indexing to allow for performing boolean queries on text data. For example, imagine we have several text files and we would like to find out which of them contain all or some of the search terms. We can group search terms using regular boolean AND and OR expressions.

To illustrate, let’s say we have two files, 1.xt and 2.txt. The two files contains the following:

file 1: one star is sparkling bright
file 2: two stars are sparkling even brighter

An inverted index on such data is built in four steps:

  1. Collect the documents.
  2. Tokenize the text by parsing each element.
  3. Perform linguistic pre-processing (e.g. removing punctuation, removing capitalization, etc.) to normalize the elements.
  4. Create an index in a form of a dictionary, where keys are the elements and the values are the documents that contain them.

It is really as simple as that. I am not going to code the normalization step as this would remove the main emphasis of this post. Instead, I will provide two simple extensions to allow for an AND and OR queries on the index.

So, given our short example files, how does the inverted index look like? Here is a visual example:

Simple Inverted Index
Simple Inverted Index

I use C# Dictionary to store such an index. The AND and OR are provided as extensions, and for a moment can handle only two keywords. It would be simple to remove this restriction by passing an array or a list to the extensions methods instead. In the end, the code is just an example of using LINQ (Intersect, Union and SelectMany). If you end up using this code, please add your own exceptions handling.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace InvertedIndex
{

    public static class Extensions
    {
        public static IEnumerable<string> And (this Dictionary<string, List<string>> index, string firstTerm, string secondTerm)
        {
            return(from d in index
                         where d.Key.Equals(firstTerm)
                         select d.Value).SelectMany(x=>x).Intersect
                            ((from d in index
                             where d.Key.Equals(secondTerm)
                             select d.Value).SelectMany(x => x));
        }

        public static IEnumerable<string> Or(this Dictionary<string, List<string>> index, string firstTerm, string secondTerm)
        {
            return (from d in index
                        where d.Key.Equals(firstTerm) || d.Key.Equals(secondTerm)
                        select d.Value).SelectMany(x=>x).Distinct();
        }

    }

    class EntryPoint
    {
        public static Dictionary<string, List<string>> invertedIndex;
 
        static void Main(string[] args)
        {
            invertedIndex = new Dictionary<string, List<string>>();
            string folder = "C:\\Users\\Elena\\Documents\\Visual Studio 2013\\Projects\\InvertedIndex\\Files\\";

            foreach (string file in Directory.EnumerateFiles(folder, "*.txt"))
            {
                List<string> content = System.IO.File.ReadAllText(file).Split(' ').Distinct().ToList();
                addToIndex(content, file.Replace(folder, ""));
            }

            var resAnd = invertedIndex.And("star", "sparkling");
            var resOr = invertedIndex.Or("star", "sparkling");

            Console.ReadLine();
        }

        private static void addToIndex(List<string> words, string document)
        {
            foreach (var word in words)
            {
                if (!invertedIndex.ContainsKey(word))
                {
                    invertedIndex.Add(word, new List<string> { document });
                }
                else
                {
                    invertedIndex[word].Add(document);
                }
            }
        }
    }
}

Enjoy and Happy New Year!

P.S. And here is the GitHub link: https://github.com/elena-sharova/Inverted-Index.git