Learning Binary Decision Trees from the Poker Hand Data Set

Greetings, my blog readers!

What a year this has been so far! I am finally finding some time and energy to write new posts on my blog. I am really looking forward to blogging more about machine learning and data science!

In this post I will share with you my approach to building a model to learn to recognize a poker hand. This problem and the training and testing data are available from the UCI machine learning repository here. My approach to this task is based on recognizing the following about the data:

  1. The training set of all positive examples is incomplete.
  2. It is possible to express all positive examples as a set of if/else/then rules.
  3. Given the training data and a classifier capable of learning the high-order rules, we should be able to achieve a 100% accuracy on test data. Thus the fact that the set of positive examples is incomplete does not matter.
  4. To evaluate our classifier we can stick with a simple accuracy ratio since our target accuracy is 100%.

The poker hand data set

There are 10 classes in the poker data set. Class 0 represents “nothing in hand” and is the largest class. The set of positive examples can be extended if we generate an exhaustive set of combinations representing a class. For example, we can fully specify the Flush by listing all possibilities of five cards of the same suit. We would need to list \frac{13!}{5!(13-5)!}\cdot 4 -40=5108 combinations, where minus 40 comes from subtracting the 40 possibilities of a straight flush.

I am not going to list all possibilities. Instead, I will train several binary decision tree models. Each tree will recognize one class vs. the rest. What do you think will happen if we present the raw training data to a decision tree model? Take, for example, scikit-learn DecisionTreeClassifier. Because our data is numeric, the tree will contain branches and nodes that do not represent meaningful rules that can be translated to categorical data. In other words, the decision tree model will try to locally optimize the splitting criteria and fail to see the big picture. If we want to use the decision tree model, we need to add more features. Specifically, we need to add features that will let the tree to spot the higher order separations that exist in this data.

Engineering more features

So, what features do we need to add? We need to add binary features indicating the Flush, 4 of a kind, 3 of a kind, a pair, full house, etc. Classes 6 and 8 can be learned by combining the features we are going to introduce. One can say that features engineering for this problem is cheating as it defeats the purpose of the task. I agree with this completely, and the main lesson of this post is that when presented with such data as the poker hand data set, the best choice of the classifier must take into account the higher-order abstracts. If adding new features is not possible, then using an un/supervised approach to first discover such constructs may be required.

Python code

Let’s take a look at a possible solution in Python (2.7). You can download it from my github space here. The “poker_hand.py” contains the solution. I have provided enough comments to make it clear what I am doing at each step. Please let me know if you can think of a more elegant way to concatenate_all_ranks and suits. The code has an ability to graph the trees using graphiz. Here is an example of a tree to pick-out the Royal Flush:

Royal Flush tree
Royal Flush tree

The code achieves 100% accuracy. No wonder! We told the tree what to learn…

Executing a Python Script from a C# Program

Python is a great programming language. It also has fantastic libraries (e.g. numPy, sciPy, pandas, etc.) that can significantly simplify and speed-up development. So, it is no surprise that sometimes when considering a coding solution for some problem you scratch your head and wish if you could only do it in Python (i.e. instead of C# or other low level programming language)!

In this short post I will show you how to execute a Python scrip and capture its output from a C# console-based program. Essentially, running a Python script from C# can be done either with IronPython dll referenced within your C# project and using the IronPython.Hosting library; or via kicking off a process that executes a Python program. The advantage of the later is that the script can be written in any Python flavor, provided you have installed the Python shell compatible with the script.

For this example I am using the following Python program saved as ‘hello.py’:

 


import sys

def main():
    print('Hello')
    if len(sys.argv)>=3:
        x = sys.argv[1]
        y = sys.argv[2]
        # print concatenated parameters
        print(x+y)

if __name__=='__main__':
    main()

 

The following C# code will start a process and assign it to the Python interpreter with the Python program name passed as argument. The output is parsed as stream. However, the output can be written to a file instead.

 

using System;
using System.Diagnostics;
using System.IO;
using System.Threading.Tasks;

namespace ExecutingPythonScript
{
    class EntryPoint
    {
        static void Main(string[] args)
        {

            int x = 1;
            int y = 2;
            string progToRun = "C:\\Python27\\hello.py";
            char[] splitter = {'\r'};

            Process proc = new Process();
            proc.StartInfo.FileName = "python.exe";
            proc.StartInfo.RedirectStandardOutput = true;
            proc.StartInfo.UseShellExecute = false;

            // call hello.py to concatenate passed parameters
            proc.StartInfo.Arguments = string.Concat(progToRun, " ", x.ToString(), " ", y.ToString());
            proc.Start();

            StreamReader sReader = proc.StandardOutput;
            string[] output = sReader.ReadToEnd().Split(splitter);

            foreach (string s in output)
                Console.WriteLine(s);

            proc.WaitForExit();

            Console.ReadLine();
        }
    }
}

 

And that is all. It is that simple. One last thing to point out: progToRun could have been a string literal assigned to the actual Python code. This works well when you have a small task and don’t want to write a Python program to hold it.

***Thanks to Johannes for spotting a coding type in the variable name, which is now fixed.