Learning Binary Decision Trees from the Poker Hand Data Set

On August 2, 2016September 1, 2016 By ElenaIn Home, Machine Learning, Python Programming

Greetings, my blog readers!

What a year this has been so far! I am finally finding some time and energy to write new posts on my blog. I am really looking forward to blogging more about machine learning and data science!

In this post I will share with you my approach to building a model to learn to recognize a poker hand. This problem and the training and testing data are available from the UCI machine learning repository here. My approach to this task is based on recognizing the following about the data:

The training set of all positive examples is incomplete.
It is possible to express all positive examples as a set of if/else/then rules.
Given the training data and a classifier capable of learning the high-order rules, we should be able to achieve a 100% accuracy on test data. Thus the fact that the set of positive examples is incomplete does not matter.
To evaluate our classifier we can stick with a simple accuracy ratio since our target accuracy is 100%.

The poker hand data set

There are 10 classes in the poker data set. Class 0 represents “nothing in hand” and is the largest class. The set of positive examples can be extended if we generate an exhaustive set of combinations representing a class. For example, we can fully specify the Flush by listing all possibilities of five cards of the same suit. We would need to list $\frac{13!}{5!(13-5)!}\cdot 4 -40=5108$ combinations, where minus 40 comes from subtracting the 40 possibilities of a straight flush.

I am not going to list all possibilities. Instead, I will train several binary decision tree models. Each tree will recognize one class vs. the rest. What do you think will happen if we present the raw training data to a decision tree model? Take, for example, scikit-learn DecisionTreeClassifier. Because our data is numeric, the tree will contain branches and nodes that do not represent meaningful rules that can be translated to categorical data. In other words, the decision tree model will try to locally optimize the splitting criteria and fail to see the big picture. If we want to use the decision tree model, we need to add more features. Specifically, we need to add features that will let the tree to spot the higher order separations that exist in this data.

Engineering more features

So, what features do we need to add? We need to add binary features indicating the Flush, 4 of a kind, 3 of a kind, a pair, full house, etc. Classes 6 and 8 can be learned by combining the features we are going to introduce. One can say that features engineering for this problem is cheating as it defeats the purpose of the task. I agree with this completely, and the main lesson of this post is that when presented with such data as the poker hand data set, the best choice of the classifier must take into account the higher-order abstracts. If adding new features is not possible, then using an un/supervised approach to first discover such constructs may be required.

Python code

Let’s take a look at a possible solution in Python (2.7). You can download it from my github space here. The “poker_hand.py” contains the solution. I have provided enough comments to make it clear what I am doing at each step. Please let me know if you can think of a more elegant way to concatenate_all_ranks and suits. The code has an ability to graph the trees using graphiz. Here is an example of a tree to pick-out the Royal Flush:

The code achieves 100% accuracy. No wonder! We told the tree what to learn…

Inverted Index in C#

On December 31, 2015January 1, 2016 By ElenaIn C# Programming

Greetings my blog readers!

This is my last post for 2015. All in all 2015 has been a good year. Many people from all over the world have read my posts, tried my code. Some even subscribed…

In this post I will do something rather simple – write an inverted index builder in C#. What is an inverted index? It is the simplest form of document indexing to allow for performing boolean queries on text data. For example, imagine we have several text files and we would like to find out which of them contain all or some of the search terms. We can group search terms using regular boolean AND and OR expressions.

To illustrate, let’s say we have two files, 1.xt and 2.txt. The two files contains the following:

file 1: one star is sparkling bright
file 2: two stars are sparkling even brighter

An inverted index on such data is built in four steps:

Collect the documents.
Tokenize the text by parsing each element.
Perform linguistic pre-processing (e.g. removing punctuation, removing capitalization, etc.) to normalize the elements.
Create an index in a form of a dictionary, where keys are the elements and the values are the documents that contain them.

It is really as simple as that. I am not going to code the normalization step as this would remove the main emphasis of this post. Instead, I will provide two simple extensions to allow for an AND and OR queries on the index.

So, given our short example files, how does the inverted index look like? Here is a visual example:

I use C# Dictionary to store such an index. The AND and OR are provided as extensions, and for a moment can handle only two keywords. It would be simple to remove this restriction by passing an array or a list to the extensions methods instead. In the end, the code is just an example of using LINQ (Intersect, Union and SelectMany). If you end up using this code, please add your own exceptions handling.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace InvertedIndex
{

    public static class Extensions
    {
        public static IEnumerable<string> And (this Dictionary<string, List<string>> index, string firstTerm, string secondTerm)
        {
            return(from d in index
                         where d.Key.Equals(firstTerm)
                         select d.Value).SelectMany(x=>x).Intersect
                            ((from d in index
                             where d.Key.Equals(secondTerm)
                             select d.Value).SelectMany(x => x));
        }

        public static IEnumerable<string> Or(this Dictionary<string, List<string>> index, string firstTerm, string secondTerm)
        {
            return (from d in index
                        where d.Key.Equals(firstTerm) || d.Key.Equals(secondTerm)
                        select d.Value).SelectMany(x=>x).Distinct();
        }

    }

    class EntryPoint
    {
        public static Dictionary<string, List<string>> invertedIndex;
 
        static void Main(string[] args)
        {
            invertedIndex = new Dictionary<string, List<string>>();
            string folder = "C:\\Users\\Elena\\Documents\\Visual Studio 2013\\Projects\\InvertedIndex\\Files\\";

            foreach (string file in Directory.EnumerateFiles(folder, "*.txt"))
            {
                List<string> content = System.IO.File.ReadAllText(file).Split(' ').Distinct().ToList();
                addToIndex(content, file.Replace(folder, ""));
            }

            var resAnd = invertedIndex.And("star", "sparkling");
            var resOr = invertedIndex.Or("star", "sparkling");

            Console.ReadLine();
        }

        private static void addToIndex(List<string> words, string document)
        {
            foreach (var word in words)
            {
                if (!invertedIndex.ContainsKey(word))
                {
                    invertedIndex.Add(word, new List<string> { document });
                }
                else
                {
                    invertedIndex[word].Add(document);
                }
            }
        }
    }
}

Enjoy and Happy New Year!

P.S. And here is the GitHub link: https://github.com/elena-sharova/Inverted-Index.git

Decision Tree Classifier – Part 1

On March 9, 2015September 1, 2016 By ElenaIn Machine Learning, Numerical Analysis

To continue my blogging on machine learning (ML) classifiers, I am turning to decision trees. The post on decision trees will be in two parts. Part 1 will provide an introduction to how decision trees work and how they are build. Part 2 will contain the C# implementation of an example decision trees classifier. As in all my posts, I prefer clear and informal explanation over the terse mathematical one. So, any pedants out there – look away now!

Decision trees are a great choice for inductive inference and have been widely used for a long time in the field of Artificial Intelligence (AI). In this post we will cover the decision tree algorithm known as ID3.
There are several reasons why decision trees are great classifiers:

decision trees are easy to understand;
decision trees work well with messy data or missing values;
decision trees give insight into complex data;
decision trees can be visualized to allow for inference inspection or correction;
decision tree building algorithms can be as simple or sophisticated as required (e.g. they can incorporate pruning, weights, etc.);

Decision trees work best with discrete classes. That is, the output class for each instance is either a string, boolean or an integer. If you are working with continuous values, you may consider rounding and mapping to a discrete output class, or may need to look for another classifier. Decision trees are used for classification problems. For example, the taxonomy of organisms, plants, minerals, etc. lends itself naturally to decision tree classifiers. This is because in a field of taxonomy we are dealing with a set of records containing values for some attributes. The values come from a finite set of known features. Each record may or may not have a classification. Medicine is another field that makes use of decision trees. Almost all illnesses can be categorized by symptoms, thus decision trees aid doctors in illness diagnosis.

It is time for an example, which I am borrowing from [1]. The tennis playing example in ML is like the ‘Hello World’ in programming languages. We are given some data about the weather conditions that are appropriate for playing tennis. Our task is to construct a decision tree based on this data, and use the tree to classify unknown weather conditions. The learning data is:

Table 1
Outlook	Temperature	Humidity	Wind	Play Tennis
sunny	hot	high	strong	no
sunny	hot	high	weak	no
overcast	hot	high	weak	yes
rain	mild	high	weak	yes
rain	cool	normal	weak	yes
rain	cool	normal	strong	no
overcast	cool	normal	strong	yes
sunny	mild	high	weak	no
sunny	cool	normal	weak	yes
rain	mild	normal	weak	yes
sunny	mild	normal	strong	yes
overcast	mild	high	strong	yes
overcast	hot	normal	weak	yes
rain	mild	high	strong	no

Table 1 tells us which weather conditions are good for playing tennis outdoors. For example, if it is sunny, and the wind is weak, and the temperature is mild but it is very humid, then playing tennis outdoors is not advisable. On the other hand, on an overcast day, regardless of the wind, playing tennis outdoors should be OK.

The most fundamental concept behind the decision tree classifiers is that of order. The amount of order in the data can be measured by assessing its consistency. For example, imagine that every row in Table 1 had a ‘yes’ associated with the Play Tennis column. This would tell us that regardless of the wind, temperature, humidity, we can always play tennis outside. The data would have perfect order, and we would be able to build a perfect classifier for the data – the one that always says ‘yes’ and is always correct. The other extreme would be where the outcome class differs for every observation. For an inductive learner like a decision tree, this would mean that it is impossible to classify new instance unless it perfectly matches some instance in the training set. This is why decision tree classifier won’t work for continuous class problems.

In information theory the concept of order (actually, lack of it), is often represented by entropy. It is a scary word for something that is rather simple. Entropy definition was proposed by the founder of information theory – Claude Shannon, and it is a probabilistic measure of categorical data defined as:

$Entropy(S) = -\sum_{i=1}^{n}p_i \log_2 p_i$

where S is the training data set with some target classification that takes on n different values. The discrete target values are assigned their probabilities p, and $\log_2$ is logarithm base 2. The base two is due to the entropy being a measure of the expected encoding in binary format (i.e. 0s and 1s). Note that the smaller are individual probabilities, the greater is the absolute value of entropy. Also, the closer is entropy to zero, the more order there is in the data.

We now can calculate the entropy of our data, one column at a time. Let’s take the Play Tennis column. We have 9 ‘yes’ and 5 ‘no’, this gives us the entropy of $Entropy(S)=-\frac{9}{14} \log_2(\frac{9}{14}) + -\frac{5}{14}\log_2(\frac{5}{14})=0.940$ , accurate to 3 d.p. There is a reason why we started with the class column. It will be used to measure the information gain we can achieve by splitting the data one attribute/column at a time, which is the core idea behind building decision trees. At each step we will be picking the attribute that achieves the greatest information gain to split the tree recursively. The information gain is, essentially, a positive difference in entropy; and the root node of any decision tree is the attribute with the greatest information gain. Information gain is formally defined as:

$InfoGain(S,A) = Entropy(S) -\sum_{i=1}^{A}\frac{|S_i|}{|S|}Entropy(S_i)$

where A is the set of all values some attribute takes. For example, the Wind column takes values in {strong, weak}, and the Humidity column takes on values in {high, normal}. $|S_i|$ and $|S|$ are the number of records taking on the value i and the total number of records respectively. Finally, $Entropy(S_i)$ is the corresponding entropy of all records where the attribute A takes on value i with respect to the target class. The last sentence is important. It means that, given all the records where attribute A takes on the same value i we are interested to know the dispersion of the target class. For example, take a look at all the records where Wind is strong. Here we have 3 records where the target class is ‘no’ and 3 records where it is ‘yes’. Thus, the corresponding entropy is $Entropy(S_{wind=strong})=-\frac{3}{6} \log_2(\frac{3}{6}) -\frac{3}{6} \log_2(\frac{3}{6})=1.0$

So, which attribute/column should be the tree’s root node? It should be the one that achieves the greatest information gain:

$InfoGain(S,Wind) = 0.940 - \frac{8}{14}\times 0.811 - \frac{6}{14}\times 1.0=0.048$
$InfoGain(S,Humidity) = 0.940 - \frac{7}{14}\times 0.985 - \frac{7}{14}\times 0.592=0.151$
$InfoGain(S,Temperature) = 0.940 - \frac{4}{14}\times 1.0 - \frac{4}{14}\times 0.811 - \frac{6}{14}\times 0.918=0.029$
$InfoGain(S,Outlook) = 0.940 - \frac{5}{14}\times 0.971 - \frac{4}{14}\times 0.0 - \frac{5}{14}\times 0.971=0.246$

It is clear that Outlook achieves the greatest information gain and should be the root of the tree. The worst information gain would be achieved with Temperature attribute. This is what our tree now looks:

At this stage we have the training data split into three clusters. Where outlook is overcast, we have a target class of ‘yes’. We now proceed recursively building the sub-tree in the other two clusters. Where outlook is sunny, we have a total of 5 records. These are reproduced in Table 2 below. Out of three possible attributes for the node we again pick the one that achieves the greatest information gain. If you inspect the data in Table 2, it is easy to see that humidity is that attribute since all five records can be split as following: if humidity is normal, then ‘yes’, else ‘no’. Its corresponding information gain is

$InfoGain(S_{sunny},Humidity) = 0.970 - \frac{3}{5}\times 0 - \frac{2}{5}\times 0=0.970$

where 0.970 is entropy of Outlook is sunny, again, with respect to the Play Tennis class: $Entropy(S_{sunny}) = - \frac{3}{5}\log_2(\frac{3}{5})- \frac{2}{5}\log_2(\frac{2}{5})=0.970$ .

Table 2
Outlook	Temperature	Humidity	Wind	Play Tennis
sunny	hot	high	strong	no
sunny	hot	high	weak	no
sunny	mild	high	weak	no
sunny	cool	normal	weak	yes
sunny	mild	normal	strong	yes

At this stage we have classified 9 records and our tree has 9 terminal descendant nodes (leaves):

We can now deal with the records in the last outlook cluster, i.e. where outlook is rain. Here we have five records, which are reproduced below in Table 3:

Table 3
Outlook	Temperature	Humidity	Wind	Play Tennis
rain	mild	high	weak	yes
rain	cool	normal	weak	yes
rain	cool	normal	strong	no
rain	mild	normal	weak	yes
rain	mild	high	strong	no

We apply the same algorithm as before by selecting the attribute for the next node that gives the greatest information gain. It is easy to spot that the Wind attribute achieves just that. Its total information gain is

$InfoGain(S_{rain},Wind) = 0.970 - \frac{3}{5}\times 0 - \frac{2}{5}\times 0=0.970$

The complete tree looks like this:

A few things should be noted about the final decision tree. Firstly, the Temperature attribute is completely missing. The constructed decision tree allowed us to see that this attribute is redundant, at least given the training dataset. Decision trees always favor the shorter trees over the longer ones, which is the main part of its inductive bias (or the algorithm’s policy). Secondly, this is not a binary tree, since we have parent nodes with more than two edges (i.e. the Outlook node). Finally, the constructed decision tree allows for a one-way searching algorithm. This means that we cannot back-track to the parent node. In the grand scheme of things this implies that a decision tree can end up at a locally optimal, rather than globally optimal solution.

The next blog post will be on how to implement the decision tree classifier in C#.

References:

[1]. T.M.Mitchell. Machine Learning. McGraw-Hill. 1997.

k Nearest Neighbor Classifier

On March 3, 2015September 1, 2016 By ElenaIn Machine Learning, Numerical Analysis

On my blog space I am going to share with you example implementations of the most common machine learning techniques. The code will be either in C# or Python.

This is the first post in the series of several posts to come, which will be on algorithms commonly used to implement classifiers. A classifier is a machine learning algorithm that is allowed to build a representation of data on some training dataset, and ultimately is used to classify new unobserved instances. Classifiers are at the heart of machine learning and data mining, and they have wide use in medicine, speech and image recognition, and even finance. Broadly speaking all classifier fall into two categories: supervised and unsupervised. Supervised classifiers need to be ‘trained’, in the sense that they are fed the training data with known classifications, which is used to construct the priory. Unsupervised classifiers are a bit more complex, and work with unlabeled/unclassified training data (e.g. k-means clusters) or even have to learn the target class for each instance by trial and error (e.g. reinforced learning).

We’ll begin by looking at the most basic instance based classifier known as the K-Nearest Neighbour (kNN). Here K is the number of instances used to cast the vote when labeling previously unobserved instance. To demonstrate how kNN works, we will take this example dataset:

Table 1
Attr 1	Attr 2	Attr 3	Class
0.7	10	300	A
0.14	9	120	A
1.0	12	200	B
1.12	15	300	B
0.4	7	150	A
0.6	8	600	A
1.15	15	600	B
1.12	11	400	B

The above is a dataset with four columns and eight rows. The first three columns are the data (numerical format is required for kNN), the last column is the class. For the purpose of this example it does not matter what the data represents. I created random example with a hidden pattern, which, hopefully, our kNN algorithm can recognise.

Given previously unobserved instance I={i0,…,iN,class}, we calculate the Euclidean distance between I and each known instance in the dataset as follows:

$D_i= \sqrt{\sum_{k=0}^{N}(Z^{i}_{k}-I^{i}_{k})^2}$

Here Z is a sequence of values of some instance i in attribute k for which a classification is given (see the dataset), and I is the unclassified instance.
For example, given I = {Attr1=12, Attr2=11, Attr3=500}, the resulting distance matrix, after normalisation, is the following:

Table 2
D	Class
11.196	A
11.772	A
10.909	B
10.792	B
11.519	A
11.295	A
10.756	B
10.774	B

The distances were calculated on normalised data. That is, instead of using the original values in Table 1, where the 3rd column dominates all other values, we normalise each value according to:

$\frac{Z_{k}^{i}-Min(Z_{k})}{Max(Z_{k})-Min(Z_{k})}$

Again, Z is from the dataset. The instance that we need to classify is also normalised. For example, the normalised values for Table 1 are:

Table 3
Attr 1	Attr 2	Attr 3	Class
0.554	0.375	0.375	A
0	0.25	0	A
0.851	0.625	0.167	B
0.970	1	0.375	B
0.257	0	0.062	A
0.455	0.125	1	A
1	1	1	B
0.970	0.5	0.583	B

The normalised I = {11.74, 0.5, 0.792}, which was calculated with max and min from the dataset, excluding the instance we need to classify.

Ok, now that we have calculated the distances (Table 2), we can proceed to vote on which class the instance I should belong to. To do this, we select K smallest distances and look at their corresponding classes. Let’s take K=3, then the smallest distances with votes are: 10.756 for B, 10.774 for B, and 10.792 for B. The new instance clearly belongs to class B. This is spot on, as the pattern is: Attr1<1 and Attr2<10 result in A, else B.

Consider another example I = {0.8, 11, 500}. For this instance, after normalisation, the top 3 distances with classes are: 0.379 for B, 0.446 for A, and 0.472 for A. The majority is A, so, the instance is classified as A.

Several modifications exist to the basic kNN algorithm. Some employ weights to improve classification, others use an even K and break ties in a weighted-fashion. Overall, it is a powerful, easy to implement classifier. If you are dealing with non-numeric data parameters, these can be quantified through some mapping from non-numeric value to numbers. Always remember to normalise your data, otherwise you are running a chance of introducing bias into the classifier.

Let’s now look at how kNN can be implemented with C#. At the start, I introduce several extensions to aid in data manipulation. In C#, extensions are an extremely useful concept and are quite addictive. And so is LINQ. You will notice that I try to use LINQ query where possible. Most worker method are private and my properties are read-only. Another thing to note is that my kNN constructor is private. Instead, the class users call initialiseKNN method which ensures that K is odd, since I am not using weights and don’t provide for tie breaks.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace kNN
{
    //extension method to aid in algorithm implementation
    public static class Extensions
    {
        //converts string representation of number to a double
        public static IEnumerable<double> ConvertToDouble<T>(this IEnumerable<T> array)
        {
            dynamic ds;
            foreach (object st in array)
            {
                ds = st;
                yield return Convert.ToDouble(ds);
            }
        }

        //returns a row in a 2D array
        public static T[] Row<T>(this T[,] array, int r)
        {
            T[] output = new T[array.GetLength(1)];
            if (r < array.GetLength(0))
            {
                for (int i = 0; i < array.GetLength(1); i++)
                    output[i] = array[r, i];
            }
            return output;
        }

        //converts a List of Lists to a 2D matrix
        public static T[,] ToMatrix<T>(this IEnumerable<List<T>> collection, int depth, int length)
        {
            T[,] output = new T[depth, length];
            int i = 0, j = 0;
            foreach (var list in collection)
            {
                foreach (var val in list)
                {
                    output[i, j] = val;
                    j++;
                }
                i++; j = 0;
            }

            return output;
        }

        //returns the classification that appears most frequently in the array of classifications
        public static string Majority<T>(this T[] array)
        {
            if (array.Length > 0)
            {
                int unique = array.Distinct().Count();
                if (unique == 1)
                    return array[0].ToString();

                return (from item in array
                             group item by item into g
                             orderby g.Count() descending
                             select g.Key).First().ToString();
            }
            else
                return "";
        }
    }

    /// <summary>
    /// kNN class implements the K Nearest Neighbor instance based classifier
    /// </summary>
    public sealed class kNN
    {
        //private constructor allows to ensure k is odd
        private  kNN(int K, string FileName, bool Normalise)
        {
            k = K;
            PopulateDataSetFromFile(FileName, Normalise);
        }

        /// <summary>
        /// Initialises the kNN class, the observations data set and the number of neighbors to use in voting when classifying
        /// </summary>
        /// <param name="K">integer representiong the number of neighbors to use in the classifying instances</param>
        /// <param name="FileName">string file name containing knows numeric observations with string classes</param>
        /// <param name="Normalise">boolean flag for normalising the data set</param>
        public static kNN initialiseKNN(int K, string FileName, bool Normalise)
        {
            if (K % 2 > 0)
                return new kNN(K, FileName, Normalise);
            else
            {
                Console.WriteLine("K must be odd.");
                return null;
            }
        }

        //read-only properties
        internal int K { get { return k; } }
        internal Dictionary<List<double>, string> DataSet { get { return dataSet;} }

        /// <summary>
        /// Classifies the instance according to a kNN algorithm
        /// calculates Eucledian distance between the instance and the know data
        /// </summary>
        /// <param name="instance">List of doubles representing the instance values</param>
        /// <returns>returns string - classification</returns>
        internal string Classify(List<double> instance)
        {
            int i=0;
            double [] normalisedInstance = new double[length];

            if (instance.Count!=length)
            {
                return "Wrong number of instance parameters.";
            }

            if (normalised)
            {
                foreach (var one in instance)
                {
                    normalisedInstance[i] = (one - originalStatsMin.ElementAt(i)) / (originalStatsMax.ElementAt(i) - originalStatsMin.ElementAt(i));
                    i++;
                }
            }
            else
            {
                normalisedInstance = instance.ToArray();
            }

            double[,] keyValue = dataSet.Keys.ToMatrix(depth, length);
            double[] distances = new double[depth];

            Dictionary<double, string> distDictionary = new Dictionary<double, string>();
            for (i = 0; i < depth; i++)
            {
                distances[i] = Math.Sqrt(keyValue.Row(i).Zip(normalisedInstance, (one, two) => (one - two) * (one - two)).ToArray().Sum());
                distDictionary.Add(distances[i], dataSet.Values.ToArray()[i]);

            }

            //select top votes
            var topK = (from d in distDictionary.Keys
                        orderby d ascending
                        select d).Take(k).ToArray();

            //obtain the corresponding classifications for the top votes
            var result = (from d in distDictionary
                        from t in topK
                        where d.Key==t
                        select d.Value).ToArray();

            return result.Majority();
        }
        /// <summary>
        /// Processess the file with the comma separated training data and populates the dictionary
        /// all values except for the class must be numeric
        /// the class is the last element in the dataset for each record
        /// </summary>
        /// <param name="fileName">string fileName - the name of the file with the training data</param>
        /// <param name="normalise">bool normalise - true if the data needs to be normalised, false otherwiese</param>
        private void PopulateDataSetFromFile(string fileName, bool normalise)
        {
            using (StreamReader sr = new StreamReader(fileName,true))
            {
                List<string> allItems = sr.ReadToEnd().TrimEnd().Split('\n').ToList();

                if (allItems.Count > 1)
                {
                    string[] array = allItems.ElementAt(0).Split(',');
                    length = array.Length - 1;
                    foreach (string item in allItems)
                    {
                        array = item.Split(',');
                        dataSet.Add(array.Where(p => p != array.Last()).ConvertToDouble().ToList(), array.Last().ToString().TrimEnd());
                    }
                    array = null;
                }
                else
                    Console.WriteLine("No items in the data set");
            }
            if (normalise)
            {
                NormaliseDataSet();
                normalised = true;
            }
        }

        private void NormaliseDataSet()
        {
            var keyCollection = from n in dataSet.Keys
                                select n;
            var valuesCollection = from n in dataSet.Values
                                   select n;

            depth = dataSet.Keys.Count;
            double[,] transpose = new double[length, depth];
            double[,] original = new double[depth, length];
            int i = 0, j = 0;

            //transpose
            foreach (var keyList in keyCollection)
            {
                foreach (var key in keyList)
                {
                    transpose[i, j] = key;
                    i++;
                }
                j++; i = 0;
            }

            //normalise
            double max, min;

            for (i = 0; i < length; i++)
            {
                originalStatsMax.Add (max = transpose.Row(i).Max());
                originalStatsMin.Add(min = transpose.Row(i).Min());

                for (j = 0; j < depth; j++)
                {
                    transpose[i, j] = (transpose[i, j] - min) / (max - min);
                }

            }
            for (i = 0; i < depth; i++)
            {
                for (j = 0; j < length; j++)
                    original[i, j] = transpose[j, i];
            }

            //overwrite the current values with the normalised ones
            dataSet = new Dictionary<List<double>, string>();
            for (i = 0; i < depth; i++)
            {
                dataSet.Add(original.Row(i).ToList(), valuesCollection.ElementAt(i));
            }
        }

        //private members
        private Dictionary<List<double>, string> dataSet = new Dictionary<List<double>,string>();
        private List<double> originalStatsMin = new List<double>();
        private List<double> originalStatsMax = new List<double>();
        private int k=0;
        private int length=0;
        private int depth=0;
        private bool normalised = false;
    }

    class EntryPoint
    {
        static void Main(string[] args)
        {
            kNN examplekNN = kNN.initialiseKNN(3,"DataSet.txt",true);

            List<double> instance2Classify = new List<double> {12,11,500};
            string result = examplekNN.Classify(instance2Classify);
            Console.WriteLine("This instance is classified as: {0}", result);
            Console.ReadLine();
        }
    }
}

In my next blog we will look at decision trees – a slightly more complicated, but also more powerful machine learning algorithm.

codefying

Tag: Machine Learning