# Decision Tree Classifier – Part 1

To continue my blogging on machine learning (ML) classifiers, I am turning to decision trees. The post on decision trees will be in two parts. Part 1 will provide an introduction to how decision trees work and how they are build. Part 2 will contain the C# implementation of an example decision trees classifier. As in all my posts, I prefer clear and informal explanation over the terse mathematical one. So, any pedants out there – look away now!

Decision trees are a great choice for inductive inference and have been widely used for a long time in the field of Artificial Intelligence (AI). In this post we will cover the decision tree algorithm known as ID3.
There are several reasons why decision trees are great classifiers:

• decision trees are easy to understand;
• decision trees work well with messy data or missing values;
• decision trees give insight into complex data;
• decision trees can be visualized to allow for inference inspection or correction;
• decision tree building algorithms can be as simple or sophisticated as required (e.g. they can incorporate pruning, weights, etc.);

Decision trees work best with discrete classes. That is, the output class for each instance is either a string, boolean or an integer. If you are working with continuous values, you may consider rounding and mapping to a discrete output class, or may need to look for another classifier. Decision trees are used for classification problems. For example, the taxonomy of organisms, plants, minerals, etc. lends itself naturally to decision tree classifiers. This is because in a field of taxonomy we are dealing with a set of records containing values for some attributes. The values come from a finite set of known features. Each record may or may not have a classification. Medicine is another field that makes use of decision trees. Almost all illnesses can be categorized by symptoms, thus decision trees aid doctors in illness diagnosis.

It is time for an example, which I am borrowing from . The tennis playing example in ML is like the ‘Hello World’ in programming languages. We are given some data about the weather conditions that are appropriate for playing tennis. Our task is to construct a decision tree based on this data, and use the tree to classify unknown weather conditions. The learning data is:

Table 1
Outlook Temperature Humidity Wind Play Tennis
sunny hot high strong no
sunny hot high weak no
overcast hot high weak yes
rain mild high weak yes
rain cool normal weak yes
rain cool normal strong no
overcast cool normal strong yes
sunny mild high weak no
sunny cool normal weak yes
rain mild normal weak yes
sunny mild normal strong yes
overcast mild high strong yes
overcast hot normal weak yes
rain mild high strong no

Table 1 tells us which weather conditions are good for playing tennis outdoors. For example, if it is sunny, and the wind is weak, and the temperature is mild but it is very humid, then playing tennis outdoors is not advisable. On the other hand, on an overcast day, regardless of the wind, playing tennis outdoors should be OK.

The most fundamental concept behind the decision tree classifiers is that of order. The amount of order in the data can be measured by assessing its consistency. For example, imagine that every row in Table 1 had a ‘yes’ associated with the Play Tennis column. This would tell us that regardless of the wind, temperature, humidity, we can always play tennis outside. The data would have perfect order, and we would be able to build a perfect classifier for the data – the one that always says ‘yes’ and is always correct. The other extreme would be where the outcome class differs for every observation. For an inductive learner like a decision tree, this would mean that it is impossible to classify new instance unless it perfectly matches some instance in the training set. This is why decision tree classifier won’t work for continuous class problems.

In information theory the concept of order (actually, lack of it), is often represented by entropy. It is a scary word for something that is rather simple. Entropy definition was proposed by the founder of information theory – Claude Shannon, and it is a probabilistic measure of categorical data defined as: $Entropy(S) = -\sum_{i=1}^{n}p_i \log_2 p_i$

where S is the training data set with some target classification that takes on n different values. The discrete target values are assigned their probabilities p, and $\log_2$ is logarithm base 2. The base two is due to the entropy being a measure of the expected encoding in binary format (i.e. 0s and 1s). Note that the smaller are individual probabilities, the greater is the absolute value of entropy. Also, the closer is entropy to zero, the more order there is in the data.

We now can calculate the entropy of our data, one column at a time. Let’s take the Play Tennis column. We have 9 ‘yes’ and 5 ‘no’, this gives us the entropy of $Entropy(S)=-\frac{9}{14} \log_2(\frac{9}{14}) + -\frac{5}{14}\log_2(\frac{5}{14})=0.940$, accurate to 3 d.p. There is a reason why we started with the class column. It will be used to measure the information gain we can achieve by splitting the data one attribute/column at a time, which is the core idea behind building decision trees. At each step we will be picking the attribute that achieves the greatest information gain to split the tree recursively. The information gain is, essentially, a positive difference in entropy; and the root node of any decision tree is the attribute with the greatest information gain. Information gain is formally defined as: $InfoGain(S,A) = Entropy(S) -\sum_{i=1}^{A}\frac{|S_i|}{|S|}Entropy(S_i)$

where A is the set of all values some attribute takes. For example, the Wind column takes values in {strong, weak}, and the Humidity column takes on values in {high, normal}. $|S_i|$ and $|S|$ are the number of records taking on the value i and the total number of records respectively. Finally, $Entropy(S_i)$ is the corresponding entropy of all records where the attribute A takes on value i with respect to the target class. The last sentence is important. It means that, given all the records where attribute A takes on the same value i we are interested to know the dispersion of the target class. For example, take a look at all the records where Wind is strong. Here we have 3 records where the target class is ‘no’ and 3 records where it is ‘yes’. Thus, the corresponding entropy is $Entropy(S_{wind=strong})=-\frac{3}{6} \log_2(\frac{3}{6}) -\frac{3}{6} \log_2(\frac{3}{6})=1.0$

So, which attribute/column should be the tree’s root node? It should be the one that achieves the greatest information gain: $InfoGain(S,Wind) = 0.940 - \frac{8}{14}\times 0.811 - \frac{6}{14}\times 1.0=0.048$ $InfoGain(S,Humidity) = 0.940 - \frac{7}{14}\times 0.985 - \frac{7}{14}\times 0.592=0.151$ $InfoGain(S,Temperature) = 0.940 - \frac{4}{14}\times 1.0 - \frac{4}{14}\times 0.811 - \frac{6}{14}\times 0.918=0.029$ $InfoGain(S,Outlook) = 0.940 - \frac{5}{14}\times 0.971 - \frac{4}{14}\times 0.0 - \frac{5}{14}\times 0.971=0.246$

It is clear that Outlook achieves the greatest information gain and should be the root of the tree. The worst information gain would be achieved with Temperature attribute. This is what our tree now looks: Figure 1

At this stage we have the training data split into three clusters. Where outlook is overcast, we have a target class of ‘yes’. We now proceed recursively building the sub-tree in the other two clusters. Where outlook is sunny, we have a total of 5 records. These are reproduced in Table 2 below. Out of three possible attributes for the node we again pick the one that achieves the greatest information gain. If you inspect the data in Table 2, it is easy to see that humidity is that attribute since all five records can be split as following: if humidity is normal, then ‘yes’, else ‘no’. Its corresponding information gain is $InfoGain(S_{sunny},Humidity) = 0.970 - \frac{3}{5}\times 0 - \frac{2}{5}\times 0=0.970$

where 0.970 is entropy of Outlook is sunny, again, with respect to the Play Tennis class: $Entropy(S_{sunny}) = - \frac{3}{5}\log_2(\frac{3}{5})- \frac{2}{5}\log_2(\frac{2}{5})=0.970$.

Table 2
Outlook Temperature Humidity Wind Play Tennis
sunny hot high strong no
sunny hot high weak no
sunny mild high weak no
sunny cool normal weak yes
sunny mild normal strong yes

At this stage we have classified 9 records and our tree has 9 terminal descendant nodes (leaves): Figure 2

We can now deal with the records in the last outlook cluster, i.e. where outlook is rain. Here we have five records, which are reproduced below in Table 3:

Table 3
Outlook Temperature Humidity Wind Play Tennis
rain mild high weak yes
rain cool normal weak yes
rain cool normal strong no
rain mild normal weak yes
rain mild high strong no

We apply the same algorithm as before by selecting the attribute for the next node that gives the greatest information gain. It is easy to spot that the Wind attribute achieves just that. Its total information gain is $InfoGain(S_{rain},Wind) = 0.970 - \frac{3}{5}\times 0 - \frac{2}{5}\times 0=0.970$

The complete tree looks like this: Figure 3

A few things should be noted about the final decision tree. Firstly, the Temperature attribute is completely missing. The constructed decision tree allowed us to see that this attribute is redundant, at least given the training dataset. Decision trees always favor the shorter trees over the longer ones, which is the main part of its inductive bias (or the algorithm’s policy). Secondly, this is not a binary tree, since we have parent nodes with more than two edges (i.e. the Outlook node). Finally, the constructed decision tree allows for a one-way searching algorithm. This means that we cannot back-track to the parent node. In the grand scheme of things this implies that a decision tree can end up at a locally optimal, rather than globally optimal solution.

The next blog post will be on how to implement the decision tree classifier in C#.

References:

. T.M.Mitchell. Machine Learning. McGraw-Hill. 1997.

Advertisements

# k Nearest Neighbor Classifier

On my blog space I am going to share with you example implementations of the most common machine learning techniques. The code will be either in C# or Python.

This is the first post in the series of several posts to come, which will be on algorithms commonly used to implement classifiers. A classifier is a machine learning algorithm that is allowed to build a representation of data on some training dataset, and ultimately is used to classify new unobserved instances. Classifiers are at the heart of machine learning and data mining, and they have wide use in medicine, speech and image recognition, and even finance. Broadly speaking all classifier fall into two categories: supervised and unsupervised. Supervised classifiers need to be ‘trained’, in the sense that they are fed the training data with known classifications, which is used to construct the priory. Unsupervised classifiers are a bit more complex, and work with unlabeled/unclassified training data (e.g. k-means clusters) or even have to learn the target class for each instance by trial and error (e.g. reinforced learning).

We’ll begin by looking at the most basic instance based classifier known as the K-Nearest Neighbour (kNN). Here K is the number of instances used to cast the vote when labeling previously unobserved instance. To demonstrate how kNN works, we will take this example dataset:

Table 1
Attr 1 Attr 2 Attr 3 Class
0.7 10 300 A
0.14 9 120 A
1.0 12 200 B
1.12 15 300 B
0.4 7 150 A
0.6 8 600 A
1.15 15 600 B
1.12 11 400 B

The above is a dataset with four columns and eight rows. The first three columns are the data (numerical format is required for kNN), the last column is the class. For the purpose of this example it does not matter what the data represents. I created random example with a hidden pattern, which, hopefully, our kNN algorithm can recognise.

Given previously unobserved instance I={i0,…,iN,class}, we calculate the Euclidean distance between I and each known instance in the dataset as follows: $D_i= \sqrt{\sum_{k=0}^{N}(Z^{i}_{k}-I^{i}_{k})^2}$

Here Z is a sequence of values of some instance i in attribute k for which a classification is given (see the dataset), and I is the unclassified instance.
For example, given I = {Attr1=12, Attr2=11, Attr3=500}, the resulting distance matrix, after normalisation, is the following:

Table 2
D Class
11.196 A
11.772 A
10.909 B
10.792 B
11.519 A
11.295 A
10.756 B
10.774 B

The distances were calculated on normalised data. That is, instead of using the original values in Table 1, where the 3rd column dominates all other values, we normalise each value according to: $\frac{Z_{k}^{i}-Min(Z_{k})}{Max(Z_{k})-Min(Z_{k})}$

Again, Z is from the dataset. The instance that we need to classify is also normalised. For example, the normalised values for Table 1 are:

Table 3
Attr 1 Attr 2 Attr 3 Class
0.554 0.375 0.375 A
0 0.25 0 A
0.851 0.625 0.167 B
0.970 1 0.375 B
0.257 0 0.062 A
0.455 0.125 1 A
1 1 1 B
0.970 0.5 0.583 B

The normalised I = {11.74, 0.5, 0.792}, which was calculated with max and min from the dataset, excluding the instance we need to classify.

Ok, now that we have calculated the distances (Table 2), we can proceed to vote on which class the instance I should belong to. To do this, we select K smallest distances and look at their corresponding classes. Let’s take K=3, then the smallest distances with votes are: 10.756 for B, 10.774 for B, and 10.792 for B. The new instance clearly belongs to class B. This is spot on, as the pattern is: Attr1<1 and Attr2<10 result in A, else B.

Consider another example I = {0.8, 11, 500}. For this instance, after normalisation, the top 3 distances with classes are: 0.379 for B, 0.446 for A, and 0.472 for A. The majority is A, so, the instance is classified as A.

Several modifications exist to the basic kNN algorithm. Some employ weights to improve classification, others use an even K and break ties in a weighted-fashion. Overall, it is a powerful, easy to implement classifier. If you are dealing with non-numeric data parameters, these can be quantified through some mapping from non-numeric value to numbers. Always remember to normalise your data, otherwise you are running a chance of introducing bias into the classifier.

Let’s now look at how kNN can be implemented with C#. At the start, I introduce several extensions to aid in data manipulation. In C#, extensions are an extremely useful concept and are quite addictive. And so is LINQ. You will notice that I try to use LINQ query where possible. Most worker method are private and my properties are read-only. Another thing to note is that my kNN constructor is private. Instead, the class users call initialiseKNN method which ensures that K is odd, since I am not using weights and don’t provide for tie breaks.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;

namespace kNN
{
//extension method to aid in algorithm implementation
public static class Extensions
{
//converts string representation of number to a double
public static IEnumerable<double> ConvertToDouble<T>(this IEnumerable<T> array)
{
dynamic ds;
foreach (object st in array)
{
ds = st;
yield return Convert.ToDouble(ds);
}
}

//returns a row in a 2D array
public static T[] Row<T>(this T[,] array, int r)
{
T[] output = new T[array.GetLength(1)];
if (r < array.GetLength(0))
{
for (int i = 0; i < array.GetLength(1); i++)
output[i] = array[r, i];
}
return output;
}

//converts a List of Lists to a 2D matrix
public static T[,] ToMatrix<T>(this IEnumerable<List<T>> collection, int depth, int length)
{
T[,] output = new T[depth, length];
int i = 0, j = 0;
foreach (var list in collection)
{
foreach (var val in list)
{
output[i, j] = val;
j++;
}
i++; j = 0;
}

return output;
}

//returns the classification that appears most frequently in the array of classifications
public static string Majority<T>(this T[] array)
{
if (array.Length > 0)
{
int unique = array.Distinct().Count();
if (unique == 1)
return array.ToString();

return (from item in array
group item by item into g
orderby g.Count() descending
select g.Key).First().ToString();
}
else
return "";
}
}

/// <summary>
/// kNN class implements the K Nearest Neighbor instance based classifier
/// </summary>
public sealed class kNN
{
//private constructor allows to ensure k is odd
private  kNN(int K, string FileName, bool Normalise)
{
k = K;
PopulateDataSetFromFile(FileName, Normalise);
}

/// <summary>
/// Initialises the kNN class, the observations data set and the number of neighbors to use in voting when classifying
/// </summary>
/// <param name="K">integer representiong the number of neighbors to use in the classifying instances</param>
/// <param name="FileName">string file name containing knows numeric observations with string classes</param>
/// <param name="Normalise">boolean flag for normalising the data set</param>
public static kNN initialiseKNN(int K, string FileName, bool Normalise)
{
if (K % 2 > 0)
return new kNN(K, FileName, Normalise);
else
{
Console.WriteLine("K must be odd.");
return null;
}
}

//read-only properties
internal int K { get { return k; } }
internal Dictionary<List<double>, string> DataSet { get { return dataSet;} }

/// <summary>
/// Classifies the instance according to a kNN algorithm
/// calculates Eucledian distance between the instance and the know data
/// </summary>
/// <param name="instance">List of doubles representing the instance values</param>
/// <returns>returns string - classification</returns>
internal string Classify(List<double> instance)
{
int i=0;
double [] normalisedInstance = new double[length];

if (instance.Count!=length)
{
return "Wrong number of instance parameters.";
}

if (normalised)
{
foreach (var one in instance)
{
normalisedInstance[i] = (one - originalStatsMin.ElementAt(i)) / (originalStatsMax.ElementAt(i) - originalStatsMin.ElementAt(i));
i++;
}
}
else
{
normalisedInstance = instance.ToArray();
}

double[,] keyValue = dataSet.Keys.ToMatrix(depth, length);
double[] distances = new double[depth];

Dictionary<double, string> distDictionary = new Dictionary<double, string>();
for (i = 0; i < depth; i++)
{
distances[i] = Math.Sqrt(keyValue.Row(i).Zip(normalisedInstance, (one, two) => (one - two) * (one - two)).ToArray().Sum());
distDictionary.Add(distances[i], dataSet.Values.ToArray()[i]);

}

//select top votes
var topK = (from d in distDictionary.Keys
orderby d ascending
select d).Take(k).ToArray();

//obtain the corresponding classifications for the top votes
var result = (from d in distDictionary
from t in topK
where d.Key==t
select d.Value).ToArray();

return result.Majority();
}
/// <summary>
/// Processess the file with the comma separated training data and populates the dictionary
/// all values except for the class must be numeric
/// the class is the last element in the dataset for each record
/// </summary>
/// <param name="fileName">string fileName - the name of the file with the training data</param>
/// <param name="normalise">bool normalise - true if the data needs to be normalised, false otherwiese</param>
private void PopulateDataSetFromFile(string fileName, bool normalise)
{
using (StreamReader sr = new StreamReader(fileName,true))
{
List<string> allItems = sr.ReadToEnd().TrimEnd().Split('\n').ToList();

if (allItems.Count > 1)
{
string[] array = allItems.ElementAt(0).Split(',');
length = array.Length - 1;
foreach (string item in allItems)
{
array = item.Split(',');
dataSet.Add(array.Where(p => p != array.Last()).ConvertToDouble().ToList(), array.Last().ToString().TrimEnd());
}
array = null;
}
else
Console.WriteLine("No items in the data set");
}
if (normalise)
{
NormaliseDataSet();
normalised = true;
}
}

private void NormaliseDataSet()
{
var keyCollection = from n in dataSet.Keys
select n;
var valuesCollection = from n in dataSet.Values
select n;

depth = dataSet.Keys.Count;
double[,] transpose = new double[length, depth];
double[,] original = new double[depth, length];
int i = 0, j = 0;

//transpose
foreach (var keyList in keyCollection)
{
foreach (var key in keyList)
{
transpose[i, j] = key;
i++;
}
j++; i = 0;
}

//normalise
double max, min;

for (i = 0; i < length; i++)
{
originalStatsMax.Add (max = transpose.Row(i).Max());
originalStatsMin.Add(min = transpose.Row(i).Min());

for (j = 0; j < depth; j++)
{
transpose[i, j] = (transpose[i, j] - min) / (max - min);
}

}
for (i = 0; i < depth; i++)
{
for (j = 0; j < length; j++)
original[i, j] = transpose[j, i];
}

//overwrite the current values with the normalised ones
dataSet = new Dictionary<List<double>, string>();
for (i = 0; i < depth; i++)
{
dataSet.Add(original.Row(i).ToList(), valuesCollection.ElementAt(i));
}
}

//private members
private Dictionary<List<double>, string> dataSet = new Dictionary<List<double>,string>();
private List<double> originalStatsMin = new List<double>();
private List<double> originalStatsMax = new List<double>();
private int k=0;
private int length=0;
private int depth=0;
private bool normalised = false;
}

class EntryPoint
{
static void Main(string[] args)
{
kNN examplekNN = kNN.initialiseKNN(3,"DataSet.txt",true);

List<double> instance2Classify = new List<double> {12,11,500};
string result = examplekNN.Classify(instance2Classify);
Console.WriteLine("This instance is classified as: {0}", result);
Console.ReadLine();
}
}
}


In my next blog we will look at decision trees – a slightly more complicated, but also more powerful machine learning algorithm.

# Solving ODEs Numerically

In this post we are going to look at two methods for approximating a solution to some ordinary differential equation. We will only consider the first order separable initial value problems. Here, the ‘first order’ means that the problem contains a first derivative only. The word ‘separable’ implies that the derivative can be separated from the derivative function (i.e. we can separate the derivative terms from the terms that do not involve derivatives). Finally, an initial value problem is a problem for which an initial condition is given, thus we have two starting values for x and y.

The following are examples of the types of problems we are considering: $\frac{dy}{dx}=xy^{\frac{1}{3}}, y(1)=1$ $\frac{dy}{dx}=4x-3, y(0)=0$ $\frac{dy}{dx}=\cos{x}-3, y(0)=0$

The above examples contain expressions for the derivative of y(x). Thus, we are given the derivative, which is some function f() of x and y, and we are also given the initial value. This initial value is the value of x for which y(x) is already known. Please note that we do not know the exact expression for y(x), we just have the expression for the derivative of y w.r.t. x.

Solving a differential equation means finding the y(x). Numerical solutions to differential equations usually approximate the y(x) at some required point. So, a numerical solution does not produce an expression for y(x), it just approximates the value of y(x) at some point x. Given enough approximations, we should also be able to graph y(x) on some interval.

There are many types of approximations to differential equations. Here we will look at Euler’s and Runge-Kutta methods. Both methods are based on the Taylor series expansion. Since the function for the 1st derivative is known, this means that we have an equation for the slope to the curve y(x) makes on the (x,y) plane.

The basic definition of the first derivative is given by $\frac{dy}{dx}\approx \frac{\Delta}{\Delta x}= \frac{(y(x+h)-y(x))}{h}$, which says that for some small change h, we can approximate the rate of change in y to the rate of change in x using this difference ratio. This can be rearranged to give us $y(x+h)$. Thus: $y(x+h)=h[\frac{dy}{dx}]+y(x)$. This expression can be used to solve a differential equation for some other value of x, since we can start with our initial condition, take a small step h, and iteratively approximate y(x). The smaller is the increment step, the better will be our approximation. This is the fundamental idea behind the Euler’s method.

The Runge-Kutta method is a little bit more complex, but it is more accurate as well. This approach is building on the fact that $y(x+h)=h[\frac{dy}{dx}]+y(x)$ is an approximation that can be improved upon. In particular, the Runge-Kutta method is the 4th order approximation to the ordinary differential equation, thus we have: $y(x+h)=y(x)+h[\frac{1}{6}(k1+2\times k2+2\times k3+k4)]$ where $k1=f(x,y)$, $k2=f(x+ \frac{h}{2},y+ \frac{h}{2}\times k1)$, $k3=(x+ \frac{h}{2},y+ \frac{h}{2}\times k2)$ and $k4=f(x+h,y+h\times k3)$.

It comes as no surprise that the Runge-Kutta is more accurate than the Euler’s method since it is better at approximating the curve of y(x).

We will now look at the implementation of these two methods in C#. My code is given below. Note that I am following my usual coding style by making the individual solution classes to derive from an abstract InitialValue class. Most of the variables and functionality that is common to both method is contained within the parent class. The differential function is implemented as a delegate. Finally, I am also populating the metadata for the class constructors and the solution methods this time, as this is a good practice.

using System;
using System.Collections.Generic;

namespace DifferentialEquations
{

abstract class InitialValue
{
internal delegate double DerivativeFunction(double x, double y);

internal InitialValue(DerivativeFunction func, double initValX, double initValY, double solPoint)
{
givenFunction = func;
this.initValX = initValX;
this.initValY = initValY;
this.solPoint = solPoint;
CalcDistance();
}

internal double Delta
{
get
{
return delta;
}
}

internal int Distance
{
get
{
return distance;
}
}

//calculates absolute distance between the x point for required solution and the initial value of x
private void CalcDistance()
{
double res = 0.0;

if (this.initValX < 0 && this.solPoint < 0)
{
res = Math.Abs(Math.Min(this.initValX, this.solPoint) - Math.Max(this.initValX, this.solPoint));
}
else if (this.initValX > 0 && this.solPoint > 0)
{
res = Math.Max(this.initValX, this.solPoint) - Math.Min(this.initValX, this.solPoint);
}
else
{
res = Math.Abs(this.solPoint) + Math.Abs(this.initValX);
}

this.distance = (int)res;
}

internal abstract double[] SolveDifferentialEquation();

protected DerivativeFunction givenFunction;
private const double  delta=0.01;
protected double initValX;
protected double initValY;
protected double solPoint;
private int distance=0;

}

sealed class EulerMethod:InitialValue
{

/// <summary>
/// Initialises the Euler method
/// </summary>
/// <param name="func">differential function to approximate</param>
/// <param name="initValX">initial value for x</param>
/// <param name="initValY">initial value for y</param>
/// <param name="solPoint">solution point: value of x for which y=f(x) is required</param>
internal EulerMethod(DerivativeFunction func, double initValX, double initValY, double solPoint):base(func,initValX, initValY, solPoint)
{
altDelta = base.Delta;
}

/// <summary>
/// Initialises the Euler method
/// </summary>
/// <param name="func">differential function to approximate</param>
/// <param name="initValX">initial value for x</param>
/// <param name="initValY">initial value for y</param>
/// <param name="solPoint">solution point: value of x for which y=f(x) is required</param>
/// <param name="delta">alternative value for delta</param>
internal EulerMethod(DerivativeFunction func, double initValX, double initValY, double solPoint, double delta)
: base(func, initValX, initValY, solPoint)
{
//this check will not provide for all cases but good enough for demonstration
if (delta < base.Distance)
altDelta = delta;
else
{
Console.WriteLine("Delta value too large. Using default delta of 0.01.");
altDelta = base.Delta;
}
}

/// <summary>
/// Solves a differential equation using the Euler method
/// </summary>
/// <returns>y=f(x) solution at the required point of x</returns>
internal override double[] SolveDifferentialEquation()
{
direction = (this.initValX < base.solPoint) ? true : false;

//the position at index 0 will be taken by initial value
int numPoints = (int)(base.Distance/ this.altDelta)+1;
double[] solutionPoints = new double[numPoints];
double x = this.initValX;

solutionPoints = this.initValY;

for (int i = 0; i < numPoints-1; i++)
{
solutionPoints[i + 1] = solutionPoints[i] + altDelta * this.givenFunction(x, solutionPoints[i]);
if (direction)
x = x + altDelta;
else
x = x - altDelta;
}
Console.WriteLine("Euler Solution at x=: " + base.solPoint + " is " + solutionPoints.GetValue(numPoints - 1));
return solutionPoints;
}

private bool direction;
private double altDelta;
}

sealed class RungeKuttaMethod : InitialValue
{
/// <summary>
/// Initialises the Runge-Kutta method
/// </summary>
/// <param name="func">differential function to approximate</param>
/// <param name="initValX">initial value for x</param>
/// <param name="initValY">initial value for y</param>
/// <param name="solPoint">solution point: value of x for which y=f(x) is required</param>
internal RungeKuttaMethod(DerivativeFunction func, double initValX, double initValY, double solPoint)
: base(func, initValX, initValY, solPoint)
{
altDelta = base.Delta;
}

/// <summary>
/// Initialises the Runge-Kutta method
/// </summary>
/// <param name="func">differential function to approximate</param>
/// <param name="initValX">initial value for x</param>
/// <param name="initValY">initial value for y</param>
/// <param name="solPoint">solution point: value of x for which y=f(x) is required</param>
/// <param name="delta">alternative delta value</param>
internal RungeKuttaMethod(DerivativeFunction func, double initValX, double initValY, double solPoint, double delta)
: base(func, initValX, initValY, solPoint)
{
//this check will not provide for all cases but good enough for demonstration
if (delta < base.Distance)
altDelta = delta;
else
{
Console.WriteLine("Delta value too large. Using default delta of 0.01.");
altDelta = base.Delta;
}
}

/// <summary>
/// Solves a differential equation using Runge-Kutta method
/// </summary>
/// <returns>y=f(x) solution at the required point of x</returns>
internal override double[] SolveDifferentialEquation()
{
direction = (this.initValX < base.solPoint) ? true : false;

//the position at index 0 will be taken by initial value
int numPoints = (int)(base.Distance/ this.altDelta)+1;
double[] solutionPoints = new double[numPoints];
double x = this.initValX;
double k1, k2, k3, k4 = this.initValY;
double scale = 1.0/6.0;
double increment = altDelta/2;

solutionPoints = this.initValY;

for (int i = 0; i < numPoints - 1; i++)
{
k1 = this.givenFunction(x, solutionPoints[i]);
k2 = this.givenFunction(x + increment, solutionPoints[i] + increment * k1);
k3 = this.givenFunction(x + increment, solutionPoints[i] + increment * k2);
k4 = this.givenFunction(x + altDelta, solutionPoints[i] + altDelta * k3);
solutionPoints[i + 1] = solutionPoints[i] + altDelta*scale *(k1+2.0*k2+2.0*k3+k4);
if (direction)
x = x + altDelta;
else
x = x - altDelta;
}
Console.WriteLine("Runge-Kutta Solution at x=: "+base.solPoint+" is " + solutionPoints.GetValue(numPoints - 1));
return solutionPoints;
}

private bool direction;
private double altDelta;
}

class EntryPoint
{
static void Main(string[] args)
{
EulerMethod eu = new EulerMethod(DerivFunction, 0, 0,5);
double[]resEU = eu.SolveDifferentialEquation();

RungeKuttaMethod rk = new RungeKuttaMethod(DerivFunction, 0, 0,5);
double[] resRK = rk.SolveDifferentialEquation();

Console.ReadLine();
}

static double DerivFunction(double x, double y)
{
//double pow = (1.0 / 3.0);
//return x * Math.Pow(y, pow);
//return 14.0 * x - 3.0;
return Math.Cos(x) - 3.0;
}
}
}


Below are the results for three initial value problems. Here the h increment is 0.01. The Runge-Kutta method provides a very good approximation.

Table 1
Differential Equation Euler’s Runge-Kutta Exact Solution (at x=5) $\frac{dy}{dx}=xy^{\frac{1}{3}}, y(1)=1$ 26.8956 26.9999 27 $\frac{dy}{dx}=4x-3, y(0)=0$ 159.6499 159.9999 160 $\frac{dy}{dx}=\cos{x}-3, y(0)=0$ -15.9553 -15.9589 -15.9589

# Solving Nonlinear Equations Numerically

Knowing the roots of some nonlinear equation allows us to answer the question about which values of x make $f(x)$ equal to zero. There are many approaches to findings the roots of equation in nonlinear algebra. Some of the well-known ones include: the linear interpolation (or bisection) method, Newton’s method, Bernoulli’s method and its variations. In this post I am demonstrating an implementation of the two basic methods: bisection and Newton’s. We will also compare their performances.

The bisection method for finding the roots of some function $f(x)$ involves two starting values a and b, where a<b and the function is defined on [a,b]. The solution is approximated by iteratively halving the distance between a and b, and resetting one of these points. The point to reset depends on both, the sign of the slope of $f(x)$ on [a,b] and the sign of $f(x_1)$, where $x_1$ is the current approximation. This process terminates when either of the following conditions is reached: (a) we find the solution to some acceptable accuracy, or (b) we do not find the solution and terminate because the current distance on which we iterate is less than some error value (i.e. we have exhausted the interval without convergence), or we have exceeded the number of allowed iterations. The solution may not be obtained if the starting points are chosen poorly, and there are no roots on the [a,b] interval. For example, take a look at Figure 1, which is a plot of $f(x)=x^2-2$, where some possible [a,b] interval is marked. Imagine instead we pick [a,b] to be [2,3]. Since nowhere on [2,3] is $f(x)=0$, this interval does not provide any roots. Figure 1

Which of the two roots do you think this method finds if we take a and b to be -2.0 and 2.0? The answer depends on the used method to approximate the slope of the function and what we do when this slope is zero. That is, it will be equally correct to move to the right or to the left. So, it is up to us.

The Newton’s method is a little bit more complicated. Essentially, it is based on the idea that, starting with some guess value $x_0$, we can improve upon the guess by updating it with the value of x for which $f(x)$ is the tangent line to the function. We repeat until convergence to a solution is achieved. Mathematically, this can be expressed as: $m(x)=f'(x_0)\times(x_1-x_0) +f(x_0)$ $x_1=x_0-\frac{f(x_0)}{f'(x_0)}$

where $x_1$ is the next approximation and $m(x)$ is the tangent line function.

As you can see, the Newton’s method requires us to approximate the first derivative of $f(x)$. This is why I think it is a bit more complex. Also, there are several conditions, under which this method fails to find a solution. One of them is that $f'(root)$ is undefined. Another is that $f'(x) =0$. More conditions can be found on Wikipedia.

Below I am providing one possible way to implement these two methods. Please note the following points about my implementation:

1. One can never provide too many checks for potential run-time failures. I chose to stop at just a few. That is, for any possible solution x, $f(x)$ should be defined. Also, we should be able to estimate the 1st derivative of $f(x)$. While testing the implementation on some functions I noticed that if the 1st derivative is close to zero, then Newton’s method tends to diverge. That is why I am rounding to the nearest 3 f.d. on line 104.
2. My approximation to the first derivative of $f(x)$ uses four points.
3. I have left a few Debug.Assert in the code to help with debugging the code for some ill-behaved functions.
4. The function for which we need to find the roots, is passed as a delegate.

Below is my implementation along with the run example:

using System;
using System.Collections.Generic;

namespace SolvingNonLinearEquations
{
public abstract class NonlinearEquationSolutionMethod
{
public delegate double NonLinearFunction(double x);

public NonlinearEquationSolutionMethod(NonLinearFunction func)
{
this.func = func;
}

protected NonLinearFunction func;
protected const double maxiter = 100;
}

public sealed class BisectionMethod: NonlinearEquationSolutionMethod
{
private BisectionMethod(double from, double to, double err, NonLinearFunction func)
:base(func)
{
this.from = from;
this.to = to;
this.err = err;
}

//ensure we correctly intitialise from and to points where from < to
public static BisectionMethod SetBisectionMethod(double from, double to, double err, NonLinearFunction func)
{
if (from >= to)
return null;
else
return  new BisectionMethod(from, to, err, func);
}

public double Solve()
{
double solution = (from+to)*0.5;
int iter = 1;

//determine of the function has positive or negative slope on the given interval
double slope = func(to) - func(from);
double temp=0.0;

while ((to - from) > err && iter < maxiter)
{
temp = func(solution);
System.Diagnostics.Debug.Assert(!temp.Equals(Double.NaN), "Division by zero or another illegal operation has occured.");

if (temp < 0.0)
{
if (slope > 0)
from = solution;
else
to = solution;
}
else
{
if (slope > 0)
to = solution;
else
from = solution;
}
solution = (from + to) * 0.5;
++iter;
}
if (iter == maxiter || Math.Abs(func(solution)) > Math.Abs(err))
{
Console.WriteLine("Failed to converge.");
solution = default(double);
}
return solution;
}

private double from;
private double to;
private double err;
}

public sealed class NewtonMethod:NonlinearEquationSolutionMethod
{
public NewtonMethod(double guess,  double err, NonLinearFunction func)
: base(func)
{
this.guess = guess;
this.err = err;
}

public double Solve()
{
//Newton's method requires computing the function's first derivative
int iter = 0;
double temp = func(guess);
double tempdev = numFirstDerivative(func, guess);

if (temp==0)
return guess;

//check if the derivative exists at the initial guess
//if not, we cannot use Newton's method with this guess
if (tempdev.Equals(Double.NaN) || Math.Round(tempdev, 3).Equals(0.000))
{
Console.WriteLine("Derivate of this function is undefined or zero at this point.");
return default(double);
}

while(iter<=maxiter && Math.Abs(func(guess))>err && numFirstDerivative(func, guess)!=0)
{
System.Diagnostics.Debug.Assert(!temp.Equals(Double.NaN), "Division by zero or another illegal operation has occured.");
guess = guess - (temp/numFirstDerivative(func, guess));
++iter;
temp = func(guess);
}

if (func(guess).Equals(Double.NaN) || numFirstDerivative(func, guess).Equals(Double.NaN))
guess = default(double);

return guess;
}

//approximates the first derivative of a single-variabled function
private double numFirstDerivative(NonLinearFunction func, double x)
{
double delta = 0.01;
return (func(x - 2 * delta) - 8 * func(x - delta) + 8 * func(x + delta) - func(x + 2 * delta)) / (12 * delta);
}

private double guess;
private double err;
}

class EntryPoint
{
static void Main(string[] args)
{
BisectionMethod bm = BisectionMethod.SetBisectionMethod(-2.0, 2.0, 0.001, funcToSolve);
NewtonMethod nm = new NewtonMethod(0.5, 0.001, funcToSolve);

Console.WriteLine(bm.Solve());
Console.WriteLine(nm.Solve());
Console.ReadLine();
}

static double funcToSolve(double x)
{
return x*x-2.0;
}
}
}


For function $f(x)=x^2-2$, the two roots are $\pm\sqrt{2}$. In my implementation, the bisection method finds the negative root, while the Newton’s method finds the positive root, since the initial guess is closer to the positive root. It takes 13 iterations to approximate the solution with the bisection method. While it takes only 4 iterations with the Newton’s method.

# Comparing Numerical Integration Methods

In this post I will re-introduce and compare four well-known numerical integration techniques for approximating definite integrals of single-variable functions (e.g. polynomials). The compared techniques, in the order of least to the most accurate, are:

• Rectangle rule;
• Mid-point rule;
• Trapezoid rule;
• and Simpson’s rule;

Let me begin by an informal description of the idea behind numerical integration. All numerical integration involves approximating the area under the function’s curve. For example, Figure 1 is a plot of f(x)=2+cos(2*sqrt(x)) for x in the interval between 0 and 2. The area under the curve can be approximated in different ways. The simplest way is to break it up into many rectangles and compute their areas instead. Figure 1

Figure 2 illustrates this simple ‘rectangle rule’ approach, where I only show three rectangles. Clearly, for this function, the rectangle rule approximation will overestimate the area. However, this is due to the fact that I began to break-up the area under the function’s curve left-to-right. This means that the area of the first rectangle is calculated by evaluating the functions at point x=a, and scaling by the distance between a and b, where b is the starting point of the next rectangle. Was I to start from left to right, as shown in Figure 3, I would have underestimated the area. If you are thinking that some kind of average of the two methods can give a more accurate approximation, you are absolutely correct. This, in fact, is what trapezoid-rule approximation does. Also, it is clear, that the more rectangles are used to break-up the area, the more accurate is the approximation. Figure 2 Figure 3

It is probably a good time to formally define these numerical methods.
For any two adjacent points a and b, the area of a single rectangle is approximated as follows:

• Rectangle rule: $\int_a^b \! f(x) \, \mathrm{d}x \approx f(a)(b-a)$
• Mid-point rule: $\int_a^b \! f(x) \, \mathrm{d}x \approx f(\frac{a+b}{2})(b-a)$
• Trapezoid rule: $\int_a^b \! f(x) \, \mathrm{d}x \approx \frac{1}{2}[f(a)+f(b)](b-a)$
• and Simpson’s rule: $\int_a^b \! f(x) \, \mathrm{d}x \approx \frac{1}{3}[f(a)+f(\frac{a+b}{2})+f(b)]$

Note that in the above definitions, I am assuming that b>a, and that my increment step is set to 1. Below I am providing a rather naïve implementation of these four numerical integration methods in C#:

using System;

//functionality to perform several types of single-variable function numerical integration
namespace NumericalIntegration
{
public abstract class NumericalIntegration
{
public delegate double SingleVarFunction(double x);

public NumericalIntegration(double from, double to, double steps, SingleVarFunction func)
{
this.fromPoint = from;
this.toPoint = to;
this.numSteps = steps;
eps=0.0005;
h = (toPoint - fromPoint) / (numSteps + 1.0);
this.func=func;
}

public abstract double Integration();

protected double fromPoint;
protected double toPoint;
protected SingleVarFunction func;
protected double numSteps;
protected double eps;
protected double h;
}

public sealed class RectangleRule: NumericalIntegration
{
public RectangleRule(double from, double to, double steps, SingleVarFunction func)
:base(from,to,steps,func)
{ }

public override double Integration()
{
double a = fromPoint;
double value = func(fromPoint);
for (int i = 1; i <= numSteps; i++)
{
a = a + h;
value += func(a);
}
return h*value;
}
}
public sealed class MidPointRule: NumericalIntegration
{
public MidPointRule(double from, double to, double steps, SingleVarFunction func)
:base(from,to,steps,func)
{}

public override double Integration()
{
double a = fromPoint;
double value = func((2.0 * fromPoint + h) / 2.0);
for (int i = 1; i <= numSteps; i++)
{
a = a + h;
value += func((2.0 * a + h) / 2.0);
}
return h * value;
}
}

public sealed class TrapezoidRule: NumericalIntegration
{
public TrapezoidRule(double from, double to, double steps, SingleVarFunction func)
: base(from, to, steps, func)
{}
public override double Integration()
{
double a = fromPoint;
double value = (func(fromPoint)+func(fromPoint+h));
for (int i = 1; i <= numSteps; i++)
{
a = a + h;
value += (func(a) + func(a + h));
}
return 0.5 * h * value;
}
}

public sealed class SimpsonRule: NumericalIntegration
{
public SimpsonRule(double from, double to, double steps, SingleVarFunction func)
: base(from, to, steps, func)
{
h = (toPoint - fromPoint) / (numSteps * 2.0);
}

public override double Integration()
{
double valueEven = 0;
double valueOdd = 0;

for (int i = 1; i <= numSteps; i++)
{
valueOdd += func(fromPoint + this.h * (2.0 * i - 1));
if (i <= numSteps-1)
valueEven += func(fromPoint + this.h * 2.0 * i);
}
return (func(fromPoint)+func(toPoint)+2.0*valueEven+4.0*valueOdd)*(this.h/3.0);
}
}
}


In summary, the code defines an abstract NumericalIntegration class which has two members: a delegate to point to the function over which we need to integrate, an abstract Integration method which must be overridden by the deriving classes. The assignment of the integration parameters occurs through the base class constructor. However, in case of the SimpsonRule class, I override the definition of h.

Please note that for the sake of this comparison, I will not be attempting to estimate the number of rectangles or partitions needed to achieve some convergence criterion. However, I am still declaring eps in the base class constructor, which can be used for this purpose.

Ok, so, let’s evaluate the Integration methods of each rule for the following functions and parameters: a=0, b=2, number of partitions = 8. Also, we’ll compare the performance on the following three functions:

• func1: $f(x)=2+\cos(2\times\sqrt{x})$
• func2: $f(x)=2+\sin(2\times\sqrt{x})$
• func3: $f(x)=0.4\times\ln(x^2)$

The last function’s lower bound is set to 0.1 instead of zero.
Table 1 summarises the results of each numerical integration rule. The last column shows the mathematical solution. Clearly, the Simpson’s rule is the most accurate here.

Table 1
func rectangle mid-point trapezoid Simpson’s solution (to 6 d.p)
1 3.68415 3.45633 3.46733 3.46000 3.46000
2 5.41971 5.51309 5.45394 5.49218 5.49946
3 -0.50542 -0.21474 -0.25244 -0.22756 -0.22676

For the sake of completeness I provide the Main function code as well:

namespace TestNumericalIntegration
{
using NumericalIntegration;

class Program
{
static void Main(string[] args)
{
double from = 0.1;
double to = 2.0;
double steps = 20.0;

RectangleRule ri = new RectangleRule(from, to, steps, funcToIntegrate);
MidPointRule mi = new MidPointRule(from, to, steps, funcToIntegrate);
TrapezoidRule ti = new TrapezoidRule(from, to, steps, funcToIntegrate);
SimpsonRule si = new SimpsonRule(from, to, steps, funcToIntegrate);

double[] res = { ri.Integration(), mi.Integration(), ti.Integration(), si.Integration() };
foreach (double r in res)
Console.WriteLine(r.ToString("F5", System.Globalization.CultureInfo.InvariantCulture));
Console.ReadLine();
}

static double funcToIntegrate(double x)
{
//return 2.0+System.Math.Cos(2.0*System.Math.Sqrt(x));       //first function
//return 2.0 + System.Math.Sin(2.0 * System.Math.Sqrt(x)); //second function
return 0.4 * System.Math.Log(x*x);                       //third function

}
}
}


I hope you’ve found this useful. In my next post I will examine various approaches to numerical integration of multi-dimensional integrals.